Add precomputed dataset to skills data generation #171

bbrowning · 2024-07-19T00:44:59Z

As a followup to #163, we need to figure out the right way to wire a precomputed dataset into the skills data generation. One example of such a dataset is https://github.com/instructlab/training/blob/9fdeb87820d5000f7be60a199c4e24aec725772e/sample-data/train_all_pruned_SDG.jsonl , however downstream uses of InstructLab, CI, or other reasons will warrant the ability to change this out in some way.

The initial implementation dropped out of scope from #163 had a placeholder in src/instructlab/sdg/configs/skills/data_recipe/default_recipe.yaml like below:

datasets:
  - path: <path_to_dataset>
    sampling_size: 1.0

Discussing this with the community, @shivchander suggested we may want to pull this dataset from somewhere like HuggingFace. So, there's work to be done to figure out where the dataset should live, how the user gets it (explicitly pulls, implicitly pulls as needed, caching, etc), and how downstream uses, CI, or other scenarios will overwrite this precomputed dataset with their own.

The text was updated successfully, but these errors were encountered:

bbrowning · 2024-07-23T15:39:28Z

The current iteration of #163 does not read recipe files at all, so if recipe files are how the precomputed datasets will be read then we'll want to plumb some version of reading from them back in once that lands.

Also, we'll likely need the ability to specify the system prompt that was used / to use with each precomputed dataset. Previously this was read from recipe files, so that may need to be added back as well if configurable system prompts are needed to properly use precomputed datasets.

This introduces Recipe yaml files, which are used both as an input into the data mixing process and as an output of the process. As an input, we have some default recipe files that specify any precomputed datasets that should be mixed with data from new skills when generating the overall mix of samples that will be sent to the training process. If a downstream user/packager wants to add default recipes (and datasets), they should install them to a path like `/usr/share/instructlab/sdg` (varies by platform, uses Python's `platformdirs.PlatformDirs` to respect platform conventions). Recipes should be in sdg/default_data_recipes/{knowledge,skills}.yaml Datasets should be in sdg/datasets but this location is not enforced. Currently we are not shipping any default recipe files in the upstream, but there is a unit test in place to ensure the functionality to load default recipes from disk works once we decide how we want to ship a precomputed dataset to our upstream users. As an output of the data generation process, we write recipe yamls to document which datasets were mixed together and in what proportions along with the system prompt that was used during the generation. Here's an example of a recipe yaml put into the output directory after running data generation: ```yaml datasets: - path: node_datasets_2024-07-25T17_49_46/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl sampling_size: 1.0 metadata: sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\ \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\ \ model. My primary function is to be a chat assistant." ``` Datasets may be referenced by relative paths, which are relative to the recipe's own directory. Or, they may use absolute filesystem paths. Anything written out under the metadata section (currently just sys_prompt) is purely informational for the user and ignored when loading recipes. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#20 Refs instructlab#162, instructlab#171, instructlab#185, instructlab#201. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>

github-actions · 2024-11-21T02:01:46Z

This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days.

bbrowning · 2024-11-21T11:43:00Z

still relevant

bbrowning mentioned this issue Jul 19, 2024

[Epic] Support for mixing generated datasets before training #162

Closed

19 tasks

aakankshaduggal self-assigned this Jul 19, 2024

This was referenced Jul 19, 2024

Introduce a way to mix generated datasets before sending to training #163

Merged

Write Recipe files during data mixing #185

Closed

bbrowning mentioned this issue Jul 24, 2024

Introduce data mixing recipe yaml files #203

Merged

nathan-weinberg added the enhancement New feature or request label Aug 20, 2024

github-actions bot added the stale label Nov 21, 2024

github-actions bot removed the stale label Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add precomputed dataset to skills data generation #171

Add precomputed dataset to skills data generation #171

bbrowning commented Jul 19, 2024

bbrowning commented Jul 23, 2024

github-actions bot commented Nov 21, 2024

bbrowning commented Nov 21, 2024

Add precomputed dataset to skills data generation #171

Add precomputed dataset to skills data generation #171

Comments

bbrowning commented Jul 19, 2024

bbrowning commented Jul 23, 2024

github-actions bot commented Nov 21, 2024

bbrowning commented Nov 21, 2024