forked from instructlab/sdg
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Introduce data mixing recipe yaml files
This introduces Recipe yaml files, which are used both as an input into the data mixing process and as an output of the process. As an input, we have some default recipe files that specify any precomputed datasets that should be mixed with data from new skills when generating the overall mix of samples that will be sent to the training process. If a downstream user/packager wants to add default recipes (and datasets), they should install them to a path like `/usr/share/instructlab/sdg` (varies by platform, uses Python's `platformdirs.PlatformDirs` to respect platform conventions). Recipes should be in sdg/default_data_recipes/{knowledge,skills}.yaml Datasets should be in sdg/datasets but this location is not enforced. Currently we are not shipping any default recipe files in the upstream, but there is a unit test in place to ensure the functionality to load default recipes from disk works once we decide how we want to ship a precomputed dataset to our upstream users. As an output of the data generation process, we write recipe yamls to document which datasets were mixed together and in what proportions along with the system prompt that was used during the generation. Here's an example of a recipe yaml put into the output directory after running data generation: ```yaml datasets: - path: node_datasets_2024-07-25T17_49_46/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl sampling_size: 1.0 metadata: sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\ \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\ \ model. My primary function is to be a chat assistant." ``` Datasets may be referenced by relative paths, which are relative to the recipe's own directory. Or, they may use absolute filesystem paths. Anything written out under the metadata section (currently just sys_prompt) is purely informational for the user and ignored when loading recipes. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#20 Refs instructlab#162, instructlab#171, instructlab#185, instructlab#201. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>
- Loading branch information
1 parent
afadfd5
commit fcfacfc
Showing
7 changed files
with
227 additions
and
34 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"id": "abc123", "messages": [], "metadata": {}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
datasets: | ||
- path: test/knowledge.jsonl | ||
sampling_size: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
datasets: | ||
- path: test/skills.jsonl | ||
sampling_size: 1.0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
datasets: | ||
- path: datasets/samples.jsonl | ||
sampling_size: 1.0 | ||
sys_prompt: I am a reliable AI assistant. |