-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull taxonomy precomputed dataset from hugging face #201
Labels
enhancement
New feature or request
Comments
bbrowning
added a commit
to bbrowning/instructlab-sdg
that referenced
this issue
Jul 25, 2024
This introduces Recipe yaml files, which are used both as an input into the data mixing process and as an output of the process. As an input, we have some default recipe files that specify any precomputed datasets that should be mixed with data from new skills when generating the overall mix of samples that will be sent to the training process. If a downstream user/packager wants to add default recipes (and datasets), they should install them to a path like `/usr/share/instructlab/sdg` (varies by platform, uses Python's `platformdirs.PlatformDirs` to respect platform conventions). Recipes should be in sdg/default_data_recipes/{knowledge,skills}.yaml Datasets should be in sdg/datasets but this location is not enforced. Currently we are not shipping any default recipe files in the upstream, but there is a unit test in place to ensure the functionality to load default recipes from disk works once we decide how we want to ship a precomputed dataset to our upstream users. As an output of the data generation process, we write recipe yamls to document which datasets were mixed together and in what proportions along with the system prompt that was used during the generation. Here's an example of a recipe yaml put into the output directory after running data generation: ```yaml datasets: - path: node_datasets_2024-07-25T17_49_46/knowledge_tonsils_overview_e2e-tonsils_p10.jsonl sampling_size: 1.0 metadata: sys_prompt: "I am, Red Hat\xAE Instruct Model based on Granite 7B, an AI language\ \ model developed by Red Hat and IBM Research, based on the Granite-7b-base language\ \ model. My primary function is to be a chat assistant." ``` Datasets may be referenced by relative paths, which are relative to the recipe's own directory. Or, they may use absolute filesystem paths. Anything written out under the metadata section (currently just sys_prompt) is purely informational for the user and ignored when loading recipes. Parts of this are extracted and rebased from aakankshaduggal#4 aakankshaduggal#20 Refs instructlab#162, instructlab#171, instructlab#185, instructlab#201. Co-authored-by: shivchander <[email protected]> Co-authored-by: Khaled Sulayman <[email protected]> Co-authored-by: abhi1092 <[email protected]> Co-authored-by: Aakanksha Duggal <[email protected]> Co-authored-by: Mark McLoughlin <[email protected]> Signed-off-by: Ben Browning <[email protected]>
19 tasks
This issue has been automatically marked as stale because it has not had activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. |
still relevant, as we aren't yet mixing in the community precomputed dataset |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In order to improve skills training, we require a precomputed dataset that is going to be mixed with the new synthetically generated dataset. This is mostly used during full training via instructlab/training, and not as important for the simpler legacy training in instructlab/instructlab.
The taxonomy precomputed dataset is hosted on hugging face -- https://huggingface.co/datasets/instructlab/InstructLabCommunity
We need a way to incorporate this dataset from HuggingFace and mix it with the synthetic generated data during an
ilab data generate
. One proposal is at aakankshaduggal#20, but after some discussion at #203 (comment) we want to implement this in a bit different way so that we're not hitting HuggingFace silently/automatically but instead with an explicit step to download the precomputed dataset.Users will have to download the dataset and place it in an appropriate cache directory. Potentially, there could be an
ilab data download
command to do this with a nicer user experience that asking them to manually download it to the appropriate directory.Allow the name and/or path to the precomputed dataset to be supplied with e.g.
ilab data generate --skills-dataset=
and ilab would construct a simple skills recipe (in memory) and pass it to the library. We could limit this just to supplying a single precomputed skills dataset, or perhaps we want to allow the user to specify a list of precomputed skills and/or knowledge datasets on the command line?We'll want to test this precomputed dataset in our e2e CI, which means ensuring we download and cache the dataset there so it's available at data generation time.
[Edited by @bbrowning to incorporate changes from https://github.com//pull/203#issuecomment-2250444499 as well as in-person discussions with @aakankshaduggal ].
The text was updated successfully, but these errors were encountered: