Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grounded skill samples generated by the simple pipeline are missing context? #258

Open
markmc opened this issue Aug 28, 2024 · 2 comments
Open
Labels
question Further information is requested

Comments

@markmc
Copy link
Contributor

markmc commented Aug 28, 2024

Just noticed this while documenting dataset formats in #236

In _gen_train_data() we are taking a dataset with question and response columns, and generating a training dataset in two different formats. (Ok, in the case of the simple pipeline, we actually parse question and response from output, but that's not super important here)

If the dataset also contains a context column, we append context to the question.

            user = _get_question_hack(synth_example)
            if len(synth_example.get("context", "")) > 0:
                user += "\n" + synth_example["context"]
            assistant = _unescape(_get_response_hack(synth_example))
            train_entry = {
                "system": _SYS_PROMPT,
                "user": _unescape(user),
                "assistant": assistant,
            }
            train_data.append(train_entry)
            sample = {
                "inputs": _unescape(user),
                "targets": assistant,
                "system": _SYS_PROMPT,
            }
            messages_data.append(_convert_to_messages(sample))

In the full pipeline for grounded skills, we do generate this context column based on the seed_context column.

In the simple pipeline for grounded skills, we are not including a context column at all. I suspect the intent was include the original seed context in each sample? If so, we'd need to add a DuplicateColumnsBlock that would copy seed_context to context?

@markmc
Copy link
Contributor Author

markmc commented Aug 30, 2024

Another example of where I think we're missing context for grounded skills:

In datamixing.py:

def _convert_to_leaf_node_messages(sample: dict, sys_prompt: str):
    ...
    user_query = _unescape(_get_question_hack(sample))
    response = _unescape(_get_response_hack(sample))

    sample["id"] = str(uuid.uuid4())
    sample["messages"] = [
	{"content": sys_prompt, "role": "system"},
        {"content": user_query, "role": "user"},
	{"content": response, "role": "assistant"},
    ]

AIUI, we should be included context in the user message here?

@nathan-weinberg nathan-weinberg added the question Further information is requested label Sep 10, 2024
@bbrowning
Copy link
Contributor

We should dig into this, assuming we keep the simple pipeline around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants