Grounded skill samples generated by the simple pipeline are missing context? #258

markmc · 2024-08-28T15:53:05Z

Just noticed this while documenting dataset formats in #236

In _gen_train_data() we are taking a dataset with question and response columns, and generating a training dataset in two different formats. (Ok, in the case of the simple pipeline, we actually parse question and response from output, but that's not super important here)

If the dataset also contains a context column, we append context to the question.

            user = _get_question_hack(synth_example)
            if len(synth_example.get("context", "")) > 0:
                user += "\n" + synth_example["context"]
            assistant = _unescape(_get_response_hack(synth_example))
            train_entry = {
                "system": _SYS_PROMPT,
                "user": _unescape(user),
                "assistant": assistant,
            }
            train_data.append(train_entry)
            sample = {
                "inputs": _unescape(user),
                "targets": assistant,
                "system": _SYS_PROMPT,
            }
            messages_data.append(_convert_to_messages(sample))

In the full pipeline for grounded skills, we do generate this context column based on the seed_context column.

In the simple pipeline for grounded skills, we are not including a context column at all. I suspect the intent was include the original seed context in each sample? If so, we'd need to add a DuplicateColumnsBlock that would copy seed_context to context?

The text was updated successfully, but these errors were encountered:

markmc · 2024-08-30T18:07:20Z

Another example of where I think we're missing context for grounded skills:

In datamixing.py:

def _convert_to_leaf_node_messages(sample: dict, sys_prompt: str):
    ...
    user_query = _unescape(_get_question_hack(sample))
    response = _unescape(_get_response_hack(sample))

    sample["id"] = str(uuid.uuid4())
    sample["messages"] = [
	{"content": sys_prompt, "role": "system"},
        {"content": user_query, "role": "user"},
	{"content": response, "role": "assistant"},
    ]

AIUI, we should be included context in the user message here?

bbrowning · 2024-11-20T15:18:52Z

We should dig into this, assuming we keep the simple pipeline around.

nathan-weinberg added the question Further information is requested label Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grounded skill samples generated by the simple pipeline are missing context? #258

Grounded skill samples generated by the simple pipeline are missing context? #258

markmc commented Aug 28, 2024

markmc commented Aug 30, 2024

bbrowning commented Nov 20, 2024

Grounded skill samples generated by the simple pipeline are missing context? #258

Grounded skill samples generated by the simple pipeline are missing context? #258

Comments

markmc commented Aug 28, 2024

markmc commented Aug 30, 2024

bbrowning commented Nov 20, 2024