Is there any way to do masking with DPO? #2226

mashdragon · 2024-12-28T20:04:16Z

mashdragon
Dec 28, 2024

So, I'm well aware that a DPO dataset essentially ranks some responses over other responses.

I have created a dataset in which I've corrected portions of a model's output. I've set it up so the "chosen" responses are my corrected outputs and the "rejected" responses are the original outputs. The catch is, these outputs are not complete responses, and the outputs are exactly the same up until the correction.

I am wondering if there is a way for me to mask the dataset during fine-tuning so that the model doesn't "learn" to stop generating early, since the responses in my dataset are not complete (I want the model to keep going but learn that at this stage you should be doing this, not that). Also, since the outputs are identical up until the correction, I'm wondering what kind of effect this will have on DPO, or if the algorithm just cancels out the effect of the identical tokens.

Thank you!

NanoCode012 · 2025-01-02T10:52:56Z

NanoCode012
Jan 2, 2025
Collaborator

Hey, could you elaborate what you mean the outputs are not complete responses? Are the "correct" ones not complete as well?

7 replies

NanoCode012 Jan 7, 2025
Collaborator

Hey!

Sorry for these "new user" questions. I'm still learning

No worries, we're happy to help guide you.

Earlier, I was asking around to see if others had some idea on this issue.

One idea proposed was to omit the EOS during the processing of your data. This could maybe tell the model that this isn't the end of the generation and to not incorrectly teach it to stop. You'll have to experiment and get back to us on this, maybe a simple LORA run to ensure that your training won't cause abrupt stops.

I think my partial responses shouldn't include the EOS token, which contradicts what is suggested here:

The reason we had that message was to let the user's know that we are not adding those tokens and to handle it themselves, in your case, you may choose to omit them. To clarify, this data loading is only supported in SFT as of the moment.

I'm still confused about the difference between -100 label tokens and a attention mask of 0.

Label of -100 would mean that it's ignored during loss calculation. Attention_mask of 0 would be for ignoring padding tokens which you don't need to worry about. Attention_mask should just always be [1]*len(input_ids).

So, I'm also confused by this one. I am guessing this means I would make it something like this (mostly taken from the example)?

Nope, the "processing" here refers to formatting it into the proper template (for example, adding chatml tokens to prompt

axolotl/src/axolotl/prompt_strategies/dpo/chatml.py

Lines 121 to 124 in 341b118

    
           sample["prompt"] = ( 
        
               f"<|im_start|>system\n{sample['system']}<|im_end|>\n" 
        
               f"<|im_start|>user\n{sample['prompt']}<|im_end|>\n<|im_start|>assistant\n" 
        
           )

). In your case, you want to add the BOS & conversation tokens (if any).

One thing, if you're training this, could you also include good/complete samples? This should help prevent the model from biasing to cut-off responses.

mashdragon Jan 7, 2025
Author

Thank you so much for such a detailed answer! I am very thankful for the tips about how the pre-tokenized dataset format is only supported for SFT currently and about the attention mask, and also the suggestion to try with complete examples. I didn't think of that.

So to recap, I will modify the relevant "processing" transformation function in src/axolotl/prompt_strategies/dpo/* to exclude the EOS token from the chosen and rejected keys and see if that works, putting my dataset back into the regular format rather than pre-tokenized but also taking care to ensure the tokenization would be the same by, for example, adding whitespace if the last token had some at the end. In another experiment I'll also add the EOS/EOT token manually for complete samples and see if that is needed for the model not to stop early.

I will run these experiments and report back how they go!

NanoCode012 Jan 7, 2025
Collaborator

I will modify the relevant "processing" transformation function in src/axolotl/prompt_strategies/dpo/* to exclude the EOS token from the chosen and rejected keys and see if that works, putting my dataset back into the regular format rather than pre-tokenized

Right

but also taking care to ensure the tokenization would be the same by, for example, adding whitespace if the last token had some at the end.

What do you mean by this? I don't think you need to manually tokenizing it anymore, only change the formatting.

In another experiment I'll also add the EOS/EOT token manually for complete samples and see if that is needed for the model not to stop early.

Yep!

mashdragon Jan 7, 2025
Author

What do you mean by this? I don't think you need to manually tokenizing it anymore, only change the formatting.

I just mean I need to be careful about the characters at the very end of a "partial prompt". In trying to generate a "guided" LLM output by iteration, what I do is I cut off the LLM's output, change the direction of the output by changing the last tokens to match the correct direction, and then resume generation from there.

What I've noticed is that for the model I'm currently using (nvidia/Llama-3.1-Nemotron-70B-Instruct), where I cut off the model can make a difference. If I do not preserve the tokens, for example by cutting off whitespace at the end if the last token was '}\n\n', sometimes that completely changes the continuation. If I don't get that last token correct and just end it with '}', then the LLM often predicts an EOT token afterward! So I just need to be careful I don't change the existing tokens in my training data.

(Also thank you for confirming the other things!! 🙂)

NanoCode012 Jan 9, 2025
Collaborator

Oh, that's an interesting detail for that model. Let us know how the experiment goes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there any way to do masking with DPO? #2226

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Is there any way to do masking with DPO? #2226

mashdragon Dec 28, 2024

Replies: 1 comment · 7 replies

NanoCode012 Jan 2, 2025 Collaborator

NanoCode012 Jan 7, 2025 Collaborator

mashdragon Jan 7, 2025 Author

NanoCode012 Jan 7, 2025 Collaborator

mashdragon Jan 7, 2025 Author

NanoCode012 Jan 9, 2025 Collaborator

mashdragon
Dec 28, 2024

Replies: 1 comment 7 replies

NanoCode012
Jan 2, 2025
Collaborator

NanoCode012 Jan 7, 2025
Collaborator

mashdragon Jan 7, 2025
Author

NanoCode012 Jan 7, 2025
Collaborator

mashdragon Jan 7, 2025
Author

NanoCode012 Jan 9, 2025
Collaborator