Is there any way to do masking with DPO? #2226
Unanswered
mashdragon
asked this question in
Q&A
Replies: 1 comment 7 replies
-
Hey, could you elaborate what you mean the outputs are not complete responses? Are the "correct" ones not complete as well? |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
So, I'm well aware that a DPO dataset essentially ranks some responses over other responses.
I have created a dataset in which I've corrected portions of a model's output. I've set it up so the "chosen" responses are my corrected outputs and the "rejected" responses are the original outputs. The catch is, these outputs are not complete responses, and the outputs are exactly the same up until the correction.
I am wondering if there is a way for me to mask the dataset during fine-tuning so that the model doesn't "learn" to stop generating early, since the responses in my dataset are not complete (I want the model to keep going but learn that at this stage you should be doing this, not that). Also, since the outputs are identical up until the correction, I'm wondering what kind of effect this will have on DPO, or if the algorithm just cancels out the effect of the identical tokens.
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions