-
-
Notifications
You must be signed in to change notification settings - Fork 910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mistral Nemo LoRA training has super high grad_norm #2095
Comments
Hey, thanks for the report.
Are you able to test one with chat_template before the GA patch (making sure you install the old requirements)? |
@Nero10578 btw, I say you're using FSDP, how many GPUs? thanks |
It could also be that the grad norm is higher because it doesn't expect the repeating of roles |
Will have to test, but I suspect this is the source of the issue.
I am using 2x3090Ti with FSDP with CPU offloading.
I don't think so because the dataset didn't change and it was fine before. |
So tested it out with the commit before the GA patch 718cfb2 and with the latest commit 724b660. These are the differences in grad_norm: The thing is that now that I have let both run, the GA patch does make the loss better, can be seen in how the first eval shows lower loss after the GA patch. I am using a different config with Liger kernels enabled:
|
the high grad_norm is because there is a big mismatch of whatever the model originally fitted and what is being presented now. |
Please check that this issue hasn't been reported before.
Expected Behavior
Before the gradient accumulation fixes and changes with transformers recently, the grad_norm when training Mistral Nemo 12B was below 1.0 like normal. Could also be because of changes to using chat_templates?
This was using the same config with previous versions of axolotl and transformers:
Current behaviour
Gradient Normalization is now around 5 when training:
Steps to reproduce
Train Mistral Nemo 12B Instruct with LoRA. I used the same config as I did back then when this works fine.
Only difference is now I am using chat_templates, where I replace the chat template in the Mistral tokenizer_config.json with the chat template shown here so that it can accept repeating same roles.
I did this by changing the chat templates in the Mistral Nemo tokenizer config to this:
If this is the wrong way to do it, that might be causing the high grad_norm? But that seems unlikely since the dataset seems to be tokenized properly when I use preprocess --debug.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main/db51a9e4
Acknowledgements
The text was updated successfully, but these errors were encountered: