You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I try to use lightning with DeepSpeed stage 3 to train a model under the precision "16-mixed". However, I find that the model parameters includes Nan and Inf values at the first step. When I change it to DDP, this issue does not exist.
#- PyTorch Lightning Version (e.g., 2.5.0):
#- PyTorch Version (e.g., 2.5):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
The text was updated successfully, but these errors were encountered:
Bug description
I try to use lightning with DeepSpeed stage 3 to train a model under the precision "16-mixed". However, I find that the model parameters includes Nan and Inf values at the first step. When I change it to DDP, this issue does not exist.
I initialize my trainer as:
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: