-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model does not update its weights #20215
Comments
Thanks for reporting the issue. Setting precision to '32-true' fixes the problem for me. |
Yes but that is not really the solution. In addition the problem might be still present and manifest itself at different training step. |
Agreed it's not a fix, but it saved me from having to rewrite my implementation or tell my PI that we had to wait for a bug to be fixed before we could finish our paper. |
Looks like it's affected lighting verion |
This PR #20460 fix this issue, ptal |
@kopalja thank you for the investigation and the reproduction |
So this happens because automatic mixed precision in PyTorch is explicitly designed to work this way: in order to avoid issues with Here is the equivalent raw PyTorch code: def train():
model = CNN()
dataset = datasets.MNIST(root=".mnist_data", download=True, transform=transforms.ToTensor())
dataloader = DataLoader(dataset)
scaler = GradScaler(device="cuda")
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-3)
previous_params = None
for batch_idx, batch in enumerate(dataloader):
params = torch.cat([param.view(-1) for param in model.parameters()])
if previous_params is not None:
num_different_values = (previous_params != params).sum().item()
assert num_different_values != 0:
else:
num_different_values = None
previous_params = params
optimizer.zero_grad()
with torch.autocast(device_type='cuda', dtype=torch.float16):
loss = model.forward(*batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
print(
f"step {batch_idx} | diff weights: {num_different_values} | all weights: {params.numel()} | weights mean: {torch.mean(params)} | loss: {loss.item()}"
)
if __name__ == '__main__':
torch.set_float32_matmul_precision("high")
torch.manual_seed(1337)
train() As you can see you get steps with no weight updates as expected. Note that if you don't use the scaler in order to step you won't get the assert (weights will always change), but updates will ultimately be incorrect and likely to blow up: def train():
model = CNN()
dataset = datasets.MNIST(root=".mnist_data", download=True, transform=transforms.ToTensor())
dataloader = DataLoader(dataset)
optimizer = torch.optim.AdamW(model.parameters(), lr=2e-3)
previous_params = None
for batch_idx, batch in enumerate(dataloader):
params = torch.cat([param.view(-1) for param in model.parameters()])
if previous_params is not None:
num_different_values = (previous_params != params).sum().item()
if num_different_values == 0:
return
else:
num_different_values = None
previous_params = params
optimizer.zero_grad()
with torch.autocast(device_type='cuda', dtype=torch.float16):
loss = model.forward(*batch)
loss.backward()
optimizer.step()
print(
f"step {batch_idx} | diff weights: {num_different_values} | all weights: {params.numel()} | weights mean: {torch.mean(params)} | loss: {loss.item()}"
)
if __name__ == '__main__':
torch.set_float32_matmul_precision("high")
torch.manual_seed(1337)
train() |
Bug description
Hi, I am using PyTorch lightning to implement some new optimization strategies using
automatic_optimization=False
. For certain setting my optimization strategy (usingautomatic_optimization=False
) should yield the same results as using standard optimization process (automatic_optimization=True
). However I could not make it work. My optimization process was returning slightly different results as using default optimization process. After a while I figured out that PyTorch lightning sometimes does not update the model weights when using the defaultautomatic_optimization=True
. I have put together minimal example in which model weights won't get updated on step 5. Model weights also won't get updated when using different hyper-parameters (e.g., batch-size, lr), only at different training step.Am I missing something or does this look like a bug.
Thanks!
What version are you seeing the problem on?
v2.4
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: