-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] support requires_grad
key in paramwise_cfg
#767
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! Corresponding unit test should also be updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution 🚀! The current implementation looks good to me, would you mind updating the docstring? It could help others to know about this function 😆
where is the docstring? |
@cxiang26 You can supplement the usage of |
@HAOCHENYE I got an error when setting 2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27
2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27 File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27
2022-11-29 00:27
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27
2022-11-29 00:27 RuntimeError: RuntimeErrorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss.
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 7: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeError:
2022-11-29 00:27 : : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss.
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 5: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. |
Ahhhh, this error is raised as expected. Actually, when we enable |
@HAOCHENYE training time consumed by the standard model is similar to the model with |
Thannnnnks for your feedback! I'll test the efficiency based on this PR these days. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #767 +/- ##
=======================================
Coverage ? 78.69%
=======================================
Files ? 128
Lines ? 9346
Branches ? 1847
=======================================
Hits ? 7355
Misses ? 1675
Partials ? 316
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
issue#750
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repos?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist