Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support requires_grad key in paramwise_cfg #767

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

cxiang26
Copy link
Contributor

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

issue#750

optim_wrapper = dict(
    type="OptimWrapper",
    optimizer=dict(type="AdamW", lr=lr, weight_decay=0.01),
    paramwise_cfg=dict(
        custom_keys={
            "backbone": dict(requires_grad=False), # lr_mult=0,
            "neck.conv0":dict(requiers_grad=False),
        }
    ),
)
  1. it is flexible to detach any part of our model, e.g. backbone, neck, neck.xx, ...
  2. more memory save than using lr_mult=0
  3. it's friendly to all models without standard mmlab-style implementation

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repos?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
  3. If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copy link
Collaborator

@HAOCHENYE HAOCHENYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! Corresponding unit test should also be updated

mmengine/optim/optimizer/default_constructor.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@HAOCHENYE HAOCHENYE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution 🚀! The current implementation looks good to me, would you mind updating the docstring? It could help others to know about this function 😆

@cxiang26 cxiang26 requested a review from zhouzaida as a code owner November 26, 2022 15:54
@cxiang26
Copy link
Contributor Author

where is the docstring?

@HAOCHENYE
Copy link
Collaborator

image

@cxiang26 You can supplement the usage of required_grad here.

@zhouzaida zhouzaida requested a review from RangiLyu November 28, 2022 02:40
@cxiang26
Copy link
Contributor Author

@HAOCHENYE I got an error when setting requires_grad=False in a multi-GPU device, but I can run it normally by setting find_unused_parameters=True in the config

2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():        
2022-11-29 00:27 
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 RuntimeError: RuntimeErrorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 7: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeError: 
2022-11-29 00:27 : : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 5: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 

@HAOCHENYE
Copy link
Collaborator

HAOCHENYE commented Nov 29, 2022

@HAOCHENYE I got an error when setting requires_grad=False in a multi-GPU device, but I can run it normally by setting find_unused_parameters=True in the config

2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():        
2022-11-29 00:27 
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 RuntimeError: RuntimeErrorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 7: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeError: 
2022-11-29 00:27 : : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 5: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 

Ahhhh, this error is raised as expected. Actually, when we enable find_unused_parameters=True, it will slow down the training speed. I'm not sure that the acceleration gained from set require_grad=False could compensate for the slowing down caused by setting find_unused_parameters=True in ddp-training, would mind providing more specific data?

@cxiang26
Copy link
Contributor Author

cxiang26 commented Dec 2, 2022

@HAOCHENYE I got an error when setting requires_grad=False in a multi-GPU device, but I can run it normally by setting find_unused_parameters=True in the config

2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():        
2022-11-29 00:27 
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 RuntimeError: RuntimeErrorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 7: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeError: 
2022-11-29 00:27 : : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 5: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 

Ahhhh, this error is raised as expected. Actually, when we enable find_unused_parameters=True, it will slow down the training speed. I'm not sure that the acceleration gained from set require_grad=False could compensate for the slowing down caused by setting find_unused_parameters=True in ddp-training, would mind providing more specific data?

@HAOCHENYE training time consumed by the standard model is similar to the model with find_unused_parameters=True and require_grad=False, while the latter is more memory-saving in my case. require_grad=False acts on the resnet101 backbone, which is a key compute-cost part in my model, anyway its work for me. I have no time to test more cases ...

@HAOCHENYE
Copy link
Collaborator

Thannnnnks for your feedback! I'll test the efficiency based on this PR these days.

@CLAassistant
Copy link

CLAassistant commented Dec 14, 2022

CLA assistant check
All committers have signed the CLA.

@HAOCHENYE HAOCHENYE added this to the 0.6.0 milestone Jan 12, 2023
@zhouzaida zhouzaida modified the milestones: 0.6.0, 0.7.4 Apr 26, 2023
Copy link

codecov bot commented Sep 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@d3d7528). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #767   +/-   ##
=======================================
  Coverage        ?   78.69%           
=======================================
  Files           ?      128           
  Lines           ?     9346           
  Branches        ?     1847           
=======================================
  Hits            ?     7355           
  Misses          ?     1675           
  Partials        ?      316           
Flag Coverage Δ
unittests 78.69% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants