[Feature] support `requires_grad` key in `paramwise_cfg` #767

cxiang26 · 2022-11-25T12:21:17Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

issue#750

optim_wrapper = dict(
    type="OptimWrapper",
    optimizer=dict(type="AdamW", lr=lr, weight_decay=0.01),
    paramwise_cfg=dict(
        custom_keys={
            "backbone": dict(requires_grad=False), # lr_mult=0,
            "neck.conv0":dict(requiers_grad=False),
        }
    ),
)

it is flexible to detach any part of our model, e.g. backbone, neck, neck.xx, ...
more memory save than using lr_mult=0
it's friendly to all models without standard mmlab-style implementation

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repos?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
If the modification has potential influence on downstream projects, this PR should be tested with downstream projects, like MMDet or MMCls.
The documentation has been modified accordingly, like docstring or example tutorials.

HAOCHENYE

Thanks for your contribution! Corresponding unit test should also be updated

mmengine/optim/optimizer/default_constructor.py

HAOCHENYE

Thanks for your contribution 🚀! The current implementation looks good to me, would you mind updating the docstring? It could help others to know about this function 😆

cxiang26 · 2022-11-26T15:55:54Z

where is the docstring?

HAOCHENYE · 2022-11-27T09:03:27Z

@cxiang26 You can supplement the usage of required_grad here.

cxiang26 · 2022-11-29T02:22:20Z

@HAOCHENYE I got an error when setting requires_grad=False in a multi-GPU device, but I can run it normally by setting find_unused_parameters=True in the config

2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():        
2022-11-29 00:27 
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 RuntimeError: RuntimeErrorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 7: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeError: 
2022-11-29 00:27 : : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 5: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss.

HAOCHENYE · 2022-11-29T13:07:38Z

@HAOCHENYE I got an error when setting requires_grad=False in a multi-GPU device, but I can run it normally by setting find_unused_parameters=True in the config

2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():        
2022-11-29 00:27 
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 RuntimeError: RuntimeErrorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 7: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeError: 
2022-11-29 00:27 : : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 5: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss.

Ahhhh, this error is raised as expected. Actually, when we enable find_unused_parameters=True, it will slow down the training speed. I'm not sure that the acceleration gained from set require_grad=False could compensate for the slowing down caused by setting find_unused_parameters=True in ddp-training, would mind providing more specific data?

cxiang26 · 2022-12-02T02:58:14Z

@HAOCHENYE I got an error when setting requires_grad=False in a multi-GPU device, but I can run it normally by setting find_unused_parameters=True in the config

2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27 
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27     return forward_call(*input, **kwargs)
2022-11-29 00:27   File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/distributed.py", line 994, in forward
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 
2022-11-29 00:27             if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():        
2022-11-29 00:27 
2022-11-29 00:27 if torch.is_grad_enabled() and self.reducer._rebuild_buckets():if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
2022-11-29 00:27 
2022-11-29 00:27 RuntimeError: RuntimeErrorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 7: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeError: 
2022-11-29 00:27 : : Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss. 
2022-11-29 00:27 If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
2022-11-29 00:27 Parameter indices which did not receive grad for rank 5: 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315
2022-11-29 00:27  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this errorRuntimeErrorExpected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
2022-11-29 00:27 making sure all `forward` function outputs participate in calculating loss.

Ahhhh, this error is raised as expected. Actually, when we enable find_unused_parameters=True, it will slow down the training speed. I'm not sure that the acceleration gained from set require_grad=False could compensate for the slowing down caused by setting find_unused_parameters=True in ddp-training, would mind providing more specific data?

@HAOCHENYE training time consumed by the standard model is similar to the model with find_unused_parameters=True and require_grad=False, while the latter is more memory-saving in my case. require_grad=False acts on the resnet101 backbone, which is a key compute-cost part in my model, anyway its work for me. I have no time to test more cases ...

HAOCHENYE · 2022-12-02T11:34:06Z

Thannnnnks for your feedback! I'll test the efficiency based on this PR these days.

CLAassistant · 2022-12-14T03:22:04Z

All committers have signed the CLA.

codecov · 2024-09-26T03:40:19Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (main@d3d7528). Learn more about missing BASE report.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #767   +/-   ##
=======================================
  Coverage        ?   78.69%           
=======================================
  Files           ?      128           
  Lines           ?     9346           
  Branches        ?     1847           
=======================================
  Hits            ?     7355           
  Misses          ?     1675           
  Partials        ?      316

Flag	Coverage Δ
unittests	`78.69% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

cxiang26 and others added 2 commits November 25, 2022 12:56

add requires_grad feat

e0a9753

Merge branch 'open-mmlab:main' into add_requires_grad_feat

5f6f18d

cxiang26 requested a review from HAOCHENYE as a code owner November 25, 2022 12:21

mm-assistant bot assigned HAOCHENYE Nov 25, 2022

HAOCHENYE reviewed Nov 25, 2022

View reviewed changes

mmengine/optim/optimizer/default_constructor.py Outdated Show resolved Hide resolved

replace inplace op

022ef78

HAOCHENYE reviewed Nov 26, 2022

View reviewed changes

add requires_grad in custom_keys unit test

effc23f

cxiang26 requested a review from zhouzaida as a code owner November 26, 2022 15:54

fix test logic

66be294

supplement the usage of required_grad

ab2f6f4

zhouzaida requested a review from RangiLyu November 28, 2022 02:40

HAOCHENYE added the community discussion label Nov 30, 2022

HAOCHENYE added community discussion and removed community discussion labels Jan 12, 2023

HAOCHENYE added this to the 0.6.0 milestone Jan 12, 2023

zhouzaida modified the milestones: 0.6.0, 0.7.4 Apr 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] support `requires_grad` key in `paramwise_cfg` #767

[Feature] support `requires_grad` key in `paramwise_cfg` #767

cxiang26 commented Nov 25, 2022

HAOCHENYE left a comment

HAOCHENYE left a comment

cxiang26 commented Nov 26, 2022

HAOCHENYE commented Nov 27, 2022

cxiang26 commented Nov 29, 2022

HAOCHENYE commented Nov 29, 2022 •

edited

Loading

cxiang26 commented Dec 2, 2022 •

edited

Loading

HAOCHENYE commented Dec 2, 2022

CLAassistant commented Dec 14, 2022 •

edited

Loading

codecov bot commented Sep 26, 2024

[Feature] support requires_grad key in paramwise_cfg #767

Are you sure you want to change the base?

[Feature] support requires_grad key in paramwise_cfg #767

Conversation

cxiang26 commented Nov 25, 2022

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

HAOCHENYE left a comment

Choose a reason for hiding this comment

HAOCHENYE left a comment

Choose a reason for hiding this comment

cxiang26 commented Nov 26, 2022

HAOCHENYE commented Nov 27, 2022

cxiang26 commented Nov 29, 2022

HAOCHENYE commented Nov 29, 2022 • edited Loading

cxiang26 commented Dec 2, 2022 • edited Loading

HAOCHENYE commented Dec 2, 2022

CLAassistant commented Dec 14, 2022 • edited Loading

codecov bot commented Sep 26, 2024

Codecov Report

[Feature] support `requires_grad` key in `paramwise_cfg` #767

[Feature] support `requires_grad` key in `paramwise_cfg` #767

HAOCHENYE commented Nov 29, 2022 •

edited

Loading

cxiang26 commented Dec 2, 2022 •

edited

Loading

CLAassistant commented Dec 14, 2022 •

edited

Loading