Multi-GPU Training GPU Usage #2701
-
❓ Multi-GPU Training GPU UsageBefore asking:
Hi, I'm using lightning and ddp as backend to do multi-gpu training, with Apex amp (amp_level = 'O1'). The gpu number is 8. I noticed that during training, most of time GPU0's utilization is 0%, while others are almost 100%. But their memory usage are the same. Is this normal? I use OpenPAI and have attached their utilization and memeory usage below. Thanks.CodeWhat have you tried?What's your environment?
|
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
check the cpu usage, to make sure dataloading is not a bottleneck. |
Beta Was this translation helpful? Give feedback.
-
Hi, thanks for the reply. The total metric(including cpu, gpu) is as follows: From these images, I have no idea if dataloading is a bottleneck. I also did profilers, as mentioned in this issue . The dataloading time is constant. But there's some weird call to ''torch._C._TensorBase' objects' that takes a lot of time. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Your cpu usage seems high. It could be the cpu is the bottleneck here. Try fewer gpus and observe then observe the gpu utilization. |
Beta Was this translation helpful? Give feedback.
-
Yes. I have found it out. Thank you so much! |
Beta Was this translation helpful? Give feedback.
Your cpu usage seems high. It could be the cpu is the bottleneck here. Try fewer gpus and observe then observe the gpu utilization.