Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ppsci使用单机多卡并行时报错 #1065

Open
Bingohong opened this issue Jan 15, 2025 · 1 comment
Open

ppsci使用单机多卡并行时报错 #1065

Bingohong opened this issue Jan 15, 2025 · 1 comment

Comments

@Bingohong
Copy link

请提出你的问题 Please ask your question

网络结构超简单

model1 = ppsci.arch.MLP(**cfg.MODEL1)

执行

通过命令行执行 python -m paddle.distributed.launch --selected_gpus='0,1' --find_unused_parameters=True crack2d_unsteady.py

报错

File "/home/hongguobin0094/.conda/envs/ppsci_py310/lib/python3.10/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 355, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
RuntimeError: (PreconditionNotMet) Error happened, when parameter[19][linear_9.b_0] has been ready before. Please set find_unused_parameters=True to traverse backward graph in each step to prepare reduce in advance. If you have set, there may be several reasons for this error: 1) In multiple reentrant backward phase, some parameters are reused.2) Using model parameters outside of forward function. Please make sure that model parameters are not shared in concurrent forward-backward passes.
  [Hint: Expected has_marked_unused_vars_ == false, but received has_marked_unused_vars_:1 != false:0.] (at ../paddle/fluid/distributed/collective/reducer.cc:812)


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I0115 15:47:05.968951 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct 
I0115 15:47:05.969121 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct 
I0115 15:47:05.969133 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct 
I0115 15:47:06.083209 2441630 tcp_store.cc:290] receive shutdown event and so quit from MasterDaemon run loop
[2025-01-15 15:47:11,318] [    INFO] launch_utils.py:334 - terminate all the procs
INFO 2025-01-15 15:47:11,318 launch_utils.py:334] terminate all the procs
[2025-01-15 15:47:11,319] [   ERROR] launch_utils.py:648 - ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log.
ERROR 2025-01-15 15:47:11,319 launch_utils.py:648] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log.
[2025-01-15 15:47:15,323] [    INFO] launch_utils.py:334 - terminate all the procs
INFO 2025-01-15 15:47:15,323 launch_utils.py:334] terminate all the procs
[2025-01-15 15:47:15,324] [ WARNING] launch.py:443 - Terminating... exit
WARNING 2025-01-15 15:47:15,324 launch.py:443] Terminating... exit
[2025-01-15 15:47:19,328] [    INFO] launch_utils.py:334 - terminate all the procs
INFO 2025-01-15 15:47:19,328 launch_utils.py:334] terminate all the procs
@HydrogenSulfate
Copy link
Collaborator

HydrogenSulfate commented Jan 15, 2025

非常感谢提交issue反馈问题,我本地试了一下一个可以运行多卡的也是MLP模型为基础的案例:viv.py:

Image

看起来好像没问题……

所以是否方便提供一下完整的训练脚本呢?可以上传到aistudio然后分享链接给我,或者是新建一个github私有项目添加我为成员的方式都行。

运行命令: python -m paddle.distributed.launch --gpus="0,1" viv.py,参考: https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/user_guide/#221

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants