ppsci使用单机多卡并行时报错 #1065

Bingohong · 2025-01-15T07:51:01Z

请提出你的问题 Please ask your question

网络结构超简单

model1 = ppsci.arch.MLP(**cfg.MODEL1)

执行

通过命令行执行 python -m paddle.distributed.launch --selected_gpus='0,1' --find_unused_parameters=True crack2d_unsteady.py

报错

File "/home/hongguobin0094/.conda/envs/ppsci_py310/lib/python3.10/site-packages/paddle/base/dygraph/tensor_patch_methods.py", line 355, in backward
    core.eager.run_backward([self], grad_tensor, retain_graph)
RuntimeError: (PreconditionNotMet) Error happened, when parameter[19][linear_9.b_0] has been ready before. Please set find_unused_parameters=True to traverse backward graph in each step to prepare reduce in advance. If you have set, there may be several reasons for this error: 1) In multiple reentrant backward phase, some parameters are reused.2) Using model parameters outside of forward function. Please make sure that model parameters are not shared in concurrent forward-backward passes.
  [Hint: Expected has_marked_unused_vars_ == false, but received has_marked_unused_vars_:1 != false:0.] (at ../paddle/fluid/distributed/collective/reducer.cc:812)


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
I0115 15:47:05.968951 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct 
I0115 15:47:05.969121 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct 
I0115 15:47:05.969133 2441365 process_group_nccl.cc:155] ProcessGroupNCCL destruct 
I0115 15:47:06.083209 2441630 tcp_store.cc:290] receive shutdown event and so quit from MasterDaemon run loop
[2025-01-15 15:47:11,318] [    INFO] launch_utils.py:334 - terminate all the procs
INFO 2025-01-15 15:47:11,318 launch_utils.py:334] terminate all the procs
[2025-01-15 15:47:11,319] [   ERROR] launch_utils.py:648 - ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log.
ERROR 2025-01-15 15:47:11,319 launch_utils.py:648] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log.
[2025-01-15 15:47:15,323] [    INFO] launch_utils.py:334 - terminate all the procs
INFO 2025-01-15 15:47:15,323 launch_utils.py:334] terminate all the procs
[2025-01-15 15:47:15,324] [ WARNING] launch.py:443 - Terminating... exit
WARNING 2025-01-15 15:47:15,324 launch.py:443] Terminating... exit
[2025-01-15 15:47:19,328] [    INFO] launch_utils.py:334 - terminate all the procs
INFO 2025-01-15 15:47:19,328 launch_utils.py:334] terminate all the procs

HydrogenSulfate · 2025-01-15T08:45:49Z

非常感谢提交issue反馈问题，我本地试了一下一个可以运行多卡的也是MLP模型为基础的案例：viv.py：

看起来好像没问题……

所以是否方便提供一下完整的训练脚本呢？可以上传到aistudio然后分享链接给我，或者是新建一个github私有项目添加我为成员的方式都行。

运行命令: python -m paddle.distributed.launch --gpus="0,1" viv.py，参考: https://paddlescience-docs.readthedocs.io/zh-cn/latest/zh/user_guide/#221

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ppsci使用单机多卡并行时报错 #1065

ppsci使用单机多卡并行时报错 #1065

Bingohong commented Jan 15, 2025

HydrogenSulfate commented Jan 15, 2025 •

edited

Loading

ppsci使用单机多卡并行时报错 #1065

ppsci使用单机多卡并行时报错 #1065

Comments

Bingohong commented Jan 15, 2025

请提出你的问题 Please ask your question

网络结构超简单

执行

报错

HydrogenSulfate commented Jan 15, 2025 • edited Loading

HydrogenSulfate commented Jan 15, 2025 •

edited

Loading