Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distribution training #9

Closed
EDGSCOUT opened this issue May 8, 2021 · 2 comments
Closed

distribution training #9

EDGSCOUT opened this issue May 8, 2021 · 2 comments

Comments

@EDGSCOUT
Copy link

EDGSCOUT commented May 8, 2021

Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Torch Version: 1.7.0
Torch Version: 1.7.0
Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 423, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
KeyboardInterrupt
Traceback (most recent call last):
File "/home/ps/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ps/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in
main()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ps/anaconda3/bin/python', '-u', 'main.py', '--local_rank=3', '--config=config/MGMatting-DIM.toml']' returned non-zero exit status 1.
(base) ps@ps:~/Downloads/MGMatting-main/code-base$ Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe
Traceback (most recent call last):
File "main.py", line 107, in
main()
File "main.py", line 27, in main
torch.distributed.init_process_group(backend='nccl', init_method='env://')
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 442, in init_process_group
barrier()
File "/home/ps/anaconda3/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1947, in barrier
work = _default_pg.barrier()
RuntimeError: Broken pipe

@yucornetto
Copy link
Owner

Hi, could you check your pytorch and cuda version? It seems to me that maybe the problem is caused by some library issues.

@EDGSCOUT
Copy link
Author

Hi, could you check your pytorch and cuda version? It seems to me that maybe the problem is caused by some library issues.

yes,I have solved it. it's a version problem.
we can close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants