Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does this have anything to do with the memory issue? #4

Open
minhphi1712 opened this issue Jan 21, 2025 · 0 comments
Open

Does this have anything to do with the memory issue? #4

minhphi1712 opened this issue Jan 21, 2025 · 0 comments

Comments

@minhphi1712
Copy link

`bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64`

> NCCL_DEBUG=VERSION NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 CUDA_LAUNCH_BLOCKING=0 torchrun --nproc_per_node 1 --master_port=19865 /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py --bf16 --skip-init --mode finetune --rotary-embedding-2d --seed 12345 --sampling-strategy BaseStrategy --max-gen-length 128 --min-gen-length 0 --num-beams 4 --length-penalty 1.0 --no-repeat-ngram-size 0 --multiline_stream --temperature 0.8 --top_k 0 --top_p 0.9 --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64
> [2025-01-21 11:21:52,777] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)
> [2025-01-21 11:21:54,854] [WARNING] No training data specified
> [2025-01-21 11:21:54,855] [WARNING] No train_iters (recommended) or epochs specified, use default 10k iters.
> [2025-01-21 11:21:54,855] [INFO] using world size: 1 and model-parallel size: 1 
> [2025-01-21 11:21:54,855] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
> [2025-01-21 11:21:54,855] [INFO] [RANK 0] > initializing model parallel with size 1
> [2025-01-21 11:21:54,856] [INFO] [comm.py:652:init_distributed] cdb=None
> [2025-01-21 11:21:54,856] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1004:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1125:configure] Activation Checkpointing Information
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1126:configure] ----Partition Activations False, CPU CHECKPOINTING False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1127:configure] ----contiguous Memory Checkpointing False with 6 total layers
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1128:configure] ----Synchronization False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1129:configure] ----Profiling time in checkpointing False
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345
> [2025-01-21 11:21:54,857] [INFO] [RANK 0] building MSAGPT model ...
> [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345
> [2025-01-21 11:21:55,104] [INFO] [RANK 0]  > number of parameters on model parallel rank 0: 2860508544
> [2025-01-21 11:21:56,144] [INFO] [RANK 0] CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
> [2025-01-21 11:21:56,144] [INFO] [RANK 0] global rank 0 is loading checkpoint ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt
> [2025-01-21 11:21:58,535] [INFO] [RANK 0] > successfully loaded ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt
> Traceback (most recent call last):
>   File "/home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py", line 44, in <module>
>     model = model.to('cuda')
>             ^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to
>     return self._apply(convert)
>            ^^^^^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
>     module._apply(fn)
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
>     module._apply(fn)
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply
>     module._apply(fn)
>   [Previous line repeated 2 more times]
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 820, in _apply
>     param_applied = fn(param)
>                     ^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert
>     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
> ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6700) of binary: /home/mca/anaconda3/bin/python
> Traceback (most recent call last):
>   File "/home/mca/anaconda3/bin/torchrun", line 8, in <module>
>     sys.exit(main())
>              ^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
>     return f(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
>     run(args)
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
>     elastic_launch(
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
>     return launch_agent(self._config, self._entrypoint, list(args))
>            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>   File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
>     raise ChildFailedError(
> torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
> ============================================================
> /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py FAILED
> ------------------------------------------------------------
> Failures:
>   <NO_OTHER_FAILURES>
> ------------------------------------------------------------
> Root Cause (first observed failure):
> [0]:
>   time      : 2025-01-21_11:22:01
>   host      : mca-lab6
>   rank      : 0 (local_rank: 0)
>   exitcode  : 1 (pid: 6700)
>   error_file: <N/A>
>   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
> ============================================================

nvidia-smi


> Tue Jan 21 11:24:48 2025       
> +---------------------------------------------------------------------------------------+
> | NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
> |-----------------------------------------+----------------------+----------------------+
> | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
> | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
> |                                         |                      |               MIG M. |
> |=========================================+======================+======================|
> |   0  NVIDIA GeForce GTX 1660 Ti     Off | 00000000:01:00.0  On |                  N/A |
> |  0%   48C    P8              12W / 120W |    219MiB /  6144MiB |      3%      Default |
> |                                         |                      |                  N/A |
> +-----------------------------------------+----------------------+----------------------+
>                                                                                          
> +---------------------------------------------------------------------------------------+
> | Processes:                                                                            |
> |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
> |        ID   ID                                                             Usage      |
> |=======================================================================================|
> |    0   N/A  N/A      2307      G   /usr/lib/xorg/Xorg                           92MiB |
> |    0   N/A  N/A      2435      G   ...libexec/gnome-remote-desktop-daemon        1MiB |
> |    0   N/A  N/A      2473      G   /usr/bin/gnome-shell                         70MiB |
> |    0   N/A  N/A      4776      G   ...seed-version=20250119-180455.285000       51MiB |
> +---------------------------------------------------------------------------------------+

This is my INPUT file

7pno_D:GSGSGSGSGTNSLLNLRSRLAAKAAKEAASSNSENLYFQ---SGGTRLTNSLLNLRSRLAAKAAKEAASSNAT------STSGGTRLTNSLLNLRSRLAAKAIKEST----------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant