We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
`bash scripts/cli_sat.sh --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64` > NCCL_DEBUG=VERSION NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 CUDA_LAUNCH_BLOCKING=0 torchrun --nproc_per_node 1 --master_port=19865 /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py --bf16 --skip-init --mode finetune --rotary-embedding-2d --seed 12345 --sampling-strategy BaseStrategy --max-gen-length 128 --min-gen-length 0 --num-beams 4 --length-penalty 1.0 --no-repeat-ngram-size 0 --multiline_stream --temperature 0.8 --top_k 0 --top_p 0.9 --from_pretrained ./checkpoints/MSAGPT --input-source /home/mca/MP_PSP/myenv/env_name/MSAGPT/INPUT --output-path /home/mca/MP_PSP/myenv/env_name/MSAGPT/output --max-gen-length 64 > [2025-01-21 11:21:52,777] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect) > [2025-01-21 11:21:54,854] [WARNING] No training data specified > [2025-01-21 11:21:54,855] [WARNING] No train_iters (recommended) or epochs specified, use default 10k iters. > [2025-01-21 11:21:54,855] [INFO] using world size: 1 and model-parallel size: 1 > [2025-01-21 11:21:54,855] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) > [2025-01-21 11:21:54,855] [INFO] [RANK 0] > initializing model parallel with size 1 > [2025-01-21 11:21:54,856] [INFO] [comm.py:652:init_distributed] cdb=None > [2025-01-21 11:21:54,856] [INFO] [config.py:733:__init__] Config mesh_device None world_size = 1 > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1004:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1125:configure] Activation Checkpointing Information > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1126:configure] ----Partition Activations False, CPU CHECKPOINTING False > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1127:configure] ----contiguous Memory Checkpointing False with 6 total layers > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1128:configure] ----Synchronization False > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:1129:configure] ----Profiling time in checkpointing False > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345 > [2025-01-21 11:21:54,857] [INFO] [RANK 0] building MSAGPT model ... > [2025-01-21 11:21:54,857] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 15063 and data parallel seed: 12345 > [2025-01-21 11:21:55,104] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 2860508544 > [2025-01-21 11:21:56,144] [INFO] [RANK 0] CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF > [2025-01-21 11:21:56,144] [INFO] [RANK 0] global rank 0 is loading checkpoint ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt > [2025-01-21 11:21:58,535] [INFO] [RANK 0] > successfully loaded ./checkpoints/MSAGPT/1/mp_rank_00_model_states.pt > Traceback (most recent call last): > File "/home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py", line 44, in <module> > model = model.to('cuda') > ^^^^^^^^^^^^^^^^ > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1145, in to > return self._apply(convert) > ^^^^^^^^^^^^^^^^^^^^ > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply > module._apply(fn) > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply > module._apply(fn) > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 797, in _apply > module._apply(fn) > [Previous line repeated 2 more times] > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 820, in _apply > param_applied = fn(param) > ^^^^^^^^^ > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1143, in convert > return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 5.79 GiB total capacity; 5.30 GiB already allocated; 84.75 MiB free; 5.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF > ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6700) of binary: /home/mca/anaconda3/bin/python > Traceback (most recent call last): > File "/home/mca/anaconda3/bin/torchrun", line 8, in <module> > sys.exit(main()) > ^^^^^^ > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper > return f(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^ > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main > run(args) > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run > elastic_launch( > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ > return launch_agent(self._config, self._entrypoint, list(args)) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/mca/anaconda3/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent > raise ChildFailedError( > torch.distributed.elastic.multiprocessing.errors.ChildFailedError: > ============================================================ > /home/mca/MP_PSP/myenv/env_name/MSAGPT/cli_sat.py FAILED > ------------------------------------------------------------ > Failures: > <NO_OTHER_FAILURES> > ------------------------------------------------------------ > Root Cause (first observed failure): > [0]: > time : 2025-01-21_11:22:01 > host : mca-lab6 > rank : 0 (local_rank: 0) > exitcode : 1 (pid: 6700) > error_file: <N/A> > traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html > ============================================================
nvidia-smi
> Tue Jan 21 11:24:48 2025 > +---------------------------------------------------------------------------------------+ > | NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 | > |-----------------------------------------+----------------------+----------------------+ > | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | > | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | > | | | MIG M. | > |=========================================+======================+======================| > | 0 NVIDIA GeForce GTX 1660 Ti Off | 00000000:01:00.0 On | N/A | > | 0% 48C P8 12W / 120W | 219MiB / 6144MiB | 3% Default | > | | | N/A | > +-----------------------------------------+----------------------+----------------------+ > > +---------------------------------------------------------------------------------------+ > | Processes: | > | GPU GI CI PID Type Process name GPU Memory | > | ID ID Usage | > |=======================================================================================| > | 0 N/A N/A 2307 G /usr/lib/xorg/Xorg 92MiB | > | 0 N/A N/A 2435 G ...libexec/gnome-remote-desktop-daemon 1MiB | > | 0 N/A N/A 2473 G /usr/bin/gnome-shell 70MiB | > | 0 N/A N/A 4776 G ...seed-version=20250119-180455.285000 51MiB | > +---------------------------------------------------------------------------------------+
This is my INPUT file
7pno_D:GSGSGSGSGTNSLLNLRSRLAAKAAKEAASSNSENLYFQ---SGGTRLTNSLLNLRSRLAAKAAKEAASSNAT------STSGGTRLTNSLLNLRSRLAAKAIKEST----------
The text was updated successfully, but these errors were encountered:
No branches or pull requests
nvidia-smi
This is my INPUT file
The text was updated successfully, but these errors were encountered: