You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was trying to run the tree-based speculative decoding from the server_gpu_experiment in the specinfer-ae repository using the OPT model. However, the process gets stuck partway through and does not progress any further, as shown below. My GPU environment and the script I used are provided below.
Has anyone encountered or solved this issue before?
Additionally, are there any other examples of running tree-based speculative decoding with a prompt similar to chatgpt.json?
GPU Information:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------|
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 On | 00000000:B1:00.0 Off | Off |
| 30% 32C P8 26W / 300W | 35MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Script: I am running the following bash script:
#! /usr/bin/env bash
set -e
set -x
# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}"
export UCX_DIR="$PWD/ucx-1.15.0/install"
export PATH=$UCX_DIR/bin:$PATH
export LD_LIBRARY_PATH=$UCX_DIR/lib:$LD_LIBRARY_PATH
./download_dataset.sh
./download_models.sh
batch_sizes=( 16 )
mkdir -p ./FlexFlow/inference/output
ncpus=1
ngpus=1
fsize=21890
zsize=80000
max_sequence_length=128
ssm_model_name="facebook/opt-125m"
llm_model_name="facebook/opt-6.7b"
for bs in "${batch_sizes[@]}"
do
./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --max-requests-per-batch $bs --max-sequence-length $max_sequence_length
done
Output:
$ ./server_gpu_experiments.sh
+ cd /workspace/.
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ ./download_dataset.sh
+ cd .
+ cd FlexFlow
+ rm -rf inference/prompt
+ mkdir -p inference/prompt
+ cd inference/prompt
+ wget https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
--2024-10-16 04:35:01-- https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
Resolving specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)... 3.5.129.104, 3.5.128.25, 52.219.109.114, ...
Connecting to specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)|3.5.129.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40769 (40K) [application/json]
Saving to: ‘chatgpt.json’
0K .......... .......... .......... ......... 100% 245K=0.2s
2024-10-16 04:35:02 (245 KB/s) - ‘chatgpt.json’ saved [40769/40769]
+ python -
+ rm chatgpt.json
+ ./download_models.sh
+ cd .
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
++ date +%s
+ start_time=1729053302
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-125m
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-125m/half-precision (if it doesn't exist)...
Loading 'facebook/opt-125m' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-125m' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-125m (if it doesn't exist)...
Saving facebook/opt-125m configs to file /root/.cache/flexflow/configs/facebook/opt-125m/config.json...
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-6.7b
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
_C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-6.7b/half-precision (if it doesn't exist)...
Loading 'facebook/opt-6.7b' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-6.7b' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-6.7b (if it doesn't exist)...
Saving facebook/opt-6.7b configs to file /root/.cache/flexflow/configs/facebook/opt-6.7b/config.json...
++ date +%s
+ end_time=1729053310
+ execution_time=8
+ echo 'Total download time: 8 seconds'
Total download time: 8 seconds
+ batch_sizes=(16)
+ mkdir -p ./FlexFlow/inference/output
++ date +%s
+ start_time=1729053310
+ ncpus=1
+ ngpus=1
+ fsize=21890
+ zsize=80000
+ max_sequence_length=128
+ ssm_model_name=facebook/opt-125m
+ llm_model_name=facebook/opt-6.7b
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 1 -ll:util 1 -ll:gpu 1 -ll:fsize 21890 -ll:zsize 80000 -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ./FlexFlow/inference/prompt/chatgpt_16.json --max-requests-per-batch 16 --max-sequence-length 128
[1729053311.514464] [36506f48411f:1356 :0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1729053311.514464] [36506f48411f:1356 :0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
...
Num of SSMs: 1
[0 - 7f0eec038000] 0.969221 {3}{RequestManager}: [1005740]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000] 0.969258 {3}{RequestManager}: [1005741]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000] 0.969298 {3}{RequestManager}: [1005742]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000] 0.969334 {3}{RequestManager}: [1005743]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
The text was updated successfully, but these errors were encountered:
I was trying to run the tree-based speculative decoding from the server_gpu_experiment in the specinfer-ae repository using the OPT model. However, the process gets stuck partway through and does not progress any further, as shown below. My GPU environment and the script I used are provided below.
Has anyone encountered or solved this issue before?
Additionally, are there any other examples of running tree-based speculative decoding with a prompt similar to chatgpt.json?
GPU Information:
Script: I am running the following bash script:
Output:
The text was updated successfully, but these errors were encountered: