Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck During Tree-Based Speculative Decoding with OPT Model #4

Open
SeungjaeLim opened this issue Oct 16, 2024 · 0 comments
Open

Stuck During Tree-Based Speculative Decoding with OPT Model #4

SeungjaeLim opened this issue Oct 16, 2024 · 0 comments

Comments

@SeungjaeLim
Copy link

I was trying to run the tree-based speculative decoding from the server_gpu_experiment in the specinfer-ae repository using the OPT model. However, the process gets stuck partway through and does not progress any further, as shown below. My GPU environment and the script I used are provided below.

Has anyone encountered or solved this issue before?
Additionally, are there any other examples of running tree-based speculative decoding with a prompt similar to chatgpt.json?

GPU Information:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------|
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:B1:00.0 Off |                  Off |
| 30%   32C    P8              26W / 300W |     35MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                   

Script: I am running the following bash script:

#! /usr/bin/env bash
set -e
set -x

# Cd into directory holding this script
cd "${BASH_SOURCE[0]%/*}"

export UCX_DIR="$PWD/ucx-1.15.0/install"
export PATH=$UCX_DIR/bin:$PATH
export LD_LIBRARY_PATH=$UCX_DIR/lib:$LD_LIBRARY_PATH

./download_dataset.sh
./download_models.sh

batch_sizes=( 16 )

mkdir -p ./FlexFlow/inference/output

ncpus=1
ngpus=1
fsize=21890
zsize=80000
max_sequence_length=128
ssm_model_name="facebook/opt-125m"
llm_model_name="facebook/opt-6.7b"

for bs in "${batch_sizes[@]}"
do
    ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --max-requests-per-batch $bs --max-sequence-length $max_sequence_length
done

Output:

$  ./server_gpu_experiments.sh 
+ cd /workspace/.
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ ./download_dataset.sh
+ cd .
+ cd FlexFlow
+ rm -rf inference/prompt
+ mkdir -p inference/prompt
+ cd inference/prompt
+ wget https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
--2024-10-16 04:35:01--  https://specinfer.s3.us-east-2.amazonaws.com/prompts/chatgpt.json
Resolving specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)... 3.5.129.104, 3.5.128.25, 52.219.109.114, ...
Connecting to specinfer.s3.us-east-2.amazonaws.com (specinfer.s3.us-east-2.amazonaws.com)|3.5.129.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 40769 (40K) [application/json]
Saving to: ‘chatgpt.json’

     0K .......... .......... .......... .........            100%  245K=0.2s

2024-10-16 04:35:02 (245 KB/s) - ‘chatgpt.json’ saved [40769/40769]

+ python -
+ rm chatgpt.json
+ ./download_models.sh
+ cd .
+ export UCX_DIR=/workspace/ucx-1.15.0/install
+ UCX_DIR=/workspace/ucx-1.15.0/install
+ export PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ PATH=/workspace/ucx-1.15.0/install/bin:/workspace/ucx-1.15.0/install/bin:/root/.cargo/bin:/usr/local/nvm/versions/node/v16.20.2/bin:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
+ export LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
+ LD_LIBRARY_PATH=/workspace/ucx-1.15.0/install/lib:/workspace/ucx-1.15.0/install/lib:/usr/local/cuda/compat/lib.real:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
++ date +%s
+ start_time=1729053302
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-125m
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-125m/half-precision (if it doesn't exist)...
Loading 'facebook/opt-125m' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-125m' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-125m (if it doesn't exist)...
Saving facebook/opt-125m configs to file /root/.cache/flexflow/configs/facebook/opt-125m/config.json...
+ python ./FlexFlow/inference/utils/download_hf_model.py --half-precision-only facebook/opt-6.7b
/usr/local/lib/python3.10/dist-packages/torch/__init__.py:613: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /opt/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:451.)
  _C._set_default_tensor_type(t)
Creating directory /root/.cache/flexflow/weights/facebook/opt-6.7b/half-precision (if it doesn't exist)...
Loading 'facebook/opt-6.7b' model weights from the cache...
Loading tokenizer...
Loading 'facebook/opt-6.7b' tokenizer from the cache...
Creating directory /root/.cache/flexflow/configs/facebook/opt-6.7b (if it doesn't exist)...
Saving facebook/opt-6.7b configs to file /root/.cache/flexflow/configs/facebook/opt-6.7b/config.json...
++ date +%s
+ end_time=1729053310
+ execution_time=8
+ echo 'Total download time: 8 seconds'
Total download time: 8 seconds
+ batch_sizes=(16)
+ mkdir -p ./FlexFlow/inference/output
++ date +%s
+ start_time=1729053310
+ ncpus=1
+ ngpus=1
+ fsize=21890
+ zsize=80000
+ max_sequence_length=128
+ ssm_model_name=facebook/opt-125m
+ llm_model_name=facebook/opt-6.7b
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 1 -ll:util 1 -ll:gpu 1 -ll:fsize 21890 -ll:zsize 80000 -llm-model facebook/opt-6.7b -ssm-model facebook/opt-125m -prompt ./FlexFlow/inference/prompt/chatgpt_16.json --max-requests-per-batch 16 --max-sequence-length 128
[1729053311.514464] [36506f48411f:1356 :0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1729053311.514464] [36506f48411f:1356 :0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)

...

Num of SSMs: 1
[0 - 7f0eec038000]    0.969221 {3}{RequestManager}: [1005740]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000]    0.969258 {3}{RequestManager}: [1005741]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000]    0.969298 {3}{RequestManager}: [1005742]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4
[0]45714
[1]10
[2]5313
[3]15923
[4]59
[5]2512
[6]2190
[7]13
[8]110
[9]659
[10]4
Num of SSMs: 1
[0 - 7f0eec038000]    0.969334 {3}{RequestManager}: [1005743]New request tokens: 2 45714 10 5313 15923 59 2512 2190 13 110 659 4

@lockshaw lockshaw transferred this issue from flexflow/flexflow-train Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant