-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added vllm cuda support #582
Conversation
Reviewer's Guide by SourceryThis pull request introduces CUDA support for vllm, along with several improvements to GPU detection and model loading. Sequence diagram for container setup with GPU detectionsequenceDiagram
participant C as Container
participant G as GPUHandler
participant M as ModelHandler
M->>G: get_gpu()
G->>G: Detect Asahi/ROCm/CUDA
G->>G: Select best GPU
G-->>M: Set environment variables
M->>C: Setup container with GPU args
Note over C: For CUDA: Add nvidia.com/gpu=all
Note over C: For others: Add env variables
Class diagram showing updated GPU handlingclassDiagram
class GPUTemplate {
+index: string
+vram: number
+env: string
}
class GPUHandler {
+get_gpu()
+get_env_vars()
}
class ModelHandler {
+setup_container(args)
+gpu_args(force, server)
+handle_runtime(args, exec_args, exec_model_path)
}
GPUHandler -- GPUTemplate : creates >
ModelHandler -- GPUHandler : uses >
Flow diagram for updated GPU detection processflowchart TD
start[Start GPU Detection] --> checkAsahi{Check Asahi}
checkAsahi -->|Yes| setAsahi[Set ASAHI_VISIBLE_DEVICES]
checkAsahi -->|No| checkROCm{Check ROCm}
checkROCm -->|Found| addROCm[Add to GPU template]
checkROCm -->|Not Found| checkCUDA{Check CUDA}
checkCUDA -->|Found| addCUDA[Add to GPU template]
checkCUDA -->|Not Found| selectGPU{Select GPU}
addROCm --> selectGPU
addCUDA --> selectGPU
selectGPU -->|Single GPU| useGPU[Use that GPU]
selectGPU -->|Multiple GPUs| selectBest[Select GPU with max VRAM]
selectBest --> setEnv[Set environment variable]
useGPU --> setEnv
setAsahi --> finish[End]
setEnv --> finish
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @bmahabirbu - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
- Missing 'vllm serve' prefix in exec_args for vllm runtime (link)
Overall Comments:
- Consider making max_model_len configurable rather than hardcoding to 4096, to support users with different VRAM capacities and requirements
Here's what I looked at during the review
- 🔴 General issues: 1 blocking issue
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟢 Complexity: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
ramalama/model.py
Outdated
exec_model_path = os.path.dirname(exec_model_path) | ||
exec_args = ["vllm", "serve", "--port", args.port, exec_model_path] | ||
exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "4096"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Missing 'vllm serve' prefix in exec_args for vllm runtime
The vllm command prefix ('vllm serve') was removed. This will cause the vllm runtime to fail. Restore these required command components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is 2048 in llama.cpp if we want consistency. We seem to be missing vllm serve also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right, thanks for catching that! I missed the vllm serve
prefix in the exec_args
. It should be included along with the other arguments. Regarding the --max_model_len
value, I agree that consistency with llama.cpp's default of 2048 is preferable unless we have a specific reason to deviate. Could you please try changing it to 2048 and see if it works as expected with your 10GB VRAM? If it still runs into issues, we can investigate further and potentially adjust it as needed. The primary goal is to ensure compatibility and avoid unnecessary memory constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whats odd is when I had vllm serve as part of the args the container would crash stating those args are recognized. Im guessing when the vllm image is run its entrypoint is vllm serve thats why I removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's investigate why vllm serve
is causing the container to crash when included in the exec_args
. If the entrypoint of the vLLM image is already vllm serve
, then including it again in the command would indeed lead to issues.
Here's what we can try:
-
Verify Entrypoint: Can you confirm the entrypoint of the vLLM Docker image you're using? You can usually find this information in the Dockerfile or by inspecting the image with
docker inspect <image_name>
. -
Environment Override (Recommended): If the entrypoint is indeed
vllm serve
, the cleanest solution is to override the entrypoint when running the container. Instead of modifying theexec_args
within the code, you can achieve this directly in the Docker run command using the--entrypoint
flag. For example:docker run --entrypoint "" <image_name> vllm serve --port <port> --model <model_path> --max_model_len 2048
This sets an empty entrypoint, allowing you to specify the full command including
vllm serve
. -
Debugging within Container (Alternative): If the above doesn't work, we need to debug inside the container to understand the exact error. You can start the container with a bash shell:
docker run -it <image_name> /bin/bash
Then, inside the container, manually execute the
vllm serve
command with the arguments to see the error message. This will give us more clues about the conflict.
Regarding --max_model_len
, let's stick with 2048 for now to maintain consistency with llama.cpp
. If we encounter further memory issues, we can revisit this parameter.
Please try these steps and let me know the outcome. We'll get to the bottom of this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok makes sense @bmahabirbu maybe just leave a comment, assumes entrypoint is "vllm serve"
@@ -171,10 +174,14 @@ def setup_container(self, args): | |||
conman_args += ["--device", "/dev/kfd"] | |||
|
|||
for k, v in get_env_vars().items(): | |||
conman_args += ["-e", f"{k}={v}"] | |||
|
|||
# Special case for Cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the else here and pass the env var in all cases, a user might set CUDA_VISIBLE_DEVICES to run a specific GPU and llama.cpp should be able to process that env var.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The else should be gone here right?
ramalama/common.py
Outdated
else: | ||
best_gpu = max(gpu_template, key=lambda x: x["vram"]) # Use max for multiple entries | ||
# Set env var of picked gpu | ||
os.environ[best_gpu["env"]] = best_gpu["index"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If CUDA_VISIBLE_DEVICES is set we should skip all this somehow also, can't remember how we did that for ROCm or if we did it correctly. nvidia-smi is probably an expensive technique for nvidia detection, maybe this is a start and we can do something cheaper later.
Make sure if nvidia-smi doesn't exists it's super silent, our packagers will complain that ramalama is broken otherwise and needs more dependancies.
Sometimes I wonder if we should do something simple like:
I don't know if it's mandatory to have nvidia things in the kernel cmdline for every nvidia GPU to work correctly with the nvidia supported kernel driver. Sometimes I wonder should we forget about the AMD/Nvidia comparisons, a machine with both GPUs is gonna be rare and a manual configuration aren't unreasonable at that point. But I trust your judgement here, I don't have an Nvidia machine of any kind. |
I agree I'll look into doing something simpler like this. No reason to use nvidia-smi if we can see what gpu driver linux is using. I agree as well lets forget about the comparison its a bit much. If a user has multiple different gpu's we can always have the user set the env var manually to what they want to use. |
After testing a bit I think ill stick with calling nvidia-smi for now as I couldn't find any nvidia driver listed inside wsl2. I also shortened the code so as long as nvidia-smi is recognized I set the env var. |
18745d0
to
15cac58
Compare
@ericcurtin PTAL |
Sorry @bmahabirbu when I asked to remove the else, I meant change:
to
We were skipping setting the -e in the CUDA case. We want the env vars set inside the container in all cases |
Ah gotcha! Will change early today! |
gpu_type, gpu_num = get_gpu() | ||
if gpu_type not in env_vars and gpu_type in {"HIP_VISIBLE_DEVICES", "ASAHI_VISIBLE_DEVICES"}: | ||
env_vars[gpu_type] = str(gpu_num) | ||
# gpu_type, gpu_num = get_gpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can delete this
Signed-off-by: Brian <[email protected]>
Added cuda gpu support and vllm support for cuda/rocm
Once gpu is detected added to env vars so detection only needs to be called once
Added functionality to pick largest GPU between cuda and rocm by vram
added
"--max_model_len"= "4096"
which reduces the token size so models can be loaded. I did this because llama3.2 was not loading on my GPU with 10gb of vram compared to llama.cpp. I believe this has to do with vllm parallelization optimization which increases vram usageSummary by Sourcery
New Features: