Added vllm cuda support #582

bmahabirbu · 2025-01-13T06:13:27Z

Added cuda gpu support and vllm support for cuda/rocm

Once gpu is detected added to env vars so detection only needs to be called once

Added functionality to pick largest GPU between cuda and rocm by vram

added "--max_model_len"= "4096" which reduces the token size so models can be loaded. I did this because llama3.2 was not loading on my GPU with 10gb of vram compared to llama.cpp. I believe this has to do with vllm parallelization optimization which increases vram usage

Summary by Sourcery

New Features:

Added support for CUDA GPUs in vLLM.

sourcery-ai · 2025-01-13T06:13:32Z

Reviewer's Guide by Sourcery

This pull request introduces CUDA support for vllm, along with several improvements to GPU detection and model loading.

Sequence diagram for container setup with GPU detection

sequenceDiagram
    participant C as Container
    participant G as GPUHandler
    participant M as ModelHandler
    M->>G: get_gpu()
    G->>G: Detect Asahi/ROCm/CUDA
    G->>G: Select best GPU
    G-->>M: Set environment variables
    M->>C: Setup container with GPU args
    Note over C: For CUDA: Add nvidia.com/gpu=all
    Note over C: For others: Add env variables

Class diagram showing updated GPU handling

classDiagram
    class GPUTemplate {
        +index: string
        +vram: number
        +env: string
    }
    class GPUHandler {
        +get_gpu()
        +get_env_vars()
    }
    class ModelHandler {
        +setup_container(args)
        +gpu_args(force, server)
        +handle_runtime(args, exec_args, exec_model_path)
    }
    GPUHandler -- GPUTemplate : creates >
    ModelHandler -- GPUHandler : uses >

Flow diagram for updated GPU detection process

flowchart TD
    start[Start GPU Detection] --> checkAsahi{Check Asahi}
    checkAsahi -->|Yes| setAsahi[Set ASAHI_VISIBLE_DEVICES]
    checkAsahi -->|No| checkROCm{Check ROCm}
    checkROCm -->|Found| addROCm[Add to GPU template]
    checkROCm -->|Not Found| checkCUDA{Check CUDA}
    checkCUDA -->|Found| addCUDA[Add to GPU template]
    checkCUDA -->|Not Found| selectGPU{Select GPU}
    addROCm --> selectGPU
    addCUDA --> selectGPU
    selectGPU -->|Single GPU| useGPU[Use that GPU]
    selectGPU -->|Multiple GPUs| selectBest[Select GPU with max VRAM]
    selectBest --> setEnv[Set environment variable]
    useGPU --> setEnv
    setAsahi --> finish[End]
    setEnv --> finish

File-Level Changes

Change	Details	Files
Improved GPU detection and selection to support CUDA and ROCm GPUs.	Added functions to detect and select the largest available GPU based on VRAM across CUDA and ROCm. Set environment variables for CUDA and ROCm to specify the selected GPU. Added handling for Asahi Linux systems to set the appropriate environment variable if detected. Refactored GPU detection logic to prioritize GPUs with larger VRAM and handle cases with multiple GPUs of different types. Simplified environment variable handling to use the detected GPU information directly. Added error handling for Nvidia GPU detection failures. Removed redundant checks for Asahi Linux systems in the GPU detection logic.	`ramalama/common.py`
Added support for vllm CUDA execution and improved model loading.	Added logic to select the appropriate vllm Docker image based on the detected GPU type (CUDA or ROCm). Modified container setup to include necessary devices and environment variables for CUDA execution. Added logic to handle CUDA devices and environment variables specifically. Updated vllm execution arguments to include the model path and maximum model length. Added a check for available GPUs before starting the vllm server. Set the maximum model length to 4096 to reduce VRAM usage and improve model loading on GPUs with limited memory. Added a call to get_gpu() before building execution arguments to ensure GPU detection is performed before running or serving the model.	`ramalama/model.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time. You can also use
this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @bmahabirbu - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

Missing 'vllm serve' prefix in exec_args for vllm runtime (link)

Overall Comments:

Consider making max_model_len configurable rather than hardcoding to 4096, to support users with different VRAM capacities and requirements

Here's what I looked at during the review

🔴 General issues: 1 blocking issue
🟢 Security: all looks good
🟢 Testing: all looks good
🟢 Complexity: all looks good
🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2025-01-13T06:14:30Z

ramalama/model.py

            exec_model_path = os.path.dirname(exec_model_path)
-            exec_args = ["vllm", "serve", "--port", args.port, exec_model_path]
+            exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "4096"]


issue (bug_risk): Missing 'vllm serve' prefix in exec_args for vllm runtime

The vllm command prefix ('vllm serve') was removed. This will cause the vllm runtime to fail. Restore these required command components.

I think this is 2048 in llama.cpp if we want consistency. We seem to be missing vllm serve also.

You are absolutely right, thanks for catching that! I missed the vllm serve prefix in the exec_args. It should be included along with the other arguments. Regarding the --max_model_len value, I agree that consistency with llama.cpp's default of 2048 is preferable unless we have a specific reason to deviate. Could you please try changing it to 2048 and see if it works as expected with your 10GB VRAM? If it still runs into issues, we can investigate further and potentially adjust it as needed. The primary goal is to ensure compatibility and avoid unnecessary memory constraints.

Whats odd is when I had vllm serve as part of the args the container would crash stating those args are recognized. Im guessing when the vllm image is run its entrypoint is vllm serve thats why I removed it.

Okay, let's investigate why vllm serve is causing the container to crash when included in the exec_args. If the entrypoint of the vLLM image is already vllm serve, then including it again in the command would indeed lead to issues.

Here's what we can try:

Verify Entrypoint: Can you confirm the entrypoint of the vLLM Docker image you're using? You can usually find this information in the Dockerfile or by inspecting the image with docker inspect <image_name>.

Environment Override (Recommended): If the entrypoint is indeed vllm serve, the cleanest solution is to override the entrypoint when running the container. Instead of modifying the exec_args within the code, you can achieve this directly in the Docker run command using the --entrypoint flag. For example:

docker run --entrypoint "" <image_name> vllm serve --port <port> --model <model_path> --max_model_len 2048

This sets an empty entrypoint, allowing you to specify the full command including vllm serve.

Debugging within Container (Alternative): If the above doesn't work, we need to debug inside the container to understand the exact error. You can start the container with a bash shell:

docker run -it <image_name> /bin/bash

Then, inside the container, manually execute the vllm serve command with the arguments to see the error message. This will give us more clues about the conflict.

Regarding --max_model_len, let's stick with 2048 for now to maintain consistency with llama.cpp. If we encounter further memory issues, we can revisit this parameter.

Please try these steps and let me know the outcome. We'll get to the bottom of this!

Ah ok makes sense @bmahabirbu maybe just leave a comment, assumes entrypoint is "vllm serve"

ericcurtin · 2025-01-13T12:54:05Z

ramalama/model.py

@@ -171,10 +174,14 @@ def setup_container(self, args):
            conman_args += ["--device", "/dev/kfd"]

        for k, v in get_env_vars().items():
-            conman_args += ["-e", f"{k}={v}"]
-
+            # Special case for Cuda


I think we should remove the else here and pass the env var in all cases, a user might set CUDA_VISIBLE_DEVICES to run a specific GPU and llama.cpp should be able to process that env var.

The else should be gone here right?

ericcurtin · 2025-01-13T13:02:37Z

ramalama/common.py

+        else:
+            best_gpu = max(gpu_template, key=lambda x: x["vram"])  # Use max for multiple entries
+        # Set env var of picked gpu
+        os.environ[best_gpu["env"]] = best_gpu["index"]


If CUDA_VISIBLE_DEVICES is set we should skip all this somehow also, can't remember how we did that for ROCm or if we did it correctly. nvidia-smi is probably an expensive technique for nvidia detection, maybe this is a start and we can do something cheaper later.

Make sure if nvidia-smi doesn't exists it's super silent, our packagers will complain that ramalama is broken otherwise and needs more dependancies.

ericcurtin · 2025-01-13T14:25:35Z

Sometimes I wonder if we should do something simple like:

    # NVIDIA CASE
    if os.path.exists('/proc/cmdline'):
        with open('/proc/cmdline', 'r') as file:
            if "nvidia" in file.read().lower():
                # Set Env Var and break
                os.environ["CUDA_VISIBLE_DEVICES"] = "1"
                return

I don't know if it's mandatory to have nvidia things in the kernel cmdline for every nvidia GPU to work correctly with the nvidia supported kernel driver.

Sometimes I wonder should we forget about the AMD/Nvidia comparisons, a machine with both GPUs is gonna be rare and a manual configuration aren't unreasonable at that point.

But I trust your judgement here, I don't have an Nvidia machine of any kind.

bmahabirbu · 2025-01-13T15:33:51Z

I agree I'll look into doing something simpler like this. No reason to use nvidia-smi if we can see what gpu driver linux is using.

I agree as well lets forget about the comparison its a bit much. If a user has multiple different gpu's we can always have the user set the env var manually to what they want to use.

bmahabirbu · 2025-01-14T01:36:33Z

After testing a bit I think ill stick with calling nvidia-smi for now as I couldn't find any nvidia driver listed inside wsl2. I also shortened the code so as long as nvidia-smi is recognized I set the env var.

rhatdan · 2025-01-14T21:57:15Z

@ericcurtin PTAL

ericcurtin · 2025-01-15T11:54:35Z

Sorry @bmahabirbu when I asked to remove the else, I meant change:

            # Special case for Cuda
            if k == "CUDA_VISIBLE_DEVICES":
                conman_args += ["--device", "nvidia.com/gpu=all"]
            else:
                conman_args += ["-e", f"{k}={v}"]

to

            # Special case for Cuda
            if k == "CUDA_VISIBLE_DEVICES":
                conman_args += ["--device", "nvidia.com/gpu=all"]

            conman_args += ["-e", f"{k}={v}"]

We were skipping setting the -e in the CUDA case. We want the env vars set inside the container in all cases

bmahabirbu · 2025-01-15T12:04:40Z

Ah gotcha! Will change early today!

ericcurtin · 2025-01-15T12:08:13Z

ramalama/common.py

-    gpu_type, gpu_num = get_gpu()
-    if gpu_type not in env_vars and gpu_type in {"HIP_VISIBLE_DEVICES", "ASAHI_VISIBLE_DEVICES"}:
-        env_vars[gpu_type] = str(gpu_num)
+    # gpu_type, gpu_num = get_gpu()


I guess we can delete this

Signed-off-by: Brian <[email protected]>

sourcery-ai bot reviewed Jan 13, 2025

View reviewed changes

ericcurtin reviewed Jan 13, 2025

View reviewed changes

bmahabirbu force-pushed the vllm branch 5 times, most recently from 18745d0 to 15cac58 Compare January 14, 2025 20:35

bmahabirbu force-pushed the vllm branch from 15cac58 to 36ccd0c Compare January 15, 2025 05:19

ericcurtin reviewed Jan 15, 2025

View reviewed changes

Added vllm cuda support

1801da9

Signed-off-by: Brian <[email protected]>

bmahabirbu force-pushed the vllm branch from 36ccd0c to 1801da9 Compare January 15, 2025 13:53

ericcurtin approved these changes Jan 15, 2025

View reviewed changes

ericcurtin merged commit 50b1fa2 into containers:main Jan 15, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added vllm cuda support #582

Added vllm cuda support #582

bmahabirbu commented Jan 13, 2025 •

edited

Loading

sourcery-ai bot commented Jan 13, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

sourcery-ai bot Jan 13, 2025

ericcurtin Jan 13, 2025

sourcery-ai bot Jan 13, 2025

bmahabirbu Jan 13, 2025

sourcery-ai bot Jan 13, 2025

ericcurtin Jan 13, 2025

ericcurtin Jan 13, 2025 •

edited

Loading

ericcurtin Jan 15, 2025

ericcurtin Jan 13, 2025 •

edited

Loading

ericcurtin commented Jan 13, 2025

bmahabirbu commented Jan 13, 2025

bmahabirbu commented Jan 14, 2025 •

edited

Loading

rhatdan commented Jan 14, 2025

ericcurtin commented Jan 15, 2025

bmahabirbu commented Jan 15, 2025

ericcurtin Jan 15, 2025

Added vllm cuda support #582

Added vllm cuda support #582

Conversation

bmahabirbu commented Jan 13, 2025 • edited Loading

Summary by Sourcery

sourcery-ai bot commented Jan 13, 2025 • edited Loading

Reviewer's Guide by Sourcery

Sequence diagram for container setup with GPU detection

Class diagram showing updated GPU handling

Flow diagram for updated GPU detection process

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

sourcery-ai bot left a comment

Choose a reason for hiding this comment

sourcery-ai bot Jan 13, 2025

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

sourcery-ai bot Jan 13, 2025

Choose a reason for hiding this comment

bmahabirbu Jan 13, 2025

Choose a reason for hiding this comment

sourcery-ai bot Jan 13, 2025

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

ericcurtin Jan 15, 2025

Choose a reason for hiding this comment

ericcurtin Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

ericcurtin commented Jan 13, 2025

bmahabirbu commented Jan 13, 2025

bmahabirbu commented Jan 14, 2025 • edited Loading

rhatdan commented Jan 14, 2025

ericcurtin commented Jan 15, 2025

bmahabirbu commented Jan 15, 2025

ericcurtin Jan 15, 2025

Choose a reason for hiding this comment

bmahabirbu commented Jan 13, 2025 •

edited

Loading

sourcery-ai bot commented Jan 13, 2025 •

edited

Loading

ericcurtin Jan 13, 2025 •

edited

Loading

ericcurtin Jan 13, 2025 •

edited

Loading

bmahabirbu commented Jan 14, 2025 •

edited

Loading