Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added vllm cuda support #582

Merged
merged 1 commit into from
Jan 15, 2025
Merged

Added vllm cuda support #582

merged 1 commit into from
Jan 15, 2025

Conversation

bmahabirbu
Copy link
Collaborator

@bmahabirbu bmahabirbu commented Jan 13, 2025

Added cuda gpu support and vllm support for cuda/rocm

Once gpu is detected added to env vars so detection only needs to be called once

Added functionality to pick largest GPU between cuda and rocm by vram

added "--max_model_len"= "4096" which reduces the token size so models can be loaded. I did this because llama3.2 was not loading on my GPU with 10gb of vram compared to llama.cpp. I believe this has to do with vllm parallelization optimization which increases vram usage

Summary by Sourcery

New Features:

  • Added support for CUDA GPUs in vLLM.

Copy link
Contributor

sourcery-ai bot commented Jan 13, 2025

Reviewer's Guide by Sourcery

This pull request introduces CUDA support for vllm, along with several improvements to GPU detection and model loading.

Sequence diagram for container setup with GPU detection

sequenceDiagram
    participant C as Container
    participant G as GPUHandler
    participant M as ModelHandler
    M->>G: get_gpu()
    G->>G: Detect Asahi/ROCm/CUDA
    G->>G: Select best GPU
    G-->>M: Set environment variables
    M->>C: Setup container with GPU args
    Note over C: For CUDA: Add nvidia.com/gpu=all
    Note over C: For others: Add env variables
Loading

Class diagram showing updated GPU handling

classDiagram
    class GPUTemplate {
        +index: string
        +vram: number
        +env: string
    }
    class GPUHandler {
        +get_gpu()
        +get_env_vars()
    }
    class ModelHandler {
        +setup_container(args)
        +gpu_args(force, server)
        +handle_runtime(args, exec_args, exec_model_path)
    }
    GPUHandler -- GPUTemplate : creates >
    ModelHandler -- GPUHandler : uses >
Loading

Flow diagram for updated GPU detection process

flowchart TD
    start[Start GPU Detection] --> checkAsahi{Check Asahi}
    checkAsahi -->|Yes| setAsahi[Set ASAHI_VISIBLE_DEVICES]
    checkAsahi -->|No| checkROCm{Check ROCm}
    checkROCm -->|Found| addROCm[Add to GPU template]
    checkROCm -->|Not Found| checkCUDA{Check CUDA}
    checkCUDA -->|Found| addCUDA[Add to GPU template]
    checkCUDA -->|Not Found| selectGPU{Select GPU}
    addROCm --> selectGPU
    addCUDA --> selectGPU
    selectGPU -->|Single GPU| useGPU[Use that GPU]
    selectGPU -->|Multiple GPUs| selectBest[Select GPU with max VRAM]
    selectBest --> setEnv[Set environment variable]
    useGPU --> setEnv
    setAsahi --> finish[End]
    setEnv --> finish
Loading

File-Level Changes

Change Details Files
Improved GPU detection and selection to support CUDA and ROCm GPUs.
  • Added functions to detect and select the largest available GPU based on VRAM across CUDA and ROCm.
  • Set environment variables for CUDA and ROCm to specify the selected GPU.
  • Added handling for Asahi Linux systems to set the appropriate environment variable if detected.
  • Refactored GPU detection logic to prioritize GPUs with larger VRAM and handle cases with multiple GPUs of different types.
  • Simplified environment variable handling to use the detected GPU information directly.
  • Added error handling for Nvidia GPU detection failures.
  • Removed redundant checks for Asahi Linux systems in the GPU detection logic.
ramalama/common.py
Added support for vllm CUDA execution and improved model loading.
  • Added logic to select the appropriate vllm Docker image based on the detected GPU type (CUDA or ROCm).
  • Modified container setup to include necessary devices and environment variables for CUDA execution.
  • Added logic to handle CUDA devices and environment variables specifically.
  • Updated vllm execution arguments to include the model path and maximum model length.
  • Added a check for available GPUs before starting the vllm server.
  • Set the maximum model length to 4096 to reduce VRAM usage and improve model loading on GPUs with limited memory.
  • Added a call to get_gpu() before building execution arguments to ensure GPU detection is performed before running or serving the model.
ramalama/model.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @bmahabirbu - I've reviewed your changes and found some issues that need to be addressed.

Blocking issues:

  • Missing 'vllm serve' prefix in exec_args for vllm runtime (link)

Overall Comments:

  • Consider making max_model_len configurable rather than hardcoding to 4096, to support users with different VRAM capacities and requirements
Here's what I looked at during the review
  • 🔴 General issues: 1 blocking issue
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟢 Complexity: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

exec_model_path = os.path.dirname(exec_model_path)
exec_args = ["vllm", "serve", "--port", args.port, exec_model_path]
exec_args = ["--port", args.port, "--model", MNT_FILE, "--max_model_len", "4096"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Missing 'vllm serve' prefix in exec_args for vllm runtime

The vllm command prefix ('vllm serve') was removed. This will cause the vllm runtime to fail. Restore these required command components.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is 2048 in llama.cpp if we want consistency. We seem to be missing vllm serve also.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right, thanks for catching that! I missed the vllm serve prefix in the exec_args. It should be included along with the other arguments. Regarding the --max_model_len value, I agree that consistency with llama.cpp's default of 2048 is preferable unless we have a specific reason to deviate. Could you please try changing it to 2048 and see if it works as expected with your 10GB VRAM? If it still runs into issues, we can investigate further and potentially adjust it as needed. The primary goal is to ensure compatibility and avoid unnecessary memory constraints.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whats odd is when I had vllm serve as part of the args the container would crash stating those args are recognized. Im guessing when the vllm image is run its entrypoint is vllm serve thats why I removed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's investigate why vllm serve is causing the container to crash when included in the exec_args. If the entrypoint of the vLLM image is already vllm serve, then including it again in the command would indeed lead to issues.

Here's what we can try:

  1. Verify Entrypoint: Can you confirm the entrypoint of the vLLM Docker image you're using? You can usually find this information in the Dockerfile or by inspecting the image with docker inspect <image_name>.

  2. Environment Override (Recommended): If the entrypoint is indeed vllm serve, the cleanest solution is to override the entrypoint when running the container. Instead of modifying the exec_args within the code, you can achieve this directly in the Docker run command using the --entrypoint flag. For example:

    docker run --entrypoint "" <image_name> vllm serve --port <port> --model <model_path> --max_model_len 2048 

    This sets an empty entrypoint, allowing you to specify the full command including vllm serve.

  3. Debugging within Container (Alternative): If the above doesn't work, we need to debug inside the container to understand the exact error. You can start the container with a bash shell:

    docker run -it <image_name> /bin/bash

    Then, inside the container, manually execute the vllm serve command with the arguments to see the error message. This will give us more clues about the conflict.

Regarding --max_model_len, let's stick with 2048 for now to maintain consistency with llama.cpp. If we encounter further memory issues, we can revisit this parameter.

Please try these steps and let me know the outcome. We'll get to the bottom of this!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok makes sense @bmahabirbu maybe just leave a comment, assumes entrypoint is "vllm serve"

@@ -171,10 +174,14 @@ def setup_container(self, args):
conman_args += ["--device", "/dev/kfd"]

for k, v in get_env_vars().items():
conman_args += ["-e", f"{k}={v}"]

# Special case for Cuda
Copy link
Collaborator

@ericcurtin ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the else here and pass the env var in all cases, a user might set CUDA_VISIBLE_DEVICES to run a specific GPU and llama.cpp should be able to process that env var.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else should be gone here right?

else:
best_gpu = max(gpu_template, key=lambda x: x["vram"]) # Use max for multiple entries
# Set env var of picked gpu
os.environ[best_gpu["env"]] = best_gpu["index"]
Copy link
Collaborator

@ericcurtin ericcurtin Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CUDA_VISIBLE_DEVICES is set we should skip all this somehow also, can't remember how we did that for ROCm or if we did it correctly. nvidia-smi is probably an expensive technique for nvidia detection, maybe this is a start and we can do something cheaper later.

Make sure if nvidia-smi doesn't exists it's super silent, our packagers will complain that ramalama is broken otherwise and needs more dependancies.

@ericcurtin
Copy link
Collaborator

Sometimes I wonder if we should do something simple like:

    # NVIDIA CASE
    if os.path.exists('/proc/cmdline'):
        with open('/proc/cmdline', 'r') as file:
            if "nvidia" in file.read().lower():
                # Set Env Var and break
                os.environ["CUDA_VISIBLE_DEVICES"] = "1"
                return

I don't know if it's mandatory to have nvidia things in the kernel cmdline for every nvidia GPU to work correctly with the nvidia supported kernel driver.

Sometimes I wonder should we forget about the AMD/Nvidia comparisons, a machine with both GPUs is gonna be rare and a manual configuration aren't unreasonable at that point.

But I trust your judgement here, I don't have an Nvidia machine of any kind.

@bmahabirbu
Copy link
Collaborator Author

I agree I'll look into doing something simpler like this. No reason to use nvidia-smi if we can see what gpu driver linux is using.

I agree as well lets forget about the comparison its a bit much. If a user has multiple different gpu's we can always have the user set the env var manually to what they want to use.

@bmahabirbu
Copy link
Collaborator Author

bmahabirbu commented Jan 14, 2025

After testing a bit I think ill stick with calling nvidia-smi for now as I couldn't find any nvidia driver listed inside wsl2. I also shortened the code so as long as nvidia-smi is recognized I set the env var.

@bmahabirbu bmahabirbu force-pushed the vllm branch 5 times, most recently from 18745d0 to 15cac58 Compare January 14, 2025 20:35
@rhatdan
Copy link
Member

rhatdan commented Jan 14, 2025

@ericcurtin PTAL

@ericcurtin
Copy link
Collaborator

Sorry @bmahabirbu when I asked to remove the else, I meant change:

            # Special case for Cuda
            if k == "CUDA_VISIBLE_DEVICES":
                conman_args += ["--device", "nvidia.com/gpu=all"]
            else:
                conman_args += ["-e", f"{k}={v}"]

to

            # Special case for Cuda
            if k == "CUDA_VISIBLE_DEVICES":
                conman_args += ["--device", "nvidia.com/gpu=all"]

            conman_args += ["-e", f"{k}={v}"]

We were skipping setting the -e in the CUDA case. We want the env vars set inside the container in all cases

@bmahabirbu
Copy link
Collaborator Author

Ah gotcha! Will change early today!

gpu_type, gpu_num = get_gpu()
if gpu_type not in env_vars and gpu_type in {"HIP_VISIBLE_DEVICES", "ASAHI_VISIBLE_DEVICES"}:
env_vars[gpu_type] = str(gpu_num)
# gpu_type, gpu_num = get_gpu()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can delete this

@ericcurtin ericcurtin merged commit 50b1fa2 into containers:main Jan 15, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants