Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to vLLM backend #34

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

anmarques
Copy link

@anmarques anmarques commented Aug 19, 2024

This PR adds support to generate responses using the vLLM backend.

vLLM is an open-source project for efficient LLM inference that has gained increasing adoption. It it significantly faster than HF backend, and also supports speedups due to model optimizations such as quantization and sparsity.

This PR adds two new classes: ChatModelVLLM and BaseModelVLLM. A new model can inherit from either of these classes to inference using vllm.

There are 3 other adjacent changes also added by this PR:

  1. It adds the optional argument cpu_offload_gb, which allows the user to offload some of the weights to cpu. This better matches the vLLM interface rather than setting max_gpu_memory.
  2. It changes the evaluation logic such that a call to _eval returns the model. My understanding is that the model is instantiated within _eval such that this step is skipped when results are already available. The issue with this logic is that this can lead to multiple instantiations of the model, which can crash the multi-gpu interface for vLLM.
  3. It adds the definition of llama_3_1_8b_instruct_vllm as an example of how to create a model compatible with vLLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants