Add support to vLLM backend #34

anmarques · 2024-08-19T20:46:08Z

This PR adds support to generate responses using the vLLM backend.

vLLM is an open-source project for efficient LLM inference that has gained increasing adoption. It it significantly faster than HF backend, and also supports speedups due to model optimizations such as quantization and sparsity.

This PR adds two new classes: ChatModelVLLM and BaseModelVLLM. A new model can inherit from either of these classes to inference using vllm.

There are 3 other adjacent changes also added by this PR:

It adds the optional argument cpu_offload_gb, which allows the user to offload some of the weights to cpu. This better matches the vLLM interface rather than setting max_gpu_memory.
It changes the evaluation logic such that a call to _eval returns the model. My understanding is that the model is instantiated within _eval such that this step is skipped when results are already available. The issue with this logic is that this can lead to multiple instantiations of the model, which can crash the multi-gpu interface for vLLM.
It adds the definition of llama_3_1_8b_instruct_vllm as an example of how to create a model compatible with vLLM.

Sync to upstream

Lin-K76 and others added 8 commits August 16, 2024 15:41

Merge pull request #1 from Psycoy/main

f6b7cf8

Sync to upstream

Add support to vllm models

cd6d3e0

Avoid instatiating model more than once

7c1e875

Add support to cpu offloading

fce6367

Add llama_3_1_8b_instruct_vllm

fea58d2

Merge branch 'Psycoy:main' into vllm

7672854

Make sure model is only instantiated once

888e407

Add support to base model w/ vllm backend

d270aa9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to vLLM backend #34

Add support to vLLM backend #34

anmarques commented Aug 19, 2024 •

edited

Loading

Add support to vLLM backend #34

Are you sure you want to change the base?

Add support to vLLM backend #34

Conversation

anmarques commented Aug 19, 2024 • edited Loading

anmarques commented Aug 19, 2024 •

edited

Loading