Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example on LLM offline inference #31

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
.vscode
llm/.env
llm/condor_log/*.txt
**/condor_log/*.txt
.devcontainer
vllm_batch_inference/.env
.venv
vllm_batch_inference/outputs.jsonl
1 change: 1 addition & 0 deletions vllm_batch_inference/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
HUGGING_FACE_HUB_TOKEN=your_token # Get it from https://huggingface.co/settings/tokens
3 changes: 3 additions & 0 deletions vllm_batch_inference/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
FROM vllm/vllm-openai:v0.6.4

ENTRYPOINT ["/bin/bash"]
66 changes: 66 additions & 0 deletions vllm_batch_inference/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# vLLM Batch Inference on CHTC

This guide demonstrates how to set up, submit, and run batch open-source LLM inference jobs using `vllm` on CHTC via `HTCondor`. This is useful for:

- Generating large volumes of synthetic data using open-source LLMs.
- Conducting large-scale, structured data extraction from text.
- Embedding large volumes of text.
- Running any LLM-driven tasks cost-effectively at a massive scale, without relying on expensive commercial alternatives.

## Prerequisites

- Basic knowledge of CHTC, HTCondor, and Docker universe jobs.

## Introduction

In this example, we will use the [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) model with [vLLM v0.6.4 offline inference](https://docs.vllm.ai/en/v0.6.4/getting_started/examples/offline_inference.html) to process 100 [example inputs](inputs.jsonl). The results will be written to `outputs.jsonl` and transferred back to the submit node using the `transfer_output_files` feature in `HTCondor`.

Example inputs:
```jsonl
{"id": "q0001", "input": "What is the capital of France?"}
{"id": "q0002", "input": "What is the capital of Germany?"}
```

Example outputs:
```jsonl
{"id": "q0001", "input": "What is the capital of France?", "output": "The capital of France is Paris."}
{"id": "q0002", "input": "What is the capital of Germany?", "output": "The capital of Germany is Berlin."}
```

## Step-by-Step Guide

1. Visit [Hugging Face](https://huggingface.co/settings/tokens) and obtain a token for downloading open-source models.
2. `ssh` into a CHTC submit node.
3. Clone this repository: `git clone https://github.com/CHTC/templates-GPUs.git`.
4. Navigate to the example folder: `cd vllm_batch_inference`.
5. Create a `.env` file based on this [example](.env.example).
6. Submit the job: `condor_submit job.sub`.

## FAQ

1. **How to find supported open-source models?**

Visit [Hugging Face Models](https://huggingface.co/models), select a model, and check if it supports `vllm` by clicking `use this model`. ![hugging face vllm](img/hf-vllm.png). Note that some models require approval or signing a user agreement on their website.

<span style="color:red">**Always review the model documentation, as input formats can vary significantly between models.**</span>

2. **Why use `vllm`?**

`vllm` currently offers the highest throughput for batch offline inference.

3. **Why not use the official `vllm` container?**

The [official container's](https://hub.docker.com/r/vllm/vllm-openai/tags) entrypoint is set to the API server. I need to change the container entrypoint to `bash`. There may be a way to modify it in the submit file, but I'm unsure how. Please let me know if you have any insights. You can rebuild your version with this [Dockerfile](Dockerfile)

4. **How can I avoid memory issues if I don't need maximum performance?**

Set `enforce_eager=True` when instantiating `vllm.VLLM` in `batch_inference.py` to mitigate some memory issues. Refer to the [vllm debugging tips](https://docs.vllm.ai/en/stable/getting_started/debugging.html) for more details.

5. **How can I achieve better tokens-per-second performance?**

Depending on your use case, you may need to tune your model settings in `vllm`, such as `max_tokens`, `max_model_len`, and `batch_size`. Refer to the [vllm debugging tips](https://docs.vllm.ai/en/stable/getting_started/debugging.html) for more details.


## About the Author

Contributed by [Jason from Data Science Institute, UW-Madison](https://github.com/jasonlo).
91 changes: 91 additions & 0 deletions vllm_batch_inference/batch_inference.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
import json
from pathlib import Path
from typing import Generator

from vllm import LLM, SamplingParams


def load_data(
file: Path, batch_size: int
) -> Generator[list[dict[str, str]], None, None]:
"""Load data from a jsonl file as a generator. Assuming at least `id` and `input` fields are present.

Example input format: (docs: https://jsonlines.org/)
{"id": "q0001", "input": "What is the capital of France?"}
{"id": "q0002", "input": "What is the capital of Germany?"}
"""

with open(file, mode="r") as f:
batch = []
for line in f:
batch.append(json.loads(line))
if len(batch) == batch_size:
yield batch
batch.clear()
if batch:
yield batch


def format_prompt(user_message: str, system_message: str | None = None) -> str:
"""This function formats the user message into the correct format for the `Phi-3.5-mini-instruct` model.

docs: https://huggingface.co/microsoft/Phi-3.5-mini-instruct#input-formats
"""

if system_message is None:
system_message = (
"You are a helpful assistant. Provide clear and short responses."
)
return (
"<|system|>\n"
f"{system_message}<|end|>\n"
"<|user|>\n"
f"{user_message}<|end|>\n"
"<|assistant|>\n"
)


def inference(
llm: LLM,
data: list[dict[str, str]],
input_field: str = "input",
temperature: float = 0.0,
max_tokens: int = 2048,
) -> list[dict]:
"""Perform a batch of inference on Phi-3.5-mini-instruct via vllm offline mode.

docs: https://docs.vllm.ai/en/v0.6.4/getting_started/examples/offline_inference.html
"""

inputs = [item[input_field] for item in data]
formatted_inputs = [format_prompt(user_message=x) for x in inputs]

sampling_params = SamplingParams(temperature=temperature, max_tokens=max_tokens)
outputs = llm.generate(prompts=formatted_inputs, sampling_params=sampling_params)
text_outputs = [raw_output.outputs[0].text.strip() for raw_output in outputs]
return [{**item, "output": output} for item, output in zip(data, text_outputs)]


def save_data(data: list[dict], file: Path) -> None:
"""Save data to a jsonl file in append mode.

Example output format:
{"id": "q0001", "input": "What is the capital of France?", "output": "Paris."}
{"id": "q0002", "input": "What is the capital of Germany?", "output": "Berlin."}
"""
with open(file, mode="a") as f:
for item in data:
f.write(json.dumps(item) + "\n")


# Main script
# Perform mini-batch inference on the input data and save the results.
# Adjust the batch size based on your requirements.
# Note: Larger batch sizes may require more GPU memory but can be faster.

data = load_data(Path("inputs.jsonl"), batch_size=20)
llm = LLM(model="microsoft/Phi-3.5-mini-instruct", max_model_len=8192)
for batch in data:
outputs = inference(llm=llm, data=batch)
save_data(outputs, Path("outputs.jsonl"))
print(f"Processed {len(batch)} examples.")
Empty file.
Binary file added vllm_batch_inference/img/hf-vllm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
100 changes: 100 additions & 0 deletions vllm_batch_inference/inputs.jsonl
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
{"id": "q0001", "input": "What is the capital of France?"}
{"id": "q0002", "input": "What is the capital of Germany?"}
{"id": "q0003", "input": "What is the capital of Italy?"}
{"id": "q0004", "input": "What is the capital of Spain?"}
{"id": "q0005", "input": "What is the capital of Portugal?"}
{"id": "q0006", "input": "What is the capital of Belgium?"}
{"id": "q0007", "input": "What is the capital of Netherlands?"}
{"id": "q0008", "input": "What is the capital of Switzerland?"}
{"id": "q0009", "input": "What is the capital of Austria?"}
{"id": "q0010", "input": "What is the capital of Sweden?"}
{"id": "q0011", "input": "What is the capital of Norway?"}
{"id": "q0012", "input": "What is the capital of Denmark?"}
{"id": "q0013", "input": "What is the capital of Finland?"}
{"id": "q0014", "input": "What is the capital of Iceland?"}
{"id": "q0015", "input": "What is the capital of Ireland?"}
{"id": "q0016", "input": "What is the capital of United Kingdom?"}
{"id": "q0017", "input": "What is the capital of Greece?"}
{"id": "q0018", "input": "What is the capital of Turkey?"}
{"id": "q0019", "input": "What is the capital of Russia?"}
{"id": "q0020", "input": "What is the capital of Ukraine?"}
{"id": "q0021", "input": "What is the capital of Poland?"}
{"id": "q0022", "input": "What is the capital of Czech Republic?"}
{"id": "q0023", "input": "What is the capital of Slovakia?"}
{"id": "q0024", "input": "What is the capital of Hungary?"}
{"id": "q0025", "input": "What is the capital of Romania?"}
{"id": "q0026", "input": "What is the capital of Bulgaria?"}
{"id": "q0027", "input": "What is the capital of Serbia?"}
{"id": "q0028", "input": "What is the capital of Croatia?"}
{"id": "q0029", "input": "What is the capital of Slovenia?"}
{"id": "q0030", "input": "What is the capital of Bosnia and Herzegovina?"}
{"id": "q0031", "input": "What is the capital of Montenegro?"}
{"id": "q0032", "input": "What is the capital of Albania?"}
{"id": "q0033", "input": "What is the capital of Macedonia?"}
{"id": "q0034", "input": "What is the capital of Kosovo?"}
{"id": "q0035", "input": "What is the capital of Armenia?"}
{"id": "q0036", "input": "What is the capital of Azerbaijan?"}
{"id": "q0037", "input": "What is the capital of Georgia?"}
{"id": "q0038", "input": "What is the capital of Kazakhstan?"}
{"id": "q0039", "input": "What is the capital of Uzbekistan?"}
{"id": "q0040", "input": "What is the capital of Turkmenistan?"}
{"id": "q0041", "input": "What is the capital of Kyrgyzstan?"}
{"id": "q0042", "input": "What is the capital of Tajikistan?"}
{"id": "q0043", "input": "What is the capital of China?"}
{"id": "q0044", "input": "What is the capital of Japan?"}
{"id": "q0045", "input": "What is the capital of South Korea?"}
{"id": "q0046", "input": "What is the capital of North Korea?"}
{"id": "q0047", "input": "What is the capital of Mongolia?"}
{"id": "q0048", "input": "What is the capital of India?"}
{"id": "q0049", "input": "What is the capital of Pakistan?"}
{"id": "q0050", "input": "What is the capital of Bangladesh?"}
{"id": "q0051", "input": "What is the capital of Sri Lanka?"}
{"id": "q0052", "input": "What is the capital of Nepal?"}
{"id": "q0053", "input": "What is the capital of Bhutan?"}
{"id": "q0054", "input": "What is the capital of Myanmar?"}
{"id": "q0055", "input": "What is the capital of Thailand?"}
{"id": "q0056", "input": "What is the capital of Laos?"}
{"id": "q0057", "input": "What is the capital of Cambodia?"}
{"id": "q0058", "input": "What is the capital of Vietnam?"}
{"id": "q0059", "input": "What is the capital of Malaysia?"}
{"id": "q0060", "input": "What is the capital of Singapore?"}
{"id": "q0061", "input": "What is the capital of Indonesia?"}
{"id": "q0062", "input": "What is the capital of Philippines?"}
{"id": "q0063", "input": "What is the capital of Brunei?"}
{"id": "q0064", "input": "What is the capital of Australia?"}
{"id": "q0065", "input": "What is the capital of New Zealand?"}
{"id": "q0066", "input": "What is the capital of Papua New Guinea?"}
{"id": "q0067", "input": "What is the capital of Fiji?"}
{"id": "q0068", "input": "What is the capital of Solomon Islands?"}
{"id": "q0069", "input": "What is the capital of Vanuatu?"}
{"id": "q0070", "input": "What is the capital of Samoa?"}
{"id": "q0071", "input": "What is the capital of Tonga?"}
{"id": "q0072", "input": "What is the capital of Kiribati?"}
{"id": "q0073", "input": "What is the capital of Micronesia?"}
{"id": "q0074", "input": "What is the capital of Palau?"}
{"id": "q0075", "input": "What is the capital of Marshall Islands?"}
{"id": "q0076", "input": "What is the capital of Nauru?"}
{"id": "q0077", "input": "What is the capital of Tuvalu?"}
{"id": "q0078", "input": "What is the capital of Canada?"}
{"id": "q0079", "input": "What is the capital of United States?"}
{"id": "q0080", "input": "What is the capital of Mexico?"}
{"id": "q0081", "input": "What is the capital of Guatemala?"}
{"id": "q0082", "input": "What is the capital of Belize?"}
{"id": "q0083", "input": "What is the capital of El Salvador?"}
{"id": "q0084", "input": "What is the capital of Honduras?"}
{"id": "q0085", "input": "What is the capital of Nicaragua?"}
{"id": "q0086", "input": "What is the capital of Costa Rica?"}
{"id": "q0087", "input": "What is the capital of Panama?"}
{"id": "q0088", "input": "What is the capital of Cuba?"}
{"id": "q0089", "input": "What is the capital of Jamaica?"}
{"id": "q0090", "input": "What is the capital of Haiti?"}
{"id": "q0091", "input": "What is the capital of Dominican Republic?"}
{"id": "q0092", "input": "What is the capital of Bahamas?"}
{"id": "q0093", "input": "What is the capital of Barbados?"}
{"id": "q0094", "input": "What is the capital of Saint Lucia?"}
{"id": "q0095", "input": "What is the capital of Saint Vincent and the Grenadines?"}
{"id": "q0096", "input": "What is the capital of Grenada?"}
{"id": "q0097", "input": "What is the capital of Trinidad and Tobago?"}
{"id": "q0098", "input": "What is the capital of Saint Kitts and Nevis?"}
{"id": "q0099", "input": "What is the capital of Antigua and Barbuda?"}
{"id": "q0100", "input": "What is the capital of Dominica?"}
33 changes: 33 additions & 0 deletions vllm_batch_inference/job.sub
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
JobBatchName = "vLLM batch inference template"

universe = docker
docker_image = ghcr.io/jasonlo/vllm-bash:v0.6.4

# Artefact
executable = run.sh
transfer_input_files = .env, batch_inference.py, inputs.jsonl
transfer_output_files = outputs.jsonl
should_transfer_files = YES

# Logging
stream_output = true
output = condor_log/output.$(Cluster)-$(Process).txt
error = condor_log/error.$(Cluster)-$(Process).txt
log = condor_log/log.$(Cluster)-$(Process).txt

# Compute resources
request_cpus = 2
request_memory = 8GB
request_disk = 10GB

# Extra GPU settings
request_gpus = 1

# (Optional) Depending on the model and batch size, request GPU with sufficient memory
require_gpus = GlobalMemoryMb >= 20000
Requirements = (Target.CUDADriverVersion >= 10.1)
+WantGPULab = true
+GPUJobLength = "short"

# Runs
queue 1
16 changes: 16 additions & 0 deletions vllm_batch_inference/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash

export HOME=$_CONDOR_SCRATCH_DIR
export HF_HOME=$_CONDOR_SCRATCH_DIR/huggingface

# If your job requests a single GPU, setting `CUDA_VISIBLE_DEVICES=0` ensures the system uses the correct device name and avoids errors. For multiple GPUs, set it accordingly (e.g., `0,1` for two GPUs).
export CUDA_VISIBLE_DEVICES=0

echo "Running job on `hostname`"
echo "GPUs assigned: $CUDA_VISIBLE_DEVICES"

echo "Setting up environment variables"
source .env

python3 batch_inference.py
echo "Job completed"