CHTC · JasonLo · Dec 13, 2024 · Dec 13, 2024 · Dec 13, 2024 · Dec 20, 2024
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,7 @@
 .vscode
 llm/.env
-llm/condor_log/*.txt
+**/condor_log/*.txt
+.devcontainer
+vllm_batch_inference/.env
+.venv
+vllm_batch_inference/outputs.jsonl
diff --git a/vllm_batch_inference/.env.example b/vllm_batch_inference/.env.example
@@ -0,0 +1 @@
+HUGGING_FACE_HUB_TOKEN=your_token # Get it from https://huggingface.co/settings/tokens
diff --git a/vllm_batch_inference/Dockerfile b/vllm_batch_inference/Dockerfile
@@ -0,0 +1,3 @@
+FROM vllm/vllm-openai:v0.6.4
+
+ENTRYPOINT ["/bin/bash"]
diff --git a/vllm_batch_inference/README.md b/vllm_batch_inference/README.md
@@ -0,0 +1,66 @@
+# vLLM Batch Inference on CHTC
+
+This guide demonstrates how to set up, submit, and run batch open-source LLM inference jobs using `vllm` on CHTC via `HTCondor`. This is useful for:
+
+- Generating large volumes of synthetic data using open-source LLMs.
+- Conducting large-scale, structured data extraction from text.
+- Embedding large volumes of text.
+- Running any LLM-driven tasks cost-effectively at a massive scale, without relying on expensive commercial alternatives.
+
+## Prerequisites
+
+- Basic knowledge of CHTC, HTCondor, and Docker universe jobs.
+
+## Introduction
+
+In this example, we will use the [Phi-3.5-mini-instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) model with [vLLM v0.6.4 offline inference](https://docs.vllm.ai/en/v0.6.4/getting_started/examples/offline_inference.html) to process 100 [example inputs](inputs.jsonl). The results will be written to `outputs.jsonl` and transferred back to the submit node using the `transfer_output_files` feature in `HTCondor`.
+
+Example inputs:
+```jsonl
+{"id": "q0001", "input": "What is the capital of France?"}
+{"id": "q0002", "input": "What is the capital of Germany?"}
+```
+
+Example outputs:
+```jsonl
+{"id": "q0001", "input": "What is the capital of France?", "output": "The capital of France is Paris."}
+{"id": "q0002", "input": "What is the capital of Germany?", "output": "The capital of Germany is Berlin."}
+```
+
+## Step-by-Step Guide
+
+1. Visit [Hugging Face](https://huggingface.co/settings/tokens) and obtain a token for downloading open-source models.
+2. `ssh` into a CHTC submit node.
+3. Clone this repository: `git clone https://github.com/CHTC/templates-GPUs.git`.
+4. Navigate to the example folder: `cd vllm_batch_inference`.
+5. Create a `.env` file based on this [example](.env.example).
+6. Submit the job: `condor_submit job.sub`.
+
+## FAQ
+
+1. **How to find supported open-source models?**
+
+    Visit [Hugging Face Models](https://huggingface.co/models), select a model, and check if it supports `vllm` by clicking `use this model`. ![hugging face vllm](img/hf-vllm.png). Note that some models require approval or signing a user agreement on their website.
+
+    <span style="color:red">**Always review the model documentation, as input formats can vary significantly between models.**</span>
+
+2. **Why use `vllm`?**
+
+    `vllm` currently offers the highest throughput for batch offline inference.
+
+3. **Why not use the official `vllm` container?**
+
+    The [official container's](https://hub.docker.com/r/vllm/vllm-openai/tags) entrypoint is set to the API server. I need to change the container entrypoint to `bash`. There may be a way to modify it in the submit file, but I'm unsure how. Please let me know if you have any insights. You can rebuild your version with this [Dockerfile](Dockerfile)
+
+4. **How can I avoid memory issues if I don't need maximum performance?**
+
+    Set `enforce_eager=True` when instantiating `vllm.VLLM` in `batch_inference.py` to mitigate some memory issues. Refer to the [vllm debugging tips](https://docs.vllm.ai/en/stable/getting_started/debugging.html) for more details.
+
+5. **How can I achieve better tokens-per-second performance?**
+
+    Depending on your use case, you may need to tune your model settings in `vllm`, such as `max_tokens`, `max_model_len`, and `batch_size`. Refer to the [vllm debugging tips](https://docs.vllm.ai/en/stable/getting_started/debugging.html) for more details.
+
+
+## About the Author
+
+Contributed by [Jason from Data Science Institute, UW-Madison](https://github.com/jasonlo).
diff --git a/vllm_batch_inference/batch_inference.py b/vllm_batch_inference/batch_inference.py
@@ -0,0 +1,91 @@
+import json
+from pathlib import Path
+from typing import Generator
+
+from vllm import LLM, SamplingParams
+
+
+def load_data(
+    file: Path, batch_size: int
+) -> Generator[list[dict[str, str]], None, None]:
+    """Load data from a jsonl file as a generator. Assuming at least `id` and `input` fields are present.
+
+    Example input format: (docs: https://jsonlines.org/)
+    {"id": "q0001", "input": "What is the capital of France?"}
+    {"id": "q0002", "input": "What is the capital of Germany?"}
+    """
+
+    with open(file, mode="r") as f:
+        batch = []
+        for line in f:
+            batch.append(json.loads(line))
+            if len(batch) == batch_size:
+                yield batch
+                batch.clear()
+        if batch:
+            yield batch
+
+
+def format_prompt(user_message: str, system_message: str | None = None) -> str:
+    """This function formats the user message into the correct format for the `Phi-3.5-mini-instruct` model.
+
+    docs: https://huggingface.co/microsoft/Phi-3.5-mini-instruct#input-formats
+    """
+
+    if system_message is None:
+        system_message = (
+            "You are a helpful assistant. Provide clear and short responses."
+        )
+    return (
+        "<|system|>\n"
+        f"{system_message}<|end|>\n"
+        "<|user|>\n"
+        f"{user_message}<|end|>\n"
+        "<|assistant|>\n"
+    )
+
+
+def inference(
+    llm: LLM,
+    data: list[dict[str, str]],
+    input_field: str = "input",
+    temperature: float = 0.0,
+    max_tokens: int = 2048,
+) -> list[dict]:
+    """Perform a batch of inference on Phi-3.5-mini-instruct via vllm offline mode.
+
+    docs: https://docs.vllm.ai/en/v0.6.4/getting_started/examples/offline_inference.html
+    """
+
+    inputs = [item[input_field] for item in data]
+    formatted_inputs = [format_prompt(user_message=x) for x in inputs]
+
+    sampling_params = SamplingParams(temperature=temperature, max_tokens=max_tokens)
+    outputs = llm.generate(prompts=formatted_inputs, sampling_params=sampling_params)
+    text_outputs = [raw_output.outputs[0].text.strip() for raw_output in outputs]
+    return [{**item, "output": output} for item, output in zip(data, text_outputs)]
+
+
+def save_data(data: list[dict], file: Path) -> None:
+    """Save data to a jsonl file in append mode.
+
+    Example output format:
+    {"id": "q0001", "input": "What is the capital of France?", "output": "Paris."}
+    {"id": "q0002", "input": "What is the capital of Germany?", "output": "Berlin."}
+    """
+    with open(file, mode="a") as f:
+        for item in data:
+            f.write(json.dumps(item) + "\n")
+
+
+# Main script
+# Perform mini-batch inference on the input data and save the results.
+# Adjust the batch size based on your requirements.
+# Note: Larger batch sizes may require more GPU memory but can be faster.
+
+data = load_data(Path("inputs.jsonl"), batch_size=20)
+llm = LLM(model="microsoft/Phi-3.5-mini-instruct", max_model_len=8192)
+for batch in data:
+    outputs = inference(llm=llm, data=batch)
+    save_data(outputs, Path("outputs.jsonl"))
+    print(f"Processed {len(batch)} examples.")
diff --git a/vllm_batch_inference/condor_log/.gitkeep b/vllm_batch_inference/condor_log/.gitkeep
diff --git a/vllm_batch_inference/img/hf-vllm.png b/vllm_batch_inference/img/hf-vllm.png
diff --git a/vllm_batch_inference/inputs.jsonl b/vllm_batch_inference/inputs.jsonl
@@ -0,0 +1,100 @@
+{"id": "q0001", "input": "What is the capital of France?"}
+{"id": "q0002", "input": "What is the capital of Germany?"}
+{"id": "q0003", "input": "What is the capital of Italy?"}
+{"id": "q0004", "input": "What is the capital of Spain?"}
+{"id": "q0005", "input": "What is the capital of Portugal?"}
+{"id": "q0006", "input": "What is the capital of Belgium?"}
+{"id": "q0007", "input": "What is the capital of Netherlands?"}
+{"id": "q0008", "input": "What is the capital of Switzerland?"}
+{"id": "q0009", "input": "What is the capital of Austria?"}
+{"id": "q0010", "input": "What is the capital of Sweden?"}
+{"id": "q0011", "input": "What is the capital of Norway?"}
+{"id": "q0012", "input": "What is the capital of Denmark?"}
+{"id": "q0013", "input": "What is the capital of Finland?"}
+{"id": "q0014", "input": "What is the capital of Iceland?"}
+{"id": "q0015", "input": "What is the capital of Ireland?"}
+{"id": "q0016", "input": "What is the capital of United Kingdom?"}
+{"id": "q0017", "input": "What is the capital of Greece?"}
+{"id": "q0018", "input": "What is the capital of Turkey?"}
+{"id": "q0019", "input": "What is the capital of Russia?"}
+{"id": "q0020", "input": "What is the capital of Ukraine?"}
+{"id": "q0021", "input": "What is the capital of Poland?"}
+{"id": "q0022", "input": "What is the capital of Czech Republic?"}
+{"id": "q0023", "input": "What is the capital of Slovakia?"}
+{"id": "q0024", "input": "What is the capital of Hungary?"}
+{"id": "q0025", "input": "What is the capital of Romania?"}
+{"id": "q0026", "input": "What is the capital of Bulgaria?"}
+{"id": "q0027", "input": "What is the capital of Serbia?"}
+{"id": "q0028", "input": "What is the capital of Croatia?"}
+{"id": "q0029", "input": "What is the capital of Slovenia?"}
+{"id": "q0030", "input": "What is the capital of Bosnia and Herzegovina?"}
+{"id": "q0031", "input": "What is the capital of Montenegro?"}
+{"id": "q0032", "input": "What is the capital of Albania?"}
+{"id": "q0033", "input": "What is the capital of Macedonia?"}
+{"id": "q0034", "input": "What is the capital of Kosovo?"}
+{"id": "q0035", "input": "What is the capital of Armenia?"}
+{"id": "q0036", "input": "What is the capital of Azerbaijan?"}
+{"id": "q0037", "input": "What is the capital of Georgia?"}
+{"id": "q0038", "input": "What is the capital of Kazakhstan?"}
+{"id": "q0039", "input": "What is the capital of Uzbekistan?"}
+{"id": "q0040", "input": "What is the capital of Turkmenistan?"}
+{"id": "q0041", "input": "What is the capital of Kyrgyzstan?"}
+{"id": "q0042", "input": "What is the capital of Tajikistan?"}
+{"id": "q0043", "input": "What is the capital of China?"}
+{"id": "q0044", "input": "What is the capital of Japan?"}
+{"id": "q0045", "input": "What is the capital of South Korea?"}
+{"id": "q0046", "input": "What is the capital of North Korea?"}
+{"id": "q0047", "input": "What is the capital of Mongolia?"}
+{"id": "q0048", "input": "What is the capital of India?"}
+{"id": "q0049", "input": "What is the capital of Pakistan?"}
+{"id": "q0050", "input": "What is the capital of Bangladesh?"}
+{"id": "q0051", "input": "What is the capital of Sri Lanka?"}
+{"id": "q0052", "input": "What is the capital of Nepal?"}
+{"id": "q0053", "input": "What is the capital of Bhutan?"}
+{"id": "q0054", "input": "What is the capital of Myanmar?"}
+{"id": "q0055", "input": "What is the capital of Thailand?"}
+{"id": "q0056", "input": "What is the capital of Laos?"}
+{"id": "q0057", "input": "What is the capital of Cambodia?"}
+{"id": "q0058", "input": "What is the capital of Vietnam?"}
+{"id": "q0059", "input": "What is the capital of Malaysia?"}
+{"id": "q0060", "input": "What is the capital of Singapore?"}
+{"id": "q0061", "input": "What is the capital of Indonesia?"}
+{"id": "q0062", "input": "What is the capital of Philippines?"}
+{"id": "q0063", "input": "What is the capital of Brunei?"}
+{"id": "q0064", "input": "What is the capital of Australia?"}
+{"id": "q0065", "input": "What is the capital of New Zealand?"}
+{"id": "q0066", "input": "What is the capital of Papua New Guinea?"}
+{"id": "q0067", "input": "What is the capital of Fiji?"}
+{"id": "q0068", "input": "What is the capital of Solomon Islands?"}
+{"id": "q0069", "input": "What is the capital of Vanuatu?"}
+{"id": "q0070", "input": "What is the capital of Samoa?"}
+{"id": "q0071", "input": "What is the capital of Tonga?"}
+{"id": "q0072", "input": "What is the capital of Kiribati?"}
+{"id": "q0073", "input": "What is the capital of Micronesia?"}
+{"id": "q0074", "input": "What is the capital of Palau?"}
+{"id": "q0075", "input": "What is the capital of Marshall Islands?"}
+{"id": "q0076", "input": "What is the capital of Nauru?"}
+{"id": "q0077", "input": "What is the capital of Tuvalu?"}
+{"id": "q0078", "input": "What is the capital of Canada?"}
+{"id": "q0079", "input": "What is the capital of United States?"}
+{"id": "q0080", "input": "What is the capital of Mexico?"}
+{"id": "q0081", "input": "What is the capital of Guatemala?"}
+{"id": "q0082", "input": "What is the capital of Belize?"}
+{"id": "q0083", "input": "What is the capital of El Salvador?"}
+{"id": "q0084", "input": "What is the capital of Honduras?"}
+{"id": "q0085", "input": "What is the capital of Nicaragua?"}
+{"id": "q0086", "input": "What is the capital of Costa Rica?"}
+{"id": "q0087", "input": "What is the capital of Panama?"}
+{"id": "q0088", "input": "What is the capital of Cuba?"}
+{"id": "q0089", "input": "What is the capital of Jamaica?"}
+{"id": "q0090", "input": "What is the capital of Haiti?"}
+{"id": "q0091", "input": "What is the capital of Dominican Republic?"}
+{"id": "q0092", "input": "What is the capital of Bahamas?"}
+{"id": "q0093", "input": "What is the capital of Barbados?"}
+{"id": "q0094", "input": "What is the capital of Saint Lucia?"}
+{"id": "q0095", "input": "What is the capital of Saint Vincent and the Grenadines?"}
+{"id": "q0096", "input": "What is the capital of Grenada?"}
+{"id": "q0097", "input": "What is the capital of Trinidad and Tobago?"}
+{"id": "q0098", "input": "What is the capital of Saint Kitts and Nevis?"}
+{"id": "q0099", "input": "What is the capital of Antigua and Barbuda?"}
+{"id": "q0100", "input": "What is the capital of Dominica?"}
diff --git a/vllm_batch_inference/job.sub b/vllm_batch_inference/job.sub
@@ -0,0 +1,33 @@
+JobBatchName            = "vLLM batch inference template"
+
+universe                = docker
+docker_image            = ghcr.io/jasonlo/vllm-bash:v0.6.4
+
+# Artefact
+executable              = run.sh
+transfer_input_files    = .env, batch_inference.py, inputs.jsonl
+transfer_output_files   = outputs.jsonl
+should_transfer_files   = YES
+
+# Logging
+stream_output           = true
+output                  = condor_log/output.$(Cluster)-$(Process).txt
+error                   = condor_log/error.$(Cluster)-$(Process).txt
+log                     = condor_log/log.$(Cluster)-$(Process).txt
+
+# Compute resources
+request_cpus            = 2
+request_memory          = 8GB
+request_disk            = 10GB
+
+# Extra GPU settings
+request_gpus            = 1
+
+# (Optional) Depending on the model and batch size, request GPU with sufficient memory   
+require_gpus            = GlobalMemoryMb >= 20000
+Requirements            = (Target.CUDADriverVersion >= 10.1)
++WantGPULab             = true
++GPUJobLength           = "short"
+
+# Runs
+queue 1
diff --git a/vllm_batch_inference/run.sh b/vllm_batch_inference/run.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+export HOME=$_CONDOR_SCRATCH_DIR
+export HF_HOME=$_CONDOR_SCRATCH_DIR/huggingface
+
+# If your job requests a single GPU, setting `CUDA_VISIBLE_DEVICES=0` ensures the system uses the correct device name and avoids errors. For multiple GPUs, set it accordingly (e.g., `0,1` for two GPUs).
+export CUDA_VISIBLE_DEVICES=0
+
+echo "Running job on `hostname`"
+echo "GPUs assigned: $CUDA_VISIBLE_DEVICES"
+
+echo "Setting up environment variables"
+source .env
+
+python3 batch_inference.py
+echo "Job completed"
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		HUGGING_FACE_HUB_TOKEN=your_token # Get it from https://huggingface.co/settings/tokens
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		FROM vllm/vllm-openai:v0.6.4

		ENTRYPOINT ["/bin/bash"]