Quick guide in deploying production ready Mistral-7B or Mixtral-8x7B vLLM secure server behind Traefik reverse proxy and load balancer. By default I am using HTTPS let's encrypt certificate, has automated renewal and comes with password protected Traefik dashboard.
Note
- You can disable Traefik dashboard for keep it internal access only
- You can substitute Let's Encrypt with your own or CloudFlare or third-party for DNS challenge
- Add, Crowdsec or WAF through Cloudflare to secure your DNS Zone and any bot, or Cybersecurity threats
- Git
- Docker ~ v24
- Docker Compose ~ v2
- CUDA 12 (NVIDIA GPU gen - Ampere, Hopper)
Check below guides in installing Docker and Docker Compose on Ubuntu 20.04
Note: To use NVIDIA GPU you will need docker compose version > v1.28.0, otherwise you will get error
git clone https://github.com/hurshd0/vllm-docker-traefik.git && cd vllm-docker-traefik
mv .env.example .env
nano .env
Fill in your environment variables
# Traefik settings
ADMIN_EMAIL=<[email protected]> # e.g. [email protected]
DOMAIN=<yourwebsite.com> # e.g. DOMAIN=website.com
CERT_RESOLVER=letsencrypt # keep it blank for internal private or local net
TRAEFIK_USER=admin
TRAEFIK_PASSWORD_HASH=<your-password-hash> # e.g. TRAEFIK_PASSWORD_HASH=$2y$10$OfEBpHk52P/5Ad1qzDj79esMnuhaEbV5of7OBTSurzhtSENLeWzAW
# vLLM settings
MODEL_NAME=<your-huggingface-hub-model> # e.g. MODEL_NAME="mistralai/Mistral-7B-Instruct-v0.1"
HF_TOKEN=<your-huggingface-hub-token> # e.g. HF_TOKEN="hf_XXXXXXXXXXXXXXXXX"
Note:
- Install
sudo apt update && sudo apt install apache2-utils -y
htpasswd -nBC 10 admin
-
HF_TOKEN
get it from Huggingface Hub -
You can leave
CERT_RESOLVER
empty if you want to test for local deployment
CERT_RESOLVER=
Start vLLM server
docker compose up -d
Check logs
docker compose logs
Get NVIDIA Device IDs
$ nvidia-smi -L
Append it in device_ids
list
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
device_ids: ['0']
Customize Docker Compose YAML file
Change the docker-compose.yml
file in vllm
section
command:
--model ${MODEL_NAME}
--tensor-parallel-size 2 # Based on GPU count, should be even number of GPUs
--load-format pt # needed since both `pt` and `safetensors` are available