This repository provides a starting point for using the vLLM library on the cluster. vLLM supports both offline and online usage:
- Offline Mode: Load the model once and process your data locally.
- Online Mode: Start a service to generate text via an endpoint, similar to OpenAI's GPT API.
This guide focuses on offline usage, as the online service may block cluster resources.
conda create -n vLLM-Starter python=3.11 -y
conda activate vLLM-Starter
pip install vllmIt is recommended to store the LLM models in a central location, as these large files can be shared across different users.
By default, we use /ds/models/llms/cache as the storage location, where several LLMs are already stored.
To set this directory as the default cache location, you need to configure the environment variable HF_HUB_CACHE.
In the examples we define the cache location in the srun command.
Please replace {script.py} with your script.
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
--time=1-00:00:00 \
python {script.py}A simple example demonstrating how to use vLLM in offline mode can be found in the file offline_simpleInference.py. This script loads a model and generates text based on a prompt.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_simpleInference.pyThe following script demonstrates how to load a model and generate text based on a chat-style prompt. You can find the more complex example in the file offline_chatstyle.py.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_chatstyle.pyThe GuidedDecodingParams class in vLLM allows you to define the output structure for tasks that require a predefined format, such as Named Entity Recognition (NER).
You can use various methods to guide the decoding process, including regular expressions, JSON objects, grammar, or simple binary choices.
The example can be found in the file offline_structuredOutput.py.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_structuredOutput.pyvLLM can also be applied to vision tasks, such as generating captions for images.
When using vision LLMs, you have to use the specific prompt-template for the model and provide stop_token_ids.
Please check the official GitHub repository for the specific prompt-template and stop_token_ids here.
The example can be found in the file offline_visionExample.py, which loads the image in data/example.jpg and generates a caption for it.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_visionExample.pyAs we have seen in the previous example, vLLM requires for vision tasks the correct prompt-template and the stop_token_ids. In the original examples from the vLLM repository, the code loads the LLM for each question. I have modified the code to load the LLM only once and then use it for all questions.
Example
srun --partition=RTXA6000-SLT \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_visionImproved.py --model=LLAVANextLLMs are known for their substantial size, which makes them highly memory-intensive. Quantization is a technique designed to reduce a model's memory footprint. Quantized models can be loaded just like their standard counterparts in vLLM. In cases where a quantized version of your model isn’t readily available, you can perform the quantization yourself. Beyond the general concept, there are various methods and tools available for quantizing models. If you are interested in model quantization for vLLM, we refer to the vLLM documentation. As I currently lack experience with quantization, I cannot provide insights into best practices.
AWQ quantization
You can find an example for [AWQ quantization](https://arxiv.org/abs/2106.04561) in the file `quantisation_quantizeModel.py`. AWQ quantisation uses calibration instances and you should therefore execute with GPU resources.pip install autoawqsrun --partition=RTXA6000-SLT \
--job-name=quantisation \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--gpus-per-task=1 \
--cpus-per-task=3 \
--mem=50G \
--time=1-00:00:00 \
python quantisation_quantizeModel.pyI followed the unsloth tutorial to fine tune a Llama 3.2 model on the Tome-100k dataset.
The current example is a simple fine-tuning example, which can be found in fineTuningSFT.py.
Training is currently only 60 steps and the model is saved to the /ds/models/hf-cache-slt/myAwesomeModel directory.
pip install unslothExample
Fine-Tuning:
srun --partition=RTXA6000-SLT \
--job-name=fine-tuning \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/,HF_DATASETS_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_visionExample.pyInference using vLLM
srun --partition=RTXA6000 \
--job-name=vllm-test \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python offline_simpleInference.py --model_name=/ds/models/hf-cache-slt/myAwesomeModel/ --prompt="Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,I followed the unsloth tutorial to fine tune a Qwen2_VL model on the LaTeX_OCR dataset.
The current example is a simple fine-tuning example, which can be found in fineTuningVision.py.
Training is currently only 60 steps and the model is saved to the /ds/models/hf-cache-slt/myAwesomeVisionModel directory.
Example
srun --partition=RTXA6000 \
--job-name=fine-tuning \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/,HF_DATASETS_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
python fineTuningVision.pyvLLM currently supports only single-GGUF models. To use multi-GGUF models, you need to download the individual GGUF files and merge them into a single file. This tutorial uses DeepSeek-R1 with 671B parameters as an example.
DeepSeek Tutorial
huggingface-cli download "unsloth/DeepSeek-R1-Q2_K" --local-dir /ds/models/hf-cache-slt/ --include='*-Q2_K-*'or alternatively for V3.1
huggingface-cli download "unsloth/DeepSeek-V3.1-GGUF" --local-dir /ds/models/hf-cache-slt/ --include='*-Q2_K_XL*'To merge GGUF files, we'll use LLAMA.cpp:
Note: The explanation below is for local usage. On the cluster you have to take a different approach.
# Clone and build LLAMA.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release
# Merge the GGUF files
./build/bin/llama-gguf-split --merge ~/deepseek/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf ~/deepseek/oneModel/output.ggufYou can find the DeepSeek-R1 model here: /ds/models/hf-cache-slt/deepseek/DeepSeek-R1.gguf
Note: On the cluster: First, build a new container using:
srun \
--ntasks=1 \
--nodes=1 \
--gpus=1 \
--partition=L40S,A100-40GB,A100-80GB,H100,RTXA6000,H200,RTX3090,H100-SLT,RTXA6000-SLT \
--time=04:00:00 \
--immediate=3600 \
--mem-per-cpu=18G \
--cpus-per-task=6 \
--container-image=/netscratch/enroot/nvcr.io_nvidia_pytorch_23.05-py3.sqsh \
--container-save=/netscratch/thomas/llama_30.06.2025.sqsh\
--container-mounts="`pwd`":"`pwd`" \
--container-workdir="`pwd`" \
bash install.shThe install.sh file simply contains:
#!/bin/bash
apt-get update -y && apt-get install -y software-properties-common \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get -y update
# Clone and build LLAMA.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config ReleaseThe newly generated container(/netscratch/thomas/llama_30.06.2025.sqsh) has now a working version of llama.cpp. You can call it by:
srun -K --job-name="olmocr" --container-mounts=/netscratch:/netscratch,/ds:/ds,$HOME:$HOME --container-workdir="$(pwd)" --container-image=/netscratch/thomas/llama_30.06.2025.sqsh --ntasks=1 --nodes=1 --gpus=1 --partition=L40S,A100-40GB,A100-80GB,H100,RTXA6000,H200,RTX3090,H100-SLT,RTXA6000-SLT --time=04:00:00 --mem=100GB ./llama.cpp/build/bin/llama-gguf-split --merge ~/deepseek/DeepSeek-V3-Q2_K_XS/DeepSeek-V3-Q2_K_XS-00001-of-00005.gguf ~/deepseek/oneModel/output.ggufThere are some DeepSeek specific changes for this project, following this pull-request vllm-project/vllm#13167:
pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightlyNo need to download.
All these files can be found here: /ds/models/hf-cache-slt/deepseek/deepseek-config
From: https://huggingface.co/deepseek-ai/DeepSeek-R1/tree/main
- generation_config.json
- tokenizer_config.json
- tokenizer.json
- model.safetensors.index.json
From: https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main
- config.json
- configuration_deepseek.py
- modeling_deepseek.py
Don't forget to change torch_dtype in config.json
export VLLM_MLA_DISABLE=1For running DeepSeek-R1, please use this code and disable MLA. However, for regular models you can provide the GGUF-model to the Simple example script.
srun --partition=H100 \
--job-name=deepseek-r1 \
--export=ALL,HF_HUB_CACHE=/ds/models/hf-cache-slt/ \
--nodes=1 \
--ntasks=1 \
--gpus-per-task=4 \
--cpus-per-task=12 \
--mem=64G \
--time=1-00:00:00 \
python deepseek.py You can also use vLLM in online/interactive mode. As previously mentioned, this mode is not recommended for production use, but it is useful for testing and debugging. Please ensure that you shutdown the service after you are done with it, as it consumes allocated resources even when not in use. This mode starts a service on the cluster, which you can access via a REST interface. This is similar to the tutorial from perseus-textgen, but in my personal experience less brittle.
Steps:
- Start the service
- Retrieve the node name using
squeue -u $USER. - Access the service documentation at
http://$NODE.kl.dfki.de:8000/docs.
vllm serve Qwen/Qwen2.5-1.5B-InstructPlease set download-dir accordingly.
srun --partition=RTXA6000-SLT \
--job-name=vllm_serve \
--nodes=1 \
--ntasks=1 \
--cpus-per-task=6 \
--gpus-per-task=1 \
--mem-per-cpu=4G \
--time=1-00:00:00 \
vllm serve "Qwen/Qwen2.5-1.5B-Instruct" \
--download-dir=/ds/models/llms/cache \
--port=8000Call this on the head node to get the list of your running jobs:
squeue -u $USERThen, you can access the API documentation at the following endpoint (replace $NODE with the node name): http://$NODE.kl.dfki.de:8000/docs
Replace $NODE with the node name.
curl http://${NODE}.kl.dfki.de:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'Please keep in mind that in order to call the remote model from your local machine, you need to do one of two things:
The VPN forwards all ports greater than 10 000. Therefore, starting vLLM on port 18 000 instead of 8 000 would be sufficient for local to remote calling.
In order to call the remote model from your local machine, you can forward the port using SSH.
ssh -L 5001:<$NODE>:8000 <username>@<loginnode>Then you can access the service on your local machine at http://localhost:5001.
curl http://localhost:5001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen2.5-1.5B-Instruct",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'Check the example in online/remoteGeneration.py.
Check the example in online/remoteChat.py.