Automated testing framework for vLLM embedding and LLM generative models with NUMA-aware CPU optimization.
On your control machine (where you run Ansible):
# Install Ansible (if not already installed)
# On macOS
brew install ansible
# On Ubuntu/Debian
sudo apt update && sudo apt install -y ansible
# On RHEL/Fedora
sudo dnf install -y ansible
# Verify installation
ansible --version # Should be 2.14+Installing Required Packages on DUT and Load Generator:
Option 1: Manual installation:
# On both DUT and Load Generator hosts
# RHEL/Fedora:
sudo dnf install -y podman python3
# Ubuntu/Debian:
sudo apt update && sudo apt install -y podman python3
# Verify installation
podman --version
python3 --versionOption 2: Automated via platform setup:
# Configure BOTH DUT and Load Generator
ansible-playbook -i inventory/hosts.yml setup-platform.yml
# Configure ONLY DUT
ansible-playbook -i inventory/hosts.yml setup-platform.yml --limit dut
# Configure ONLY Load Generator
ansible-playbook -i inventory/hosts.yml setup-platform.yml --limit load_generator
# Reboot required for kernel parameters to take effect
ansible -i inventory/hosts.yml all -b -m rebootWhat this configures:
- Installs: podman, tuned, kernel-tools, numactl
- Performance optimizations: CPU isolation, NUMA topology, IRQ balancing
- Targets: DUT and Load Generator hosts only (NOT your Ansible control machine)
Important Notes:
- Option 2 (setup-platform.yml): Automatically installs Podman, Python 3,
and all performance tools on DUT and Load Generator hosts. Your Ansible
control machine needs Ansible itself and the collections from requirements.yml.
Before running setup-platform.yml, install required collections:
cd automation/test-execution/ansible ansible-galaxy collection install -r requirements.yml - Option 1 (Manual): If you prefer minimal setup, manually install Podman/Docker and Python 3 on DUT and Load Generator hosts before running test playbooks.
- Container Images (vLLM and GuideLLM): Automatically pulled during test execution. Test playbooks handle this—no manual installation needed.
- Reboot Required: After running setup-platform.yml, reboot hosts for kernel parameters to take effect.
Verify SSH and Network Access:
Ensure the following before running playbooks:
- OS: Ubuntu 22.04+, RHEL 9+, or Fedora 38+
- SSH Access: Password-less SSH access from control machine
- User privileges: User should have sudo access (or use root directly)
- Network: Hosts can reach each other (DUT port 8000 accessible from Load Generator)
# Verify SSH access from control machine
ssh -i ~/.ssh/your-key.pem ec2-user@your-dut-hostname
ssh -i ~/.ssh/your-key.pem ec2-user@your-loadgen-hostname
# Test sudo access on remote hosts
ssh ec2-user@your-dut-hostname 'sudo whoami' # Should return 'root'Option A: Environment Variables (Recommended - No file edits)
export DUT_HOSTNAME=your-dut-hostname.compute.amazonaws.com
export LOADGEN_HOSTNAME=your-loadgen-hostname.compute.amazonaws.com
export ANSIBLE_SSH_USER=ec2-user
export ANSIBLE_SSH_KEY=~/.ssh/your-key.pem
export HF_TOKEN=hf_xxxxx # If using gated models like Llama
# Container images (optional - defaults are provided)
export VLLM_CONTAINER_IMAGE=docker.io/vllm/vllm-openai-cpu:v0.18.0
export GUIDELLM_CONTAINER_IMAGE=ghcr.io/vllm-project/guidellm:v0.6.0
# Health check timeout (optional - default: 600s)
export VLLM_HEALTH_TIMEOUT=600
# Custom entrypoint (optional - only needed for containers without vLLM in default path)
# Example for AMD ZenDNN containers:
export VLLM_CONTAINER_IMAGE=docker.io/amdih/zendnn_zentorch:vllm_v0.18.0_zentorch_v5.2.1_rhel9.5_r5.2.1
export VLLM_CONTAINER_ENTRYPOINT='["vllm", "serve"]'
export VLLM_HEALTH_TIMEOUT=1200 # AMD ZenDNN needs longer startup timeThe inventory automatically uses these variables with sensible defaults.
Option B: Edit inventory/hosts.yml directly
Edit inventory/hosts.yml - change the IP addresses on lines 63 and 73 (see inventory documentation for details):
all:
vars:
# Common settings (pre-configured with defaults)
ansible_user: "{{ lookup('env', 'ANSIBLE_SSH_USER') | default('ec2-user', true) }}"
# ...
children:
# ...
dut:
hosts:
vllm-server:
ansible_host: "{{ lookup('env', 'DUT_HOSTNAME') | default('192.168.1.10', true) }}" # ⚠️ Line 63
load_generator:
hosts:
guidellm-client:
ansible_host: "{{ lookup('env', 'LOADGEN_HOSTNAME') | default('192.168.1.20', true) }}" # ⚠️ Line 73The file uses Jinja2 templating to read environment variables. If not set, it falls back to the
default IPs shown above. Everything else is pre-configured in inventory/group_vars/!
cd automation/test-execution/ansible
ansible -i inventory/hosts.yml all -m pingNote: For production benchmarking or accurate performance measurements, run Platform Setup first (requires reboot). This is optional for quick testing or development but highly recommended for reliable results.
# Set HuggingFace token (if using gated models)
export HF_TOKEN=hf_xxxxx
# Run LLM benchmark with auto-configured cores
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=16"
# Results are saved locally at:
# results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/benchmarks.jsonDone! See sections below for advanced usage and additional playbooks.
┌─────────────────┐ ┌──────────────────┐
│ Load Generator │◄───────►│ DUT │
│ │ HTTP │ │
│ - GuideLLM │ :8000 │ - vLLM Server │
│ - vllm bench │ │ - Containerized │
└─────────────────┘ └──────────────────┘
Two-node architecture:
- DUT (Device Under Test): Runs vLLM server in container
- Load Generator: Runs benchmarking tools (GuideLLM, vllm bench serve)
| Playbook | Purpose | Usage |
|---|---|---|
| llm-benchmark.yml | Single LLM test with manual config | -e "core_config_name=..." |
| llm-benchmark-auto.yml | Single LLM test with auto core allocation | -e "requested_cores=16" |
| llm-benchmark-concurrent-load.yml | 3-phase concurrent load testing | -e "base_workload=chat" -e "core_sweep_counts=[16,32]" |
| llm-core-sweep.yml | Test multiple core configs | -e "core_config_names=[...]" |
| llm-core-sweep-auto.yml | Test multiple core counts (auto-allocated) | -e "requested_cores_list=[8,16,32]" |
| embedding-benchmark.yml | Single embedding test | -e "test_model=..." -e "scenario=baseline" |
| embedding-core-sweep.yml | Embedding core sweep | Multiple configs |
| Playbook | Purpose | When |
|---|---|---|
| setup-platform.yml | Configure DUT/LoadGen for optimal performance | One-time, before testing |
| collect-logs.yml | Collect logs from DUT | After tests |
| health-check.yml | Check vLLM server health | Standalone or imported |
Pre-configured in inventory/group_vars/all/test-workloads.yml:
| Workload | ISL:OSL | Use Case | Baseline vLLM Args |
|---|---|---|---|
embedding |
512:1 | Embedding models | --dtype=bfloat16 --max-model-len=512 |
chat |
512:256 | Chatbots | --dtype=bfloat16 --no-enable-prefix-caching |
rag |
4096:512 | RAG applications | --dtype=bfloat16 --no-enable-prefix-caching |
code |
512:4096 | Code generation | --dtype=bfloat16 --no-enable-prefix-caching |
short_codegen |
256:2048 | Short code generation | --dtype=bfloat16 --no-enable-prefix-caching |
summarization |
1024:256 | Text summarization | --dtype=bfloat16 --no-enable-prefix-caching |
| Workload | ISL±σ:OSL±σ (max) | Use Case | Baseline vLLM Args |
|---|---|---|---|
chat_var |
512±128:512±128 (1024:1024) | Realistic chat traffic | --dtype=bfloat16 --no-enable-prefix-caching |
code_var |
1024±256:1024±256 (2048:2048) | Realistic code generation | --dtype=bfloat16 --no-enable-prefix-caching |
Note: Baseline mode disables both prefix caching and radix cache for true baseline measurements. Production mode enables caching optimizations.
The recommended testing methodology for CPU inference performance evaluation follows a 3-phase approach:
Establish pure baseline performance without any caching optimizations.
- Configuration:
vllm_caching_mode=baseline - Workload: Fixed token counts (e.g.,
chat,rag,code) - vLLM flags:
--no-enable-prefix-caching - Concurrency levels:
[1, 2, 4, 8, 16, 32]
Measure performance under realistic traffic variability.
- Configuration:
vllm_caching_mode=baseline - Workload: Variable token counts (e.g.,
chat_var,code_var) - vLLM flags:
--no-enable-prefix-caching - Concurrency levels:
[1, 2, 4, 8, 16, 32]
Simulate true production conditions with realistic load and optimizations.
- Configuration:
vllm_caching_mode=production - Workload: Variable token counts (e.g.,
chat_var,code_var) - vLLM flags: Default (caching enabled)
- Concurrency levels:
[1, 2, 4, 8, 16, 32]
export HF_TOKEN=hf_xxxxx
# Full 3-phase concurrent load test with core sweep
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-concurrent-load.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "base_workload=chat" \
-e "core_sweep_counts=[16,32,64]"This runs:
- Phase 1:
chatbaseline (fixed tokens, no caching) - Phase 2:
chat_varrealistic (variable tokens, no caching) - Phase 3:
chat_varproduction (variable tokens, with caching)
Note: You specify the base workload (e.g., base_workload=chat), and the playbook automatically:
- Uses
chatfor Phase 1 (fixed) - Uses
chat_varfor Phase 2 (adds_varsuffix) - Uses
chat_varfor Phase 3 (adds_varsuffix)
All phases use the same concurrency sweep: [1, 2, 4, 8, 16, 32]
# Phase 1 only (baseline)
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-concurrent-load.yml \
-e "test_model=TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
-e "base_workload=chat" \
-e "requested_cores=16" \
-e "skip_phase_2=true" \
-e "skip_phase_3=true"
# Phase 3 only (production)
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-concurrent-load.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "base_workload=rag" \
-e "core_sweep_counts=[16,32]" \
-e "skip_phase_1=true" \
-e "skip_phase_2=true"cd automation/test-execution/ansible
# Configure CPU isolation, tuned profile, systemd pinning
ansible-playbook -i inventory/hosts.yml setup-platform.yml
# Reboot required
ansible -i inventory/hosts.yml all -b -m reboot
# Validate after reboot
ansible -i inventory/hosts.yml dut -b -a "tuned-adm active"See Platform Setup Guide for detailed information on platform optimization.
Test with specific core configuration:
cd automation/test-execution/ansible
export HF_TOKEN=hf_xxxxx
# Manual core config
ansible-playbook -i inventory/hosts.yml \
llm-benchmark.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=summarization" \
-e "core_config_name=32cores-single-socket"
# Or auto-configured cores (recommended)
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=32"cd automation/test-execution/ansible
ansible-playbook -i inventory/hosts.yml \
embedding-benchmark.yml \
-e "test_model=ibm-granite/granite-embedding-278m-multilingual" \
-e "scenario=baseline"cd automation/test-execution/ansible
export HF_TOKEN=hf_xxxxx
for model in \
"meta-llama/Llama-3.2-1B-Instruct" \
"meta-llama/Llama-3.2-3B-Instruct"; do
ansible-playbook -i inventory/hosts.yml \
llm-benchmark-auto.yml \
-e "test_model=$model" \
-e "workload_type=chat" \
-e "requested_cores=16"
doneAfter running tests, results are automatically collected to your local machine:
# LLM results location
ls -la results/llm/
# JSON results for programmatic access
cat results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/benchmarks.json
# CSV results for spreadsheet analysis
cat results/llm/meta-llama__Llama-3.2-1B-Instruct/chat-*/benchmarks.csvNote: HTML output (
benchmarks.html) is not currently enabled. See GuideLLM issue #627 for details.
Results include:
- benchmarks.json - Raw JSON data for analysis
- benchmarks.csv - CSV format for spreadsheet tools
- vllm-server.log - vLLM server logs (collected from DUT)
See Results Documentation for more details on result organization and analysis.
AMD provides optimized vLLM containers with ZenDNN (Zen Deep Neural Network) and ZenTorch acceleration for improved performance on AMD EPYC processors. These containers require a custom entrypoint configuration.
Why a custom entrypoint is needed:
The AMD ZenDNN containers package vLLM differently than the standard vLLM containers. Instead of having vLLM in the default PATH, you need to explicitly specify vllm serve as the entrypoint. Without this, the container will exit with code 127 ("command not found").
Configuration:
Set these environment variables before running your playbooks:
# AMD ZenDNN optimized vLLM container
export VLLM_CONTAINER_IMAGE=docker.io/amdih/zendnn_zentorch:vllm_v0.18.0_zentorch_v5.2.1_rhel9.5_r5.2.1
export VLLM_CONTAINER_ENTRYPOINT='["vllm", "serve"]'
# AMD ZenDNN containers have longer initialization time
# Increase health check timeout (default: 600s, AMD recommended: 900-1200s)
export VLLM_HEALTH_TIMEOUT=1200Available AMD ZenDNN Container Versions:
docker.io/amdih/zendnn_zentorch:vllm_v0.18.0_zentorch_v5.2.1_rhel9.5_r5.2.1(RHEL 9.5)docker.io/amdih/zendnn_zentorch:vllm_v0.18.0_zentorch_v5.2.1_ubuntu22.04_r5.2.1(Ubuntu 22.04)
Check AMD Infinity Hub for the latest versions.
Complete Example - AMD Platform Test:
# 1. Configure AMD container
export VLLM_CONTAINER_IMAGE=docker.io/amdih/zendnn_zentorch:vllm_v0.18.0_zentorch_v5.2.1_rhel9.5_r5.2.1
export VLLM_CONTAINER_ENTRYPOINT='["vllm", "serve"]'
export VLLM_HEALTH_TIMEOUT=1200 # AMD ZenDNN needs longer startup time
# 2. Set your infrastructure
export DUT_HOSTNAME=amd-epyc-server.example.com
export LOADGEN_HOSTNAME=loadgen-server.example.com
export HF_TOKEN=hf_xxxxx # If using gated models
# 3. Run benchmark
cd automation/test-execution/ansible
ansible-playbook llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=64"What gets optimized:
- ZenDNN: AMD's optimized deep learning library for EPYC processors
- ZenTorch: PyTorch integration with ZenDNN for transparent acceleration
- NUMA-aware execution: Leverages AMD EPYC multi-die architecture
- AVX-512 & AVX2: CPU instruction set optimizations
NUMA Socket Pinning with AMD ZenDNN:
AMD ZenDNN containers require special configuration for NUMA pinning. The automation automatically detects AMD ZenDNN containers and applies:
- CPU Pinning: Applied via
cpuset_cpus(e.g.,96-127for NUMA node 1) - Thread Affinity:
VLLM_CPU_OMP_THREADS_BINDenv var set to match cpuset (required for ZenDNN) - Memory Binding: Skipped (ZenDNN libnuma requires visibility to all NUMA nodes)
- Memory Affinity: Achieved naturally via CPU locality
You can use socket pinning parameters normally - the automation handles AMD-specific requirements:
-e "vllm_cpu_start=96" \
-e "vllm_numa_node=1" \
-e "guidellm_cpus=0-31" \
-e "guidellm_numa_node=0"Why this configuration:
- AMD ZenDNN's internal thread affinity requires
VLLM_CPU_OMP_THREADS_BINDmatching the container's CPU set - Without this env var, vLLM fails with
sched_setaffinity errno 22 - ZenDNN's libnuma needs to see all NUMA nodes, so
cpuset_memsis omitted
Verifying ZenDNN is active:
Check the vLLM server logs for ZenDNN initialization messages:
# View logs from the DUT after starting a test
ssh ${DUT_HOSTNAME} "podman logs vllm-server 2>&1 | grep -i zendnn"You should see messages indicating ZenDNN/ZenTorch are loaded.
Switching back to standard vLLM:
To use the standard vLLM container, simply unset the environment variables or set them to the default:
export VLLM_CONTAINER_IMAGE=docker.io/vllm/vllm-openai-cpu:v0.18.0
unset VLLM_CONTAINER_ENTRYPOINTTechnical Details:
The entrypoint configuration is set in inventory/group_vars/all/infrastructure.yml and read from the VLLM_CONTAINER_ENTRYPOINT environment variable. The framework automatically applies it when starting the container via the vllm_server role.
automation/test-execution/ansible/
├── ansible.cfg # Ansible configuration
├── inventory/
│ ├── hosts.yml # Main inventory - edit IPs here
│ ├── group_vars/ # Group variables
│ │ ├── all/ # Variables for all hosts
│ │ │ ├── benchmark-tools.yml # GuideLLM, vllm-bench config
│ │ │ ├── credentials.yml # HuggingFace token setup
│ │ │ ├── endpoints.yml # Network endpoints
│ │ │ ├── hardware-profiles.yml # Core configurations
│ │ │ ├── infrastructure.yml # Platform setup config
│ │ │ └── test-workloads.yml # Workload definitions
│ │ ├── dut/main.yml # DUT-specific vars
│ │ └── load_generator/main.yml # Load gen-specific vars
│ └── README.md # Inventory documentation
│
├── roles/ # Reusable roles
│ ├── vllm_server/ # vLLM server management
│ │ ├── defaults/main.yml # Default variables
│ │ └── tasks/ # Tasks
│ │ ├── main.yml
│ │ ├── start-llm.yml
│ │ ├── start-embedding.yml
│ │ └── clean-restart.yml
│ ├── hf_token/ # HuggingFace token setup
│ │ └── tasks/
│ │ ├── main.yml
│ │ └── setup-optional.yml
│ ├── benchmark_guidellm/ # GuideLLM benchmarks
│ │ ├── defaults/main.yml
│ │ └── tasks/main.yml
│ ├── benchmark_embedding/ # Embedding benchmarks
│ │ └── tasks/
│ │ ├── main.yml
│ │ ├── baseline.yml
│ │ └── latency.yml
│ ├── benchmark_vllm_bench/ # vllm-bench base
│ │ └── tasks/main.yml
│ ├── results_collector/ # Log/result collection
│ │ ├── defaults/main.yml # Default variables
│ │ ├── README.md # Role documentation
│ │ └── tasks/
│ │ ├── main.yml
│ │ ├── collect-vllm-logs.yml
│ │ └── collect-test-results.yml
│ └── common/ # Common role
│ └── tasks/ # Shared task definitions
│ ├── allocate-cores-from-count.yml
│ ├── detect-numa-topology.yml
│ └── setup-vllm-api-key.yml
│
├── llm-benchmark.yml # LLM playbook (manual config)
├── llm-benchmark-auto.yml # LLM playbook (auto-config cores)
├── llm-core-sweep.yml # LLM sweep (manual configs)
├── llm-core-sweep-auto.yml # LLM sweep (auto-config cores)
├── embedding-benchmark.yml # Embedding playbook
├── embedding-core-sweep.yml # Embedding sweep
├── setup-platform.yml # Platform setup
├── health-check.yml # Health check
├── start-vllm-server.yml # vLLM starter
│
├── filter_plugins/ # Custom Jinja2 filters
│ ├── cpu_utils.py # CPU topology filters
│ └── test_cpu_utils.py # Unit tests
│
├── tests/ # Ansible tests
│
└── ansible.md # This file
Manages vLLM server lifecycle:
- Starts vLLM for LLM or embedding workloads
- Handles HuggingFace token setup
- Clean container restarts
- CPU/NUMA pinning
HuggingFace authentication:
- Multiple token sources (env, file, vault, prompt)
- Optional setup (allows public models)
GuideLLM benchmark execution:
- Container or host execution
- Configurable workloads
- Result collection
Embedding model benchmarks:
- Baseline throughput tests
- Latency/concurrency tests
- Uses vllm-bench
Log and result collection:
- Collects vLLM logs from DUT
- Fetches benchmark results from load generator
- Organizes by test run ID
Edit inventory/group_vars/all/test-workloads.yml:
test_configs:
my_custom_workload:
workload_type: "summarization"
isl: 2048
osl: 512
backend: "openai-completions"
vllm_args:
- "--dtype=bfloat16"
- "--no_enable_prefix_caching"
- "--my-custom-flag"
kv_cache_space: "50GiB"Edit inventory/group_vars/all/hardware-profiles.yml:
core_configs:
- name: "my-24cores"
cores: 24
cpuset_cpus: "0-23"
cpuset_mems: "0"
tensor_parallel: 1Default Configuration (CPU Testing):
- Profile:
concurrent- Fixed concurrency level testing - Concurrency Rates:
[1, 2, 4, 8, 16, 32]- CPU-appropriate levels - Test Duration:
600seconds (10 minutes per concurrency level) - Request Timeout:
600seconds (matches test duration) - Max Concurrency:
128(CPU testing limit)
You can customize GuideLLM benchmark parameters in two ways:
Option 1: Simple flat variables (recommended for quick tests)
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
-e "test_model=Qwen/Qwen2.5-3B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=16" \
-e "guidellm_max_seconds=120" \
-e "guidellm_rate=[1,8,16]"Option 2: Dictionary syntax (for multiple parameters)
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
-e "test_model=Qwen/Qwen2.5-3B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=16" \
-e '{"benchmark_tool": {"guidellm": {"max_seconds": 120, "profile": "throughput", "rate": [32]}}}'Available flat variables:
guidellm_profile- Benchmark profile (concurrent, throughput, sweep, synchronous)guidellm_rate- Concurrency levels for concurrent profile (e.g.,[1,8,16,32])guidellm_max_seconds- Maximum test duration (default: 600)guidellm_max_requests- Maximum requests to send (default: 1000)guidellm_request_timeout- Request timeout (default: 600)guidellm_max_concurrency- Maximum concurrent requests (default: 128)guidellm_warmup- Warmup percentage (default: 0.1 = 10%)guidellm_cooldown- Cooldown between tests in seconds (default: 30)guidellm_outputs- Output formats (default: "html,json,csv")guidellm_container_image- GuideLLM container imageguidellm_use_container- Use container mode (default: true)guidellm_cpuset_cpus- CPU allocation for GuideLLM (default: "16-31")guidellm_cpuset_mems- NUMA node allocation (default: "0")
Defaults are defined in inventory/group_vars/all/benchmark-tools.yml.
Assign custom names to benchmark tests for easier identification and filtering in Streamlit dashboards.
Usage:
# Single test with custom name
ansible-playbook -i inventory/hosts.yml llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=16" \
-e "test_name=baseline-v1"Result: Test run ID becomes 20260423-143022-baseline-v1 (timestamp + name)
Core sweep with custom name:
ansible-playbook -i inventory/hosts.yml llm-core-sweep-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores_list=[8,16,32,64]" \
-e "test_name=perf-optimization-test"All tests in the sweep share the custom name.
Dashboard filtering:
When tests have custom names, Streamlit dashboards automatically display a "Custom Test Names" filter, allowing you to:
- Filter results by test name
- Compare specific test runs
- Track iterations (e.g., baseline-v1, baseline-v2, optimized-v1)
Best practices:
- Keep names under 30 characters
- Use alphanumeric characters and hyphens only
- Examples:
baseline-v1,cache-enabled,production-candidate,bug-123-repro - Avoid spaces (use hyphens instead)
Backward compatibility: Tests without test_name work as before (timestamp-only IDs).
Socket pinning allows you to isolate vLLM server and GuideLLM load generator to different CPU sockets/NUMA nodes, minimizing performance interference.
Supported Playbooks:
- ✅
llm-benchmark-auto.yml - ✅
llm-benchmark-concurrent-load.yml
Parameters:
| Parameter | Description | Example |
|---|---|---|
vllm_cpu_start |
CPU ID to start vLLM allocation from | 64 (for socket 1) |
vllm_numa_node |
NUMA node for vLLM (required for socket pinning) | 1 |
guidellm_cpus |
CPU range for load generator | "0-31" |
guidellm_numa_node |
NUMA node for load generator | 0 |
Determine Your System's Socket Layout:
# Show NUMA topology
lscpu | grep NUMA
# Show cores per NUMA node
numactl --hardware
# Example output:
# node 0 cpus: 0 1 2 ... 31
# node 1 cpus: 64 65 66 ... 95Example: vLLM on Socket 1, GuideLLM on Socket 0
Assuming a 2-socket system:
- Socket 0: Cores 0-31 (NUMA node 0)
- Socket 1: Cores 64-95 (NUMA node 1)
# Single test with socket separation
ansible-playbook llm-benchmark-auto.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "workload_type=chat" \
-e "requested_cores=32" \
-e "vllm_cpu_start=64" \
-e "vllm_numa_node=1" \
-e "guidellm_cpus=0-31" \
-e "guidellm_numa_node=0"# Concurrent load test with socket separation
ansible-playbook llm-benchmark-concurrent-load.yml \
-e "test_model=meta-llama/Llama-3.2-1B-Instruct" \
-e "base_workload=chat" \
-e "requested_cores=32" \
-e "vllm_cpu_start=64" \
-e "vllm_numa_node=1" \
-e "guidellm_cpus=0-31" \
-e "guidellm_numa_node=0"Validation:
# Verify vLLM container pinning
podman ps | grep vllm
podman inspect <container-id> | grep -A 2 cpuset
# Expected: "CpusetCpus": "64-95", "CpusetMems": "1"
# Verify GuideLLM container pinning
podman ps | grep guidellm
podman inspect <container-id> | grep -A 2 cpuset
# Expected: "CpusetCpus": "0-31", "CpusetMems": "0"Use Cases:
- Minimize Interference: Eliminate CPU contention between server and load generator
- Test Cross-NUMA Performance: Measure impact of cross-socket memory access
- Multi-Tenant Systems: Isolate benchmarks from other workloads
Notes:
- Socket pinning requires
vllm_numa_nodeto be set - When pinning to single NUMA node,
tensor_parallelmust equal 1 - The automation validates requested cores fit within specified socket
# Test SSH manually
ssh -i ~/.ssh/your-key.pem ec2-user@<DUT_IP>
# Check key permissions
chmod 600 ~/.ssh/your-key.pem
# Verbose Ansible output
ansible -i inventory/hosts.yml all -m ping -vvv# Verify token is set
echo $HF_TOKEN
# Test token works
huggingface-cli whoami# Check vLLM logs on DUT
ssh <DUT_IP> "podman logs vllm-server"
# Common issues:
# - Out of memory: Reduce kv_cache_space in inventory
# - Model not found: Check HF_TOKEN
# - Port in use: Check no other vLLM runningIf a benchmark times out waiting for completion:
# Default timeout = min(max_seconds + 600, 14400)
# - Short tests: max_seconds + 10min buffer
# - Long tests: capped at 4 hours
# Override timeout for very long tests:
ansible-playbook ... -e "guidellm_wait_timeout_seconds=7200" # 2 hours
# Monitor container in real-time:
ssh <DUT_IP> "sudo podman logs -f <container-name>"
# Check if container is stuck:
ssh <DUT_IP> "sudo podman ps -a"
ssh <DUT_IP> "sudo podman inspect <container-name> --format '{{.State.Status}}'"Timeout Examples:
max_seconds: 120→ timeout: 720s (12 min)max_seconds: 1800→ timeout: 2400s (40 min)max_seconds: 10800→ timeout: 14400s (4 hours, capped)
- Role-based architecture - Modular, reusable components
- Group variables - Environment-specific configuration
- Custom Jinja2 filters - Native Python (no shell/awk scripts)
- Clean restarts - Fresh vLLM state between tests
- Secure HF tokens - Multiple methods (env, file, vault, prompt)
- Container isolation - Podman/Docker with CPU/NUMA pinning
- Platform tuning - CPU isolation, tuned profiles, systemd pinning
- Comprehensive testing - Unit tests for all filters
- Single inventory - Just edit IPs and run
- Inventory Documentation - Detailed inventory guide
- Results Collector Role - Log collection documentation
- vLLM Documentation
- GuideLLM Documentation
- Ansible Best Practices