A benchmarking framework for LLM inference engines (vLLM, SGLang, NVIDIA NIM) that measures sustainable throughput under continuous full load.
Traditional benchmarks often use "open-loop" testing: send N requests, wait for completion, measure time. This approach has significant drawbacks:
- Unrealistic workload: Real production systems face continuous load, not bursts
- Cold start effects: First requests may be slower due to JIT compilation, memory allocation
- Queue dynamics ignored: Doesn't capture how the system behaves when the request queue is constantly full
This framework uses a closed-loop methodology that maintains a constant number of in-flight requests:
┌─────────────────────────────────────────────────────────────────┐
│ CLOSED-LOOP BENCHMARK │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Client Inference Server │
│ ┌─────────┐ ┌─────────────┐ │
│ │ Slot 1 │ ──── Request 1 ──────▶ │ │ │
│ │ Slot 2 │ ──── Request 2 ──────▶ │ vLLM / │ │
│ │ Slot 3 │ ──── Request 3 ──────▶ │ SGLang / │ │
│ │ ... │ ... │ NIM │ │
│ │ Slot N │ ──── Request N ──────▶ │ │ │
│ └─────────┘ └─────────────┘ │
│ │ │ │
│ │◀────── Response ──────────────────┘ │
│ │ │
│ └──── Immediately send next request ────▶ │
│ │
│ Target: Always maintain N concurrent requests ("in-flight") │
└─────────────────────────────────────────────────────────────────┘
Key principle: When a request completes, a new one is immediately dispatched. This keeps the server constantly saturated at exactly N concurrent requests.
Time ──────────────────────────────────────────────────────────────▶
┌──────────┐ ┌────────────────┐ ┌────────────────────┐ ┌────────┐
│ FILL │ │ STABILIZE │ │ MEASURE │ │ DRAIN │
│ ~30s │ │ ~60s │ │ ~120s │ │ ~30s │
└──────────┘ └────────────────┘ └────────────────────┘ └────────┘
│ │ │ │
│ │ │ │
Gradually Let system Collect all Let pending
ramp up to reach steady metrics here requests
N requests state (JIT, (this is the complete
(no burst!) caches warm) actual data)
- Fill Phase: Gradually add requests (1 per second) to avoid burst effects
- Stabilize Phase: System reaches steady state - JIT compilation completes, caches warm up
- Measure Phase: All metrics are collected here - this is the actual benchmark data
- Drain Phase: Let remaining requests complete gracefully
| Aspect | Open-Loop | Closed-Loop (This Framework) |
|---|---|---|
| Server utilization | Variable, often underutilized | Constant, always at target load |
| Queue behavior | Empty → burst → empty | Constant queue depth |
| Measurement validity | Includes warm-up artifacts | Pure steady-state measurement |
| Real-world relevance | Synthetic | Matches production patterns |
| Throughput measurement | Peak burst capacity | Sustainable throughput |
What it measures: Total tokens generated per second across all requests.
How it's calculated:
server_throughput = user_throughput_avg × batch_sizeThis is derived from per-request throughput rather than direct token counting to avoid boundary issues with requests that span measurement phase boundaries.
This represents the aggregate capacity of the inference server - how many tokens it can produce per second when fully loaded.
What it measures: Tokens per second from a single user's perspective.
How it's calculated:
# For each completed request:
user_throughput = output_tokens / request_duration # includes TTFT!
# Reported as mean ± std across all requests in measure phaseNote: request_duration is the total time from request sent to last token received, which includes TTFT. This means user throughput reflects the complete user experience, not just the generation phase.
This answers: "How fast does generation feel to an individual user?"
At high concurrency, user throughput decreases because server capacity is shared and TTFT increases.
What it measures: Latency from request start until the first output token arrives.
How it's calculated:
ttft = timestamp_first_token - timestamp_request_sentTTFT includes:
- Network latency (request to server)
- Queue wait time (if server is busy)
- Prefill time (processing the entire input prompt)
- Network latency (first token back)
Why it matters: Users perceive TTFT as "thinking time" before the model starts responding. Long TTFT feels sluggish even if generation is fast afterward.
What it measures: Average time between consecutive output tokens during generation.
How it's calculated:
# Native calculation (most accurate):
tpot = (timestamp_last_token - timestamp_first_token) / (output_tokens - 1)Why it matters: TPOT determines the "streaming speed" - how fast text appears during generation. Lower TPOT = smoother, faster-feeling output.
What it measures: The actual time gap between each consecutive token pair.
itl_values = [t2 - t1, t3 - t2, t4 - t3, ...] # All inter-token gapsWhile TPOT is an average, ITL captures the full distribution of token timings.
What it measures: Stability of token generation - how consistent is the timing?
How it's calculated:
itl_sorted = sorted(all_itl_values)
p50 = itl_sorted[len(itl_sorted) // 2] # Median
p99 = itl_sorted[int(len(itl_sorted) * 0.99)] # 99th percentile
jitter_ratio = p99 / p50Interpretation:
- Jitter ≈ 1.0: Very stable, consistent token timing (ideal)
- Jitter = 2.0: P99 latency is 2x the median (some stuttering)
- Jitter > 3.0: Significant variance, noticeable pauses during generation
Why it matters: High jitter means users experience "stuttering" - the text flows smoothly, then pauses, then flows again. Even with good average TPOT, high jitter feels bad.
# 1. Install dependencies
pip install -r requirements.txt
# 2. Edit benchmark_config.json with your endpoints
# 3. Run benchmark
python run_full_benchmark.py
# Or dry-run first to see what will happen
python run_full_benchmark.py --dry-run{
"endpoints": [
{"name": "vllm", "url": "http://localhost:8000/v1", "engine": "vllm"}
],
"models": [
{"id": "llama-70b", "name": "meta-llama/Llama-3.1-70B-Instruct"}
],
"scenarios": ["A1", "A2", "B1", "B2", "C", "D"],
"batch_sizes": [1, 2, 8, 16, 32, 64, 128, 256],
"measure_seconds": 120,
"stabilize_seconds": 60,
"max_tokens": 512,
"output_dir": "results/my_benchmark"
}The framework includes 6 scenarios designed to test different workload patterns:
| ID | Name | Input Tokens | Caching Potential | Purpose |
|---|---|---|---|---|
| A1 | Same Short | ~50 | Maximum (identical prompts) | Best-case prefix caching |
| A2 | Same Long | ~2,500 | Maximum + long prefill | Prefix caching with large context |
| B1 | Diverse Short | ~50 | None (all different) | Raw throughput, no caching |
| B2 | Diverse Long | ~2,500 | None + long prefill | Stress test: diverse + large |
| C | RAG/Agent | ~2,000 system + short query | System prompt caching | Typical RAG workload |
| D | Multi-Turn Dialog | Growing history | History caching | Chat with context buildup |
├── run_full_benchmark.py # Quick start entry point
├── benchmark_config.json # Default configuration
├── requirements.txt
│
├── src/
│ ├── benchmark_core/ # Core library
│ │ ├── closed_loop.py # Closed-loop benchmark logic
│ │ ├── client.py # Async OpenAI-compatible client
│ │ └── metrics.py # Metric calculations
│ └── scenarios/ # Benchmark scenarios & prompts
│
├── scripts/
│ ├── run_benchmark.py # Full benchmark runner
│ ├── prepare_b2_dataset.py # Download B2 dataset (LongBench)
│ └── prepare_dialog_dataset.py # Download dialog dataset (oasst2)
│
├── analysis/
│ ├── generate_ix_figures.py # Generate visualizations
│ └── validate_results.py # Validate benchmark results
│
├── results/ # Benchmark results (JSON)
└── figures/ # Generated plots
After running benchmarks:
python analysis/generate_ix_figures.py --results-dir results/my_benchmarkThis generates:
- Throughput vs. concurrency plots
- TTFT and TPOT latency curves
- Jitter analysis
- Workload difficulty heatmaps
- Engine comparison charts
For scenarios B2 (diverse long) and D (multi-turn dialog), you can download real datasets:
# B2: LongBench-v2 dataset (diverse long documents)
pip install datasets
python scripts/prepare_b2_dataset.py
# D: OpenAssistant dialogs (real multi-turn conversations)
python scripts/prepare_dialog_dataset.pyAreas identified for future development:
The JSON result files currently lack timing metadata. Future versions should include:
- Start timestamp: When the benchmark configuration started
- End timestamp: When it completed
- Total duration: Wall-clock time for the entire run (fill + stabilize + measure + drain)
This would enable analysis of total benchmark runtime and help identify slow configurations.
Currently, only aggregated metrics are stored. For deeper post-hoc analysis, future versions should track:
- Full request/response pairs: Store prompts and generated responses for quality analysis
- Per-request timing data: Individual TTFT, TPOT, ITL values (not just aggregates)
- Token-level events: Complete timeline of token arrivals for detailed latency profiling
This would enable retrospective analysis without re-running benchmarks.
The current user throughput metric includes TTFT in the calculation:
# Current implementation:
user_throughput = output_tokens / request_duration # includes TTFTA more accurate measure of "streaming experience" would exclude TTFT:
# Improved implementation:
generation_throughput = output_tokens / (request_duration - ttft)This generation_throughput better reflects how fast tokens appear after generation starts, which is what users perceive as "streaming speed".
MIT