AI Inference Benchmark

A benchmarking framework for LLM inference engines (vLLM, SGLang, NVIDIA NIM) that measures sustainable throughput under continuous full load.

Why Closed-Loop Benchmarking?

Traditional benchmarks often use "open-loop" testing: send N requests, wait for completion, measure time. This approach has significant drawbacks:

Unrealistic workload: Real production systems face continuous load, not bursts
Cold start effects: First requests may be slower due to JIT compilation, memory allocation
Queue dynamics ignored: Doesn't capture how the system behaves when the request queue is constantly full

The Closed-Loop Approach

This framework uses a closed-loop methodology that maintains a constant number of in-flight requests:

┌─────────────────────────────────────────────────────────────────┐
│                     CLOSED-LOOP BENCHMARK                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Client                              Inference Server          │
│   ┌─────────┐                         ┌─────────────┐          │
│   │ Slot 1  │ ──── Request 1 ──────▶ │             │          │
│   │ Slot 2  │ ──── Request 2 ──────▶ │   vLLM /    │          │
│   │ Slot 3  │ ──── Request 3 ──────▶ │   SGLang /  │          │
│   │   ...   │         ...            │   NIM       │          │
│   │ Slot N  │ ──── Request N ──────▶ │             │          │
│   └─────────┘                         └─────────────┘          │
│        │                                    │                   │
│        │◀────── Response ──────────────────┘                   │
│        │                                                        │
│        └──── Immediately send next request ────▶               │
│                                                                 │
│   Target: Always maintain N concurrent requests ("in-flight")  │
└─────────────────────────────────────────────────────────────────┘

Key principle: When a request completes, a new one is immediately dispatched. This keeps the server constantly saturated at exactly N concurrent requests.

Benchmark Phases

Time ──────────────────────────────────────────────────────────────▶

┌──────────┐  ┌────────────────┐  ┌────────────────────┐  ┌────────┐
│   FILL   │  │   STABILIZE    │  │      MEASURE       │  │ DRAIN  │
│  ~30s    │  │     ~60s       │  │      ~120s         │  │  ~30s  │
└──────────┘  └────────────────┘  └────────────────────┘  └────────┘
     │              │                      │                   │
     │              │                      │                   │
  Gradually      Let system            Collect all         Let pending
  ramp up to     reach steady          metrics here        requests
  N requests     state (JIT,           (this is the        complete
  (no burst!)    caches warm)          actual data)

Fill Phase: Gradually add requests (1 per second) to avoid burst effects
Stabilize Phase: System reaches steady state - JIT compilation completes, caches warm up
Measure Phase: All metrics are collected here - this is the actual benchmark data
Drain Phase: Let remaining requests complete gracefully

Why This Produces Better Data

Aspect	Open-Loop	Closed-Loop (This Framework)
Server utilization	Variable, often underutilized	Constant, always at target load
Queue behavior	Empty → burst → empty	Constant queue depth
Measurement validity	Includes warm-up artifacts	Pure steady-state measurement
Real-world relevance	Synthetic	Matches production patterns
Throughput measurement	Peak burst capacity	Sustainable throughput

Metrics Explained

Server Throughput (tok/s)

What it measures: Total tokens generated per second across all requests.

How it's calculated:

server_throughput = user_throughput_avg × batch_size

This is derived from per-request throughput rather than direct token counting to avoid boundary issues with requests that span measurement phase boundaries.

This represents the aggregate capacity of the inference server - how many tokens it can produce per second when fully loaded.

User Throughput (tok/s per request)

What it measures: Tokens per second from a single user's perspective.

How it's calculated:

# For each completed request:
user_throughput = output_tokens / request_duration  # includes TTFT!

# Reported as mean ± std across all requests in measure phase

Note: request_duration is the total time from request sent to last token received, which includes TTFT. This means user throughput reflects the complete user experience, not just the generation phase.

This answers: "How fast does generation feel to an individual user?"

At high concurrency, user throughput decreases because server capacity is shared and TTFT increases.

TTFT - Time To First Token

What it measures: Latency from request start until the first output token arrives.

How it's calculated:

ttft = timestamp_first_token - timestamp_request_sent

TTFT includes:

Network latency (request to server)
Queue wait time (if server is busy)
Prefill time (processing the entire input prompt)
Network latency (first token back)

Why it matters: Users perceive TTFT as "thinking time" before the model starts responding. Long TTFT feels sluggish even if generation is fast afterward.

TPOT - Time Per Output Token

What it measures: Average time between consecutive output tokens during generation.

How it's calculated:

# Native calculation (most accurate):
tpot = (timestamp_last_token - timestamp_first_token) / (output_tokens - 1)

Why it matters: TPOT determines the "streaming speed" - how fast text appears during generation. Lower TPOT = smoother, faster-feeling output.

ITL - Inter-Token Latency

What it measures: The actual time gap between each consecutive token pair.

itl_values = [t2 - t1, t3 - t2, t4 - t3, ...]  # All inter-token gaps

While TPOT is an average, ITL captures the full distribution of token timings.

Jitter Ratio (ITL P99 / P50)

What it measures: Stability of token generation - how consistent is the timing?

How it's calculated:

itl_sorted = sorted(all_itl_values)
p50 = itl_sorted[len(itl_sorted) // 2]           # Median
p99 = itl_sorted[int(len(itl_sorted) * 0.99)]    # 99th percentile

jitter_ratio = p99 / p50

Interpretation:

Jitter ≈ 1.0: Very stable, consistent token timing (ideal)
Jitter = 2.0: P99 latency is 2x the median (some stuttering)
Jitter > 3.0: Significant variance, noticeable pauses during generation

Why it matters: High jitter means users experience "stuttering" - the text flows smoothly, then pauses, then flows again. Even with good average TPOT, high jitter feels bad.

Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Edit benchmark_config.json with your endpoints
# 3. Run benchmark
python run_full_benchmark.py

# Or dry-run first to see what will happen
python run_full_benchmark.py --dry-run

Configuration (benchmark_config.json)

{
    "endpoints": [
        {"name": "vllm", "url": "http://localhost:8000/v1", "engine": "vllm"}
    ],
    "models": [
        {"id": "llama-70b", "name": "meta-llama/Llama-3.1-70B-Instruct"}
    ],
    "scenarios": ["A1", "A2", "B1", "B2", "C", "D"],
    "batch_sizes": [1, 2, 8, 16, 32, 64, 128, 256],
    "measure_seconds": 120,
    "stabilize_seconds": 60,
    "max_tokens": 512,
    "output_dir": "results/my_benchmark"
}

Scenarios

The framework includes 6 scenarios designed to test different workload patterns:

ID	Name	Input Tokens	Caching Potential	Purpose
A1	Same Short	~50	Maximum (identical prompts)	Best-case prefix caching
A2	Same Long	~2,500	Maximum + long prefill	Prefix caching with large context
B1	Diverse Short	~50	None (all different)	Raw throughput, no caching
B2	Diverse Long	~2,500	None + long prefill	Stress test: diverse + large
C	RAG/Agent	~2,000 system + short query	System prompt caching	Typical RAG workload
D	Multi-Turn Dialog	Growing history	History caching	Chat with context buildup

Project Structure

├── run_full_benchmark.py      # Quick start entry point
├── benchmark_config.json      # Default configuration
├── requirements.txt
│
├── src/
│   ├── benchmark_core/        # Core library
│   │   ├── closed_loop.py     # Closed-loop benchmark logic
│   │   ├── client.py          # Async OpenAI-compatible client
│   │   └── metrics.py         # Metric calculations
│   └── scenarios/             # Benchmark scenarios & prompts
│
├── scripts/
│   ├── run_benchmark.py       # Full benchmark runner
│   ├── prepare_b2_dataset.py  # Download B2 dataset (LongBench)
│   └── prepare_dialog_dataset.py  # Download dialog dataset (oasst2)
│
├── analysis/
│   ├── generate_ix_figures.py # Generate visualizations
│   └── validate_results.py    # Validate benchmark results
│
├── results/                   # Benchmark results (JSON)
└── figures/                   # Generated plots

Generating Visualizations

After running benchmarks:

python analysis/generate_ix_figures.py --results-dir results/my_benchmark

This generates:

Throughput vs. concurrency plots
TTFT and TPOT latency curves
Jitter analysis
Workload difficulty heatmaps
Engine comparison charts

Optional: Dataset Preparation

For scenarios B2 (diverse long) and D (multi-turn dialog), you can download real datasets:

# B2: LongBench-v2 dataset (diverse long documents)
pip install datasets
python scripts/prepare_b2_dataset.py

# D: OpenAssistant dialogs (real multi-turn conversations)
python scripts/prepare_dialog_dataset.py

Future Improvements

Areas identified for future development:

Benchmark Timestamps

The JSON result files currently lack timing metadata. Future versions should include:

Start timestamp: When the benchmark configuration started
End timestamp: When it completed
Total duration: Wall-clock time for the entire run (fill + stabilize + measure + drain)

This would enable analysis of total benchmark runtime and help identify slow configurations.

Enhanced Request Tracking

Currently, only aggregated metrics are stored. For deeper post-hoc analysis, future versions should track:

Full request/response pairs: Store prompts and generated responses for quality analysis
Per-request timing data: Individual TTFT, TPOT, ITL values (not just aggregates)
Token-level events: Complete timeline of token arrivals for detailed latency profiling

This would enable retrospective analysis without re-running benchmarks.

User Throughput Calculation

The current user throughput metric includes TTFT in the calculation:

# Current implementation:
user_throughput = output_tokens / request_duration  # includes TTFT

A more accurate measure of "streaming experience" would exclude TTFT:

# Improved implementation:
generation_throughput = output_tokens / (request_duration - ttft)

This generation_throughput better reflects how fast tokens appear after generation starts, which is what users perceive as "streaming speed".

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Inference Benchmark

Why Closed-Loop Benchmarking?

The Closed-Loop Approach

Benchmark Phases

Why This Produces Better Data

Metrics Explained

Server Throughput (tok/s)

User Throughput (tok/s per request)

TTFT - Time To First Token

TPOT - Time Per Output Token

ITL - Inter-Token Latency

Jitter Ratio (ITL P99 / P50)

Quick Start

Configuration (benchmark_config.json)

Scenarios

Project Structure

Generating Visualizations

Optional: Dataset Preparation

Future Improvements

Benchmark Timestamps

Enhanced Request Tracking

User Throughput Calculation

License

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
analysis		analysis
docs		docs
figures		figures
results		results
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
benchmark_config.json		benchmark_config.json
requirements.txt		requirements.txt
run_full_benchmark.py		run_full_benchmark.py

Folders and files

Latest commit

History

Repository files navigation

AI Inference Benchmark

Why Closed-Loop Benchmarking?

The Closed-Loop Approach

Benchmark Phases

Why This Produces Better Data

Metrics Explained

Server Throughput (tok/s)

User Throughput (tok/s per request)

TTFT - Time To First Token

TPOT - Time Per Output Token

ITL - Inter-Token Latency

Jitter Ratio (ITL P99 / P50)

Quick Start

Configuration (benchmark_config.json)

Scenarios

Project Structure

Generating Visualizations

Optional: Dataset Preparation

Future Improvements

Benchmark Timestamps

Enhanced Request Tracking

User Throughput Calculation

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages