Skip to content

bizrockman/AIInferenceBenchmark

Repository files navigation

AI Inference Benchmark

A benchmarking framework for LLM inference engines (vLLM, SGLang, NVIDIA NIM) that measures sustainable throughput under continuous full load.

Why Closed-Loop Benchmarking?

Traditional benchmarks often use "open-loop" testing: send N requests, wait for completion, measure time. This approach has significant drawbacks:

  • Unrealistic workload: Real production systems face continuous load, not bursts
  • Cold start effects: First requests may be slower due to JIT compilation, memory allocation
  • Queue dynamics ignored: Doesn't capture how the system behaves when the request queue is constantly full

The Closed-Loop Approach

This framework uses a closed-loop methodology that maintains a constant number of in-flight requests:

┌─────────────────────────────────────────────────────────────────┐
│                     CLOSED-LOOP BENCHMARK                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│   Client                              Inference Server          │
│   ┌─────────┐                         ┌─────────────┐          │
│   │ Slot 1  │ ──── Request 1 ──────▶ │             │          │
│   │ Slot 2  │ ──── Request 2 ──────▶ │   vLLM /    │          │
│   │ Slot 3  │ ──── Request 3 ──────▶ │   SGLang /  │          │
│   │   ...   │         ...            │   NIM       │          │
│   │ Slot N  │ ──── Request N ──────▶ │             │          │
│   └─────────┘                         └─────────────┘          │
│        │                                    │                   │
│        │◀────── Response ──────────────────┘                   │
│        │                                                        │
│        └──── Immediately send next request ────▶               │
│                                                                 │
│   Target: Always maintain N concurrent requests ("in-flight")  │
└─────────────────────────────────────────────────────────────────┘

Key principle: When a request completes, a new one is immediately dispatched. This keeps the server constantly saturated at exactly N concurrent requests.

Benchmark Phases

Time ──────────────────────────────────────────────────────────────▶

┌──────────┐  ┌────────────────┐  ┌────────────────────┐  ┌────────┐
│   FILL   │  │   STABILIZE    │  │      MEASURE       │  │ DRAIN  │
│  ~30s    │  │     ~60s       │  │      ~120s         │  │  ~30s  │
└──────────┘  └────────────────┘  └────────────────────┘  └────────┘
     │              │                      │                   │
     │              │                      │                   │
  Gradually      Let system            Collect all         Let pending
  ramp up to     reach steady          metrics here        requests
  N requests     state (JIT,           (this is the        complete
  (no burst!)    caches warm)          actual data)
  1. Fill Phase: Gradually add requests (1 per second) to avoid burst effects
  2. Stabilize Phase: System reaches steady state - JIT compilation completes, caches warm up
  3. Measure Phase: All metrics are collected here - this is the actual benchmark data
  4. Drain Phase: Let remaining requests complete gracefully

Why This Produces Better Data

Aspect Open-Loop Closed-Loop (This Framework)
Server utilization Variable, often underutilized Constant, always at target load
Queue behavior Empty → burst → empty Constant queue depth
Measurement validity Includes warm-up artifacts Pure steady-state measurement
Real-world relevance Synthetic Matches production patterns
Throughput measurement Peak burst capacity Sustainable throughput

Metrics Explained

Server Throughput (tok/s)

What it measures: Total tokens generated per second across all requests.

How it's calculated:

server_throughput = user_throughput_avg × batch_size

This is derived from per-request throughput rather than direct token counting to avoid boundary issues with requests that span measurement phase boundaries.

This represents the aggregate capacity of the inference server - how many tokens it can produce per second when fully loaded.

User Throughput (tok/s per request)

What it measures: Tokens per second from a single user's perspective.

How it's calculated:

# For each completed request:
user_throughput = output_tokens / request_duration  # includes TTFT!

# Reported as mean ± std across all requests in measure phase

Note: request_duration is the total time from request sent to last token received, which includes TTFT. This means user throughput reflects the complete user experience, not just the generation phase.

This answers: "How fast does generation feel to an individual user?"

At high concurrency, user throughput decreases because server capacity is shared and TTFT increases.

TTFT - Time To First Token

What it measures: Latency from request start until the first output token arrives.

How it's calculated:

ttft = timestamp_first_token - timestamp_request_sent

TTFT includes:

  • Network latency (request to server)
  • Queue wait time (if server is busy)
  • Prefill time (processing the entire input prompt)
  • Network latency (first token back)

Why it matters: Users perceive TTFT as "thinking time" before the model starts responding. Long TTFT feels sluggish even if generation is fast afterward.

TPOT - Time Per Output Token

What it measures: Average time between consecutive output tokens during generation.

How it's calculated:

# Native calculation (most accurate):
tpot = (timestamp_last_token - timestamp_first_token) / (output_tokens - 1)

Why it matters: TPOT determines the "streaming speed" - how fast text appears during generation. Lower TPOT = smoother, faster-feeling output.

ITL - Inter-Token Latency

What it measures: The actual time gap between each consecutive token pair.

itl_values = [t2 - t1, t3 - t2, t4 - t3, ...]  # All inter-token gaps

While TPOT is an average, ITL captures the full distribution of token timings.

Jitter Ratio (ITL P99 / P50)

What it measures: Stability of token generation - how consistent is the timing?

How it's calculated:

itl_sorted = sorted(all_itl_values)
p50 = itl_sorted[len(itl_sorted) // 2]           # Median
p99 = itl_sorted[int(len(itl_sorted) * 0.99)]    # 99th percentile

jitter_ratio = p99 / p50

Interpretation:

  • Jitter ≈ 1.0: Very stable, consistent token timing (ideal)
  • Jitter = 2.0: P99 latency is 2x the median (some stuttering)
  • Jitter > 3.0: Significant variance, noticeable pauses during generation

Why it matters: High jitter means users experience "stuttering" - the text flows smoothly, then pauses, then flows again. Even with good average TPOT, high jitter feels bad.


Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Edit benchmark_config.json with your endpoints
# 3. Run benchmark
python run_full_benchmark.py

# Or dry-run first to see what will happen
python run_full_benchmark.py --dry-run

Configuration (benchmark_config.json)

{
    "endpoints": [
        {"name": "vllm", "url": "http://localhost:8000/v1", "engine": "vllm"}
    ],
    "models": [
        {"id": "llama-70b", "name": "meta-llama/Llama-3.1-70B-Instruct"}
    ],
    "scenarios": ["A1", "A2", "B1", "B2", "C", "D"],
    "batch_sizes": [1, 2, 8, 16, 32, 64, 128, 256],
    "measure_seconds": 120,
    "stabilize_seconds": 60,
    "max_tokens": 512,
    "output_dir": "results/my_benchmark"
}

Scenarios

The framework includes 6 scenarios designed to test different workload patterns:

ID Name Input Tokens Caching Potential Purpose
A1 Same Short ~50 Maximum (identical prompts) Best-case prefix caching
A2 Same Long ~2,500 Maximum + long prefill Prefix caching with large context
B1 Diverse Short ~50 None (all different) Raw throughput, no caching
B2 Diverse Long ~2,500 None + long prefill Stress test: diverse + large
C RAG/Agent ~2,000 system + short query System prompt caching Typical RAG workload
D Multi-Turn Dialog Growing history History caching Chat with context buildup

Project Structure

├── run_full_benchmark.py      # Quick start entry point
├── benchmark_config.json      # Default configuration
├── requirements.txt
│
├── src/
│   ├── benchmark_core/        # Core library
│   │   ├── closed_loop.py     # Closed-loop benchmark logic
│   │   ├── client.py          # Async OpenAI-compatible client
│   │   └── metrics.py         # Metric calculations
│   └── scenarios/             # Benchmark scenarios & prompts
│
├── scripts/
│   ├── run_benchmark.py       # Full benchmark runner
│   ├── prepare_b2_dataset.py  # Download B2 dataset (LongBench)
│   └── prepare_dialog_dataset.py  # Download dialog dataset (oasst2)
│
├── analysis/
│   ├── generate_ix_figures.py # Generate visualizations
│   └── validate_results.py    # Validate benchmark results
│
├── results/                   # Benchmark results (JSON)
└── figures/                   # Generated plots

Generating Visualizations

After running benchmarks:

python analysis/generate_ix_figures.py --results-dir results/my_benchmark

This generates:

  • Throughput vs. concurrency plots
  • TTFT and TPOT latency curves
  • Jitter analysis
  • Workload difficulty heatmaps
  • Engine comparison charts

Optional: Dataset Preparation

For scenarios B2 (diverse long) and D (multi-turn dialog), you can download real datasets:

# B2: LongBench-v2 dataset (diverse long documents)
pip install datasets
python scripts/prepare_b2_dataset.py

# D: OpenAssistant dialogs (real multi-turn conversations)
python scripts/prepare_dialog_dataset.py

Future Improvements

Areas identified for future development:

Benchmark Timestamps

The JSON result files currently lack timing metadata. Future versions should include:

  • Start timestamp: When the benchmark configuration started
  • End timestamp: When it completed
  • Total duration: Wall-clock time for the entire run (fill + stabilize + measure + drain)

This would enable analysis of total benchmark runtime and help identify slow configurations.

Enhanced Request Tracking

Currently, only aggregated metrics are stored. For deeper post-hoc analysis, future versions should track:

  • Full request/response pairs: Store prompts and generated responses for quality analysis
  • Per-request timing data: Individual TTFT, TPOT, ITL values (not just aggregates)
  • Token-level events: Complete timeline of token arrivals for detailed latency profiling

This would enable retrospective analysis without re-running benchmarks.

User Throughput Calculation

The current user throughput metric includes TTFT in the calculation:

# Current implementation:
user_throughput = output_tokens / request_duration  # includes TTFT

A more accurate measure of "streaming experience" would exclude TTFT:

# Improved implementation:
generation_throughput = output_tokens / (request_duration - ttft)

This generation_throughput better reflects how fast tokens appear after generation starts, which is what users perceive as "streaming speed".

License

MIT

About

A benchmarking framework for LLM inference engines (vLLM, SGLang, NVIDIA NIM) that measures sustainable throughput under continuous full load in a closed Loop

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors