Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
name: Release to PyPI

on:
push:
tags:
- "v*"
release:
types: [published]

jobs:
build:
Expand Down
81 changes: 54 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,30 @@
# infer-check

**Correctness and reliability testing for LLM inference engines.**
[![PyPI - Version](https://img.shields.io/pypi/v/infer-check?logo=PyPi&color=%233775A9)](https://pypi.org/project/infer-check/)
[![Run tests and upload coverage](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml)

`infer-check` is a CLI tool that tests whether LLM inference backends produce correct, stable, and deterministic output. It catches the bugs that benchmarks miss — quantization-induced failures, cross-backend divergence, KV cache corruption under load, and non-determinism at temperature=0.
**Catches the correctness bugs that benchmarks miss in LLM inference engines.**

## Key findings
Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — `infer-check` tests whether engines are correct.

Tested across Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx.
## The problem

**4-bit quantization degrades task-dependently.** Numerical tasks break worst:
Every LLM inference engine has correctness bugs that benchmarks don't catch:

- **KV cache NaN pollution** in vLLM-Ascend permanently corrupts all subsequent requests
- **FP8 KV quantization** in vLLM causes repeated garbage output
- **32.5% element mismatches** in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
- **Batch-size-dependent output** where tokens change depending on concurrent request count

These aren't model quality problems — they're engine correctness failures. `infer-check` is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.

## Example results

Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.

### Quantization sweep

4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation. Numerical tasks broke worst:

```
Llama-3.1-8B: bf16 vs 4bit
Expand All @@ -21,17 +37,27 @@ Tested across Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using
└───────────────────────┴───────────┴──────────┴─────────────────┘
```

**Dense and MoE architectures degrade similarly at 4-bit.** Qwen3.5-4B (Gated Delta Networks + sparse MoE) shows 35/50 severe on reasoning — the same rate as dense Llama-3.1-8B.
A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.

**vllm-mlx's serving layer is perfectly faithful.** mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). The serving layer introduces zero divergence.
### Dense vs. MoE comparison

**Both engines are deterministic at temperature=0.** Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 perfect determinism across 20 runs per prompt.
Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.

### Cross-backend diff

**vllm-mlx handles concurrent load without corruption.** Stress test at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels.
mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.

### Determinism

Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.

### Stress test

vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.

## Installation

```bash
```
pip install infer-check

# With MLX backend support (Apple Silicon)
Expand All @@ -44,7 +70,7 @@ pip install "infer-check[mlx]"

Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo.

```bash
```
infer-check sweep \
--models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
Expand Down Expand Up @@ -73,7 +99,7 @@ The baseline is automatically run twice as a self-check — if it's not 50/50 id

Same model, same quant, different inference paths. Catches serving-layer bugs.

```bash
```
# Start vllm-mlx in another terminal:
# vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000

Expand All @@ -91,7 +117,7 @@ Uses `/v1/chat/completions` by default (`--chat`) so server-side chat templates

Same prompt N times at temperature=0. Output should be bit-identical every run.

```bash
```
infer-check determinism \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backend mlx-lm \
Expand All @@ -104,7 +130,7 @@ infer-check determinism \

Concurrent requests through a serving backend. Tests KV cache correctness under load.

```bash
```
infer-check stress \
--model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
--backend openai-compat \
Expand All @@ -118,7 +144,7 @@ infer-check stress \

Generate an HTML report from all saved results.

```bash
```
infer-check report ./results/ --format html
```

Expand All @@ -127,34 +153,35 @@ infer-check report ./results/ --format html
Curated prompts targeting known quantization failure modes:

| Suite | Count | Purpose |
|---|---|---|
| --- | --- | --- |
| `reasoning.jsonl` | 50 | Multi-step math and logic |
| `code.jsonl` | 49 | Python, JSON, SQL generation |
| `adversarial-numerics.jsonl` | 30 | IEEE 754 edge cases, overflow, precision |
| `long-context.jsonl` | 10 | Tables and transcripts with recall questions |
| `determinism.jsonl` | 50 | High-entropy continuations for determinism testing |

All suites ship with the package — no need to clone the repo. Custom suites are JSONL files: `{"id": "...", "text": "...", "category": "...", "max_tokens": N}` per line.
All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line:

```json
{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}
```

## Supported backends

| Backend | Type | Use case |
|---|---|---|
| --- | --- | --- |
| **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
| **llama.cpp** | HTTP | `llama-server` via `/completion` endpoint |
| **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
| **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |

## Why this exists

Every LLM inference engine has correctness bugs that benchmarks don't catch:

- **KV cache NaN pollution** in vLLM-Ascend permanently corrupts all subsequent requests
- **FP8 KV quantization** in vLLM causes repeated garbage output
- **32.5% element mismatches** in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
- **Batch-size-dependent output** where tokens change depending on concurrent request count
## Roadmap

These aren't model quality problems — they're engine correctness failures. Benchmarks like lm-evaluation-harness test whether models are smart. `infer-check` tests whether engines are correct.
- [ ] GGUF backend (direct llama.cpp integration without HTTP)
- [ ] CUDA vLLM backend for GPU-based differential testing
- [ ] Logprobs-based divergence scoring where backends support it
- [ ] Automated regression CI mode (`infer-check ci` with pass/fail exit codes)
- [ ] Expanded prompt suites for tool use and multi-turn conversations

## Requirements

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ build-backend = "hatchling.build"

[project]
name = "infer-check"
version = "0.1.0"
version = "0.1.1"
description = "Correctness and reliability testing for LLM inference engines"
readme = "README.md"
license = { text = "Apache-2.0" }
Expand Down
105 changes: 87 additions & 18 deletions src/infer_check/reporting/html.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,8 +318,9 @@
<nav>
<a href="#summary">Summary</a>
{% if sweep_rows %}<a href="#sweep">Sweep</a>{% endif %}
{% if diff_failures %}<a href="#diff">Diff</a>{% endif %}
{% if diff_rows %}<a href="#diff">Diff</a>{% endif %}
{% if failures %}<a href="#cards">Failures</a>{% endif %}
{% if stress_rows %}<a href="#stress">Stress</a>{% endif %}
{% if determinism_rows %}<a href="#determinism">Determinism</a>{% endif %}
</nav>
</header>
Expand Down Expand Up @@ -402,14 +403,9 @@
{% endif %}

<!-- ── SECTION 3: Cross-Backend Comparison ── -->
{% if diff_failures %}
{% if diff_rows %}
<section id="diff">
<h2>
Cross-Backend Comparison
<small style="font-weight:400;font-size:0.8rem;color:var(--text-dim);">
(failures only)
</small>
</h2>
<h2>Cross-Backend Comparison</h2>
<div class="table-wrap">
Comment on lines 405 to 409
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff section was switched to use diff_rows ({% if diff_rows %} and generate_report(..., diff_rows=...)), but inside the template the table loop still iterates over diff_failures. This mismatch will cause the Diff table to render no rows. Rename the loop variable to iterate over diff_rows (or pass diff_failures again) so the guard and loop are consistent.

Copilot uses AI. Check for mistakes.
<table>
<thead>
Expand All @@ -435,7 +431,7 @@
<small style="color:var(--text-dim);">{{ row.test_quant }}</small>
</td>
<td>
<span class="chip {{ 'chip-fail' if row.similarity | float < 70 else 'chip-ok' }}">
<span class="chip {{ 'chip-fail' if row.is_failure else 'chip-ok' }}">
{{ row.similarity }}%
</span>
</td>
Expand Down Expand Up @@ -463,6 +459,47 @@
</section>
{% endif %}

<!-- ── SECTION: Stress Test Results ── -->
{% if stress_rows %}
<section id="stress">
<h2>Stress Test Results</h2>
<div class="table-wrap">
<table>
<thead>
<tr>
<th>Model ID</th>
<th>Backend</th>
<th>Concurrency</th>
<th>Requests</th>
<th>Errors</th>
<th>Consistency</th>
</tr>
</thead>
<tbody>
{% for row in stress_rows %}
<tr>
<td style="font-family:monospace;font-size:0.78rem;">{{ row.model_id }}</td>
<td>{{ row.backend_name }}</td>
<td>{{ row.concurrency_level }}</td>
<td>{{ row.num_results }}</td>
<td>
<span class="chip {{ 'chip-fail' if row.error_count > 0 else 'chip-ok' }}">
{{ row.error_count }}
</span>
</td>
<td>
<span class="chip {{ 'chip-ok' if row.consistency_raw >= 0.9 else 'chip-fail' }}">
{{ row.output_consistency }}%
</span>
</td>
</tr>
{% endfor %}
</tbody>
</table>
</div>
</section>
{% endif %}

<!-- ── SECTION 4: Failure Cards ── -->
{% if failures %}
<section id="cards">
Expand Down Expand Up @@ -648,9 +685,19 @@ def _load_results(results_dir: Path) -> dict[str, list[Any]]:
sections["stress"].append(stress)
continue

stress_list = _try_load_list(raw, StressResult)
if stress_list is not None:
sections["stress"].extend(stress_list)
continue

det = _try_load(raw, DeterminismResult)
if det is not None:
sections["determinism"].append(det)
continue

det_list = _try_load_list(raw, DeterminismResult)
if det_list is not None:
sections["determinism"].extend(det_list)

return sections

Expand Down Expand Up @@ -715,28 +762,28 @@ def _build_sweep_context(


def _build_diff_context(diff_batches: list[list[ComparisonResult]]) -> list[dict[str, Any]]:
"""Build failure rows for the cross-backend comparison table."""
failures: list[dict[str, Any]] = []
"""Build rows for the cross-backend comparison table."""
rows: list[dict[str, Any]] = []
for batch in diff_batches:
for comp in batch:
if not comp.is_failure:
continue
failures.append(
rows.append(
{
"prompt_id": comp.baseline.prompt_id[:32],
"baseline_backend": comp.baseline.backend_name,
"baseline_quant": comp.baseline.quantization or "—",
"test_backend": comp.test.backend_name,
"test_quant": comp.test.quantization or "—",
"similarity": f"{comp.text_similarity * 100:.1f}",
"similarity_raw": comp.text_similarity,
"kl": f"{comp.kl_divergence:.4f}" if comp.kl_divergence is not None else "N/A",
"baseline_text": comp.baseline.text,
"test_text": comp.test.text,
"is_failure": comp.is_failure,
}
)
# Sort worst-first (lowest similarity).
failures.sort(key=lambda r: float(r["similarity"]))
return failures
rows.sort(key=lambda r: float(r["similarity_raw"]))
return rows


def _build_failure_cards(
Expand Down Expand Up @@ -786,6 +833,25 @@ def _build_failure_cards(
return cards


def _build_stress_context(stress_results: list[StressResult]) -> list[dict[str, Any]]:
"""Build table rows for the stress section."""
rows = []
for s in stress_results:
rows.append(
{
"model_id": s.model_id[:32],
"backend_name": s.backend_name,
"concurrency_level": s.concurrency_level,
"error_count": s.error_count,
"output_consistency": f"{s.output_consistency * 100:.1f}",
"consistency_raw": s.output_consistency,
"num_results": len(s.results),
}
)
rows.sort(key=lambda r: (r["backend_name"], r["concurrency_level"]))
return rows


def _build_determinism_context(
det_results: list[DeterminismResult],
) -> list[dict[str, Any]]:
Expand Down Expand Up @@ -847,6 +913,7 @@ def generate_report(results_dir: Path, output_path: Path) -> Path:
sections = _load_results(results_dir)
sweeps: list[SweepResult] = sections["sweep"]
diff_batches: list[list[ComparisonResult]] = sections["diff"]
stress_results: list[StressResult] = sections["stress"]
det_results: list[DeterminismResult] = sections["determinism"]

# ── Executive Summary ──────────────────────────────────────────────────
Expand Down Expand Up @@ -875,8 +942,9 @@ def generate_report(results_dir: Path, output_path: Path) -> Path:

# ── Section data ───────────────────────────────────────────────────────
sweep_ctx = _build_sweep_context(sweeps)
diff_failures = _build_diff_context(diff_batches)
diff_rows = _build_diff_context(diff_batches)
failure_cards = _build_failure_cards(diff_batches)
stress_rows = _build_stress_context(stress_results)
determinism_rows = _build_determinism_context(det_results)

generated_at = datetime.now(UTC).strftime("%Y-%m-%d %H:%M UTC")
Expand All @@ -897,8 +965,9 @@ def generate_report(results_dir: Path, output_path: Path) -> Path:
sweep_rows=sweep_ctx["sweep_rows"],
quant_cols=sweep_ctx["quant_cols"],
degradation_cliff=sweep_ctx["degradation_cliff"],
diff_failures=diff_failures,
diff_rows=diff_rows,
failures=failure_cards,
stress_rows=stress_rows,
determinism_rows=determinism_rows,
generated_at=generated_at,
)
Expand Down
Loading
Loading