NullPointerDepressiveDisorder · NullPointerDepressiveDisorder · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026 · Mar 12, 2026
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -1,9 +1,8 @@
 name: Release to PyPI
 
 on:
-    push:
-        tags:
-            - "v*"
+    release:
+        types: [published]
 
 jobs:
     build:

diff --git a/README.md b/README.md
@@ -1,14 +1,30 @@
 # infer-check
 
-**Correctness and reliability testing for LLM inference engines.**
+[![PyPI - Version](https://img.shields.io/pypi/v/infer-check?logo=PyPi&color=%233775A9)](https://pypi.org/project/infer-check/)
+[![Run tests and upload coverage](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/NullPointerDepressiveDisorder/infer-check/actions/workflows/coverage.yml)
 
-`infer-check` is a CLI tool that tests whether LLM inference backends produce correct, stable, and deterministic output. It catches the bugs that benchmarks miss — quantization-induced failures, cross-backend divergence, KV cache corruption under load, and non-determinism at temperature=0.
+**Catches the correctness bugs that benchmarks miss in LLM inference engines.**
 
-## Key findings
+Quantization silently breaks arithmetic. Serving layers silently alter output. KV caches silently corrupt under load. Benchmarks like lm-evaluation-harness test whether models are smart — `infer-check` tests whether engines are correct.
 
-Tested across Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx.
+## The problem
 
-**4-bit quantization degrades task-dependently.** Numerical tasks break worst:
+Every LLM inference engine has correctness bugs that benchmarks don't catch:
+
+- **KV cache NaN pollution** in vLLM-Ascend permanently corrupts all subsequent requests
+- **FP8 KV quantization** in vLLM causes repeated garbage output
+- **32.5% element mismatches** in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
+- **Batch-size-dependent output** where tokens change depending on concurrent request count
+
+These aren't model quality problems — they're engine correctness failures. `infer-check` is a CLI tool that runs differential tests across backends, quantization levels, and concurrency conditions to surface them automatically.
+
+## Example results
+
+Results from running `infer-check` on Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using mlx-lm and vllm-mlx. These demonstrate what the tool catches — not a comprehensive benchmark.
+
+### Quantization sweep
+
+4-bit quantization on Llama-3.1-8B showed clear task-dependent degradation. Numerical tasks broke worst:
 
 ```
                        Llama-3.1-8B: bf16 vs 4bit
@@ -21,17 +37,27 @@ Tested across Llama-3.1-8B-Instruct and Qwen3.5-4B (MoE) on Apple Silicon using
 └───────────────────────┴───────────┴──────────┴─────────────────┘
 ```
 
-**Dense and MoE architectures degrade similarly at 4-bit.** Qwen3.5-4B (Gated Delta Networks + sparse MoE) shows 35/50 severe on reasoning — the same rate as dense Llama-3.1-8B.
+A "severe" divergence means the quantized output is functionally wrong — not just worded differently, but giving incorrect answers to questions the bf16 baseline handles correctly. This pattern is consistent with published research on quantization-induced degradation, reproduced here on MLX's native quantization scheme.
 
-**vllm-mlx's serving layer is perfectly faithful.** mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). The serving layer introduces zero divergence.
+### Dense vs. MoE comparison
 
-**Both engines are deterministic at temperature=0.** Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 perfect determinism across 20 runs per prompt.
+Qwen3.5-4B (Gated Delta Networks + sparse MoE) showed similar degradation rates to dense Llama-3.1-8B in our testing — 35/50 severe on reasoning at 4-bit. Small sample, but the tool picks up the signal clearly on both architectures.
+
+### Cross-backend diff
 
-**vllm-mlx handles concurrent load without corruption.** Stress test at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels.
+mlx-lm vs vllm-mlx at temperature=0 on Llama-3.1-8B-4bit: 50/50 identical (reasoning) and 30/30 identical (numerics). In this test, the vllm-mlx serving layer introduced zero divergence — output differences in production would come from quantization, not from the serving layer itself.
+
+### Determinism
+
+Llama-3.1-8B-4bit and Qwen3.5-4B both scored 50/50 identical across 20 runs per prompt on single-request mlx-lm inference at temperature=0.
+
+### Stress test
+
+vllm-mlx at concurrency 1/2/4/8: zero errors, 100% output consistency at all levels. No KV cache corruption or batch-dependent divergence detected.
 
 ## Installation
 
-```bash
+```
 pip install infer-check
 
 # With MLX backend support (Apple Silicon)
@@ -44,7 +70,7 @@ pip install "infer-check[mlx]"
 
 Compare pre-quantized models against a baseline. Each model is a separate HuggingFace repo.
 
-```bash
+```
 infer-check sweep \
   --models "bf16=mlx-community/Meta-Llama-3.1-8B-Instruct-bf16,\
             8bit=mlx-community/Meta-Llama-3.1-8B-Instruct-8bit,\
@@ -73,7 +99,7 @@ The baseline is automatically run twice as a self-check — if it's not 50/50 id
 
 Same model, same quant, different inference paths. Catches serving-layer bugs.
 
-```bash
+```
 # Start vllm-mlx in another terminal:
 # vllm-mlx serve mlx-community/Meta-Llama-3.1-8B-Instruct-4bit --port 8000
 
@@ -91,7 +117,7 @@ Uses `/v1/chat/completions` by default (`--chat`) so server-side chat templates
 
 Same prompt N times at temperature=0. Output should be bit-identical every run.
 
-```bash
+```
 infer-check determinism \
   --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
   --backend mlx-lm \
@@ -104,7 +130,7 @@ infer-check determinism \
 
 Concurrent requests through a serving backend. Tests KV cache correctness under load.
 
-```bash
+```
 infer-check stress \
   --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit \
   --backend openai-compat \
@@ -118,7 +144,7 @@ infer-check stress \
 
 Generate an HTML report from all saved results.
 
-```bash
+```
 infer-check report ./results/ --format html
 ```
 
@@ -127,34 +153,35 @@ infer-check report ./results/ --format html
 Curated prompts targeting known quantization failure modes:
 
 | Suite | Count | Purpose |
-|---|---|---|
+| --- | --- | --- |
 | `reasoning.jsonl` | 50 | Multi-step math and logic |
 | `code.jsonl` | 49 | Python, JSON, SQL generation |
 | `adversarial-numerics.jsonl` | 30 | IEEE 754 edge cases, overflow, precision |
 | `long-context.jsonl` | 10 | Tables and transcripts with recall questions |
 | `determinism.jsonl` | 50 | High-entropy continuations for determinism testing |
 
-All suites ship with the package — no need to clone the repo. Custom suites are JSONL files: `{"id": "...", "text": "...", "category": "...", "max_tokens": N}` per line.
+All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line:
+
+```json
+{"id": "custom-001", "text": "Your prompt here", "category": "math", "max_tokens": 512}
+```
 
 ## Supported backends
 
 | Backend | Type | Use case |
-|---|---|---|
+| --- | --- | --- |
 | **mlx-lm** | In-process | Local Apple Silicon inference with logprobs |
 | **llama.cpp** | HTTP | `llama-server` via `/completion` endpoint |
 | **vllm-mlx** | HTTP | Continuous batching on Apple Silicon |
 | **openai-compat** | HTTP | Any OpenAI-compatible server (vLLM, SGLang, Ollama) |
 
-## Why this exists
-
-Every LLM inference engine has correctness bugs that benchmarks don't catch:
-
-- **KV cache NaN pollution** in vLLM-Ascend permanently corrupts all subsequent requests
-- **FP8 KV quantization** in vLLM causes repeated garbage output
-- **32.5% element mismatches** in SGLang's FP8 DeepGEMM kernels on Blackwell GPUs
-- **Batch-size-dependent output** where tokens change depending on concurrent request count
+## Roadmap
 
-These aren't model quality problems — they're engine correctness failures. Benchmarks like lm-evaluation-harness test whether models are smart. `infer-check` tests whether engines are correct.
+- [ ] GGUF backend (direct llama.cpp integration without HTTP)
+- [ ] CUDA vLLM backend for GPU-based differential testing
+- [ ] Logprobs-based divergence scoring where backends support it
+- [ ] Automated regression CI mode (`infer-check ci` with pass/fail exit codes)
+- [ ] Expanded prompt suites for tool use and multi-turn conversations
 
 ## Requirements
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -5,7 +5,7 @@ build-backend = "hatchling.build"
 
 [project]
 name = "infer-check"
-version = "0.1.0"
+version = "0.1.1"
 description = "Correctness and reliability testing for LLM inference engines"
 readme = "README.md"
 license = { text = "Apache-2.0" }

diff --git a/src/infer_check/reporting/html.py b/src/infer_check/reporting/html.py
@@ -318,8 +318,9 @@
   <nav>
     <a href="#summary">Summary</a>
     {% if sweep_rows %}<a href="#sweep">Sweep</a>{% endif %}
-    {% if diff_failures %}<a href="#diff">Diff</a>{% endif %}
+    {% if diff_rows %}<a href="#diff">Diff</a>{% endif %}
     {% if failures %}<a href="#cards">Failures</a>{% endif %}
+    {% if stress_rows %}<a href="#stress">Stress</a>{% endif %}
     {% if determinism_rows %}<a href="#determinism">Determinism</a>{% endif %}
   </nav>
 </header>
@@ -402,14 +403,9 @@
   {% endif %}
 
   <!-- ── SECTION 3: Cross-Backend Comparison ── -->
-  {% if diff_failures %}
+  {% if diff_rows %}
   <section id="diff">
-    <h2>
-      Cross-Backend Comparison
-      <small style="font-weight:400;font-size:0.8rem;color:var(--text-dim);">
-        (failures only)
-      </small>
-    </h2>
+    <h2>Cross-Backend Comparison</h2>
     <div class="table-wrap">
       <table>
         <thead>
@@ -435,7 +431,7 @@
               <small style="color:var(--text-dim);">{{ row.test_quant }}</small>
             </td>
             <td>
-              <span class="chip {{ 'chip-fail' if row.similarity | float < 70 else 'chip-ok' }}">
+              <span class="chip {{ 'chip-fail' if row.is_failure else 'chip-ok' }}">
                 {{ row.similarity }}%
               </span>
             </td>
@@ -463,6 +459,47 @@
   </section>
   {% endif %}
 
+  <!-- ── SECTION: Stress Test Results ── -->
+  {% if stress_rows %}
+  <section id="stress">
+    <h2>Stress Test Results</h2>
+    <div class="table-wrap">
+      <table>
+        <thead>
+          <tr>
+            <th>Model ID</th>
+            <th>Backend</th>
+            <th>Concurrency</th>
+            <th>Requests</th>
+            <th>Errors</th>
+            <th>Consistency</th>
+          </tr>
+        </thead>
+        <tbody>
+          {% for row in stress_rows %}
+          <tr>
+            <td style="font-family:monospace;font-size:0.78rem;">{{ row.model_id }}</td>
+            <td>{{ row.backend_name }}</td>
+            <td>{{ row.concurrency_level }}</td>
+            <td>{{ row.num_results }}</td>
+            <td>
+              <span class="chip {{ 'chip-fail' if row.error_count > 0 else 'chip-ok' }}">
+                {{ row.error_count }}
+              </span>
+            </td>
+            <td>
+              <span class="chip {{ 'chip-ok' if row.consistency_raw >= 0.9 else 'chip-fail' }}">
+                {{ row.output_consistency }}%
+              </span>
+            </td>
+          </tr>
+          {% endfor %}
+        </tbody>
+      </table>
+    </div>
+  </section>
+  {% endif %}
+
   <!-- ── SECTION 4: Failure Cards ── -->
   {% if failures %}
   <section id="cards">
@@ -648,9 +685,19 @@ def _load_results(results_dir: Path) -> dict[str, list[Any]]:
             sections["stress"].append(stress)
             continue
 
+        stress_list = _try_load_list(raw, StressResult)
+        if stress_list is not None:
+            sections["stress"].extend(stress_list)
+            continue
+
         det = _try_load(raw, DeterminismResult)
         if det is not None:
             sections["determinism"].append(det)
+            continue
+
+        det_list = _try_load_list(raw, DeterminismResult)
+        if det_list is not None:
+            sections["determinism"].extend(det_list)
 
     return sections
 
@@ -715,28 +762,28 @@ def _build_sweep_context(
 
 
 def _build_diff_context(diff_batches: list[list[ComparisonResult]]) -> list[dict[str, Any]]:
-    """Build failure rows for the cross-backend comparison table."""
-    failures: list[dict[str, Any]] = []
+    """Build rows for the cross-backend comparison table."""
+    rows: list[dict[str, Any]] = []
     for batch in diff_batches:
         for comp in batch:
-            if not comp.is_failure:
-                continue
-            failures.append(
+            rows.append(
                 {
                     "prompt_id": comp.baseline.prompt_id[:32],
                     "baseline_backend": comp.baseline.backend_name,
                     "baseline_quant": comp.baseline.quantization or "—",
                     "test_backend": comp.test.backend_name,
                     "test_quant": comp.test.quantization or "—",
                     "similarity": f"{comp.text_similarity * 100:.1f}",
+                    "similarity_raw": comp.text_similarity,
                     "kl": f"{comp.kl_divergence:.4f}" if comp.kl_divergence is not None else "N/A",
                     "baseline_text": comp.baseline.text,
                     "test_text": comp.test.text,
+                    "is_failure": comp.is_failure,
                 }
             )
     # Sort worst-first (lowest similarity).
-    failures.sort(key=lambda r: float(r["similarity"]))
-    return failures
+    rows.sort(key=lambda r: float(r["similarity_raw"]))
+    return rows
 
 
 def _build_failure_cards(
@@ -786,6 +833,25 @@ def _build_failure_cards(
     return cards
 
 
+def _build_stress_context(stress_results: list[StressResult]) -> list[dict[str, Any]]:
+    """Build table rows for the stress section."""
+    rows = []
+    for s in stress_results:
+        rows.append(
+            {
+                "model_id": s.model_id[:32],
+                "backend_name": s.backend_name,
+                "concurrency_level": s.concurrency_level,
+                "error_count": s.error_count,
+                "output_consistency": f"{s.output_consistency * 100:.1f}",
+                "consistency_raw": s.output_consistency,
+                "num_results": len(s.results),
+            }
+        )
+    rows.sort(key=lambda r: (r["backend_name"], r["concurrency_level"]))
+    return rows
+
+
 def _build_determinism_context(
     det_results: list[DeterminismResult],
 ) -> list[dict[str, Any]]:
@@ -847,6 +913,7 @@ def generate_report(results_dir: Path, output_path: Path) -> Path:
     sections = _load_results(results_dir)
     sweeps: list[SweepResult] = sections["sweep"]
     diff_batches: list[list[ComparisonResult]] = sections["diff"]
+    stress_results: list[StressResult] = sections["stress"]
     det_results: list[DeterminismResult] = sections["determinism"]
 
     # ── Executive Summary ──────────────────────────────────────────────────
@@ -875,8 +942,9 @@ def generate_report(results_dir: Path, output_path: Path) -> Path:
 
     # ── Section data ───────────────────────────────────────────────────────
     sweep_ctx = _build_sweep_context(sweeps)
-    diff_failures = _build_diff_context(diff_batches)
+    diff_rows = _build_diff_context(diff_batches)
     failure_cards = _build_failure_cards(diff_batches)
+    stress_rows = _build_stress_context(stress_results)
     determinism_rows = _build_determinism_context(det_results)
 
     generated_at = datetime.now(UTC).strftime("%Y-%m-%d %H:%M UTC")
@@ -897,8 +965,9 @@ def generate_report(results_dir: Path, output_path: Path) -> Path:
         sweep_rows=sweep_ctx["sweep_rows"],
         quant_cols=sweep_ctx["quant_cols"],
         degradation_cliff=sweep_ctx["degradation_cliff"],
-        diff_failures=diff_failures,
+        diff_rows=diff_rows,
         failures=failure_cards,
+        stress_rows=stress_rows,
         determinism_rows=determinism_rows,
         generated_at=generated_at,
     )