Skip to content
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
b957eed
feat: add `compare` command for head-to-head model evaluation and ans…
NullPointerDepressiveDisorder Mar 21, 2026
d3f1f0a
feat: add token distribution capture and KL divergence calculation ac…
NullPointerDepressiveDisorder Mar 21, 2026
37e2248
feat: enhance backend auto-detection in CLI and refine model resoluti…
NullPointerDepressiveDisorder Mar 21, 2026
779218d
feat: implement HTML report generation for the compare command.
NullPointerDepressiveDisorder Mar 21, 2026
f1d59be
feat: add quant-sensitive prompt suite
NullPointerDepressiveDisorder Mar 21, 2026
f4d3c25
chore: remove old prompt-suites directory from the root.
NullPointerDepressiveDisorder Mar 21, 2026
12a252c
feat: implement token-ID alignment for accurate cross-backend KL dive…
NullPointerDepressiveDisorder Mar 22, 2026
87657b5
Here is a suggested commit message summarizing the provided diff:
NullPointerDepressiveDisorder Mar 22, 2026
d84530f
feat: limit MLX-LM backend logprobs extraction to top-K to prevent me…
NullPointerDepressiveDisorder Mar 22, 2026
8851a93
feat: allow string token IDs in distribution metadata to support KL d…
NullPointerDepressiveDisorder Mar 22, 2026
4e9cf72
fix: sanitize model labels for checkpoints and improve answer extract…
NullPointerDepressiveDisorder Mar 22, 2026
1be80d4
fix: enhance stability in MLX backend, filename generation, and model…
NullPointerDepressiveDisorder Mar 22, 2026
5b089e7
Here is a concise commit message summarizing the changes in the provi…
NullPointerDepressiveDisorder Mar 22, 2026
ac861ee
fix: skip KL divergence calculation when token IDs have different typ…
NullPointerDepressiveDisorder Mar 22, 2026
8e69c6a
fix: handle invalid logprobs in OpenAI backend and centralize filenam…
NullPointerDepressiveDisorder Mar 22, 2026
381f3af
feat: add dedicated model comparison section to HTML reports.
NullPointerDepressiveDisorder Mar 22, 2026
2895a9e
fix: handle empty comparisons in runner, resolve type checking errors…
NullPointerDepressiveDisorder Mar 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ repos:
- click>=8.1.0
- httpx>=0.27.0
- types-jinja2
- mlx-lm
args: ["--strict", "--ignore-missing-imports"]
pass_filenames: false
entry: mypy src/infer_check/
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ infer-check sweep \
--output ./results/sweep/
```

`--prompts` accepts either a bundled suite name (`reasoning`, `code`, `adversarial-numerics`, `determinism`, `long-context`) or a path to any `.jsonl` file.
`--prompts` accepts either a bundled suite name (`reasoning`, `code`, `adversarial-numerics`, `determinism`, `long-context`, `quant-sensitive`) or a path to any `.jsonl` file.

The baseline is automatically run twice as a self-check — if it's not 50/50 identical, your comparison data is unreliable.

Expand Down Expand Up @@ -158,6 +158,7 @@ Curated prompts targeting known quantization failure modes:
| `code.jsonl` | 49 | Python, JSON, SQL generation |
| `adversarial-numerics.jsonl` | 30 | IEEE 754 edge cases, overflow, precision |
| `long-context.jsonl` | 10 | Tables and transcripts with recall questions |
| `quant-sensitive.jsonl` | 20 | Multi-digit arithmetic, long CoT, precise syntax |
| `determinism.jsonl` | 50 | High-entropy continuations for determinism testing |

All suites ship with the package — no need to clone the repo. Custom suites are JSONL files with one object per line:
Expand Down
30 changes: 0 additions & 30 deletions prompt-suites/adversarial-numerics.jsonl

This file was deleted.

49 changes: 0 additions & 49 deletions prompt-suites/code.jsonl

This file was deleted.

50 changes: 0 additions & 50 deletions prompt-suites/determinism.jsonl

This file was deleted.

Loading
Loading