Conversation
…wer flip-rate analysis - Introduce the `compare` CLI command and runner method to evaluate two arbitrary models or quantizations across different backends. - Implement category-aware answer extraction (numeric, boolean, code, JSON) to calculate semantic flip rates, comparing functional answers instead of relying solely on text similarity. - Add model resolution logic to support explicit backend prefixes (e.g., `ollama:`, `mlx:`, `gguf:`) and HuggingFace repo heuristics. - Introduce the `CompareResult` data model for capturing and serializing evaluation metrics. - Add comprehensive unit tests for answer extraction, model resolution, CLI, and runner logic.
…ross backends. Add `distributions` field to `InferenceResult` and update `openai_compat`, `mlx_lm`, and `llama_cpp` backends to parse and populate token probability distributions. Implement KL divergence computation in the runner's comparison logic, and include `mean_kl_divergence` statistics in both sweep and compare result summaries.
…on heuristics. - Introduce `get_backend_for_model` to encapsulate model resolution and backend instantiation. - Update `sweep`, `stress`, and `determinism` CLI commands to auto-detect the backend if not explicitly provided. - Improve HuggingFace repository heuristics in `resolve.py` to better identify MLX and GGUF models, changing the default GGUF backend from `openai-compat` to `llama-cpp`. - Fix a minor type checking issue in the `mlx_lm` backend adapter. - Add and update unit tests for the new model resolution heuristics.
- Replace the placeholder in the CLI with an actual call to `generate_report` for the `compare` command. - Add support for loading `CompareResult` instances in the HTML report generator.
Introduce `quant-sensitive.jsonl` with 20 new prompts targeting multi-digit arithmetic, long chain-of-thought, and precise syntax. Update the README to document the new suite in the bundled suites list and command examples.
There was a problem hiding this comment.
Pull request overview
This PR adds a new “compare” workflow for head-to-head model/quant comparisons (including backend auto-detection), plus supporting analysis and prompt-suite updates to better measure quantization sensitivity.
Changes:
- Introduces a
compareCLI command,TestRunner.compare(), andCompareResultto report flip-rate, similarity, (attempted) KL divergence, and per-category stats. - Adds category-aware answer extraction (
analysis.answer_extract) and new unit tests for resolver/compare/answer-extraction. - Adds a bundled
quant-sensitive.jsonlsuite and updates docs; removes the repo-levelprompt-suites/*.jsonlcontents (migration toward packaged suites).
Reviewed changes
Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
src/infer_check/cli.py |
Adds compare command; switches other commands to backend auto-detection; prints compare summaries |
src/infer_check/runner.py |
Adds TestRunner.compare() and KL computation in _compare(); writes compare checkpoints |
src/infer_check/types.py |
Adds CompareResult; extends InferenceResult with distributions |
src/infer_check/resolve.py |
New model-spec resolver for backend + URL inference |
src/infer_check/backends/base.py |
Adds get_backend_for_model() (auto-resolve backend if not provided) |
src/infer_check/backends/mlx_lm.py |
Captures per-token distributions/logprobs from MLX generate-step; changes batch generation to sequential |
src/infer_check/backends/openai_compat.py |
Captures top_logprobs into distributions |
src/infer_check/backends/llama_cpp.py |
Requests n_probs; captures per-token probability lists into distributions |
src/infer_check/analysis/answer_extract.py |
New answer extraction + flip-rate logic |
src/infer_check/analysis/__init__.py |
Re-exports answer-extraction API |
src/infer_check/reporting/html.py |
Loads CompareResult files into report pipeline |
src/infer_check/prompt_suites/quant-sensitive.jsonl |
Adds new bundled quant-sensitive suite |
README.md |
Documents the new bundled suite |
tests/unit/test_compare.py |
Adds CLI + runner tests for compare (includes an unfinished stub test) |
tests/unit/test_resolve.py |
Adds resolver tests |
tests/unit/test_answer_extract.py |
Adds answer-extraction tests |
.pre-commit-config.yaml |
Adds mlx-lm to mypy hook env |
prompt-suites/*.jsonl |
Removes repo-level suite contents (migration/deprecation) |
Comments suppressed due to low confidence (1)
tests/unit/test_compare.py:343
test_checkpoint_writtenis currently a stub (only a docstring + commented-out code) and doesn't assert anything about checkpoint creation. Either implement the test (e.g., runcompare()and assert acompare_*_vs_*_*.jsonfile appears undercache_dir) or remove it to avoid dead/unfinished tests.
def test_checkpoint_written(self, test_runner: TestRunner, tmp_path: Path) -> None:
"""A checkpoint file is written to cache_dir."""
# prompts = _make_prompts(1)
# resp = {prompts[0].id: "ok"}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # KL Divergence computation | ||
| kl_div = None | ||
| if baseline.distributions and test.distributions: | ||
| import numpy as np | ||
|
|
||
| # Truncate to the shorter sequence | ||
| min_len = min(len(baseline.distributions), len(test.distributions)) | ||
| kl_per_token = [] | ||
|
|
||
| for i in range(min_len): | ||
| p = np.array(baseline.distributions[i]) | ||
| q = np.array(test.distributions[i]) | ||
|
|
||
| # Ensure we are dealing with probabilities | ||
| # If they are logprobs (like from MLX), convert to probs | ||
| # Actually, MLX gives logprobs. Llama.cpp gives probs. | ||
| # Let's check if we can detect. | ||
| # If sum of exp(logprobs) is close to 1, they are logprobs. | ||
| # If sum is close to 1, they are probs. | ||
|
|
||
| def to_probs(dist: Any) -> Any: | ||
| import numpy as np | ||
|
|
||
| if np.any(dist < 0): # Likely logprobs | ||
| # Numerical stability: shift logprobs | ||
| dist = dist - np.max(dist) | ||
| probs = np.exp(dist) | ||
| return probs / np.sum(probs) | ||
| else: # Likely probs | ||
| return dist / np.sum(dist) | ||
|
|
||
| p_prob = to_probs(p) | ||
| q_prob = to_probs(q) | ||
|
|
||
| # Handle different distribution lengths (e.g. MLX full vocab vs Llama.cpp top 10) | ||
| # This is tricky. If they are different lengths, we can't easily compute KL. | ||
| # However, quants of the same model SHOULD have same vocab size. | ||
| if len(p_prob) == len(q_prob): | ||
| # Standard KL divergence: sum(p * log(p/q)) | ||
| # Using small epsilon to avoid log(0) | ||
| eps = 1e-10 | ||
| p_prob = np.clip(p_prob, eps, 1.0) | ||
| q_prob = np.clip(q_prob, eps, 1.0) | ||
| kl = np.sum(p_prob * np.log(p_prob / q_prob)) | ||
| kl_per_token.append(float(kl)) | ||
| else: | ||
| # If distributions are different sizes, we might be comparing | ||
| # top-K from one model to full vocab of another. | ||
| # This is not directly comparable. | ||
| pass |
There was a problem hiding this comment.
KL divergence here assumes baseline.distributions[i] and test.distributions[i] are aligned vectors over the same token ordering. For backends that only expose top-K (or where token identities are not stored), equal-length arrays are not comparable (indices represent rank order, not token IDs), so this will produce meaningless KL values. Consider only computing KL when distributions are token-ID aligned (e.g., store per-step {token_id: prob/logprob} and align on union/intersection), otherwise skip and leave kl_divergence=None.
| # Top prob entry (index 0) contains the chosen token logprob. | ||
| probs = entry.get("probs", []) | ||
| if probs: | ||
| # llama-server returns probabilities, not log-probabilities | ||
| # in 'probs' list, but it's better than nothing. | ||
| # Actually, KL divergence can be calculated from probs. | ||
| # However, we need a consistent format. | ||
| # MLX provides logprobs for the whole vocab. | ||
| # llama-server provides probs for top N. | ||
| logprobs.append(float(probs[0].get("prob", 0.0))) | ||
| distributions.append([float(p.get("prob", 0.0)) for p in probs]) |
There was a problem hiding this comment.
This stores prob values (0..1) into the InferenceResult.logprobs field, but the rest of the codebase (e.g., divergence analysis) treats logprobs as log-probabilities. Either convert with math.log(prob) (with epsilon clipping) or rename/store separately to avoid breaking KL / divergence calculations that assume log-space.
| tokens: list[str] = [] | ||
| logprobs: list[float] = [] | ||
| distributions: list[list[float]] = [] | ||
|
|
||
| start = time.perf_counter() | ||
|
|
||
| for step_idx, (token, logprob_val) in enumerate( | ||
| for step_idx, (token, logprob_dist) in enumerate( | ||
| generate_step( | ||
| prompt=input_ids, | ||
| model=self._model, | ||
| sampler=sampler, | ||
| ) | ||
| ): | ||
| dist_list: list[float] = logprob_dist.tolist() # type: ignore[assignment] | ||
| distributions.append(dist_list) | ||
|
|
There was a problem hiding this comment.
logprob_dist.tolist() is likely the full-vocab logprob vector each step; appending that for every generated token will explode memory usage and make result/checkpoint JSONs enormous (since InferenceResult is persisted inside CompareResult/ComparisonResult). Prefer storing only top-K (token_id, score) per step, or compute KL online and discard per-token distributions before serialization (possibly behind an opt-in flag).
| """ | ||
| # If either extraction failed, fall back to raw similarity. | ||
| if not a.value or not b.value: | ||
| return not a.value and not b.value |
There was a problem hiding this comment.
The comment says to "fall back to raw similarity" when extraction fails, but the implementation treats any single empty extraction as a mismatch and both-empty as a match (no similarity check). Either update the comment/docstring to match the behavior, or implement the intended raw-similarity fallback for the one-empty case.
| return not a.value and not b.value | |
| ratio = difflib.SequenceMatcher(None, a.value or "", b.value or "").ratio() | |
| return ratio >= similarity_threshold |
| """Model resolution for the compare command. | ||
|
|
||
| Takes a model spec string and resolves it to a backend type + model path. | ||
| Supports HuggingFace repo IDs, local paths, and Ollama-style tags. | ||
|
|
||
| Resolution rules (in order): | ||
| 1. Explicit prefix — ``ollama:llama3.1:8b-q4`` → openai-compat + Ollama | ||
| 2. Explicit prefix — ``mlx:mlx-community/...`` → mlx-lm | ||
| 3. Explicit prefix — ``gguf:/path/to/model.gguf`` → llama-cpp | ||
| 4. Local .gguf file — path exists and ends with ``.gguf`` → llama-cpp | ||
| 5. HF repo with ``-mlx`` or ``mlx-community/`` → mlx-lm | ||
| 6. HF repo with ``-GGUF`` or ``-gguf`` → llama-cpp (via Ollama) | ||
| 7. Fallback — assume mlx-lm (most common local Mac use case) | ||
| """ |
There was a problem hiding this comment.
The module docstring frames this as "Model resolution for the compare command", but resolve_model() is now also used by get_backend_for_model() (and thus by sweep/stress/determinism). Updating the docstring (and rule #6 wording about "via Ollama") would prevent confusion about its broader role and defaults (e.g., llama-cpp default URL vs Ollama).
| 4bit=mlx-community/Llama-3.1-8B-Instruct-4bit, | ||
| 3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \\ | ||
| --backend mlx-lm \\ | ||
| --prompts ./prompt-suites/reasoning.jsonl |
There was a problem hiding this comment.
The sweep docstring example still uses --prompts ./prompt-suites/reasoning.jsonl, but the repo-level prompt-suites/*.jsonl files are removed in this PR and the CLI now prefers bundled suite names (via get_suite_path). Update the example to use --prompts reasoning (or a real on-disk path) to avoid pointing users at a non-existent file.
| --prompts ./prompt-suites/reasoning.jsonl | |
| --prompts reasoning |
| report_path = output / f"report_{safe_a}_vs_{safe_b}.html" | ||
| generate_report(output, report_path) | ||
| console.print(f"[green]HTML report generated at {report_path}[/green]") | ||
| elif n > 0: |
There was a problem hiding this comment.
When --no-report is used, this always prints "No answer flips detected" whenever at least one prompt ran, even if flips were found (and even after rendering the flipped table). Gate this message on not flipped (or print a different summary) so the CLI output is not misleading.
| elif n > 0: | |
| elif n > 0 and not flipped: |
| # ── Checkpoint ─────────────────────────────────────────────── | ||
| timestamp_int = int(datetime.now(UTC).timestamp()) | ||
| checkpoint_path = self.cache_dir / f"compare_{label_a}_vs_{label_b}_{timestamp_int}.json" | ||
| self._save_checkpoint(comparisons, checkpoint_path) |
There was a problem hiding this comment.
checkpoint_path interpolates label_a / label_b directly into the filename. If a user passes a label containing /, .., or other path-unfriendly characters, this can create unintended subdirectories or invalid paths. Sanitize/slugify labels here (similar to safe_a/safe_b in the CLI) before constructing the checkpoint filename.
| top_logprobs = lp_data.get("top_logprobs") | ||
| if top_logprobs: | ||
| distributions = [] | ||
| for entry in top_logprobs: | ||
| # entry is a dict of token: logprob | ||
| distributions.append(list(entry.values())) |
There was a problem hiding this comment.
distributions.append(list(entry.values())) drops the token identities from top_logprobs and relies on dict value order, which makes the per-step distribution impossible to align across runs/backends (and unstable if server ordering changes). Store token→logprob mappings (or a list of (token, logprob) pairs in a deterministic order) so downstream KL/divergence computations can align by token identity or safely skip.
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
This pull request updates the prompt suite infrastructure and documentation for the project, with a focus on prompt suite management and test coverage. The most significant changes are the addition of a new "quant-sensitive" prompt suite, updates to documentation to reflect this, and the removal of all prompts from several existing suite files—likely as part of a migration, refactor, or deprecation. There is also a minor dependency update to support new model integrations.
Prompt suite and documentation updates:
quant-sensitive.jsonlprompt suite (20 prompts) targeting multi-digit arithmetic, long chain-of-thought, and precise syntax; updated the documentation to include this suite as a bundled option. [1] [2]adversarial-numerics.jsonl,code.jsonl, anddeterminism.jsonl, effectively emptying these files. This may indicate a migration, deprecation, or preparation for new content. [1] [2] [3]Dependency management:
mlx-lmas a dependency to themypypre-commit hook environment in.pre-commit-config.yaml, likely to enable type checking or integration for the new model.