Feature/quant compare mode by NullPointerDepressiveDisorder · Pull Request #4 · NullPointerDepressiveDisorder/infer-check

NullPointerDepressiveDisorder · 2026-03-21T01:42:50Z

This pull request updates the prompt suite infrastructure and documentation for the project, with a focus on prompt suite management and test coverage. The most significant changes are the addition of a new "quant-sensitive" prompt suite, updates to documentation to reflect this, and the removal of all prompts from several existing suite files—likely as part of a migration, refactor, or deprecation. There is also a minor dependency update to support new model integrations.

Prompt suite and documentation updates:

Added a new quant-sensitive.jsonl prompt suite (20 prompts) targeting multi-digit arithmetic, long chain-of-thought, and precise syntax; updated the documentation to include this suite as a bundled option. [1] [2]
Removed all prompts from adversarial-numerics.jsonl, code.jsonl, and determinism.jsonl, effectively emptying these files. This may indicate a migration, deprecation, or preparation for new content. [1] [2] [3]

Dependency management:

Added mlx-lm as a dependency to the mypy pre-commit hook environment in .pre-commit-config.yaml, likely to enable type checking or integration for the new model.

…wer flip-rate analysis - Introduce the `compare` CLI command and runner method to evaluate two arbitrary models or quantizations across different backends. - Implement category-aware answer extraction (numeric, boolean, code, JSON) to calculate semantic flip rates, comparing functional answers instead of relying solely on text similarity. - Add model resolution logic to support explicit backend prefixes (e.g., `ollama:`, `mlx:`, `gguf:`) and HuggingFace repo heuristics. - Introduce the `CompareResult` data model for capturing and serializing evaluation metrics. - Add comprehensive unit tests for answer extraction, model resolution, CLI, and runner logic.

…ross backends. Add `distributions` field to `InferenceResult` and update `openai_compat`, `mlx_lm`, and `llama_cpp` backends to parse and populate token probability distributions. Implement KL divergence computation in the runner's comparison logic, and include `mean_kl_divergence` statistics in both sweep and compare result summaries.

…on heuristics. - Introduce `get_backend_for_model` to encapsulate model resolution and backend instantiation. - Update `sweep`, `stress`, and `determinism` CLI commands to auto-detect the backend if not explicitly provided. - Improve HuggingFace repository heuristics in `resolve.py` to better identify MLX and GGUF models, changing the default GGUF backend from `openai-compat` to `llama-cpp`. - Fix a minor type checking issue in the `mlx_lm` backend adapter. - Add and update unit tests for the new model resolution heuristics.

- Replace the placeholder in the CLI with an actual call to `generate_report` for the `compare` command. - Add support for loading `CompareResult` instances in the HTML report generator.

Introduce `quant-sensitive.jsonl` with 20 new prompts targeting multi-digit arithmetic, long chain-of-thought, and precise syntax. Update the README to document the new suite in the bundled suites list and command examples.

Copilot

Pull request overview

This PR adds a new “compare” workflow for head-to-head model/quant comparisons (including backend auto-detection), plus supporting analysis and prompt-suite updates to better measure quantization sensitivity.

Changes:

Introduces a compare CLI command, TestRunner.compare(), and CompareResult to report flip-rate, similarity, (attempted) KL divergence, and per-category stats.
Adds category-aware answer extraction (analysis.answer_extract) and new unit tests for resolver/compare/answer-extraction.
Adds a bundled quant-sensitive.jsonl suite and updates docs; removes the repo-level prompt-suites/*.jsonl contents (migration toward packaged suites).

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`src/infer_check/cli.py`	Adds `compare` command; switches other commands to backend auto-detection; prints compare summaries
`src/infer_check/runner.py`	Adds `TestRunner.compare()` and KL computation in `_compare()`; writes compare checkpoints
`src/infer_check/types.py`	Adds `CompareResult`; extends `InferenceResult` with `distributions`
`src/infer_check/resolve.py`	New model-spec resolver for backend + URL inference
`src/infer_check/backends/base.py`	Adds `get_backend_for_model()` (auto-resolve backend if not provided)
`src/infer_check/backends/mlx_lm.py`	Captures per-token distributions/logprobs from MLX generate-step; changes batch generation to sequential
`src/infer_check/backends/openai_compat.py`	Captures `top_logprobs` into `distributions`
`src/infer_check/backends/llama_cpp.py`	Requests `n_probs`; captures per-token probability lists into `distributions`
`src/infer_check/analysis/answer_extract.py`	New answer extraction + flip-rate logic
`src/infer_check/analysis/__init__.py`	Re-exports answer-extraction API
`src/infer_check/reporting/html.py`	Loads `CompareResult` files into report pipeline
`src/infer_check/prompt_suites/quant-sensitive.jsonl`	Adds new bundled quant-sensitive suite
`README.md`	Documents the new bundled suite
`tests/unit/test_compare.py`	Adds CLI + runner tests for compare (includes an unfinished stub test)
`tests/unit/test_resolve.py`	Adds resolver tests
`tests/unit/test_answer_extract.py`	Adds answer-extraction tests
`.pre-commit-config.yaml`	Adds `mlx-lm` to mypy hook env
`prompt-suites/*.jsonl`	Removes repo-level suite contents (migration/deprecation)

Comments suppressed due to low confidence (1)

tests/unit/test_compare.py:343

test_checkpoint_written is currently a stub (only a docstring + commented-out code) and doesn't assert anything about checkpoint creation. Either implement the test (e.g., run compare() and assert a compare_*_vs_*_*.json file appears under cache_dir) or remove it to avoid dead/unfinished tests.

    def test_checkpoint_written(self, test_runner: TestRunner, tmp_path: Path) -> None:
        """A checkpoint file is written to cache_dir."""
        # prompts = _make_prompts(1)
        # resp = {prompts[0].id: "ok"}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-21T01:49:19Z

src/infer_check/runner.py

+        # KL Divergence computation
+        kl_div = None
+        if baseline.distributions and test.distributions:
+            import numpy as np
+
+            # Truncate to the shorter sequence
+            min_len = min(len(baseline.distributions), len(test.distributions))
+            kl_per_token = []
+
+            for i in range(min_len):
+                p = np.array(baseline.distributions[i])
+                q = np.array(test.distributions[i])
+
+                # Ensure we are dealing with probabilities
+                # If they are logprobs (like from MLX), convert to probs
+                # Actually, MLX gives logprobs. Llama.cpp gives probs.
+                # Let's check if we can detect.
+                # If sum of exp(logprobs) is close to 1, they are logprobs.
+                # If sum is close to 1, they are probs.
+
+                def to_probs(dist: Any) -> Any:
+                    import numpy as np
+
+                    if np.any(dist < 0):  # Likely logprobs
+                        # Numerical stability: shift logprobs
+                        dist = dist - np.max(dist)
+                        probs = np.exp(dist)
+                        return probs / np.sum(probs)
+                    else:  # Likely probs
+                        return dist / np.sum(dist)
+
+                p_prob = to_probs(p)
+                q_prob = to_probs(q)
+
+                # Handle different distribution lengths (e.g. MLX full vocab vs Llama.cpp top 10)
+                # This is tricky. If they are different lengths, we can't easily compute KL.
+                # However, quants of the same model SHOULD have same vocab size.
+                if len(p_prob) == len(q_prob):
+                    # Standard KL divergence: sum(p * log(p/q))
+                    # Using small epsilon to avoid log(0)
+                    eps = 1e-10
+                    p_prob = np.clip(p_prob, eps, 1.0)
+                    q_prob = np.clip(q_prob, eps, 1.0)
+                    kl = np.sum(p_prob * np.log(p_prob / q_prob))
+                    kl_per_token.append(float(kl))
+                else:
+                    # If distributions are different sizes, we might be comparing
+                    # top-K from one model to full vocab of another.
+                    # This is not directly comparable.
+                    pass


KL divergence here assumes baseline.distributions[i] and test.distributions[i] are aligned vectors over the same token ordering. For backends that only expose top-K (or where token identities are not stored), equal-length arrays are not comparable (indices represent rank order, not token IDs), so this will produce meaningless KL values. Consider only computing KL when distributions are token-ID aligned (e.g., store per-step {token_id: prob/logprob} and align on union/intersection), otherwise skip and leave kl_divergence=None.

Copilot · 2026-03-21T01:49:20Z

src/infer_check/backends/llama_cpp.py

                # Top prob entry (index 0) contains the chosen token logprob.
                probs = entry.get("probs", [])
                if probs:
+                    # llama-server returns probabilities, not log-probabilities
+                    # in 'probs' list, but it's better than nothing.
+                    # Actually, KL divergence can be calculated from probs.
+                    # However, we need a consistent format.
+                    # MLX provides logprobs for the whole vocab.
+                    # llama-server provides probs for top N.
                    logprobs.append(float(probs[0].get("prob", 0.0)))
+                    distributions.append([float(p.get("prob", 0.0)) for p in probs])


This stores prob values (0..1) into the InferenceResult.logprobs field, but the rest of the codebase (e.g., divergence analysis) treats logprobs as log-probabilities. Either convert with math.log(prob) (with epsilon clipping) or rename/store separately to avoid breaking KL / divergence calculations that assume log-space.

Copilot · 2026-03-21T01:49:20Z

src/infer_check/backends/mlx_lm.py

        tokens: list[str] = []
        logprobs: list[float] = []
+        distributions: list[list[float]] = []

        start = time.perf_counter()

-        for step_idx, (token, logprob_val) in enumerate(
+        for step_idx, (token, logprob_dist) in enumerate(
            generate_step(
                prompt=input_ids,
                model=self._model,
                sampler=sampler,
            )
        ):
+            dist_list: list[float] = logprob_dist.tolist()  # type: ignore[assignment]
+            distributions.append(dist_list)
+


logprob_dist.tolist() is likely the full-vocab logprob vector each step; appending that for every generated token will explode memory usage and make result/checkpoint JSONs enormous (since InferenceResult is persisted inside CompareResult/ComparisonResult). Prefer storing only top-K (token_id, score) per step, or compute KL online and discard per-token distributions before serialization (possibly behind an opt-in flag).

Copilot · 2026-03-21T01:49:20Z

src/infer_check/analysis/answer_extract.py

+    """
+    # If either extraction failed, fall back to raw similarity.
+    if not a.value or not b.value:
+        return not a.value and not b.value


The comment says to "fall back to raw similarity" when extraction fails, but the implementation treats any single empty extraction as a mismatch and both-empty as a match (no similarity check). Either update the comment/docstring to match the behavior, or implement the intended raw-similarity fallback for the one-empty case.

Suggested change

return not a.value and not b.value

ratio = difflib.SequenceMatcher(None, a.value or "", b.value or "").ratio()

return ratio >= similarity_threshold

Copilot · 2026-03-21T01:49:21Z

src/infer_check/resolve.py

+"""Model resolution for the compare command.
+
+Takes a model spec string and resolves it to a backend type + model path.
+Supports HuggingFace repo IDs, local paths, and Ollama-style tags.
+
+Resolution rules (in order):
+  1. Explicit prefix  — ``ollama:llama3.1:8b-q4`` → openai-compat + Ollama
+  2. Explicit prefix  — ``mlx:mlx-community/...`` → mlx-lm
+  3. Explicit prefix  — ``gguf:/path/to/model.gguf`` → llama-cpp
+  4. Local .gguf file — path exists and ends with ``.gguf`` → llama-cpp
+  5. HF repo with ``-mlx`` or ``mlx-community/`` → mlx-lm
+  6. HF repo with ``-GGUF`` or ``-gguf`` → llama-cpp (via Ollama)
+  7. Fallback — assume mlx-lm (most common local Mac use case)
+"""


The module docstring frames this as "Model resolution for the compare command", but resolve_model() is now also used by get_backend_for_model() (and thus by sweep/stress/determinism). Updating the docstring (and rule #6 wording about "via Ollama") would prevent confusion about its broader role and defaults (e.g., llama-cpp default URL vs Ollama).

Copilot · 2026-03-21T01:49:21Z

src/infer_check/cli.py

                    4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
                    3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \\
-          --backend mlx-lm \\
          --prompts ./prompt-suites/reasoning.jsonl


The sweep docstring example still uses --prompts ./prompt-suites/reasoning.jsonl, but the repo-level prompt-suites/*.jsonl files are removed in this PR and the CLI now prefers bundled suite names (via get_suite_path). Update the example to use --prompts reasoning (or a real on-disk path) to avoid pointing users at a non-existent file.

Suggested change

--prompts ./prompt-suites/reasoning.jsonl

--prompts reasoning

Copilot · 2026-03-21T01:49:21Z

src/infer_check/cli.py

+        report_path = output / f"report_{safe_a}_vs_{safe_b}.html"
+        generate_report(output, report_path)
+        console.print(f"[green]HTML report generated at {report_path}[/green]")
+    elif n > 0:


When --no-report is used, this always prints "No answer flips detected" whenever at least one prompt ran, even if flips were found (and even after rendering the flipped table). Gate this message on not flipped (or print a different summary) so the CLI output is not misleading.

Suggested change

elif n > 0:

elif n > 0 and not flipped:

Copilot · 2026-03-21T01:49:21Z

src/infer_check/runner.py

+        # ── Checkpoint ───────────────────────────────────────────────
+        timestamp_int = int(datetime.now(UTC).timestamp())
+        checkpoint_path = self.cache_dir / f"compare_{label_a}_vs_{label_b}_{timestamp_int}.json"
+        self._save_checkpoint(comparisons, checkpoint_path)


checkpoint_path interpolates label_a / label_b directly into the filename. If a user passes a label containing /, .., or other path-unfriendly characters, this can create unintended subdirectories or invalid paths. Sanitize/slugify labels here (similar to safe_a/safe_b in the CLI) before constructing the checkpoint filename.

Copilot · 2026-03-21T01:49:21Z

src/infer_check/backends/openai_compat.py

+            top_logprobs = lp_data.get("top_logprobs")
+            if top_logprobs:
+                distributions = []
+                for entry in top_logprobs:
+                    # entry is a dict of token: logprob
+                    distributions.append(list(entry.values()))


distributions.append(list(entry.values())) drops the token identities from top_logprobs and relies on dict value order, which makes the per-step distribution impossible to align across runs/backends (and unstable if server ordering changes). Store token→logprob mappings (or a list of (token, logprob) pairs in a deterministic order) so downstream KL/divergence computations can align by token identity or safely skip.

codecov · 2026-03-21T02:44:18Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

NullPointerDepressiveDisorder added 6 commits March 20, 2026 17:58

feat: implement HTML report generation for the compare command.

779218d

- Replace the placeholder in the CLI with an actual call to `generate_report` for the `compare` command. - Add support for loading `CompareResult` instances in the HTML report generator.

feat: add quant-sensitive prompt suite

f1d59be

Introduce `quant-sensitive.jsonl` with 20 new prompts targeting multi-digit arithmetic, long chain-of-thought, and precise syntax. Update the README to document the new suite in the bundled suites list and command examples.

chore: remove old prompt-suites directory from the root.

f4d3c25

Copilot AI review requested due to automatic review settings March 21, 2026 01:42

Copilot started reviewing on behalf of NullPointerDepressiveDisorder March 21, 2026 01:43 View session

Copilot AI reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/quant compare mode#4

Feature/quant compare mode#4
NullPointerDepressiveDisorder wants to merge 6 commits intomainfrom
feature/quant-compare-mode

NullPointerDepressiveDisorder commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

Copilot AI Mar 21, 2026

Uh oh!

codecov bot commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return not a.value and not b.value
	ratio = difflib.SequenceMatcher(None, a.value or "", b.value or "").ratio()
	return ratio >= similarity_threshold

	--prompts ./prompt-suites/reasoning.jsonl
	--prompts reasoning

Conversation

NullPointerDepressiveDisorder commented Mar 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 21, 2026

Welcome to Codecov 🎉

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants