Skip to content

Feature/quant compare mode#4

Open
NullPointerDepressiveDisorder wants to merge 6 commits intomainfrom
feature/quant-compare-mode
Open

Feature/quant compare mode#4
NullPointerDepressiveDisorder wants to merge 6 commits intomainfrom
feature/quant-compare-mode

Conversation

@NullPointerDepressiveDisorder
Copy link
Owner

This pull request updates the prompt suite infrastructure and documentation for the project, with a focus on prompt suite management and test coverage. The most significant changes are the addition of a new "quant-sensitive" prompt suite, updates to documentation to reflect this, and the removal of all prompts from several existing suite files—likely as part of a migration, refactor, or deprecation. There is also a minor dependency update to support new model integrations.

Prompt suite and documentation updates:

  • Added a new quant-sensitive.jsonl prompt suite (20 prompts) targeting multi-digit arithmetic, long chain-of-thought, and precise syntax; updated the documentation to include this suite as a bundled option. [1] [2]
  • Removed all prompts from adversarial-numerics.jsonl, code.jsonl, and determinism.jsonl, effectively emptying these files. This may indicate a migration, deprecation, or preparation for new content. [1] [2] [3]

Dependency management:

  • Added mlx-lm as a dependency to the mypy pre-commit hook environment in .pre-commit-config.yaml, likely to enable type checking or integration for the new model.

…wer flip-rate analysis

- Introduce the `compare` CLI command and runner method to evaluate two arbitrary models or quantizations across different backends.
- Implement category-aware answer extraction (numeric, boolean, code, JSON) to calculate semantic flip rates, comparing functional answers instead of relying solely on text similarity.
- Add model resolution logic to support explicit backend prefixes (e.g., `ollama:`, `mlx:`, `gguf:`) and HuggingFace repo heuristics.
- Introduce the `CompareResult` data model for capturing and serializing evaluation metrics.
- Add comprehensive unit tests for answer extraction, model resolution, CLI, and runner logic.
…ross backends.

Add `distributions` field to `InferenceResult` and update `openai_compat`, `mlx_lm`, and `llama_cpp` backends to parse and populate token probability distributions.
Implement KL divergence computation in the runner's comparison logic, and include `mean_kl_divergence` statistics in both sweep and compare result summaries.
…on heuristics.

- Introduce `get_backend_for_model` to encapsulate model resolution and backend instantiation.
- Update `sweep`, `stress`, and `determinism` CLI commands to auto-detect the backend if not explicitly provided.
- Improve HuggingFace repository heuristics in `resolve.py` to better identify MLX and GGUF models, changing the default GGUF backend from `openai-compat` to `llama-cpp`.
- Fix a minor type checking issue in the `mlx_lm` backend adapter.
- Add and update unit tests for the new model resolution heuristics.
- Replace the placeholder in the CLI with an actual call to `generate_report` for the `compare` command.
- Add support for loading `CompareResult` instances in the HTML report generator.
Introduce `quant-sensitive.jsonl` with 20 new prompts targeting multi-digit arithmetic, long chain-of-thought, and precise syntax. Update the README to document the new suite in the bundled suites list and command examples.
Copilot AI review requested due to automatic review settings March 21, 2026 01:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new “compare” workflow for head-to-head model/quant comparisons (including backend auto-detection), plus supporting analysis and prompt-suite updates to better measure quantization sensitivity.

Changes:

  • Introduces a compare CLI command, TestRunner.compare(), and CompareResult to report flip-rate, similarity, (attempted) KL divergence, and per-category stats.
  • Adds category-aware answer extraction (analysis.answer_extract) and new unit tests for resolver/compare/answer-extraction.
  • Adds a bundled quant-sensitive.jsonl suite and updates docs; removes the repo-level prompt-suites/*.jsonl contents (migration toward packaged suites).

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/infer_check/cli.py Adds compare command; switches other commands to backend auto-detection; prints compare summaries
src/infer_check/runner.py Adds TestRunner.compare() and KL computation in _compare(); writes compare checkpoints
src/infer_check/types.py Adds CompareResult; extends InferenceResult with distributions
src/infer_check/resolve.py New model-spec resolver for backend + URL inference
src/infer_check/backends/base.py Adds get_backend_for_model() (auto-resolve backend if not provided)
src/infer_check/backends/mlx_lm.py Captures per-token distributions/logprobs from MLX generate-step; changes batch generation to sequential
src/infer_check/backends/openai_compat.py Captures top_logprobs into distributions
src/infer_check/backends/llama_cpp.py Requests n_probs; captures per-token probability lists into distributions
src/infer_check/analysis/answer_extract.py New answer extraction + flip-rate logic
src/infer_check/analysis/__init__.py Re-exports answer-extraction API
src/infer_check/reporting/html.py Loads CompareResult files into report pipeline
src/infer_check/prompt_suites/quant-sensitive.jsonl Adds new bundled quant-sensitive suite
README.md Documents the new bundled suite
tests/unit/test_compare.py Adds CLI + runner tests for compare (includes an unfinished stub test)
tests/unit/test_resolve.py Adds resolver tests
tests/unit/test_answer_extract.py Adds answer-extraction tests
.pre-commit-config.yaml Adds mlx-lm to mypy hook env
prompt-suites/*.jsonl Removes repo-level suite contents (migration/deprecation)
Comments suppressed due to low confidence (1)

tests/unit/test_compare.py:343

  • test_checkpoint_written is currently a stub (only a docstring + commented-out code) and doesn't assert anything about checkpoint creation. Either implement the test (e.g., run compare() and assert a compare_*_vs_*_*.json file appears under cache_dir) or remove it to avoid dead/unfinished tests.
    def test_checkpoint_written(self, test_runner: TestRunner, tmp_path: Path) -> None:
        """A checkpoint file is written to cache_dir."""
        # prompts = _make_prompts(1)
        # resp = {prompts[0].id: "ok"}


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +81 to +130
# KL Divergence computation
kl_div = None
if baseline.distributions and test.distributions:
import numpy as np

# Truncate to the shorter sequence
min_len = min(len(baseline.distributions), len(test.distributions))
kl_per_token = []

for i in range(min_len):
p = np.array(baseline.distributions[i])
q = np.array(test.distributions[i])

# Ensure we are dealing with probabilities
# If they are logprobs (like from MLX), convert to probs
# Actually, MLX gives logprobs. Llama.cpp gives probs.
# Let's check if we can detect.
# If sum of exp(logprobs) is close to 1, they are logprobs.
# If sum is close to 1, they are probs.

def to_probs(dist: Any) -> Any:
import numpy as np

if np.any(dist < 0): # Likely logprobs
# Numerical stability: shift logprobs
dist = dist - np.max(dist)
probs = np.exp(dist)
return probs / np.sum(probs)
else: # Likely probs
return dist / np.sum(dist)

p_prob = to_probs(p)
q_prob = to_probs(q)

# Handle different distribution lengths (e.g. MLX full vocab vs Llama.cpp top 10)
# This is tricky. If they are different lengths, we can't easily compute KL.
# However, quants of the same model SHOULD have same vocab size.
if len(p_prob) == len(q_prob):
# Standard KL divergence: sum(p * log(p/q))
# Using small epsilon to avoid log(0)
eps = 1e-10
p_prob = np.clip(p_prob, eps, 1.0)
q_prob = np.clip(q_prob, eps, 1.0)
kl = np.sum(p_prob * np.log(p_prob / q_prob))
kl_per_token.append(float(kl))
else:
# If distributions are different sizes, we might be comparing
# top-K from one model to full vocab of another.
# This is not directly comparable.
pass
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KL divergence here assumes baseline.distributions[i] and test.distributions[i] are aligned vectors over the same token ordering. For backends that only expose top-K (or where token identities are not stored), equal-length arrays are not comparable (indices represent rank order, not token IDs), so this will produce meaningless KL values. Consider only computing KL when distributions are token-ID aligned (e.g., store per-step {token_id: prob/logprob} and align on union/intersection), otherwise skip and leave kl_divergence=None.

Copilot uses AI. Check for mistakes.
Comment on lines 85 to +95
# Top prob entry (index 0) contains the chosen token logprob.
probs = entry.get("probs", [])
if probs:
# llama-server returns probabilities, not log-probabilities
# in 'probs' list, but it's better than nothing.
# Actually, KL divergence can be calculated from probs.
# However, we need a consistent format.
# MLX provides logprobs for the whole vocab.
# llama-server provides probs for top N.
logprobs.append(float(probs[0].get("prob", 0.0)))
distributions.append([float(p.get("prob", 0.0)) for p in probs])
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stores prob values (0..1) into the InferenceResult.logprobs field, but the rest of the codebase (e.g., divergence analysis) treats logprobs as log-probabilities. Either convert with math.log(prob) (with epsilon clipping) or rename/store separately to avoid breaking KL / divergence calculations that assume log-space.

Copilot uses AI. Check for mistakes.
Comment on lines 201 to +216
tokens: list[str] = []
logprobs: list[float] = []
distributions: list[list[float]] = []

start = time.perf_counter()

for step_idx, (token, logprob_val) in enumerate(
for step_idx, (token, logprob_dist) in enumerate(
generate_step(
prompt=input_ids,
model=self._model,
sampler=sampler,
)
):
dist_list: list[float] = logprob_dist.tolist() # type: ignore[assignment]
distributions.append(dist_list)

Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logprob_dist.tolist() is likely the full-vocab logprob vector each step; appending that for every generated token will explode memory usage and make result/checkpoint JSONs enormous (since InferenceResult is persisted inside CompareResult/ComparisonResult). Prefer storing only top-K (token_id, score) per step, or compute KL online and discard per-token distributions before serialization (possibly behind an opt-in flag).

Copilot uses AI. Check for mistakes.
"""
# If either extraction failed, fall back to raw similarity.
if not a.value or not b.value:
return not a.value and not b.value
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says to "fall back to raw similarity" when extraction fails, but the implementation treats any single empty extraction as a mismatch and both-empty as a match (no similarity check). Either update the comment/docstring to match the behavior, or implement the intended raw-similarity fallback for the one-empty case.

Suggested change
return not a.value and not b.value
ratio = difflib.SequenceMatcher(None, a.value or "", b.value or "").ratio()
return ratio >= similarity_threshold

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +14
"""Model resolution for the compare command.

Takes a model spec string and resolves it to a backend type + model path.
Supports HuggingFace repo IDs, local paths, and Ollama-style tags.

Resolution rules (in order):
1. Explicit prefix — ``ollama:llama3.1:8b-q4`` → openai-compat + Ollama
2. Explicit prefix — ``mlx:mlx-community/...`` → mlx-lm
3. Explicit prefix — ``gguf:/path/to/model.gguf`` → llama-cpp
4. Local .gguf file — path exists and ends with ``.gguf`` → llama-cpp
5. HF repo with ``-mlx`` or ``mlx-community/`` → mlx-lm
6. HF repo with ``-GGUF`` or ``-gguf`` → llama-cpp (via Ollama)
7. Fallback — assume mlx-lm (most common local Mac use case)
"""
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docstring frames this as "Model resolution for the compare command", but resolve_model() is now also used by get_backend_for_model() (and thus by sweep/stress/determinism). Updating the docstring (and rule #6 wording about "via Ollama") would prevent confusion about its broader role and defaults (e.g., llama-cpp default URL vs Ollama).

Copilot uses AI. Check for mistakes.
4bit=mlx-community/Llama-3.1-8B-Instruct-4bit,
3bit=mlx-community/Llama-3.1-8B-Instruct-3bit" \\
--backend mlx-lm \\
--prompts ./prompt-suites/reasoning.jsonl
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sweep docstring example still uses --prompts ./prompt-suites/reasoning.jsonl, but the repo-level prompt-suites/*.jsonl files are removed in this PR and the CLI now prefers bundled suite names (via get_suite_path). Update the example to use --prompts reasoning (or a real on-disk path) to avoid pointing users at a non-existent file.

Suggested change
--prompts ./prompt-suites/reasoning.jsonl
--prompts reasoning

Copilot uses AI. Check for mistakes.
report_path = output / f"report_{safe_a}_vs_{safe_b}.html"
generate_report(output, report_path)
console.print(f"[green]HTML report generated at {report_path}[/green]")
elif n > 0:
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --no-report is used, this always prints "No answer flips detected" whenever at least one prompt ran, even if flips were found (and even after rendering the flipped table). Gate this message on not flipped (or print a different summary) so the CLI output is not misleading.

Suggested change
elif n > 0:
elif n > 0 and not flipped:

Copilot uses AI. Check for mistakes.
Comment on lines +421 to +424
# ── Checkpoint ───────────────────────────────────────────────
timestamp_int = int(datetime.now(UTC).timestamp())
checkpoint_path = self.cache_dir / f"compare_{label_a}_vs_{label_b}_{timestamp_int}.json"
self._save_checkpoint(comparisons, checkpoint_path)
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint_path interpolates label_a / label_b directly into the filename. If a user passes a label containing /, .., or other path-unfriendly characters, this can create unintended subdirectories or invalid paths. Sanitize/slugify labels here (similar to safe_a/safe_b in the CLI) before constructing the checkpoint filename.

Copilot uses AI. Check for mistakes.
Comment on lines +190 to +195
top_logprobs = lp_data.get("top_logprobs")
if top_logprobs:
distributions = []
for entry in top_logprobs:
# entry is a dict of token: logprob
distributions.append(list(entry.values()))
Copy link

Copilot AI Mar 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributions.append(list(entry.values())) drops the token identities from top_logprobs and relies on dict value order, which makes the per-step distribution impossible to align across runs/backends (and unstable if server ordering changes). Store token→logprob mappings (or a list of (token, logprob) pairs in a deterministic order) so downstream KL/divergence computations can align by token identity or safely skip.

Copilot uses AI. Check for mistakes.
@codecov
Copy link

codecov bot commented Mar 21, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants