Skip to content

Iris: Calculate verbalized confidence#507

Open
toukhi wants to merge 4 commits intomainfrom
iris/feature/calculate-verbalized-confidence
Open

Iris: Calculate verbalized confidence#507
toukhi wants to merge 4 commits intomainfrom
iris/feature/calculate-verbalized-confidence

Conversation

@toukhi
Copy link
Copy Markdown
Contributor

@toukhi toukhi commented Apr 13, 2026

Implements verbalized confidence scoring for the autonomous tutor pipeline, based on Yang et al. (2024) "On Verbalized Confidence Scores for LLMs". Previously, the pipeline returned a hardcoded confidence of 0.99 for every response. This PR replaces that placeholder with a real scoring mechanism where the model itself estimates the probability that its answer is correct.
How it works:

  • The existing system prompt is extended (not replaced) with confidence-scoring instructions
  • Two prompt variants are used depending on model size:
    • Combo method (large models: GPT-4/5 class, ≥32B open-source): 5-shot examples spanning the full confidence range,
      asks the model for a calibrated "best guess" + probability
    • Basic method (small models): minimal instruction, asks for a probability with no examples
  • The model responds in a structured Answer/Guess: ... \nProbability: ... format
  • A parser extracts the clean answer text and the probability score, handling edge cases (percentages, out-of-range
    values, missing scores)
    The confidence score is included in the AutonomousTutorPipelineStatusUpdateDTO sent back to Artemis, enabling the
    existing shouldPostDirectly threshold logic

Summary by CodeRabbit

  • New Features

    • Tutor responses now include explicit confidence scores (0.0–1.0) shown with each answer.
    • Confidence guidance adapts based on model capability to improve calibration.
    • Responses follow a strict two-line output format (answer + probability) for consistency.
  • Bug Fixes

    • Confidence extraction is more robust; malformed or missing scores default to 0.0 without breaking replies.

Closes IRIS-23

@toukhi toukhi requested a review from a team as a code owner April 13, 2026 08:54
@github-actions github-actions Bot added the iris label Apr 13, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16b98d3d-3095-4961-8bf9-8dbbfa2b496d

📥 Commits

Reviewing files that changed from the base of the PR and between 1d76ced and 12de91f.

📒 Files selected for processing (3)
  • iris/src/iris/domain/autonomous_tutor/autonomous_tutor_pipeline_status_update_dto.py
  • iris/src/iris/pipeline/autonomous_tutor_pipeline.py
  • iris/src/iris/web/status/status_update.py
💤 Files with no reviewable changes (1)
  • iris/src/iris/web/status/status_update.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • iris/src/iris/pipeline/autonomous_tutor_pipeline.py

📝 Walkthrough

Walkthrough

The pipeline now appends model-size-specific confidence prompt fragments, parses an explicit probability from LLM responses (overwriting the result text), and uses new utilities to classify model size and extract a cleaned answer plus a clamped confidence float.

Changes

Cohort / File(s) Summary
Pipeline Integration
iris/src/iris/pipeline/autonomous_tutor_pipeline.py
Selects a confidence prompt fragment via is_large_model; post_agent_hook computes and logs confidence (removed threshold check and should_post_directly flow); _estimate_confidence uses parse_confidence_response, overwrites state.result with the cleaned answer, and returns the extracted probability.
Confidence Templates
iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_basic.j2, iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_combo.j2
Added two Jinja2 fragments requiring a strict two-line model output with answer and probability; combo variant adds calibration guidance and examples.
Confidence Utilities
iris/src/iris/pipeline/shared/confidence_scoring.py
New helpers: is_large_model(model_id) detects large-model families or >=32b tokens; parse_confidence_response(raw_response) extracts answer text and a clamped 0.0–1.0 probability, returning (raw_response, 0.0) on parse failure and never raising.
DTO Change
iris/src/iris/domain/autonomous_tutor/autonomous_tutor_pipeline_status_update_dto.py
Removed should_post_directly (shouldPostDirectly) field from AutonomousTutorPipelineStatusUpdateDTO; exported DTO now contains only result and confidence.
Status Callback
iris/src/iris/web/status/status_update.py
Removed should_post_directly parameter from StatusCallback.done(...) and eliminated assignments to self.status.should_post_directly in DONE/cleanup paths.

Sequence Diagram(s)

sequenceDiagram
  participant Pipeline as Pipeline
  participant LLM as LLM
  participant Utils as ConfidenceUtils
  participant State as StateStore

  Pipeline->>State: read state.llm.model_name
  Pipeline->>Utils: is_large_model(model_name)
  Utils-->>Pipeline: boolean
  Pipeline->>LLM: send system prompt + selected confidence fragment
  LLM-->>Pipeline: verbalized Answer + Probability
  Pipeline->>Utils: parse_confidence_response(raw_response)
  Utils-->>Pipeline: (answer_text, probability)
  Pipeline->>State: overwrite state.result (strip prob line)
  Pipeline->>State: log confidence
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nudged the prompt to ask, "How sure?"
Models answer, then number the cure.
I trim the tail, I tuck the score,
A tiny hop — truth at the core.
Confidence counted, tutor purrs for more.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and specifically describes the main change: introducing model-generated verbalized confidence calculation to replace hardcoded values in the autonomous tutor pipeline.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch iris/feature/calculate-verbalized-confidence

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.5)
iris/src/iris/pipeline/autonomous_tutor_pipeline.py
iris/src/iris/domain/autonomous_tutor/autonomous_tutor_pipeline_status_update_dto.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 3-11: The current _LARGE_MODEL_PATTERNS and is_large_model check
only static substrings and misses open-source numeric sizes like
34B/40B/65B/etc.; update is_large_model to detect large models by parsing
numeric size tokens and known family names: implement logic in is_large_model to
(1) case-normalize the model string, (2) match known large-family names
("gpt-4","gpt-5","gpt-oss") OR extract a trailing numeric size before an
optional "b" (e.g., regex like r"(\d+)\s*b?\b"), convert that number to int and
treat >=32 as large, and (3) fallback to existing explicit checks in
_LARGE_MODEL_PATTERNS if needed—replace the static list usage with this
parse-and-compare routine so all ≥32B open-source models are correctly
classified.
- Around line 18-21: The _PROBABILITY_LINE_RE is too permissive and combined
with re.search allows incidental "p: 0.5" inside normal answers to be treated as
a confidence score; tighten the pattern (add anchors ^...$ or require word
boundaries like \b and ensure optional % handling stays correct) and change
parsing call sites to use re.match or re.fullmatch instead of search so only
strings that are exclusively a probability line are accepted; apply the same
change to the other usage sites in this module (the occurrences referenced
around lines 55-56) so should_post_directly logic only treats explicit
standalone probability lines as confidence.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a7e71d5a-891a-4e5a-b93c-070d826574f1

📥 Commits

Reviewing files that changed from the base of the PR and between ed561ba and 9821e49.

📒 Files selected for processing (4)
  • iris/src/iris/pipeline/autonomous_tutor_pipeline.py
  • iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_basic.j2
  • iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_combo.j2
  • iris/src/iris/pipeline/shared/confidence_scoring.py

Comment thread iris/src/iris/pipeline/shared/confidence_scoring.py Outdated
Comment on lines +18 to +21
_PROBABILITY_LINE_RE = re.compile(
r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?",
re.IGNORECASE,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Probability parsing is too permissive and can misread normal answer text as confidence.

Using an unanchored regex with search() means any trailing p: <number> inside ordinary text can be parsed as confidence, which may inflate should_post_directly decisions.

Proposed fix
 _PROBABILITY_LINE_RE = re.compile(
-    r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?",
+    r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$",
     re.IGNORECASE,
 )
@@
-            m = _PROBABILITY_LINE_RE.search(lines[i])
+            m = _PROBABILITY_LINE_RE.match(lines[i])

Also applies to: 55-56

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 18 - 21,
The _PROBABILITY_LINE_RE is too permissive and combined with re.search allows
incidental "p: 0.5" inside normal answers to be treated as a confidence score;
tighten the pattern (add anchors ^...$ or require word boundaries like \b and
ensure optional % handling stays correct) and change parsing call sites to use
re.match or re.fullmatch instead of search so only strings that are exclusively
a probability line are accepted; apply the same change to the other usage sites
in this module (the occurrences referenced around lines 55-56) so
should_post_directly logic only treats explicit standalone probability lines as
confidence.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
iris/src/iris/pipeline/shared/confidence_scoring.py (1)

12-15: ⚠️ Potential issue | 🟠 Major

Probability parsing still accepts embedded text; require standalone line matching.

Using an unanchored pattern with search() can parse incidental fragments like "... p: 0.9 ..." as confidence, which can incorrectly affect direct-post thresholds.

Suggested fix
 _PROBABILITY_LINE_RE = re.compile(
-    r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?",
+    r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$",
     re.IGNORECASE,
 )
@@
-            m = _PROBABILITY_LINE_RE.search(lines[i])
+            m = _PROBABILITY_LINE_RE.match(lines[i])
#!/bin/bash
python - <<'PY'
import re
pat_search = re.compile(r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?", re.I)
pat_match  = re.compile(r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$", re.I)

samples = [
    "Probability: 0.82",
    "Some answer text p: 0.82 maybe",
    "I think confidence: 90% overall",
]
for s in samples:
    print(s, "| search=", bool(pat_search.search(s)), "| anchored_match=", bool(pat_match.match(s)))
PY

Also applies to: 58-60

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 12 - 15,
The current _PROBABILITY_LINE_RE is unanchored and allows embedded matches;
replace it with an anchored pattern (e.g., start/end with optional surrounding
whitespace and allow leading dot-only decimals) like the reviewer suggested and
update any places that call .search() for this pattern to use .match() so only
standalone lines (exact lines containing "probability|confidence|p:
<number>[%]") are accepted; apply the same anchored-change to the related regex
used around lines 58-60 (the other probability/confidence regex constants) and
ensure callers reference the updated symbol names (_PROBABILITY_LINE_RE) and use
.match() instead of .search().
🧹 Nitpick comments (1)
iris/src/iris/pipeline/shared/confidence_scoring.py (1)

90-91: Narrow the exception scope in parser fallback.

Catching Exception hides unexpected bugs. Prefer parse-related exceptions (ValueError, TypeError) and keep the same fallback return.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 90 - 91,
The broad except Exception in the parser fallback hides unrelated bugs; replace
it with a narrow exception tuple such as except (ValueError, TypeError,
json.JSONDecodeError) around the parsing call so only parse-related errors are
caught, and keep the same fallback return of (raw_response, 0.0); ensure you
import json if adding json.JSONDecodeError and apply the change where the
current except Exception: block returns raw_response, 0.0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 12-15: The current _PROBABILITY_LINE_RE is unanchored and allows
embedded matches; replace it with an anchored pattern (e.g., start/end with
optional surrounding whitespace and allow leading dot-only decimals) like the
reviewer suggested and update any places that call .search() for this pattern to
use .match() so only standalone lines (exact lines containing
"probability|confidence|p: <number>[%]") are accepted; apply the same
anchored-change to the related regex used around lines 58-60 (the other
probability/confidence regex constants) and ensure callers reference the updated
symbol names (_PROBABILITY_LINE_RE) and use .match() instead of .search().

---

Nitpick comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 90-91: The broad except Exception in the parser fallback hides
unrelated bugs; replace it with a narrow exception tuple such as except
(ValueError, TypeError, json.JSONDecodeError) around the parsing call so only
parse-related errors are caught, and keep the same fallback return of
(raw_response, 0.0); ensure you import json if adding json.JSONDecodeError and
apply the change where the current except Exception: block returns raw_response,
0.0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e1106bdb-487e-4d11-a454-e718ef4cbdbb

📥 Commits

Reviewing files that changed from the base of the PR and between 9821e49 and 1d76ced.

📒 Files selected for processing (1)
  • iris/src/iris/pipeline/shared/confidence_scoring.py

@github-actions
Copy link
Copy Markdown

There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions.

@github-actions github-actions Bot added the stale label Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant