`Iris`: Calculate verbalized confidence by toukhi · Pull Request #507 · ls1intum/edutelligence

toukhi · 2026-04-13T08:54:22Z

Implements verbalized confidence scoring for the autonomous tutor pipeline, based on Yang et al. (2024) "On Verbalized Confidence Scores for LLMs". Previously, the pipeline returned a hardcoded confidence of 0.99 for every response. This PR replaces that placeholder with a real scoring mechanism where the model itself estimates the probability that its answer is correct.
How it works:

The existing system prompt is extended (not replaced) with confidence-scoring instructions
Two prompt variants are used depending on model size:
- Combo method (large models: GPT-4/5 class, ≥32B open-source): 5-shot examples spanning the full confidence range,
  asks the model for a calibrated "best guess" + probability
- Basic method (small models): minimal instruction, asks for a probability with no examples
The model responds in a structured Answer/Guess: ... \nProbability: ... format
A parser extracts the clean answer text and the probability score, handling edge cases (percentages, out-of-range
values, missing scores)
The confidence score is included in the AutonomousTutorPipelineStatusUpdateDTO sent back to Artemis, enabling the
existing shouldPostDirectly threshold logic

Summary by CodeRabbit

New Features
- Tutor responses now include explicit confidence scores (0.0–1.0) shown with each answer.
- Confidence guidance adapts based on model capability to improve calibration.
- Responses follow a strict two-line output format (answer + probability) for consistency.
Bug Fixes
- Confidence extraction is more robust; malformed or missing scores default to 0.0 without breaking replies.

Closes IRIS-23

…ature/calculate-verbalized-confidence

coderabbitai · 2026-04-13T08:54:47Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 16b98d3d-3095-4961-8bf9-8dbbfa2b496d

📥 Commits

Reviewing files that changed from the base of the PR and between 1d76ced and 12de91f.

📒 Files selected for processing (3)

iris/src/iris/domain/autonomous_tutor/autonomous_tutor_pipeline_status_update_dto.py
iris/src/iris/pipeline/autonomous_tutor_pipeline.py
iris/src/iris/web/status/status_update.py

💤 Files with no reviewable changes (1)

iris/src/iris/web/status/status_update.py

🚧 Files skipped from review as they are similar to previous changes (1)

iris/src/iris/pipeline/autonomous_tutor_pipeline.py

📝 Walkthrough

Walkthrough

The pipeline now appends model-size-specific confidence prompt fragments, parses an explicit probability from LLM responses (overwriting the result text), and uses new utilities to classify model size and extract a cleaned answer plus a clamped confidence float.

Changes

Cohort / File(s)	Summary
Pipeline Integration `iris/src/iris/pipeline/autonomous_tutor_pipeline.py`	Selects a confidence prompt fragment via `is_large_model`; `post_agent_hook` computes and logs confidence (removed threshold check and `should_post_directly` flow); `_estimate_confidence` uses `parse_confidence_response`, overwrites `state.result` with the cleaned answer, and returns the extracted probability.
Confidence Templates `iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_basic.j2`, `iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_combo.j2`	Added two Jinja2 fragments requiring a strict two-line model output with answer and probability; combo variant adds calibration guidance and examples.
Confidence Utilities `iris/src/iris/pipeline/shared/confidence_scoring.py`	New helpers: `is_large_model(model_id)` detects large-model families or >=32b tokens; `parse_confidence_response(raw_response)` extracts answer text and a clamped 0.0–1.0 probability, returning `(raw_response, 0.0)` on parse failure and never raising.
DTO Change `iris/src/iris/domain/autonomous_tutor/autonomous_tutor_pipeline_status_update_dto.py`	Removed `should_post_directly` (`shouldPostDirectly`) field from `AutonomousTutorPipelineStatusUpdateDTO`; exported DTO now contains only `result` and `confidence`.
Status Callback `iris/src/iris/web/status/status_update.py`	Removed `should_post_directly` parameter from `StatusCallback.done(...)` and eliminated assignments to `self.status.should_post_directly` in DONE/cleanup paths.

Sequence Diagram(s)

sequenceDiagram
  participant Pipeline as Pipeline
  participant LLM as LLM
  participant Utils as ConfidenceUtils
  participant State as StateStore

  Pipeline->>State: read state.llm.model_name
  Pipeline->>Utils: is_large_model(model_name)
  Utils-->>Pipeline: boolean
  Pipeline->>LLM: send system prompt + selected confidence fragment
  LLM-->>Pipeline: verbalized Answer + Probability
  Pipeline->>Utils: parse_confidence_response(raw_response)
  Utils-->>Pipeline: (answer_text, probability)
  Pipeline->>State: overwrite state.result (strip prob line)
  Pipeline->>State: log confidence

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nudged the prompt to ask, "How sure?"
Models answer, then number the cure.
I trim the tail, I tuck the score,
A tiny hop — truth at the core.
Confidence counted, tutor purrs for more.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and specifically describes the main change: introducing model-generated verbalized confidence calculation to replace hardcoded values in the autonomous tutor pipeline.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch iris/feature/calculate-verbalized-confidence

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Pylint (4.0.5)

iris/src/iris/pipeline/autonomous_tutor_pipeline.py

iris/src/iris/domain/autonomous_tutor/autonomous_tutor_pipeline_status_update_dto.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 3-11: The current _LARGE_MODEL_PATTERNS and is_large_model check
only static substrings and misses open-source numeric sizes like
34B/40B/65B/etc.; update is_large_model to detect large models by parsing
numeric size tokens and known family names: implement logic in is_large_model to
(1) case-normalize the model string, (2) match known large-family names
("gpt-4","gpt-5","gpt-oss") OR extract a trailing numeric size before an
optional "b" (e.g., regex like r"(\d+)\s*b?\b"), convert that number to int and
treat >=32 as large, and (3) fallback to existing explicit checks in
_LARGE_MODEL_PATTERNS if needed—replace the static list usage with this
parse-and-compare routine so all ≥32B open-source models are correctly
classified.
- Around line 18-21: The _PROBABILITY_LINE_RE is too permissive and combined
with re.search allows incidental "p: 0.5" inside normal answers to be treated as
a confidence score; tighten the pattern (add anchors ^...$ or require word
boundaries like \b and ensure optional % handling stays correct) and change
parsing call sites to use re.match or re.fullmatch instead of search so only
strings that are exclusively a probability line are accepted; apply the same
change to the other usage sites in this module (the occurrences referenced
around lines 55-56) so should_post_directly logic only treats explicit
standalone probability lines as confidence.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a7e71d5a-891a-4e5a-b93c-070d826574f1

📥 Commits

Reviewing files that changed from the base of the PR and between ed561ba and 9821e49.

📒 Files selected for processing (4)

iris/src/iris/pipeline/autonomous_tutor_pipeline.py
iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_basic.j2
iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_combo.j2
iris/src/iris/pipeline/shared/confidence_scoring.py

coderabbitai · 2026-04-13T09:00:12Z

+_PROBABILITY_LINE_RE = re.compile(
+    r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?",
+    re.IGNORECASE,
+)


⚠️ Potential issue | 🟠 Major

Probability parsing is too permissive and can misread normal answer text as confidence.

Using an unanchored regex with search() means any trailing p: <number> inside ordinary text can be parsed as confidence, which may inflate should_post_directly decisions.

Proposed fix

_PROBABILITY_LINE_RE = re.compile( - r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?", + r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$", re.IGNORECASE, ) @@ - m = _PROBABILITY_LINE_RE.search(lines[i]) + m = _PROBABILITY_LINE_RE.match(lines[i])

Also applies to: 55-56

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 18 - 21, The _PROBABILITY_LINE_RE is too permissive and combined with re.search allows incidental "p: 0.5" inside normal answers to be treated as a confidence score; tighten the pattern (add anchors ^...$ or require word boundaries like \b and ensure optional % handling stays correct) and change parsing call sites to use re.match or re.fullmatch instead of search so only strings that are exclusively a probability line are accepted; apply the same change to the other usage sites in this module (the occurrences referenced around lines 55-56) so should_post_directly logic only treats explicit standalone probability lines as confidence.

coderabbitai

♻️ Duplicate comments (1)

iris/src/iris/pipeline/shared/confidence_scoring.py (1)

12-15: ⚠️ Potential issue | 🟠 Major

Probability parsing still accepts embedded text; require standalone line matching.

Using an unanchored pattern with search() can parse incidental fragments like "... p: 0.9 ..." as confidence, which can incorrectly affect direct-post thresholds.

Suggested fix

 _PROBABILITY_LINE_RE = re.compile(
-    r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?",
+    r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$",
     re.IGNORECASE,
 )
@@
-            m = _PROBABILITY_LINE_RE.search(lines[i])
+            m = _PROBABILITY_LINE_RE.match(lines[i])

#!/bin/bash
python - <<'PY'
import re
pat_search = re.compile(r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?", re.I)
pat_match  = re.compile(r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$", re.I)

samples = [
    "Probability: 0.82",
    "Some answer text p: 0.82 maybe",
    "I think confidence: 90% overall",
]
for s in samples:
    print(s, "| search=", bool(pat_search.search(s)), "| anchored_match=", bool(pat_match.match(s)))
PY

Also applies to: 58-60

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 12 - 15,
The current _PROBABILITY_LINE_RE is unanchored and allows embedded matches;
replace it with an anchored pattern (e.g., start/end with optional surrounding
whitespace and allow leading dot-only decimals) like the reviewer suggested and
update any places that call .search() for this pattern to use .match() so only
standalone lines (exact lines containing "probability|confidence|p:
<number>[%]") are accepted; apply the same anchored-change to the related regex
used around lines 58-60 (the other probability/confidence regex constants) and
ensure callers reference the updated symbol names (_PROBABILITY_LINE_RE) and use
.match() instead of .search().

🧹 Nitpick comments (1)

iris/src/iris/pipeline/shared/confidence_scoring.py (1)
90-91: Narrow the exception scope in parser fallback.

Catching Exception hides unexpected bugs. Prefer parse-related exceptions (ValueError, TypeError) and keep the same fallback return.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 90 - 91,
The broad except Exception in the parser fallback hides unrelated bugs; replace
it with a narrow exception tuple such as except (ValueError, TypeError,
json.JSONDecodeError) around the parsing call so only parse-related errors are
caught, and keep the same fallback return of (raw_response, 0.0); ensure you
import json if adding json.JSONDecodeError and apply the change where the
current except Exception: block returns raw_response, 0.0.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 12-15: The current _PROBABILITY_LINE_RE is unanchored and allows
embedded matches; replace it with an anchored pattern (e.g., start/end with
optional surrounding whitespace and allow leading dot-only decimals) like the
reviewer suggested and update any places that call .search() for this pattern to
use .match() so only standalone lines (exact lines containing
"probability|confidence|p: <number>[%]") are accepted; apply the same
anchored-change to the related regex used around lines 58-60 (the other
probability/confidence regex constants) and ensure callers reference the updated
symbol names (_PROBABILITY_LINE_RE) and use .match() instead of .search().

---

Nitpick comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 90-91: The broad except Exception in the parser fallback hides
unrelated bugs; replace it with a narrow exception tuple such as except
(ValueError, TypeError, json.JSONDecodeError) around the parsing call so only
parse-related errors are caught, and keep the same fallback return of
(raw_response, 0.0); ensure you import json if adding json.JSONDecodeError and
apply the change where the current except Exception: block returns raw_response,
0.0.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e1106bdb-487e-4d11-a454-e718ef4cbdbb

📥 Commits

Reviewing files that changed from the base of the PR and between 9821e49 and 1d76ced.

📒 Files selected for processing (1)

iris/src/iris/pipeline/shared/confidence_scoring.py

github-actions · 2026-04-27T12:25:40Z

There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions.

toukhi added 2 commits April 13, 2026 01:48

Implement verbalized confidence scoring

2d9d1de

Merge branch 'main' of github.com:ls1intum/edutelligence into iris/fe…

9821e49

…ature/calculate-verbalized-confidence

toukhi requested a review from a team as a code owner April 13, 2026 08:54

github-actions Bot assigned toukhi Apr 13, 2026

github-actions Bot added the iris label Apr 13, 2026

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

toukhi added the deploy:pyris-test label Apr 13, 2026

github-actions Bot added lock:pyris-test and removed deploy:pyris-test labels Apr 13, 2026

toukhi temporarily deployed to Iris Test April 13, 2026 13:26 — with GitHub Actions Inactive

improve large-model detection

1d76ced

toukhi removed the lock:pyris-test label Apr 13, 2026

coderabbitai Bot reviewed Apr 13, 2026

View reviewed changes

Remove should_post_directly

12de91f

toukhi mentioned this pull request Apr 26, 2026

Iris: Tutor verification dashboard for autonomous tutor responses ls1intum/Artemis#12561

Open

23 tasks

github-actions Bot added the stale label Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Iris`: Calculate verbalized confidence#507

`Iris`: Calculate verbalized confidence#507
toukhi wants to merge 4 commits intomainfrom
iris/feature/calculate-verbalized-confidence

toukhi commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 13, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

toukhi commented Apr 13, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

toukhi commented Apr 13, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 13, 2026 •

edited

Loading