Conversation
…ature/calculate-verbalized-confidence
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThe pipeline now appends model-size-specific confidence prompt fragments, parses an explicit probability from LLM responses (overwriting the result text), and uses new utilities to classify model size and extract a cleaned answer plus a clamped confidence float. Changes
Sequence Diagram(s)sequenceDiagram
participant Pipeline as Pipeline
participant LLM as LLM
participant Utils as ConfidenceUtils
participant State as StateStore
Pipeline->>State: read state.llm.model_name
Pipeline->>Utils: is_large_model(model_name)
Utils-->>Pipeline: boolean
Pipeline->>LLM: send system prompt + selected confidence fragment
LLM-->>Pipeline: verbalized Answer + Probability
Pipeline->>Utils: parse_confidence_response(raw_response)
Utils-->>Pipeline: (answer_text, probability)
Pipeline->>State: overwrite state.result (strip prob line)
Pipeline->>State: log confidence
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 Pylint (4.0.5)iris/src/iris/pipeline/autonomous_tutor_pipeline.pyiris/src/iris/domain/autonomous_tutor/autonomous_tutor_pipeline_status_update_dto.pyThanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 3-11: The current _LARGE_MODEL_PATTERNS and is_large_model check
only static substrings and misses open-source numeric sizes like
34B/40B/65B/etc.; update is_large_model to detect large models by parsing
numeric size tokens and known family names: implement logic in is_large_model to
(1) case-normalize the model string, (2) match known large-family names
("gpt-4","gpt-5","gpt-oss") OR extract a trailing numeric size before an
optional "b" (e.g., regex like r"(\d+)\s*b?\b"), convert that number to int and
treat >=32 as large, and (3) fallback to existing explicit checks in
_LARGE_MODEL_PATTERNS if needed—replace the static list usage with this
parse-and-compare routine so all ≥32B open-source models are correctly
classified.
- Around line 18-21: The _PROBABILITY_LINE_RE is too permissive and combined
with re.search allows incidental "p: 0.5" inside normal answers to be treated as
a confidence score; tighten the pattern (add anchors ^...$ or require word
boundaries like \b and ensure optional % handling stays correct) and change
parsing call sites to use re.match or re.fullmatch instead of search so only
strings that are exclusively a probability line are accepted; apply the same
change to the other usage sites in this module (the occurrences referenced
around lines 55-56) so should_post_directly logic only treats explicit
standalone probability lines as confidence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a7e71d5a-891a-4e5a-b93c-070d826574f1
📒 Files selected for processing (4)
iris/src/iris/pipeline/autonomous_tutor_pipeline.pyiris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_basic.j2iris/src/iris/pipeline/prompts/templates/autonomous_tutor_confidence_combo.j2iris/src/iris/pipeline/shared/confidence_scoring.py
| _PROBABILITY_LINE_RE = re.compile( | ||
| r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?", | ||
| re.IGNORECASE, | ||
| ) |
There was a problem hiding this comment.
Probability parsing is too permissive and can misread normal answer text as confidence.
Using an unanchored regex with search() means any trailing p: <number> inside ordinary text can be parsed as confidence, which may inflate should_post_directly decisions.
Proposed fix
_PROBABILITY_LINE_RE = re.compile(
- r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?",
+ r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$",
re.IGNORECASE,
)
@@
- m = _PROBABILITY_LINE_RE.search(lines[i])
+ m = _PROBABILITY_LINE_RE.match(lines[i])Also applies to: 55-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 18 - 21,
The _PROBABILITY_LINE_RE is too permissive and combined with re.search allows
incidental "p: 0.5" inside normal answers to be treated as a confidence score;
tighten the pattern (add anchors ^...$ or require word boundaries like \b and
ensure optional % handling stays correct) and change parsing call sites to use
re.match or re.fullmatch instead of search so only strings that are exclusively
a probability line are accepted; apply the same change to the other usage sites
in this module (the occurrences referenced around lines 55-56) so
should_post_directly logic only treats explicit standalone probability lines as
confidence.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
iris/src/iris/pipeline/shared/confidence_scoring.py (1)
12-15:⚠️ Potential issue | 🟠 MajorProbability parsing still accepts embedded text; require standalone line matching.
Using an unanchored pattern with
search()can parse incidental fragments like"... p: 0.9 ..."as confidence, which can incorrectly affect direct-post thresholds.Suggested fix
_PROBABILITY_LINE_RE = re.compile( - r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?", + r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$", re.IGNORECASE, ) @@ - m = _PROBABILITY_LINE_RE.search(lines[i]) + m = _PROBABILITY_LINE_RE.match(lines[i])#!/bin/bash python - <<'PY' import re pat_search = re.compile(r"(?:probability|confidence|p)\s*:\s*(-?\d+(?:\.\d+)?)(\s*%)?", re.I) pat_match = re.compile(r"^\s*(?:probability|confidence|p)\s*:\s*(-?(?:\d+(?:\.\d+)?|\.\d+))\s*(%)?\s*$", re.I) samples = [ "Probability: 0.82", "Some answer text p: 0.82 maybe", "I think confidence: 90% overall", ] for s in samples: print(s, "| search=", bool(pat_search.search(s)), "| anchored_match=", bool(pat_match.match(s))) PYAlso applies to: 58-60
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 12 - 15, The current _PROBABILITY_LINE_RE is unanchored and allows embedded matches; replace it with an anchored pattern (e.g., start/end with optional surrounding whitespace and allow leading dot-only decimals) like the reviewer suggested and update any places that call .search() for this pattern to use .match() so only standalone lines (exact lines containing "probability|confidence|p: <number>[%]") are accepted; apply the same anchored-change to the related regex used around lines 58-60 (the other probability/confidence regex constants) and ensure callers reference the updated symbol names (_PROBABILITY_LINE_RE) and use .match() instead of .search().
🧹 Nitpick comments (1)
iris/src/iris/pipeline/shared/confidence_scoring.py (1)
90-91: Narrow the exception scope in parser fallback.Catching
Exceptionhides unexpected bugs. Prefer parse-related exceptions (ValueError,TypeError) and keep the same fallback return.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@iris/src/iris/pipeline/shared/confidence_scoring.py` around lines 90 - 91, The broad except Exception in the parser fallback hides unrelated bugs; replace it with a narrow exception tuple such as except (ValueError, TypeError, json.JSONDecodeError) around the parsing call so only parse-related errors are caught, and keep the same fallback return of (raw_response, 0.0); ensure you import json if adding json.JSONDecodeError and apply the change where the current except Exception: block returns raw_response, 0.0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 12-15: The current _PROBABILITY_LINE_RE is unanchored and allows
embedded matches; replace it with an anchored pattern (e.g., start/end with
optional surrounding whitespace and allow leading dot-only decimals) like the
reviewer suggested and update any places that call .search() for this pattern to
use .match() so only standalone lines (exact lines containing
"probability|confidence|p: <number>[%]") are accepted; apply the same
anchored-change to the related regex used around lines 58-60 (the other
probability/confidence regex constants) and ensure callers reference the updated
symbol names (_PROBABILITY_LINE_RE) and use .match() instead of .search().
---
Nitpick comments:
In `@iris/src/iris/pipeline/shared/confidence_scoring.py`:
- Around line 90-91: The broad except Exception in the parser fallback hides
unrelated bugs; replace it with a narrow exception tuple such as except
(ValueError, TypeError, json.JSONDecodeError) around the parsing call so only
parse-related errors are caught, and keep the same fallback return of
(raw_response, 0.0); ensure you import json if adding json.JSONDecodeError and
apply the change where the current except Exception: block returns raw_response,
0.0.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: e1106bdb-487e-4d11-a454-e718ef4cbdbb
📒 Files selected for processing (1)
iris/src/iris/pipeline/shared/confidence_scoring.py
|
There hasn't been any activity on this pull request recently. Therefore, this pull request has been automatically marked as stale and will be closed if no further activity occurs within seven days. Thank you for your contributions. |
Implements verbalized confidence scoring for the autonomous tutor pipeline, based on Yang et al. (2024) "On Verbalized Confidence Scores for LLMs". Previously, the pipeline returned a hardcoded confidence of
0.99for every response. This PR replaces that placeholder with a real scoring mechanism where the model itself estimates the probability that its answer is correct.How it works:
asks the model for a calibrated "best guess" + probability
Answer/Guess: ... \nProbability: ...formatvalues, missing scores)
The confidence score is included in the
AutonomousTutorPipelineStatusUpdateDTOsent back to Artemis, enabling theexisting
shouldPostDirectlythreshold logicSummary by CodeRabbit
New Features
Bug Fixes
Closes IRIS-23