feat: add single metric for llm predefined #164

MaksymAI · 2025-02-26T09:00:33Z

Important

Introduce run_single method for single sample evaluation across multiple metrics, refactoring existing methods for consistency and updating tests accordingly.

Behavior:
- Add run_single() method to AnswerCorrectnessEvaluator, ContextPrecisionEvaluator, ContextRecallEvaluator, FactualCorrectnessEvaluator, and FaithfulnessEvaluator for single sample evaluation.
- Refactor run() methods in these evaluators to use run_single() for each sample.
Models:
- Add AnswerCorrectnessRunSingleInput to answer_correctness.py.
- Add FaithfulnessRunSingleInput to faithfulness.py.
Tests:
- Update test_answer_correctness_evaluator in test_answer_correctness.py to reflect changes in evaluation logic.

^{This description was created by}^{for d81435e. It will automatically update as commits are pushed.}

github-actions · 2025-02-26T09:04:04Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
dynamiq/evaluations/metrics
answer_correctness.py	144	9	93%	56, 260, 282, 348, 375, 434–437
context_precision.py	111	21	81%	34, 40, 80–82, 282–287, 293–294, 299–304, 309–310
context_recall.py	98	15	84%	33, 38, 43, 64, 239–240, 244–245, 265–271
factual_correctness.py	175	34	80%	77, 82, 90, 258, 305, 375–376, 378–379, 414–416, 422–424, 431–433, 443–444, 447–448, 453–454, 457, 459, 477–484
faithfulness.py	153	22	85%	24, 52, 73, 110, 115, 120, 314, 423–425, 432–434, 451–459
TOTAL	11938	3630	69%

Tests	Skipped	Failures	Errors	Time
416	0 💤	0 ❌	0 🔥	49.660s ⏱️

ellipsis-dev

👍 Looks good to me! Reviewed everything up to d81435e in 2 minutes and 28 seconds

More details

Looked at 1279 lines of code in 6 files
Skipped 0 files when reviewing.
Skipped posting 10 drafted comments based on config settings.

1. tests/integration/evaluations/metrics/test_answer_correctness.py:30

Draft comment:
Clear setup for mocking extraction. Side_effect is well structured.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

2. tests/integration/evaluations/metrics/test_answer_correctness.py:89

Draft comment:
The side_effect sequence for classification mocks is clear and covers all calls.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

3. tests/integration/evaluations/metrics/test_answer_correctness.py:115

Draft comment:
Expected scores and explanation comments are detailed and improve test clarity.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50%
None

4. dynamiq/evaluations/metrics/answer_correctness.py:429

Draft comment:
Ensure extract_statements always returns a non-empty list before indexing [0].
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
The extract_statements() method has defensive coding - it converts non-list results to lists and handles empty results. It always returns a list of lists, with at least an empty list for each input question. Since we're passing single-element lists [question] and [answer], we'll always get back a single-element list containing a list of statements, making the [0] access safe.
Could there be edge cases I'm missing? What if the LLM evaluator fails completely and returns None?
Looking at lines 257-261, even if the LLM returns None or invalid data, it's handled by converting to an empty list. The code is quite defensive.
The comment should be deleted. The code already handles empty/invalid results safely and there is no risk of IndexError.

5. dynamiq/evaluations/metrics/context_precision.py:24

Draft comment:
Good use of field_validator to normalize contexts_list; consider adding more detailed error messages if input is invalid.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

6. dynamiq/evaluations/metrics/context_recall.py:247

Draft comment:
Consider handling missing keys more gracefully when building ClassificationItem inside run_single.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50%
None

7. dynamiq/evaluations/metrics/factual_correctness.py:410

Draft comment:
Ensure that decompose_claims properly handles cases where no claims are decomposed, to avoid possible issues downstream.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
This comment is asking the author to ensure that a function handles a specific case correctly. It is not making a specific suggestion or pointing out a specific issue, but rather asking for confirmation of intended behavior, which violates the rules.

8. dynamiq/evaluations/metrics/faithfulness.py:421

Draft comment:
Handle the case when simplify_statements returns an empty list more explicitly to avoid potential downstream issues.
Reason this comment was not posted:
Comment looked like it was already resolved.

9. dynamiq/evaluations/metrics/faithfulness.py:450

Draft comment:
Ensure that verbose debug logs do not expose sensitive data in production environments.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
This comment is asking the author to ensure something, which violates the rule against asking the author to ensure behavior is intended or tested. It doesn't provide a specific suggestion or point out a specific issue in the code.

10. tests/integration/evaluations/metrics/test_answer_correctness.py:89

Draft comment:
Moq side_effect lists rely on call order; consider adding comments or checks to ensure ordering is maintained to prevent brittle tests.
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_0T0t5zIhQnnwPySQ

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

acoola

Great job. Please pull latest main changes

MaksymAI marked this pull request as ready for review February 26, 2025 09:07

MaksymAI requested a review from a team as a code owner February 26, 2025 09:07

ellipsis-dev bot reviewed Feb 26, 2025

View reviewed changes

acoola approved these changes Feb 27, 2025

View reviewed changes

MaksymAI added 6 commits February 27, 2025 15:34

feat: add single run for faithfulness

f33300a

feat: add run single for answer correctness

4720973

feat: update context precision

751527a

feat: update context recall

afe49ff

feat: add single run for factual correctness

c13dd3c

test: update test for answer correctness

de9dbe6

acoola force-pushed the feat/add_single_metric_for_llm_predefined branch from d81435e to de9dbe6 Compare February 27, 2025 13:34

acoola merged commit 782a176 into main Feb 27, 2025
7 checks passed

acoola deleted the feat/add_single_metric_for_llm_predefined branch February 27, 2025 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add single metric for llm predefined #164

feat: add single metric for llm predefined #164

MaksymAI commented Feb 26, 2025 •

edited by ellipsis-dev bot

Loading

github-actions bot commented Feb 26, 2025 •

edited

Loading

ellipsis-dev bot left a comment

acoola left a comment

feat: add single metric for llm predefined #164

feat: add single metric for llm predefined #164

Conversation

MaksymAI commented Feb 26, 2025 • edited by ellipsis-dev bot Loading

github-actions bot commented Feb 26, 2025 • edited Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

acoola left a comment

Choose a reason for hiding this comment

MaksymAI commented Feb 26, 2025 •

edited by ellipsis-dev bot

Loading

github-actions bot commented Feb 26, 2025 •

edited

Loading