Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add single metric for llm predefined #164

Merged
merged 6 commits into from
Feb 27, 2025

Conversation

MaksymAI
Copy link
Contributor

@MaksymAI MaksymAI commented Feb 26, 2025

Important

Introduce run_single method for single sample evaluation across multiple metrics, refactoring existing methods for consistency and updating tests accordingly.

  • Behavior:
    • Add run_single() method to AnswerCorrectnessEvaluator, ContextPrecisionEvaluator, ContextRecallEvaluator, FactualCorrectnessEvaluator, and FaithfulnessEvaluator for single sample evaluation.
    • Refactor run() methods in these evaluators to use run_single() for each sample.
  • Models:
    • Add AnswerCorrectnessRunSingleInput to answer_correctness.py.
    • Add FaithfulnessRunSingleInput to faithfulness.py.
  • Tests:
    • Update test_answer_correctness_evaluator in test_answer_correctness.py to reflect changes in evaluation logic.

This description was created by Ellipsis for d81435e. It will automatically update as commits are pushed.

Copy link

github-actions bot commented Feb 26, 2025

Coverage

Coverage Report •
FileStmtsMissCoverMissing
dynamiq/evaluations/metrics
   answer_correctness.py144993%56, 260, 282, 348, 375, 434–437
   context_precision.py1112181%34, 40, 80–82, 282–287, 293–294, 299–304, 309–310
   context_recall.py981584%33, 38, 43, 64, 239–240, 244–245, 265–271
   factual_correctness.py1753480%77, 82, 90, 258, 305, 375–376, 378–379, 414–416, 422–424, 431–433, 443–444, 447–448, 453–454, 457, 459, 477–484
   faithfulness.py1532285%24, 52, 73, 110, 115, 120, 314, 423–425, 432–434, 451–459
TOTAL11938363069% 

Tests Skipped Failures Errors Time
416 0 💤 0 ❌ 0 🔥 49.660s ⏱️

@MaksymAI MaksymAI marked this pull request as ready for review February 26, 2025 09:07
@MaksymAI MaksymAI requested a review from a team as a code owner February 26, 2025 09:07
Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to d81435e in 2 minutes and 28 seconds

More details
  • Looked at 1279 lines of code in 6 files
  • Skipped 0 files when reviewing.
  • Skipped posting 10 drafted comments based on config settings.
1. tests/integration/evaluations/metrics/test_answer_correctness.py:30
  • Draft comment:
    Clear setup for mocking extraction. Side_effect is well structured.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
2. tests/integration/evaluations/metrics/test_answer_correctness.py:89
  • Draft comment:
    The side_effect sequence for classification mocks is clear and covers all calls.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
3. tests/integration/evaluations/metrics/test_answer_correctness.py:115
  • Draft comment:
    Expected scores and explanation comments are detailed and improve test clarity.
  • Reason this comment was not posted:
    Confidence changes required: 0% <= threshold 50%
    None
4. dynamiq/evaluations/metrics/answer_correctness.py:429
  • Draft comment:
    Ensure extract_statements always returns a non-empty list before indexing [0].
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
    The extract_statements() method has defensive coding - it converts non-list results to lists and handles empty results. It always returns a list of lists, with at least an empty list for each input question. Since we're passing single-element lists [question] and [answer], we'll always get back a single-element list containing a list of statements, making the [0] access safe.
    Could there be edge cases I'm missing? What if the LLM evaluator fails completely and returns None?
    Looking at lines 257-261, even if the LLM returns None or invalid data, it's handled by converting to an empty list. The code is quite defensive.
    The comment should be deleted. The code already handles empty/invalid results safely and there is no risk of IndexError.
5. dynamiq/evaluations/metrics/context_precision.py:24
  • Draft comment:
    Good use of field_validator to normalize contexts_list; consider adding more detailed error messages if input is invalid.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
6. dynamiq/evaluations/metrics/context_recall.py:247
  • Draft comment:
    Consider handling missing keys more gracefully when building ClassificationItem inside run_single.
  • Reason this comment was not posted:
    Confidence changes required: 33% <= threshold 50%
    None
7. dynamiq/evaluations/metrics/factual_correctness.py:410
  • Draft comment:
    Ensure that decompose_claims properly handles cases where no claims are decomposed, to avoid possible issues downstream.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    This comment is asking the author to ensure that a function handles a specific case correctly. It is not making a specific suggestion or pointing out a specific issue, but rather asking for confirmation of intended behavior, which violates the rules.
8. dynamiq/evaluations/metrics/faithfulness.py:421
  • Draft comment:
    Handle the case when simplify_statements returns an empty list more explicitly to avoid potential downstream issues.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
9. dynamiq/evaluations/metrics/faithfulness.py:450
  • Draft comment:
    Ensure that verbose debug logs do not expose sensitive data in production environments.
  • Reason this comment was not posted:
    Comment did not seem useful. Confidence is useful = 0% <= threshold 50%
    This comment is asking the author to ensure something, which violates the rule against asking the author to ensure behavior is intended or tested. It doesn't provide a specific suggestion or point out a specific issue in the code.
10. tests/integration/evaluations/metrics/test_answer_correctness.py:89
  • Draft comment:
    Moq side_effect lists rely on call order; consider adding comments or checks to ensure ordering is maintained to prevent brittle tests.
  • Reason this comment was not posted:
    Comment was on unchanged code.

Workflow ID: wflow_0T0t5zIhQnnwPySQ


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Collaborator

@acoola acoola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. Please pull latest main changes

@acoola acoola force-pushed the feat/add_single_metric_for_llm_predefined branch from d81435e to de9dbe6 Compare February 27, 2025 13:34
@acoola acoola merged commit 782a176 into main Feb 27, 2025
7 checks passed
@acoola acoola deleted the feat/add_single_metric_for_llm_predefined branch February 27, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants