-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add single metric for llm predefined #164
Conversation
Coverage Report •
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to d81435e in 2 minutes and 28 seconds
More details
- Looked at
1279
lines of code in6
files - Skipped
0
files when reviewing. - Skipped posting
10
drafted comments based on config settings.
1. tests/integration/evaluations/metrics/test_answer_correctness.py:30
- Draft comment:
Clear setup for mocking extraction. Side_effect is well structured. - Reason this comment was not posted:
Confidence changes required:0%
<= threshold50%
None
2. tests/integration/evaluations/metrics/test_answer_correctness.py:89
- Draft comment:
The side_effect sequence for classification mocks is clear and covers all calls. - Reason this comment was not posted:
Confidence changes required:0%
<= threshold50%
None
3. tests/integration/evaluations/metrics/test_answer_correctness.py:115
- Draft comment:
Expected scores and explanation comments are detailed and improve test clarity. - Reason this comment was not posted:
Confidence changes required:0%
<= threshold50%
None
4. dynamiq/evaluations/metrics/answer_correctness.py:429
- Draft comment:
Ensure extract_statements always returns a non-empty list before indexing [0]. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 0% vs. threshold = 50%
The extract_statements() method has defensive coding - it converts non-list results to lists and handles empty results. It always returns a list of lists, with at least an empty list for each input question. Since we're passing single-element lists [question] and [answer], we'll always get back a single-element list containing a list of statements, making the [0] access safe.
Could there be edge cases I'm missing? What if the LLM evaluator fails completely and returns None?
Looking at lines 257-261, even if the LLM returns None or invalid data, it's handled by converting to an empty list. The code is quite defensive.
The comment should be deleted. The code already handles empty/invalid results safely and there is no risk of IndexError.
5. dynamiq/evaluations/metrics/context_precision.py:24
- Draft comment:
Good use of field_validator to normalize contexts_list; consider adding more detailed error messages if input is invalid. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
6. dynamiq/evaluations/metrics/context_recall.py:247
- Draft comment:
Consider handling missing keys more gracefully when building ClassificationItem inside run_single. - Reason this comment was not posted:
Confidence changes required:33%
<= threshold50%
None
7. dynamiq/evaluations/metrics/factual_correctness.py:410
- Draft comment:
Ensure that decompose_claims properly handles cases where no claims are decomposed, to avoid possible issues downstream. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
This comment is asking the author to ensure that a function handles a specific case correctly. It is not making a specific suggestion or pointing out a specific issue, but rather asking for confirmation of intended behavior, which violates the rules.
8. dynamiq/evaluations/metrics/faithfulness.py:421
- Draft comment:
Handle the case when simplify_statements returns an empty list more explicitly to avoid potential downstream issues. - Reason this comment was not posted:
Comment looked like it was already resolved.
9. dynamiq/evaluations/metrics/faithfulness.py:450
- Draft comment:
Ensure that verbose debug logs do not expose sensitive data in production environments. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%
<= threshold50%
This comment is asking the author to ensure something, which violates the rule against asking the author to ensure behavior is intended or tested. It doesn't provide a specific suggestion or point out a specific issue in the code.
10. tests/integration/evaluations/metrics/test_answer_correctness.py:89
- Draft comment:
Moq side_effect lists rely on call order; consider adding comments or checks to ensure ordering is maintained to prevent brittle tests. - Reason this comment was not posted:
Comment was on unchanged code.
Workflow ID: wflow_0T0t5zIhQnnwPySQ
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. Please pull latest main changes
d81435e
to
de9dbe6
Compare
Important
Introduce
run_single
method for single sample evaluation across multiple metrics, refactoring existing methods for consistency and updating tests accordingly.run_single()
method toAnswerCorrectnessEvaluator
,ContextPrecisionEvaluator
,ContextRecallEvaluator
,FactualCorrectnessEvaluator
, andFaithfulnessEvaluator
for single sample evaluation.run()
methods in these evaluators to userun_single()
for each sample.AnswerCorrectnessRunSingleInput
toanswer_correctness.py
.FaithfulnessRunSingleInput
tofaithfulness.py
.test_answer_correctness_evaluator
intest_answer_correctness.py
to reflect changes in evaluation logic.This description was created by
for d81435e. It will automatically update as commits are pushed.