Summary
In evaluation/utils.py, the get_final_score function calculates four distinct metrics (skill_match_score, entity_match_score, skill_with_entity_match_score, exact_match_score). However, when computing the weighted total_score, the code iterates only over the keys present in the skill_entity_scores dictionary. This causes skill_with_entity_match_score (10% weight) and exact_match_score (10% weight) to be completely ignored.
As a result, the maximum possible score for any task is capped at 80.0 instead of 100.0, and models are not rewarded for correct structural dependencies or joint skill-entity matching.
Code Analysis
The issue is located in evaluation/utils.py:
# ... lines 319-323
skill_entity_scores = calculate_skill_and_entity_scores(standard_skill_sequence, model_skill_sequence)
# This dict ONLY contains: ['skill_match_score', 'entity_match_score']
skill_with_entity_scores = calculate_skill_with_entity_scores(standard_skill_sequence, model_skill_sequence)
# This is a separate variable
exact_match_score = get_exact_match(standard_skill_sequence, model_skill_sequence, dependency)
# This is a separate variable
score_weight = {
"skill_match_score": 0.4,
"entity_match_score": 0.4,
"skill_with_entity_match_score": 0.1,
"exact_match_score": 0.1
}
# BUG HERE: This loop only iterates over keys in `skill_entity_scores`,
# effectively ignoring the other two metrics computed above.
total_score = sum(score_weight[key] * value for key, value in skill_entity_scores.items())
Reproduction Steps
- Run evaluation on any task where
skill_with_entity_match_score > 0.
- Observe the generated
output.json or final_score.json.
- Manually calculate the weighted sum:
0.4*skill + 0.4*entity + 0.1*joint + 0.1*exact.
- Compare it with the logged
total_score.
Example:
If a model gets:
- Skill Match: 100 (weighted 40)
- Entity Match: 0 (weighted 0)
- Skill+Entity Match: 50 (weighted 5)
- Exact Match: 0 (weighted 0)
Expected Score: 45.0
Actual Score: 40.0
Suggested Fix
Explicitly sum all weighted components instead of iterating through a partial dictionary.
total_score = (
skill_entity_scores["skill_match_score"] * score_weight["skill_match_score"] +
skill_entity_scores["entity_match_score"] * score_weight["entity_match_score"] +
skill_with_entity_scores["skill_with_entity_match_score"] * score_weight["skill_with_entity_match_score"] +
exact_match_score * score_weight["exact_match_score"]
)
Summary
In
evaluation/utils.py, theget_final_scorefunction calculates four distinct metrics (skill_match_score,entity_match_score,skill_with_entity_match_score,exact_match_score). However, when computing the weightedtotal_score, the code iterates only over the keys present in theskill_entity_scoresdictionary. This causesskill_with_entity_match_score(10% weight) andexact_match_score(10% weight) to be completely ignored.As a result, the maximum possible score for any task is capped at 80.0 instead of 100.0, and models are not rewarded for correct structural dependencies or joint skill-entity matching.
Code Analysis
The issue is located in
evaluation/utils.py:Reproduction Steps
skill_with_entity_match_score > 0.output.jsonorfinal_score.json.0.4*skill + 0.4*entity + 0.1*joint + 0.1*exact.total_score.Example:
If a model gets:
Expected Score: 45.0
Actual Score: 40.0
Suggested Fix
Explicitly sum all weighted components instead of iterating through a partial dictionary.