Skip to content

Bug Report: ”total_score“ calculation ignores partial metrics in ”get_final_score“ #77

@Evan-Joseph

Description

@Evan-Joseph

Summary

In evaluation/utils.py, the get_final_score function calculates four distinct metrics (skill_match_score, entity_match_score, skill_with_entity_match_score, exact_match_score). However, when computing the weighted total_score, the code iterates only over the keys present in the skill_entity_scores dictionary. This causes skill_with_entity_match_score (10% weight) and exact_match_score (10% weight) to be completely ignored.

As a result, the maximum possible score for any task is capped at 80.0 instead of 100.0, and models are not rewarded for correct structural dependencies or joint skill-entity matching.

Code Analysis

The issue is located in evaluation/utils.py:

# ... lines 319-323
skill_entity_scores = calculate_skill_and_entity_scores(standard_skill_sequence, model_skill_sequence)
# This dict ONLY contains: ['skill_match_score', 'entity_match_score']

skill_with_entity_scores = calculate_skill_with_entity_scores(standard_skill_sequence, model_skill_sequence)
# This is a separate variable

exact_match_score = get_exact_match(standard_skill_sequence, model_skill_sequence, dependency)
# This is a separate variable

score_weight = {
    "skill_match_score": 0.4,
    "entity_match_score": 0.4,
    "skill_with_entity_match_score": 0.1,
    "exact_match_score": 0.1
}

# BUG HERE: This loop only iterates over keys in `skill_entity_scores`,
# effectively ignoring the other two metrics computed above.
total_score = sum(score_weight[key] * value for key, value in skill_entity_scores.items())

Reproduction Steps

  1. Run evaluation on any task where skill_with_entity_match_score > 0.
  2. Observe the generated output.json or final_score.json.
  3. Manually calculate the weighted sum: 0.4*skill + 0.4*entity + 0.1*joint + 0.1*exact.
  4. Compare it with the logged total_score.

Example:
If a model gets:

  • Skill Match: 100 (weighted 40)
  • Entity Match: 0 (weighted 0)
  • Skill+Entity Match: 50 (weighted 5)
  • Exact Match: 0 (weighted 0)

Expected Score: 45.0
Actual Score: 40.0

Suggested Fix

Explicitly sum all weighted components instead of iterating through a partial dictionary.

    total_score = (
        skill_entity_scores["skill_match_score"] * score_weight["skill_match_score"] +
        skill_entity_scores["entity_match_score"] * score_weight["entity_match_score"] +
        skill_with_entity_scores["skill_with_entity_match_score"] * score_weight["skill_with_entity_match_score"] +
        exact_match_score * score_weight["exact_match_score"]
    )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions