Skip to content

Conversation

@yuvalluria
Copy link
Contributor

Replace model_evaluator.score_model() composite scoring with direct AA benchmark scores from usecase_quality_scorer. The composite score incorrectly favored smaller models due to latency/budget bonuses.

Changes:

  • Get raw accuracy from score_model_quality() in capacity_planner
  • GPT-OSS 120B now correctly shows ~62% (was showing lower)
  • GPT-OSS 20B now correctly shows ~55% (was showing higher)

Assisted-by: Claude [email protected]

Replace model_evaluator.score_model() composite scoring with direct
AA benchmark scores from usecase_quality_scorer. The composite score
incorrectly favored smaller models due to latency/budget bonuses.

Changes:
- Get raw accuracy from score_model_quality() in capacity_planner
- GPT-OSS 120B now correctly shows ~62% (was showing lower)
- GPT-OSS 20B now correctly shows ~55% (was showing higher)

Assisted-by: Claude <[email protected]>
Signed-off-by: Yuval Luria <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant