Replies: 2 comments 2 replies
-
|
Hey @ShiraVH just saw this discussion! You should set the seed for your LLM judge by creating a custom evaluation model here: https://www.deepeval.com/guides/guides-using-custom-llms |
Beta Was this translation helpful? Give feedback.
-
|
Reproducibility in LLM evaluation is fundamentally harder than in traditional ML — the ground truth itself can be model-dependent, and the same prompt can yield different evaluations across model versions. A few things that help in practice: Seed the evaluator, not just the model under test — most reproducibility work focuses on fixing temperature and seeds for the model being evaluated. But if your evaluator (the LLM judging the output) is also non-deterministic, you get compounding variance. Fix seeds at both levels, or use a deterministic judge (regex, schema validation) for metrics where that's feasible. Version the evaluation schema — the criteria you use to evaluate quality drift over time. If "coherence" means something slightly different in v1 vs v2 of your rubric, your longitudinal metrics aren't comparable. We hash the evaluation prompt + criteria and store it alongside each eval run. Store the full context, not just the score — for agent evaluations specifically, the relevant context is often the full delegation chain (what the root agent intended, what sub-agents were called, what the final output was). Storing just the final output + score makes it hard to diagnose why a score changed. Behavioral drift detection — rather than just checking individual eval scores, track distributions over time. We use KL divergence on action distributions to detect when an agent's behavior has shifted, even if individual eval metrics look fine. More on evaluation design for multi-agent systems: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons Is the reproducibility issue mainly for CI-style regression testing, or for research comparisons across model versions? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello, I am using your deepeval benchmarks for testing different variants of a model. I want to make sure that when I do the tests will be exactly the same. How should I do that? Which seeds should I manually set?
Thank you!
Beta Was this translation helpful? Give feedback.
All reactions