Reproducibility #1119

ShiraVH · 2024-10-23T08:32:51Z

ShiraVH
Oct 23, 2024

Hello, I am using your deepeval benchmarks for testing different variants of a model. I want to make sure that when I do the tests will be exactly the same. How should I do that? Which seeds should I manually set?
Thank you!

penguine-ip · 2025-04-28T03:03:39Z

penguine-ip
Apr 28, 2025
Maintainer

Hey @ShiraVH just saw this discussion! You should set the seed for your LLM judge by creating a custom evaluation model here: https://www.deepeval.com/guides/guides-using-custom-llms

2 replies

mausch Apr 28, 2025

Thanks but I don't follow... Users are expected to create a custom eval model (seems like quite a bit of boilerplate) for every eval? even if they just want to set a seed for reproducibility? Seems overkill?

penguine-ip Apr 28, 2025
Maintainer

@mausch we welcome a PR from you to fix it!

kinthaiofficial · 2026-04-29T00:29:40Z

kinthaiofficial
Apr 29, 2026

Reproducibility in LLM evaluation is fundamentally harder than in traditional ML — the ground truth itself can be model-dependent, and the same prompt can yield different evaluations across model versions.

A few things that help in practice:

Seed the evaluator, not just the model under test — most reproducibility work focuses on fixing temperature and seeds for the model being evaluated. But if your evaluator (the LLM judging the output) is also non-deterministic, you get compounding variance. Fix seeds at both levels, or use a deterministic judge (regex, schema validation) for metrics where that's feasible.

Version the evaluation schema — the criteria you use to evaluate quality drift over time. If "coherence" means something slightly different in v1 vs v2 of your rubric, your longitudinal metrics aren't comparable. We hash the evaluation prompt + criteria and store it alongside each eval run.

Store the full context, not just the score — for agent evaluations specifically, the relevant context is often the full delegation chain (what the root agent intended, what sub-agents were called, what the final output was). Storing just the final output + score makes it hard to diagnose why a score changed.

Behavioral drift detection — rather than just checking individual eval scores, track distributions over time. We use KL divergence on action distributions to detect when an agent's behavior has shifted, even if individual eval metrics look fine.

More on evaluation design for multi-agent systems: https://blog.kinthai.ai/221-agents-multi-agent-coordination-lessons

Is the reproducibility issue mainly for CI-style regression testing, or for research comparisons across model versions?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducibility #1119

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reproducibility #1119

Uh oh!

ShiraVH Oct 23, 2024

Replies: 2 comments · 2 replies

Uh oh!

penguine-ip Apr 28, 2025 Maintainer

Uh oh!

mausch Apr 28, 2025

Uh oh!

penguine-ip Apr 28, 2025 Maintainer

Uh oh!

kinthaiofficial Apr 29, 2026

ShiraVH
Oct 23, 2024

Replies: 2 comments 2 replies

penguine-ip
Apr 28, 2025
Maintainer

penguine-ip Apr 28, 2025
Maintainer

kinthaiofficial
Apr 29, 2026