refactor(gaia): use evaluation framework for max_retries parameter #268

juanmichelini · 2026-01-06T22:02:01Z

Summary

This PR fixes issue #171 by making GAIA's run_infer.py use the evaluation framework's parameter handling consistently with swebench/run_infer.py and commit0/run_infer.py.

Changes

Added max_retries=args.max_retries to EvalMetadata: Previously, GAIA was not passing the max_retries argument to the metadata, which meant the evaluation framework's retry logic for error handling was not using the user-provided value.
Renamed confusing local variable: Renamed max_retries to max_event_sync_retries in _extract_answer_from_history to clarify that this is for WebSocket event synchronization (waiting for events to arrive), not for error retries (which are handled by the Evaluation base class).

Why this matters

The max_retries parameter controls how many times the evaluation framework retries an instance when it throws an exception
GAIA was ignoring this parameter, always using the default value from EvalMetadata
This change aligns GAIA with other benchmarks (swebench, commit0) that properly pass this parameter

Fixes #171

@juanmichelini can click here to continue refining the PR

- Add max_retries=args.max_retries to EvalMetadata to use the evaluation framework's retry logic instead of ignoring the parameter - Rename local variable max_retries to max_event_sync_retries in _extract_answer_from_history to clarify it's for WebSocket event synchronization, not error retries This aligns GAIA's parameter handling with swebench and commit0. Co-authored-by: openhands <[email protected]>

…meter" This reverts commit 404d5d0.

juanmichelini · 2026-01-09T21:26:31Z

testing on https://github.com/OpenHands/evaluation/actions/runs/20866021715

tested! See https://openhands-ai.slack.com/archives/C09QGUDQVTL/p1767998023821739 and OpenHands/openhands-index-results#129

simonrosenberg · 2026-01-10T09:17:59Z

testing on https://github.com/OpenHands/evaluation/actions/runs/20866021715

tested! See https://openhands-ai.slack.com/archives/C09QGUDQVTL/p1767998023821739 and OpenHands/openhands-index-results#129

There are 0 files changed

juanmichelini · 2026-01-11T02:15:57Z

testing on https://github.com/OpenHands/evaluation/actions/runs/20866021715
tested! See https://openhands-ai.slack.com/archives/C09QGUDQVTL/p1767998023821739 and OpenHands/openhands-index-results#129

There are 0 files changed

the issue does not affect full datasets, only splits, so will come back to it after the index is completed

openhands-ai bot mentioned this pull request Jan 6, 2026

refactor GAIA to use standard prepare_dataset logic #171

Open

Revert "refactor(gaia): use evaluation framework for max_retries para…

7e48f39

…meter" This reverts commit 404d5d0.

juanmichelini requested a review from simonrosenberg January 9, 2026 22:37

juanmichelini closed this Jan 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(gaia): use evaluation framework for max_retries parameter #268

refactor(gaia): use evaluation framework for max_retries parameter #268

Uh oh!

juanmichelini commented Jan 6, 2026

Uh oh!

juanmichelini commented Jan 9, 2026 •

edited

Loading

Uh oh!

simonrosenberg commented Jan 10, 2026

Uh oh!

juanmichelini commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor(gaia): use evaluation framework for max_retries parameter #268

refactor(gaia): use evaluation framework for max_retries parameter #268

Uh oh!

Conversation

juanmichelini commented Jan 6, 2026

Summary

Changes

Why this matters

Uh oh!

juanmichelini commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonrosenberg commented Jan 10, 2026

Uh oh!

juanmichelini commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juanmichelini commented Jan 9, 2026 •

edited

Loading