Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

Summary

This PR fixes issue #171 by making GAIA's run_infer.py use the evaluation framework's parameter handling consistently with swebench/run_infer.py and commit0/run_infer.py.

Changes

  1. Added max_retries=args.max_retries to EvalMetadata: Previously, GAIA was not passing the max_retries argument to the metadata, which meant the evaluation framework's retry logic for error handling was not using the user-provided value.

  2. Renamed confusing local variable: Renamed max_retries to max_event_sync_retries in _extract_answer_from_history to clarify that this is for WebSocket event synchronization (waiting for events to arrive), not for error retries (which are handled by the Evaluation base class).

Why this matters

  • The max_retries parameter controls how many times the evaluation framework retries an instance when it throws an exception
  • GAIA was ignoring this parameter, always using the default value from EvalMetadata
  • This change aligns GAIA with other benchmarks (swebench, commit0) that properly pass this parameter

Fixes #171

@juanmichelini can click here to continue refining the PR

- Add max_retries=args.max_retries to EvalMetadata to use the evaluation
  framework's retry logic instead of ignoring the parameter
- Rename local variable max_retries to max_event_sync_retries in
  _extract_answer_from_history to clarify it's for WebSocket event
  synchronization, not error retries

This aligns GAIA's parameter handling with swebench and commit0.

Co-authored-by: openhands <[email protected]>
@juanmichelini
Copy link
Collaborator Author

juanmichelini commented Jan 9, 2026

@simonrosenberg
Copy link
Collaborator

@juanmichelini
Copy link
Collaborator Author

testing on https://github.com/OpenHands/evaluation/actions/runs/20866021715
tested! See https://openhands-ai.slack.com/archives/C09QGUDQVTL/p1767998023821739 and OpenHands/openhands-index-results#129

There are 0 files changed

the issue does not affect full datasets, only splits, so will come back to it after the index is completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

refactor GAIA to use standard prepare_dataset logic

4 participants