Fix Browser action deserialization by using OpenHandsModel #136
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
When running GAIA evaluations with Browser actions enabled (
enable_browser=True), the evaluation phase completes successfully, but the aggregation phase fails with a pydantic validation error:This error occurs when attempting to deserialize
output.jsonlfiles that contain Browser events (actions and observations).Root Cause
The issue stems from how pydantic handles discriminated unions with dynamically registered types:
EvalOutputcontains ahistory: list[Event]fieldEventis a discriminated union that can contain different action/observation typesBrowserGetContentAction,BrowserObservation, etc.) are registered dynamically whenget_default_tools(enable_browser=True)is calledEvalOutputextendsBaseModel, the schema is frozen with only the action types that existed at importSolution
Change
EvalOutputto extendOpenHandsModelinstead ofBaseModel.OpenHandsModelis a custom base class in the OpenHands SDK that automatically callsmodel_rebuild()before validation. This regenerates the discriminated union schema to include all dynamically registered event types, ensuring Browser actions can be properly deserialized.Changes
benchmarks/utils/models.pyclass EvalOutput(BaseModel)→class EvalOutput(OpenHandsModel)from openhands.sdk.utils.models import OpenHandsModelOpenHandsModelis necessaryImpact
✅ Minimal and safe change: Only the parent class is modified
✅ Backward compatible:
OpenHandsModelextendsBaseModel✅ No API changes: All existing code continues to work
✅ Fixes: GAIA evaluations with Browser tools
✅ Future-proof: Handles any future dynamically registered tool types
Testing
Verified with GAIA evaluation runs:
Why This PR is Necessary
The main branch already has GAIA support (added in #129) and uses Browser tools by default. Without this fix, all GAIA evaluations on main branch will fail during aggregation when they try to load results containing Browser events.
This is a critical bug fix that should be merged to prevent evaluation failures.
@simonrosenberg can click here to continue refining the PR