Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

Problem

When running GAIA evaluations with Browser actions enabled (enable_browser=True), the evaluation phase completes successfully, but the aggregation phase fails with a pydantic validation error:

ValidationError: Unexpected kind BrowserGetContentAction

This error occurs when attempting to deserialize output.jsonl files that contain Browser events (actions and observations).

Root Cause

The issue stems from how pydantic handles discriminated unions with dynamically registered types:

  1. EvalOutput contains a history: list[Event] field
  2. Event is a discriminated union that can contain different action/observation types
  3. Browser action types (BrowserGetContentAction, BrowserObservation, etc.) are registered dynamically when get_default_tools(enable_browser=True) is called
  4. Pydantic caches discriminated union schemas at import time (before Browser types are registered)
  5. When EvalOutput extends BaseModel, the schema is frozen with only the action types that existed at import
  6. During deserialization, pydantic encounters Browser action types that aren't in its cached schema and raises a validation error

Solution

Change EvalOutput to extend OpenHandsModel instead of BaseModel.

OpenHandsModel is a custom base class in the OpenHands SDK that automatically calls model_rebuild() before validation. This regenerates the discriminated union schema to include all dynamically registered event types, ensuring Browser actions can be properly deserialized.

Changes

  • File: benchmarks/utils/models.py
  • Change: class EvalOutput(BaseModel)class EvalOutput(OpenHandsModel)
  • Added import: from openhands.sdk.utils.models import OpenHandsModel
  • Added comprehensive docstring explaining why OpenHandsModel is necessary

Impact

Minimal and safe change: Only the parent class is modified
Backward compatible: OpenHandsModel extends BaseModel
No API changes: All existing code continues to work
Fixes: GAIA evaluations with Browser tools
Future-proof: Handles any future dynamically registered tool types

Testing

Verified with GAIA evaluation runs:

  • Before: Evaluation succeeded but aggregation failed with deserialization errors
  • After: Complete end-to-end success with Browser actions in output.jsonl properly deserialized

Why This PR is Necessary

The main branch already has GAIA support (added in #129) and uses Browser tools by default. Without this fix, all GAIA evaluations on main branch will fail during aggregation when they try to load results containing Browser events.

This is a critical bug fix that should be merged to prevent evaluation failures.

@simonrosenberg can click here to continue refining the PR

This fix resolves a critical deserialization error that occurs when GAIA
evaluations use Browser actions and then attempt to load the results.

Problem:
--------
When running GAIA evaluations with enable_browser=True, the evaluation
phase completes successfully but the aggregation phase fails with:

    ValidationError: Unexpected kind BrowserGetContentAction

This happens during deserialization of output.jsonl files that contain
Browser events (actions and observations).

Root Cause:
-----------
1. EvalOutput contains a 'history: list[Event]' field
2. Event is a discriminated union that can contain different action types
3. Browser action types (BrowserGetContentAction, etc.) are registered
   dynamically when get_default_tools(enable_browser=True) is called
4. Pydantic caches discriminated union schemas at import time
5. When EvalOutput extends BaseModel, the schema is frozen before Browser
   types are registered
6. During deserialization, pydantic doesn't recognize Browser action types
   because they weren't in the cached schema

Solution:
---------
Change EvalOutput to extend OpenHandsModel instead of BaseModel.

OpenHandsModel is a custom base class that automatically calls
model_rebuild() before validation, which regenerates the discriminated
union schema to include all dynamically registered types.

Impact:
-------
- Minimal change: Only the parent class is modified
- Backward compatible: OpenHandsModel extends BaseModel
- No API changes: All existing code continues to work
- Fixes: GAIA evaluations and any future benchmarks using dynamic tools

Testing:
--------
Verified with GAIA evaluation runs that previously failed with
deserialization errors now complete successfully end-to-end.
@openhands-ai
Copy link

openhands-ai bot commented Dec 6, 2025

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks
    • .github/workflows/build-gaia-image.yml

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #136 at branch `fix/browser-action-deserialization`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@neubig neubig marked this pull request as draft December 31, 2025 02:07
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to take a look at this once the failing CI and merge conflicts are fixed, if it's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants