Add swebenchmultimodal support to run-eval workflow #1659
+1
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for the
swebenchmultimodalbenchmark to the run-eval workflow, enabling triggering of multimodal benchmark evaluations through the CI.Changes
Usage
After merging, the software-agent-sdk can be evaluated against the multimodal benchmark using:
Testing
The implementation follows the existing patterns from swebench, gaia, and commit0 benchmarks, ensuring consistency and reliability.
Related
This is part of a coordinated effort to add swebenchmultimodal support across the evaluation infrastructure:
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:6939d4b-pythonRun
All tags pushed for this build
About Multi-Architecture Support
6939d4b-python) is a multi-arch manifest supporting both amd64 and arm646939d4b-python-amd64) are also available if needed