Add swebenchmultimodal support to run-eval workflow #1659

juanmichelini · 2026-01-09T04:34:17Z

Summary

This PR adds support for the swebenchmultimodal benchmark to the run-eval workflow, enabling triggering of multimodal benchmark evaluations through the CI.

Changes

Add swebenchmultimodal as choice option in run-eval.yml: Enables triggering multimodal benchmark evaluations through the workflow dispatch

Usage

After merging, the software-agent-sdk can be evaluated against the multimodal benchmark using:

gh workflow run run-eval.yml -f benchmark=swebenchmultimodal

Testing

The implementation follows the existing patterns from swebench, gaia, and commit0 benchmarks, ensuring consistency and reliability.

ghcr.io/openhands/agent-server:6939d4b-golang-amd64
ghcr.io/openhands/agent-server:6939d4b-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:6939d4b-golang-arm64
ghcr.io/openhands/agent-server:6939d4b-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:6939d4b-java-amd64
ghcr.io/openhands/agent-server:6939d4b-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:6939d4b-java-arm64
ghcr.io/openhands/agent-server:6939d4b-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:6939d4b-python-amd64
ghcr.io/openhands/agent-server:6939d4b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:6939d4b-python-arm64
ghcr.io/openhands/agent-server:6939d4b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:6939d4b-golang
ghcr.io/openhands/agent-server:6939d4b-java
ghcr.io/openhands/agent-server:6939d4b-python

About Multi-Architecture Support

Each variant tag (e.g., 6939d4b-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 6939d4b-python-amd64) are also available if needed

- Add swebenchmultimodal as choice option in run-eval.yml - Enables triggering multimodal benchmark evaluations through CI Co-authored-by: openhands <[email protected]>

all-hands-bot · 2026-01-11T12:17:18Z

[Automatic Post]: I have assigned @raymyers as a reviewer based on git blame information. Thanks in advance for the help!

raymyers · 2026-01-11T16:46:22Z

This one is trivial but I think I'm getting auto-added to a lot of these and don't really know the triage expectations for this repo. So if I'm going to be on the list maybe we should sync

openhands-ai · 2026-01-13T04:17:50Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Run Eval (swebenchmultimodal) Testing with max_retries input added
- Run Eval (swebenchmultimodal) Testing with pushed step-level conditionals
- Run Eval (swebenchmultimodal) Testing step-level conditionals for branch selection
- Run Eval (swebenchmultimodal) Debug conditional logic for branch selection
- Run Eval (swebenchmultimodal) Testing swebenchmultimodal CI with 50 instances - fixed conditional syntax
- Run Eval (swebenchmultimodal) Testing swebenchmultimodal CI with 50 instances - fixed eval branch refs
- Run Eval (swebenchmultimodal) Testing swebenchmultimodal CI with 50 instances to avoid image size issues

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1659 at branch `add-swebenchmultimodal-support`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

Add swebenchmultimodal support to run-eval workflow

8b3a7b2

- Add swebenchmultimodal as choice option in run-eval.yml - Enables triggering multimodal benchmark evaluations through CI Co-authored-by: openhands <[email protected]>

all-hands-bot requested a review from raymyers January 11, 2026 12:17

raymyers approved these changes Jan 11, 2026

View reviewed changes

Merge branch 'main' into add-swebenchmultimodal-support

b1b1e3d

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add swebenchmultimodal support to run-eval workflow #1659

Add swebenchmultimodal support to run-eval workflow #1659

juanmichelini commented Jan 9, 2026 •

edited by github-actions bot

Loading

Uh oh!

all-hands-bot commented Jan 11, 2026

Uh oh!

raymyers commented Jan 11, 2026

Uh oh!

openhands-ai bot commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add swebenchmultimodal support to run-eval workflow #1659

Are you sure you want to change the base?

Add swebenchmultimodal support to run-eval workflow #1659

Conversation

juanmichelini commented Jan 9, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Usage

Testing

Related

Uh oh!

all-hands-bot commented Jan 11, 2026

Uh oh!

raymyers commented Jan 11, 2026

Uh oh!

openhands-ai bot commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juanmichelini commented Jan 9, 2026 •

edited by github-actions bot

Loading