Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

@juanmichelini juanmichelini commented Jan 9, 2026

Summary

This PR adds support for the swebenchmultimodal benchmark to the run-eval workflow, enabling triggering of multimodal benchmark evaluations through the CI.

Changes

  • Add swebenchmultimodal as choice option in run-eval.yml: Enables triggering multimodal benchmark evaluations through the workflow dispatch

Usage

After merging, the software-agent-sdk can be evaluated against the multimodal benchmark using:

gh workflow run run-eval.yml -f benchmark=swebenchmultimodal

Testing

The implementation follows the existing patterns from swebench, gaia, and commit0 benchmarks, ensuring consistency and reliability.

Related

This is part of a coordinated effort to add swebenchmultimodal support across the evaluation infrastructure:

  • benchmarks
  • software-agent-sdk (this PR)
  • evaluation
  • openhands-index-results (no changes needed)

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:6939d4b-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-6939d4b-python \
  ghcr.io/openhands/agent-server:6939d4b-python

All tags pushed for this build

ghcr.io/openhands/agent-server:6939d4b-golang-amd64
ghcr.io/openhands/agent-server:6939d4b-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:6939d4b-golang-arm64
ghcr.io/openhands/agent-server:6939d4b-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:6939d4b-java-amd64
ghcr.io/openhands/agent-server:6939d4b-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:6939d4b-java-arm64
ghcr.io/openhands/agent-server:6939d4b-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:6939d4b-python-amd64
ghcr.io/openhands/agent-server:6939d4b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:6939d4b-python-arm64
ghcr.io/openhands/agent-server:6939d4b-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:6939d4b-golang
ghcr.io/openhands/agent-server:6939d4b-java
ghcr.io/openhands/agent-server:6939d4b-python

About Multi-Architecture Support

  • Each variant tag (e.g., 6939d4b-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 6939d4b-python-amd64) are also available if needed

- Add swebenchmultimodal as choice option in run-eval.yml
- Enables triggering multimodal benchmark evaluations through CI

Co-authored-by: openhands <[email protected]>
@all-hands-bot
Copy link
Collaborator

[Automatic Post]: I have assigned @raymyers as a reviewer based on git blame information. Thanks in advance for the help!

@raymyers
Copy link
Contributor

This one is trivial but I think I'm getting auto-added to a lot of these and don't really know the triage expectations for this repo. So if I'm going to be on the list maybe we should sync

@openhands-ai
Copy link

openhands-ai bot commented Jan 13, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Run Eval (swebenchmultimodal) Testing with max_retries input added
    • Run Eval (swebenchmultimodal) Testing with pushed step-level conditionals
    • Run Eval (swebenchmultimodal) Testing step-level conditionals for branch selection
    • Run Eval (swebenchmultimodal) Debug conditional logic for branch selection
    • Run Eval (swebenchmultimodal) Testing swebenchmultimodal CI with 50 instances - fixed conditional syntax
    • Run Eval (swebenchmultimodal) Testing swebenchmultimodal CI with 50 instances - fixed eval branch refs
    • Run Eval (swebenchmultimodal) Testing swebenchmultimodal CI with 50 instances to avoid image size issues

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1659 at branch `add-swebenchmultimodal-support`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants