Skip to content

Conversation

@juanmichelini
Copy link
Collaborator

Summary

This PR adds support for the benchmark to the CI workflows, enabling evaluation of multimodal SWE-Bench tasks.

Changes

  • Add swebenchmultimodal as choice option in run-eval.yml: Enables triggering multimodal benchmark evaluations through the main workflow
  • Create build-swebenchmultimodal-images.yml workflow: New workflow for building multimodal benchmark images, following the same pattern as existing swebench workflow but using the swebenchmultimodal build script

Usage

After merging, you can trigger multimodal evaluations using:

gh workflow run run-eval.yml -f benchmark=swebenchmultimodal

Testing

The implementation follows the existing patterns from swebench, gaia, and commit0 benchmarks, ensuring consistency and reliability.

Related

This is part of a coordinated effort to add swebenchmultimodal support across the evaluation infrastructure:

  • benchmarks (this PR)
  • software-agent-sdk
  • evaluation
  • openhands-index-results (no changes needed)

- Add swebenchmultimodal as choice option in run-eval.yml
- Create build-swebenchmultimodal-images.yml workflow for building multimodal benchmark images
- Workflow follows same pattern as existing swebench workflow but uses swebenchmultimodal build script

Co-authored-by: openhands <[email protected]>
@juanmichelini
Copy link
Collaborator Author

@simonrosenberg I need action build-swebenchmultimodal-images.yml to exist on main branch, to test it and continue debugging.

@juanmichelini juanmichelini merged commit e4ea297 into main Jan 11, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants