Add swebenchmultimodal support to CI workflows #289

juanmichelini · 2026-01-09T04:34:10Z

Summary

This PR adds support for the benchmark to the CI workflows, enabling evaluation of multimodal SWE-Bench tasks.

Changes

Add swebenchmultimodal as choice option in run-eval.yml: Enables triggering multimodal benchmark evaluations through the main workflow
Create build-swebenchmultimodal-images.yml workflow: New workflow for building multimodal benchmark images, following the same pattern as existing swebench workflow but using the swebenchmultimodal build script

Usage

After merging, you can trigger multimodal evaluations using:

gh workflow run run-eval.yml -f benchmark=swebenchmultimodal

Testing

The implementation follows the existing patterns from swebench, gaia, and commit0 benchmarks, ensuring consistency and reliability.

- Add swebenchmultimodal as choice option in run-eval.yml - Create build-swebenchmultimodal-images.yml workflow for building multimodal benchmark images - Workflow follows same pattern as existing swebench workflow but uses swebenchmultimodal build script Co-authored-by: openhands <[email protected]>

juanmichelini · 2026-01-10T01:11:45Z

@simonrosenberg I need action build-swebenchmultimodal-images.yml to exist on main branch, to test it and continue debugging.

juanmichelini requested a review from simonrosenberg January 10, 2026 01:10

simonrosenberg approved these changes Jan 10, 2026

View reviewed changes

juanmichelini merged commit e4ea297 into main Jan 11, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add swebenchmultimodal support to CI workflows #289

Add swebenchmultimodal support to CI workflows #289

Uh oh!

juanmichelini commented Jan 9, 2026

Uh oh!

juanmichelini commented Jan 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add swebenchmultimodal support to CI workflows #289

Add swebenchmultimodal support to CI workflows #289

Uh oh!

Conversation

juanmichelini commented Jan 9, 2026

Summary

Changes

Usage

Testing

Related

Uh oh!

juanmichelini commented Jan 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants