Forward output_jsonl_gcs to evaluation job #1582

simonrosenberg · 2026-01-03T11:10:11Z

Summary

add output_jsonl_gcs input to run-eval workflow
forward to evaluation dispatch payload

Testing

gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=gaia -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f output_jsonl_gcs=gs://openhands-evaluation-results/eval-20622411875-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-31-16-36.tar.gz -f reason="rerun latest gaia 1-image"
gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=commit0 -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f output_jsonl_gcs=gs://openhands-evaluation-results/eval-20622412428-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-31-18-13.tar.gz -f reason="rerun latest commit0 1-image"
gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=swebench -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f output_jsonl_gcs=gs://openhands-evaluation-results/eval-20600678832-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-30-16-22.tar.gz -f reason="eval-only swebench reuse eval-20600678832 (fix)"
gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=swebench -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f reason="infer+eval swebench smoke (pydantic 2.12)"

enyst · 2026-01-03T12:14:42Z

@simonrosenberg out of curiosity, what version of GPT-5 are you using, and is it with oh via SDK or with good ole’ V0?

all-hands-bot · 2026-01-05T12:19:11Z

[Automatic Post]: I have assigned @raymyers as a reviewer based on git blame information. Thanks in advance for the help!

all-hands-bot · 2026-01-12T12:19:36Z

[Automatic Post]: It has been a while since there was any activity on this PR. @simonrosenberg, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

enyst · 2026-01-12T17:40:02Z

@simonrosenberg out of curiosity, what version of GPT-5 are you using, and is it with oh via SDK or with good ole’ V0?

Just to clarify for everyone stumbling upon this, Simon answered in private, it was GPT-5.2 medium reasoning.

I think we have some phrase telling the LLM to not use \n\n, but it's obviously not working. (the PR description had \n\n stuff all over)

Forward output_jsonl_gcs to evaluation job

55689c7

all-hands-bot requested a review from raymyers January 5, 2026 12:19

simonrosenberg removed the request for review from raymyers January 5, 2026 12:20

xingyaoww approved these changes Jan 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Forward output_jsonl_gcs to evaluation job #1582

Forward output_jsonl_gcs to evaluation job #1582

Uh oh!

simonrosenberg commented Jan 3, 2026 •

edited

Loading

Uh oh!

enyst commented Jan 3, 2026

Uh oh!

all-hands-bot commented Jan 5, 2026

Uh oh!

all-hands-bot commented Jan 12, 2026

Uh oh!

enyst commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Forward output_jsonl_gcs to evaluation job #1582

Are you sure you want to change the base?

Forward output_jsonl_gcs to evaluation job #1582

Uh oh!

Conversation

simonrosenberg commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

enyst commented Jan 3, 2026

Uh oh!

all-hands-bot commented Jan 5, 2026

Uh oh!

all-hands-bot commented Jan 12, 2026

Uh oh!

enyst commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

simonrosenberg commented Jan 3, 2026 •

edited

Loading

enyst commented Jan 12, 2026 •

edited

Loading