Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Jan 3, 2026

Summary

  • add output_jsonl_gcs input to run-eval workflow
  • forward to evaluation dispatch payload

Testing

  • gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=gaia -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f output_jsonl_gcs=gs://openhands-evaluation-results/eval-20622411875-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-31-16-36.tar.gz -f reason="rerun latest gaia 1-image"
  • gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=commit0 -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f output_jsonl_gcs=gs://openhands-evaluation-results/eval-20622412428-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-31-18-13.tar.gz -f reason="rerun latest commit0 1-image"
  • gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=swebench -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f output_jsonl_gcs=gs://openhands-evaluation-results/eval-20600678832-claude-son_litellm_proxy-claude-sonnet-4-5-20250929_25-12-30-16-22.tar.gz -f reason="eval-only swebench reuse eval-20600678832 (fix)"
  • gh workflow run run-eval.yml --ref issue-236-output-jsonl -f benchmark=swebench -f eval_limit=1 -f sdk_ref=issue-236-output-jsonl -f benchmarks_branch=issue-236-output-jsonl -f eval_branch=issue-236-output-jsonl -f reason="infer+eval swebench smoke (pydantic 2.12)"

@enyst
Copy link
Collaborator

enyst commented Jan 3, 2026

@simonrosenberg out of curiosity, what version of GPT-5 are you using, and is it with oh via SDK or with good ole’ V0?

@all-hands-bot all-hands-bot requested a review from raymyers January 5, 2026 12:19
@all-hands-bot
Copy link
Collaborator

[Automatic Post]: I have assigned @raymyers as a reviewer based on git blame information. Thanks in advance for the help!

@simonrosenberg simonrosenberg removed the request for review from raymyers January 5, 2026 12:20
@all-hands-bot
Copy link
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @simonrosenberg, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@enyst
Copy link
Collaborator

enyst commented Jan 12, 2026

@simonrosenberg out of curiosity, what version of GPT-5 are you using, and is it with oh via SDK or with good ole’ V0?

Just to clarify for everyone stumbling upon this, Simon answered in private, it was GPT-5.2 medium reasoning.

I think we have some phrase telling the LLM to not use \n\n, but it's obviously not working. (the PR description had \n\n stuff all over)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants