Multi-step integration tests #1613

csmith49 · 2026-01-06T15:26:56Z

Summary

Refactors the BaseIntegrationTest.run_instruction method into two separate methods:

BaseIntegrationTest.run_integration_test is the new entry point and does the setup/teardown + logging. Feeding instructions to the agent is delegated to...
BaseIntegrationTest.run_instructions, which feeds the instruction class variable to the conversation and can be overridden for more complicated conversation control (multiple messages, condensation, etc.)

Also updates the test runner to respect these new methods. No changes to existing integration or behavior test.

This PR is partially inspired by struggles to implement #1584.

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:8c04604-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-8c04604-python \
  ghcr.io/openhands/agent-server:8c04604-python

All tags pushed for this build

ghcr.io/openhands/agent-server:8c04604-golang-amd64
ghcr.io/openhands/agent-server:8c04604-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:8c04604-golang-arm64
ghcr.io/openhands/agent-server:8c04604-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:8c04604-java-amd64
ghcr.io/openhands/agent-server:8c04604-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:8c04604-java-arm64
ghcr.io/openhands/agent-server:8c04604-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:8c04604-python-amd64
ghcr.io/openhands/agent-server:8c04604-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:8c04604-python-arm64
ghcr.io/openhands/agent-server:8c04604-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:8c04604-golang
ghcr.io/openhands/agent-server:8c04604-java
ghcr.io/openhands/agent-server:8c04604-python

About Multi-Architecture Support

Each variant tag (e.g., 8c04604-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 8c04604-python-amd64) are also available if needed

This reverts commit 0656329.

This reverts commit 6754fbb.

github-actions · 2026-01-06T15:29:29Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-01-06T15:40:47Z

🧪 Integration Tests Results

Overall Success Rate: 96.0%
Total Cost: $1.97
Models Tested: 6
Timestamp: 2026-01-06 15:40:41 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

litellm_proxy_mistral_devstral_2512: 📥 View & Download Logs
litellm_proxy_moonshot_kimi_k2_thinking: 📥 View & Download Logs
litellm_proxy_claude_sonnet_4_5_20250929: 📥 View & Download Logs
litellm_proxy_deepseek_deepseek_chat: 📥 View & Download Logs
litellm_proxy_gpt_5.1_codex_max: 📥 View & Download Logs
litellm_proxy_vertex_ai_gemini_3_pro_preview: 📥 View & Download Logs

📊 Summary

Model	Overall	Integration (Required)	Behavior (Optional)	Tests Passed	Skipped	Total	Cost	Tokens
litellm_proxy_mistral_devstral_2512	75.0%	75.0%	N/A	6/8	1	9	$0.13	320,196
litellm_proxy_moonshot_kimi_k2_thinking	100.0%	100.0%	N/A	8/8	1	9	$0.35	538,857
litellm_proxy_claude_sonnet_4_5_20250929	100.0%	100.0%	N/A	9/9	0	9	$0.65	564,130
litellm_proxy_deepseek_deepseek_chat	100.0%	100.0%	N/A	8/8	1	9	$0.05	418,355
litellm_proxy_gpt_5.1_codex_max	100.0%	100.0%	N/A	8/8	1	9	$0.19	257,189
litellm_proxy_vertex_ai_gemini_3_pro_preview	100.0%	100.0%	N/A	9/9	0	9	$0.60	355,803

📋 Detailed Results

litellm_proxy_mistral_devstral_2512

Overall Success Rate: 75.0% (6/8)
Integration Tests (Required): 75.0% (6/9)
Total Cost: $0.13
Token Usage: prompt: 316,944, completion: 3,252
Run Suffix: litellm_proxy_mistral_devstral_2512_e1a357e_devstral_2512_run_N9_20260106_153021
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0084)
t09_token_condenser ⚠️ REQUIRED: Condensation not triggered. Token counting may not work. (Cost: $0.03)

litellm_proxy_moonshot_kimi_k2_thinking

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.35
Token Usage: prompt: 525,176, completion: 13,681, cache_read: 456,704
Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_e1a357e_kimi_k2_run_N9_20260106_153016
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.65
Token Usage: prompt: 551,248, completion: 12,882, cache_read: 465,787, cache_write: 84,507, reasoning: 3,838
Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_e1a357e_sonnet_run_N9_20260106_153010

litellm_proxy_deepseek_deepseek_chat

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.05
Token Usage: prompt: 407,395, completion: 10,960, cache_read: 381,568
Run Suffix: litellm_proxy_deepseek_deepseek_chat_e1a357e_deepseek_run_N9_20260106_153018
Skipped Tests: 1

Skipped Tests:

t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5.1_codex_max

Overall Success Rate: 100.0% (8/8)
Integration Tests (Required): 100.0% (8/9)
Total Cost: $0.19
Token Usage: prompt: 251,243, completion: 5,946, cache_read: 159,488, reasoning: 3,520
Run Suffix: litellm_proxy_gpt_5.1_codex_max_e1a357e_gpt51_codex_run_N9_20260106_153014
Skipped Tests: 1

Skipped Tests:

t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

litellm_proxy_vertex_ai_gemini_3_pro_preview

Overall Success Rate: 100.0% (9/9)
Integration Tests (Required): 100.0% (9/9)
Total Cost: $0.60
Token Usage: prompt: 332,724, completion: 23,079, cache_read: 191,501, reasoning: 16,469
Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_e1a357e_gemini_3_pro_run_N9_20260106_153019

openhands-ai · 2026-01-06T15:41:00Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1613 at branch `feat/multi-step-integration-tests`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

enyst

Thank you!

Calvin Smith added 7 commits January 5, 2026 16:19

initial

6754fbb

renaming local vars and removing ref to unused test

0656329

Revert "renaming local vars and removing ref to unused test"

bbd17a1

This reverts commit 0656329.

Revert "initial"

f79e664

This reverts commit 6754fbb.

run_instruction broken out to support conversation manipulation

6766c78

minor readme

aadef40

instruction message property

e1a357e

csmith49 added the integration-test Runs the integration tests and comments the results label Jan 6, 2026

csmith49 removed the integration-test Runs the integration tests and comments the results label Jan 6, 2026

csmith49 and others added 2 commits January 6, 2026 08:41

Merge branch 'main' into feat/multi-step-integration-tests

bd7f363

linting

d3aee66

csmith49 marked this pull request as ready for review January 6, 2026 15:47

enyst approved these changes Jan 6, 2026

View reviewed changes

csmith49 merged commit d1baf30 into main Jan 7, 2026
21 checks passed

csmith49 deleted the feat/multi-step-integration-tests branch January 7, 2026 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-step integration tests #1613

Multi-step integration tests #1613

Uh oh!

csmith49 commented Jan 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

openhands-ai bot commented Jan 6, 2026

Uh oh!

enyst left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Multi-step integration tests #1613

Multi-step integration tests #1613

Uh oh!

Conversation

csmith49 commented Jan 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

github-actions bot commented Jan 6, 2026

Uh oh!

github-actions bot commented Jan 6, 2026

🧪 Integration Tests Results

📁 Detailed Logs & Artifacts

📊 Summary

📋 Detailed Results

litellm_proxy_mistral_devstral_2512

litellm_proxy_moonshot_kimi_k2_thinking

litellm_proxy_claude_sonnet_4_5_20250929

litellm_proxy_deepseek_deepseek_chat

litellm_proxy_gpt_5.1_codex_max

litellm_proxy_vertex_ai_gemini_3_pro_preview

Uh oh!

openhands-ai bot commented Jan 6, 2026

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

csmith49 commented Jan 6, 2026 •

edited by github-actions bot

Loading