Skip to content

Conversation

@csmith49
Copy link
Collaborator

@csmith49 csmith49 commented Jan 6, 2026

Summary

Refactors the BaseIntegrationTest.run_instruction method into two separate methods:

  • BaseIntegrationTest.run_integration_test is the new entry point and does the setup/teardown + logging. Feeding instructions to the agent is delegated to...
  • BaseIntegrationTest.run_instructions, which feeds the instruction class variable to the conversation and can be overridden for more complicated conversation control (multiple messages, condensation, etc.)

Also updates the test runner to respect these new methods. No changes to existing integration or behavior test.

This PR is partially inspired by struggles to implement #1584.

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.12-nodejs22 Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:8c04604-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-8c04604-python \
  ghcr.io/openhands/agent-server:8c04604-python

All tags pushed for this build

ghcr.io/openhands/agent-server:8c04604-golang-amd64
ghcr.io/openhands/agent-server:8c04604-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:8c04604-golang-arm64
ghcr.io/openhands/agent-server:8c04604-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:8c04604-java-amd64
ghcr.io/openhands/agent-server:8c04604-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:8c04604-java-arm64
ghcr.io/openhands/agent-server:8c04604-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:8c04604-python-amd64
ghcr.io/openhands/agent-server:8c04604-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:8c04604-python-arm64
ghcr.io/openhands/agent-server:8c04604-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:8c04604-golang
ghcr.io/openhands/agent-server:8c04604-java
ghcr.io/openhands/agent-server:8c04604-python

About Multi-Architecture Support

  • Each variant tag (e.g., 8c04604-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 8c04604-python-amd64) are also available if needed

@csmith49 csmith49 added the integration-test Runs the integration tests and comments the results label Jan 6, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 6, 2026

🧪 Integration Tests Results

Overall Success Rate: 96.0%
Total Cost: $1.97
Models Tested: 6
Timestamp: 2026-01-06 15:40:41 UTC

📁 Detailed Logs & Artifacts

Click the links below to access detailed agent/LLM logs showing the complete reasoning process for each model. On the GitHub Actions page, scroll down to the 'Artifacts' section to download the logs.

📊 Summary

Model Overall Integration (Required) Behavior (Optional) Tests Passed Skipped Total Cost Tokens
litellm_proxy_mistral_devstral_2512 75.0% 75.0% N/A 6/8 1 9 $0.13 320,196
litellm_proxy_moonshot_kimi_k2_thinking 100.0% 100.0% N/A 8/8 1 9 $0.35 538,857
litellm_proxy_claude_sonnet_4_5_20250929 100.0% 100.0% N/A 9/9 0 9 $0.65 564,130
litellm_proxy_deepseek_deepseek_chat 100.0% 100.0% N/A 8/8 1 9 $0.05 418,355
litellm_proxy_gpt_5.1_codex_max 100.0% 100.0% N/A 8/8 1 9 $0.19 257,189
litellm_proxy_vertex_ai_gemini_3_pro_preview 100.0% 100.0% N/A 9/9 0 9 $0.60 355,803

📋 Detailed Results

litellm_proxy_mistral_devstral_2512

  • Overall Success Rate: 75.0% (6/8)
  • Integration Tests (Required): 75.0% (6/9)
  • Total Cost: $0.13
  • Token Usage: prompt: 316,944, completion: 3,252
  • Run Suffix: litellm_proxy_mistral_devstral_2512_e1a357e_devstral_2512_run_N9_20260106_153021
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

Failed Tests:

  • t02_add_bash_hello ⚠️ REQUIRED: Shell script is not executable (Cost: $0.0084)
  • t09_token_condenser ⚠️ REQUIRED: Condensation not triggered. Token counting may not work. (Cost: $0.03)

litellm_proxy_moonshot_kimi_k2_thinking

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.35
  • Token Usage: prompt: 525,176, completion: 13,681, cache_read: 456,704
  • Run Suffix: litellm_proxy_moonshot_kimi_k2_thinking_e1a357e_kimi_k2_run_N9_20260106_153016
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_claude_sonnet_4_5_20250929

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.65
  • Token Usage: prompt: 551,248, completion: 12,882, cache_read: 465,787, cache_write: 84,507, reasoning: 3,838
  • Run Suffix: litellm_proxy_claude_sonnet_4_5_20250929_e1a357e_sonnet_run_N9_20260106_153010

litellm_proxy_deepseek_deepseek_chat

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.05
  • Token Usage: prompt: 407,395, completion: 10,960, cache_read: 381,568
  • Run Suffix: litellm_proxy_deepseek_deepseek_chat_e1a357e_deepseek_run_N9_20260106_153018
  • Skipped Tests: 1

Skipped Tests:

  • t08_image_file_viewing: This test requires a vision-capable LLM model. Please use a model that supports image input.

litellm_proxy_gpt_5.1_codex_max

  • Overall Success Rate: 100.0% (8/8)
  • Integration Tests (Required): 100.0% (8/9)
  • Total Cost: $0.19
  • Token Usage: prompt: 251,243, completion: 5,946, cache_read: 159,488, reasoning: 3,520
  • Run Suffix: litellm_proxy_gpt_5.1_codex_max_e1a357e_gpt51_codex_run_N9_20260106_153014
  • Skipped Tests: 1

Skipped Tests:

  • t09_token_condenser: This test stresses long repetitive tool loops to trigger token-based condensation. GPT-5.1 Codex Max often declines such requests for efficiency/safety reasons.

litellm_proxy_vertex_ai_gemini_3_pro_preview

  • Overall Success Rate: 100.0% (9/9)
  • Integration Tests (Required): 100.0% (9/9)
  • Total Cost: $0.60
  • Token Usage: prompt: 332,724, completion: 23,079, cache_read: 191,501, reasoning: 16,469
  • Run Suffix: litellm_proxy_vertex_ai_gemini_3_pro_preview_e1a357e_gemini_3_pro_run_N9_20260106_153019

@openhands-ai
Copy link

openhands-ai bot commented Jan 6, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1613 at branch `feat/multi-step-integration-tests`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@csmith49 csmith49 removed the integration-test Runs the integration tests and comments the results label Jan 6, 2026
@csmith49 csmith49 marked this pull request as ready for review January 6, 2026 15:47
Copy link
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@csmith49 csmith49 merged commit d1baf30 into main Jan 7, 2026
21 checks passed
@csmith49 csmith49 deleted the feat/multi-step-integration-tests branch January 7, 2026 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants