Integration test for Opus thinking block constraints #1584

csmith49 · 2026-01-03T19:19:00Z

Summary

Opus 4.5 seems to occasionally respond with a malformed signature error. This PR adds an integration test that reproduces a possible scenario as simply as possible.

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.12-nodejs22`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:143a01d-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-143a01d-python \
  ghcr.io/openhands/agent-server:143a01d-python

All tags pushed for this build

ghcr.io/openhands/agent-server:143a01d-golang-amd64
ghcr.io/openhands/agent-server:143a01d-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:143a01d-golang-arm64
ghcr.io/openhands/agent-server:143a01d-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:143a01d-java-amd64
ghcr.io/openhands/agent-server:143a01d-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:143a01d-java-arm64
ghcr.io/openhands/agent-server:143a01d-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:143a01d-python-amd64
ghcr.io/openhands/agent-server:143a01d-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-amd64
ghcr.io/openhands/agent-server:143a01d-python-arm64
ghcr.io/openhands/agent-server:143a01d-nikolaik_s_python-nodejs_tag_python3.12-nodejs22-arm64
ghcr.io/openhands/agent-server:143a01d-golang
ghcr.io/openhands/agent-server:143a01d-java
ghcr.io/openhands/agent-server:143a01d-python

About Multi-Architecture Support

Each variant tag (e.g., 143a01d-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 143a01d-python-amd64) are also available if needed

…verride in base test class

openhands-ai · 2026-01-03T19:23:02Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1584 at branch `fix/opus-thinking`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

enyst · 2026-01-03T19:58:50Z

tests/integration/tests/t10_condensation_thinking.py

+
+            # Check if this atomic unit has any events with thinking blocks
+            for event in view.events[start_idx:end_idx]:
+                if isinstance(event, ActionEvent) and event.thinking_blocks:


Thank you! I appreciate this, I do wonder though if it belongs as integration test, or maybe something else.

thinking_blocks are an Anthropic Claudes' parameter, Gemini 3 has this but they go by different rules (and different from Gemini 2.5 too 😅 ...); Minimax I think has them too. And I can't think of another.

Since it feels very LLM specific, I wonder if maybe a script in scripts/ would work? Alternatively, we have now an examples/ directory named llm_specific/ ... None seems great, idk, WDYT?

It probably is too LLM specific to be an integration test. The scaffold is just convenient for condenser tests -- repeatable, no mocked objects, arbitrary LLMs configured from outside the test, etc.

What if we made it like the behavior tests? Using the same infra as the integration tests, but triggered separately and non-blocking. This would basically become c01_test... instead.

Would be good to have some batch of tests that stress the condenser across multiple LLMs we can run on occasion to make sure behavior isn't drifting.

I actually just have a somewhat similar problem: we want an integration test for conversation restore, where the user does a few actions with an LLM, exit and change to another LLM, then restore the conversation.

The structure of integration tests is almost okay, except that it is actually unnecessary to run all 6 LLMs or so, the test doesn't really depend on them. It just seems to me a bit of a waste, it matters that it "really" works, with real LLMs picking up and continuing the conversation, but it doesn't matter for all matrix. Maybe it matters for more than a pair (we might want a reasoning LLM paired with a non-reasoning one, for fun), but those are... different rules to choose LLM than the current matrix.

I'd call it similar with this, in the sense where thinking_blocks are under test, so we know exactly which LLM(s) to run it for, and not run it for others? If that's possible within the integration-test framework... for C01, C02, ...

Calvin Smith added 13 commits January 3, 2026 10:36

initial integration test

cba7e3a

fixing logging

d689e28

test checks that thinking block is actually forgotten

61660f6

minor

3601cec

cleaning up logging

01c1ad5

renaming

322fef3

doc string

485a4ef

tidy pass

1dc0567

refactoring multi-step integration test -- new execute_conversation o…

74a5dc7

…verride in base test class

minor linting pass

804e823

minor cleanup

3d8dba1

test cleanup, failing for other reasons

324dde2

custom condenser, passing integration test

e97570d

csmith49 mentioned this pull request Jan 3, 2026

Conversations stalling due to unhandled LLM API errors (thinking block signature, context length) #1575

Open

enyst reviewed Jan 3, 2026

View reviewed changes

This was referenced Jan 6, 2026

Multi-step integration tests #1613

Merged

Condenser integration tests #1652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integration test for Opus thinking block constraints #1584

Integration test for Opus thinking block constraints #1584

csmith49 commented Jan 3, 2026 •

edited by github-actions bot

Loading

Uh oh!

openhands-ai bot commented Jan 3, 2026

Uh oh!

enyst Jan 3, 2026

Uh oh!

csmith49 Jan 6, 2026

Uh oh!

enyst Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Integration test for Opus thinking block constraints #1584

Are you sure you want to change the base?

Integration test for Opus thinking block constraints #1584

Conversation

csmith49 commented Jan 3, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

openhands-ai bot commented Jan 3, 2026

Uh oh!

enyst Jan 3, 2026

Choose a reason for hiding this comment

Uh oh!

csmith49 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

enyst Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

csmith49 commented Jan 3, 2026 •

edited by github-actions bot

Loading