-
Notifications
You must be signed in to change notification settings - Fork 108
Integration test for Opus thinking block constraints #1584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…verride in base test class
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
|
|
||
| # Check if this atomic unit has any events with thinking blocks | ||
| for event in view.events[start_idx:end_idx]: | ||
| if isinstance(event, ActionEvent) and event.thinking_blocks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I appreciate this, I do wonder though if it belongs as integration test, or maybe something else.
thinking_blocks are an Anthropic Claudes' parameter, Gemini 3 has this but they go by different rules (and different from Gemini 2.5 too 😅 ...); Minimax I think has them too. And I can't think of another.
Since it feels very LLM specific, I wonder if maybe a script in scripts/ would work? Alternatively, we have now an examples/ directory named llm_specific/ ... None seems great, idk, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably is too LLM specific to be an integration test. The scaffold is just convenient for condenser tests -- repeatable, no mocked objects, arbitrary LLMs configured from outside the test, etc.
What if we made it like the behavior tests? Using the same infra as the integration tests, but triggered separately and non-blocking. This would basically become c01_test... instead.
Would be good to have some batch of tests that stress the condenser across multiple LLMs we can run on occasion to make sure behavior isn't drifting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually just have a somewhat similar problem: we want an integration test for conversation restore, where the user does a few actions with an LLM, exit and change to another LLM, then restore the conversation.
The structure of integration tests is almost okay, except that it is actually unnecessary to run all 6 LLMs or so, the test doesn't really depend on them. It just seems to me a bit of a waste, it matters that it "really" works, with real LLMs picking up and continuing the conversation, but it doesn't matter for all matrix. Maybe it matters for more than a pair (we might want a reasoning LLM paired with a non-reasoning one, for fun), but those are... different rules to choose LLM than the current matrix.
I'd call it similar with this, in the sense where thinking_blocks are under test, so we know exactly which LLM(s) to run it for, and not run it for others? If that's possible within the integration-test framework... for C01, C02, ...
Summary
Opus 4.5 seems to occasionally respond with a malformed signature error. This PR adds an integration test that reproduces a possible scenario as simply as possible.
Checklist
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:143a01d-pythonRun
All tags pushed for this build
About Multi-Architecture Support
143a01d-python) is a multi-arch manifest supporting both amd64 and arm64143a01d-python-amd64) are also available if needed