Skip to content

feat(providers): strip chain-of-thought markers from custom provider output#8635

Open
mvanhorn wants to merge 4 commits intoaaif-goose:mainfrom
mvanhorn:feat/8590-cot-filter
Open

feat(providers): strip chain-of-thought markers from custom provider output#8635
mvanhorn wants to merge 4 commits intoaaif-goose:mainfrom
mvanhorn:feat/8590-cot-filter

Conversation

@mvanhorn
Copy link
Copy Markdown
Contributor

Summary

Custom OpenAI-compatible providers (DeepSeek R1, Qwen-style proxies, LLMProxy, etc.) emit reasoning inline in the content field as <think>...</think> or <thinking>...</thinking> blocks rather than using the structured reasoning_content JSON field. Goose surfaced those blocks directly in chat, cluttering the UI and sometimes triggering repetition.

This change adds a streaming-safe ThinkFilter / split_think_blocks helper in crates/goose/src/providers/base.rs and wires it into both the non-streaming (response_to_message) and streaming (response_to_streaming_message) paths in crates/goose/src/providers/formats/openai.rs. Inline thinking is routed into MessageContent::thinking; cleaned content flows through MessageContent::text. Structured reasoning_content / reasoning fields still take precedence, so providers that already send reasoning as a first-class field do not get duplicated output.

Streaming semantics: when a chunk ends on a partial tag prefix (<thi, <think, </thinkin, etc.), the filter holds that suffix until it can decide whether it belongs to a think block or is regular content. A chunk ending mid-tag will never leak characters to the UI that later turn out to be inside a think block.

Testing

  • cargo fmt --check: clean
  • cargo clippy -p goose --all-targets -- -D warnings: clean
  • cargo test -p goose --lib providers::base::tests: 23 passed (includes new think filter tests)
  • cargo test -p goose --lib providers::formats::openai::tests: all passed (includes new regression tests)

New unit tests cover:

  • Full-content splitter for <think>, <thinking>, attributes (<think class="x">), case-insensitive tags, multiple blocks, plain content, unterminated opens
  • Streaming across partial tag boundaries (chunks like ["<thi", "nk>x</thi", "nk>y"])
  • Non-think tags (<table>) are treated as content
  • Structured reasoning_content path still wins when both are present

Related Issues

Closes #8590

Prior art: zed's similar fix landed in zed-industries/zed#53488 ("Strip think tags from text stream") for an analogous UI-leakage issue.

This contribution was developed with AI assistance (Codex).

…output

Custom OpenAI-compatible providers (DeepSeek R1, Qwen, etc.) embed reasoning
in `<think>...</think>` / `<thinking>...</thinking>` blocks inline in the
content field rather than using the structured `reasoning_content` field.
Goose surfaced these blocks directly in chat, cluttering the UI.

Adds a streaming-safe `ThinkFilter` / `split_think_blocks` helper that
separates inline think blocks from content. Wires it into the OpenAI
response parser (both non-streaming and streaming paths), routing thinking
into `MessageContent::thinking` and emitting cleaned text through
`MessageContent::text`. Structured `reasoning_content` / `reasoning` fields
still take precedence so we do not duplicate reasoning for providers that
already send it as a first-class field.

Streaming buffers partial tag prefixes (`<thi`, `<think`, `</thin`, etc.)
until the decision can be made, so a chunk ending mid-tag never leaks
characters that later turn out to be inside a think block. Unit tests cover
the full-content splitter, streaming boundaries, case-insensitive tags,
attributes, multiple blocks, non-think tags, and unterminated opens.

Closes aaif-goose#8590
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8af5d16ed8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/goose/src/providers/formats/openai.rs
Comment thread crates/goose/src/providers/base.rs Outdated
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Still resolving routing.

@mvanhorn mvanhorn closed this Apr 17, 2026
@mvanhorn
Copy link
Copy Markdown
Contributor Author

Apologies for the noise - I misread my tooling and closed this thinking it was misrouted. block/goose and aaif-goose/goose resolve to the same repo (transfer). Reopening.

@mvanhorn mvanhorn reopened this Apr 18, 2026
@jh-block
Copy link
Copy Markdown
Collaborator

Thanks for this @mvanhorn -- can you address the Codex code review feedback (either implement the requested changes or reply to the comments briefly explaining why it's invalid/unnecessary).

@jh-block jh-block self-assigned this Apr 20, 2026
Two fixes to the OpenAI-format provider path, per Codex P2 review on aaif-goose#8635:

- base.rs parse_think_tag: require a real tag boundary (>, /, or whitespace)
  after the name before classifying as ThinkTag::Open. Tags like <thinking-mode>
  or <thinking123> were being matched as think tags because the parser scanned
  blindly for > after matching 'think'/'thinking'. Adds regression tests for
  <thinking-mode>, <thinking123>, <thinker>, and <think data-source="x">.

- openai.rs response_to_streaming_message: buffer inline <think>...</think>
  output in pending_inline_thinking until end-of-stream or the tool-call flush.
  If reasoning_content arrives in any later chunk, saw_structured_reasoning flips
  and the buffered inline thinking is discarded so structured reasoning wins
  cleanly. Previously, inline think chunks were yielded immediately, so a later
  reasoning_content chunk produced duplicated reasoning. New streaming test
  covers the race.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 06d806af73

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Without this, `<thinking-mode>` or `<thinking123>` would be classified as a
// think tag and stripped from normal content.
let valid_open_boundary = match bytes.get(idx) {
Some(&b) => b == b'>' || b == b'/' || b.is_ascii_whitespace(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Treat self-closing tags as closed

The parser accepts '/' as a valid boundary after think/thinking, but open tags are always emitted as ThinkTag::Open, so <think/> increments think_depth and never closes it. In that case, all following text is routed to thinking and removed from visible assistant text (e.g., <think/>Visible answer becomes hidden reasoning). This is a data-loss regression for providers that emit self-closing tags, so the parser should either reject /> as an open-think marker or treat it as an immediate open+close.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not seen an empty think tag rendered as a self-closing tag (<think/>) before, but here codex is quite confidently saying "for providers that emit self-closing tags" implying that there are some providers that do this. It might be worth being defensive here @mvanhorn?

@mvanhorn
Copy link
Copy Markdown
Contributor Author

Thanks @jh-block. Both Codex P2 items addressed in 06d806a:

  • parse_think_tag in crates/goose/src/providers/base.rs: added a real-boundary check (>, /, whitespace) after the name match, so <thinking-mode>, <thinking123>, <thinker> no longer get classified as think tags and stripped. New regression tests in base.rs cover that plus <think data-source="x"> so tags with attributes still work.
  • response_to_streaming_message in crates/goose/src/providers/formats/openai.rs: inline <think>...</think> chunks now accumulate into pending_inline_thinking and are only flushed at end-of-stream (or at the tool-call flush block). When reasoning_content arrives in any later chunk, saw_structured_reasoning flips and the buffer is cleared, so structured reasoning wins cleanly without duplicated output. New streaming test test_streaming_suppresses_inline_think_when_structured_reasoning_follows covers the race.

Verified: 5 think-filter tests + 58 openai format tests pass, cargo fmt + cargo clippy -D warnings clean.

A self-closing `<think/>` used to be classified as `ThinkTag::Open`
because parse_think_tag looped until `>` without distinguishing the
trailing `/`. That incremented think_depth and routed every subsequent
chunk into the thinking bucket until the end of the stream, silently
dropping the provider's visible output.

Add a `ThinkTag::SelfClosing` variant, detect the trailing `/` before
`>`, and treat it as no-op (consume the tag text, leave think_depth
and inside_think unchanged). Works for `<think/>`, `<think />`,
`<thinking/>`, and `<think data-source="x"/>`.

New regression tests in base.rs cover the no-op cases and verify a
self-closing tag inside an open block does not pre-close it.
@mvanhorn
Copy link
Copy Markdown
Contributor Author

@jh-block fair, defensive handling added in 7940020.

parse_think_tag now distinguishes self-closing (<think/>, <think/> with attrs, <thinking/>, <think />) from plain open. I added a ThinkTag::SelfClosing variant that is treated as a no-op - the tag is consumed from visible output (so <think/> doesn't leak to the user), but think_depth and inside_think are left unchanged, so the rest of the stream isn't silently swallowed into the thinking bucket.

Regression tests in base.rs cover:

  • <think/>, <think />, <thinking/>, <think data-source="x"/> -> content preserved, thinking empty
  • self-closing inside an open <think> block doesn't pre-close it (depth stays at 1 until the real </think>)
  • streaming case: content after <think/> across multiple chunks stays visible

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7940020b67

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +250 to +252
static OPEN_RE: LazyLock<Regex> = LazyLock::new(|| {
Regex::new(r"(?is)^<(?:t(?:h(?:i(?:n(?:k(?:i(?:n(?:g)?)?)?)?)?)?)?)(?:\s[^>]*)?$").unwrap()
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Recognize <think/ as an incomplete self-closing think tag

is_possible_partial_think_tag does not match the valid in-progress suffix <think/ (or <thinking/), so when a stream chunk splits a self-closing tag at that boundary (e.g. "...<think/" then ">..."), the first chunk is emitted as visible text and the marker is never stripped. That reintroduces leakage for one common chunk split pattern and violates the streaming guarantee that mid-tag boundaries are buffered until disambiguated.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvanhorn looks like there's one last issue to resolve here (the streaming parser can emit <think/ or <thinking/ if that appears at the end of a chunk). If you can tweak the regex to handle that I think then we should be able to merge this

The streaming partial-tag regex allowed `<think<whitespace>...` as a
partial open tag but rejected a bare trailing `/`. A chunk that ended
mid self-closing tag (e.g. `<think/` before `>` arrived in the next
chunk) was not buffered, so the `<think/` prefix leaked into visible
content.

Extend OPEN_RE with the `/` alternative and add a regression test that
exercises three partial-boundary cases: `<think/` + `>`, `<thinking/` +
`>`, and `<think ` + `/>`. All three now buffer correctly and never
leak `<` into visible output.
@mvanhorn
Copy link
Copy Markdown
Contributor Author

@jh-block addressed in bb9f824. Extended OPEN_RE with a trailing / alternative so <think/ and <thinking/ at a chunk boundary buffer until the > arrives in the next chunk.

Root cause: the old regex accepted <think<whitespace>... but not a bare trailing /, so a chunk ending mid self-closing tag fell out of is_possible_partial_think_tag and leaked the <think/ prefix into visible content.

Added test_think_filter_streaming_across_self_closing_boundary covering three partial-boundary cases: <think/ + >, <thinking/ + >, and <think + />. Asserts no < leaks into content and thinking stays empty (self-closing = no payload). Verified with cargo test -p goose --lib providers::base::tests - 29 tests pass, clippy/fmt clean.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb9f82416f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +231 to +233
while let Some(&byte) = bytes.get(idx) {
if byte == b'>' {
let kind = if last_non_ws == Some(b'/') {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip quoted > when scanning think-tag end

The open-tag parser treats the first > byte as the end of a <think...> tag without tracking quoted attribute values, so inputs like <think data="a>b"/>Visible are misparsed: the tag closes early, then trailing bytes and even visible answer text can be routed into thinking or dropped. This is a data-corruption path for providers/models that emit > inside attribute values (or tool-generated XML-like tags), so the scanner should ignore > while inside quoted strings.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Chain-of-Thought Leakage Breaking Chat Readability & Poor User Experience

2 participants