Why is BaseChatOpenAI streaming the "get_final_completion" as a chunk? #29640

marcammann · 2025-02-06T18:49:11Z

marcammann
Feb 6, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

chain = self.model.bind(
    response_format=oai_response_format,
    tools=oai_tools,
    parallel_tool_calls=False) | JsonOutputParser()

Description

In 0.2.14 and below, this would stream individual chunks as tokens to the parser
In 0.3.x this now streams one empty and one "partial" chunk and then at the end streams a "chunk" that contains the entire cumulative response. This turns into the tool calls being duplicated/concatenated during merging of the chunks:

AIMessageChunk(content='', additional_kwargs={'tool_calls': [{'index': 0, 'id': 'call_0xRjVSsqNJElRTnw357Z2CwIcall_0xRjVSsqNJElRTnw357Z2CwI', 'function': {'arguments': '{"location":"New York"}{"location":"New York"}', 'name': 'get_weatherget_weather', 'parsed_arguments': {'location': 'New York'}}, 'type': 'function'}], 'parsed': None, 'refusal': None}, response_metadata={'finish_reason': 'tool_calls', 'token_usage': None, 'model_name': '', 'system_fingerprint': 'fp_f3927aa00d', 'prompt_filter_results': [{'prompt_index': 0, 'content_filter_results': {}}]}, id='run-8ec9c1fe-4c25-42e6-a2c0-4b4e53ccee46', tool_calls=[{'name': 'get_weatherget_weather', 'args': {'location': 'New York'}, 'id': 'call_0xRjVSsqNJElRTnw357Z2CwIcall_0xRjVSsqNJElRTnw357Z2CwI', 'type': 'tool_call'}], tool_call_chunks=[{'name': 'get_weatherget_weather', 'args': '{"location":"New York"}{"location":"New York"}', 'id': 'call_0xRjVSsqNJElRTnw357Z2CwIcall_0xRjVSsqNJElRTnw357Z2CwI', 'index': 0, 'type': 'tool_call_chunk'}])

The place this is sent in base.py:

        if hasattr(response, "get_final_completion") and "response_format" in payload:
            final_completion = await response.get_final_completion()
            generation_chunk = self._get_generation_chunk_from_completion(
                final_completion
            )
            if run_manager:
                await run_manager.on_llm_new_token(
                    generation_chunk.text, chunk=generation_chunk
                )
            yield generation_chunk

Is there a proposed way to determine that final chunk? It seems to me this shouldn't be a chunk, but a message directly, as it is complete.

This change was introduced here: #29044

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 24.1.0: Thu Nov 14 18:19:02 PST 2024; root:xnu-11215.41.3~13/RELEASE_ARM64_T8132
Python Version: 3.12.7 (main, Oct 16 2024, 07:12:08) [Clang 18.1.8 ]

Package Information

langchain_core: 0.3.33
langchain: 0.3.17
langchain_community: 0.3.0
langsmith: 0.1.144
langchain_openai: 0.3.3
langchain_postgres: 0.0.12
langchain_text_splitters: 0.3.5
langchainhub: 0.1.20
langgraph_sdk: 0.1.36

Answered by ccurme

Feb 7, 2025

Thanks for raising this. I believe the bug was resolved in #29649.

There are a few options for how we stream structured output with OpenAI:

Stream chunks with json string content, with a final chunk containing the parsed Pydantic object. Obtain this parsed object using get_final_completion. This is what is implemented now and what is demonstrated in OpenAI's docs. The downside of this as you found is it erroneously doubles tool calls when we simultaneously stream tool calls + structured output (this particular bug is now fixed).
Stream chunks with json string content, with a final chunk containing the parsed Pydantic object. Obtain this parsed object during the stream from the content…

View full answer

ccurme · 2025-02-07T16:42:06Z

ccurme
Feb 7, 2025
Maintainer

Thanks for raising this. I believe the bug was resolved in #29649.

There are a few options for how we stream structured output with OpenAI:

Stream chunks with json string content, with a final chunk containing the parsed Pydantic object. Obtain this parsed object using get_final_completion. This is what is implemented now and what is demonstrated in OpenAI's docs. The downside of this as you found is it erroneously doubles tool calls when we simultaneously stream tool calls + structured output (this particular bug is now fixed).
Stream chunks with json string content, with a final chunk containing the parsed Pydantic object. Obtain this parsed object during the stream from the content.done chunk. This refactor is possible and I drafted it in a branch here. Downside is it would require us to duplicate some logic for extracting usage metadata + handling parsed / refusals.
Ignore the parsed Pydantic object supplied by OpenAI. OpenAI provides dict representations of the parsed object during the stream. Parse these into a Pydantic object ourselves. This is also possible but IMO as a user I'd rather have the object straight from OpenAI's SDK.

(2) in my opinion is not an obvious slam dunk as there are some trade-offs, so will keep as-is for now but please let me know if there are other options or you have additional thoughts.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is BaseChatOpenAI streaming the "get_final_completion" as a chunk? #29640

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Why is BaseChatOpenAI streaming the "get_final_completion" as a chunk? #29640

marcammann Feb 6, 2025

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Replies: 1 comment

ccurme Feb 7, 2025 Maintainer

marcammann
Feb 6, 2025

ccurme
Feb 7, 2025
Maintainer