Skip to content

[Optimization]Merge Text processor#7030

Merged
LiqinruiG merged 9 commits intoPaddlePaddle:developfrom
luukunn:merge_processor_2
Mar 30, 2026
Merged

[Optimization]Merge Text processor#7030
LiqinruiG merged 9 commits intoPaddlePaddle:developfrom
luukunn:merge_processor_2

Conversation

@luukunn
Copy link
Copy Markdown
Collaborator

@luukunn luukunn commented Mar 26, 2026

Motivation

重构前,文本处理存在两条独立的继承链(DataProcessor 和 Ernie4_5Processor),二者在响应处理(process_response_dict、ids2tokens、update_stop_seq 等)和请求处理(process_request_dict)上存在大量逻辑重复,仅在 tokenizer 加载方式、分词策略和少量分支行为上有差异。这导致同一个 bug 需要在两处分别修复,响应处理签名不一致(ERNIE 用位置参数传 stream,HF 用关键字参数),且新增功能(如 tool_parser、reasoning_parser)需要在两个 Processor 中分别适配,维护成本高。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

新建 BaseTextProcessor(ABC) 抽象基类,将两条继承链中重复的响应处理逻辑和公共工具方法提取为统一实现,同时声明 _load_tokenizer、text2ids、messages2ids 等抽象接口。在此基础上,新增 TextProcessor(BaseTextProcessor) 类,通过 tokenizer_type 参数("auto" / "ernie4_5")在关键路径上进行策略分发。最后更新工厂类 preprocess.py,将文本分支统一切换为 TextProcessor 实例化,原有 DataProcessor 和 Ernie4_5Processor 保留为 deprecation 别名以确保向后兼容。

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings March 26, 2026 08:31
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 26, 2026

Thanks for your contribution!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

该 PR 旨在合并/统一文本侧的 processor 实现:把原先分散在 DataProcessorErnie4_5Processor 的请求/响应处理逻辑抽到公共基类,并新增 TextProcessor 用一个类覆盖 auto 与 ernie4_5 两种 tokenizer 路径,从而简化 InputPreprocessor.create_processor() 的分发逻辑。

Changes:

  • 新增 BaseTextProcessorfastdeploy/input/base_processor.py),集中实现 request/response 处理与通用工具方法。
  • 新增 TextProcessor 并调整 InputPreprocessor.create_processor():非 V1 且纯文本场景统一走 TextProcessor(tokenizer_type=auto/ernie4_5)
  • Ernie4_5Processor 改为弃用包装类,并补充 VL processor 的 tokenizer_type 以适配统一的增量解码逻辑;同步更新相关单测 mock。

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/input/test_preprocess.py 更新路由单测:mock 目标由 DataProcessor 改为 TextProcessor
fastdeploy/input/text_processor.py 引入 TextProcessorDataProcessor 改为继承新的公共基类并移除本地 request/response 处理实现。
fastdeploy/input/preprocess.py 纯文本、非 V1 路径统一使用 TextProcessor,并按架构选择 tokenizer_type
fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py 增加 tokenizer_type="ernie4_5" 以保证继承的解码分支选择正确。
fastdeploy/input/ernie4_5_processor.py Ernie4_5Processor 改为弃用包装类,转发到 TextProcessor(tokenizer_type='ernie4_5')
fastdeploy/input/base_processor.py 新增公共基类 BaseTextProcessor,实现共享 request/response 处理与通用方法。
Comments suppressed due to low confidence (1)

fastdeploy/input/text_processor.py:248

  • 这里 DataProcessor 现在继承了 BaseTextProcessor,而 BaseTextProcessor.process_response_dict_* 会假设 ids2tokens() 始终返回 3 元组 (delta_text, previous_token_ids, previous_texts)。但当前 DataProcessor.ids2tokens()envs.FD_USE_HF_TOKENIZER=True 分支仍返回单个字符串,会导致 streaming / non-streaming 响应处理时解包失败(例如 delta_text, _, previous_texts = ...)。建议让 DataProcessor 直接复用 BaseTextProcessor.ids2tokens()(删除覆盖),或把 HF 分支返回值改为与基类一致的 3 元组。
class DataProcessor(BaseTextProcessor):
    def __init__(self, model_name_or_path, reasoning_parser_obj=None, tool_parser_obj=None):
        """
            Initializes the DecodeStatus object.

Comment on lines +39 to +40
* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is
set. This matches DataProcessor behaviour.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的设计说明与实际实现不一致:文档写到 tool_parser 结果不会更新 outputs["text"],但 process_response_dict_streaming() 中在解析到 tool_calls 时会把 outputs["text"] 赋值为 tool_call_delta_message.content。建议要么更新文档描述(区分 streaming 与 non-streaming 行为),要么调整实现以匹配文档中的约定。

Suggested change
* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is
set. This matches DataProcessor behaviour.
* In non-streaming mode, tool_parser results never update ``outputs["text"]``;
only ``tool_calls`` is set, matching ``DataProcessor`` behaviour. In
streaming mode, ``process_response_dict_streaming()`` may temporarily set
``outputs["text"]`` to the current ``tool_call_delta_message.content`` while
parsing tool calls.

Copilot uses AI. Check for mistakes.
Comment on lines +497 to +503
if token_id == think_start_id:
started = True
ended = False
in_thinking = True
elif token_id == think_end_id and in_thinking:
ended = True
in_thinking = False
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tokens_after_start 初始化为 0 但在遍历 token_list 时从未累加,导致 think_prompt_tokens_after_start 永远为 0。该字段会被 fastdeploy/model_executor/logits_processor/thinking_budget.py 读取用于推理预算状态更新,恒为 0 会让预算逻辑失真。建议在遇到 <think> 之后对非结束标记 token 进行计数,或移除该字段避免产生误导。

Suggested change
if token_id == think_start_id:
started = True
ended = False
in_thinking = True
elif token_id == think_end_id and in_thinking:
ended = True
in_thinking = False
if token_id == think_start_id:
# Enter thinking segment; do not count the start token itself
started = True
ended = False
in_thinking = True
elif token_id == think_end_id and in_thinking:
# Leave thinking segment; do not count the end token itself
ended = True
in_thinking = False
elif in_thinking:
# Count tokens inside the thinking segment, excluding start/end markers
tokens_after_start += 1

Copilot uses AI. Check for mistakes.
Comment on lines +88 to 93
tokenizer_type = "ernie4_5" if ErnieArchitectures.contains_ernie_arch(architecture) else "auto"
self.processor = TextProcessor(
model_name_or_path=self.model_name_or_path,
tokenizer_type=tokenizer_type,
reasoning_parser_obj=reasoning_parser_obj,
tool_parser_obj=tool_parser_obj,
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里新增了根据架构选择 tokenizer_type(ernie4_5 vs auto)并走 TextProcessor 的分支,但当前单测只覆盖了“非 Ernie、非多模态”的路径。建议补一个用例覆盖 Ernie 架构(例如 architecture 为 ERNIE4.5 相关字符串)时 tokenizer_type 被设置为 ernie4_5,并验证 TextProcessor 的构造参数。

Copilot uses AI. Check for mistakes.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 26, 2026

Codecov Report

❌ Patch coverage is 89.37644% with 46 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@25d64ef). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/input/base_processor.py 89.43% 23 Missing and 18 partials ⚠️
fastdeploy/input/preprocess.py 42.85% 3 Missing and 1 partial ⚠️
fastdeploy/input/text_processor.py 96.77% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7030   +/-   ##
==========================================
  Coverage           ?   73.21%           
==========================================
  Files              ?      401           
  Lines              ?    56322           
  Branches           ?     8887           
==========================================
  Hits               ?    41234           
  Misses             ?    12183           
  Partials           ?     2905           
Flag Coverage Δ
GPU 73.21% <89.37%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@luukunn luukunn changed the title Merge processor 2 [Optimization]Merge Text processor Mar 27, 2026
Copilot AI review requested due to automatic review settings March 27, 2026 08:10

This comment was marked as resolved.

Copilot AI review requested due to automatic review settings March 27, 2026 11:31

This comment was marked as resolved.

Copy link
Copy Markdown
Collaborator

@LiqinruiG LiqinruiG left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@LiqinruiG LiqinruiG merged commit b9f8873 into PaddlePaddle:develop Mar 30, 2026
35 of 38 checks passed
mattheliu pushed a commit to mattheliu/FastDeploy that referenced this pull request Apr 1, 2026
* merge text processor

* update

* fix unit test

* merge messages2ids

* fix unit test

* 删除重复代码

* remove redundant code

* delete code

* fix unit test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants