[Optimization]Merge Text processor by luukunn · Pull Request #7030 · PaddlePaddle/FastDeploy

luukunn · 2026-03-26T08:31:57Z

Motivation

重构前，文本处理存在两条独立的继承链（DataProcessor 和 Ernie4_5Processor），二者在响应处理（process_response_dict、ids2tokens、update_stop_seq 等）和请求处理（process_request_dict）上存在大量逻辑重复，仅在 tokenizer 加载方式、分词策略和少量分支行为上有差异。这导致同一个 bug 需要在两处分别修复，响应处理签名不一致（ERNIE 用位置参数传 stream，HF 用关键字参数），且新增功能（如 tool_parser、reasoning_parser）需要在两个 Processor 中分别适配，维护成本高。

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

新建 BaseTextProcessor(ABC) 抽象基类，将两条继承链中重复的响应处理逻辑和公共工具方法提取为统一实现，同时声明 _load_tokenizer、text2ids、messages2ids 等抽象接口。在此基础上，新增 TextProcessor(BaseTextProcessor) 类，通过 tokenizer_type 参数（"auto" / "ernie4_5"）在关键路径上进行策略分发。最后更新工厂类 preprocess.py，将文本分支统一切换为 TextProcessor 实例化，原有 DataProcessor 和 Ernie4_5Processor 保留为 deprecation 别名以确保向后兼容。

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-03-26T08:32:07Z

Thanks for your contribution!

Copilot

Pull request overview

该 PR 旨在合并/统一文本侧的 processor 实现：把原先分散在 DataProcessor 与 Ernie4_5Processor 的请求/响应处理逻辑抽到公共基类，并新增 TextProcessor 用一个类覆盖 auto 与 ernie4_5 两种 tokenizer 路径，从而简化 InputPreprocessor.create_processor() 的分发逻辑。

Changes:

新增 BaseTextProcessor（fastdeploy/input/base_processor.py），集中实现 request/response 处理与通用工具方法。
新增 TextProcessor 并调整 InputPreprocessor.create_processor()：非 V1 且纯文本场景统一走 TextProcessor(tokenizer_type=auto/ernie4_5)。
将 Ernie4_5Processor 改为弃用包装类，并补充 VL processor 的 tokenizer_type 以适配统一的增量解码逻辑；同步更新相关单测 mock。

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/input/test_preprocess.py	更新路由单测：mock 目标由 `DataProcessor` 改为 `TextProcessor`。
fastdeploy/input/text_processor.py	引入 `TextProcessor`；`DataProcessor` 改为继承新的公共基类并移除本地 request/response 处理实现。
fastdeploy/input/preprocess.py	纯文本、非 V1 路径统一使用 `TextProcessor`，并按架构选择 `tokenizer_type`。
fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py	增加 `tokenizer_type="ernie4_5"` 以保证继承的解码分支选择正确。
fastdeploy/input/ernie4_5_processor.py	将 `Ernie4_5Processor` 改为弃用包装类，转发到 `TextProcessor(tokenizer_type='ernie4_5')`。
fastdeploy/input/base_processor.py	新增公共基类 `BaseTextProcessor`，实现共享 request/response 处理与通用方法。

Comments suppressed due to low confidence (1)

fastdeploy/input/text_processor.py:248

这里 DataProcessor 现在继承了 BaseTextProcessor，而 BaseTextProcessor.process_response_dict_* 会假设 ids2tokens() 始终返回 3 元组 (delta_text, previous_token_ids, previous_texts)。但当前 DataProcessor.ids2tokens() 在 envs.FD_USE_HF_TOKENIZER=True 分支仍返回单个字符串，会导致 streaming / non-streaming 响应处理时解包失败（例如 delta_text, _, previous_texts = ...）。建议让 DataProcessor 直接复用 BaseTextProcessor.ids2tokens()（删除覆盖），或把 HF 分支返回值改为与基类一致的 3 元组。

class DataProcessor(BaseTextProcessor):
    def __init__(self, model_name_or_path, reasoning_parser_obj=None, tool_parser_obj=None):
        """
            Initializes the DecodeStatus object.

Copilot · 2026-03-26T08:38:06Z

fastdeploy/input/base_processor.py

+* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is
+  set.  This matches DataProcessor behaviour.


这里的设计说明与实际实现不一致：文档写到 tool_parser 结果不会更新 outputs["text"]，但 process_response_dict_streaming() 中在解析到 tool_calls 时会把 outputs["text"] 赋值为 tool_call_delta_message.content。建议要么更新文档描述（区分 streaming 与 non-streaming 行为），要么调整实现以匹配文档中的约定。

Suggested change

* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is

set. This matches DataProcessor behaviour.

* In non-streaming mode, tool_parser results never update ``outputs["text"]``;

only ``tool_calls`` is set, matching ``DataProcessor`` behaviour. In

streaming mode, ``process_response_dict_streaming()`` may temporarily set

``outputs["text"]`` to the current ``tool_call_delta_message.content`` while

parsing tool calls.

Copilot · 2026-03-26T08:38:06Z

fastdeploy/input/base_processor.py

+            if token_id == think_start_id:
+                started = True
+                ended = False
+                in_thinking = True
+            elif token_id == think_end_id and in_thinking:
+                ended = True
+                in_thinking = False


tokens_after_start 初始化为 0 但在遍历 token_list 时从未累加，导致 think_prompt_tokens_after_start 永远为 0。该字段会被 fastdeploy/model_executor/logits_processor/thinking_budget.py 读取用于推理预算状态更新，恒为 0 会让预算逻辑失真。建议在遇到 <think> 之后对非结束标记 token 进行计数，或移除该字段避免产生误导。

Suggested change

if token_id == think_start_id:

started = True

ended = False

in_thinking = True

elif token_id == think_end_id and in_thinking:

ended = True

in_thinking = False

if token_id == think_start_id:

# Enter thinking segment; do not count the start token itself

started = True

ended = False

in_thinking = True

elif token_id == think_end_id and in_thinking:

# Leave thinking segment; do not count the end token itself

ended = True

in_thinking = False

elif in_thinking:

# Count tokens inside the thinking segment, excluding start/end markers

tokens_after_start += 1

Copilot · 2026-03-26T08:38:07Z

fastdeploy/input/preprocess.py

+                    tokenizer_type = "ernie4_5" if ErnieArchitectures.contains_ernie_arch(architecture) else "auto"
+                    self.processor = TextProcessor(
                        model_name_or_path=self.model_name_or_path,
+                        tokenizer_type=tokenizer_type,
                        reasoning_parser_obj=reasoning_parser_obj,
                        tool_parser_obj=tool_parser_obj,


这里新增了根据架构选择 tokenizer_type（ernie4_5 vs auto）并走 TextProcessor 的分支，但当前单测只覆盖了“非 Ernie、非多模态”的路径。建议补一个用例覆盖 Ernie 架构（例如 architecture 为 ERNIE4.5 相关字符串）时 tokenizer_type 被设置为 ernie4_5，并验证 TextProcessor 的构造参数。

codecov-commenter · 2026-03-26T11:23:10Z

Codecov Report

❌ Patch coverage is 89.37644% with 46 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@25d64ef). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/input/base_processor.py	89.43%	23 Missing and 18 partials ⚠️
fastdeploy/input/preprocess.py	42.85%	3 Missing and 1 partial ⚠️
fastdeploy/input/text_processor.py	96.77%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7030   +/-   ##
==========================================
  Coverage           ?   73.21%           
==========================================
  Files              ?      401           
  Lines              ?    56322           
  Branches           ?     8887           
==========================================
  Hits               ?    41234           
  Misses             ?    12183           
  Partials           ?     2905

Flag	Coverage Δ
GPU	`73.21% <89.37%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

LiqinruiG

LGTM

* merge text processor * update * fix unit test * merge messages2ids * fix unit test * 删除重复代码 * remove redundant code * delete code * fix unit test

luukunn added 4 commits March 26, 2026 15:06

merge text processor

32cdc34

update

a5ab32c

fix unit test

4f9fe3a

merge messages2ids

ca1f105

Copilot AI review requested due to automatic review settings March 26, 2026 08:31

luukunn temporarily deployed to Metax_ci March 26, 2026 08:32 — with GitHub Actions Inactive

Copilot started reviewing on behalf of luukunn March 26, 2026 08:32 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

fix unit test

b1796e5

luukunn had a problem deploying to Metax_ci March 26, 2026 12:53 — with GitHub Actions Failure

luukunn changed the title ~~Merge processor 2~~ [Optimization]Merge Text processor Mar 27, 2026

删除重复代码

59a0e3f

Copilot AI review requested due to automatic review settings March 27, 2026 08:10

luukunn temporarily deployed to Metax_ci March 27, 2026 08:10 — with GitHub Actions Inactive

Copilot started reviewing on behalf of luukunn March 27, 2026 08:11 View session

This comment was marked as resolved.

Sign in to view

remove redundant code

286a025

luukunn temporarily deployed to Metax_ci March 27, 2026 10:14 — with GitHub Actions Inactive

delete code

b21a433

Copilot AI review requested due to automatic review settings March 27, 2026 11:31

luukunn temporarily deployed to Metax_ci March 27, 2026 11:31 — with GitHub Actions Inactive

Copilot started reviewing on behalf of luukunn March 27, 2026 11:31 View session

This comment was marked as resolved.

Sign in to view

fix unit test

bdacd11

luukunn temporarily deployed to Metax_ci March 30, 2026 03:28 — with GitHub Actions Inactive

LiqinruiG approved these changes Mar 30, 2026

View reviewed changes

LiqinruiG merged commit b9f8873 into PaddlePaddle:develop Mar 30, 2026
35 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization]Merge Text processor#7030

[Optimization]Merge Text processor#7030
LiqinruiG merged 9 commits intoPaddlePaddle:developfrom
luukunn:merge_processor_2

luukunn commented Mar 26, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

codecov-commenter commented Mar 26, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

LiqinruiG left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is
		set. This matches DataProcessor behaviour.

-* tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is
-  set.  This matches DataProcessor behaviour.
+* In non-streaming mode, tool_parser results never update ``outputs["text"]``;
+  only ``tool_calls`` is set, matching ``DataProcessor`` behaviour.  In
+  streaming mode, ``process_response_dict_streaming()`` may temporarily set
+  ``outputs["text"]`` to the current ``tool_call_delta_message.content`` while
+  parsing tool calls.

Conversation

luukunn commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Mar 26, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

LiqinruiG left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

luukunn commented Mar 26, 2026 •

edited

Loading

codecov-commenter commented Mar 26, 2026 •

edited

Loading