[Optimization]Merge Text processor#7030
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 旨在合并/统一文本侧的 processor 实现:把原先分散在 DataProcessor 与 Ernie4_5Processor 的请求/响应处理逻辑抽到公共基类,并新增 TextProcessor 用一个类覆盖 auto 与 ernie4_5 两种 tokenizer 路径,从而简化 InputPreprocessor.create_processor() 的分发逻辑。
Changes:
- 新增
BaseTextProcessor(fastdeploy/input/base_processor.py),集中实现 request/response 处理与通用工具方法。 - 新增
TextProcessor并调整InputPreprocessor.create_processor():非 V1 且纯文本场景统一走TextProcessor(tokenizer_type=auto/ernie4_5)。 - 将
Ernie4_5Processor改为弃用包装类,并补充 VL processor 的tokenizer_type以适配统一的增量解码逻辑;同步更新相关单测 mock。
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/input/test_preprocess.py | 更新路由单测:mock 目标由 DataProcessor 改为 TextProcessor。 |
| fastdeploy/input/text_processor.py | 引入 TextProcessor;DataProcessor 改为继承新的公共基类并移除本地 request/response 处理实现。 |
| fastdeploy/input/preprocess.py | 纯文本、非 V1 路径统一使用 TextProcessor,并按架构选择 tokenizer_type。 |
| fastdeploy/input/ernie4_5_vl_processor/ernie4_5_vl_processor.py | 增加 tokenizer_type="ernie4_5" 以保证继承的解码分支选择正确。 |
| fastdeploy/input/ernie4_5_processor.py | 将 Ernie4_5Processor 改为弃用包装类,转发到 TextProcessor(tokenizer_type='ernie4_5')。 |
| fastdeploy/input/base_processor.py | 新增公共基类 BaseTextProcessor,实现共享 request/response 处理与通用方法。 |
Comments suppressed due to low confidence (1)
fastdeploy/input/text_processor.py:248
- 这里
DataProcessor现在继承了BaseTextProcessor,而BaseTextProcessor.process_response_dict_*会假设ids2tokens()始终返回 3 元组(delta_text, previous_token_ids, previous_texts)。但当前DataProcessor.ids2tokens()在envs.FD_USE_HF_TOKENIZER=True分支仍返回单个字符串,会导致 streaming / non-streaming 响应处理时解包失败(例如delta_text, _, previous_texts = ...)。建议让DataProcessor直接复用BaseTextProcessor.ids2tokens()(删除覆盖),或把 HF 分支返回值改为与基类一致的 3 元组。
class DataProcessor(BaseTextProcessor):
def __init__(self, model_name_or_path, reasoning_parser_obj=None, tool_parser_obj=None):
"""
Initializes the DecodeStatus object.
| * tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is | ||
| set. This matches DataProcessor behaviour. |
There was a problem hiding this comment.
这里的设计说明与实际实现不一致:文档写到 tool_parser 结果不会更新 outputs["text"],但 process_response_dict_streaming() 中在解析到 tool_calls 时会把 outputs["text"] 赋值为 tool_call_delta_message.content。建议要么更新文档描述(区分 streaming 与 non-streaming 行为),要么调整实现以匹配文档中的约定。
| * tool_parser result never updates ``outputs["text"]``; only ``tool_calls`` is | |
| set. This matches DataProcessor behaviour. | |
| * In non-streaming mode, tool_parser results never update ``outputs["text"]``; | |
| only ``tool_calls`` is set, matching ``DataProcessor`` behaviour. In | |
| streaming mode, ``process_response_dict_streaming()`` may temporarily set | |
| ``outputs["text"]`` to the current ``tool_call_delta_message.content`` while | |
| parsing tool calls. |
| if token_id == think_start_id: | ||
| started = True | ||
| ended = False | ||
| in_thinking = True | ||
| elif token_id == think_end_id and in_thinking: | ||
| ended = True | ||
| in_thinking = False |
There was a problem hiding this comment.
tokens_after_start 初始化为 0 但在遍历 token_list 时从未累加,导致 think_prompt_tokens_after_start 永远为 0。该字段会被 fastdeploy/model_executor/logits_processor/thinking_budget.py 读取用于推理预算状态更新,恒为 0 会让预算逻辑失真。建议在遇到 <think> 之后对非结束标记 token 进行计数,或移除该字段避免产生误导。
| if token_id == think_start_id: | |
| started = True | |
| ended = False | |
| in_thinking = True | |
| elif token_id == think_end_id and in_thinking: | |
| ended = True | |
| in_thinking = False | |
| if token_id == think_start_id: | |
| # Enter thinking segment; do not count the start token itself | |
| started = True | |
| ended = False | |
| in_thinking = True | |
| elif token_id == think_end_id and in_thinking: | |
| # Leave thinking segment; do not count the end token itself | |
| ended = True | |
| in_thinking = False | |
| elif in_thinking: | |
| # Count tokens inside the thinking segment, excluding start/end markers | |
| tokens_after_start += 1 |
| tokenizer_type = "ernie4_5" if ErnieArchitectures.contains_ernie_arch(architecture) else "auto" | ||
| self.processor = TextProcessor( | ||
| model_name_or_path=self.model_name_or_path, | ||
| tokenizer_type=tokenizer_type, | ||
| reasoning_parser_obj=reasoning_parser_obj, | ||
| tool_parser_obj=tool_parser_obj, |
There was a problem hiding this comment.
这里新增了根据架构选择 tokenizer_type(ernie4_5 vs auto)并走 TextProcessor 的分支,但当前单测只覆盖了“非 Ernie、非多模态”的路径。建议补一个用例覆盖 Ernie 架构(例如 architecture 为 ERNIE4.5 相关字符串)时 tokenizer_type 被设置为 ernie4_5,并验证 TextProcessor 的构造参数。
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #7030 +/- ##
==========================================
Coverage ? 73.21%
==========================================
Files ? 401
Lines ? 56322
Branches ? 8887
==========================================
Hits ? 41234
Misses ? 12183
Partials ? 2905
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
* merge text processor * update * fix unit test * merge messages2ids * fix unit test * 删除重复代码 * remove redundant code * delete code * fix unit test
Motivation
重构前,文本处理存在两条独立的继承链(DataProcessor 和 Ernie4_5Processor),二者在响应处理(process_response_dict、ids2tokens、update_stop_seq 等)和请求处理(process_request_dict)上存在大量逻辑重复,仅在 tokenizer 加载方式、分词策略和少量分支行为上有差异。这导致同一个 bug 需要在两处分别修复,响应处理签名不一致(ERNIE 用位置参数传 stream,HF 用关键字参数),且新增功能(如 tool_parser、reasoning_parser)需要在两个 Processor 中分别适配,维护成本高。
Modifications
新建 BaseTextProcessor(ABC) 抽象基类,将两条继承链中重复的响应处理逻辑和公共工具方法提取为统一实现,同时声明 _load_tokenizer、text2ids、messages2ids 等抽象接口。在此基础上,新增 TextProcessor(BaseTextProcessor) 类,通过 tokenizer_type 参数("auto" / "ernie4_5")在关键路径上进行策略分发。最后更新工厂类 preprocess.py,将文本分支统一切换为 TextProcessor 实例化,原有 DataProcessor 和 Ernie4_5Processor 保留为 deprecation 别名以确保向后兼容。
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.