-
Notifications
You must be signed in to change notification settings - Fork 660
[Feature] support stop_token_ids #4382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
Thanks for your contribution! |
…into stop_token_ids_1
zoooo0820
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fastdeploy/engine/sampling_params.py
Outdated
| Must be in [0, 1]. Set to 0 to disable this. | ||
| seed: Random seed to use for the generation. | ||
| stop: list of strings that stop the generation when they are generated. | ||
| stop_seqs: list of strings that stop the generation when they are generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里stop到stop_seqs的修改,会改变对外暴露的接口名吗
| next_tokens[bid] = end_ids[0]; | ||
| topk_ids[bid] = end_ids[0]; | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里max_tokens的限制是否必要,目前应该已经有针对max_tokens的限制了?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个是stop_token_ids如果设置到了max_tokens的外面,那么就截止到max_tokens,我认为应该是有必要的
| // If haven't reached min_tokens, cannot stop for any reason | ||
| if (below_min_tokens) { | ||
| if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) { | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不足min_tokens的条件是不是所有情况都应该直接return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| max_tokens: paddle.Tensor = None | ||
| """ | ||
| the maximum tokens that will be generated | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要看下max_tokens是否是必要参数
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已删除
…into stop_token_ids_1
…into stop_token_ids_1
fastdeploy/input/text_processor.py
Outdated
| if stop_sequences is not None and len(stop_sequences) != 0: | ||
| stop_seqs, stop_seqs_len = self.update_stop_seq(stop_sequences) | ||
| request.set("stop_token_ids", stop_seqs) | ||
| request.set("stop_seqs", stop_seqs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不是应该和其他的地方统一一下,都叫stop 或者stop_seqs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
| PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL); | ||
| bool prefill_one_step_stop = false; | ||
| if (const char *env_p = std::getenv("PREFILL_NODE_ONE_STEP_STOP")) { | ||
| // std::cout << "Your PATH is: " << env_p << '\n'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
之前的打印调试代码也一并删掉吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
…into stop_token_ids_1
…into stop_token_ids_1
|
文档需要补充说明(例如参数) |
| // If haven't reached min_tokens, cannot stop for any reason | ||
| if (below_min_tokens) { | ||
| if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) { | ||
| return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是不是 满足 below_min_tokens都会返回
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
zoooo0820
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for stop_token_ids to enable specifying individual token IDs that signal the model to terminate generation. It also fixes a bug where stop sequences were incorrectly assigned to stop_token_ids instead of stop (or stop_seqs), and adds min_tokens validation to prevent premature stopping.
Key Changes:
- Introduced new
stop_token_idsparameter for specifying individual stop token IDs (distinct from stop sequences) - Fixed bug where stop sequences were being incorrectly passed to
request.stop_token_idsinstead ofrequest.stoporrequest.stop_seqs - Added
min_tokensenforcement to ensure generation continues until minimum token count is reached
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
custom_ops/gpu_ops/stop_generation_multi_ends.cu |
Core CUDA kernel implementing stop_token_ids checking and min_tokens validation |
custom_ops/gpu_ops/cpp_extensions.cc |
Updated function signature to include new parameters |
fastdeploy/envs.py |
Added FD_STOP_TOKEN_IDS_MAX_LEN environment variable (with incorrect default reference) |
fastdeploy/config.py |
Added configuration for stop_token_ids_max_len with type casting |
fastdeploy/worker/output.py |
Added new fields for stop_seqs, stop_token_ids_len, and min_tokens |
fastdeploy/worker/gpu_model_runner.py |
Implemented stop_token_ids handling and fixed stop_seqs bug |
fastdeploy/model_executor/pre_and_post_process.py |
Updated post-processing to pass new parameters |
fastdeploy/input/text_processor.py |
Fixed bug: changed stop_seqs assignment from stop_token_ids to stop |
fastdeploy/input/qwen_vl_processor/qwen_vl_processor.py |
Fixed bug: changed stop_seqs assignment from stop_token_ids to stop |
fastdeploy/input/ernie4_5_processor.py |
Fixed bug: changed stop_seqs assignment from stop_token_ids to stop |
fastdeploy/entrypoints/engine_client.py |
Added explicit int() casting for environment variables |
tests/operators/test_stop_generation_multi_ends.py |
Added comprehensive tests for stop_token_ids and min_tokens |
tests/ci_use/Qwen3-MoE/test_Qwen3-MoE_serving.py |
Added integration test for stop_token_ids |
docs/zh/features/early_stop.md |
Added Chinese documentation for stop_token_ids feature |
docs/features/early_stop.md |
Added English documentation for stop_token_ids feature |
| "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")), | ||
| "FD_STOP_SEQS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"), | ||
| # Maximum length of stop token ids. | ||
| "FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"), |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The environment variable name is incorrect. It reads from FD_STOP_SEQS_MAX_LEN instead of FD_STOP_TOKEN_IDS_MAX_LEN. This will cause stop_token_ids to use the wrong environment variable, making it impossible to configure them separately from stop sequences. Change to: lambda: os.getenv("FD_STOP_TOKEN_IDS_MAX_LEN", "8")
| "FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"), | |
| "FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_TOKEN_IDS_MAX_LEN", "8"), |
| Before starting the service, set the following environment variables | ||
|
|
||
| ``` | ||
| FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8) |
Copilot
AI
Nov 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation mentions the wrong environment variable. For the stop_token_ids feature being documented in section 3, it should reference FD_STOP_TOKEN_IDS_MAX_LEN instead of FD_STOP_SEQS_MAX_LEN. These are two separate configuration parameters for different features.
| FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8) | |
| FD_STOP_TOKEN_IDS_MAX_LEN (Maximum length of stop_token_ids, default is 8) |
This PR supports stop_token_ids, which allows specifying individual token IDs that signal the model to terminate the current generation process.
Related environment variables
FD_STOP_TOKEN_IDS_MAX_LEN: Maximum number of stop token, default is 8Usage
online serving
stop_token_idsparameter, it can be '[List[int]]'offline demo
PR Summary
This PR addresses three main improvements:
If a stop_token_ids would trigger after max_tokens, generation stops at max_tokens instead.