[Feature] support stop_token_ids #4382

lizexu123 · 2025-10-13T12:51:35Z

This PR supports stop_token_ids, which allows specifying individual token IDs that signal the model to terminate the current generation process.

Related environment variables

FD_STOP_TOKEN_IDS_MAX_LEN : Maximum number of stop token, default is 8

Usage

online serving

launch the serving
request with stop_token_ids parameter， it can be '[List[int]]'

# create a chat request with "stop_token_ids" parameter
curl -X POST "http://0.0.0.0:13312/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": "北京天安门在哪里?"
        }
    ],
    "temperature": 0.7,
    "stream": false,
    "seed": 1,
    "stop_token_ids":[104208]
}'

# the original output without `stop_token_ids` is: 
# {"id":"chatcmpl-33610f95-7d01-47a6-b040-39b18316f727","object":"chat.completion","created":1760692757,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的，用户问的是“北京天安门在哪里？”。首先，我需要确认用户的需求是什么。可能他们想知道天安门的具体位置，或者想了解它的重要性。接下来，我得回忆一下北京天安门广场的地理位置。天安门广场位于北京市中心，周围环绕着著名的胡同，比如大栅栏、小街等。用户可能对城市规划和地标建筑感兴趣，也可能是想了解天安门的历史和功能。\n\n用户可能没有明确说明他们的需求，但作为回答者，我需要确保信息准确且易于理解。天安门广场是北京的标志性建筑之一，周围有丰富的历史和文化元素。此外，用户可能还想知道天安门与周围其他景点的关系，比如人民广场、故宫等，这有助于提供更全面的回答。\n\n需要注意的是，用户可能对“哪里”这个词语有歧义，可能需要进一步澄清。但根据问题本身，直接回答地理位置是合适的。同时，保持回答简洁明了，避免使用过于专业的术语，让用户容易理解。\n</think>\n\n北京天安门广场位于中国北京市中心，是中华人民共和国的象征性建筑之一。广场周围环绕着著名的胡同，如大栅栏、小街等，是北京的城市地标和历史文化中心。","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":276,"completion_tokens":262,"prompt_tokens_details":{"cached_tokens":0}}}

# the output with `stop_token_ids` is:
# {"id":"chatcmpl-51f772d0-e0d8-48da-9b5f-a3690849ffca","object":"chat.completion","created":1760692873,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的，用户问的是“北京天安门在哪里？”。首先，我需要确认用户的需求是什么。可能他们想知道天安门的具体位置，或者想了解它的重要性。接下来，我得回忆一下北京天安门","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}```

offline demo

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B"

# 超参设置
sampling_params = SamplingParams(temperature=1, seed=1,stop_token_ids=[104208])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "北京天安门在哪里?"}], use_tqdm=True, sampling_params=sampling_params)

print(output)```

PR Summary

This PR addresses three main improvements:

Clarified the ambiguous usage of stop_seqs and stop_token_ids:Previously, stop_seqs was incorrectly passed to request.stop_token_ids Now corrected to pass it to request.stop_seqs instead.
Added support for stop_token_ids.
Enhanced stopping checks with min_tokens and max_tokens validation:If a stop_token_ids is encountered before reaching min_tokens, generation continues until min_tokens is reached
If a stop_token_ids would trigger after max_tokens, generation stops at max_tokens instead.

paddle-bot · 2025-10-13T12:51:45Z

Thanks for your contribution!

…into stop_token_ids_1

zoooo0820

https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/features/early_stop.md 也分别在这个中英文文档中补充下用法吧

zoooo0820 · 2025-10-17T11:06:35Z

fastdeploy/engine/sampling_params.py

            Must be in [0, 1]. Set to 0 to disable this.
        seed: Random seed to use for the generation.
-        stop: list of strings that stop the generation when they are generated.
+        stop_seqs: list of strings that stop the generation when they are generated.


这里stop到stop_seqs的修改，会改变对外暴露的接口名吗

zoooo0820 · 2025-10-17T11:18:50Z

custom_ops/gpu_ops/stop_generation_multi_ends.cu

+                next_tokens[bid] = end_ids[0];
+                topk_ids[bid] = end_ids[0];
+                return;
+            }


这里max_tokens的限制是否必要，目前应该已经有针对max_tokens的限制了？

这个是stop_token_ids如果设置到了max_tokens的外面，那么就截止到max_tokens，我认为应该是有必要的

zoooo0820 · 2025-10-17T11:20:28Z

custom_ops/gpu_ops/stop_generation_multi_ends.cu

+            // If haven't reached min_tokens, cannot stop for any reason
+            if (below_min_tokens) {
+                if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) {
+                    return;


不足min_tokens的条件是不是所有情况都应该直接return

zoooo0820 · 2025-10-17T11:21:56Z

fastdeploy/worker/output.py

+    max_tokens: paddle.Tensor = None
+    """
+        the maximum tokens that will be generated
+    """


需要看下max_tokens是否是必要参数

…into stop_token_ids_1

zoooo0820 · 2025-11-12T12:09:10Z

fastdeploy/input/text_processor.py

        if stop_sequences is not None and len(stop_sequences) != 0:
            stop_seqs, stop_seqs_len = self.update_stop_seq(stop_sequences)
-            request.set("stop_token_ids", stop_seqs)
+            request.set("stop_seqs", stop_seqs)


这里是不是应该和其他的地方统一一下，都叫stop 或者stop_seqs？

zoooo0820 · 2025-11-12T12:13:15Z

custom_ops/gpu_ops/stop_generation_multi_ends.cu

+  PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL);
+  bool prefill_one_step_stop = false;
+  if (const char *env_p = std::getenv("PREFILL_NODE_ONE_STEP_STOP")) {
+    // std::cout << "Your PATH is: " << env_p << '\n';


之前的打印调试代码也一并删掉吧

…into stop_token_ids_1

Jiang-Jia-Jun · 2025-11-13T06:39:30Z

文档需要补充说明（例如参数）

zoooo0820 · 2025-11-13T08:12:39Z

custom_ops/gpu_ops/stop_generation_multi_ends.cu

+      // If haven't reached min_tokens, cannot stop for any reason
+      if (below_min_tokens) {
+        if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) {
+          return;


这里是不是满足 below_min_tokens都会返回

zoooo0820

LGTM

Copilot

Pull Request Overview

This PR adds support for stop_token_ids to enable specifying individual token IDs that signal the model to terminate generation. It also fixes a bug where stop sequences were incorrectly assigned to stop_token_ids instead of stop (or stop_seqs), and adds min_tokens validation to prevent premature stopping.

Key Changes:

Introduced new stop_token_ids parameter for specifying individual stop token IDs (distinct from stop sequences)
Fixed bug where stop sequences were being incorrectly passed to request.stop_token_ids instead of request.stop or request.stop_seqs
Added min_tokens enforcement to ensure generation continues until minimum token count is reached

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`custom_ops/gpu_ops/stop_generation_multi_ends.cu`	Core CUDA kernel implementing stop_token_ids checking and min_tokens validation
`custom_ops/gpu_ops/cpp_extensions.cc`	Updated function signature to include new parameters
`fastdeploy/envs.py`	Added FD_STOP_TOKEN_IDS_MAX_LEN environment variable (with incorrect default reference)
`fastdeploy/config.py`	Added configuration for stop_token_ids_max_len with type casting
`fastdeploy/worker/output.py`	Added new fields for stop_seqs, stop_token_ids_len, and min_tokens
`fastdeploy/worker/gpu_model_runner.py`	Implemented stop_token_ids handling and fixed stop_seqs bug
`fastdeploy/model_executor/pre_and_post_process.py`	Updated post-processing to pass new parameters
`fastdeploy/input/text_processor.py`	Fixed bug: changed stop_seqs assignment from stop_token_ids to stop
`fastdeploy/input/qwen_vl_processor/qwen_vl_processor.py`	Fixed bug: changed stop_seqs assignment from stop_token_ids to stop
`fastdeploy/input/ernie4_5_processor.py`	Fixed bug: changed stop_seqs assignment from stop_token_ids to stop
`fastdeploy/entrypoints/engine_client.py`	Added explicit int() casting for environment variables
`tests/operators/test_stop_generation_multi_ends.py`	Added comprehensive tests for stop_token_ids and min_tokens
`tests/ci_use/Qwen3-MoE/test_Qwen3-MoE_serving.py`	Added integration test for stop_token_ids
`docs/zh/features/early_stop.md`	Added Chinese documentation for stop_token_ids feature
`docs/features/early_stop.md`	Added English documentation for stop_token_ids feature

Copilot · 2025-11-14T06:04:10Z

fastdeploy/envs.py

-    "FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
+    "FD_STOP_SEQS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
+    # Maximum length of stop token ids.
+    "FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),


The environment variable name is incorrect. It reads from FD_STOP_SEQS_MAX_LEN instead of FD_STOP_TOKEN_IDS_MAX_LEN. This will cause stop_token_ids to use the wrong environment variable, making it impossible to configure them separately from stop sequences. Change to: lambda: os.getenv("FD_STOP_TOKEN_IDS_MAX_LEN", "8")

Suggested change

"FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),

"FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_TOKEN_IDS_MAX_LEN", "8"),

Copilot · 2025-11-14T06:04:10Z

docs/features/early_stop.md

+Before starting the service, set the following environment variables
+
+```
+FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8)


The documentation mentions the wrong environment variable. For the stop_token_ids feature being documented in section 3, it should reference FD_STOP_TOKEN_IDS_MAX_LEN instead of FD_STOP_SEQS_MAX_LEN. These are two separate configuration parameters for different features.

Suggested change

FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8)

FD_STOP_TOKEN_IDS_MAX_LEN (Maximum length of stop_token_ids, default is 8)

fix stop_seqs

fecd8ef

lizexu123 added 10 commits October 14, 2025 21:30

support stop_token_ids

4f24a2d

merge develop

20e5b58

support min_tokens

1c9a1e3

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

3d5c730

…into stop_token_ids_1

support min_tokens and max_tokens stop

c26be0d

add test

394a24a

add FD_STOP_TOKEN_IDS_MAX_LEN

b0dcbf7

merge develop

39deb26

delete print

8d76a61

fix dummy_run

6e93c15

zoooo0820 reviewed Oct 17, 2025

View reviewed changes

lizexu123 added 11 commits October 31, 2025 15:53

update

601d30c

fix stop

78e9aa5

code-prefix

d067a6f

fix max_tokens

9ec111b

update

c2d8965

delete max_tokens

700bdcc

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

c116078

…into stop_token_ids_1

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

5cbfde4

…into stop_token_ids_1

delete max_tokens

870f192

fix

82a2473

fix

83a9f79

zoooo0820 reviewed Nov 12, 2025

View reviewed changes

lizexu123 added 5 commits November 12, 2025 20:52

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

d04c1bb

…into stop_token_ids_1

update develop

c0aa25d

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

185aa4f

…into stop_token_ids_1

add stop_token_ids test

d582e5c

fix

fe8d942

zoooo0820 reviewed Nov 13, 2025

View reviewed changes

lizexu123 added 4 commits November 13, 2025 17:22

add ducument

c721157

fix document

e49a056

fix test

f02fbff

delete print

d2ea29f

zoooo0820 approved these changes Nov 14, 2025

View reviewed changes

Jiang-Jia-Jun requested a review from Copilot November 14, 2025 06:01

Copilot started reviewing on behalf of Jiang-Jia-Jun November 14, 2025 06:01 View session

Copilot finished reviewing on behalf of Jiang-Jia-Jun November 14, 2025 06:03

Copilot AI reviewed Nov 14, 2025

View reviewed changes

	"FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
	"FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_TOKEN_IDS_MAX_LEN", "8"),

	FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8)
	FD_STOP_TOKEN_IDS_MAX_LEN (Maximum length of stop_token_ids, default is 8)

[Feature] support stop_token_ids #4382

Are you sure you want to change the base?

[Feature] support stop_token_ids #4382

Uh oh!

Conversation

lizexu123 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related environment variables

Usage

online serving

offline demo

PR Summary

Uh oh!

paddle-bot bot commented Oct 13, 2025

Uh oh!

zoooo0820 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jiang-Jia-Jun commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zoooo0820 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lizexu123 commented Oct 13, 2025 •

edited

Loading