Skip to content

Conversation

@lizexu123
Copy link
Collaborator

@lizexu123 lizexu123 commented Oct 13, 2025

This PR supports stop_token_ids, which allows specifying individual token IDs that signal the model to terminate the current generation process.

Related environment variables

  • FD_STOP_TOKEN_IDS_MAX_LEN : Maximum number of stop token, default is 8

Usage

online serving

  • launch the serving
  • request with stop_token_ids parameter, it can be '[List[int]]'
# create a chat request with "stop_token_ids" parameter
curl -X POST "http://0.0.0.0:13312/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": "北京天安门在哪里?"
        }
    ],
    "temperature": 0.7,
    "stream": false,
    "seed": 1,
    "stop_token_ids":[104208]
}'

# the original output without `stop_token_ids` is: 
# {"id":"chatcmpl-33610f95-7d01-47a6-b040-39b18316f727","object":"chat.completion","created":1760692757,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的,用户问的是“北京天安门在哪里?”。首先,我需要确认用户的需求是什么。可能他们想知道天安门的具体位置,或者想了解它的重要性。接下来,我得回忆一下北京天安门广场的地理位置。天安门广场位于北京市中心,周围环绕着著名的胡同,比如大栅栏、小街等。用户可能对城市规划和地标建筑感兴趣,也可能是想了解天安门的历史和功能。\n\n用户可能没有明确说明他们的需求,但作为回答者,我需要确保信息准确且易于理解。天安门广场是北京的标志性建筑之一,周围有丰富的历史和文化元素。此外,用户可能还想知道天安门与周围其他景点的关系,比如人民广场、故宫等,这有助于提供更全面的回答。\n\n需要注意的是,用户可能对“哪里”这个词语有歧义,可能需要进一步澄清。但根据问题本身,直接回答地理位置是合适的。同时,保持回答简洁明了,避免使用过于专业的术语,让用户容易理解。\n</think>\n\n北京天安门广场位于中国北京市中心,是中华人民共和国的象征性建筑之一。广场周围环绕着著名的胡同,如大栅栏、小街等,是北京的城市地标和历史文化中心。","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":276,"completion_tokens":262,"prompt_tokens_details":{"cached_tokens":0}}}

# the output with `stop_token_ids` is:
# {"id":"chatcmpl-51f772d0-e0d8-48da-9b5f-a3690849ffca","object":"chat.completion","created":1760692873,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的,用户问的是“北京天安门在哪里?”。首先,我需要确认用户的需求是什么。可能他们想知道天安门的具体位置,或者想了解它的重要性。接下来,我得回忆一下北京天安门","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}```

offline demo

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B"

# 超参设置
sampling_params = SamplingParams(temperature=1, seed=1,stop_token_ids=[104208])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "北京天安门在哪里?"}], use_tqdm=True, sampling_params=sampling_params)

print(output)```

PR Summary

This PR addresses three main improvements:

  1. Clarified the ambiguous usage of stop_seqs and stop_token_ids:Previously, stop_seqs was incorrectly passed to request.stop_token_ids Now corrected to pass it to request.stop_seqs instead.
  2. Added support for stop_token_ids.
  3. Enhanced stopping checks with min_tokens and max_tokens validation:If a stop_token_ids is encountered before reaching min_tokens, generation continues until min_tokens is reached
    If a stop_token_ids would trigger after max_tokens, generation stops at max_tokens instead.

@paddle-bot
Copy link

paddle-bot bot commented Oct 13, 2025

Thanks for your contribution!

Copy link
Collaborator

@zoooo0820 zoooo0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/features/early_stop.md 也分别在这个中英文文档中补充下用法吧

Must be in [0, 1]. Set to 0 to disable this.
seed: Random seed to use for the generation.
stop: list of strings that stop the generation when they are generated.
stop_seqs: list of strings that stop the generation when they are generated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里stopstop_seqs的修改,会改变对外暴露的接口名吗

next_tokens[bid] = end_ids[0];
topk_ids[bid] = end_ids[0];
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里max_tokens的限制是否必要,目前应该已经有针对max_tokens的限制了?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是stop_token_ids如果设置到了max_tokens的外面,那么就截止到max_tokens,我认为应该是有必要的

// If haven't reached min_tokens, cannot stop for any reason
if (below_min_tokens) {
if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不足min_tokens的条件是不是所有情况都应该直接return

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

max_tokens: paddle.Tensor = None
"""
the maximum tokens that will be generated
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要看下max_tokens是否是必要参数

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已删除

if stop_sequences is not None and len(stop_sequences) != 0:
stop_seqs, stop_seqs_len = self.update_stop_seq(stop_sequences)
request.set("stop_token_ids", stop_seqs)
request.set("stop_seqs", stop_seqs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是应该和其他的地方统一一下,都叫stop 或者stop_seqs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

PD_CHECK(stop_flags.dtype() == paddle::DataType::BOOL);
bool prefill_one_step_stop = false;
if (const char *env_p = std::getenv("PREFILL_NODE_ONE_STEP_STOP")) {
// std::cout << "Your PATH is: " << env_p << '\n';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

之前的打印调试代码也一并删掉吧

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Jiang-Jia-Jun
Copy link
Collaborator

文档需要补充说明(例如参数)

// If haven't reached min_tokens, cannot stop for any reason
if (below_min_tokens) {
if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是不是 满足 below_min_tokens都会返回

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@zoooo0820 zoooo0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for stop_token_ids to enable specifying individual token IDs that signal the model to terminate generation. It also fixes a bug where stop sequences were incorrectly assigned to stop_token_ids instead of stop (or stop_seqs), and adds min_tokens validation to prevent premature stopping.

Key Changes:

  • Introduced new stop_token_ids parameter for specifying individual stop token IDs (distinct from stop sequences)
  • Fixed bug where stop sequences were being incorrectly passed to request.stop_token_ids instead of request.stop or request.stop_seqs
  • Added min_tokens enforcement to ensure generation continues until minimum token count is reached

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
custom_ops/gpu_ops/stop_generation_multi_ends.cu Core CUDA kernel implementing stop_token_ids checking and min_tokens validation
custom_ops/gpu_ops/cpp_extensions.cc Updated function signature to include new parameters
fastdeploy/envs.py Added FD_STOP_TOKEN_IDS_MAX_LEN environment variable (with incorrect default reference)
fastdeploy/config.py Added configuration for stop_token_ids_max_len with type casting
fastdeploy/worker/output.py Added new fields for stop_seqs, stop_token_ids_len, and min_tokens
fastdeploy/worker/gpu_model_runner.py Implemented stop_token_ids handling and fixed stop_seqs bug
fastdeploy/model_executor/pre_and_post_process.py Updated post-processing to pass new parameters
fastdeploy/input/text_processor.py Fixed bug: changed stop_seqs assignment from stop_token_ids to stop
fastdeploy/input/qwen_vl_processor/qwen_vl_processor.py Fixed bug: changed stop_seqs assignment from stop_token_ids to stop
fastdeploy/input/ernie4_5_processor.py Fixed bug: changed stop_seqs assignment from stop_token_ids to stop
fastdeploy/entrypoints/engine_client.py Added explicit int() casting for environment variables
tests/operators/test_stop_generation_multi_ends.py Added comprehensive tests for stop_token_ids and min_tokens
tests/ci_use/Qwen3-MoE/test_Qwen3-MoE_serving.py Added integration test for stop_token_ids
docs/zh/features/early_stop.md Added Chinese documentation for stop_token_ids feature
docs/features/early_stop.md Added English documentation for stop_token_ids feature

"FD_STOP_SEQS_MAX_LEN": lambda: int(os.getenv("FD_STOP_SEQS_MAX_LEN", "8")),
"FD_STOP_SEQS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
# Maximum length of stop token ids.
"FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable name is incorrect. It reads from FD_STOP_SEQS_MAX_LEN instead of FD_STOP_TOKEN_IDS_MAX_LEN. This will cause stop_token_ids to use the wrong environment variable, making it impossible to configure them separately from stop sequences. Change to: lambda: os.getenv("FD_STOP_TOKEN_IDS_MAX_LEN", "8")

Suggested change
"FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_SEQS_MAX_LEN", "8"),
"FD_STOP_TOKEN_IDS_MAX_LEN": lambda: os.getenv("FD_STOP_TOKEN_IDS_MAX_LEN", "8"),

Copilot uses AI. Check for mistakes.
Before starting the service, set the following environment variables

```
FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8)
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions the wrong environment variable. For the stop_token_ids feature being documented in section 3, it should reference FD_STOP_TOKEN_IDS_MAX_LEN instead of FD_STOP_SEQS_MAX_LEN. These are two separate configuration parameters for different features.

Suggested change
FD_STOP_SEQS_MAX_LEN (Maximum length of stop sequences, default is 8)
FD_STOP_TOKEN_IDS_MAX_LEN (Maximum length of stop_token_ids, default is 8)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants