Skip to content

Conversation

lizexu123
Copy link
Collaborator

@lizexu123 lizexu123 commented Oct 13, 2025

This PR supports stop_token_ids, which allows specifying individual token IDs that signal the model to terminate the current generation process.

Related environment variables

  • FD_STOP_TOKEN_IDS_MAX_LEN : Maximum number of stop token, default is 8

Usage

online serving

  • launch the serving
  • request with stop_token_ids parameter, it can be '[List[int]]'
# create a chat request with "stop_token_ids" parameter
curl -X POST "http://0.0.0.0:13312/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{
    "model": "default",
    "messages": [
        {
            "role": "user",
            "content": "北京天安门在哪里?"
        }
    ],
    "temperature": 0.7,
    "stream": false,
    "seed": 1,
    "stop_token_ids":[104208]
}'

# the original output without `stop_token_ids` is: 
# {"id":"chatcmpl-33610f95-7d01-47a6-b040-39b18316f727","object":"chat.completion","created":1760692757,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的,用户问的是“北京天安门在哪里?”。首先,我需要确认用户的需求是什么。可能他们想知道天安门的具体位置,或者想了解它的重要性。接下来,我得回忆一下北京天安门广场的地理位置。天安门广场位于北京市中心,周围环绕着著名的胡同,比如大栅栏、小街等。用户可能对城市规划和地标建筑感兴趣,也可能是想了解天安门的历史和功能。\n\n用户可能没有明确说明他们的需求,但作为回答者,我需要确保信息准确且易于理解。天安门广场是北京的标志性建筑之一,周围有丰富的历史和文化元素。此外,用户可能还想知道天安门与周围其他景点的关系,比如人民广场、故宫等,这有助于提供更全面的回答。\n\n需要注意的是,用户可能对“哪里”这个词语有歧义,可能需要进一步澄清。但根据问题本身,直接回答地理位置是合适的。同时,保持回答简洁明了,避免使用过于专业的术语,让用户容易理解。\n</think>\n\n北京天安门广场位于中国北京市中心,是中华人民共和国的象征性建筑之一。广场周围环绕着著名的胡同,如大栅栏、小街等,是北京的城市地标和历史文化中心。","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":276,"completion_tokens":262,"prompt_tokens_details":{"cached_tokens":0}}}

# the output with `stop_token_ids` is:
# {"id":"chatcmpl-51f772d0-e0d8-48da-9b5f-a3690849ffca","object":"chat.completion","created":1760692873,"model":"/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>\n好的,用户问的是“北京天安门在哪里?”。首先,我需要确认用户的需求是什么。可能他们想知道天安门的具体位置,或者想了解它的重要性。接下来,我得回忆一下北京天安门","multimodal_content":null,"reasoning_content":null,"tool_calls":null,"prompt_token_ids":null,"completion_token_ids":null,"prompt_tokens":null,"completion_tokens":null},"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":{"cached_tokens":0}}}```

offline demo

from fastdeploy.engine.sampling_params import SamplingParams
from fastdeploy.entrypoints.llm import LLM

model_name_or_path = "/root/paddlejob/workspace/env_run/output/models/paddle/Qwen/Qwen3-0.6B"

# 超参设置
sampling_params = SamplingParams(temperature=1, seed=1,stop_token_ids=[104208])
llm = LLM(model=model_name_or_path, tensor_parallel_size=1)
output = llm.chat(messages=[{"role": "user", "content": "北京天安门在哪里?"}], use_tqdm=True, sampling_params=sampling_params)

print(output)```

PR Summary

This PR addresses three main improvements:

  1. Clarified the ambiguous usage of stop_seqs and stop_token_ids:Previously, stop_seqs was incorrectly passed to request.stop_token_ids Now corrected to pass it to request.stop_seqs instead.
  2. Added support for stop_token_ids.
  3. Enhanced stopping checks with min_tokens and max_tokens validation:If a stop_token_ids is encountered before reaching min_tokens, generation continues until min_tokens is reached
    If a stop_token_ids would trigger after max_tokens, generation stops at max_tokens instead.

Copy link

paddle-bot bot commented Oct 13, 2025

Thanks for your contribution!

Copy link
Collaborator

@zoooo0820 zoooo0820 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/FastDeploy/blob/develop/docs/features/early_stop.md 也分别在这个中英文文档中补充下用法吧

Must be in [0, 1]. Set to 0 to disable this.
seed: Random seed to use for the generation.
stop: list of strings that stop the generation when they are generated.
stop_seqs: list of strings that stop the generation when they are generated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里stopstop_seqs的修改,会改变对外暴露的接口名吗

next_tokens[bid] = end_ids[0];
topk_ids[bid] = end_ids[0];
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里max_tokens的限制是否必要,目前应该已经有针对max_tokens的限制了?

// If haven't reached min_tokens, cannot stop for any reason
if (below_min_tokens) {
if (!beam_search && is_in_end(topk_ids[bid], end_ids, end_length)) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不足min_tokens的条件是不是所有情况都应该直接return

max_tokens: paddle.Tensor = None
"""
the maximum tokens that will be generated
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要看下max_tokens是否是必要参数

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants