Skip to content

fix: fix return double first token #3241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

Shunkangz
Copy link
Collaborator

@Shunkangz Shunkangz commented Apr 2, 2025

In PD, we have different behaviors for overlap and non-overlap scheduler. With non-overlap scheduler, we always return the first two generated tokens togethers. With overlap scheduler, the request might return the response without calculating the second generated token. This MR fix the change of #2986 in overlap scheduler case.

@Shunkangz Shunkangz requested review from kaiyux and Shixiaowei02 April 2, 2025 13:51
@Shunkangz Shunkangz self-assigned this Apr 2, 2025
@Shunkangz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1022 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1022 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #787 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1336 [ run ] triggered by Bot

@Shunkangz Shunkangz force-pushed the fix_double_first_gen_token branch 2 times, most recently from df5c8c7 to 8fdb7a8 Compare April 7, 2025 15:31
@Shunkangz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1340 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1336 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1340 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1005 completed with status: 'FAILURE'

Comment on lines 1806 to 1831
if request.state == LlmRequestState.GENERATION_IN_PROGRESS:
if request.py_decoding_iter == 1:
new_active_requests.append(request)
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we modify the condition here to return a nullptr when the request is generation only but has created only one token? I think that might be cleaner.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iman, I also think this would be cleaner, but we might need to change C++ disaggregated examples, so that we extract tokens from context response. It doesn't look like we're doing that right now:

https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/cpp/executor/executorExampleDisaggregated.cpp#L259-L266

https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/disaggServerBenchmark.cpp

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. I will modify this.

@Shunkangz Shunkangz force-pushed the fix_double_first_gen_token branch 2 times, most recently from f1bd653 to ed97ba3 Compare April 8, 2025 04:46
@Shunkangz
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1405 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1405 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #1053 completed with status: 'FAILURE'

@Shunkangz Shunkangz force-pushed the fix_double_first_gen_token branch from ed97ba3 to 7331397 Compare April 8, 2025 05:29
@Shunkangz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1409 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1409 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1058 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1544 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1544 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1154 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1591 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1591 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1191 completed with status: 'FAILURE'

@Shunkangz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

Shunkang added 2 commits April 9, 2025 20:20
Signed-off-by: Shunkang <[email protected]>

Add double first token check

Signed-off-by: Shunkang <[email protected]>

Adapt for py_decoding_iter

Signed-off-by: Shunkang <[email protected]>

Roll back CI change

Signed-off-by: Shunkang <[email protected]>

Add check

Signed-off-by: Shunkang <[email protected]>
Signed-off-by: Shunkang <[email protected]>
@Shunkangz Shunkangz force-pushed the fix_double_first_gen_token branch from cc12e50 to 1df2674 Compare April 9, 2025 12:20
@Shunkangz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1611 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1612 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1611 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1612 [ run ] completed with state FAILURE

@Shunkangz
Copy link
Collaborator Author

/bot run --add-multi-gpu-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1613 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #1613 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #1207 completed with status: 'FAILURE'

Comment on lines +182 to 184
if self.disaggregated_params is not None and \
len(response_tensors.output_token_ids[src_idx]) == 2:
output._last_token_ids_len = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should rely on the len of output_token_ids since for spec decoding, we could have 2 tokens even after the first gen token. Can you have a look at: #3427. I think it's a more general fix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I also think that should be a good solution. I will close this PR.

@Shunkangz Shunkangz closed this Apr 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants