-
Notifications
You must be signed in to change notification settings - Fork 1.6k
fix: fix return double first token #3241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/bot run --add-multi-gpu-test |
PR_Github #1022 [ run ] triggered by Bot |
PR_Github #1022 [ run ] completed with state |
eb8ebe0
to
399bc94
Compare
/bot run |
PR_Github #1336 [ run ] triggered by Bot |
df5c8c7
to
8fdb7a8
Compare
/bot run |
PR_Github #1340 [ run ] triggered by Bot |
PR_Github #1336 [ run ] completed with state |
PR_Github #1340 [ run ] completed with state |
if request.state == LlmRequestState.GENERATION_IN_PROGRESS: | ||
if request.py_decoding_iter == 1: | ||
new_active_requests.append(request) | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we modify the condition here to return a nullptr when the request is generation only but has created only one token? I think that might be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iman, I also think this would be cleaner, but we might need to change C++ disaggregated examples, so that we extract tokens from context response. It doesn't look like we're doing that right now:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/benchmarks/cpp/disaggServerBenchmark.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. I will modify this.
f1bd653
to
ed97ba3
Compare
/bot run |
PR_Github #1405 [ run ] triggered by Bot |
PR_Github #1405 [ run ] completed with state |
ed97ba3
to
7331397
Compare
/bot run --add-multi-gpu-test |
PR_Github #1409 [ run ] triggered by Bot |
PR_Github #1409 [ run ] completed with state |
/bot run --add-multi-gpu-test |
PR_Github #1544 [ run ] triggered by Bot |
PR_Github #1544 [ run ] completed with state |
/bot run --add-multi-gpu-test |
PR_Github #1591 [ run ] triggered by Bot |
PR_Github #1591 [ run ] completed with state |
/bot run --add-multi-gpu-test |
Signed-off-by: Shunkang <[email protected]> Add double first token check Signed-off-by: Shunkang <[email protected]> Adapt for py_decoding_iter Signed-off-by: Shunkang <[email protected]> Roll back CI change Signed-off-by: Shunkang <[email protected]> Add check Signed-off-by: Shunkang <[email protected]>
Signed-off-by: Shunkang <[email protected]>
cc12e50
to
1df2674
Compare
/bot run --add-multi-gpu-test |
PR_Github #1611 [ run ] triggered by Bot |
PR_Github #1612 [ run ] triggered by Bot |
PR_Github #1611 [ run ] completed with state |
PR_Github #1612 [ run ] completed with state |
/bot run --add-multi-gpu-test |
PR_Github #1613 [ run ] triggered by Bot |
PR_Github #1613 [ run ] completed with state |
if self.disaggregated_params is not None and \ | ||
len(response_tensors.output_token_ids[src_idx]) == 2: | ||
output._last_token_ids_len = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should rely on the len of output_token_ids
since for spec decoding, we could have 2 tokens even after the first gen token. Can you have a look at: #3427. I think it's a more general fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I also think that should be a good solution. I will close this PR.
In PD, we have different behaviors for overlap and non-overlap scheduler. With non-overlap scheduler, we always return the first two generated tokens togethers. With overlap scheduler, the request might return the response without calculating the second generated token. This MR fix the change of #2986 in overlap scheduler case.