Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Prompt lookup speculative decoding for LLM API #3138

Open
tonyay163 opened this issue Mar 28, 2025 · 4 comments
Open

[Feature] Prompt lookup speculative decoding for LLM API #3138

tonyay163 opened this issue Mar 28, 2025 · 4 comments
Labels

Comments

@tonyay163
Copy link

It looks like the model runner API supports prompt lookup speculative decoding: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/prompt_lookup

However, it doesn't seem to be part of the LLM API yet:

speculative_config: Optional[Union[LookaheadDecodingConfig,
MedusaDecodingConfig,
EagleDecodingConfig,
MTPDecodingConfig]] = None

@juney-nvidia
Copy link
Collaborator

juney-nvidia commented Mar 28, 2025

Hi @tonyay163,

Thanks for bringing this to our attention.
It is true that prompt lookup speculative decoding is not exposed in the LLM API level now.
Recently we are working to make the LLM API stable enough to be ready for the official TensorRT-LLM 1.0 release, so for now we may not be able to do the work to expose prompt lookup speculative decoding in LLM API.

If you have interest, you are welcome to contribute the code to TensorRT-LLM directly.

@Superjomn for vis on this.

@tonyay163
Copy link
Author

Thanks for the quick response @juney-nvidia, is there an example PR where the other ones were implemented that I can refer to?

@Superjomn
Copy link
Collaborator

Superjomn commented Mar 28, 2025

Hi @tonyay163, I am afraid the major MRs were internal before we switched to GitHub. Recently, we have been focusing on the pytorch path, and here are some related code I know

cc @lfr-0531 if there is more information about contributing to Pytorch's speculative part.

@juney-nvidia
Copy link
Collaborator

juney-nvidia commented Mar 28, 2025

@tonyay163

As @Superjomn said, we are now focusing on the PyTorch path to improve the ease-of-use of TensorRT-LLM(with still ensuring the best performance). Also since there is already Prompt Lookup speculative decoding support in the TensorRT path, you can decide whether you want to implement it in the PyTorch path(by following the MTP example shared by Superjomm) or you want to expose the current prompt lookup implementation in TensorRT path to LLM API.

In our design, both TensorRT and PyTorch path details can be hidden by the LLM API, so as long as you are using LLM API, the switching between TensorRT and PyTorch should be relatively seamless to end-users(there can still be some cases that based on LLM API when you switch from TensorRT to PyTorch path some user side changes are needed but the code change should be very small).

Pls let me know whether this is clear enough to you.

Thanks
June

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants