-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Prompt lookup speculative decoding for LLM API #3138
Comments
Hi @tonyay163, Thanks for bringing this to our attention. If you have interest, you are welcome to contribute the code to TensorRT-LLM directly. @Superjomn for vis on this. |
Thanks for the quick response @juney-nvidia, is there an example PR where the other ones were implemented that I can refer to? |
Hi @tonyay163, I am afraid the major MRs were internal before we switched to GitHub. Recently, we have been focusing on the pytorch path, and here are some related code I know cc @lfr-0531 if there is more information about contributing to Pytorch's speculative part. |
As @Superjomn said, we are now focusing on the PyTorch path to improve the ease-of-use of TensorRT-LLM(with still ensuring the best performance). Also since there is already Prompt Lookup speculative decoding support in the TensorRT path, you can decide whether you want to implement it in the PyTorch path(by following the MTP example shared by Superjomm) or you want to expose the current prompt lookup implementation in TensorRT path to LLM API. In our design, both TensorRT and PyTorch path details can be hidden by the LLM API, so as long as you are using LLM API, the switching between TensorRT and PyTorch should be relatively seamless to end-users(there can still be some cases that based on LLM API when you switch from TensorRT to PyTorch path some user side changes are needed but the code change should be very small). Pls let me know whether this is clear enough to you. Thanks |
It looks like the model runner API supports prompt lookup speculative decoding: https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/prompt_lookup
However, it doesn't seem to be part of the LLM API yet:
TensorRT-LLM/tensorrt_llm/llmapi/llm_args.py
Lines 851 to 854 in 3ee4332
The text was updated successfully, but these errors were encountered: