Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When will Gemma 3 be supported? #3143

Open
bebilli opened this issue Mar 29, 2025 · 7 comments
Open

When will Gemma 3 be supported? #3143

bebilli opened this issue Mar 29, 2025 · 7 comments
Assignees
Labels
feature request New feature or request triaged Issue has been triaged by maintainers

Comments

@bebilli
Copy link

bebilli commented Mar 29, 2025

No description provided.

@juney-nvidia
Copy link
Collaborator

@bebilli

Hi bebill,

We haven't finalized the plan to support Gemma 3 yet. And if you have interest, you are welcome to contribute this model support to TensorRT-LLM and we can provide needed support and consulting.

June

@juney-nvidia juney-nvidia self-assigned this Mar 29, 2025
@juney-nvidia juney-nvidia added triaged Issue has been triaged by maintainers feature request New feature or request labels Mar 29, 2025
@bebilli
Copy link
Author

bebilli commented Mar 29, 2025

I'm just an AI application developer. Does adapting to Gemma3 require a strong and professional AI development background? If not, could you give me some guidance?

@juney-nvidia
Copy link
Collaborator

@bebilli

Hi,

I would recommend you use the PyTorch workflow to add Gemma 3 model support which can be less steep for AI application developers. You can follow this guide:

and this example code(LLaMA) to add Gemma3

For any specific questions when you hit with adding Gemma 3, pls let us know.

Thanks
June

@bebilli
Copy link
Author

bebilli commented Mar 30, 2025

@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?

@juney-nvidia
Copy link
Collaborator

juney-nvidia commented Mar 30, 2025

@juney-nvidia If this method you mentioned is used, is it necessary to convert to the native TensorRT format before inference? If conversion is not required, can the performance match that of the native TensorRT format?

For the PyTorch workflow, you don't need to convert the PyTorch model to TensorRT format, rather you need to follow this step-by-step guide to add your new model, including adding your model definition based on TensorRT-LLM PyTorch modeling API, implementing the weight loading logics.

As to performance, based on our internal benchmark, on key models such as LLaMA/Mistral/Mixtral, PyTorch workflow performance is on-par(or even faster) than the TensorRT workflow, due to that the customized performant kernels are reused in both TensorRT workflow(as TensorRT plugins) and PyTorch workflow(as torch custom op), and also the high performant C++ runtime building blocks(such as Batch Scheduler, KV Cache Manager, Dis-agg serving related logics) are also reused in both TensorRT and PyTorch workflow.

Also due to the flexibility of PyTorch, more optimizations can be quickly added to further push performance boundary.

For the latest announced world-class DeepSeek R1 performance number on Blackwell, they are all measured with the PyTorch workflow and we only support DeepSeek R1 in the PyTorch workflow for now.

Pls let me know if there is any further question.

Thanks
June

@bebilli
Copy link
Author

bebilli commented Mar 30, 2025

Thank you for your guidance. I'll go and give it a try.

@juney-nvidia
Copy link
Collaborator

Thank you for your guidance. I'll go and give it a try.

Thanks, looking forward to your contribution MR :)

June

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

2 participants