Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to implement attention when query and value have different hidden dims? #3121

Closed
ChaseMonsterAway opened this issue Mar 27, 2025 · 5 comments
Assignees
Labels
feature request New feature or request question Further information is requested triaged Issue has been triaged by maintainers

Comments

@ChaseMonsterAway
Copy link

Hi, I'm trying to export an attention layer with ​different hidden dimensions for query and value to a TensorRT engine. Do you have any tips

@juney-nvidia
Copy link
Collaborator

@ming-wei

Hi Ming, do you have any suggestion for this question?

Thanks
June

@juney-nvidia juney-nvidia added triaged Issue has been triaged by maintainers question Further information is requested labels Mar 27, 2025
@ChaseMonsterAway
Copy link
Author

@ming-wei Hi, could you give me any suggestions? Thanks in advance.

@ming-wei
Copy link
Collaborator

Did you mean multi query attention or group query attention, where each q head corresponds to multiple kv heads?

We have support for this use case already:

'num_attention_heads': args.n_head,

Just set num_attention_heads to the number of q heads, and set num_key_value_heads to the number of kv heads.

You can check out the README in examples/llama directory for more details.

Let me know if you have further questions.

Thanks,
Ming

@ChaseMonsterAway
Copy link
Author

Did you mean multi query attention or group query attention, where each q head corresponds to multiple kv heads?

We have support for this use case already:

TensorRT-LLM/examples/llama/convert_checkpoint.py

Line 421 in 794f61c

'num_attention_heads': args.n_head,
Just set num_attention_heads to the number of q heads, and set num_key_value_heads to the number of kv heads.

You can check out the README in examples/llama directory for more details.

Let me know if you have further questions.

Thanks, Ming

Thanks for the feedback. However, my question is different with MQA and GQA. It's still a normal attention layer, but the projection layers of query, key, value (Q\K\V) is different.

Here is an example of my questions from UniMERNet.

@ming-wei
Copy link
Collaborator

Thanks for the elaboration.

I don't think TRTLLM supports such use cases.

A workaround that you can try is to pad zeroes to q/k/v to make their dimension match, however you'll lose all the inference speedup/memory saving benefit from it.

@ming-wei ming-wei added the feature request New feature or request label Mar 31, 2025
@ming-wei ming-wei closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants