How to implement attention when query and value have different hidden dims? #3121

ChaseMonsterAway · 2025-03-27T10:29:44Z

Hi, I'm trying to export an attention layer with different hidden dimensions for query and value to a TensorRT engine. Do you have any tips

juney-nvidia · 2025-03-27T10:40:55Z

@ming-wei

Hi Ming, do you have any suggestion for this question?

Thanks
June

ChaseMonsterAway · 2025-03-28T08:45:33Z

@ming-wei Hi, could you give me any suggestions? Thanks in advance.

ming-wei · 2025-03-31T02:51:29Z

Did you mean multi query attention or group query attention, where each q head corresponds to multiple kv heads?

We have support for this use case already:

TensorRT-LLM/examples/llama/convert_checkpoint.py

Line 421 in 794f61c

'num_attention_heads': args.n_head,

Just set num_attention_heads to the number of q heads, and set num_key_value_heads to the number of kv heads.

You can check out the README in examples/llama directory for more details.

Let me know if you have further questions.

Thanks,
Ming

ChaseMonsterAway · 2025-03-31T05:54:13Z

Did you mean multi query attention or group query attention, where each q head corresponds to multiple kv heads?

We have support for this use case already:

TensorRT-LLM/examples/llama/convert_checkpoint.py

Line 421 in 794f61c

'num_attention_heads': args.n_head,
Just set num_attention_heads to the number of q heads, and set num_key_value_heads to the number of kv heads.

You can check out the README in examples/llama directory for more details.

Let me know if you have further questions.

Thanks, Ming

Thanks for the feedback. However, my question is different with MQA and GQA. It's still a normal attention layer, but the projection layers of query, key, value (Q\K\V) is different.

Here is an example of my questions from UniMERNet.

ming-wei · 2025-03-31T08:14:49Z

Thanks for the elaboration.

I don't think TRTLLM supports such use cases.

A workaround that you can try is to pad zeroes to q/k/v to make their dimension match, however you'll lose all the inference speedup/memory saving benefit from it.

juney-nvidia assigned ming-wei Mar 27, 2025

juney-nvidia added triaged Issue has been triaged by maintainers question Further information is requested labels Mar 27, 2025

ming-wei added the feature request New feature or request label Mar 31, 2025

ming-wei closed this as not planned Won't fix, can't repro, duplicate, stale Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement attention when query and value have different hidden dims? #3121

How to implement attention when query and value have different hidden dims? #3121

ChaseMonsterAway commented Mar 27, 2025

juney-nvidia commented Mar 27, 2025

ChaseMonsterAway commented Mar 28, 2025

ming-wei commented Mar 31, 2025

ChaseMonsterAway commented Mar 31, 2025

ming-wei commented Mar 31, 2025

How to implement attention when query and value have different hidden dims? #3121

How to implement attention when query and value have different hidden dims? #3121

Comments

ChaseMonsterAway commented Mar 27, 2025

juney-nvidia commented Mar 27, 2025

ChaseMonsterAway commented Mar 28, 2025

ming-wei commented Mar 31, 2025

ChaseMonsterAway commented Mar 31, 2025

ming-wei commented Mar 31, 2025