MultiHeadAttentionWrapper should instantiate CausalSelfAttention with d_out = d_out // num_heads? #609

henrythe9th · 2025-04-09T01:57:31Z

henrythe9th
Apr 9, 2025

Since the MultiHeadAttentionWrapper class calls torch.cat([head(x) for head in self.heads], dim=-1)
shouldn't we be instantiating CausalSelfAttention with d_out = d_out // num_heads so that the final MultiHeadAttentionWrapper output has the same shape and d_out as was specified in the input?

In other words, is this a clearer implementation?

class MultiHeadAttentionWrapper(nn.Module):
    def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
        super().__init__()
        self.heads = nn.ModuleList(
            [CausalSelfAttention(d_in, d_out // num_heads, context_length, dropout, qkv_bias) 
             for _ in range(num_heads)]
        )
        self.out_proj = nn.Linear(d_out, d_out)

    def forward(self, x):
        context_vec = torch.cat([head(x) for head in self.heads], dim=-1)
        return self.out_proj(context_vec)

casinca · 2025-04-09T11:17:40Z

casinca
Apr 9, 2025

I believe the confusion lies in how we are interpreting d_out. In the specific case of MultiHeadAttentionWrapper from Sebastian's impl, d_out keyword is actually head_dim (seen from the classic MHA perspective.)

In your impl, d_out is the actual d_out as you'd see it from the classic MHA.

It's true that it's clearer in the sense that the d_out keyword is what we'd expect it to be. Sebastian's impl d_out argument will be for you d_out * num_heads.

1 reply

rasbt Apr 18, 2025
Maintainer

Thanks for the comments @henrythe9th and @casinca . I don't disagree with you here, and I remember toying with that idea. It's been a while, but I think I ultimately decided to go with the current way because I thought it might be easier to see what's happening (i.e., that we are stacking multiple heads). With d_out = d_out // num_heads the input and output would have been the same and it would have been harder to see.

But I can also see your point doing it differently.

fmatulic · 2026-01-19T14:36:05Z

fmatulic
Jan 19, 2026

@rasbt I hope it's ok if I borrow this thread to make a comment related to that code, specifically torch.cat([head(x) for head in self.heads], dim=-1).

The book states that the single-head attention modules are processed sequentially via [head(x) for head in self.heads] in the forward pass of MultiHeadAttentionWrapper. But that for loop is just for the list comprehension inside torch.cat. If I'm not mistaken, once the list of heads is created, the cat function uses Pytorch's internal computation graph to run all the head(x) concurrently.

Or did you mean in terms of kernel launches? Like, a single batched matmul across all heads in the optimised MultiHeadAttention class allows you to process all heads with one kernel launch, compared to multiple ones with MultiHeadAttentionWrapper?

2 replies

rasbt Jan 21, 2026
Maintainer

No worries. Thanks for the comment. In that code, afaik, torch.cat would only collect the already computed results. Like you said the computation graph does get built for each entry there, but I think you only have concurrency (for the matmuls) inside of each of these head(x) calls.

(I hope to be wrong though, which would be nice from an efficiency standpoint :))

fmatulic Jan 21, 2026

Thanks for the reply. I did some timing comparisons between the two approaches and you were right, there's linear scaling with head count for the wrapper. PyTorch doesn't magically parallelise the computations across the Python for loop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiHeadAttentionWrapper should instantiate CausalSelfAttention with d_out = d_out // num_heads? #609

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MultiHeadAttentionWrapper should instantiate CausalSelfAttention with d_out = d_out // num_heads? #609

Uh oh!

henrythe9th Apr 9, 2025

Replies: 2 comments · 3 replies

Uh oh!

casinca Apr 9, 2025

Uh oh!

rasbt Apr 18, 2025 Maintainer

Uh oh!

Uh oh!

fmatulic Jan 19, 2026

Uh oh!

rasbt Jan 21, 2026 Maintainer

Uh oh!

fmatulic Jan 21, 2026

henrythe9th
Apr 9, 2025

Replies: 2 comments 3 replies

casinca
Apr 9, 2025

rasbt Apr 18, 2025
Maintainer

fmatulic
Jan 19, 2026

rasbt Jan 21, 2026
Maintainer