Skip to content

[DeepSeekV4] Compressor does not seem to account for padding tokens when forming compressed KV blocks #45938

@WKQ9411

Description

@WKQ9411

Description

I noticed a potential padding-related issue in the current DeepSeekV4 implementation.

The sliding-window attention path appears to correctly receive the user-provided attention_mask through create_sliding_window_causal_mask, so padded key/value tokens can be masked out in the normal attention path.

However, the HCA / CSA compressor path seems to form compressed KV entries from the projected kv / gate tensors without using the input attention_mask.

From my reading, the compressor first projects hidden states into kv and gate, then stores/chunks them through the cache layer, and finally performs softmax-gated pooling to produce compressed KV entries. This means padding tokens may still participate in compressed KV construction.

For example, with right padding:

attention_mask = [1, 1, 1, 0]
compress_rate = 4

the compressor may form one compressed block from:

[token_0, token_1, token_2, pad]

Even though the pad token is masked in the normal sliding-window attention mask, it may still contribute to the compressed KV entry.

Question

If padding is intended to be supported, should the compressor receive attention_mask and use it when forming compressed KV blocks?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions