Description
I noticed a potential padding-related issue in the current DeepSeekV4 implementation.
The sliding-window attention path appears to correctly receive the user-provided attention_mask through create_sliding_window_causal_mask, so padded key/value tokens can be masked out in the normal attention path.
However, the HCA / CSA compressor path seems to form compressed KV entries from the projected kv / gate tensors without using the input attention_mask.
From my reading, the compressor first projects hidden states into kv and gate, then stores/chunks them through the cache layer, and finally performs softmax-gated pooling to produce compressed KV entries. This means padding tokens may still participate in compressed KV construction.
For example, with right padding:
attention_mask = [1, 1, 1, 0]
compress_rate = 4
the compressor may form one compressed block from:
[token_0, token_1, token_2, pad]
Even though the pad token is masked in the normal sliding-window attention mask, it may still contribute to the compressed KV entry.
Question
If padding is intended to be supported, should the compressor receive attention_mask and use it when forming compressed KV blocks?
Thanks!
Description
I noticed a potential padding-related issue in the current DeepSeekV4 implementation.
The sliding-window attention path appears to correctly receive the user-provided
attention_maskthroughcreate_sliding_window_causal_mask, so padded key/value tokens can be masked out in the normal attention path.However, the HCA / CSA compressor path seems to form compressed KV entries from the projected
kv/gatetensors without using the inputattention_mask.From my reading, the compressor first projects hidden states into
kvandgate, then stores/chunks them through the cache layer, and finally performs softmax-gated pooling to produce compressed KV entries. This means padding tokens may still participate in compressed KV construction.For example, with right padding:
the compressor may form one compressed block from:
Even though the pad token is masked in the normal sliding-window attention mask, it may still contribute to the compressed KV entry.
Question
If padding is intended to be supported, should the compressor receive
attention_maskand use it when forming compressed KV blocks?Thanks!