[DeepSeekV4] Compressor does not seem to account for padding tokens when forming compressed KV blocks

## Description

I noticed a potential padding-related issue in the current DeepSeekV4 implementation.

The sliding-window attention path appears to correctly receive the user-provided `attention_mask` through `create_sliding_window_causal_mask`, so padded key/value tokens can be masked out in the normal attention path.

However, the HCA / CSA compressor path seems to form compressed KV entries from the projected `kv` / `gate` tensors without using the input `attention_mask`.

From my reading, the compressor first projects hidden states into `kv` and `gate`, then stores/chunks them through the cache layer, and finally performs softmax-gated pooling to produce compressed KV entries. This means padding tokens may still participate in compressed KV construction.

For example, with right padding:

```text
attention_mask = [1, 1, 1, 0]
compress_rate = 4
````

the compressor may form one compressed block from:

```text
[token_0, token_1, token_2, pad]
```

Even though the pad token is masked in the normal sliding-window attention mask, it may still contribute to the compressed KV entry.

## Question

If padding is intended to be supported, should the compressor receive `attention_mask` and use it when forming compressed KV blocks?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSeekV4] Compressor does not seem to account for padding tokens when forming compressed KV blocks #45938

Description

Question

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DeepSeekV4] Compressor does not seem to account for padding tokens when forming compressed KV blocks #45938

Description

Description

Question

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions