fix FlashAttentionKwargs RoPE #35941

garrett361 · 2025-01-28T17:35:00Z

What does this PR do?

#33932 introduced FlashAttentionKwargs as an alternative to using position_ids for padding-free training. The RoPE positional embedding are not currently applied correctly in the FlashAttentionKwargs code path. This PR ensures that RoPE is used properly for this path.

Code Notes

The Issue

The issue is that if position_ids not provided, then they are internally generated here:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 561 to 562 in ec7afad

    
           if position_ids is None: 
        
               position_ids = cache_position.unsqueeze(0)

and these are used to generate the rope embeddings here:

transformers/src/transformers/models/llama/modeling_llama.py

Lines 570 to 571 in ec7afad

    
           # create position embeddings to be shared across the decoder layers 
        
           position_embeddings = self.rotary_emb(hidden_states, position_ids)

These rope embeddings are ~ torch.arange, whereas they should be non-trivially generated from the values in FlashAttentionKwargs

The Fix

Introduce a get_position_ids_from_cu_seq_lens helper which coverts from FlashAttentionKwargs -> position_ids, when provided.

Because many other models inherit from LlamaDecoder, this change propagates changes to many other models via modular_model_converter.py.

Tests

The solution is tested in LlamaModelTest::test_attn_mask_position_ids_flash_attn_equality, which checks that logits in the follow cases are consistent with each other:

No padding-free, just padding and attention masks
Padding free via position_ids
Padding free via FlashAttentionKwargs

This test fails on latest main without the above fix.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Rocketknight1 · 2025-01-29T13:23:21Z

cc @Abhishek-TAMU @ArthurZucker because this is an update to #33932

fix FlashAttentionKwargs RoPE

08da972

garrett361 marked this pull request as draft January 28, 2025 17:35

run modular_model_converter.py

0804ea5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix FlashAttentionKwargs RoPE #35941

fix FlashAttentionKwargs RoPE #35941

garrett361 commented Jan 28, 2025 •

edited

Loading

Rocketknight1 commented Jan 29, 2025

	if position_ids is None:
	position_ids = cache_position.unsqueeze(0)

	# create position embeddings to be shared across the decoder layers
	position_embeddings = self.rotary_emb(hidden_states, position_ids)

fix FlashAttentionKwargs RoPE #35941

Are you sure you want to change the base?

fix FlashAttentionKwargs RoPE #35941

Conversation

garrett361 commented Jan 28, 2025 • edited Loading

What does this PR do?

Code Notes

The Issue

The Fix

Tests

Before submitting

Who can review?

Rocketknight1 commented Jan 29, 2025

garrett361 commented Jan 28, 2025 •

edited

Loading