Skip to content

Offloading support for multiple attention layouts #2024

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sanandaraj5597
Copy link
Contributor

Description

CPU offloading currently only supports the sbhd_sbhd_sbhd layout, but we have multiple other layouts for pre-training, fine-tuning of LLM's.

This PR adds support offloading for all attention layouts.

Selvaraj Anandaraj and others added 4 commits August 2, 2025 19:30
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
Signed-off-by: Selvaraj Anandaraj <[email protected]>
@pggPL
Copy link
Collaborator

pggPL commented Aug 19, 2025

I think it will not work for offloaded layers, because .to() by default preserves memory format. I think it needs to be changed to .to(device="cpu", memory_format=torch.contiguous_format).

@pggPL
Copy link
Collaborator

pggPL commented Aug 19, 2025

Otherwise it looks good, have you tested it somehow?

@sanandaraj5597
Copy link
Contributor Author

sanandaraj5597 commented Aug 19, 2025

I think it will not work for offloaded layers, because .to() by default preserves memory format. I think it needs to be changed to .to(device="cpu", memory_format=torch.contiguous_format).

It will work right because the cpu copy is created in a contiguous fashion. When you do .to(), the cpu copy will be moved to GPU which is also contiguous. And that's the reason why we breakdown all the formats (sbh3d/th3d/...) to contiguous (sbhd_sbhd_sbhd/thd_thd_thd/...) in attention.

have you tested it somehow?

Yes I've run E2E tests on top of this for pre-training and fine-tuning.

@pggPL
Copy link
Collaborator

pggPL commented Aug 20, 2025

https://docs.pytorch.org/docs/stable/generated/torch.Tensor.to.html It preserves memory formats. So it will split sbh3d into 3 separate tensors (non interleaved), none of them being contiguous - all of them will have the same stride as previously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants