Put Muon optimizer momentum buffer on GPU #7648

delock · 2025-10-24T14:11:47Z

This PR put Muon optimizer momentum buffer on GPU. This makes Muon optimizer executes much faster (finetune Qwen2.5-3B on 2xA100 cards, iteration time 1500ms --> 910ms). Previously this buffer is on CPU.

Signed-off-by: Guokai Ma <[email protected]>

delock · 2025-10-24T14:14:40Z

Hi @PKUWZP , I want to confirm this change with you. I saw comments saying put the momentum buffer on CPU to save device memory. So I guess the intention is to allow train larger model with Muon optimizer. But put momentum buffer on CPU also makes Muon optimizer run slower. Maybe allow Muon optimizer in ZeRO offload should work for large model.

Signed-off-by: Guokai Ma <[email protected]>

delock · 2025-11-03T03:01:34Z

Hi @PKUWZP , do you have comments for this PR? Thanks!

PKUWZP · 2025-11-03T13:33:23Z

@delock Do you have any benchmarking results?

delock · 2025-11-04T03:07:39Z

@delock Do you have any benchmarking results?

I tested with finetune Qwen2.5-3B on 2xA100 cards. global batch size is 8.

On master branch the finetune iteration time is 1430ms. With this PR the finetune iteration time is 918ms.

Profiling data shows before this change, a lot of time spent on H2D and D2H copy. After this change, there is no H2D and D2H copy in top profiled items.

make muon optimizer totally running on GPU

bbb4bfc

Signed-off-by: Guokai Ma <[email protected]>

delock requested review from tjruwase and tohtana as code owners October 24, 2025 14:11

delock added 2 commits October 24, 2025 07:14

apply torch.compile to Muon optimizer

e02c0ec

Signed-off-by: Guokai Ma <[email protected]>

make torch.compile more adaptive to old pytorch version

632ab6b

Signed-off-by: Guokai Ma <[email protected]>

PKUWZP self-requested a review November 3, 2025 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Put Muon optimizer momentum buffer on GPU #7648

Put Muon optimizer momentum buffer on GPU #7648

Uh oh!

delock commented Oct 24, 2025

Uh oh!

delock commented Oct 24, 2025

Uh oh!

delock commented Nov 3, 2025

Uh oh!

PKUWZP commented Nov 3, 2025

Uh oh!

delock commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Put Muon optimizer momentum buffer on GPU #7648

Are you sure you want to change the base?

Put Muon optimizer momentum buffer on GPU #7648

Uh oh!

Conversation

delock commented Oct 24, 2025

Uh oh!

delock commented Oct 24, 2025

Uh oh!

delock commented Nov 3, 2025

Uh oh!

PKUWZP commented Nov 3, 2025

Uh oh!

delock commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants