Skip to content

Conversation

@delock
Copy link
Collaborator

@delock delock commented Oct 24, 2025

This PR put Muon optimizer momentum buffer on GPU. This makes Muon optimizer executes much faster (finetune Qwen2.5-3B on 2xA100 cards, iteration time 1500ms --> 910ms). Previously this buffer is on CPU.

@delock
Copy link
Collaborator Author

delock commented Oct 24, 2025

Hi @PKUWZP , I want to confirm this change with you. I saw comments saying put the momentum buffer on CPU to save device memory. So I guess the intention is to allow train larger model with Muon optimizer. But put momentum buffer on CPU also makes Muon optimizer run slower. Maybe allow Muon optimizer in ZeRO offload should work for large model.

@delock
Copy link
Collaborator Author

delock commented Nov 3, 2025

Hi @PKUWZP , do you have comments for this PR? Thanks!

@PKUWZP PKUWZP self-requested a review November 3, 2025 13:32
@PKUWZP
Copy link
Collaborator

PKUWZP commented Nov 3, 2025

@delock Do you have any benchmarking results?

@delock
Copy link
Collaborator Author

delock commented Nov 4, 2025

@delock Do you have any benchmarking results?

I tested with finetune Qwen2.5-3B on 2xA100 cards. global batch size is 8.

On master branch the finetune iteration time is 1430ms. With this PR the finetune iteration time is 918ms.

Profiling data shows before this change, a lot of time spent on H2D and D2H copy. After this change, there is no H2D and D2H copy in top profiled items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants