-
Notifications
You must be signed in to change notification settings - Fork 27.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer saving checkpoint crashes in multi-gpu (DDP): OSError: [Errno 39] Directory not empty #36076
Comments
same issue |
If possible, can you also try this patch out (directly edit the /lib/python3.11/site-packages/transformers/trainer.py with the below fix and re-run your script) and see if this fixes the error? Can you try converting transformers/trainer.py:3160
into
I am hoping this should handle the Thanks in advance! |
@SilverSoldier I can reproduce. We shouldn't |
@muellerzr I just tried the latest, and it works great now! |
@muellerzr Thanks a lot!
|
System Info
Training with DDP crashes when saving periodic/intermediate checkpoint within the Trainer class.
I'm using pytorch 2.5.1, accelerate (with DDP), SFTTrainer (Trainer), 8xA100gpu machine
It was working fine prior to this recent PR, probably the bug was introduced there:
#35580
it crashes every time. It seems several processes simultaneously trying to copy/move a temporary checkpoint directory to its final location.
@SilverSoldier @rwightman
System
transformers
version: 4.49.0.dev0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
DDP with SFTTrainer
Expected behavior
no crashes
The text was updated successfully, but these errors were encountered: