Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer saving checkpoint crashes in multi-gpu (DDP): OSError: [Errno 39] Directory not empty #36076

Closed
4 tasks
myron opened this issue Feb 7, 2025 · 7 comments · Fixed by #36094
Closed
4 tasks
Labels

Comments

@myron
Copy link

myron commented Feb 7, 2025

System Info

Training with DDP crashes when saving periodic/intermediate checkpoint within the Trainer class.
I'm using pytorch 2.5.1, accelerate (with DDP), SFTTrainer (Trainer), 8xA100gpu machine

It was working fine prior to this recent PR, probably the bug was introduced there:
#35580

it crashes every time. It seems several processes simultaneously trying to copy/move a temporary checkpoint directory to its final location.

[rank1]:     return inner_training_loop(                                                                                                        
[rank1]:            ^^^^^^^^^^^^^^^^^^^^                                                                                                        
[rank1]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 2555, in _inner_training_loop                                                                                                               
[rank1]:     self._maybe_log_save_evaluate(                                                                                                     
[rank1]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3035, in _maybe_log_save_evaluate                                                                                                           
[rank1]:     self._save_checkpoint(model, trial)                                                                                                
[rank1]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3160, in _save_checkpoint                                                                                                                   
[rank1]:     shutil.rmtree(checkpoint_dir)                                                                                                      
[rank1]:   File "/lib/python3.11/shutil.py", line 752, in rmtree        
[rank1]:     _rmtree_safe_fd(fd, path, onerror)                                                                                                 
[rank1]:   File "/lib/python3.11/shutil.py", line 703, in _rmtree_safe_f
d                                                                                                                                               
[rank1]:     onerror(os.unlink, fullname, sys.exc_info())                                                                                       
[rank1]:   File "/lib/python3.11/shutil.py", line 701, in _rmtree_safe_f
d                                                                                                                                               
[rank1]:     os.unlink(entry.name, dir_fd=topfd)                                                                                                
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'model-00001-of-00002.safetensors'                                             
[rank7]: Traceback (most recent call last):                                                                                                     
[rank7]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3157, in _save_checkpoint                                                                                                                   
[rank7]:     os.renames(output_dir, checkpoint_dir)                                                                                             
[rank7]:   File "<frozen os>", line 272, in renames                                                                                             
[rank7]: OSError: [Errno 39] Directory not empty: 'data/mymodel/tmp-checkpoint-16k7xf30' -> 'data/mymodel/checkpoint-20'

@SilverSoldier @rwightman

System

  • transformers version: 4.49.0.dev0
  • Platform: Linux-5.15.0-1032-oracle-x86_64-with-glibc2.31
  • Python version: 3.11.11
  • Huggingface_hub version: 0.28.1
  • Safetensors version: 0.5.2
  • Accelerate version: 1.3.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.5.1+cu124 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA A100-SXM4-80GB

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

DDP with SFTTrainer

Expected behavior

no crashes

@myron myron added the bug label Feb 7, 2025
@MrToy
Copy link

MrToy commented Feb 7, 2025

same issue

@SilverSoldier
Copy link
Contributor

SilverSoldier commented Feb 7, 2025

@myron / @MrToy can you please share the exact command with flags that you ran? I am trying with DDP but am not able to reproduce this error.

@SilverSoldier
Copy link
Contributor

If possible, can you also try this patch out (directly edit the /lib/python3.11/site-packages/transformers/trainer.py with the below fix and re-run your script) and see if this fixes the error?

Can you try converting transformers/trainer.py:3160

shutil.rmtree(checkpoint_dir)

into

shutil.rmtree(checkpoint_dir, ignore_errors=True)

I am hoping this should handle the FileNotFoundError: [Errno 2] No such file or directory: 'model-00001-of-00002.safetensors' part when removing the directory, which should hopefully allow the rm to complete without errors.

Thanks in advance!

@muellerzr
Copy link
Contributor

@SilverSoldier I can reproduce. We shouldn't ignore_errors, but we should guard for this. Give me a few minutes to whip up a solution

@muellerzr
Copy link
Contributor

muellerzr commented Feb 7, 2025

@MrToy @myron can you try via pip install git+https://github.com/huggingface/transformers?

Should work fine, I tested both on 8 GPUs and multinode 16gpus

@myron
Copy link
Author

myron commented Feb 7, 2025

@muellerzr I just tried the latest, and it works great now!
Thank you for the prompt fix!

@myron
Copy link
Author

myron commented Feb 7, 2025

@muellerzr
may be too much to ask, but is there a chance we can also improve log output .
It used to say " Saving model checkpoint to .../checkpoint-20", but now it says "Saving model checkpoint to tmp-checkpoint-pe8ry7n_ " . See the "console log" below, with many non-informative messages. At least it should output where the Temp dir was moved to (where is the actual checkpoint path)...

Thanks a lot!

***** Running Evaluation *****
[INFO|trainer.py:4168] 2025-02-07 15:26:56,948 >>   Num examples = 10671
[INFO|trainer.py:4171] 2025-02-07 15:26:56,948 >>   Batch size = 1
{'eval_loss': 0.7353992462158203, 'eval_runtime': 132.7399, 'eval_samples_per_second': 80.39, 'eval_steps_per_second': 10.05, 'epoch': 0.03}    
  1%|▉                                                                                                      | 40/4500 [07:21<4:45:21,  3.84s/it[INFO|trainer.py:3850] 
2025-02-07 15:29:09,695 >> Saving model checkpoint to data/tmp-checkpoint-pe8ry7n_                    
[INFO|configuration_utils.py:420] 2025-02-07 15:29:09,701 >> Configuration saved in data/tmp-checkpoint-pe8ry7n_/config.json
[INFO|configuration_utils.py:907] 2025-02-07 15:29:09,704 >> Configuration saved in data/tmp-checkpoint-pe8ry7n_/generation_config.json
[INFO|modeling_utils.py:3028] 2025-02-07 15:29:16,285 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at data/tmp-checkpoint-pe8ry7n_/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2500] 2025-02-07 15:29:16,289 >> tokenizer config file saved in data/tmp-checkpoint-pe8ry7n_/tokenizer_config.json
[INFO|tokenization_utils_base.py:2509] 2025-02-07 15:29:16,291 >> Special tokens file saved in data/tmp-checkpoint-pe8ry7n_/special_tokens_map.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants