Trainer saving checkpoint crashes in multi-gpu (DDP): OSError: [Errno 39] Directory not empty #36076

myron · 2025-02-07T03:12:03Z

System Info

Training with DDP crashes when saving periodic/intermediate checkpoint within the Trainer class.
I'm using pytorch 2.5.1, accelerate (with DDP), SFTTrainer (Trainer), 8xA100gpu machine

It was working fine prior to this recent PR, probably the bug was introduced there:
#35580

it crashes every time. It seems several processes simultaneously trying to copy/move a temporary checkpoint directory to its final location.

[rank1]:     return inner_training_loop(                                                                                                        
[rank1]:            ^^^^^^^^^^^^^^^^^^^^                                                                                                        
[rank1]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 2555, in _inner_training_loop                                                                                                               
[rank1]:     self._maybe_log_save_evaluate(                                                                                                     
[rank1]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3035, in _maybe_log_save_evaluate                                                                                                           
[rank1]:     self._save_checkpoint(model, trial)                                                                                                
[rank1]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3160, in _save_checkpoint                                                                                                                   
[rank1]:     shutil.rmtree(checkpoint_dir)                                                                                                      
[rank1]:   File "/lib/python3.11/shutil.py", line 752, in rmtree        
[rank1]:     _rmtree_safe_fd(fd, path, onerror)                                                                                                 
[rank1]:   File "/lib/python3.11/shutil.py", line 703, in _rmtree_safe_f
d                                                                                                                                               
[rank1]:     onerror(os.unlink, fullname, sys.exc_info())                                                                                       
[rank1]:   File "/lib/python3.11/shutil.py", line 701, in _rmtree_safe_f
d                                                                                                                                               
[rank1]:     os.unlink(entry.name, dir_fd=topfd)                                                                                                
[rank1]: FileNotFoundError: [Errno 2] No such file or directory: 'model-00001-of-00002.safetensors'                                             
[rank7]: Traceback (most recent call last):                                                                                                     
[rank7]:   File "/lib/python3.11/site-packages/transformers/trainer.py", l
ine 3157, in _save_checkpoint                                                                                                                   
[rank7]:     os.renames(output_dir, checkpoint_dir)                                                                                             
[rank7]:   File "<frozen os>", line 272, in renames                                                                                             
[rank7]: OSError: [Errno 39] Directory not empty: 'data/mymodel/tmp-checkpoint-16k7xf30' -> 'data/mymodel/checkpoint-20'

@SilverSoldier @rwightman

System

transformers version: 4.49.0.dev0
Platform: Linux-5.15.0-1032-oracle-x86_64-with-glibc2.31
Python version: 3.11.11
Huggingface_hub version: 0.28.1
Safetensors version: 0.5.2
Accelerate version: 1.3.0
Accelerate config: not found
PyTorch version (GPU?): 2.5.1+cu124 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA A100-SXM4-80GB

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

DDP with SFTTrainer

Expected behavior

no crashes

The text was updated successfully, but these errors were encountered:

MrToy · 2025-02-07T09:15:38Z

same issue

SilverSoldier · 2025-02-07T10:36:14Z

@myron / @MrToy can you please share the exact command with flags that you ran? I am trying with DDP but am not able to reproduce this error.

SilverSoldier · 2025-02-07T11:12:05Z

If possible, can you also try this patch out (directly edit the /lib/python3.11/site-packages/transformers/trainer.py with the below fix and re-run your script) and see if this fixes the error?

Can you try converting transformers/trainer.py:3160

shutil.rmtree(checkpoint_dir)

into

shutil.rmtree(checkpoint_dir, ignore_errors=True)

I am hoping this should handle the FileNotFoundError: [Errno 2] No such file or directory: 'model-00001-of-00002.safetensors' part when removing the directory, which should hopefully allow the rm to complete without errors.

Thanks in advance!

muellerzr · 2025-02-07T14:37:08Z

@SilverSoldier I can reproduce. We shouldn't ignore_errors, but we should guard for this. Give me a few minutes to whip up a solution

muellerzr · 2025-02-07T14:46:08Z

@MrToy @myron can you try via pip install git+https://github.com/huggingface/transformers?

Should work fine, I tested both on 8 GPUs and multinode 16gpus

myron · 2025-02-07T17:40:44Z

@muellerzr I just tried the latest, and it works great now!
Thank you for the prompt fix!

myron · 2025-02-07T23:39:41Z

@muellerzr
may be too much to ask, but is there a chance we can also improve log output .
It used to say " Saving model checkpoint to .../checkpoint-20", but now it says "Saving model checkpoint to tmp-checkpoint-pe8ry7n_ " . See the "console log" below, with many non-informative messages. At least it should output where the Temp dir was moved to (where is the actual checkpoint path)...

Thanks a lot!

***** Running Evaluation *****
[INFO|trainer.py:4168] 2025-02-07 15:26:56,948 >>   Num examples = 10671
[INFO|trainer.py:4171] 2025-02-07 15:26:56,948 >>   Batch size = 1
{'eval_loss': 0.7353992462158203, 'eval_runtime': 132.7399, 'eval_samples_per_second': 80.39, 'eval_steps_per_second': 10.05, 'epoch': 0.03}    
  1%|▉                                                                                                      | 40/4500 [07:21<4:45:21,  3.84s/it[INFO|trainer.py:3850] 
2025-02-07 15:29:09,695 >> Saving model checkpoint to data/tmp-checkpoint-pe8ry7n_                    
[INFO|configuration_utils.py:420] 2025-02-07 15:29:09,701 >> Configuration saved in data/tmp-checkpoint-pe8ry7n_/config.json
[INFO|configuration_utils.py:907] 2025-02-07 15:29:09,704 >> Configuration saved in data/tmp-checkpoint-pe8ry7n_/generation_config.json
[INFO|modeling_utils.py:3028] 2025-02-07 15:29:16,285 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 2 checkpoint shards. You can find where each parameters has been saved in the index located at data/tmp-checkpoint-pe8ry7n_/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2500] 2025-02-07 15:29:16,289 >> tokenizer config file saved in data/tmp-checkpoint-pe8ry7n_/tokenizer_config.json
[INFO|tokenization_utils_base.py:2509] 2025-02-07 15:29:16,291 >> Special tokens file saved in data/tmp-checkpoint-pe8ry7n_/special_tokens_map.json

myron added the bug label Feb 7, 2025

muellerzr mentioned this issue Feb 7, 2025

Fix OS err #36094

Merged

5 tasks

muellerzr closed this as completed in #36094 Feb 7, 2025

Elenore1997 mentioned this issue Feb 7, 2025

微调Qwen2.5-VL-7B-Instruct时，checkpoint保存失败 hiyouga/LLaMA-Factory#6848

Closed

1 task

Jintao-Huang mentioned this issue Feb 8, 2025

qwen2.5-vl-3B 4卡训练报错 transformers.4.49.0.dev OSError: [Errno 39] Directory not empty modelscope/ms-swift#3031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer saving checkpoint crashes in multi-gpu (DDP): OSError: [Errno 39] Directory not empty #36076

Trainer saving checkpoint crashes in multi-gpu (DDP): OSError: [Errno 39] Directory not empty #36076

myron commented Feb 7, 2025

MrToy commented Feb 7, 2025

SilverSoldier commented Feb 7, 2025 •

edited

Loading

SilverSoldier commented Feb 7, 2025

muellerzr commented Feb 7, 2025

muellerzr commented Feb 7, 2025 •

edited

Loading

myron commented Feb 7, 2025

myron commented Feb 7, 2025

Trainer saving checkpoint crashes in multi-gpu (DDP): OSError: [Errno 39] Directory not empty #36076

Trainer saving checkpoint crashes in multi-gpu (DDP): OSError: [Errno 39] Directory not empty #36076

Comments

myron commented Feb 7, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

MrToy commented Feb 7, 2025

SilverSoldier commented Feb 7, 2025 • edited Loading

SilverSoldier commented Feb 7, 2025

muellerzr commented Feb 7, 2025

muellerzr commented Feb 7, 2025 • edited Loading

myron commented Feb 7, 2025

myron commented Feb 7, 2025

SilverSoldier commented Feb 7, 2025 •

edited

Loading

muellerzr commented Feb 7, 2025 •

edited

Loading