-
Notifications
You must be signed in to change notification settings - Fork 935
Description
Describe the bug
We're observing that "pre-uploaded" progress (via Xet) is sometimes lost after a graceful shutdown (Ctrl-C) and restart. When this happens, the uploader re-uploads data that was previously shown as fully pre-uploaded, leading to significantly higher actual network traffic than the dataset size.
Specifically, we observed:
- Pre-upload progress appears complete
Example: in a previous run, the UI/logs showed ~17.5 TB of files as pre-uploaded. At that point, the only active thread was doing commit, while other threads were idle/waiting. - After Ctrl-C and restart, pre-upload progress drops
After stopping the program with Ctrl-C (graceful shutdown) and restarting the same upload job, the reported pre-uploaded amount drops from ~17.5 TB → ~5.3 TB. The uploader then uploads the missing ~12 TB again, even though it previously indicated those files were already pre-uploaded. - Network traffic is much larger than the dataset size
We see significantly more traffic than total file size, in both upload and download directions. Additionally, during the commit phase, we sometimes observe large upload + download traffic even when logs indicate all data is already pre-uploaded via Xet. Interestingly, this "large traffic during commit" pattern does not happen after a restart where upload progress is lost and the data is re-uploaded.
We expect that if files are already pre-uploaded, that state should be durable across graceful restart, and the uploader should resume from the last known pre-upload point without re-uploading previously completed data.
As a side note, the upload is slow on our end, and we would like to confirm is there a time-to-live (TTL) for pre-uploaded data before commit? We think it might be a possible cause for the problem.
The issue is observed in the following repos: https://huggingface.co/datasets/OpenDCAI/PKU_TianWang/tree/main , https://huggingface.co/datasets/OpenDCAI/PKU_TianWang2/tree/main and https://huggingface.co/datasets/OpenDCAI/PKU_TianWang3/tree/main
Reproduction
Use hf upload-large-folder to upload a dataset of 26156 files and 17.5T in total using 4 threads. All files are of similar size.
Logs
System info
- huggingface_hub version: 1.3.2
- Platform: Linux-6.12.0-124.21.1.el10_1.x86_64-x86_64-with-glibc2.39
- Python version: 3.12.12
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/<username>/.cache/huggingface/token
- Has saved token ?: True
- Configured git credential helpers:
- Installation method: unknown
- httpx: 0.28.1
- hf_xet: 1.2.0
- gradio: N/A
- tensorboard: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/<username>/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/<username>/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/<username>/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /home/<username>/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_DISABLE_XET: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
- HF_XET_HIGH_PERFORMANCE: False