Skip to content

Pre-upload progress is not persisted across restart (Xet), causing re-uploads and excess traffic #3726

@uebian

Description

@uebian

Describe the bug

We're observing that "pre-uploaded" progress (via Xet) is sometimes lost after a graceful shutdown (Ctrl-C) and restart. When this happens, the uploader re-uploads data that was previously shown as fully pre-uploaded, leading to significantly higher actual network traffic than the dataset size.

Specifically, we observed:

  1. Pre-upload progress appears complete
    Example: in a previous run, the UI/logs showed ~17.5 TB of files as pre-uploaded. At that point, the only active thread was doing commit, while other threads were idle/waiting.
  2. After Ctrl-C and restart, pre-upload progress drops
    After stopping the program with Ctrl-C (graceful shutdown) and restarting the same upload job, the reported pre-uploaded amount drops from ~17.5 TB → ~5.3 TB. The uploader then uploads the missing ~12 TB again, even though it previously indicated those files were already pre-uploaded.
  3. Network traffic is much larger than the dataset size
    We see significantly more traffic than total file size, in both upload and download directions. Additionally, during the commit phase, we sometimes observe large upload + download traffic even when logs indicate all data is already pre-uploaded via Xet. Interestingly, this "large traffic during commit" pattern does not happen after a restart where upload progress is lost and the data is re-uploaded.

We expect that if files are already pre-uploaded, that state should be durable across graceful restart, and the uploader should resume from the last known pre-upload point without re-uploading previously completed data.

As a side note, the upload is slow on our end, and we would like to confirm is there a time-to-live (TTL) for pre-uploaded data before commit? We think it might be a possible cause for the problem.

The issue is observed in the following repos: https://huggingface.co/datasets/OpenDCAI/PKU_TianWang/tree/main , https://huggingface.co/datasets/OpenDCAI/PKU_TianWang2/tree/main and https://huggingface.co/datasets/OpenDCAI/PKU_TianWang3/tree/main

Reproduction

Use hf upload-large-folder to upload a dataset of 26156 files and 17.5T in total using 4 threads. All files are of similar size.

Logs

System info

- huggingface_hub version: 1.3.2
- Platform: Linux-6.12.0-124.21.1.el10_1.x86_64-x86_64-with-glibc2.39
- Python version: 3.12.12
- Running in iPython ?: No
- Running in notebook ?: No
- Running in Google Colab ?: No
- Running in Google Colab Enterprise ?: No
- Token path ?: /home/<username>/.cache/huggingface/token
- Has saved token ?: True
- Configured git credential helpers: 
- Installation method: unknown
- httpx: 0.28.1
- hf_xet: 1.2.0
- gradio: N/A
- tensorboard: N/A
- ENDPOINT: https://huggingface.co
- HF_HUB_CACHE: /home/<username>/.cache/huggingface/hub
- HF_ASSETS_CACHE: /home/<username>/.cache/huggingface/assets
- HF_TOKEN_PATH: /home/<username>/.cache/huggingface/token
- HF_STORED_TOKENS_PATH: /home/<username>/.cache/huggingface/stored_tokens
- HF_HUB_OFFLINE: False
- HF_HUB_DISABLE_TELEMETRY: False
- HF_HUB_DISABLE_PROGRESS_BARS: None
- HF_HUB_DISABLE_SYMLINKS_WARNING: False
- HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
- HF_HUB_DISABLE_IMPLICIT_TOKEN: False
- HF_HUB_DISABLE_XET: False
- HF_HUB_ETAG_TIMEOUT: 10
- HF_HUB_DOWNLOAD_TIMEOUT: 10
- HF_XET_HIGH_PERFORMANCE: False

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions