Skip to content

load_dataset does not check .no_exist files in the hub cache #7686

@jmaccarl

Description

@jmaccarl

Describe the bug

I'm not entirely sure if this should be submitted as a bug in the datasets library or the huggingface_hub library, given it could be fixed at different levels of the stack.

The fundamental issue is that the load_datasets api doesn't use the .no_exist files in the hub cache unlike other wrapper APIs that do. This is because the utils.file_utils.cached_path used directly calls hf_hub_download instead of using file_download.try_to_load_from_cache from huggingface_hub (see transformers library utils.hub.cached_files for one alternate example).

This results in unnecessary metadata HTTP requests occurring for files that don't exist on every call. It won't generate the .no_exist cache files, nor will it use them.

Steps to reproduce the bug

Run the following snippet as one example (setting cache dirs to clean paths for clarity)
env HF_HOME=~/local_hf_hub python repro.py

from datasets import load_dataset

import huggingface_hub

# monkeypatch to print out metadata requests being made
original_get_hf_file_metadata = huggingface_hub.file_download.get_hf_file_metadata

def get_hf_file_metadata_wrapper(*args, **kwargs):
    print("File metadata request made (get_hf_file_metadata):", args, kwargs)
    return original_get_hf_file_metadata(*args, **kwargs)

# Apply the patch
huggingface_hub.file_download.get_hf_file_metadata = get_hf_file_metadata_wrapper

dataset = load_dataset(
    "Salesforce/wikitext",
    "wikitext-2-v1",
    split="test",
    trust_remote_code=True,
    cache_dir="~/local_datasets",
    revision="b08601e04326c79dfdd32d625aee71d232d685c3",
)

This may be called over and over again, and you will see the same calls for files that don't exist:

File metadata request made (get_hf_file_metadata): () {'url': 'https://huggingface.co/datasets/Salesforce/wikitext/resolve/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext.py', 'proxies': None, 'timeout': 10, 'headers': {'user-agent': 'datasets/3.6.0; hf_hub/0.33.2; python/3.12.11; torch/2.7.0; huggingface_hub/0.33.2; pyarrow/20.0.0; jax/0.5.3'}, 'token': None}
File metadata request made (get_hf_file_metadata): () {'url': 'https://huggingface.co/datasets/Salesforce/wikitext/resolve/b08601e04326c79dfdd32d625aee71d232d685c3/.huggingface.yaml', 'proxies': None, 'timeout': 10, 'headers': {'user-agent': 'datasets/3.6.0; hf_hub/0.33.2; python/3.12.11; torch/2.7.0; huggingface_hub/0.33.2; pyarrow/20.0.0; jax/0.5.3'}, 'token': None}
File metadata request made (get_hf_file_metadata): () {'url': 'https://huggingface.co/datasets/Salesforce/wikitext/resolve/b08601e04326c79dfdd32d625aee71d232d685c3/dataset_infos.json', 'proxies': None, 'timeout': 10, 'headers': {'user-agent': 'datasets/3.6.0; hf_hub/0.33.2; python/3.12.11; torch/2.7.0; huggingface_hub/0.33.2; pyarrow/20.0.0; jax/0.5.3'}, 'token': None}

And you can see that the .no_exist folder is never created

$ ls ~/local_hf_hub/hub/datasets--Salesforce--wikitext/
blobs  refs  snapshots

Expected behavior

The expected behavior is for the print "File metadata request made" to stop after the first call, and for .no_exist directory & files to be populated under ~/local_hf_hub/hub/datasets--Salesforce--wikitext/

Environment info

  • datasets version: 3.6.0
  • Platform: Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.35
  • Python version: 3.12.11
  • huggingface_hub version: 0.33.2
  • PyArrow version: 20.0.0
  • Pandas version: 2.3.1
  • fsspec version: 2024.9.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions