Describe the bug
I'm not entirely sure if this should be submitted as a bug in the datasets library or the huggingface_hub library, given it could be fixed at different levels of the stack.
The fundamental issue is that the load_datasets api doesn't use the .no_exist files in the hub cache unlike other wrapper APIs that do. This is because the utils.file_utils.cached_path used directly calls hf_hub_download instead of using file_download.try_to_load_from_cache from huggingface_hub (see transformers library utils.hub.cached_files for one alternate example).
This results in unnecessary metadata HTTP requests occurring for files that don't exist on every call. It won't generate the .no_exist cache files, nor will it use them.
Steps to reproduce the bug
Run the following snippet as one example (setting cache dirs to clean paths for clarity)
env HF_HOME=~/local_hf_hub python repro.py
from datasets import load_dataset
import huggingface_hub
# monkeypatch to print out metadata requests being made
original_get_hf_file_metadata = huggingface_hub.file_download.get_hf_file_metadata
def get_hf_file_metadata_wrapper(*args, **kwargs):
print("File metadata request made (get_hf_file_metadata):", args, kwargs)
return original_get_hf_file_metadata(*args, **kwargs)
# Apply the patch
huggingface_hub.file_download.get_hf_file_metadata = get_hf_file_metadata_wrapper
dataset = load_dataset(
"Salesforce/wikitext",
"wikitext-2-v1",
split="test",
trust_remote_code=True,
cache_dir="~/local_datasets",
revision="b08601e04326c79dfdd32d625aee71d232d685c3",
)
This may be called over and over again, and you will see the same calls for files that don't exist:
File metadata request made (get_hf_file_metadata): () {'url': 'https://huggingface.co/datasets/Salesforce/wikitext/resolve/b08601e04326c79dfdd32d625aee71d232d685c3/wikitext.py', 'proxies': None, 'timeout': 10, 'headers': {'user-agent': 'datasets/3.6.0; hf_hub/0.33.2; python/3.12.11; torch/2.7.0; huggingface_hub/0.33.2; pyarrow/20.0.0; jax/0.5.3'}, 'token': None}
File metadata request made (get_hf_file_metadata): () {'url': 'https://huggingface.co/datasets/Salesforce/wikitext/resolve/b08601e04326c79dfdd32d625aee71d232d685c3/.huggingface.yaml', 'proxies': None, 'timeout': 10, 'headers': {'user-agent': 'datasets/3.6.0; hf_hub/0.33.2; python/3.12.11; torch/2.7.0; huggingface_hub/0.33.2; pyarrow/20.0.0; jax/0.5.3'}, 'token': None}
File metadata request made (get_hf_file_metadata): () {'url': 'https://huggingface.co/datasets/Salesforce/wikitext/resolve/b08601e04326c79dfdd32d625aee71d232d685c3/dataset_infos.json', 'proxies': None, 'timeout': 10, 'headers': {'user-agent': 'datasets/3.6.0; hf_hub/0.33.2; python/3.12.11; torch/2.7.0; huggingface_hub/0.33.2; pyarrow/20.0.0; jax/0.5.3'}, 'token': None}
And you can see that the .no_exist folder is never created
$ ls ~/local_hf_hub/hub/datasets--Salesforce--wikitext/
blobs refs snapshots
Expected behavior
The expected behavior is for the print "File metadata request made" to stop after the first call, and for .no_exist directory & files to be populated under ~/local_hf_hub/hub/datasets--Salesforce--wikitext/
Environment info
datasets version: 3.6.0
- Platform: Linux-6.5.13-65-650-4141-22041-coreweave-amd64-85c45edc-x86_64-with-glibc2.35
- Python version: 3.12.11
huggingface_hub version: 0.33.2
- PyArrow version: 20.0.0
- Pandas version: 2.3.1
fsspec version: 2024.9.0
Describe the bug
I'm not entirely sure if this should be submitted as a bug in the
datasetslibrary or thehuggingface_hublibrary, given it could be fixed at different levels of the stack.The fundamental issue is that the
load_datasetsapi doesn't use the.no_existfiles in the hub cache unlike other wrapper APIs that do. This is because theutils.file_utils.cached_pathused directly callshf_hub_downloadinstead of usingfile_download.try_to_load_from_cachefromhuggingface_hub(seetransformerslibraryutils.hub.cached_filesfor one alternate example).This results in unnecessary metadata HTTP requests occurring for files that don't exist on every call. It won't generate the .no_exist cache files, nor will it use them.
Steps to reproduce the bug
Run the following snippet as one example (setting cache dirs to clean paths for clarity)
env HF_HOME=~/local_hf_hub python repro.pyThis may be called over and over again, and you will see the same calls for files that don't exist:
And you can see that the .no_exist folder is never created
Expected behavior
The expected behavior is for the print "File metadata request made" to stop after the first call, and for .no_exist directory & files to be populated under ~/local_hf_hub/hub/datasets--Salesforce--wikitext/
Environment info
datasetsversion: 3.6.0huggingface_hubversion: 0.33.2fsspecversion: 2024.9.0