-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Feature request
Simplify downloading and streaming datasets locally. Specifically, perhaps add an option to load_dataset(..., streaming="download_first") or add better support for streaming symlinked or arrow files.
Motivation
I have downloaded FineWeb-edu locally and currently trying to stream the dataset from the local files. I have both the raw parquet files using hugginface-cli download --repo-type dataset HuggingFaceFW/fineweb-edu and the processed arrow files using load_dataset("HuggingFaceFW/fineweb-edu").
Streaming the files locally does not work well for both file types for two different reasons.
Arrow files
When running load_dataset("arrow", data_files={"train": "~/.cache/huggingface/datasets/HuggingFaceFW___fineweb-edu/default/0.0.0/5b89d1ea9319fe101b3cbdacd89a903aca1d6052/fineweb-edu-train-*.arrow"}) resolving the data files is fast, but because arrow is not included in the known extensions file list , all files are opened and scanned to determine the compression type. Adding arrow to the known extension types resolves this issue.
Parquet files
When running load_dataset("arrow", data_files={"train": "~/.cache/huggingface/hub/dataset-HuggingFaceFW___fineweb-edu/snapshots/5b89d1ea9319fe101b3cbdacd89a903aca1d6052/data/CC-MAIN-*/train-*.parquet"}) the paths do not get resolved because the parquet files are symlinked from the blobs (which contain all files in case there are different versions). This occurs because the pattern matching checks if the path is a file and does not check for symlinks. Symlinks (at least on my machine) are of type "other".
Your contribution
I have created a PR for fixing arrow file streaming and symlinks. However, I have not checked locally if the tests work or new tests need to be added.
IMO, the easiest option would be to add a streaming=download_first option, but I'm afraid that exceeds my current knowledge of how the datasets library works. #7083