Skip to content

More easily support streaming local files #7084

@fschlatt

Description

@fschlatt

Feature request

Simplify downloading and streaming datasets locally. Specifically, perhaps add an option to load_dataset(..., streaming="download_first") or add better support for streaming symlinked or arrow files.

Motivation

I have downloaded FineWeb-edu locally and currently trying to stream the dataset from the local files. I have both the raw parquet files using hugginface-cli download --repo-type dataset HuggingFaceFW/fineweb-edu and the processed arrow files using load_dataset("HuggingFaceFW/fineweb-edu").

Streaming the files locally does not work well for both file types for two different reasons.

Arrow files

When running load_dataset("arrow", data_files={"train": "~/.cache/huggingface/datasets/HuggingFaceFW___fineweb-edu/default/0.0.0/5b89d1ea9319fe101b3cbdacd89a903aca1d6052/fineweb-edu-train-*.arrow"}) resolving the data files is fast, but because arrow is not included in the known extensions file list , all files are opened and scanned to determine the compression type. Adding arrow to the known extension types resolves this issue.

Parquet files

When running load_dataset("arrow", data_files={"train": "~/.cache/huggingface/hub/dataset-HuggingFaceFW___fineweb-edu/snapshots/5b89d1ea9319fe101b3cbdacd89a903aca1d6052/data/CC-MAIN-*/train-*.parquet"}) the paths do not get resolved because the parquet files are symlinked from the blobs (which contain all files in case there are different versions). This occurs because the pattern matching checks if the path is a file and does not check for symlinks. Symlinks (at least on my machine) are of type "other".

Your contribution

I have created a PR for fixing arrow file streaming and symlinks. However, I have not checked locally if the tests work or new tests need to be added.

IMO, the easiest option would be to add a streaming=download_first option, but I'm afraid that exceeds my current knowledge of how the datasets library works. #7083

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions