What filesystem to use with parquet files #1648

kthyng · 2024-07-14T17:52:02Z

This might be a naive question but I have spent a bit of time trying to figure it out and haven't made much progress.

I'm trying to do this workflow for a parquet file:

import fsspec

fs = fsspec.filesystem().open(path_to_file)

This sort of workflow without specifying a protocol finds that the parquet file is a directory and returns IsADirectory exception. So I am trying to figure out which protocol to use. Looking through the docs, two built-in implementations mention parquet files, but they both seem aimed at kerchunk files specifically. I'm not sure if this means I can use them for other uses or not? I tried with protocol="reference" and then I wasn't sure what to use for fo. I am using a local parquet file and I used that for fo, something like this:

fs = fsspec.filesystem("reference", fo=path_to_file).open(path_to_file)

but then it couldn't find my file, though it is sitting in the same directory and I had just given the file name in "path_to_file". I am using local files now but in general wouldn't always be.

Am I taking the wrong approach altogether? Any idea for how to approach this? Thanks.

The text was updated successfully, but these errors were encountered:

martindurant · 2024-07-16T17:32:45Z

I am a little confused on what you want to do. As you say, a parquet dataset is (usually) a collection of files in a directory or tree. fsspec is for reading bytes or doing filesystem manipulations, so it makes no sense to "open" a directory.

fs = fsspec.filesystem()
fs.find(path) # list all files
fsspec.open(path+"/**/*.parquet", "rb")  # "open" all matching data files

However, the parquet libraries understand the layout of parquet files, so you don't need to do this.

pd.read_parquet(path)

will call fsspec as needed (via arrow, which also has a concept of filesystems, or via fastparquet). Same goes for dask, polars, etc.

And of course Intake

data = intake.readers.datatypes.Parquet(path)
reader = data.to_reader("pandas")
# or
reader = intake.auto_pipeline(path, "pandas:DataFrame") # works if path matches *.parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What filesystem to use with parquet files #1648

What filesystem to use with parquet files #1648

kthyng commented Jul 14, 2024

martindurant commented Jul 16, 2024

What filesystem to use with parquet files #1648

What filesystem to use with parquet files #1648

Comments

kthyng commented Jul 14, 2024

martindurant commented Jul 16, 2024