Access via Dask read_parquet #22
-
This is a wonderful project. Thanks for your work on this. I'm primarily a python user, so I wanted to explore accessing the data using dask.dataframe.read_parquet or pandas.read_parquet. Given the size of the dataset, I decided to start with Dask because it allows loading select row groups based on filters. Loading the places data using Dask's read_parquet kind of works, but I've encountered a couple challenges. This is where I am at:
Two main challenges:
Any advice would be appreciated. I also wanted to put this here in case it helps anyone else trying the same approach. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
I can't tell from the documentation at https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html whether referencing struct members like
If you load the data as binary, you can use import dask.dataframe as dd
import geopandas
import dask_geopandas
df = dd.read_parquet(
'az://release/2023-07-26-alpha.0/theme=places/type=place/*',
columns=['bbox', 'geometry'],
engine='pyarrow',
dtype_backend="pyarrow",
storage_options={"anon": True, "account_name": "overturemapswestus2"},
parquet_file_extensions=False,
)
geometry = df["geometry"].map_partitions(geopandas.GeoSeries.from_wkb, meta=geopandas.GeoSeries(name="geometry"))
gdf = dask_geopandas.from_dask_dataframe(df, geometry=geometry)
gdf.head() outputs
|
Beta Was this translation helpful? Give feedback.
I can't tell from the documentation at https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html whether referencing struct members like
'bbox.min'
is supported. Maybe @jorisvandenbossche knows off hand? Perhaps you need to make some kind ofpyarrow.compute.Expression
, but I couldn't sort it out.If you load the data as binary, you can use
geopandas.Geoseries.from_wkb
to parse it into polygons: