Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DatasetGroupBy.quantile for DatasetGroupBy.median for multiple groups when using dask arrays #9935

Open
adriaat opened this issue Jan 9, 2025 · 0 comments

Comments

@adriaat
Copy link

adriaat commented Jan 9, 2025

Is your feature request related to a problem?

I am grouping data in a Dataset and computing statistics. I wanted to take the median over (two) groups, but I got the following message:

>>> ds.groupby(['x', 'y']).median()
# NotImplementedError: The da.nanmedian function only works along an axis or a subset of axes.  The full algorithm is difficult to do in parallel

while ds.groupby(['x']).median() works without any problem.

I noticed that this issue is because the DataArrays are dask arrays: if they are numpy arrays, there is no problem. In addition, if .median() is replaced by .quantile(0.5), there is no problem either. See below:

import dask.array as da
import numpy as np
import xarray as xr

rng = da.random.default_rng(0)
ds = xr.Dataset(
    {'a': (('x', 'y'), rng.random((10, 10)))},
    coords={'x': np.arange(5).repeat(2), 'y': np.arange(5).repeat(2)}
)

# Raises:
# NotImplementedError: The da.nanmedian function only works along an axis or a subset of axes.  The full algorithm is difficult to do in parallel
try:
    ds.groupby(['x', 'y']).median()
except NotImplementedError as e:
    print(e)

# No problems with the following:
ds.groupby(['x']).median()
ds.groupby(['x', 'y']).quantile(0.5)
ds.compute().groupby(['x', 'y']).median() # Implicit conversion to numpy array

Describe the solution you'd like

A straightforward solution seems to be to use DatasetGroupBy.quantile(0.5) for DatasetGroupBy.median() if the median is to be computed over multiple groups.

Describe alternatives you've considered

No response

Additional context

My xr.show_versions():

INSTALLED VERSIONS ------------------ commit: None python: 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0] python-bits: 64 OS: Linux OS-release: 6.8.0-49-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development

xarray: 2024.10.0
pandas: 2.2.3
numpy: 1.26.4
scipy: 1.14.1
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.4.1
h5py: 3.12.1
zarr: 2.18.3
cftime: 1.6.4.post1
nc_time_axis: None
iris: None
bottleneck: 1.4.2
dask: 2024.11.2
distributed: None
matplotlib: 3.9.2
cartopy: 0.24.0
seaborn: 0.13.2
numbagg: None
fsspec: 2024.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 75.5.0
pip: 24.3.1
conda: None
pytest: None
mypy: None
IPython: 8.29.0
sphinx: 7.4.7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants