Skip to content

Support rechunking to seasonal frequency with SeasonalResampler #10519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

dhruvak001
Copy link
Contributor

users could not use SeasonResampler for chunking operations in xarray, despite it being a natural fit for seasonal data analysis. When attempting ds.chunk(time=SeasonResampler(["DJF", "MAMJ", "JAS", "ON"])), users encountered obscure errors because the chunking logic was hardcoded to only work with TimeResampler objects. This limitation prevented efficient seasonal analysis workflows and forced users to use workarounds or manual chunking strategies.

Now Added a generalized chunking approach by adding a resolve_chunks method to the Resampler base class and updating the chunking logic to work with all Resampler objects, not just TimeResampler. We also added a _for_chunking method to SeasonResampler that ensures drop_incomplete=False during chunking operations to prevent silent data loss. The solution maintains full backward compatibility with existing TimeResampler functionality while enabling seamless seasonal chunking

@dhruvak001 dhruvak001 changed the title Support chunking Support rechunking to seasonal frequency with SeasonalResampler Jul 9, 2025
DHRUVA KUMAR KAUSHAL added 2 commits July 10, 2025 00:06
@@ -201,7 +201,7 @@ def copy(
# FYI in some cases we don't allow `None`, which this doesn't take account of.
# FYI the `str` is for a size string, e.g. "16MB", supported by dask.
T_ChunkDim: TypeAlias = str | int | Literal["auto"] | tuple[int, ...] | None # noqa: PYI051
T_ChunkDimFreq: TypeAlias = Union["TimeResampler", T_ChunkDim]
T_ChunkDimFreq: TypeAlias = Union["TimeResampler", "SeasonResampler", T_ChunkDim]
Copy link
Contributor

@dcherian dcherian Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T_ChunkDimFreq: TypeAlias = Union["TimeResampler", "SeasonResampler", T_ChunkDim]
T_ChunkDimFreq: TypeAlias = Union["Resampler", T_ChunkDim]

Good time to generalize here

@@ -52,6 +52,7 @@
"EncodedGroups",
"Grouper",
"Resampler",
"SeasonResampler",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"SeasonResampler",
"SeasonGrouper",
"SeasonResampler",

This was an oversight.

Comment on lines +200 to +204
if variable.ndim != 1:
raise ValueError(
f"chunks={self!r} only supported for 1D variables. "
f"Received variable {name!r} with {variable.ndim} dimensions instead."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check should go in _resolve_frequency in dataset.py

Comment on lines +206 to +210
# Create a temporary resampler that ignores drop_incomplete for chunking
# This prevents data from being silently dropped during chunking
resampler_for_chunking = (
self._for_chunking() if hasattr(self, "_for_chunking") else self
)
Copy link
Contributor

@dcherian dcherian Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Create a temporary resampler that ignores drop_incomplete for chunking
# This prevents data from being silently dropped during chunking
resampler_for_chunking = (
self._for_chunking() if hasattr(self, "_for_chunking") else self
)

Instead let's define this compute_chunks method independently on TimeResampler and SeasonResampler. We would like to add FloatResampler (#4008) in the future, so we should keep it general. For example, the "contains datetime" check should be specific to the resamplers

@@ -169,7 +170,60 @@ class Resampler(Grouper):
Currently only used for TimeResampler, but could be used for SpaceResampler in the future.
"""

pass
def resolve_chunks(self, name: Hashable, variable: Variable) -> tuple[int, ...]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def resolve_chunks(self, name: Hashable, variable: Variable) -> tuple[int, ...]:
def compute_chunks(self, name: Hashable, variable: Variable) -> tuple[int, ...]:

@@ -65,7 +69,7 @@ Bug fixes
creates extra variables that don't match the provided coordinate names, instead
of silently ignoring them. The error message suggests using the factory method
pattern with :py:meth:`xarray.Coordinates.from_xindex` and
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`).
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`, , :pull:`10503`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`, , :pull:`10503`).
:py:meth:`Dataset.assign_coords` for advanced use cases (:issue:`10499`, :pull:`10503`).

Comment on lines +16 to +17
- Allow skipping the creation of default indexes when opening datasets (:pull:`8051`).
By `Benoit Bovy <https://github.com/benbovy>`_ and `Justus Magin <https://github.com/keewis>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Allow skipping the creation of default indexes when opening datasets (:pull:`8051`).
By `Benoit Bovy <https://github.com/benbovy>`_ and `Justus Magin <https://github.com/keewis>`_.

bad merge

Comment on lines +212 to +225
chunks = (
DataArray(
np.ones(variable.shape, dtype=int),
dims=(name,),
coords={name: variable},
)
.resample({name: resampler_for_chunking})
.sum()
)
# When bins (binning) or time periods are missing (resampling)
# we can end up with NaNs. Drop them.
if chunks.dtype.kind == "f":
chunks = chunks.dropna(name).astype(int)
chunks_tuple: tuple[int, ...] = tuple(chunks.data.tolist())
Copy link
Contributor

@dcherian dcherian Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're interested, a better approach here would be to pass this DataArray to Resampler.factorize() and use the group_indices of the returned EncodedGroups object to figure out the chunks tuple.

This is another reason I'd prefer just sending the variable to a method on the Resampler. That way the resamplers are free to optimize if they can.

Comment on lines +1140 to +1158
data = create_test_data()
for chunks in [1, 2, 3, 4, 5]:
rechunked = data.chunk({"dim1": chunks})
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + (
(8 % chunks,) if 8 % chunks else ()
)

rechunked = data.chunk({"dim2": chunks})
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + (
(9 % chunks,) if 9 % chunks else ()
)

rechunked = data.chunk({"dim1": chunks, "dim2": chunks})
assert rechunked.chunks["dim1"] == (chunks,) * (8 // chunks) + (
(8 % chunks,) if 8 % chunks else ()
)
assert rechunked.chunks["dim2"] == (chunks,) * (9 // chunks) + (
(9 % chunks,) if 9 % chunks else ()
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this test?


# Test standard seasons
rechunked = ds.chunk(x=2, time=SeasonResampler(["DJF", "MAM", "JJA", "SON"]))
expected = tuple(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is expected needed?

import dask.array

N = 365 * 2 # 2 years
time = xr.date_range("2001-01-01", periods=N, freq="D")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's parameterize this over use_cftime=[True, False]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support rechunking to seasonal frequency with SeasonalResampler
2 participants