You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AI-written (Claude Code, prompted by @FBumann). memray numbers verified on the #751 branch.
Follow-up to #753, splitting out the part that #751 did not solve. #753's fast-path half shipped in #751 (multi-key groupby([names]) now takes the reindex path, one dimension per key, non-breaking); this issue tracks the memory half.
Problem
expr.groupby(["period","season"]).sum() returns separate period × season dims. In dense xarray that's a full cartesian grid — every absent key combination is a real fill cell. For a sparse/correlated crossing it blows up. memray, diagonal crossing (N=1000 observed combos):
grouping
output
peak memory
groupby([names])
{period:1000, season:1000} dense grid
33.3 MB
groupby(df)
{group:1000} MultiIndex, observed-only
0.33 MB
~100×, scaling as N. The whole difference is the final densification.
Why it's inherent (in dense xarray)
Separate dims are a dense grid; the only compact form is a stacked MultiIndex (what the DataFrame grouper returns). "Separate dims and compact" is impossible without a genuinely sparse store.
The sparse / long-format _term kernel (umbrella #756, which lists groupby densification as an entry point). Under a long-format kernel, groupby(k).sum() is a relational group_by().agg() over observed combinations only — no grid, no padding.
Note
AI-written (Claude Code, prompted by @FBumann). memray numbers verified on the #751 branch.
Follow-up to #753, splitting out the part that #751 did not solve. #753's fast-path half shipped in #751 (multi-key
groupby([names])now takes the reindex path, one dimension per key, non-breaking); this issue tracks the memory half.Problem
expr.groupby(["period","season"]).sum()returns separateperiod×seasondims. In dense xarray that's a full cartesian grid — every absent key combination is a real fill cell. For a sparse/correlated crossing it blows up. memray, diagonal crossing (N=1000 observed combos):groupby([names]){period:1000, season:1000}dense gridgroupby(df){group:1000}MultiIndex, observed-only~100×, scaling as N. The whole difference is the final densification.
Why it's inherent (in dense xarray)
Separate dims are a dense grid; the only compact form is a stacked
MultiIndex(what theDataFramegrouper returns). "Separate dims and compact" is impossible without a genuinely sparse store.What #751 shipped (mitigation, not a fix)
UserWarningwhen the grid ≫ observed combinations, nudging users to theDataFramegrouper.DataFramegrouper as the compact (observed-only, stacked) escape hatch.Real fix → #756
The sparse / long-format
_termkernel (umbrella #756, which listsgroupbydensification as an entry point). Under a long-format kernel,groupby(k).sum()is a relationalgroup_by().agg()over observed combinations only — no grid, no padding.