Skip to content

Multi-key groupby([names]).sum() densifies to a dense cartesian grid (memory) #757

@FBumann

Description

@FBumann

Note

AI-written (Claude Code, prompted by @FBumann). memray numbers verified on the #751 branch.

Follow-up to #753, splitting out the part that #751 did not solve. #753's fast-path half shipped in #751 (multi-key groupby([names]) now takes the reindex path, one dimension per key, non-breaking); this issue tracks the memory half.

Problem

expr.groupby(["period","season"]).sum() returns separate period × season dims. In dense xarray that's a full cartesian grid — every absent key combination is a real fill cell. For a sparse/correlated crossing it blows up. memray, diagonal crossing (N=1000 observed combos):

grouping output peak memory
groupby([names]) {period:1000, season:1000} dense grid 33.3 MB
groupby(df) {group:1000} MultiIndex, observed-only 0.33 MB

~100×, scaling as N. The whole difference is the final densification.

Why it's inherent (in dense xarray)

Separate dims are a dense grid; the only compact form is a stacked MultiIndex (what the DataFrame grouper returns). "Separate dims and compact" is impossible without a genuinely sparse store.

What #751 shipped (mitigation, not a fix)

  • A UserWarning when the grid ≫ observed combinations, nudging users to the DataFrame grouper.
  • The DataFrame grouper as the compact (observed-only, stacked) escape hatch.

Real fix → #756

The sparse / long-format _term kernel (umbrella #756, which lists groupby densification as an entry point). Under a long-format kernel, groupby(k).sum() is a relational group_by().agg() over observed combinations only — no grid, no padding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceThis improves performance while not (meaningfully) altering behaviour for users

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions