Skip to content

Reduce memory usage: lazy writing to dask.array #69

@aufdenkampe

Description

@aufdenkampe

Splitting out from #62 this other approach to managing memory, to address the running-out-of-memory issue described by #57 (comment), complementing:

Dask Array

We also want to consider using dask.array within Xarray, which under-the-hood is just a chunked collection of numpy.ndarray objects. What this gets us is the ability to handle arrays that are larger than memory, by just accessing certain chunks at a time (i.e. by timestep). These docs (at the bottom of the section) provide an explanation how lazy writing works:

Once you’ve manipulated a Dask array, you can still write a dataset too big to fit into memory back to disk by using to_netcdf() in the usual way...
By setting the compute argument to False, to_netcdf() will return a dask.delayed object that can be computed later.

from dask.diagnostics import ProgressBar

delayed_obj = ds.to_netcdf("manipulated-example-data.nc", compute=False)

with ProgressBar():
    results = delayed_obj.compute()

NOTE: We can use the Dataset.to_zarr() method the same way.

The solutions near the bottom of this thread describes a smart approach to do exactly what we need. Let's implement something similar (see the suggested code lines in response 8): https://discourse.pangeo.io/t/processing-large-too-large-for-memory-xarray-datasets-and-writing-to-netcdf/1724/8

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestrefactor-coreFor core functionality, not model specifc

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions