Larger than memory, on-disk Zarr dataset buffering #10107

jackwayneright · 2025-03-08T00:16:10Z

jackwayneright
Mar 8, 2025

Hello! I apologize if this is already documented somewhere and I just missed it, but my search did not turn up the information I was hoping to find. I had some questions about how Xarray buffers data with on-disk Zarr based datasets, particularly when writing.

My first question (and possibly only question depending on the answer) is: does Xarray automatically buffer writing to the Zarr on-disk dataset? That is, if I open a dataset for writing, and modify the values of an array (say one by one), are these changes written to disk individually? Or are they written together or in batches in someway, by buffering the changes? If no such buffering or batching occurs, the remaining questions can be ignored.

Assuming there is some sort of buffering, what kind of buffering occurs? Do all changes remain in memory until the file is closed? If so, is there a way to force it to apply the changes in memory earlier to clear out the memory? I am expecting to be applying changes that are larger than memory, so I would need to apply these before the full modification was complete. Or, is there some arbitrary buffer size, such that after enough values are modified, the buffer will be written to the disk?

I ask because I'm going to be ingesting a significant amount of data that is in some (multi-stream) serial format. Currently, I plan on creating a large N-dimensional array that will hold that data. Then I'm going to walking through the serial data and filling that array index by index. If there's some form of buffering built in, it simplifies this process quite a bit. If there's no buffering, I'll need to set up the batching of data for writing, and I'll likely write the code in a notably different way in the order it reads things from the serial inputs.

Thank you for your time!

Answered by jackwayneright

Mar 8, 2025

While trying to implement this, I believe I may have found my answer. If someone can confirm what I think I've found, that would be great though!

It looks like modifying the dataset does not change the Zarr file at all. You need to explicitly call to_zarr again to write to the file. This can be done in 4 ways (as described in this part of the documentation). Just calling to_zarr with the full dataset, thereby re-writing the entire dataset. Calling to_zarr with mode='a' to overwrite individual variables. Calling it with append_dim to append to a specific dimension of the Zarr storage. Or calling it with region to write to a specific region.

In my case, it would seem I will want to batch th…

View full answer

jackwayneright · 2025-03-08T01:00:49Z

jackwayneright
Mar 8, 2025
Author

While trying to implement this, I believe I may have found my answer. If someone can confirm what I think I've found, that would be great though!

It looks like modifying the dataset does not change the Zarr file at all. You need to explicitly call to_zarr again to write to the file. This can be done in 4 ways (as described in this part of the documentation). Just calling to_zarr with the full dataset, thereby re-writing the entire dataset. Calling to_zarr with mode='a' to overwrite individual variables. Calling it with append_dim to append to a specific dimension of the Zarr storage. Or calling it with region to write to a specific region.

In my case, it would seem I will want to batch the data, then write to a region. Again, if someone could confirm my thinking here and that I'm not missing something else, that would be great! Thank you much!

1 reply

TomNicholas Mar 14, 2025
Maintainer

This is all correct - I'm glad you were able to find that out from the documentation. Let us know if you have any suggestions for clarifications or improvements.

FYI if you're updating Zarr stores with distributed writes you will likely be interested in Icechunk:

https://github.com/earth-mover/icechunk

jackwayneright · 2025-03-08T01:13:01Z

jackwayneright
Mar 8, 2025
Author

And reading the Distributed Writes section of the documentation ended up making it fairly clear that this is correct (and clarified a bit more of the details of how it works). Notably, it's mentioned that creating separate smaller datasets with the matching indexes of the larger dataset to use in the to_zarr call with region is the way to approach this. Sorry for the discussion noise!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Larger than memory, on-disk Zarr dataset buffering #10107

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Larger than memory, on-disk Zarr dataset buffering #10107

Uh oh!

Uh oh!

jackwayneright Mar 8, 2025

Replies: 2 comments · 1 reply

Uh oh!

jackwayneright Mar 8, 2025 Author

Uh oh!

TomNicholas Mar 14, 2025 Maintainer

Uh oh!

Uh oh!

jackwayneright Mar 8, 2025 Author

jackwayneright
Mar 8, 2025

Replies: 2 comments 1 reply

jackwayneright
Mar 8, 2025
Author

TomNicholas Mar 14, 2025
Maintainer

jackwayneright
Mar 8, 2025
Author