Larger than memory, on-disk Zarr dataset buffering #10107
-
Hello! I apologize if this is already documented somewhere and I just missed it, but my search did not turn up the information I was hoping to find. I had some questions about how Xarray buffers data with on-disk Zarr based datasets, particularly when writing. My first question (and possibly only question depending on the answer) is: does Xarray automatically buffer writing to the Zarr on-disk dataset? That is, if I open a dataset for writing, and modify the values of an array (say one by one), are these changes written to disk individually? Or are they written together or in batches in someway, by buffering the changes? If no such buffering or batching occurs, the remaining questions can be ignored. Assuming there is some sort of buffering, what kind of buffering occurs? Do all changes remain in memory until the file is closed? If so, is there a way to force it to apply the changes in memory earlier to clear out the memory? I am expecting to be applying changes that are larger than memory, so I would need to apply these before the full modification was complete. Or, is there some arbitrary buffer size, such that after enough values are modified, the buffer will be written to the disk? I ask because I'm going to be ingesting a significant amount of data that is in some (multi-stream) serial format. Currently, I plan on creating a large N-dimensional array that will hold that data. Then I'm going to walking through the serial data and filling that array index by index. If there's some form of buffering built in, it simplifies this process quite a bit. If there's no buffering, I'll need to set up the batching of data for writing, and I'll likely write the code in a notably different way in the order it reads things from the serial inputs. Thank you for your time! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
While trying to implement this, I believe I may have found my answer. If someone can confirm what I think I've found, that would be great though! It looks like modifying the dataset does not change the Zarr file at all. You need to explicitly call In my case, it would seem I will want to batch the data, then write to a region. Again, if someone could confirm my thinking here and that I'm not missing something else, that would be great! Thank you much! |
Beta Was this translation helpful? Give feedback.
-
And reading the Distributed Writes section of the documentation ended up making it fairly clear that this is correct (and clarified a bit more of the details of how it works). Notably, it's mentioned that creating separate smaller datasets with the matching indexes of the larger dataset to use in the |
Beta Was this translation helpful? Give feedback.
While trying to implement this, I believe I may have found my answer. If someone can confirm what I think I've found, that would be great though!
It looks like modifying the dataset does not change the Zarr file at all. You need to explicitly call
to_zarr
again to write to the file. This can be done in 4 ways (as described in this part of the documentation). Just callingto_zarr
with the full dataset, thereby re-writing the entire dataset. Callingto_zarr
withmode='a'
to overwrite individual variables. Calling it withappend_dim
to append to a specific dimension of the Zarr storage. Or calling it withregion
to write to a specific region.In my case, it would seem I will want to batch th…