Concatenate arrays with varchunks #374

ivirshup · 2023-10-16T16:03:25Z

Allow kerchunk based concatenation of zarr arrays with variable length chunks. This is mostly to allow me to play around with some downstream usecases.

This is basically feature complete at the moment. Remaining tasks are largely polish (error handling, further testing) or upstream (ZEP 3 approval)

Fixes concatenate_arrays with (slightly) different array shapes #305 (mostly)

Works off of zarr-developers/zarr-python#1483

TODO

Upstream tasks

Get the zarr PR merged (😉)
Make work for v3 (downstream of Kerchunk and Zarr V3 #235)
- Figure out if this should ONLY work for v3

This PR

Allow mixed input chunk types (but only if there is no remainder for any fixed length chunks)
- The more I think about allowing extra values in any of the variable length chunks the less I want to support it
Better tests
- Parameterize success cases
- Parameterize failure cases
Potentially allow going from all fixed chunks to variable chunks
Better error messages, we have far more limited expectations for chunking on the concatenation dimension in this case

martindurant · 2023-10-18T14:37:57Z

I would really like to see some success examples, even if based on POC on POCs, to help justify the whole idea!

One thing I have been meaning to check: I believe that passing a zarr array with complex chunks to dask will do the right thing, since it just reads the .chunks attribute, which is already int he right format. This should be tested.

ivirshup · 2023-10-18T18:16:13Z

Demo gist

I've got virtual concatenation of sparse arrays working. Dataframes should be easier. Unfortunately I can't use this with existing stores (data has to be rewritten) since the chunk boundaries need to be exact.

Dask does not seem to want to work immediately based on:

https://github.com/dask/dask/blob/a6be172ecdddb7a16c923190a561cb8fc88bcf21/dask/array/core.py#L3097-L3098

traceback

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 da.from_zarr(result_group["data"], chunks=result_group["data"].chunks)

File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3600, in from_zarr(url, component, storage_options, chunks, name, inline_array, **kwargs)
   3598 if name is None:
   3599     name = "from-zarr-" + tokenize(z, component, storage_options, chunks, **kwargs)
-> 3600 return from_array(z, chunks, name=name, inline_array=inline_array)

File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3483, in from_array(x, chunks, name, lock, asarray, fancy, getitem, meta, inline_array)
   3479     asarray = not hasattr(x, "__array_function__")
   3481 previous_chunks = getattr(x, "chunks", None)
-> 3483 chunks = normalize_chunks(
   3484     chunks, x.shape, dtype=x.dtype, previous_chunks=previous_chunks
   3485 )
   3487 if name in (None, True):
   3488     token = tokenize(x, chunks, lock, asarray, fancy, getitem, inline_array)

File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3098, in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
   3095     chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
   3097 if shape is not None:
-> 3098     chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))
   3100 if chunks and shape is not None:
   3101     chunks = sum(
   3102         (
   3103             blockdims_from_blockshape((s,), (c,))
   (...)
   3108         (),
   3109     )

File ~/miniforge3/envs/variable-chunks/lib/python3.11/site-packages/dask/array/core.py:3098, in <genexpr>(.0)
   3095     chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
   3097 if shape is not None:
-> 3098     chunks = tuple(c if c not in {None, -1} else s for c, s in zip(chunks, shape))
   3100 if chunks and shape is not None:
   3101     chunks = sum(
   3102         (
   3103             blockdims_from_blockshape((s,), (c,))
   (...)
   3108         (),
   3109     )

TypeError: unhashable type: 'list'

But we can do something like:

da.from_array(
    zarr_array, 
    chunks=tuple(tuple(c) if isinstance(c, list) else c for c in zarr_array.chunks))
)

martindurant · 2023-10-18T18:36:10Z

I was just looking at normalize_chunks, since it turns out it gets called a lot when constructing an xr.DataSet with chunks={}. It should not be called... at all, for the case where there is no rechunking requested (and in any case, xarray should only call it when a variable is accessed, as currently happens on the non-dask path).
Each call of normalize_chunks is only 10ms, all the time in the final line tuple(tuple(..) for ..) line, but for a dataset with many variables, it adds up.

(cc @rsignell-usgs )

martindurant · 2023-10-19T17:35:10Z

dask/dask#10579

NikosAlexandris · 2023-10-29T13:00:12Z

Demo gist

I've got virtual concatenation of sparse arrays working. Dataframes should be easier. Unfortunately I can't use this with existing stores (data has to be rewritten) since the chunk boundaries need to be exact.

Dask does not seem to want to work immediately based on:

https://github.com/dask/dask/blob/a6be172ecdddb7a16c923190a561cb8fc88bcf21/dask/array/core.py#L3097-L3098
traceback

But we can do something like:
da.from_array(
    zarr_array, 
    chunks=tuple(tuple(c) if isinstance(c, list) else c for c in zarr_array.chunks))
)

I can run the example from the gist. Now trying to understand how I can adapt this to my use-case.

@ivirshup What does indptr(s) mean ? Kindest request, please spell out names to make them easy to grasp.

ivirshup · 2023-10-30T17:13:51Z

@NikosAlexandris, that's the indptr array from a CSR or CSC matrix. They are offsets into the indices and data arrays. Was this your question? Also maybe better to discuss in comments on the gist?

ivirshup · 2023-10-31T13:16:56Z

Updated to allow inference of variable chunked output from input with fixed chunking. E.g. can now concatenate arrays like:

[zarr.ones(4, chunks=(2,)), zarr.ones(3, chunks=(3,))]
# result has chunking ([2, 2, 3],)

Should there be a switch so users can turn this off? It would probably be better to error on this input if you know downstream consumers won't be able to handle the output.

tinaok · 2023-11-17T09:56:45Z

It would fix my problem if this works!! I tried to apply your approach but I think I missed something and I can not apply it to my workflow...@martindurant any thought?
https://gist.github.com/tinaok/d232cb7b9f31fd0cee26ce7c3c865958

ivirshup added 3 commits October 16, 2023 17:48

inital concatenation with variable length chunks

9a29a8e

add zarr dep to requirements

844ff10

Start expansion of test suite

b7ed8f9

ivirshup mentioned this pull request Oct 16, 2023

POC implementation of ZEP003 zarr-developers/zarr-python#1483

Draft

ivirshup added 2 commits October 16, 2023 18:27

Simplify tests by using zarr arrays as input

dd11b86

Add mismatched chunk failure test

0af2509

TomNicholas mentioned this pull request Oct 16, 2023

Refactor MultiZarrToZarr into multiple functions #377

Open

martindurant mentioned this pull request Oct 27, 2023

AssertionError: Found chunk size mismatch #218

Closed

Better inference of variable chunking from inputs

b9041ad

ivirshup added 2 commits October 31, 2023 14:17

Remove commented line

ae48a74

Merge branch 'main' into concat-varchunks

2aefc98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concatenate arrays with varchunks #374

Concatenate arrays with varchunks #374

ivirshup commented Oct 16, 2023 •

edited

Loading

martindurant commented Oct 18, 2023

ivirshup commented Oct 18, 2023

martindurant commented Oct 18, 2023

martindurant commented Oct 19, 2023

NikosAlexandris commented Oct 29, 2023

ivirshup commented Oct 30, 2023

ivirshup commented Oct 31, 2023 •

edited

Loading

tinaok commented Nov 17, 2023 •

edited

Loading

Concatenate arrays with varchunks #374

Are you sure you want to change the base?

Concatenate arrays with varchunks #374

Conversation

ivirshup commented Oct 16, 2023 • edited Loading

TODO

Upstream tasks

This PR

martindurant commented Oct 18, 2023

ivirshup commented Oct 18, 2023

martindurant commented Oct 18, 2023

martindurant commented Oct 19, 2023

NikosAlexandris commented Oct 29, 2023

ivirshup commented Oct 30, 2023

ivirshup commented Oct 31, 2023 • edited Loading

tinaok commented Nov 17, 2023 • edited Loading

ivirshup commented Oct 16, 2023 •

edited

Loading

ivirshup commented Oct 31, 2023 •

edited

Loading

tinaok commented Nov 17, 2023 •

edited

Loading