Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parametrized tests for netcdf encoding options #274

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Dec 19, 2022

This PR adds a test to verify that kerchunk can decode all possible flavors of netcdf4.

This reveals that fletcher32 decoding is broken, despite #34 and #35. Attempting to decode data written with zlib=True and fletcher32=True always results in

        zlib.error: Error -5 while decompressing data: incomplete or truncated stream

xref pydata/xarray#7388

@martindurant
Copy link
Member

martindurant commented Dec 19, 2022

Thanks for making sure this got tested.
The documentation misled me: the 4-byte Fletcher checksum is added to the byte buffer before compression, not after, it seems.

@rabernat
Copy link
Contributor Author

Awesome! Great to hear that you can see a solution. Feel free to push to my branch if you want to just fix it in this PR.

@martindurant
Copy link
Member

Already did :)

@rabernat
Copy link
Contributor Author

rabernat commented Dec 19, 2022

😆 you're too fast Martin!

A great next step would be to actually implement the real checksum filter in numcodecs, rather than a dummy passthrough.

I found implementations here: https://github.com/njaladan/hashpy/blob/master/hashpy/fletcherNbit.py

@martindurant
Copy link
Member

@rabernat
Copy link
Contributor Author

I can't imagine that computational cost of fletcher32 would be comparable to the actual decompression step (not to mention the i/o itself).

return buff[:-4]

def encode(self, buf):
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't you want to append 4 empty bytes here, just in case someone tried to use it in a round-trip context?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm not sure. Maybe raising NotImplemented is the correct thing to do, as the output can't be used by anything that really does the check.

@martindurant
Copy link
Member

Incoming new codec: zarr-developers/numcodecs#412 (so there would be no need for one here once that is in)

@martindurant
Copy link
Member

I can change this to point to the new fletcher codec in numcodecs. Is the null version, skipping the CPU time to do the comparison, useful? I suppose fletcher ought to be pretty cheap to compute.

@rabernat
Copy link
Contributor Author

I personally consider it quite dangerous to pass through the data without actually checking the checksum. Since we now have a fast fletcher32 codec in numcodecs, I see no reason to keep the passthrough codec.

Profiling would be easy to do if we want some actual numbers.

@martindurant
Copy link
Member

We'll need a numcodecs release anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants