-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only read payload buffer #343
Comments
The current kerchunk grib decoder: https://github.com/fsspec/kerchunk/blob/main/kerchunk/codecs.py#L87 (eccodes, not cfgrib) |
Hi Martin, Sorry for the delay in getting back to you, very busy :) To be honest, I'm not sure that cfgrib will help you here. kerchunk's current implementation using eccodes directly looks ok to me with one exception, which I'll come to! cfgrib will always try to generate lat/lons, as its purpose is to create a hypercube that includes the geographical information where possible. I'm not sure what extra value cfgrib gives you over the current eccodes-based implementation, even if we removed the geometry. As for the problem I see with your current implementation, it looks like you are using the default eccodes missing value. This is set, unfortunately, to 9999, which means that any missing values in the GRIB will be returned as 9999. This of course could clash with valid values in the data. The
Now, when you ask for the values, any missing values will be returned as np.finfo(np.float32).max, which should not clash with any data. |
By the way, 'kerchunk' is a fantastic name :) |
I should say, I am basically ignorant of what eccodes does. I don't even know if it produces the geometry proactively, but I would imagine "no", since cfgrib has the option now to intercept it. We ould probably establish the case by looing at memory monitoring. Thanks for the tip about missing values. We can fix that. (@emfdavid , in case you have come across this)
Not everyone agrees, but I'm glad you like it. |
Hi Martin, In case it helps, here's what happens with the geometry: a GRIB message contains only a scant description of the geometry. For a regular lat/lon grid for example, it contains the N/S/E/W bounds plus the lat/lon increments in degrees (plus a scanning mode, but that's a detail). So it's literally just a few bytes on disk. cfgrib then asks the ecCodes library for the list of latitudes and longitudes for the grid; ecCodes then computes them from the description I mentioned a moment ago. So any cost is not in terms of disk access, it is a little computational power to compute the lists of lats and lons, plus the memory to store them. However, the more I think about it, the more I can see that many computations do not require the lats and lons (e.g. simply computing a monthly mean across all points regardless of their location), so I can see the option to disable geographical coordinate generation off as being generally useful. I will look into it! In the meantime, I wish you a good weekend! As for 'kerchunk', I guess whether people like the name depends on whether they played a certain marble-based game in their childhood...! |
Is your feature request related to a problem? Please describe.
No response
Describe the solution you'd like
Following from #341 (comment)
kerchunk is a library for extracting out the constituent data buffers from various data storage formats, and organising them into zarr datasets, potentially across many input files. By "extract", I mean: find the extract byte-range qnd write this to a "references" file, so that the logical zarr dataset created does not need to duplicate any of the data, which remains in-situ.
Kerchunk's GRIB2 support currently regards each grib message as a "chunk" in this sense, and a whole message is loaded and decoded by eccodes for each chunk - we save the start/end of each message. Actually, the coordinates of all of the chunks have already been considered at this point, and the location of each chunk in the overall dataset determined, so the coordinates portion of the message (as opposed to the actual variable payload) is unnecessary.
We would like, if possible, to extract the byte range of the actual payload rather than a whole message. I appreciate that since #341, the coordinates no longer will be constructed for each chunk, but it would be nice not to even download the bytes that define it. This may also allow simpler decoding of the payload.
Describe alternatives you've considered
No response
Additional context
No response
Organisation
anaconda, fsspec, zarr, pangeo
The text was updated successfully, but these errors were encountered: