Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only read payload buffer #343

Open
martindurant opened this issue Jun 14, 2023 · 5 comments
Open

Only read payload buffer #343

martindurant opened this issue Jun 14, 2023 · 5 comments
Labels
enhancement New feature or request

Comments

@martindurant
Copy link

Is your feature request related to a problem? Please describe.

No response

Describe the solution you'd like

Following from #341 (comment)

kerchunk is a library for extracting out the constituent data buffers from various data storage formats, and organising them into zarr datasets, potentially across many input files. By "extract", I mean: find the extract byte-range qnd write this to a "references" file, so that the logical zarr dataset created does not need to duplicate any of the data, which remains in-situ.

Kerchunk's GRIB2 support currently regards each grib message as a "chunk" in this sense, and a whole message is loaded and decoded by eccodes for each chunk - we save the start/end of each message. Actually, the coordinates of all of the chunks have already been considered at this point, and the location of each chunk in the overall dataset determined, so the coordinates portion of the message (as opposed to the actual variable payload) is unnecessary.

We would like, if possible, to extract the byte range of the actual payload rather than a whole message. I appreciate that since #341, the coordinates no longer will be constructed for each chunk, but it would be nice not to even download the bytes that define it. This may also allow simpler decoding of the payload.

Describe alternatives you've considered

No response

Additional context

No response

Organisation

anaconda, fsspec, zarr, pangeo

@martindurant martindurant added the enhancement New feature or request label Jun 14, 2023
@martindurant martindurant changed the title Payload buffer read only Only read payload buffer Jun 14, 2023
@martindurant
Copy link
Author

The current kerchunk grib decoder: https://github.com/fsspec/kerchunk/blob/main/kerchunk/codecs.py#L87 (eccodes, not cfgrib)

@iainrussell
Copy link
Member

Hi Martin,

Sorry for the delay in getting back to you, very busy :)

To be honest, I'm not sure that cfgrib will help you here. kerchunk's current implementation using eccodes directly looks ok to me with one exception, which I'll come to!

cfgrib will always try to generate lat/lons, as its purpose is to create a hypercube that includes the geographical information where possible. I'm not sure what extra value cfgrib gives you over the current eccodes-based implementation, even if we removed the geometry.

As for the problem I see with your current implementation, it looks like you are using the default eccodes missing value. This is set, unfortunately, to 9999, which means that any missing values in the GRIB will be returned as 9999. This of course could clash with valid values in the data. The missingValue key in eccodes is actually writable. What we do in cfgrib is to set it like this:

self["missingValue"] = np.finfo(np.float32).max

Now, when you ask for the values, any missing values will be returned as np.finfo(np.float32).max, which should not clash with any data.

@iainrussell
Copy link
Member

By the way, 'kerchunk' is a fantastic name :)

@martindurant
Copy link
Author

cfgrib will always try to generate lat/lons

I should say, I am basically ignorant of what eccodes does. I don't even know if it produces the geometry proactively, but I would imagine "no", since cfgrib has the option now to intercept it. We ould probably establish the case by looing at memory monitoring.
I also don't know whether the geometry and other metadata definitions ever make up a significant fraction of the bytes of a message on-disk (as opposed to the actual variable values, the payload).

Thanks for the tip about missing values. We can fix that. (@emfdavid , in case you have come across this)

By the way, 'kerchunk' is a fantastic name

Not everyone agrees, but I'm glad you like it.

@iainrussell
Copy link
Member

Hi Martin,

In case it helps, here's what happens with the geometry: a GRIB message contains only a scant description of the geometry. For a regular lat/lon grid for example, it contains the N/S/E/W bounds plus the lat/lon increments in degrees (plus a scanning mode, but that's a detail). So it's literally just a few bytes on disk. cfgrib then asks the ecCodes library for the list of latitudes and longitudes for the grid; ecCodes then computes them from the description I mentioned a moment ago. So any cost is not in terms of disk access, it is a little computational power to compute the lists of lats and lons, plus the memory to store them.

However, the more I think about it, the more I can see that many computations do not require the lats and lons (e.g. simply computing a monthly mean across all points regardless of their location), so I can see the option to disable geographical coordinate generation off as being generally useful. I will look into it! In the meantime, I wish you a good weekend!

As for 'kerchunk', I guess whether people like the name depends on whether they played a certain marble-based game in their childhood...!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants