Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec source #467

Open
juntyr opened this issue Sep 20, 2024 · 3 comments
Open

fsspec source #467

juntyr opened this issue Sep 20, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@juntyr
Copy link

juntyr commented Sep 20, 2024

Is your feature request related to a problem? Please describe.

I haven't yet found a good way to open large (exceeds RAM) remote (not on my local file system) GRIB files in xarray.

Describe the solution you'd like

A new source would be added, e.g.

earthkit.data.from_source(
    "fsspec", uri_or_file, fs=None, **storage_args,
)

that would be similar to the "file" source in making use of random access but use Python's file-like interface (so perhaps "file-like" would be another name) and thus add support for fsspec's numerous backends to earthkit for free.

This new source should also support loading large GRIB datasets without reading the entire file. Ideally, loading the GRIB file into xarray would only read as little data as possible and defer any data reads until the user specifically asks for the data (similar to how NetCDF and Zarr support lazy-loading).

Describe alternatives you've considered

ds = earthkit.data.from_source(
    "stream", fsspec.open("<uri>").open(), batch_size=0,
).to_xarray()

(inspired by ecmwf/cfgrib#326 (comment)) provides the closest current solution but treats the file pessimistically as only a stream and not as a random-access file, which results in excessive reads.

Additional context

I am working in an extremely memory-constrained environment and would like to support opening remote GRIB files (in addition to NetCDF and Zarr datasets which already work).

Organisation

University of Helsinki, ESiWACE3 project

@sandorkertesz
Copy link
Collaborator

Thank you for the suggestion.

Just a remark. If you want to convert GRIB to xarray first you need to scan the whole file/files (all the messages) for metadata. So this is very much different to use case of NetCDF and zarr where this information is available "instantly".

@juntyr
Copy link
Author

juntyr commented Sep 20, 2024

Thank you for the suggestion.

Just a remark. If you want to convert GRIB to xarray first you need to scan the whole file/files (all the messages) for metadata. So this is very much different to use case of NetCDF and zarr where this information is available "instantly".

I didn’t know that, is it related to GRIB’s format? In that case, would the index files help? If so, would it be possible to check if a pre-generated index file is available as well (e.g. a file-like object passed alongside or a relative fsspec uri) and to use that to skip the initial full-file scan?

In any case, it would be important that once the metadata has been extracted, the actual data is not kept in memory until requested by the user (so slices would still only be lazily loaded in).

@observingClouds
Copy link

I like the idea of fsspec support. I would actually see fsspec as the de-facto standard for data access within the python environment, particularly because widely adopted tools like xarray support this standard natively. I would therefore argue that instead of earthkit developing access to the fsspec drivers, fsspec drivers for fdb, MARS, cds should be developed itself. Those stand-alone drivers would be truely general and can be used across a variety of tools. I just started developing ecmwfspec for this particular reason to provide a general fsspec driver to interact with ECFS (ECMWF File Storage). I could imagine PRs that would extend this to e.g. fdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants