Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DataTree for organizing Datasets by type of level #327

Open
jthielen opened this issue Jan 10, 2023 · 4 comments
Open

Support DataTree for organizing Datasets by type of level #327

jthielen opened this issue Jan 10, 2023 · 4 comments

Comments

@jthielen
Copy link

jthielen commented Jan 10, 2023

As discussed in xarray-contrib/datatree#195, it would be wonderful (and relatively straightforward) to add support for DataTree in cfgrib. This would allow a improved organization of the different datasets that would have been previously been returned from cfgrib.open_datasets() in a single data collection.

As far as implementation, I would propose refactoring the existing open_datasets() to something like:

def open_datatree(path, backend_kwargs={}, **kwargs):
    # type: (str, T.Dict[str, T.Any], T.Any) -> datatree.DataTree
    """
    Open a GRIB file groupping incompatible hypercubes to different datasets via simple heuristics.
    """
    squeeze = backend_kwargs.get("squeeze", True)
    backend_kwargs = backend_kwargs.copy()
    backend_kwargs["squeeze"] = False
    datasets = open_variable_datasets(path, backend_kwargs=backend_kwargs, **kwargs)

    type_of_level_datasets = {}  # type: T.Dict[str, T.List[xr.Dataset]]
    for ds in datasets:
        for _, da in ds.data_vars.items():
            type_of_level = da.attrs.get("GRIB_typeOfLevel", "undef")
            type_of_level_datasets.setdefault(type_of_level, []).append(ds)

    return datatree.DataTree.from_dict(type_of_level_datasets)

Then, open_datasets could be re-implemented something like:

def open_datasets(path, backend_kwargs={}, **kwargs):
    type_of_level_datasets = open_datatree(path, backend_kwargs=backend_kwargs, **kwargs)
    merged = []  # type: T.List[xr.Dataset]
    for type_of_level in sorted(type_of_level_datasets):
        for ds in merge_datasets(type_of_level_datasets[type_of_level], join="exact"):
            merged.append(ds.squeeze() if squeeze else ds)
    return merged

(these snippets were edited quick in-between conference sessions; no guarantee that I didn't miss something and these don't work properly as-is)

This all being said, discussions would likely need to happen to decide whether this should be supported before or after integration of DataTree into xarray proper (xref pydata/xarray#7418).

cc @TomNicholas, @blaylockbk

@blaylockbk
Copy link
Contributor

#187 and #321 are additional cases where Datatree could help cfgrib: Different stepRange for precipitation (and other?) variables.

@blaylockbk
Copy link
Contributor

Wanted to bring up this topic again now that datatree was merged into xarray.

I envision instead of needing cfgrib.open_datasets(), cfgrib could read a grib file into an xarray.datatree rather than returning a list of datasets.

See also xarray-contrib/datatree#195, #344

cc @shahramn, @iainrussell

@TomNicholas
Copy link

You might also want to use the new open_groups API too.

@jthielen
Copy link
Author

I would definitely recommend open_groups! I got a bit into this back at the SciPy sprints and that seemed like the best approach, though it did require a little bit of reordering of steps within of some of the CfGribDataStore methods. Hopefully I could get the chance sometime soon to put together a PR here now that there's a released version on the xarray side to base it off of, but if someone else wanted to take this on, don't let me stop you either!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants