Strengthen "Groups" conventions #333

pvanlaake · 2024-07-08T10:08:39Z

pvanlaake
Jul 8, 2024

Topic for discussion

“Groups” (section 2.7 in CF 1.11) are currently only described in terms of scoping and (not even) a handful of attributes. There is nothing on what goes into the groups, which creates ambiguity and leaves open the possibility of creating unnecessarily complex data sets. The language on name resolution could also be strengthened (for instance, netCDF defines a group as a namespace for variables, groups and types so there shouldn’t be any duplicates).

More problematic is the reference to a “dimension id”. For starters, the quote from the netCDF Data Model seems to be outdated. The original text currently reads: “Dimensions are scoped such that they can be seen in all descendant groups. That is, dimensions can be shared between variables in different groups, if they are defined in a parent group”. The reference to “dimension id” is no longer there. Indeed, a “netCDF dimension has both a name and a length”, but not an id as an identifying property. The final sentence in the opening part of the section should thus read: “If any dimension of an out-of-group variable has the same name as a dimension of the referring variable, the two must be the same dimension. This implies that all out-of-group dimensions defined in the entire data set across all groups must have unique names” (my emphasis).

Given the absence of specific language on where to locate related objects, it is now possible to create a CF-compliant data set that is more complex than necessary. As a simple example, it is now permitted to create a coordinate variable x in, say, group /g11/g12/cv, referencing dimension /g11/x, with the CV used by data variable /g21/g22/vars/my_var (CV x “sees” dimension x higher up in the hierarchy, data variable my_var finds CV x via lateral search).

Proposal

Overly complex group layouts could be avoided by two conventions:

The dimension and variable that define a coordinate variable are located in the same group.
A coordinate variable is located in the same group as, or a parent group of the referring data variable.

Similar conventions could be defined for other concepts such as bounds variables, formula_terms, auxiliary CVs, grids (NUG best practices includes “Variables with the same coordinate system implicitly form a group”, so the grid mapping variable should logically also be in the group or immediately above), and possible others.

erget · 2024-07-10T15:15:18Z

erget
Jul 10, 2024
Maintainer

Hi @pvanlaake , thanks for raising this.

What you're proposing sounds reasonable and beneficial from my perspective if we're promoting it as a good practice ("should" rather than "shall"). One of the principles that we try to follow with the Conventions is preserving backwards compatibility when we make changes, and our support of groups essentially makes it possible for CF-aware software to interpret files as if they were flat, which is what was allowed by the netCDF-3.

I see the benefit of providing guidance on how to use groups so that they are not confusing humans, and would request that if we do so we ensure that it's done in such a way that it doesn't invalidate existing data, which may be encoded primarily with machines in mind and would not be restricted by guidelines such as the one you outline above.

0 replies

pvanlaake · 2024-07-11T10:34:24Z

pvanlaake
Jul 11, 2024
Author

Hi @erget, thanks for your response.

I understand and fully subscribe to the need to remain backwards compatible. That, however, does not preclude CF 1.12 from "highly recommending" certain conventions for data sets that claim conformance with this newer version of the conventions, similar to language used elsewhere in the document. That is in line with principle for design 8. What I am proposing is done with an eye to principles 4 and 7, i.e. reduce potential complexity, which would benefit both data producers and readers.

On CF-aware software, this is exactly where I am coming from. I am writing an R package for the full CF data model and the current "open" treatment of groups makes writing a good reader very hard (as every reference to coordinates, formula_terms, bounds, grid_mapping, etc., could involve group tree traversal either from or towards the root group, down to descendants, or into lateral branches). Having trawled through the discussion leading up to the inclusion of groups in CF (cf-convention/cf-conventions#144) I understand the need to remain flexible but at the same time it would be worthwhile to recommend use patterns for groups such that new data collections may be easier to produce and parse.

If there is interest in defining such recommended use patterns, I am happy to pitch in more time.

Best,
Patrick

0 replies

erget · 2024-07-15T14:55:15Z

erget
Jul 15, 2024
Maintainer

Hi @pvanlaake , yes certainly, we can make recommendations that don't invalidate existing data! And in the very long term we can even think about deprecating items for new versions, but we'd have to give tons of advance warning, and that's probably more of a CF-2 situation. I think you've summarised well how best to go about it.

Any contributions you're willing to make in this area would be welcome! Are you planning on coming to the CF Workshop this year? It might be a good venue to talk with others to advance these thoughts.

2 replies

ChrisBarker-NOAA Jul 15, 2024
Collaborator

The xarray folks are working on DataTree.

https://xarray-datatree.readthedocs.io/en/latest/

Which I think is a standardized way to share coordinates among groups.

It would be great if what they finally settle on is CF compatible. With luck, it could be a prototype for a new CF spec.

ChrisBarker-NOAA Jul 15, 2024
Collaborator

pvanlaake
Jul 19, 2024
Author

Hi @erget, I'll work on a few best practices based on a common principle:

"From the group that contains a data variable, CF constructs that can be shared with other data variables are placed in the same group or a parent group, while CF constructs that are specific to a single data variable are placed in the same group or a descendant group."

A best practice would look like:

"A bounds variable should be located in the same group as the referencing coordinate variable, scalar coordinate variable or auxiliary coordinate variable."

There could be some preceding text emphasising that these are best practices that should be followed whenever it is practical to do so, and with the warning on potential future deprecation of alternative arrangements at the end. And a bunch of examples that demonstrate preferred arrangements over multiple groups.

Assuming that this may be discussed at the CF Workshop, by when would you need this worked out, possibly in a PR for section 2.7?

Unfortunately I cannot attend the CF Workshop. I'll register for online participation but I'll be travelling that week.

4 replies

erget Jul 26, 2024
Maintainer

Ok. It's a pity that you're traveling on that date - we don't have a hard and fast deadline for submissions, unless you'd want to give a talk about it, which might be useful.

For a hackathon, if we have an early draft announced that would be helpful so that the group of interested parties for a hackathon would be prepared to work on the PR. That would get it on track to go into the next version of the standard. Of course it would need review by you as kind of the champion of the PR and by the Conventions Committee afterwards. That would all be fast if we do it during the week, but it won't be slow if we do it offline - it just moves it into async territory. The workshop is designed to accelerate the community but we try to have parity for those who can and can't attend.

pvanlaake Jul 26, 2024
Author

I could do a (lightning?) talk on Tuesday and/or a hackathon session on Thursday morning, virtually, otherwise I am not available.

I'm happy to prepare some materials for the hackathon. Would that be something structured like this? I can similarly prepare a PR (I am assuming here that this refers to section 2.7).

This being my first active contribution, any pointers you can give would be much appreciated, such as any guidance on creating a fork (in my own GitHub environment, or is there a shared location and if so how do I get permissions to edit there).

erget Jul 26, 2024
Maintainer

Cool, take a look please at the preliminary agenda and see if there's a possibility for you to make something fit with your schedule, considering your time zone :)

Yes, the issue you cite is a good example. The process follows the Rules for CF Conventions Changes and that's set forth in a bit more detail from an implementation perspective on GitHub in our CONTRIBUTING.md. In a nutshell, you raise an issue proposing what you think would be useful, the community (if interested - it certainly seems this is the case!) engages, and once any high-level discussion has concluded, you make a PR proposing the actual tetual changes. There's a review period, and if everybody's happy it gets merged into the next draft.

Great to have new contributors - thanks for improving things around here! :)

pvanlaake Jul 26, 2024
Author

Will work on an issue, will post within a few days

JonathanGregory · 2024-07-19T17:32:21Z

JonathanGregory
Jul 19, 2024
Maintainer

Your recommendations sound sensible to me, Patrick @pvanlaake, but I have not used groups so I'm not well-qualified. Do you suggest that your recommendations should be CF recommendations, included in the conformance document, so that the checker issues a warning if they're contravened? What do you think, Daniel @erget?

2 replies

pvanlaake Jul 19, 2024
Author

Hi @JonathanGregory, I am definitely aiming for CF recommendations. Some well-designed best practices would go a long way to making it easier to upgrade currently non-supportive readers / applications to support group-based data sets (or develop new ones, as I am doing). Think "guidance", rather than "instruction".

Whether that should be integrated into the conformance document and checker is up to the community (I vote "yes" for a note or warning). In either case, the conformance documents needs some edits because there are some issues with the current text on section 2.7.

erget Jul 26, 2024
Maintainer

I agree, it would be good to have it in the conformance doc as well, that would make it easier to implement as software for automated data vetting, etc.

ethanrd · 2024-07-19T20:01:46Z

ethanrd
Jul 19, 2024
Maintainer

I like the idea of adding a should statement about keeping related/ancillary variables in close proximity to the relevant variables within a group structure.

I would also suggest adding a should statement about all CF references to dimensions or variables following the NUG scoping mechanism principle quoted in CF section 2.7.1. Which would mean the objects referenced should always be in the referring group or an ancestor group, not in descendent groups. This suggestion goes against one of your comments, @pvanlaake, about a reference into descendent groups. Do you have a particular reason for preferring the descendent direction for the situation you mention?

2 replies

pvanlaake Jul 20, 2024
Author

Hi @ethanrd, on an abstract level and relative to the data variable, things that are general (and could thus potentially be shared between data variables) should be placed high in the group hierarchy, while things that are specific should be placed lower down. In the case of coordinate variables, which are based on NUG guidance, this makes the lateral search superfluous and there is already a note in section 2.7.1 that this may be deprecated in the future. Of course, the current text in the section already has guidance to this effect but it could be made more explicit.

There are other references than NUG-style coordinate variables, though. A data variable is a netCDF variable that references some coordinate variables plus potentially scalar coordinate variables, that can each reference a bounds variable, formula terms having data variables, that in turn can have bounds variables, ancillary coordinate variables, grid mapping variables, etc. Overall, a lot of these would be either specific to a single data variable or coordinate variable (like ancillary CVs), or "naturally" general (like a grid mapping variable). It is this level of indirection between the elements that make up a data variable that I feel would benefit from some conventions.

In this context, another general guidance could be like-with-like, i.e. place similar data in a specific branch of the group tree and other data in another branch. This could have particular relevance to DSGs, especially if the restriction on a single featureType is relaxed.

It might be useful to reference any guidance to the CF data model in Appendix I (which itself should mention groups, if only to remind the reader of that feature).

erget Jul 26, 2024
Maintainer

@czender fyi - we'd discussed a few years back about how the lateral search is the cause of many solitary tears into our lonely CF pillows at night. You may be interested in moves we're making towards guidance in the form of "should" - and on a distant horizon I could imagine a 2.0 that cleans things up for the Alpha Centauri colonists.

ethanrd · 2024-07-19T20:04:24Z

ethanrd
Jul 19, 2024
Maintainer

I'm not sure there's enough experience with real-world datasets using groups to call anything regarding groups best or even good practice. Perhaps there is some common practice already developing, I'm not sure. Either way, I think it would be better to simply consider it a CF recommendation and not mention best practice.

Also, I don't think deprecation should be mentioned until there is a concrete proposal that changes an existing CF requirement or recommendation. As this introduces a new recommendation, it seems premature to mention possible future deprecations.

0 replies

JonathanGregory · 2024-07-26T12:22:33Z

JonathanGregory
Jul 26, 2024
Maintainer

I agree with @erget's summary:

In a nutshell, you raise an issue proposing what you think would be useful, the community (if interested - it certainly seems this is the case!) engages, and once any high-level discussion has concluded, you make a PR proposing the actual textual changes

except that "high-level" is a bit vague. I think it's usually a good idea to discuss and agree the text itself in the issue, before putting it the PR, so that textual comments on the PR are really only minor ones about typos etc., not the substantive content. That is because (a) it's common for there to be several affected pieces of text in different parts of the conventions and conformance documents, and it's easier to understand the change and get a consistent result by seeing them all in the same place, (b) if discussion about substance goes on in both the issue and the PR, it's much harder to follow the debate in retrospect, since the sequence becomes unclear, especially because "mini-debates" could arise at more than one place in the PR. I think it's essential to keep the history clear, because we often review it as background to further developments.

However, the above point is a digression. We should follow it up in a separate issue. Daniel and I have a joint existing issue from quite a while ago to do something about coordinating the documents that contain procedures and guidance. We will get there, I'm sure!

1 reply

pvanlaake Jul 26, 2024
Author

Thanks for the further guidance. There will indeed be several sections in the conventions document besides 2.7 where changes have to be made if the issue is accepted, i.e. cross-references or flagging any peculiarities introduced by the use of groups.

pvanlaake · 2024-07-29T10:59:57Z

pvanlaake
Jul 29, 2024
Author

Issue opened: cf-convention/cf-conventions#533

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF Conventions

Strengthen "Groups" conventions #333

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CF Conventions

Strengthen "Groups" conventions #333

pvanlaake Jul 8, 2024

Topic for discussion

Proposal

Replies: 9 comments · 11 replies

erget Jul 10, 2024 Maintainer

pvanlaake Jul 11, 2024 Author

erget Jul 15, 2024 Maintainer

ChrisBarker-NOAA Jul 15, 2024 Collaborator

ChrisBarker-NOAA Jul 15, 2024 Collaborator

pvanlaake Jul 19, 2024 Author

erget Jul 26, 2024 Maintainer

pvanlaake Jul 26, 2024 Author

erget Jul 26, 2024 Maintainer

pvanlaake Jul 26, 2024 Author

JonathanGregory Jul 19, 2024 Maintainer

pvanlaake Jul 19, 2024 Author

erget Jul 26, 2024 Maintainer

ethanrd Jul 19, 2024 Maintainer

pvanlaake Jul 20, 2024 Author

erget Jul 26, 2024 Maintainer

ethanrd Jul 19, 2024 Maintainer

JonathanGregory Jul 26, 2024 Maintainer

pvanlaake Jul 26, 2024 Author

pvanlaake Jul 29, 2024 Author

pvanlaake
Jul 8, 2024

Replies: 9 comments 11 replies

erget
Jul 10, 2024
Maintainer

pvanlaake
Jul 11, 2024
Author

erget
Jul 15, 2024
Maintainer

ChrisBarker-NOAA Jul 15, 2024
Collaborator

ChrisBarker-NOAA Jul 15, 2024
Collaborator

pvanlaake
Jul 19, 2024
Author

erget Jul 26, 2024
Maintainer

pvanlaake Jul 26, 2024
Author

erget Jul 26, 2024
Maintainer

pvanlaake Jul 26, 2024
Author

JonathanGregory
Jul 19, 2024
Maintainer

pvanlaake Jul 19, 2024
Author

erget Jul 26, 2024
Maintainer

ethanrd
Jul 19, 2024
Maintainer

pvanlaake Jul 20, 2024
Author

erget Jul 26, 2024
Maintainer

ethanrd
Jul 19, 2024
Maintainer

JonathanGregory
Jul 26, 2024
Maintainer

pvanlaake Jul 26, 2024
Author

pvanlaake
Jul 29, 2024
Author