Strengthen "Groups" conventions #333
Replies: 9 comments 11 replies
-
Hi @pvanlaake , thanks for raising this. What you're proposing sounds reasonable and beneficial from my perspective if we're promoting it as a good practice ("should" rather than "shall"). One of the principles that we try to follow with the Conventions is preserving backwards compatibility when we make changes, and our support of groups essentially makes it possible for CF-aware software to interpret files as if they were flat, which is what was allowed by the netCDF-3. I see the benefit of providing guidance on how to use groups so that they are not confusing humans, and would request that if we do so we ensure that it's done in such a way that it doesn't invalidate existing data, which may be encoded primarily with machines in mind and would not be restricted by guidelines such as the one you outline above. |
Beta Was this translation helpful? Give feedback.
-
Hi @erget, thanks for your response. I understand and fully subscribe to the need to remain backwards compatible. That, however, does not preclude CF 1.12 from "highly recommending" certain conventions for data sets that claim conformance with this newer version of the conventions, similar to language used elsewhere in the document. That is in line with principle for design 8. What I am proposing is done with an eye to principles 4 and 7, i.e. reduce potential complexity, which would benefit both data producers and readers. On CF-aware software, this is exactly where I am coming from. I am writing an R package for the full CF data model and the current "open" treatment of groups makes writing a good reader very hard (as every reference to If there is interest in defining such recommended use patterns, I am happy to pitch in more time. Best, |
Beta Was this translation helpful? Give feedback.
-
Hi @pvanlaake , yes certainly, we can make recommendations that don't invalidate existing data! And in the very long term we can even think about deprecating items for new versions, but we'd have to give tons of advance warning, and that's probably more of a CF-2 situation. I think you've summarised well how best to go about it. Any contributions you're willing to make in this area would be welcome! Are you planning on coming to the CF Workshop this year? It might be a good venue to talk with others to advance these thoughts. |
Beta Was this translation helpful? Give feedback.
-
Hi @erget, I'll work on a few best practices based on a common principle: "From the group that contains a data variable, CF constructs that can be shared with other data variables are placed in the same group or a parent group, while CF constructs that are specific to a single data variable are placed in the same group or a descendant group." A best practice would look like: "A bounds variable should be located in the same group as the referencing coordinate variable, scalar coordinate variable or auxiliary coordinate variable." There could be some preceding text emphasising that these are best practices that should be followed whenever it is practical to do so, and with the warning on potential future deprecation of alternative arrangements at the end. And a bunch of examples that demonstrate preferred arrangements over multiple groups. Assuming that this may be discussed at the CF Workshop, by when would you need this worked out, possibly in a PR for section 2.7? Unfortunately I cannot attend the CF Workshop. I'll register for online participation but I'll be travelling that week. |
Beta Was this translation helpful? Give feedback.
-
Your recommendations sound sensible to me, Patrick @pvanlaake, but I have not used groups so I'm not well-qualified. Do you suggest that your recommendations should be CF recommendations, included in the conformance document, so that the checker issues a warning if they're contravened? What do you think, Daniel @erget? |
Beta Was this translation helpful? Give feedback.
-
I like the idea of adding a should statement about keeping related/ancillary variables in close proximity to the relevant variables within a group structure. I would also suggest adding a should statement about all CF references to dimensions or variables following the NUG scoping mechanism principle quoted in CF section 2.7.1. Which would mean the objects referenced should always be in the referring group or an ancestor group, not in descendent groups. This suggestion goes against one of your comments, @pvanlaake, about a reference into descendent groups. Do you have a particular reason for preferring the descendent direction for the situation you mention? |
Beta Was this translation helpful? Give feedback.
-
I'm not sure there's enough experience with real-world datasets using groups to call anything regarding groups best or even good practice. Perhaps there is some common practice already developing, I'm not sure. Either way, I think it would be better to simply consider it a CF recommendation and not mention best practice. Also, I don't think deprecation should be mentioned until there is a concrete proposal that changes an existing CF requirement or recommendation. As this introduces a new recommendation, it seems premature to mention possible future deprecations. |
Beta Was this translation helpful? Give feedback.
-
I agree with @erget's summary:
except that "high-level" is a bit vague. I think it's usually a good idea to discuss and agree the text itself in the issue, before putting it the PR, so that textual comments on the PR are really only minor ones about typos etc., not the substantive content. That is because (a) it's common for there to be several affected pieces of text in different parts of the conventions and conformance documents, and it's easier to understand the change and get a consistent result by seeing them all in the same place, (b) if discussion about substance goes on in both the issue and the PR, it's much harder to follow the debate in retrospect, since the sequence becomes unclear, especially because "mini-debates" could arise at more than one place in the PR. I think it's essential to keep the history clear, because we often review it as background to further developments. However, the above point is a digression. We should follow it up in a separate issue. Daniel and I have a joint existing issue from quite a while ago to do something about coordinating the documents that contain procedures and guidance. We will get there, I'm sure! |
Beta Was this translation helpful? Give feedback.
-
Issue opened: cf-convention/cf-conventions#533 |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
“Groups” (section 2.7 in CF 1.11) are currently only described in terms of scoping and (not even) a handful of attributes. There is nothing on what goes into the groups, which creates ambiguity and leaves open the possibility of creating unnecessarily complex data sets. The language on name resolution could also be strengthened (for instance,
netCDF
defines a group as a namespace for variables, groups and types so there shouldn’t be any duplicates).More problematic is the reference to a “dimension id”. For starters, the quote from the netCDF Data Model seems to be outdated. The original text currently reads: “Dimensions are scoped such that they can be seen in all descendant groups. That is, dimensions can be shared between variables in different groups, if they are defined in a parent group”. The reference to “dimension id” is no longer there. Indeed, a “netCDF dimension has both a name and a length”, but not an id as an identifying property. The final sentence in the opening part of the section should thus read: “If any dimension of an out-of-group variable has the same name as a dimension of the referring variable, the two must be the same dimension. This implies that all out-of-group dimensions defined in the entire data set across all groups must have unique names” (my emphasis).
Given the absence of specific language on where to locate related objects, it is now possible to create a CF-compliant data set that is more complex than necessary. As a simple example, it is now permitted to create a coordinate variable
x
in, say, group/g11/g12/cv
, referencing dimension/g11/x
, with the CV used by data variable/g21/g22/vars/my_var
(CVx
“sees” dimensionx
higher up in the hierarchy, data variablemy_var
finds CVx
via lateral search).Proposal
Overly complex group layouts could be avoided by two conventions:
Similar conventions could be defined for other concepts such as
bounds
variables,formula_terms
, auxiliary CVs, grids (NUG best practices includes “Variables with the same coordinate system implicitly form a group”, so the grid mapping variable should logically also be in the group or immediately above), and possible others.Beta Was this translation helpful? Give feedback.
All reactions