-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Location / Identification for grid specs outside of a file #357
Comments
https://cfconventions.org/Data/Trac-tickets/145
V. Balaji Office: +1-609-452-6516
Advanced Computing Projects Mobile: +1-917-273-9824
CIMES/GFDL Email:
***@***.***://www.gfdl.noaa.gov/v-balaji-homepage
…On Thu, Mar 10, 2022 at 1:58 PM Chris Barker ***@***.***> wrote:
As we all know, CF data (or model results, anyway) are often associated
with a particular grid specification.
This could be a simple rectangular grid, or a more complex grid, such as
those defined by the UGRID and SGRID specs.
It's a common practice to store the grid definition in the same file as
the data itself, which works out fine. But in some cases, the grid
specification may be substantial, and so it can be stored in a separate
file, so it doesn't need to be repeated.
In that case, there needs to be some way to find that other file, and,
ideally, determine that it is, indeed, the correct grid. Various folks are
doing this already, but not in a standard or robust way -- so It would be
nice to have an standard way to do that in CF.
I'm pretty sure I recall some previous discussion about this, but was not
able to find it -- thus the new issue.
But please feel free to redirect this discussion to an existing issue if
there is one.
I bring this up now because there was a proposal on the UGRID spec site:
ugrid-conventions/ugrid-conventions#59
<ugrid-conventions/ugrid-conventions#59>
But it would really be nicer to have a way to do that for any grid type in
CF itself.
I refer you to that discussion for a more fleshed-out idea. If the UGRID
community wants to take this up, I recommend we move the discussion from
the UGRID repo to here.
—
Reply to this email directly, view it on GitHub
<#357>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQJZVGGJF4YVCI5PVHVBF3U7JA6LANCNFSM5QNQN6QQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
THanks -- looking at that issue, it resulted in this in teh current standard: """ Which is clearly similar / related, but doesn't help with the issue at hand - particularly as it is restricted to cell_measures. IN a sense, this is a broader problem -- we're not talking about refereing to particular variable per-se, but to an entire grid definition -- a concept that may not yet be included in CF, but will if/when UGRID is included. So maybe we should talk about this in: #153 ? |
@ChrisBarker-NOAA Has this discussion progressed anywhere else? I see the UGRID ticket is waiting on us (CF). I think it would be entirely proper to make a proposal which is a variant on the external measures option 2.6.3, but which provides a UUID as well as a variable name for the variable in another file. |
Hi All, Since CF-1.9, we can point to an entire grid definition simply by referring to a domain variable that contains it (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#domain-variables). Such a reference is, as a has been pointed out, not currently allowed - but it in principle it need be more complicated than referring to a cell measure variable. In CF-1.9 we disallowed a data variable to replace it's traditional domain definition (attributes Now, all of this would apply also to a mesh topology variable [1], as it is just a collection of one or more connected domains. The reference required from a data variable would require two parts: the mesh topology variable name (possibly external), and the location on the mesh (e.g. "node"). I have no comment right now on what the reference encoding could/should look like - just that I don't think that there are any more high-level barriers to doing this! Hope that helps, |
Dear @ChrisBarker-NOAA Thanks for raising this issue. Things have moved on a bit in the last couple of years. In particular, we have made the link between CF and UGRID in the latest CF release, as you know. How would you like to proceed with this? At the outset you mentioned a use-case from UGRID. Can we address that by permitting some different types of external variable in Sect 2.6.3, as suggested earlier? Best wishes Jonathan |
Hello @ChrisBarker-NOAA, @JonathanGregory and All, I am keen that this is pursued, but only for UGRID mesh topology variables and the variables that they reference (such as connectivity variables). To that end @bnlawrence, @koldunovn and myself have been working on an encoding proposal, and are close to actually proposing it. As @JonathanGregory suggests it could be, our proposal it quite simple and includes an extension of the existing Why only for UGRID? Apart from it being the only use case on the table (as far as I'm aware), and there are complications if it is extended to arbitrary domain definitions. These complications arise from having to think about whether or not a data variable can be allowed to reference a domain variable instead of the usual collection of metadata variables via named attributes ( Many thanks, |
Makes sense, though:
So: Yes, only for UGRID now, but let's make sure to leave the door open for extending it to other grid types in the future. |
Hi Chris,
Absolutely. If/when SGRID makes it into CF, external variables could be extended to include it. To be discussed at that time.
Indeed, large "ordinary" grids could well benefit from this. I think that is better dealt with separately because a) no one is actually asking for it right now, and b) even if they were, it will involve a discussion (instigated by me :)) on whether to allow domain variables to be referenced from data variables - this discussion is not relevant to UGRID, so would only hold up the current discussion, which does have an existing use case. I'm not expressing an opinion either way on ordinary grids - just kicking the can down the road for now!
Good. Are you happy for me put in PR linked to this issue that I have ready that could implement this? Thanks, |
Absolutely—go for it. |
Thanks Chris. Here is a proposal ... Allowing UGRID mesh variables to be externalModeratorTo be decided. Moderator Status Review [last updated: YYYY-MM-DD]Requirement SummaryFor data defined on UGRID mesh topologies, it is already the case that datasets are being written for which the mesh variables, i.e. the mesh topology variable and all the variables that it references (e.g. the node coordinates), are stored in a different netCDF file to all of the data variable files. This is because the UGRID description of a mesh can be very large in terms of the disk space required to store it, so repeating it in every file can be expensive. For instance, depending on how many of the optional (yet useful) UGRID features are included, the specification of mesh faces can be approximately 6 to 20 times larger than a 1-d data array defined on those faces. The mechanism for allowing the mesh variables from one file to be associated with the data variable in another file needs to be standardised so that users of the data can correctly interpret such dataset, and generic software applications can be built that can facilitate the recombination of data and metadata. Cell measures: A bit of historyCF-1.7 introduced the concept of external variables, but only for cell measures (Section 2.6.3. External Variables). The motivation for the cell measures case was to save space, and it was already been done in a non-standardised way prior to CF-1.7. In other words, the pragmatic situation and motivation for allowing cell measures to reside in an external file are the same as for mesh topology variables. The discussion on allowing cell measures to be external (Trac ticket #145):
Why allow mesh-related variables to be external, but not arbitrary geolocation variables?Whilst being able to put arbitrary geolocation variables in external files has been talked about for years, there are technical aspects that have not been addressed which do not apply to UGRID meshes (such as whether or not to allow data variables variables to reference domain variables), and it is the UGRID mesh use case that is being actively discussed here. Restricting the proposal to UGRID meshes will not directly impact on any future discussion on a use case for a further generalisation of external variables, so there is no need to complicate the current situation. Functional summaryWe propose extending the attributes for which CF standardises the use of external variables to include In addition we propose a simple mechanism to better prevent an incorrect external variable being used by mistake. A new optional global attribute will allow an arbitrary string-valued identity to be associated with each external variable. When provided, and the same identity is also given as an attribute of the external variable in the external file, then the identities must match for the external variable to be used. Technical Proposal SummaryAllowing meshes to be externalExtending the attributes for which CF standardises the use of external variables to include a mesh topology variable will follow the same principle as for external cell measure variables, with additional rules concerning the variables that the mesh topology variables refers to. The first (and currently only) paragraph of Section 2.6.3 External Variables will be changed to: An "external variable" is one which is named by an attribute in the file but which is not present in the file. These variables are to be found in other files called "external files". A variable named by an attribute of an external variable must also be an external variable. CF does not provide conventions for identifying external files, but an external variable must be in the same external file as any variables named by its attributes. The only attributes in the file for which CF standardises the use of external variables are Providing identifiers for external variablesThe following paragraph will be inserted at the end of Section 5.9 Mesh Topology Variables: A mesh topology variable referenced by the Providing identifiers for external variablesThe following new paragraph will be inserted at the end of Section 2.6.3. External Variables: A variable listed in the ExamplesA new example for Section 2.6.3. External Variables: Example 2.2: A file with external variables
Example 2.3: The external file containing the external variables for example 2.2
Updating appendix A: AttributesThe new attributes BenefitsAll users of datasets with external meshes would benefit from this proposal. Status QuoUsers of datasets with external meshes will struggle to read datasets with external meshes. Associated pull requestNot yet written, but all changes are detailed above. |
Dear @davidhassell Thanks for this proposal, which I think is sound and sensible. I have only one minor comment, namely that I wonder if the attribute Best wishes Jonathan |
Thanks for looking at this, @JonathanGregory. I agree with your suggestion on the name of the external variable identifier attribute (i.e. calling it Given this initial support, I shall create an actual PR with the above changes, incorporating Jonathan's suggestion. |
Whilst writing the external variables PR, I came up with some better (?) text for Section 2.6.3, which is, in full: An "external variable" is one which is named by an attribute in the file but which is not present in the file, or is named by an attribute of another external variable. These variables are to be found in other files called "external files". An external variable must be in the same external file as any variables named by its attributes. The only attributes in the file for which CF standardizes the use of external variables are CF does not provide conventions for finding external files, but once an external file for an external variable has been located it is expected to contain a variable with the same name; and the names of the variable's dimensions (if any) are the same as the corresponding dimensions in the parent file, and with the same sizes. An additional check that an external variable is the correct one may be optionally provided via an external variable identifier. The global |
A couple comments:
I presume this has been hashed out previously, but are we sure we want the identifier to be optional? That seems like a really good idea, why not require it? (client code could ignore it of course...) The other thought in that is whether some guidance should be provided on the the identifier -- ideally it would be likely (or guaranteed to be) unique:
Maybe that's too much to require, but it could be suggested. My concern is that we'll end up with identifiers, like FVCOM-mesh or the like, or even "mesh"
given that the mesh variable is always the same dimensions, and names like "mesh" are likely -- that does reinforce why an identifier is a really good idea! In practice, for a UGRID, the wrong mesh definition is highly unlikley to be compatible with the other variables, but it would be better to fail early. -CHB |
Dear David Your text is clear, thanks, except that I think this sentence could be improved
I suggest the following, if it's correct
Cheers Jonathan |
Following @ChrisBarker-NOAA's comment, I suggest we could strengthen
to
We should say here what format this identifier can have. Since they're containing in a blank-separated list, each identifier must be a single word not containing any whitespace. It doesn't have to be intelligible text, but if it's contained in a string I suppose it ought to be consist of printable ASCII characters only i.e. hex 21-7E inclusive. Is that too restrictive? I agree with Chris that we could give some guidance about choosing the identifier, along the lines that to fulfil its purpose it should somehow distinguish among variables that have the right name and shape in the set of files which the user program might consider, so it probably won't help if the identifier is always given the same contents by a given program writing datasets of this kind. Best wishes Jonathan |
Hi Chris, thanks for your feedback.
This has not been thrashed out in public before. Off-line @bnlawrence and I have talked about it whilst preparing this, but didn't come to a consensus. I favour optional, but Bryan mandatory. My reasons are for optional are:
I'm not worried about people not providing identifiers simply because they don't have to - standard_names are (almost) never required, but they are used because people want to, for whatever motivation (they can see that they're useful, or their project insists on them, etc.) We could certainly recommend that people use an identifier though, then the CF checker will warn in it's absence for files from CF-1.12 onwards. I hope that Bryan will chime in with his point of view :) I'm happy leaving the content of an identifier unrestricted (though with @JonathanGregory's caveats) - I can see use cases for a UUID and for a nice name like |
Ah, at last, my chance to air my opinion, and without seeding the conversation first! Thanks @ChrisBarker-NOAA ! I think we should : a) make it not just a string identifier but a UUID string identifier, and a) would nearly guarantee uniqueness (you couldn’t get in trouble later on because someone copied a different grid into the same path and filename on your machine), and b) would make sure that it worked as intended - without it, it's basically fragile beyond belief. We get away with areacella because only a tiny fraction of applications need it ... |
From my outside perspective I would side with @bnlawrence and @ChrisBarker-NOAA for exactly the same arguments. BTW: Currently the CF-Checker handles versions up to and including CF-1.8, so bringing it up to CF-1.12 would require some dedicated time and resources ... |
Dear all I agree with @davidhassell that the new I like Bryan @bnlawrence's idea of using a "universally unique identifier" UUID. For the sake of anyone who, like me, didn't know what this is, wikipedia says
In that case, the identifier would be a string consisting of hexdigits and a few allowed punctuation marks. Also, we could call the attribute Cheers Jonathan |
I'm only superficially familiar with uuid, but in CMIP we require each file to include a global attribute we call "tracking_id", and it contains a prefix plus a uuid constructed as described here: tracking_id should be of the form I'm not sure why the prefix is helpful. I think it has something to do with the DKRZ (Hamburg Germany) method of referencing datasets. Does anyone know its purpose? |
Dear all On further reflection, while walking home, I've changed my mind to agree with @davidhassell that sometimes it would be better to have a meaningful identifier, rather than a UUID. It's a good idea to use a UUID when the external variable could take many different values, such as time-dependent cell metrics. On the other hand, for static information, such as David's example of This consideration suggests we should leave it to the data-writer to decide what's sensible, and give them guidelines. The identifier should not have a generic value like Best wishes Jonathan |
Following the CMIP example, you could have the best of both worlds by using an attribute with a value like |
hi Everyone I knew that suggesting something would be mandatory would be controversial, but up until now, it has been mandatory to put the coordinate information in the file, indeed our "primary principle" has been "Data should be self-describing, without external tables needed for interpretation." and unlike areacella which is mostly not needed for interpretation, one will not be able to interpret these files without the external information. The suggestion that unavailability of specific files would cause problems is, I think confusing information with usage. I think the use of a uuid makes it possible for software to be sure that the coordinate file is the correct one - but if you haven't got access to it, software can be advised to use a different one - which puts the onus on the human advisor to get it right. Given that these files will always be created by software, it will not be onerous to require a UUID at creation time - what one does when confronted by the lack of the appropriate file at usage time should be up to a human. What this discussion has made me change my mind on is that it would be desirable to have both a UUID and a human readable identifier. |
(I should say that one might imagine the appropriate coordinate files could be distributed with standard configurations of the code so while there is a risk of many files with the same content and different UUIDs this is a risk that can be mitigated against - and wont be a big risk in practice since the coordinate files should travel with the data. But this is the risk which is most likely to cause me to disagree with myself if someone can convince me.) |
Hmm:
But IIUC, it's only interchangeable IF it's the same content, in which case it could have the same identifier, yes? the identifier is associated with the variable, not the file, correct? When you generate a UUID it's unique, but you can make as many copies of it as you like. |
Agreed. So at data writing time there are two choices:
At reading the data time, there are two choices:
|
many thanks all for raising this issue's profile and exploring opportunities to progress this. There is an aspect I am cautious of, which I would like to explore, if I may? Firstly, context & support: However, I am a bit more cautious about the limited scope of the proposed The UGRID case is particularly interesting, as well as useful, as we are likely exploring putting numerous external variables within a single file / object to describe the unstructured domain. So, should we be thinking cautiously about how the data consumers need to act in identifiying the object to reference (a netCDF file, a data service endpoint, ...) and the specific variable set within that object, how to obtain that information at data read time, how to check intent, consistency? Considerations on my mind include:
I am a bit concerned if there's only one item of information one can provide per variable, and the conventions only recommend this, and only tell us it may be a string-name, or a UUID or something else. Are there options that we can explore, that are still scope constrained, but give opportunity for more than one piece of information to be encoded and provided per variable when describing external variables? |
Hi @bnlawrence, you wrote
This highlights something that hasn't been explicitly addressed, i.e. what to do when an identifier in the parent file doesn't match the identifier (if there is one) in the external file. I would say that if the identifier is mandatory then it follows that it must be forbidden to use any external variable that doesn't have that identifier, otherwise "mandatory" has no real meaning in this context. On the other hand, if it were OK to ignore a mandatory identifier when you don't like it, then it follows that the identifier is in fact logically optional, because ignoring it is akin to it not having been provided int he first place. My text suggested that when an identifier had been provided then it is only recommended for software to only use an external variable with a matching identifier. This, I think, gives us three choices:
The in the parent phrase in 2. and 3. is important - it allows the creator of the parent file to happily ignore any identifiers in the external files if that suits their purpose. Given that identifiers not matching seems to be a real use case that we want to be able to deal with nicely, I would suggest that one of the "optional" choices make sense. |
I don't see how you get from "it is mandatory when writing to" it is mandatory to use that information - that's akin to saying if I put both U and V in the file, you must use both! |
It comes back to the "CF prime directive" - at the moment it is mandatory to put the domain IN the file (while reading, you could ignore that now if you wanted to). I am suggesting that if you want to break the prime directive it should be mandatory to give users of that data the best possible chance of finding that information. |
I don't see it quite like that (today) - e.g. you can't ignore standard names, in that you can't treat a variable as eastward_wind if it has a standard name of air_temperature, however convenient that may be for you :-). By which I mean, if metadata (not variables) are provided then you can put minimum expectation on their use, (edit:) if you want to, or set recommendations, etc. |
(I should add that I'm not entrenched here, just finding my way like the rest of us.) |
As a data-reader, I would like to comment on a couple of Bryan's points:
For such reasons I maintain that it should be optional, but recommended, to require the identifier to be present and matching in the external file. |
I mean, you absolutely can do that. I've done it. Generally it happens when you're calculating some derived quantity and you don't update the standard names until you're done, so there are intermediate steps with stale metadata. All CF can do is to say that it's not correct to do that, assuming all of the entities in question have accurate and CF-compliant metadata. (Which they don't, in the case I describe above.) Likewise, if you have a parent file and an external file that both have identifiers and they don't match, CF can tell you that it's incorrect to use them together (under the assumption that both have accurate and compliant metadata). If either or both files are lacking identifiers, you have no information about whether or not the contents of the files match (and therefore should proceed with caution), and CF can say it's not compliant to produce a pair of files intended to be linked that way without matching identifiers. But I agree with Bryan that it's weird to say it's forbidden to use files that don't have matching identifiers. |
I see that "forbid" was a poor choice of word - sorry! I meant that we can't say that providing an identifier is mandatory without also saying that a dataset is not CF-compliant if the external file doesn't also have a matching identifier. Otherwise I could just further game the system by putting in an arbitrary identifier in the parent file knowing that I never need to abide by it. In this situation CF-compliant software should refuse to entrain an external variable without a matching identifier. This wasn't an argument against mandatory per se (although I still favour optional), rather an observation about the consequences of making it mandatory. This is similar to some other of the few cases of mandatory attributes. E.g. the |
Of course, that is the idea, and true for any other requirement of CF - that's kind of the point :-) It seems to me that CF is all about the data writer -- THIS is how you make your files properly described by metadata -- and it's guidance for data readers -- This is how you CAN interpret the metadata -- but, of course, we have no say whatsoever over what data readers actually choose to do -- if they want to ignore units, they can ignore units, or standard names, or whatever.
Hmm -- it seems to me that in that case, your external file might , (or probably) would be the same -- but not actually guaranteed to be -- maybe only different to the tenth significant figure, or ... mayb eyou used a float32, and the original used a float64, or .... So I would argue that it is not guaranteed to be the exact same information. (if we want to, we could use a hash of the data (not the file, as metadata could change) to be a unique id of those exacty variables -- but in the example above, it probably wouldn't match.
I don't think so -- if you are creating that external file, then you create the identifier, and you point to it -- done. It may well be that there are other external files out there with the same information, but with different identifiers, but I think that's good, not bad. I guess I'm missing something -- I can see two cases:
How does leaving out the identifier help anyone here? I think we have some consensus here:
(whether we should specify A or B (or a combo) is still open)
The open question is whether it should be required or not. My thoughts -- I'm having a really hard time finding any reason NOT to require it -- from a practical perspective, I see a LOT of files out there that are not quite strictly CF compliant (way too many :- ( ) -- so if someone really has a reason to not supply an identifier, they can not supply an identifier, and their file wont be fully compliant -- that's there choice. NOTE on (1) above -- I don't think it's a deal breaker, but I'm trying to see how a compliance checker would enforce a "proper" identifier -- it can look like a UUID, but not actually be one, though maybe a warnign is enough -- "this doesn't look like unique identifier to me". HMM -- I'm thinking of a realistic use case-- not sure how that impacts this issue: It's quite common for the original provider to create a huge pile of file that all have the grid specification in them -- maybe one for each day, or each timestep, or .... Then a downstream distributer (or more than one) of the model. results may want to aggregate them in a particular way, and create an external grid definition file. The aggregator has the grid info, and can create an external file, and give it a unique identifier. All good and that aggregator can keep using that same external file as it ads to. the aggregation. All good. But anyone else that is also aggregating, or is using the original files, will have no way to to know what. identifier anyone else is using -- so they will need to create their own, with their own identifier. Is this a problem? Now that I've written it out, I don't think so -- and it's probably a good thing -- the only. way to know for sure that your grid matches your data is if they were, in fact, created in the same way from the same source -- so this is all a good thing. And, indeed, ideally, the original source would have provided an identifier for the grid, even if it wasn't in an external file :-) I'm coming down on the mandatory side here. In short: If we provide no way to locate an external file, and no way to know for sure that an external file you find is the right one, how in the world can we call that dataset self-describing? |
Ah, that makes sense. I disagree on only one point: I think CF-compliant software shouldn't automatically use external files without matching identifiers, but that it's reasonable for a user to override that manually. The nature of external variables is that they've been separated out because they apply to multiple files, but that means there are cases where a user won't be able to get the external file and will need to either recreate it or provide something they can assert is functionally identical. And although these cases will be (hopefully) rare, they're not impossible, and as Jonathan says, we don't want to force people to spoof UUIDs when that happens. |
In a similar way that CF allows minimal_dataset.nc.zip as a CF-compliant dataset:
It might not be very useful to some, is not very self describing, and would throw up a load of CF checker warnings (not errors), but it's OK! CF provides has always provided the means for creators to describe their data, and for readers to interpret them as the creator intended. Should we make setting at least one of External variables have always been delivered with the expectation that the user should be careful to find the correct ones. The identifier is intended to make that choice easier in case that the data writer wants to be extra sure that appropriate variables are used, and in that case surely the writer would not want people to ignore the identifier? If providing an identifier in the parent file were mandatory, then when the parent and external files are both passed to the CF checker, it would throw an error (not a warning) if the external file did not have the matching identifier. If we actually expect software to provide an "ignore identifier" option in the mandatory case, then we should in fact make the identifier optional so the data creator can say that they're happy for people to find/create their own, thereby preventing the need for software to provide a work-around that the data creator did not intend. One of the reasons CF is successful is that it gives data writers freedom of expression with a well defined set of tools. We've seen that there are cases where flexibility is desired during the location of external variables, so we shouldn't force data writers to remove that flexibility when they don't want to. It's fine for CF to have an opinion (i.e. to make a recommendation), but I don't think it should remove choice when we know that that choice is sometimes needed. Cheers, |
I guess my point is that file is self describing. Not fully self describing, but it is. What would one do with a UGRID file of data without its domain description. It's properly useless right? It doesn't pass the minimal requirement of being usable in its own right. |
Hi Bryan - I think "file" is a misleading. We've moved in to "dataset" territory. When external variables are in play, we have one dataset comprising two or three files. The dataset is self-describing, the constituent files aren't so much. |
Quite. So I am suggesting it be mandatory to give people enough information to turn our files into a dataset. I am not suggesting it be mandatory that people use that information, but I think without it, we are putting a lot of requirements on data managers as opposed to users. |
I don't see the conflict. In the optional case, if I was a data manager that liked the idea of the identifiers, than I simply wouldn't accept files without them (just like ESGF does not accept non-CMOR-ized CMIP files). Everyone's happy, right? |
Dear all I agree with David that CF aims to provide conventions which "give data writers freedom of expression with a well defined set of tools." That's related to principle 8 in sect 1.2, "Conventions are provided to allow data-producers to describe the data they wish to produce, rather than attempting to prescribe what data they should produce; consequently most CF conventions are optional." Hence my opinion (I think the same as David's) is still that it should be optional to provide an identifier in the parent file, and optional to require a match when using an external file, although they could be recommended by CF, and they could be required by a project that has its own stricter requirements. In the case of the CMIP CMIP is another case (distinct from my earlier example of manually creating an external file) where I think you don't want an opaque unique identifier. When the files are CMORised, they should all give the same identifier for Best wishes Jonathan |
Hello. Bryan gave us these user choices:
I think that these user choices are correct, but there is a third choice of:
I don't think we should force a writer to put a user in that position. The writer can choose to put a user who likes choice 3 in that position - perfectly fine - but there will be cases when they don't want to. A use case has been discussed here: the fact that there may sometimes be multiple identifiers in multiple valid external files, any of which might by acceptable for use. The writer should have the option of giving a user the flexibility of not being constrained by an arbitrary one of those identifiers. There's no danger that optional identifiers will not get used by data creators who want to use them (who doesn't use at least one of standard_name and long_name, even though it's allowed to provide neither?), but there is a danger of mandatory identifiers having the unintended consequence of preventing the use of a dataset. I think that identifiers are great, that their use should be recommended by CF (so the checker will warn if none are provided in the parent file), but that it is not appropriate for them to be mandatory. On the question of UUID/meaningful-string, I think that there are good arguments for types, so the identifier should be unrestricted (within the permitted character set). Cheers, |
Dear all, This issue urgently needs a moderator! Moreover, I suggest that the moderator at the earliest convenience convenes an offline group to tease out the essentials in the different viewpoints. /Lars |
HI Lars. David and I did discuss this offline this morning. Our suggestion is that we'll wait another day, then generate a summary of the (one?) key issue(s) of disagreement (we can agree that easily enough) and then I would suggest we simply vote on which direction we pursue in further fleshing out the proposal. In advance of that summary, I would say that there are two respectable positions on the table, neither are False in the sense that any of us would argue that others are (logically) Wrong. I believe that the proponents of both sides believe their positions are correct, and are unlikely to be further persuaded by finer detail of the arguments - so a consensus by mutual agreement seems unlikely. I also believe that no one so far believes that their position represents something upon which they MUST WIN, so a vote would be a reasonable way of achieving a way forward. If that's an acceptable way forward, then we can worry about how to vote next. But come what may, David and I attempt a summary of things in the next couple of days. |
Meanwhile, I'll add another two more user cases for folks to mull over (ok, the second is a summary, but ...)
|
If we can't reach a consensus in this GitHub discussion, I think the right procedure would be the one we followed for (what became) |
That'd certainly be fine by me, I had forgotten that we'd used that mechanism despite Lars actively suggesting it above. Sorry. Do you think it's worth trying to summarise where things are now first? My sense is that we have a lot of information on the table now, and we can synthesise it now as input to such a discussion. |
A summary would certainly be helpful for me. Presumably it could provide some of the text that eventually made it into the "recommendation" resulting from the process, reducing overall effort down the road. |
hello regarding:
how may one suggest themself to be involved in this activity please? many thanks |
I plan to get a summary back here some time early next week, try and take the temperature of the immediate response, and then probably poll for "interested parties" on this ticket, following that, we'll likely doodle for a suitable time. We may be in a hurry if we want to get folks before they disappear for summer. |
thank you |
As we all know, CF data (or model results, anyway) are often associated with a particular grid specification.
This could be a simple rectangular grid, or a more complex grid, such as those defined by the UGRID and SGRID specs.
It's a common practice to store the grid definition in the same file as the data itself, which works out fine. But in some cases, the grid specification may be substantial, and so it can be stored in a separate file, so it doesn't need to be repeated.
In that case, there needs to be some way to find that other file, and, ideally, determine that it is, indeed, the correct grid. Various folks are doing this already, but not in a standard or robust way -- so It would be nice to have an standard way to do that in CF.
I'm pretty sure I recall some previous discussion about this, but was not able to find it -- thus the new issue.
But please feel free to redirect this discussion to an existing issue if there is one.
I bring this up now because there was a proposal on the UGRID spec site:
ugrid-conventions/ugrid-conventions#59
But it would really be nicer to have a way to do that for any grid type in CF itself.
I refer you to that discussion for a more fleshed-out idea. If the UGRID community wants to take this up, I recommend we move the discussion from the UGRID repo to here.
The text was updated successfully, but these errors were encountered: