Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Report Uncertainty Chapter #320

Closed
kenkehoe opened this issue Apr 6, 2021 · 30 comments · May be fixed by #321
Closed

How to Report Uncertainty Chapter #320

kenkehoe opened this issue Apr 6, 2021 · 30 comments · May be fixed by #321
Labels
dormant Issue closed without a conclusion on a proposed change enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@kenkehoe
Copy link

kenkehoe commented Apr 6, 2021

Before submitting an issue be sure you have read and understand the github contributing guidelines: https://github.com/cf-convention/cf-conventions/blob/master/CONTRIBUTING.md and the rules for CF changes: http://cfconventions.org/rules.html

If the modification is straightforward and non-controversial, feel free to open a pull request simultaneously with the proposed changes.

Change proposals should include the following information as applicable.

Title

Add a new chapter to explain how to report uncertainty values that correlate with data in the file

Moderator

@user

Moderator Status Review [last updated: YYYY-MM-DD]

Brief comment on current status, update periodically

Requirement Summary

Proposing a new chapter to the CF convention to report uncertainty values in a netCDF file that correspond to a linked data variable(s). Since there is no one clear definition of an uncertainty, the proposal is flexible to accommodate many different types and shapes.

Technical Proposal Summary

Brief proposal overview

Benefits

Any data users who would like to include uncertainty values in a netCDF file with (or external) to the data file.

Status Quo

Discussion of the current state CF and other standards.

Associated pull request

#321

Detailed Proposal

I have been working on a proposal for adding uncertainties to CF for a number of years. I've presented these proposals to the CF meetings and taken into account many suggestions. In addition to the proposals to the CF community I have engaged other communities to see their needs and how to accommodate as many use cases as possible. This has culminated in a working Google Doc (https://docs.google.com/document/d/1UR0flhrEE3yw_3dKW8NpCrGymLt9idwFXJBhZ5ngX3Y/edit#) with the core proposal and examples. Most of the details of the proposal are best summed up in the Google Doc which also has permissions set to comments for anyone to add suggestions and comments.

The basic summary is to use ancillary variables to contain the uncertainty values with flexibility in how to represent the uncertainty values from scalars, to vectors, to external files, to formula to allow users to calculate uncertainty values.

@kenkehoe kenkehoe added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Apr 6, 2021
@kenkehoe kenkehoe changed the title New How to Report Uncertainty Chapter How to Report Uncertainty Chapter Apr 6, 2021
@davidhassell
Copy link
Contributor

Dear Ken,

Thanks for putting this together - a lot of work has clearly gone into it.

I have a number of comments and questions which I'd like to think about some more before posting, but I'd like to highlight at this time that the proposal as it stands would require a number of changes to the CF data model.

The features that would need data model changes are, as far as I can tell:

  1. Ancillary variables referencing other ancillary variables
  2. Ancillary variables having cell methods
  3. Ancillary variables having a trailing size-2 "interval" dimension

Extending the data model is not a problem provided that it has been established that we can't meet the requirements with the current data model. I don't have any answers as yet, but will carry on thinking about it.

I'll post again with some more detailed thoughts on the text ...

All the best,
David

@davidhassell
Copy link
Contributor

davidhassell commented May 18, 2021

Overall, I think that the general approach of using ancillary variables and cell methods is a good one.

There was considerable discussion around the topic of "standard name modifiers or cell methods?" in 2011, 2012, and 2013 (e.g. http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/006106.html, https://cf-trac.llnl.gov/trac/ticket/74) - which is well worth revisiting if you have the time.

Here are my initial thoughts on the detailed proposal:

Standard names

I don't think that these standard names will work, for two reasons:

  1. they do not describe what the quantity stored by the ancillary variable is, rather they describe its role . For example, if variable contains the standard deviation of air temperature, then its standard name and cell methods should say so. How that quantity is to be interpreted by the parent data variable should be stored be stored elsewhere. I'm reminded of the cf_role attribute, here, as used by DSGs.

  2. it is not possible for them to have canonical units, which all standard names must have.

From reading the [GUM] reference you very helpfully provided in the bibliography (https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf), I'm a bit confused on the definition of "total_uncertainty". Your reference to "the square root of the sum of squares" might suggest that it is the [GUM]'s "standard uncertainty" - is that right?

I felt the prefix specific_ was a bit misleading. Perhaps component_ might be clearer?

Cell methods

It would be very useful to include the parent data variable in the examples that have ancillary data cell methods. Without that reference I can't fully understand what the name: of the cell method is referring to - is it a standard name, or a dimension of the parent data variable? It's worth reviewing Jonathan Gregory's "measurement" dimension idea in this light (but I've not followed that train of thought through myself, yet).

I'm not sure that other_than_statistical_analysis is a cell method. The [GUM] says that , Type B uncertainties are characterised in the same way as Type A ones (e.g. by standard deviations), but it is only how the uncertainty was arrived at that is different: Type A is obtained from an observed frequency distribution; Type B comes from an assumed probability density function.

In the Type B case, there are perhaps further complications, if one considers the stored values to not be representative of sub-grid variation.

There is already a standard_error_multiplier attribute that states the multiplication factor for the standard error. Including a standard_deviation_multiplier (for example) would seem the way to go rather than using the cell method comment section.

I'm confused how the confidence interval in example 10.4 is stored as a scalar. Is it in fact half of an interval that is symmetric about the measurement? If so, the method should perhaps reflect this.

Ancillary variables

Allowing cell methods to be interpreted on ancillary variables could be allowed, provided that the the cell method names could also be plausibly applied to the data variable - see comments above on the interpretation of the cell method names.

Ancillary methods containing ancillary methods and having trailing dimensions - these need more thought, and I'll post again when I've had time to do so.

All the best,
David

@kenkehoe
Copy link
Author

David,

Thanks for the comments. I initially proposed a standard name modifier and it was not well accepted by the audiences I presented to. I think a standard name modifier is a more difficult and confusing way to use the standard names, and would require updating the CF document every time we want to add a new name. That is significantly more difficult than adding a standard name (which is not a trivial process). If we want to use some other attribute that's fine. I was trying to keep the number of new attributes to a minimum. I also don't want to develop a new system to track names. We could use cf_role, but I get worried about overloading attributes with too many different concepts.

There is no standard uncertainty. This is the crux of the problem. There are literally thousands of ways to derive uncertainty so we can't be too specific with the definitions. That's why using standard_error standard name modifier will not be enough. "the square root of the sum of squares" is just the component sum method to add values. It only works if the values are independent, which may or may not be the case.

Whether we use specific_ or component_ doesn't matter to me. Someone will not like whichever one is chosen. That's just how it goes. I got the term from someone else.

I think the confidence interval needs to be listed as the full range for a confidence value to be calculated. It's a scalar because the same interval is used for all values. The value listed in the variable would be half the range centered on the middle of the range. I don't think the details matter that much, just that the variable indicates the values are from a 95% confidence range. I see the cell_methods as just a high level indication of what is going on. There are many other decisions made in the calculation that are not indicated in cell_methods (e.g. how was missing data treated, was QC applied).

I don't follow the concern with ancillary variables. I assume the cell methods attribute describes the process to create the values in the variable. So of course the method listed in cell_methods would need to work on the data variable.

Thanks for the comments,

Ken

@davidhassell
Copy link
Contributor

Hi Ken,

I'm trying to better understand the issues around cell methods, and would find it very useful to have the parent data variable that goes with this example ancillary variable the new chapter 10:

float relative_humidity_uncert ;
  relative_humidity_uncert:long_name = "Relative humidity uncertainty" ;
  relative_humidity_uncert:units = "%" ;
  relative_humidity_uncert:standard_name = "random_uncertainty" ;
  relative_humidity_uncert:cell_methods = "time: height: confidence_interval (at 95%)" ;

Would it be possible to update the example?

Many thanks,
David

@kenkehoe
Copy link
Author

David,

Sure, no problem. I've updated the example in Chapter 10 to include a data variable for both confidence interval examples.

Ken

@kenkehoe
Copy link
Author

kenkehoe commented Jul 2, 2021

This review appears to have stalled. What can we do to get this going agian?

@davidhassell
Copy link
Contributor

Hi Ken,

I'm sorry that this has stalled, but I don't have as much time as I would like to devote to as many lengthy and involved proposals as I might like. Perhaps when some other CF issues I've been involved with for some time have concluded I will have more time here.

There are still many outstanding questions for me on a few areas, such as: the use of standard names, role identification, interpretation of cell methods on ancillary variables, ancillary variables referencing other ancillary variables[*]. These will need careful thought, especially the last two which, as proposed, break the CF data model.

[*] in my original post I meant Ancillary variables containing ancillary variables when I wrote Ancillary methods containing ancillary methods. Sorry.

I think the first points we need to resolve is the use of standard names and role_identification (I agree that re-using "cf_role" name is probably not the best choice, here, but another name could work). Do you have any more thoughts on that?

All the best,
David

@JonathanGregory
Copy link
Contributor

Dear @kenkehoe

Thanks for making this proposal. I realise that three months has passed. I regret that I have not yet had time to study and review it, although it's been on my agenda all this time.

Best wishes

Jonathan

@kenkehoe
Copy link
Author

@davidhassell I am trying to keep this proposal as simple as possible. I've gone through a few iterations on options and I think the current proposal is the least drastic change to the convention. I see no issue with ancillary variables containing ancillary variables as that attribute is just a linkage.

A previous discussion was to use standard name modifiers but that became cumbersome and would require a change to the CF document each time we add new uncertainty. I also thought about creating a new attribute but that would require either changing the CF document appendix where they are listed each time a new one is added or a new external look up. Since we have the standard name table already I am suggesting we use that.

If we wanted to use cf_role set to "uncertainty" to indicate the variable contains uncertainty values that is fine. But we don't require that for other variables (like state variables, quality control, data, platform information) so it would be a strange one off. If we wanted to create a new attribute to signify the values are uncertainty that is fine. But I'd prefer to keep the description of the variable contents using standard name table so we don't need to create a new table.

Honestly I'm not a huge fan of cell methods. If that is causing strife I'm OK with dropping it and just using a comment attribute, standard name table, or reference attribute to point to a document with the description of how it was computed.

@kenkehoe
Copy link
Author

kenkehoe commented Aug 4, 2021

What can I do to help move this along?

@davidhassell
Copy link
Contributor

Hi Ken,

Sorry to have abandoned this again. Your proposal is clearly workable in practice, but there are some issues with the implementation, of varying degrees of seriousness, that mean that it is not yet suitable for inclusion into CF. I agree that it seems like a minimally intrusive set of changes, but some of the aspects you propose are a bit like the tip of the iceberg in complexity, like the ancillary variables containing ancillary variables issue.

The issue there is that the ancillary_variables attribute is not currently standardised on ancillary variables, so the linkage doesn't formally exist. If you made it meaningful when viewing a variable as an ancillary variable then this would need further additions to the conventions text and the CF data model would need extending in as-yet-unknown ways. Before that route is explored we have to be sure that there is no way of representing the information you require with the existing machinery.

This could well be worse than it sounds! It is possible that minor changes could resolve these problems and make your proposal fit in with the CF view of data. I think that the key to progression might be to demonstrate clearly when any raised concerns are not in fact valid, or else be proactive in suggesting concrete alternative ideas, with examples, if none have been suggested.

I think I can devote some time to thinking about this in detail again towards the end of August, when I am back from leave, so I look forward to carrying on the discussion then.

All the best,
David

@JonathanGregory
Copy link
Contributor

Dear Ken

I have had time at last to study and think a bit about your detailed proposal. Thank you for preparing and presenting it. I appreciate it's frustrating for you that this issue is going slowly. Speaking for myself and from David's comments too, I believe this is because it is a large and complicated proposal; when you're busy (as we all are), it's hard to create a large enough chunk of time to address something requiring lengthy thought. Things might go faster if we dealt with it a piece at a time.

I formed my opinions before reading David's, and I find (without surprise) that many of them are the same. Like David, I'm grateful for your link to the GUM. I too agree with your approach of using ancillary variables to contain measures of uncertainty. The CF standard (section 3.4) doesn't say what dimensions ancillary variables should have. Since they're intended to provide metadata about individual values of a data variable, they would normally have all the same dimensions. However, I don't think it would be problematic to allow dimensions to be dropped over which the uncertainty doesn't vary. You could drop all the dimensions to provide a scalar uncertainty, as in your examples.

I don't think that standard names are the right way to describe the uncertainties, because the standard name should still identify the geophysical quantity for which it is an uncertainty e.g. air_temperature, and because each standard name requires particular canonical units, whereas the uncertainties have the same units as the data.

David mentioned that your proposal requires ancillary variables themselves to have ancillary variables. I didn't notice an instance of that in the examples - is there one?

The earlier long and detailed discussion of 2013, which David referenced, is certainly very relevant to your proposal, regarding the distinction between cell_methods and standard name modifiers. Two of the four standard name modifiers (number_of_observations and status_flag) are now deprecated, in favour of using them as standard names instead. That is fine because they don't have units. The other two (detection_minimum and standard_error) are uncertainty measures, and hence relate to your proposal particularly. In order not to complicated the standard and software, it is one of the CF principles that we don't introduce a new way to do something we can already do, even if the new way is agreed to be better, but even so I would be happy if your proposal provided an alternative and better framework for these measures!

Since ancillary variables are like data variables, I think we could allow them to have cell_methods. As in the discussion of 2013, I believe that cell_methods would be a good place to identify the variable as a measure of uncertainty. This would mean expanding the idea of what cell methods is for. At the moment its role is to describe how the data represents statistical variation of the geophysical quantity within the cells. It seems to me that this can encompass uncertainty as well if we regard that as being variation over different realisations of the cells.

If the uncertainty comes from repeated measurement of a quantity with the same spatiotemporal coordinates, you might really add a dimension which runs over the individual measurements. This is exactly like an ensemble of model runs e.g. float air_temperature(time,lat,lon,realization), where realization is the sample dimension. Then if you calculated the standard deviation of the sample in each spatiotemporal cell, it would have cell_methods="realization: standard_deviation". The collapsed realization dimension, now of size 1, could be dropped, because realization is also a standard name, and hence the cell_methods implies that a standard deviation was computed over the entire set of realizations, about which no information is retained (Section 7.3.4).

Most of your examples of uncertainty are mathematically described as standard deviations. I think they are actually standard errors in the statistical sense: "The standard error (SE) of a statistic is the standard deviation of its sampling distribution or an estimate of that standard deviation" (wikipedia). I note that the GUM doesn't use that term, and probably "experimental standard deviation" is the same concept, isn't it? I think it's confusing to call it a standard deviation, however, because it is not the SD of the sample; it's divided by sqrt(N). I would prefer standard_error as a new cell_method, also for consistency with the standard name modifier that has the same meaning, and allowing us to use the existing standard_error_multiplier attribute, as David mentioned, instead of a standardised comment in cell methods, as you suggest.

All the above leads me to suggest a syntax such as cell_methods="uncertainty: standard_error" for an uncertainty that is a mathematically treated as a SD, like most of your examples. In this syntax, standard_error would be a new cell method, and uncertainty would be a new special keyword, rather like realization in meaning, as above, but not requiring the idea of a collapsed dimension.

You would also like to be able to provide intervals when not symmetrical. That could be done by adding a size-one dimension for probability or percentile, with bounds to specify the interval e.g. air_temperature(time,lat,lon,probability), where probability is a size-one coordinate or scalar coordinate variable. This could be identified with a syntax such as cell_methods="probability: expanded_uncertainty". I think that's the term the GUM uses, isn't it? It could also be called e.g. uncertainty_bounds. The GUM deprecates "confidence interval". An interval which contains all conceivable values is one which spans probability 0.0 to 1.0.

So far this is all about describing the mathematical nature of the uncertainty. You also want to describe what it represents. You do this with the standard name, which David and I both think wouldn't work. Could you do this with standardised comments in the cell methods? For instance, you could add (statistical) and (subjective) for the GUM's Type A and B. The GUM says, "a Type A standard uncertainty is obtained from a probability density function (C.2.5) derived from an observed frequency distribution, while a Type B standard uncertainty is obtained from an assumed probability density function based on the degree of belief that an event will occur, often called subjective probability. Both approaches employ recognized interpretations of probability." I think that if the uncertainty is unqualified it should be assumed to be the "combined" or total uncertainty. That is consistent with the convention in CF standard names that an unqualified name means everything is included.

I think that's enough for now! I wonder what you think.

Best wishes

Jonathan

@kenkehoe
Copy link
Author

Hi Jonathan,

Thanks for the reply. You have proposed some good ideas to ponder.

I like your idea to use statistical and subjective for the Type A and Type B terms. I'll incorporate that suggestion.

I understand the argument against using stanard_name because of the conocial units issue. If that is a deal breaker we can find another method.

I will need to think about the cell_methods suggestions some more. But I can say I'm not excited about that option. I've been pushing the use of cell_methods with my institution and it's not being accepted well. It is often quite difficult to encapsulate a description of the process into the cell_methods attribute, and most of my colleagues don't want to add that attribute. I spend a lot of time ensuring other critical attributes make it into the datasets, so I've been pushing less hard lately. I also see this as a slippery slope. If cell methods becomes required (in this case it is THE metadata to indicate uncertainty) then we should require it for other metadata variables. I'd prefer to require a less complicated way to indicate a type of data value.

I agree we should not add new attributes if an existing attribute already exists. My main concern is to find a method that is simple to write and simple to understand for our data users. Most of our data products have different types of methods of uncertainty and I need to find a solution that work for all of them. Most data users will not care how the uncertainty value was derived, only that the institution which created the data file are providing their best guess uncertainty estimation. They will then use that uncertainty estimation in their work. Often the uncertainty value provided will not be of the form most desireable to their research, but if that is all the researcher is provided they will find a way to use it. Therefore, I'm trying to not require the uncertainty to be defined by the details of the method used to derive the uncertainty.

Thanks,

Ken

@JonathanGregory
Copy link
Contributor

Correction to what I wrote yesterday:

You would also like to be able to provide intervals when not symmetrical. That could be done by adding a size-one dimension for probability or percentile, with bounds to specify the interval e.g. air_temperature(time,lat,lon,probability), where probability is a size-one coordinate or scalar coordinate variable. This could be identified with a syntax such as cell_methods="probability: expanded_uncertainty". I think that's the term the GUM uses, isn't it? It could also be called e.g. uncertainty_bounds.

Since the uncertainty variable is like a data variable, it doesn't have bounds. I was getting confused. Here, probability should be a dimension of size two, with a coordinate variable probability(probability) to contain the probability values, such as 0.05, 0.95 for a 5-95% interval. We would need to introduce some appropriate standard name for this "probability", to indicate its role in defining a confidence limit.

@JonathanGregory
Copy link
Contributor

Dear @kenkehoe

I would encourage you to read http://mailman.cgd.ucar.edu/pipermail/cf-metadata/2013/006106.html if you haven't, because there's a lot of a discussion there about the advantages and disadvantages of cell_methods and comparison with standard names and modifiers. Many views are expressed which are relevant to the present issue, including the use of cell_methods for the kind of purpose I've suggested above.

The cell_methods is an important part of CF metadata. It's really not an optional extra, because a lot of data is not the "default" cell methods, which implies an instantaneous point measurement for an intensive quantity. To describe a time-mean or spatial mean requires cell_methods, for instance. The CF standard recommends that cell_methods should always be included. The conformance document says, "If a data variable has any dimensions or scalar coordinate variables referring to horizontal, vertical or time dimensions, it should have a cell_methods attribute with an entry for each of these spatiotemporal dimensions or scalar coordinate variables," and the CF checker should give a warning if it's not included.

The cell_methods and standard_name are both optional in CF. They're not alternatives. They provide complementary parts of the metadata. Suggestions have been made several times for adding a further attribute that would encapsulate standard_name, cell_methods and other essential simple metadata into a single string, which could be a new extra attribute - redundant in content with others (which is a hazard), but convenient for searching (which is an advantage). This idea was elaborated in the common concepts and standard string discussions, both several years ago, but neither was eventually agreed. However, that's a separate topic from this issue.

Best wishes

Jonathan

@larsbarring
Copy link
Contributor

I am trying to follow this issue, but don't expect to have much to add. However, the pointer to this email thread (thanks @JonathanGregory) particularly caught my eye because I am looking at cell methods in a different context (climate indices). In the email thread the question is asked what a statistician would say about the difference (or not) between a mean and a standard deviation. I am pretty sure the answer would be along the lines that the former is a measure of location and the latter is a measure of spread. Whether they are 'geophysical quantities' or 'derived statistics' is another matter, the point is that they are not commensurate.

And measures of spread have a lot of common with uncertainty even though uncertainty is a much more advanced and theoretically sophisticated concept. So, with regard to this issue I think it is relevant to revisit the list of cell methods and clarify which are measures of location and of spread, and how they relate to standard names and their canonical units. In particular the latter is important for temperature. This is something that I raised in a another issue.

I do not in any way want to hijack this thread, only point out that there are aspects of standard names, standard name modifiers, and cell methods that are relevant for this thread but might better separated into a conversation over in that issue.

@kenkehoe
Copy link
Author

Jonathan,

Thanks for the link to the standard_name vs. cell_methods discussion. I found some useful discussions in there. It was nice to see others have the same confusion about standard_error being a standard_name modifier (only) while standard_deviation is part of cell_methods. I see Jonathan indicates the standard error is not in cell_methods because it does not relate to a particular dimension. If this is true then I don't see how we can rely on cell_methods to define uncertainty when standard error is the most common method to estimate uncertainty. This could result in the location of the described method being in two different locations depending (if standard error look in standard_name, if something else look in cell_methods).

I've had a really hard time trying to understand why CF recommends the definition of a statistical process that changes the essence of the values from a mean to standard deviation would use the same standard_name. Jim points out this is quite confusing, and assumes a person or the software would always need to analyze multiple place of metadata just to determine a value is a standard deviation not the mean or instantaneous value of a value.

I think we should keep with Ken's idea that the standard_name (with modifiers) defines the data and the cell_method is a second order description of what is really in the cells. This is how I have always interpreted the difference.

As I go through this email thread I think I should propose my original proposal, which follows along Jim's idea, of using standard name modifiers, and point to cell_methods for information on what sort of operation was performed. This follows David's interpretation of a standard name modifier as something that further describes a data variable beyond the initial description by the standard name. We could limit the standard name modifier to total_uncertainty, random_uncertainty, systematic_uncertainty and then expect the users to look in cell_methods for the other details.

I'm still concerned with the need Steve pointed out to balance the technically correct description of the values with the ease of use. Since this method will be used by software to discover data and most users will still use an uncertainty estimation when it does not align with their understanding of uncertainty when presented with no other option, we should make the process to discovery a measurement's uncertainty estimation as simple as possible.

The other issue I see is that cell_methods (as far as I can tell from the CF document) is used to describe how the values were derived using the other data in the file. So a variable's standard deviation can be simply explained in cell_methods by indicating the dimension the operation was performed over and what statistical process was performed on the data. But this will typically not be how the uncertainty estimate was derived. Most uncertainty estimations will include or entirely contain data that is not part of the provided data file. Or the estimation of uncertainty is provided with an equation from an instrument manufacture that is not just a simple standard error calculation. Trying to put that description of operations in the cell_methods is currently not possible.

My program attempted to get an uncertainty estimate for all our primary measurements and present them in a single location. This was a huge task that took many resources to produce. We ended up on a simplified PDF document. As you can see in the appendix there are many different methods used to derive an uncertainty estimation. Many of the estimates are from vendors with proprietary methods they are not willing to share, or are too complicated to put into cell_methods. A majority of the uncertainty estimates are single values. My goal is to just provide a simple method to provide data users with the current best estimate of uncertainty and not bog them down with too much detail.

Thanks for the links and discussion,

Ken

@JonathanGregory
Copy link
Contributor

Dear Ken

Yes, the standard_error modifier is not in cell_methods because it doesn't relate to a particular dimension. But considering (as I said above) that the statistical standard error of a quantity is 1/sqrt(N) times the standard deviation over an imaginary realization dimension which indexes the observations of the quantity, I think it's very similar to a cell method.

My suggestion above is to define standard_error as a new cell method, and introduce a special keyword of uncertainty which would go in place of the dimension, indicating a statistic calculated over a set of realisations, thus making cell_methods="uncertainty: standard_error". There would be no need to use the standard_error modifier.

I think some of the comments in the previous email discussion, as well as yours and @larsbarring's, may be partly addressed by keeping in mind that the standard_name is not the entire description of the quantity. The cell_methods is often a very important qualifier. For example, standard_name="air_temperature" with cell_methods="time: mean longitude: maximum" means the zonal maximum of the time-mean air temperature. I appreciate the point that a standard deviation "feels" a bit different from a mean, and a variance even more so because of its different units, but I can't see a clear justification for treating these two differently from mean, median, mode or other cell methods - they are all statistics which characterise variation of a single quantity over its dimensions.

I agree that what I sketched above is insufficient to deal with your more complex description of uncertainty computations. I didn't write any more because I'd already written quite a lot! I agree that some further attributes may be needed to provide information about how the uncertainty is derived.

Best wishes

Jonathan

@larsbarring
Copy link
Contributor

larsbarring commented Aug 24, 2021

Dear Jonathan,

standard deviation "feels" a bit different from a mean, and a variance even more so because of its different units, but I can't see a clear justification for treating these two differently from mean, median, mode or other cell methods - they are all statistics which characterise variation of a single quantity over its dimensions

I have a hard time understanding how mean, median or mode can be used to characterise variation. For each of these statistics one can (in principle) use one data value to argue that it is the the best available estimate of the corresponding true unobserved value "out there". However, for standard deviation, variance and standard error this is clearly not possible. So, I would say that there is a fundamental difference. In the light of this issue, mean, median or mode do not (as far as I understand) give any information that can be related to uncertainty, which is contrary to standard deviation, variance or standard error.

Kind regards,
Lars

@JonathanGregory
Copy link
Contributor

Sorry to be unclear. Let me try again. The default cell methods (point for intensive quantities, sum for extensive) are for a quantity which in principle has a single value within the gridcell, either exactly at the coordinate (for point) or being a property of the entire cell (for sum). All the other cell methods (mean, median, maximum, standard_deviation, root_mean_square, mean_absolute_value etc.) are statistics which are computed from the subgridcell variation of the quantity. Best wishes, Jonathan

@kenkehoe
Copy link
Author

Jonathan,

While I understand the intention of cell_methods to describe the modified values so standard_name list does not exponentially grow, I don't agree with the method. With the general concept of uncertainty so poorly defined I don't see how we can require listing all possible methods for a user to add in a cell_methods = "uncertainty: <method>" syntax. Here are two variables from a new data product I just reviewed. The creator just wants to report the random and systematic uncertainty for the data users. The data user just wants to take the best guess from the opinion of the data creator for random and systematic uncertainty and use it. Requiring the data producer to explain the full process in a cell_methods attribute that the data user will only glance at is an excessive requirement on the data producer, if even possible.

calibration_e_LH_uncertainty_random(time):float
    long_name = Random uncertainty in calibration_e_LH
    units = 1
    comment = The random uncertainty is derived from the elastic high channel signal, the elastic low channel signal, and their random uncertainties
    missing_value:float = -9999

calibration_e_LH_uncertainty_systematic(time):float
    long_name = Systematic uncertainty in calibration_e_LH
    units = 1
    comment = The systematic uncertainty is derived from the elastic high channel signal, the elastic low channel signal, and their systematic uncertainties
    missing_value:float = -9999

I am now leaning towards using a standard name modifier of uncertainty and distinguishing the the systematic, total, random qualifier in the cell_methods. That was the original intent of standard_error standard name modifier. I am now suggesting adding uncertainty as a standard name modifier. I believe the whole reason for a standard file format is to aid in categorizing the information for easy understanding and parsing. If we are making the process of determining if a variable is an uncertainty more difficult than just searching a comment attribute for the word uncertainty, we are making things too complicated. An all encompassing and correct method of describing a variable as uncertainty may be more accurate and succinct for the power users, but alenates the majority of the other data users.

Thanks,

Ken

@JonathanGregory
Copy link
Contributor

Dear @kenkehoe

You write

Requiring the data producer to explain the full process in a cell_methods attribute that the data user will only glance at is an excessive requirement on the data producer, if even possible

and I agree that that. I didn't suggest such a requirement myself. In #320 (comment) I wrote,

I agree that what I sketched above is insufficient to deal with your more complex description of uncertainty computations. ... I agree that some further attributes may be needed to provide information about how the uncertainty is derived.

The detailed description which you give in these two examples is probably too cumbersome for cell_methods, as you say, and maybe comment is where it should go, as you have shown. Omitting the comment and missing_value and using shorter variable names for brevity, my suggestion for these examples would be

float random(time);
   random:standard_name = "something";
   random:long_name = "Random uncertainty in calibration_e_LH";
   random:units = "1";
   random:cell_methods = "uncertainty: standard_error (statistical)";

float systematic(time);
   systematic:standard_name = "something";
   systematic:long_name = "Systematic uncertainty in calibration_e_LH";
   systematic:units = "1";
   systematic:cell_methods = "uncertainty: standard_error (subjective)";

I don't know what quantity this is so I can't suggest the standard name! But it doesn't need a modifier in this suggestion.

If we are making the process of determining if a variable is an uncertainty more difficult than just searching a comment attribute for the word uncertainty, we are making things too complicated.

The proposal is that to determine whether a variable contains an uncertainty you would search the cell_methods attribute for the word uncertainty:. That's just the same level of complexity as you advocate. The next word in the attribute states the kind of statistic which is used to measure the uncertainty (standard_error in this case).

The next word (the comment in ()) indicates whether it is random or systematic (which I called statistical and subjective following the GUM, which you cited). As I said before, the assumption elsewhere in CF (standard names in particular) is that the absence of a qualification implies the "whole thing". Following this, if the cell_methods uncertainty entry contains neither statistical nor subjective, we assume it is the "total" error.

Best wishes

Jonathan

@kenkehoe
Copy link
Author

Jonathan,

Thanks for the suggestion. I can see how using cell_method = "uncertainty" would work, but the main issue is that we don't know the method. The method is not standard_error. So the information provided in cell_methods in your example is incorrect. We do not know the method. I would be on board with moving the statistical or subjective into the method location as that is not listing a specific mathematical process.

float random(time);
random:long_name = "Random uncertainty in calibration_e_LH";
random:units = "1";
random:cell_methods = "uncertainty: statistical (random)";

float systematic(time);
systematic:long_name = "Systematic uncertainty in calibration_e_LH";
systematic:units = "1";
systematic:cell_methods = "uncertainty: subjective (systematic)";

I think there is confusion between the need to describe the process (statistical vs. subjective) and need to describe the type (total, random, systematic, specific random, specific systematic). According to GUM both classifiers are needed for describing uncertainty. Following the idea of absence of a qualifier defaulting to something, the default would be "(total)".

With this method I would suggest the data user searching cell_methods for a string starting with "uncertainty: " as the indicator a variable is an uncertainty variable. This would be equivalent to searching the standard_name ending in "uncertainty". Then the keywords random or systematic in parentheses would indicate the type. Absence of these words would mean total. This would require the addition of subjective and statistical to the cell_methods method list.

I feel we are getting close to a solution.

Thanks,

Ken

@JonathanGregory
Copy link
Contributor

Dear @kenkehoe

I suggested the keyword standard_error as the cell method for the uncertainty because, if I understand correctly, the number which is provided will be used mathematically as if it were a standard deviation quantifying the uncertainty distribution. For example, 2.3.1 of the GUM says "standard uncertainty: uncertainty of the result of a measurement expressed as a standard deviation."

I think that standard_error would be a better choice for it than the existing cell method of standard_deviation because when it's computed as a random statistical error it is actually the standard error of the mean, isn't it i.e. it's multiplied by 1/sqrt(N). Therefore it's a different calculation from a standard deviation, which doesn't have that factor.

The standard error is not the only statistic you might use for an uncertainty distribution. Confidence limits are an alternative, for example (the "expanded uncertainty" of GUM 6.2), and that would be a different cell method again.

I had assumed that random errors are evaluated statistically (Type A in GUM 2.3.2) and systematic errors are evaluated subjectively (Type B in GUM 2.3.3). However, if any combination is allowed, as you say, I agree that they should be separately indicated. I think they should all be put in the comment in () after the cell method. That is, the comment could begin with one of statistical and subjective (but not both), and one of random and systematic (but not both), in either order.

I'm glad you think we are making progress. I agree. Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Sep 16, 2021

PS and if it contains neither random nor systematic it must be the combined or total uncertainty.

@JonathanGregory
Copy link
Contributor

Dear @kenkehoe

A couple of years ago we made some progress with this issue that you raised about the description of uncertainty in CF. Do you have any time to continue with this, or is someone else able to pursue it?

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

I propose to close this issue, labelled dormant, if no-one makes a new contribution in the next three weeks, before 18th March.

@DocOtak
Copy link
Member

DocOtak commented Feb 29, 2024

I just got back from the AGU Ocean Science Meeting, uncertainty and how to report it (not from a technical standpoint though), received a lot of attention in a few sessions both modeling and observational. I would be reluctant to close the issue, even if it is dormant.

@JonathanGregory
Copy link
Contributor

Dear Andrew @DocOtak

I agree it's potentially useful enhancement to make, and we have made some progress in this issue. Closing it doesn't mean deleting it, of course. It could always be reopened if there is a new contribution to make. The motivation for closing dormant issues is to clarify our view of the truly active issues, which helps with managing them. We could put a separate link on the discussion page to produce a list of dormant issues, if that would be useful.

Best wishes

Jonathan

@JonathanGregory JonathanGregory added the dormant Issue closed without a conclusion on a proposed change label Mar 18, 2024
@JonathanGregory
Copy link
Contributor

Three weeks have passed with no new contribution to the discussion, so I am closing it as dormant. However, it can be reopened in order to continue the discussion. Following Andrew @DocOtak's comment, in website issue 467 I have made a new link in the CF discussion page that will produce a list of dormant conventions issues, so they aren't forgotten about when no longer visible on the default view. I hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dormant Issue closed without a conclusion on a proposed change enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants