Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of cell_methods #414

Open
taylor13 opened this issue Nov 23, 2022 · 10 comments
Open

Clarification of cell_methods #414

taylor13 opened this issue Nov 23, 2022 · 10 comments
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@taylor13
Copy link

taylor13 commented Nov 23, 2022

Moderator:

@bnlawrence

Last updated:

2022-11-22 (initiated proposal)

Requirement Summary

Current description of cell_methods attribute is unclear and sometimes less definitive than it could be. Changes are proposed to remedy this.

Technical Proposal Summary

Rewording of text of conventions is proposed to provide better guidance on how cell_methods should be defined.

Benefits:

Those writing and reading CF-compliant data will have clearer guidance and more definitive rules for interpreting the cell_methods.

Status Quo:

???

Associated pull request:

None yet.

Detailed Proposal

For more than 15 years now I have had trouble understanding exactly how to define cell_methods that correctly describe variables included in the CMIP request for model output. I have been recently reviewing the variables defined for CMIP6, in preparation for a possible CMIP7. Again, I'm not sure if we've defined cell_methods consistent with the intentions of those requesting the variables. I suspect others have also had difficulties correctly defining their cell_methods. Below I suggest specific rewording of the CF conventions text, and in a few places define default interpretation of the cell_methods that specify more definitively how the cell methods should be interpreted. I include further rationale for the suggested changes below.

The ``cell_methods" section opens with:
Screen Shot 2022-11-22 at 4 19 49 PM
I have offered only minor non-substantive edits in the above, but I think the rewording reads better.


@JonathanGregory copied this part to #447

I then suggest inserting a paragraph while not modifying the paragraph immediately following:
Screen Shot 2022-11-22 at 5 17 14 PM
The inserted paragraph explains how by default the grid-cell values have been computed from the contributing samples. This greatly reduces the need to include the so-called "non-standardized information" regarding the cell_methods.


I next suggest removing the next two paragraphs:
Screen Shot 2022-11-22 at 5 29 50 PM
We advise at the end that users not rely on the default (different) treatments of intensive and extensive variables, and I suspect no careful data writer has relied on this. Furthermore, I think most readers simply give up trying to understand these paragraphs, so why not delete them?

I have no suggested changes to the example (7.5) that follows.

In section 7.3.1, I have made no changes to the 1st paragraph, but I have edited the 2nd paragraph to improve clarity.

Screen Shot 2022-12-02 at 4 12 09 PM
image


@JonathanGregory copied this part to #447

In the 3rd paragraph of 7.3.1 I define default weighting for 2-d (area) means and 3-d means, which also apply to other statistics involving sums. Without this default specification of weighting, data writers would have to provide parenthetical non-standardized information for most of the variables they write.


The next subsection (7.3.2), I suggest, should be placed after section 7.3.3. The reason is that this sub-section discusses how to record supplemental (sometimes non-standardized) information about the method. It seems to me that is much less important and less-often used than specifying what portions of a cell are reported on by a statistic, which is the subject of the current subsection 7.3.3. I have thus renumbered that section 7.3.2, and also suggest the following changes to its first two paragraphs:
Screen Shot 2022-11-22 at 5 44 15 PM
The first paragraph has been modified to make it clear that "where" can also apply to the time dimension. The other changes add clarity (I hope) to exactly how to apply "where" in practice.

I have made no changes to the next paragraph or Example 7.7 that follows it, which should, however, now be renumbered Example 7.6. Here is the unaltered text (without the example):
Screen Shot 2022-11-22 at 5 52 00 PM

Within the current example 7.7 is some text that really belongs outside it (following) the example. I have suggested a few changes to that text:
Screen Shot 2022-11-22 at 5 56 31 PM
Most of the suggested edits follow this discussion.

The next section number 7.3.2 would become section 7.3.3 in my revision. It is largely unchanged except for the third paragraph where the example described has been modified since the original example is now already handled by the default weighting imposed in section 7.3.1.
Screen Shot 2022-11-22 at 7 37 37 PM
I suggest no changes to the remaining section 7.3.4.

In summary, the proposal is to:

  1. Introduce default interpretations on how weighting of means and other statistics should be applied. I think this constitutes the most important substantive change because existing datasets may have relied on the vagueness of the convention (to this point) in accommodating a different weighting (i.e., no default specification of the weighting), which would be at odds with what is proposed here. @JonathanGregory copied all of the material related to this point into the separate issue 447
  2. Reorder subsections 7.3.2 and 7.3.3.
  3. Delete two paragraphs that say what to assume about cell methods if the cell_methods is not specified. I don't think the default interpretation should hold. Users should explicitly say what the cell_methods is.
  4. Explicitly indicate that the "where" directive can be used for the time dimension as well as spatial dimensions and that "where" can sometimes be interpreted as "when".
  5. Revise some of the text and modify some of the examples for clarity and for readability.
@taylor13 taylor13 added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Nov 23, 2022
@bnlawrence
Copy link

Karl, since you proposed it, I am happy to do the moderation!

@taylor13
Copy link
Author

Thanks @bnlawrence !

@taylor13
Copy link
Author

taylor13 commented Dec 2, 2022

I just noticed that I omitted my suggested revisions to section 7.3.1. I'll add those shortly.
[Now done; 12/2/22

@taylor13
Copy link
Author

Perhaps we should also mention in the renumbered section 7.3.2 that the default area-weighting applied in this case refers to the area of the portion of the cell indicated by the "where" directive.

@taylor13
Copy link
Author

About 5 months ago, I got started trying to create a pull request for the above proposed changes. Not knowing what I was doing, I created https://github.com/cf-convention/cf-conventions/blob/taylor13cell_methods_edits/ch07.adoc . I think @davidhassell gave me some advice offline on how to do things correctly and how to proceed, but I think I've lost that email. I need advice on how to proceed and get the proposed changes implemented. I think the revised version of the cell_methods section pointed to earlier in this paragraph reflects the text above but may need some further editing to clean up the way other portions of the conventions document are referenced.

As already noted, these changes were introduced to try to clarify how exactly to interpret cell_methods for the purpose of clearly defining variables requested as part of CMIP. I would really like to see this happen before the end of summer or sooner since they are needed for CMIP7, with work on the data request well underway.

@JonathanGregory
Copy link
Contributor

Dear Karl @taylor13

Thanks for your proposal. It is regrettable that it hasn't received any substantive comments up to now. For me, the reason for not commenting is that it's rather a large proposal, and there have always been smaller issues to be considered. Maybe this reason applies to others as well. I suggest that this difficulty could be mitigated by splitting up the issue into smaller pieces. At the end of your first contribution, you've helpfully set out a summary in five points. These could logically constitute five separate issues, which are of different sorts, and some could be resolved more quickly than others. If they're all discussed in one issue, I suspect that multiple threads of discussion will become entangled.

Of your five, I agree with you that the first is the largest and substantial change, and perhaps this is the one which you would like to see most urgently.

  1. Introduce default interpretations on how weighting of means and other statistics should be applied. I think this constitutes the most important substantive change because existing datasets may have relied on the vagueness of the convention (to this point) in accommodating a different weighting (i.e., no default specification of the weighting), which would be at odds with what is proposed here.

At the moment, as you say, there is no information in cell_methods about what weighting should be assumed. Weighting is certainly an aspect of the statistical computations which are described by cell_methods, and it makes sense to include it. If we define default interpretations, as you suggest, new data and old data with the same cell_methods would have different interpretations, since the old data has undefined weights, but the new data has defined weights. Although this is not strictly a backwards incompatibility, it is a potential pitfall for data-users of the kind that principle 9 in Sect 1.2 says we should avoid:

  1. Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

Therefore, rather than defining defaults, I think we should introduce new syntax for indicating the weights explicitly. If no weighting was indicated, it would mean the same as now i.e. undefined. The syntax could be e.g. "name: method weighted_by keyword", or perhaps just by instead of weighted_by, where keyword would be chosen from a new list of possibilities, such as extent for the simple weighting by the size of the cell calculated as the difference between its bounds, unity if all cells have the same weight, or mass for mass-weighting.

Best wishes

Jonathan

@taylor13
Copy link
Author

Thanks, Jonathan, for your input on how weighting might be included without violating principle 9. We would want to consider whether to include it within the parentheses (the way we include "interval:") or whether it would follow directly the "where" directive. Also, we need to think about what "key words" would be needed and the procedure for expanding the list if need be (e.g., "weighted_by mass" might not be specific enough; might need "weighted_by mass_of_snow", or "weighted_by mass_of_seaice", etc.)

You are right that it is clearly specifying the weights that is highest priority, although summary point 4 is also particularly urgent and doesn't require any major revisions.

Thanks for responding and suggesting we consider the 5 summary points one at a time in hopes of provoking additional input.

@bnlawrence
Copy link

I think breaking this up into multiple issues is an excellent idea, and will likely speed things up. If we do go that route, there is another (slightly tangential) issue to consider, and that is the difference between cell methods and cell output frequency. The latter appears nowhere and CF and needs to inferred by examining the time coordinate - and so in CMOR and XIOS there are other ways introduced to guide the user (eg. 3hr in the filename, or interval-write in the attributes). This is intimately related to the interval discussion above. It would be good to be clear about that relationship in whichever sub issue picks this up.

@larsbarring
Copy link
Contributor

I agree that breaking up this issue would be helpful (maybe even necessary to make it more manageable). I was just reading section 7.3.2 Recording the spacing of the original data and other information when Bryan's comment popped in. And I totally that if/when we now are dealing with this part of CF, the cell output frequency should be considered.

In 7.3.2 the following appears

Currently the only standardized information is to provide the typical interval between the original data values to which the method was applied, in the situation where the present data values are statistically representative of original data values which had a finer spacing. ... ... ... Recording the original interval is particularly important for standard deviations. For example, the standard deviation of daily values could be indicated by cell_methods="time: standard_deviation (interval: 1 day)" and of annual values by cell_methods="time: standard_deviation (interval: 1 year)".

If I understand these two examples correctly they are to be interpreted as
The standard deviation over some period, e.g. {month, season, year} | {decade, 30-year period}, given by a bounds variable, of daily | annual data.

Now, for the standard deviation calculation the daily/annual data is the input data, but is that to be interpreted as the "original data values"? In particular, I think that this would be problematic in relation to that the default interpretation depends on whether it is an intensive or extensive variable. E.g. what does it mean if it is an intensive quantity from a model?

Tentatively this could be clarified in the two examples above by rewriting them as
cell_methods="time: mean within days time: standard_deviation (interval: 1 day)"
cell_methods="time: sum within days time: standard_deviation (interval: 1 year)"
But that also bring the conversation in issue #197 into play.

So, sum up, I think that it is necessary to

  • Break up the discussion to make it more focussed, still maintain an overall consistency across the separate parts.
  • Find a mechanism within CF to make a distinction between "input data" (as I tentatively called it earlier in this comment), and what Bryan call "cell output frequency". To rely on have this recorded ad hoc and totally outside seems not like CF.
  • Consider the conversation in issue #197.

The cell methods construct is already now complex and difficult to understand and interpret, which means that we have to be careful and keep different user communities' needs in mind when making changes.

@JonathanGregory
Copy link
Contributor

I have created a new issue 447 for discussion of Karl @taylor13's first point, about weighting. I hope it's OK to continue with discussion of that point in #447 rather than here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

No branches or pull requests

4 participants