Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorporating the CFA convention for aggregated datasets into CF #508

Open
davidhassell opened this issue Feb 7, 2024 · 29 comments · May be fixed by #534
Open

Incorporating the CFA convention for aggregated datasets into CF #508

davidhassell opened this issue Feb 7, 2024 · 29 comments · May be fixed by #534
Labels
CF1.12? We might conclude this issue in time for CF1.12 enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@davidhassell
Copy link
Contributor

Incorporating the CFA convention for aggregated datasets into CF

Moderator

To be decided

Moderator Status Review [last updated: 2024-02-07]

  • New issue created on 2024-02-07

Requirement Summary

This is a proposal to incorporate the CFA conventions into CF.

CFA (Climate and Forecast Aggregation) is a convention for recording aggregations of data, without copying their data.

The CFA conventions were discussed at the 2021 and 2023 annual CF workshops, the latter discussion resulting in an agreement to propose their incorporation into CF.

By an “aggregation” we mean a single dataset which has been formed by combining several datasets stored in any number of files. In the CFA convention, an aggregation is recorded by variables with a special function, called “aggregation variables”, in a single netCDF file called an “aggregation file”. The aggregation variables contain no data but instead record instructions on both how to find the data in their original files, and how to combine the data into an aggregated data array. An aggregation variable will almost always take up a negligible amount of disk space compared with the space taken up by the data that belongs to it, because each constituent piece, called a “fragment”, of the aggregated data array is represented solely by file and netCDF variable names and a few indices that describe where its data should be placed relative to the other fragments (see examples 1 and 2).

Example 1: For a timeseries of surface air temperature from 1861 to 2100 that is archived across 24 files, each spanning 10 years, it is useful to view this as if it were a single netCDF dataset spanning 240 years.
CFA_1

CFA has been developed since 2012 and is now a stable and versioned convention that has been fully implemented by cf-python for both aggregation file creation and reading.

Note that this proposal does not cover how to decide whether or not the data arrays of two existing variables could or should be aggregated into a single larger array. That is a software implementation decision. For instance, cf-python has an algorithm for this purpose (We think that the cf-python aggregation rules are complete and consistent because they are entirely based on the CF data model.)

Storing aggregations of existing datasets is useful for data analysis and archive curation. Data analysis benefits from being able to view an aggregation as a single entity and from avoiding the computational expense of creating aggregations on-the-fly; and aggregation files can act as metadata-rich archive indices that consume a very small amount of disk space.

The CFA conventions only affect the representation of a variable’s data, and thus they work alongside all CF metadata, i.e. the CFA conventions do not duplicate, extend, nor re-define any of the metadata elements defined by the CF conventions.

An aggregation file may, and often will, contain both aggregation variables and normal CF-netCDF variables i.e. those with data arrays. All kinds of CF-netCDF variables (e.g. data variables, coordinate variables, cell measures) can be aggregated using the CFA conventions. For instance an aggregated data variable (whose actual data are in other files) may have normal CF-netCDF coordinate variables (whose data are in the aggregation file).

Another approach to file aggregation without copying data is NcML Aggregation, which has been extensively used. CFA is similar in intent to NcML but is more general and efficient, because it

  • keeps the CF metadata in the same place as the aggregation instructions;
  • allows aggregations over any number of dimensions in any array positions;
  • places no restrictions on netCDF elements that are not standardised by CF (such as variable names);
  • uses the binary netCDF format to speed up read times for large aggregations.

Technical Proposal Summary

The CFA conventions currently have their own document (https://github.com/NCAS-CMS/cfa-conventions/blob/main/source/cfa.md) which describes in detail how to create and interpret an "aggregation variable", i.e. a netCDF variable that does not contain a data array but instead has attributes that contain instructions on how to assemble the data array as an aggregation of data from other sources.

A Pull Request to incorporate CFA into CF has not been created yet. Before starting any work on translating the content of the CFA document into the CF conventions document, it is important to get the community’s consensus that this is a good idea, and about how the new content should be structured (e.g. a new section, a new appendix, both, or something else).

The main features of CFA are summarised in example 2, a CDL view of an aggregation of two 6-month datasets into a single 12-month variable (see the CFA document for details).

Example 2: An aggregation data variable whose aggregated data comprises two fragments. Each fragment spans half of the aggregated time dimension and the whole of the other three aggregated dimensions, and is stored in an external netCDF file in a variable called temp. The fragment URIs define the file locations. Both fragment files have the same format, so the format variable can be stored as a scalar variable.
CFA_diagram_CF

Benefits

Aggregations persisted to disk allow users and software libraries to access pre-created aggregations with no complicated and time-consuming processing.

Status Quo

Not being able to persist fully generalised aggregations to disk means that every user/software library has to be able to create their own aggregations every time the data files are accessed. This is a complicated and time-consuming task.

Associated pull request

None yet (see above).

CFA authors

CFA has been developed by David Hassell, Jonathan Gregory, Neil Massey, Bryan Lawrence, and Sadie Bartholomew.

Contributors to CFA discussions at the CF workshops

Chris Barker, Ethan Davies, Roland Schweitzer, Karl Taylor, Charlie Zender, and Klaus Zimmermann (please let us know if we have accidentally missed you off this list).

@davidhassell davidhassell added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Feb 7, 2024
@larsbarring
Copy link
Contributor

David,
This will be an excellent and very useful addition to the CF Conventions! I have not yet wrapped my head around the technical details. There is one thing I do not quite understand, first you write:

The aggregation variables contain no data but instead record instructions on both how to find the data in their original files, and how to combine the data into an aggregated data array

And below the Figure 1 you write:

Note that this proposal does not cover how to decide whether or not the data arrays of two existing variables could or should be aggregated into a single larger array.

Probably I am missing something here, but to me this seems contradictory? Anyway, that is a detail, and I think the more important questions are the one you raise in the Technical Proposal Summary:

... incorporate CFA into CF ... that this is a good idea, ...

To me this is no doubt a good idea, which already has a strong community backing.

... how the new content should be structured (e.g. a new section, a new appendix, both, or something else).

Perhaps an outline somewhere in the main text: end of Chapter 2 regarding aggregation files and their relation to the fragment files, somewhere in Chapter 3 regarding aggregation variables? And then an exhaustive description in an Appendix?

This, brings me a more general thought that I have been thinking about for some time:
I think that the CF Conventions document is getting increasingly long and complex/difficult to get an overview of. The Table of Content takes 8 full screens (5 pdf pages), then 5 screens of Tables of tables/figures/examples (3 pdf pages). I have no idea how to improve upon this, but it becomes more and more of a concern as we add new features to the Conventions. However, this is not something to discuss and solve here in this enhancement proposal, but I wanted to bring it up here anywaay.

@davidhassell
Copy link
Contributor Author

davidhassell commented Apr 9, 2024

Thank you for you comments, Lars, and sorry that it has taken me some time to respond.

Even though you are the only person to have commented here (and in support), this proposal has been scrutinised carefully at two CF workshops, with a group decision being reached in 2023 to work towards incorporating CFA into CF. I'm therefore minded to move to writing the PR, now that Lars has made a good suggestion of how and where the content could go into the existing CF conventions. This shouldn't take too long, because it will largely be a "cut and paste" job from the existing CFA description, which was deliberately written in a CF-ish style in anticipation of this :).

The aggregation variables contain no data but instead record instructions on both how to find the data in their original files, and how to combine the data into an aggregated data array
...
Note that this proposal does not cover how to decide whether or not the data arrays of two existing variables could or should be aggregated into a single larger array.

Good point. The first statement applies to the reading of the data, and the second to the writing of the data. The CFA conventions do not give any guidance on the decision of how fragment files can be combined prior to creating an aggregation variable, rather once you have an aggregation in mind, they provide a framework in which you can encode it in such a way that other people can decode it.

If I give you two datasets (A and B) then the CFA conventions won't give you any help in working out if A and B can be sensibly combined into a single larger dataset (C). There are various ways in which you could work this out yourself - you could inspect the metadata and apply an aggregation algorithm (e.g. this one, or by visual inspection), or base it on files names (e.g. I know that model outputs from March.nc and April.nc are safe to combine into a 2-month dataset), etc.

Perhaps an outline somewhere in the main text: end of Chapter 2 regarding aggregation files and their relation to the fragment files, somewhere in Chapter 3 regarding aggregation variables? And then an exhaustive description in an Appendix?

I like the idea of a Chapter 2 outline. I might suggest content from Introduction, Terminology, Aggregation variables, and Aggregation instructions (without its subsections) for Chapter 2, and everything else - which is most of the existing CFA document - (Standardized aggregation instructions, Non-standardized terms, Fragment Storage and examples) for the appendix.

The Table of Content takes 8 full screens (5 pdf pages), then 5 screens of Tables of tables/figures/examples (3 pdf pages).

Just a thought - the TOC currently shows all subnsections - maybe it could be restricted to just one level of subsection, so for instance Chapter 7 would go from

[7. Data Representative of Cells](https://cfconventions.org/cf-conventions/cf-conventions.html#_data_representative_of_cells)
    [7.1. Cell Boundaries](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries)
    [7.2. Cell Measures](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-measures)
    [7.3. Cell Methods](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-methods)
        [7.3.1. Statistics for more than one axis](https://cfconventions.org/cf-conventions/cf-conventions.html#statistics-more-than-one-axis)
        [7.3.2. Recording the spacing of the original data and other information](https://cfconventions.org/cf-conventions/cf-conventions.html#recording-spacing-original-data)
        [7.3.3. Statistics applying to portions of cells](https://cfconventions.org/cf-conventions/cf-conventions.html#statistics-applying-portions)
        [7.3.4. Cell methods when there are no coordinates](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-methods-no-coordinates)
    [7.4. Climatological Statistics](https://cfconventions.org/cf-conventions/cf-conventions.html#climatological-statistics)
    [7.5. Geometries](https://cfconventions.org/cf-conventions/cf-conventions.html#geometries)

to

[7. Data Representative of Cells](https://cfconventions.org/cf-conventions/cf-conventions.html#_data_representative_of_cells)
    [7.1. Cell Boundaries](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-boundaries)
    [7.2. Cell Measures](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-measures)
    [7.3. Cell Methods](https://cfconventions.org/cf-conventions/cf-conventions.html#cell-methods)
    [7.4. Climatological Statistics](https://cfconventions.org/cf-conventions/cf-conventions.html#climatological-statistics)
    [7.5. Geometries](https://cfconventions.org/cf-conventions/cf-conventions.html#geometries)

That alone would remove 71 lines from the TOC! But as you say, any more on that should be discussed elsewhere, which I would welcome.

@taylor13
Copy link

taylor13 commented Apr 9, 2024

I think this is generally a good idea and have been meaning to go over the details.

A quick thought about the table of contents: Would it be easy in the web view to collapse the subsection hierarchy to 1 or 2 levels, then click on an upper level to display its subsections? That might give a newbie a more accessible overview. On the other hand, I usually just execute "find" for some key word I know is relevant to what I want to look up, and if that word becomes hidden (in a hidden low level subsection), then I may have a harder time navigating quickly to the relevant section. So I can see arguments for the current expanded table of contents.

@davidhassell
Copy link
Contributor Author

Hello,

We have finally prepared a pull request for incorporating aggregation into the CF conventions: #534

It touches on 9 files:

Chapter 1: Terminology
Chapter 2: Full description of aggregation variables
Appendix A: new aggregation attributes
Appendix L: Aggregation examples
Conformance: Requirements and recommendations
History: History
Bibliography: New URI reference
toc-extra: New examples
conventions: New authors

All the best,
David

@davidhassell
Copy link
Contributor Author

Hello,

I fully appreciate that this is a large pull request, but it would be very nice if someone who wasn't involved in its development could look it over. The PR already has the support of the original CFA authors (@JonathanGregory, @bnlawrence, @nmassey001, @sadielbartholomew and myself), but at least one "outside" perspective is necessary, I think. It would be great if this could get into CF-1.12, which means in practice that it would have to all agreed by (roughly) the end of October.

Any takers? We'd be much obliged :)

Many thanks,
David

@taylor13
Copy link

taylor13 commented Oct 16, 2024

H all,

I have now studied the proposal as given above and also read the CFA conventions documentation. What's the easiest way to review the pull request? When I look at the changes made to files, I can see the new text, but it isn't particularly easy to read. [I know I should know how to do this by now.]. NOTE ADDED AFTER THE FACT: I HAVE NOW BEGUN TO READ THE ACTUAL PULL REQUEST, WHICH DIFFERS SUBSTANTIALLY FROM THE PROPOSAL ABOVE, SO MOST OF THE FOLLOWING COMMENTS MAY BE IRRELEVANT. I'LL PROVIDE FEEDBACK ON THE ACTUAL PULL REQUEST IN THE NEXT DAY OR TWO.

Anyway, based on what I've read, I have a few comments and questions:

  1. This proposal provides a really useful enhancement of CF, and it seems to be very mature and tested in practice, so I strongly support it.
  2. It appears that some of the features in the CFA conventions document haven't made it into CF (but I might be wrong about that). For example, the option to specify units of a variable in the aggregation file that are a different from (but compatible with) the units in the fragment files seems like a nice feature. Is that included in this CF proposal?
  3. It's not clear to me how the "substitution" option helps us much. Is it assumed that each time you move the location of your fragment files, you would update the aggregation file (using this feature) to fix broken links that might be introduced?
  4. What value is represented by "_" in a ncdump listing? Is that an undefined value? If I'm writing data, and want to define aggregation_location=[6, 6, 73, _, 144, _], how do I assign the 4th and 6th element of the array?
  5. You use the term "address" and describe it as the "addresses of data in fragment files". Then you indicate the aggregation_address is "tos1", "tos2". Are "tos1" and "tos2" the names of variables containing the fragments being aggregated into a variable named "temp"? If so, could "address" be replaced with "name"?
  6. The aggregated_data attribute defines the names of the "location", "file", "format", and "address". I would find these 4 terms more aptly named "shapes", "files", "formats", and "names".
  7. In the example(s), I think I would find it less confusing if you came up with different variable names for the "location", "file", "format", and "address". I think I'd replace:
  • "aggregation_location" with "fragment_shapes"
  • "aggregation_file" with "fragment_files"
  • "aggregation_format" with "fragment_file_formats"
  • "aggregation_address" with "fragment_variable_names"

I understand that the data writer can name these however (s)he likes, but for the CF documentation, I think the above would be easier for users to understand (unless, of course, I've completely misrepresented what they are).

@davidhassell
Copy link
Contributor Author

Thanks, Karl - I very much appreciate your comments. I've seen your edit about reading the latest, and will hold off responding to everything until you have done so.

However, it could be useful to mention some of your comments now, and with reference to the PR text:

  1. It appears that some of the features in the CFA conventions document haven't made it into CF (but I might be wrong about that). For example, the option to specify units of a variable in the aggregation file that are a different from (but compatible with) the units in the fragment files seems like a nice feature. Is that included in this CF proposal?

I has indeed. In section 2.8.2 Fragment Interpretation it says that converting the fragment to its canonical form may involve "Transforming the fragment's data to have the aggregation variable's units (e.g. as required when aggregating time fragments whose units have different reference date/times)."

  1. It's not clear to me how the "substitution" option helps us much. Is it assumed that each time you move the location of your fragment files, you would update the aggregation file (using this feature) to fix broken links that might be introduced?

That's right - it's a convenience feature. Instead of having to update the file paths of 1 million fragment file names when teh files are moved, if the file names have been defined with a substitution (cf. environment variable) then you just have to update that one attribute to set the new location for the 1 million files. The new text in section 2.8.1 Aggregated Dimensions and Data aims to clarify this: "The use of substitutions can save space in the aggregation file; and in the event that the fragment locations need to be updated after the aggregation file has been created, it may be possible to achieve this by modifying the substitutions attribute rather than by changing the actual location fragment array variable values."

  1. What value is represented by "_" in a ncdump listing? Is that an undefined value? If I'm writing data, and want to define aggregation_location=[6, 6, 73, _, 144, _], how do I assign the 4th and 6th element of the array?

_ in CDL Is a placeholder for a missing or fill value. The "shape variable" definition in section 2.8.1 Aggregated Dimensions and Data hopefully describes that this value can only have_trailing_ missing values which account for the different number of fragments that span different dimensions.

  1. You use the term "address" and describe it as the "addresses of data in fragment files". Then you indicate the aggregation_address is "tos1", "tos2". Are "tos1" and "tos2" the names of variables containing the fragments being aggregated into a variable named "temp"? If so, could "address" be replaced with "name"?

If the intention were to only ever aggregation netCDF fragments, I may agree, but we'd like the conventions to allow the aggregation of non-CF datasets (2.8. Aggregation Variables: "Fragment datasets may be CF-compliant or have any other format, thereby allowing an aggregation variable to act as a CF-compliant view of non-CF datasets"). Not all datasets use a nice name to identify a variable in their contents (could be an integer file position, as is the case for Met Office PP format files, which at NCAS we are using as fragment files), so we landed on the term "address". Nothing particularly special about this term, though, so happy to consider any other.

@taylor13
Copy link

Hi David and all,

[My input below is, I hope, constructive, because I think that adoption of a CF-compliant approach to creating aggregated datasets will be very useful. Thanks for all your work on this. It's likely that something quite obvious has eluded me, in which case, please excuse by ignorance, but perhaps you could provide further explanation (or examples) that might enlighten me.]

I've spent some time studying the proposed pull request changes to the conventions document. I spent most of my time trying to figure out exactly how to interpret the fragment_shape array and thinking about how I might form an aggregated array from the fragment arrays, based on the (array size?) numbers it gives. I failed. Then, I thought about what alternative options there might be for providing mapping information in a concise form, which codes could use to combine fragments into a single aggregate. In the pdf file attached below, I've suggested an alternative to the pull request proposal and highlighted it in yellow. I think my approach is easier to explain to users and would facilitate the construction of aggregated variables. (It has similarities with conventional "pointer" approaches to accessing array data.) I know code has already been written based on the original proposal, so perhaps my alternative will not be popular. More likely, those of you who spent so much time coming up with the "fragment_shape" approach of describing how the fragments fit together will find an obvious problem with my suggestion. If so, perhaps all that is needed is a better explanation of your method.

In particular, as a first step, I would be interested in someone telling me what the "fragment_shape" is for the example I came up with. (See the few lines of red-highlighted text below the colorful graphic on the attached.) Perhaps that will enable me to finally "get" how this shape information can be used.

The following document contains a suggestion on how the approach might be modified and made simpler to explain to new users. Most of the "edit suggestions" contained in the file are unrelated to the new approach.
cf-conventions_aggregation_PR_KET.pdf

Thanks again for all the thought and work that has already gone into this.
Karl

@taylor13
Copy link

Just wanted to ask about a use case:

Suppose I want to aggregate a surface temperature field provided by multiple models, all on a common grid. There is no "model axis" in the files. Can I combine the fields defining a "model label" coordinate?

@JonathanGregory JonathanGregory added the CF1.12? We might conclude this issue in time for CF1.12 label Oct 20, 2024
@davidhassell
Copy link
Contributor Author

Karl asked:

Suppose I want to aggregate a surface temperature field provided by multiple models, all on a common grid. There is no "model axis" in the files. Can I combine the fields defining a "model label" coordinate?

Yes, provided that all of the models are on the same domain, of course. Here is a modification of the new Example L.1, with an extra "model" axis included:

dimensions:
  model = 4 ;               // New model axis
  time = 12 ;
  level = 1 ;
  latitude = 73 ;
  longitude = 144 ;
  // Fragment array dimensions
  f_model = 4 ;
  f_time = 1 ;
  f_level = 1 ;
  f_latitude = 1 ;
  f_longitude = 1 ;
  // Fragment shape dimensions
  j = 5 ;         // Equal to the number of aggregated dimensions
  i = 4 ;         // Equal to the size of the largest fragment array dimension
variables:
  // Aggregation data variable
  double temperature ;
    temperature:standard_name = "air_temperature" ;
    temperature:units = "K" ;
    temperature:cell_methods = "time: mean" ;
    temperature:coordinates = "model_label" ;
    temperature:aggregated_dimensions = "model time level latitude longitude" ;
    temperature:aggregated_data = "location: fragment_location
                                   address: fragment_address
                                   shape: fragment_shape" ;
  // Coordinate variables
  double time(time) ;
    time:standard_name = "time" ;
    time:units = "days since 2001-01-01" ;
  double level(level) ;
    level:standard_name = "height_above_mean_sea_level" ;
    level:units = "m" ;
  double latitude(latitude) ;
    latitude:standard_name = "latitude" ;
    latitude:units = "degrees_north" ;
  double longitude(longitude) ;
    longitude:standard_name = "longitude" ;
    longitude:units = "degrees_east" ;
  str model_label(model) ;               // New model label auxiliary coordinate
    model_name:long_name = "Name of model" ;
  // Fragment array variables
  string fragment_location(f_model, f_time, f_level, f_latitude, f_longitude) ;
  string fragment_address ;
  int fragment_shape(j, i) ;
data:
  temperature = _ ;
  time = 0, 31, 59, 90, 120, 151, 181, 212, 243, 273, 304, 334 ;
  level = ... ;
  latitude = ... ;
  longitude = ... ;
  model_name = "model1", "model1", "model3", "model4" ;
  fragment_location = "model1.nc", "model2.nc, "model3.nc", "model4.nc" ;
  fragment_address = "temperature" ;
  fragment_shape = 1, 1, 1, 1,
                   12, _, _, _, 
                   1, _, _, _, 
                   73, _, _, _, 
                   144, _, _, _ ;

We could include this example in the appendix.

@davidhassell
Copy link
Contributor Author

Hi Karl,

Just quickly jumping on your suggested structural change, before getting into your text suggestions and questions ... I'm intrigued by your new fragment_shape variable. I've always been annoyed that we have to pad it out with missing data, thereby wasting space (and making it more complicated to explain!).

I can't see any problems with your new approach by just thinking about it :) (I'd like to run it past my software implementation to be sure). I'm wondering if the super-general figure on page 12 of conventions_aggregation_PR_KET.pdf should be be restricted to the cases allowed by the current "fragment array" - i.e. where the all of the fragments are aligned in neat hyper-rows. This is because a) I doubt there's a use case for non-aligned fragments; b) I very much doubt that there's any software out there that can handle the fully general case whilst also applying "lazy loading" of the data, and we shouldn't encourage people to tackle this very thorny problem without need; and c) full generality could easily be allowed if a use case ever arose.

What does anyone else think?

@taylor13
Copy link

taylor13 commented Oct 22, 2024

Hi David,

That is very encouraging. I must admit I was unable to understand how your fragment_shape information got utilized. I didn't realize that the constraint was imposed that "all of the fragments are aligned in neat hyper-rows." (I'm not sure I still understand what that means exactly, but for now, that's o.k.)

There is an important constraint even for my "super-general" example: All fragments must be logically rectangular (in hyperspace), and together they must fill a logically-rectangular aggregated array. One can think of each fragment as a block and the blocks together are used to build a single aggregated block (without leaving any spaces).

I agree we could be more restrictive (for reasons you've listed above) if that really makes a difference to those writing code. Thinking in terms of fortran-style coding (which is my default thought process), I don't think it would be difficult to handle the general case, but then I'm not familiar with what you say is "lazy loading" of the data. Is that loading the data into a vector not preserving the multi-dimensional structure?

In any case, I think the primary advantage offered by the alternative approach is that it seems to me to be easier to explain. Let's see what others think.

A note on the examples and notation: New users might better follow what we're doing if we change the keyword (under aggregated_data) from "shape" to "map" since the values tell you how to map your fragments into the aggregated array. In the example, I named the "shape" variable "fragment_starts", but a better variable name might be "insert_at", since the fragment arrays get inserted in the aggregated array at the index values provided by the "insert_at" variable.

@taylor13
Copy link

While I'm thinking about it, the term "address" doesn't immediately bring to mind the name of a variable, but rather a location of the variable; would "identifier" be a better term? It doesn't specifically have to be a variable name, but could be, as you noted earlier, an integer or some other kind of identifier of the variable of interest.

@davidhassell
Copy link
Contributor Author

Hi Karl,

I'm going to tackle all of your comments very soon, but first would like to try to conclude the discussion on the shape variable, if we can.

I have thought a lot more about your suggested proposal for replacing the shape variable with a map variable (and discussed it with @JonathanGregory) , and produced a software implementation for it ... and all that has led me to think that technique described in the PR is preferable, after all, for the following reasons:

  • The suggested map variable takes up more (or the same) amount of space
  • It is more complicated to implement. Most (certainly both existing) implementations will read the fragment array into memory, only creating the aggregated data if requested later. To instantiate the fragment array into memory, the corner indices are not useful and need to be converted to sizes, like those provided by the original proposal. Converting to sizes also requires "sort" and "argsort" functions (the only way I could see how to do it!) - perhaps not universally available?
  • There is no use case for the "super general" case. If there ever was one (which is not envisaged, at this time) then extending the convention to also include your technique would be no problem

Your point that it was hard to understand the description is certainly correct, though!

I propose a new, and hopefully understandable, description of the original shape variable, which includes renaming it to a map variable following your suggestion. Would this be acceptable? The following text is a drop-in replacement for the shape` description in the original PR:


The features must comprise either all three of the map, location, and address keywords; or else both of the map and value keywords. No other combinations of keywords are allowed. These features are defined as follows:

map

The integer-valued map variable maps the canonical form of each fragment (see Section 2.8.2 "Fragment Interpretation") to a part of the aggregated data.

For each element of the fragment array, the map variable defines the number of elements occupied by its fragment along each of the aggregated dimensions. For instance, in Example 2.2 the fragment array has six elements with shape (1, 3, 2), and the corresponding map variable has 3 rows, one for each fragment array dimension. Each of these rows contains the sizes of the fragments along that dimension, in their fragment array order, as follows:

1st row:  17
2nd row:  91,  45,  45
3rd row: 180, 180

The part of each aggregated dimension that is occupied by a fragment is defined by the fragment size along that dimension, offset by sizes of the fragments that precede it in the fragment array. For instance, the fragment in fragment array position [0, 1, 1], occupies elements 1 to 17 of the Z aggregated dimension, 91 to 136 of the Y aggregated dimension, and 181 to 360 of the X aggregated dimension (using one-based indices in this example).

The rows of the map variable (i.e. the slowest-varying dimension, and the first dimension in CDL order) correspond to the aggregated dimensions in the same order, and each row is padded with missing values to create a rectangular array. In Example 2.2, the map variable stored in a netCDF dataset therefore has the following 3 by 3 array, where _ denotes a missing value (see Aggregation Example 4):

 17,   _,  _,
 91,  45, 45,
180, 180,  _ ;

When the aggregated data is scalar, the fragment array is also scalar and the map fragment array variable must be stored as a scalar variable containing the value 1. See Aggregation variable example 8.

@taylor13
Copy link

Hi David,

I'm not going to be able to get to this before next week, unfortunately. A quick read through raised a question. Could you clarify what you mean by "all of the fragments are aligned in neat hyper-rows."? This apparently excludes the "super-general" case I was considering, but I'm just not sure what the actual constraints are on what kind of fragments can be aggregated.

Also, above you state "For each element of the fragment array, the map variable defines the number of elements occupied by its fragment along each of the aggregated dimensions". I might be able to figure out what is meant by studying the example, but on its own I can't visualize what you have in mind.

thanks,
Karl

@davidhassell
Copy link
Contributor Author

Hi Karl,

Could you clarify what you mean by "all of the fragments are aligned in neat hyper-rows."

Yes - sorry about this made-up phrase! I'm struggling to find the correct terminology ... How about (borrowing from processor distribution of parellised NWP and climate models) "The fragments comprise a regular domain decomposition of the aggregated data".

Both the following two examples, have a 2 x 2 fragment array, but fragments are not fully aligned in the second case, so that is not OK.

+---+------+
|   |      |
+---+------+           Regular domain decompostion: OK
|   |      |
+---+------+
+---+------+
|   |      |
+---+-+----+        Irregular domain decomposition: Not OK
|     |    |
+-----+----+

@taylor13
Copy link

O.K., I think I get it now. The partitioning of any given dimension into fragments must be consistent across all fragments comprising the aggregate. So, for example, you couldn't aggregate data that was originally stored on a global grid for part of a simulation with data you might have stored in two parts (say a N.H. chunk and a separate S.H. chunk) for the remainder of the simulation. Right?

I suppose if it were possible to aggregate two "aggregate" files into a super-aggregate, then you could handle the above case. First you would aggregate the two hemispheres of data for the portion of the simulation where they had been separately stored. Then you would aggregate this aggregated data with the data stored originally on a global grid. [I'm not suggesting this is an important "use case", but just checking on the limits of the approach you've proposed.]

@taylor13
Copy link

In the conventions document, I think we need to mention that an aggregation variable can't be directly accessed through the netCDF API, but an intermediary code must be written that interprets the construction of an aggregated array. This intermediary code will obtain the fragments using the netCDF API, and enable the user then to manipulate the aggregated array. Do I have that right?

One question that popped into my head is: Does your current "intermediary code" enable the user to obtain a subset of the aggregated variable without reading in the entire array? For example, if I have 100 years of a monthly, globally-gridded 3-d field (like AirTemperature(time, pressure, lat, lon)) which has been saved as 10-year chunks, and I've constructed an aggregation file spanning the 10 fragments, can I ask your intermediary code to extract just the 500 hPa pressure level of data from the aggregated dataset without first storing in memory all pressure-levels?

@bnlawrence
Copy link

Depends on your definition of intermediary code: yes, the data can be accessed through the NetCDF API, but it's a two step process, you need to use the NetCDF API to work out which fragment files to open and where to put the content in the aggregated array, but both steps use the NetCDF API ...

... and yes, both the working implementations (xarray and cf-python) are fully lazy and only extract the data you want when you want to do the computation (but of course CF itself doesn't require or say anything about that).

@JonathanGregory
Copy link
Contributor

Dear Karl

I suppose if it were possible to aggregate two "aggregate" files into a super-aggregate, then you could handle the above case. First you would aggregate the two hemispheres of data for the portion of the simulation where they had been separately stored. Then you would aggregate this aggregated data with the data stored originally on a global grid. [I'm not suggesting this is an important "use case", but just checking on the limits of the approach you've proposed.]

As it happens, @davidhassell and I discussed this situation on Thursday. Yes, you could take such a hierarchical approach, if necessary. An aggregation may have aggregations as its fragments.

Best wishes

Jonathan

@davidhassell
Copy link
Contributor Author

Hi - just to let you know that I'm in the midst of preparing a detailed review of Karl's comments, and preparing a new PR that incorporates many of the suggestions made here (and mainly by Karl!). I hope to have it all ready in a day or two.

@taylor13
Copy link

Hi David and all,

I now understand how you know how to "map" the fragments to the aggregated array given the aggregated_data's shape specifications. I hadn't realized all the constraints placed on the fragments that make this work. I think newbies would understand your approach more easily if they had in mind the constraints. [When I was first trying to figure it out, I had in mind that you could aggregate fragments that had fewer constraints, and I couldn't see how your approach would handle that.

Under either of the two approaches we considered, the following constraints are imposed:

• The fragment arrays cannot be “ragged”, i.e., they can, in general, be visualized as multi-dimensional rectangular solids.
• The entire variable array stored in a fragment file must be used in constructing the aggregated array.
• The fragment arrays share a common set of dimensions, and the dimensions appear in the same order for all fragments.
• The aggregated array is also a multi-dimensional rectangular array which includes the common set of dimensions found in its fragments (but the size of any dimension may be larger than any of its fragments corresponding dimension). The aggregated array might also include one or more additional dimensions when constructed from fragments that are distinguishable in some way not captured by the fragment dimensions (e.g., forming an aggregate variable containing multi-model ensemble output from the fragments containing results from individual models would require a “model identifier” coordinate). Except for any additional dimensions, the ordering of the aggregated array dimensions must be the same as the ordering in each of its fragments.
• The fragment arrays (conceptually, the rectangular solids) comprising the aggregate must be non-overlapping and must fill the aggregate array space completely.

For the original approach (@davidhassell et al.), there is an additional constraint (which is difficult to describe):

• Along each dimension, the same number of fragment arrays will together fill that dimension’s aggregated space, independent of all other dimensions. Although along a given dimension the fragments comprising the aggregated array can occupy unequal portions of that dimension, they must be aligned consistently across all the other dimensions. None of the fragments comprising the aggregated array will be offset from any neighboring fragment.

While thinking about the generality of the approaches, the following use-cases came to mind:

  1. Consider monthly-mean surface temperatures spanning a 100-year interval, as produced by different CMIP models. Different models have provided data on different grids and usually stored the data in multiple files, each one spanning a portion of the 100-years (e.g., in files each containing 10-years or 120 months of model output). The chunking across years has not been consistent. Suppose for multi-model analysis purposes I want to regrid all the models to a common grid and then aggregate all the model output into a single aggregate array.

Regridding can be computationally expensive, so I might choose to do that once and for all before starting my analysis. The most straightforward way of proceeding would be to consider each file individually, regrid its data, and then rewrite it. This would result in the same number of files as before, but now with all data on a common grid.

Ideally, I would then simply aggregate the data across time and across models to form a 100-year multi-model gridded dataset of surface temperature. As I understand it, this would only be possible if the original model output were stored in identical temporal chunks. Since this is not the case (with, for example, some models storing data in 20-year chunks and others in 10-year chunks), how would you proceed?

  1. Now consider a simpler case where each CMIP model has provided 10-years of monthly means on a common grid in a single file. If this data is from a control run, each model will likely report output from a different 10-year period (so the time coordinate values will differ from one model to the next). Can this data be aggregated to form a multi-model aggregated dataset?

  2. What about a variant on case 2 above with each model reporting data for the same time-period, but with different models relying on different calendars in creating the time coordinate? Some models might assume no leap years while others include leap years, which will lead to a mismatch in their time-coordinates once the first leap day is encountered. Can this data be aggregated, with a new time coordinate defined that is consistent across models (e.g., adjusting the time-coordinate in models with leap years so all models follow a no-leap-year calendar)?

@davidhassell
Copy link
Contributor Author

Hi Karl,

Thank you for your constraints and examples! Here is a sneak preview of text from the new PR, which I think encpasulates all of our requirements:


The aggregated dimensions are partitioned by the fragments (in their canonical forms, see Section 2.8.2 "Fragment Interpretation"), and this partitioning is consistent across all of the fragments, i.e. any two fragments either span the same part of a given aggregated dimension, or else do not overlap along that same dimension. In addition, each fragment data value provides exactly one aggregated data value, and each aggregated data value comes from exactly one fragment. With these constraints, the fragments can be organised into a fully-populated orthogonal multidimensionsal array of fragments, for which the size of each dimension is equal to the number of fragments that span its corresponding aggregated dimension.

The aggregated data is formed by combining the fragments in the same relative positions as they appear in the array of fragments, and with no gaps or overlaps between neighbouring fragments.


Ideally, I would then simply aggregate the data across time and across models to form a 100-year multi-model gridded dataset of surface temperature. As I understand it, this would only be possible if the original model output were stored in identical temporal chunks. Since this is not the case (with, for example, some models storing data in 20-year chunks and others in 10-year chunks), how would you proceed?

As you say, unless the partitioning across time is consistent between models, you can't aggregate this example.

  1. Now consider a simpler case where each CMIP model has provided 10-years of monthly means on a common grid in a single file. If this data is from a control run, each model will likely report output from a different 10-year period (so the time coordinate values will differ from one model to the next). Can this data be aggregated to form a multi-model aggregated dataset?

If all models have data for the same 10 year period, then they can be aggregated, otherwise not. In the latter case, the mechanics of aggregation would of course allow you to stitch them together along a "model" dimension, what you put as the commone time coordinates? Like with many things in CF, just because you can do something, it doesn't mean that it correctly describes what you did :).

  1. What about a variant on case 2 above with each model reporting data for the same time-period, but with different models relying on different calendars in creating the time coordinate? Some models might assume no leap years while others include leap years, which will lead to a mismatch in their time-coordinates once the first leap day is encountered. Can this data be aggregated, with a new time coordinate defined that is consistent across models (e.g., adjusting the time-coordinate in models with leap years so all models follow a no-leap-year calendar)?

These can not be aggregated, because at least one of the fragments will have units that are not equivalent (i.e. convertible) to the the units defined on the aggregation variable.

@taylor13
Copy link

Hi David,

Just to let you know I liked the clause, “ this partitioning is consistent across all of the fragments, i.e. any two fragments either span the same part of a given aggregated dimension, or else do not overlap along that same dimension”. That explains it better than I could have.

Regarding the use cases that can’t be handled, perhaps we should consider what modifications would be needed to make it possible to handle these, since I think for CMIP data it would be quite useful. Since aggregation is mostly handled in index space, as I understand it, then shouldn’t it be possible for a user to aggregate datasets based solely on indexes? When generating an aggregate file, a user could provide whatever coordinate values they wanted for 1 or more of the aggregate’s dimensions. If the user elected not to provide values, then your software could read the coordinate values from the fragment files (and presumably check for consistency across fragments.

I’m quite ignorant about your software package(s). Does your package help users create the aggregate files (as shown in the examples), or is that left totally up to the user? When a user accesses data through an aggregate file, does your package check that the coordinate values match the coordinate values already in the aggregate file?

Still coming up to speed on this.
Thanks,
Karl

@dwest77a
Copy link

Hi All,
Adding my thoughts on this proposal here. I'm a developer with the Centre for Environmental Data Analysis (CEDA) and recently took part in a secondment to CMS at Reading to create an implementation of the CFA Conventions in Xarray, enabling the use of Aggregated NetCDF files in Xarray. I also added in the capability to generate the aggregated files using a simple Python interface that only uses the NetCDF4 Python library. This only requires a user to deliver the set of fragment files to be aggregated - the software then decides how to orient and assemble all dimensions and variables within the aggregated dataset. The link to the repository for this code is here https://github.com/cedadev/CFAPyX, with documentation linked. This package is also installable with pip using pip install cfapyx.

I don't have any particular comments on the content of the conventions, I was fairly satisfied with the scope and coverage, and I'm very much in favour of the overall aim - to persist aggregations to disk, rather than relying on on-the-fly aggregations for which Xarray seems particularly slow. I have a few suggestions from my exposure so far:

  • Others in this thread have touched on the 'substitutions' for the location of fragment files - I find this a very useful feature actually since at CEDA we may well create aggregations before the data is finally ingested into the archive. I would say it isn't exactly clear if it is allowed to make any string substitutions, as the conventions mostly focus on examples with an existing $BASE or similar in the location itself. Whether you're 'supposed' to use the substitutions to just change the name of a directory for example is a bit unclear.

  • The CFA conventions themselves are very clear once you get your head around some of the specific terminology, like fragment array variables and aggregated dimensions but I would still suggest possibly a key to define these terms somewhere in the conventions?

  • The cf-python algorithm which serves as an implementation of these conventions is also very well defined and consistent, but contains a large number of specific terms and jargon that present a significant barrier. This may not be a problem for the average user, who we do not expect to derive their own implementation. The Xarray implementation I've written, which you can find at https://github.com/cedadev/CFAPyX uses a very similar algorithm, but since I'm not well versed in the language of the cf-python documentation I can't say if they are exactly the same, only that my implementation is consistent with the CFA Conventions (both the current version and the changes made to incorporate into the CF conventions).

  • I've attempted to review the suggestion made by @taylor13 but I may not know enough about the practical application of these aggregations to understand. I don't believe the current CFA conventions would be able to support the aggregation of fragments shown on page 12 of the linked PDF but I may be mistaken, I would assume the fragments must themselves be regularly shaped in all fragmented dimensions. I believe this is also what @davidhassell is pointing out when referring to the alignment of neat hyper-rows - the underlying architecture that allows loading of the individual fragments of data using Dask requires a set size of each fragmented dimension. I would not be opposed to considering the map attribute that David later suggested but would need to consider the implementation a little more.

  • I would agree with some others about the address variable being less obvious as to what it means - actually referring to the name of the variable, but I'm also much less familiar with fragment file formats other than NetCDF where this may have specific uses.

In summary, I also agree with others contributing to this thread that this is a useful addition to the CF conventions as a whole, and I would hope they are advertised/promoted accordingly, since there are other aggregation formats with significantly more widespread awareness (Kerchunk/Zarr etc.). My main suggestions revolve around some additional documentation specifically aimed at people with no or little knowledge of the specific terminology with NetCDF or CF in general.

@davidhassell
Copy link
Contributor Author

Hi all,

I'm still preparing responses to Karl's PDF comments, and have yet to properly read the last couple of comments from Karl and Dan (I will do all of that today!), but I'd to get my alternative PR out there. This incorporates a lot of Karl's suggestions, and thanks to those, regardless of where we end up, I think that it is a much clearer exposition.

@larsbarring
Copy link
Contributor

Already since the beginning I have been, and still, in favour of this addition, although I have not followed this discussion in any detail.

A general (in fact CF wide) thought that again spring so mind when I read @dwest77a's comment is that i think that the CF should try (hard) to be independent of any language references. Do not misunderstand this as I am in any way critical to the work Daniel have done -- rather the contrary!

But sometimes references to language --- especially python --- implementations may limit the discussion. If some aspect of a language implementation of the proposal stands out in some way (difficult, simple, slow, fast, easy ....), or depends on a mechanism or library only available in that language (we are here talking about python) that might make it difficult to implement the proposal in other languages, then I think this needs to be discussed. I have no idea if this is the case here.

Anyway, and in general I think it is important in the long run to keep a clear separation between the CF Conventions as a convention/standard, and its implementations in different languages. Not having done that I personally think this is one of the weaknesses of many very popular formats/tools/etc. Daniel mentions zarr and kerchunk (I have no experience of either, but they seem to be oriented/limited to python).

If we really want a closer connection between CF and some software tool I think that C is the way to go (still?), with bindings to R, Matlab, Julia, Fortran and more. But that is for another another day and conversation.

@davidhassell
Copy link
Contributor Author

Replying to Karl,

Regarding the use cases that can’t be handled, perhaps we should consider what modifications would be needed to make it possible to handle these, since I think for CMIP data it would be quite useful. Since aggregation is mostly handled in index space, as I understand it, then shouldn’t it be possible for a user to aggregate datasets based solely on indexes? When generating an aggregate file, a user could provide whatever coordinate values they wanted for 1 or more of the aggregate’s dimensions. If the user elected not to provide values, then your software could read the coordinate values from the fragment files (and presumably check for consistency across fragments.

An aggregation file contains exactly the same information as its equivalent non-aggregation file that contains copies of all of the fragment data. The new PR has the line "The aggregated data is identical to the data that would be stored within a dataset that contained the equivalent non-aggregation variable.". This is near the end of the description but should, I realise, be much prominent.

This means that aggregation is not solely in index space - we can't aggregate a 1 degree horizontal grid with a 2 horizontal degree grid because we couldn't represent the data in a non-aggregation file.

I’m quite ignorant about your software package(s). Does your package help users create the aggregate files (as shown in the examples), or is that left totally up to the user? When a user accesses data through an aggregate file, does your package check that the coordinate values match the coordinate values already in the aggregate file?

cf-python does indeed aggregate for you, using its aggregation algorithm [*]. You can tell cf-python in one command to "read all of these files", and it will aggregate their contents into as few fields as possible. Then, you can treat a field data in memory as a single entity, write it to disk as normal CF dataset (copying in all of the data), or write it to disk as an aggregation file.

There is no requirement nor expectation for CF nor any software implementation to check other parts of the dataset (e.g. the coordinates) with any equivalent entities in the fragment files. You can easily create the coordinates as aggregation variables themselves if you wanted, which would mean that their values would come only from the fragments. This has implications for accessibility (i.e. the coordinates values are not readily available to casual inspection, e.g. with ncdump), but there may well be times when you want to do this. Example L.5 in Appendix L of the new PR shows just this.

[*] This aggregation algorithm is completely based on the CF data model, and should be guaranteed to work for any conceivable CF datasets. it was proposed to CF nearly a decade ago (https://cfconventions.org/Data/Trac-tickets/78.html), which was probably a little premature. But now that we are on the brink of being able to store aggregations in CF, perhaps its time to think again about providing guidance on how to create them. This is not part of this proposal, though!

@taylor13
Copy link

To be sure, is it possible (even if perhaps inadvisable) for me to create an aggregation file linking 10-years of monthly-mean data created by multiple models even when the models have assumed different calendars (so some include leap years and others don't). Assume i've recorded in my aggregation file 120 values for the time dimension of the aggregated array, which I've based, let's say, on a no-leap-year calendar. Now if the software reading my aggregation file simply creates the aggregation array based on the mapping information I've provided, won't I get a properly defined aggregation array? Or is the aggregation software required to check that the time coordinates values I've defined are consistent with the coordiinate values stored in each of the fragment files? Basically, can't I create an aggregated dataset that has only approximately correct time coordinate information (for some models), which I deem to be good enough for analysis of multi-model ensembles?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CF1.12? We might conclude this issue in time for CF1.12 enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants