Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conventions for anomalies #582

Open
JonathanGregory opened this issue Dec 28, 2024 · 2 comments
Open

Conventions for anomalies #582

JonathanGregory opened this issue Dec 28, 2024 · 2 comments
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Dec 28, 2024

Moderator

None yet

Requirement Summary

To provide conventions which describe the calculation of an anomaly (i.e. deviation) from the normal (i.e. reference or baseline) of the same geophysical quantity. The most important and complicated case is the calculation of anomalies with respect to climatological statistics.

Status Quo

Sections 7.3 introduces cell_methods, and Section 7.4 defines the use of cell_methods to describe climatological statistics. There are various standard names of the form X_anomaly, where X is another standard name. Vocabulary issue 27 proposes to add text in the descriptions of those standard names to say that a size-one coordinate variable with the standard name of reference_epoch can be used to record the time-bounds of a climatology with respect to which the anomaly was calculated. CF doesn't have any other conventions for describing anomalies or relating them to normals.

Associated pull request

None yet.

Background

This issue arises from three others:

  • vocabulary issue 27 on "Adding reference epoch sentence to anomaly terms".

  • vocabulary issue 70 (now closed, opened as discuss issue 252) on "reference periods for variables derived from climatology".

  • Discussion 305, entitled "Can CF offer better support for the representation of climate anomalies?"

There are some other unconcluded discussions relating to cell_methods too, which are more or less related to this one:

  • Discussion 372 (opened as issue #197 in this repo), entitled "Cell methods: within|over days|months and time axis (Section 7.4)" about the relationship of these to climatologies, and the possibility of cell_methods recording more than two statistical processing steps applied to the time axis.

  • Issue Clarification of cell_methods #414 in this repo, proposing various clarifications to cell_methods and its documentation:

    • Reorder subsections 7.3.2 and 7.3.3.
    • Delete the default interpretation of cell_methods as point or sum.
    • Explicitly indicate that where can be used for the time dimension as well as spatial dimensions and that where can sometimes be interpreted as "when".
    • Revise some of the text and modify some of the examples for clarity and for readability.
  • Issue Clarification of weighting in cell_methods #447 in this repo, on recording weighting in cell_methods. (This was raised in Clarification of cell_methods #414 and split off because of its greater urgency, and to make the discussion more manageable.)

A lot of points are made in all these discussions. I expect we will be able to address them all eventually, but it's hard to comprehend them all at once!

Detailed Proposal

I propose that we introduce a new Section 7.5 on "Anomaly data", renumbering the existing Section 7.5 on "Geometries" as 7.6.

The first aim of this issue is to provide examples of how to use reference_epoch, for which @TomLav and @sethmcg have both pointed out the need. For this aim, I propose the following text for the new section. In a subsequent posting, I will propose a more general convention, for any quantity (not just anomaly standard names), and including the cell_methods of the climatology.

In this draft text, I've assumed that an anomaly variable might refer to multiannual mean (of entire years) or a climatological monthly (or other sub-annual) multiannual mean. As we've discussed, without its cell_methods, we don't know what the climatology is. That's an unsatisfactory situation, but some of the anomaly standard names have been around for a long time, and we don't know how they've been used. We could leave it as vague, or we could clarify it for CF 1.13 onwards. Possible choices to remove the vagueness include:

  • Defining anomaly to mean difference with respect to the time-mean of the entire time-interval bounded by the reference_epoch bounds. That's simple, but you couldn't use anomaly standard names for anomalies wrt a climatological monthly (or other subannual) mean e.g. mean July of 1990-2019.

  • Defining anomaly to mean difference with respect to the mean over the years in the reference_epoch of the subannual period specified by the time coordinate. For instance, in the example below, we have an anomaly for 16th July wrt a reference epoch of 1990-2019. This would be interpreted as an anomaly wrt the mean of 16th July over the 30 years.

Postscript: We could support both the above interpretations, distinguishing between them by assuming the first if the reference_epoch has a bounds attribute, and the second if it has a climatology attribute (instead of bounds). That distinction would be made if the climatology variable were present in the file; its time coordinate bounds would be in bounds if its cell_methods indicated an ordinary time-mean, and climatology if cell_methods indicated a climatological time-mean i.e. with within and over.

What do you think of this question, the draft text below, or any other related matter?


7.5. Anomaly data

In a data variable containing anomaly data, each element A is the difference P - N between a particular value P of any quantity and the normal value or norm N of the same quantity. N is a statistic calculated from all values of the quantity that lie within specified ranges of one or more of its coordinates. P can be, but is not necessarily, one of the set of values from which N is calculated.

The commonest kind of anomaly is the difference between the value P of a quantity and the mean N of the same quantity over some range of time coordinates, usually called the "climatological normal", the "climate normal", or the "climatology". N is usually either a mean over a number of entire years or a climatological mean (Section 7.4, "Climatological statistics"). The time coordinate of the anomaly may or may not lie within the range of times from which N is calculated.

A data variable A containing anomalies with respect to climatology is notionally the difference between a variable P with all the same coordinates as A and a multiannual or climatological time-mean data variable N which has all the same coordinates as A except for time. The time coordinate variable of N may be multivalued only if it is climatological time, and must otherwise be single-valued. P and N are usually not actually present in the dataset.

Several CF standard names have been defined for anomalies with respect to a climatology. The start and end of the climatological period may be recorded in the bounds of either a scalar coordinate variable, or a coordinate variable with a single size-one dimension, having reference_epoch as its standard name attribute, as in Example 7.A.

The use of the convention with anomaly standard names and reference_epoch is restricted to those common cases which have such standard names defined, and where the data variable contains anomalies with respect to a single climatology. Furthermore, if the climatology variable is not present in the file, there is no indication of the cell_methods of the climatology (see Example 7.A). Hence the interpretation of the anomaly is unclear.


Example 7.A. An anomaly data variable with a reference epoch.

variables:
  float delta_tas(time,latitude,longitude);
    delta_tas:standard_name="air_temperature_anomaly";
    delta_tas:units="degC";
    delta_tas:coordinates="climatological_time";
    delta_tas:cell_methods="time: maximum";
  double time(time);
    time:standard_name="time";
    time:units="days since 2023-7-16";
    time:bounds="time_bounds";
    time:calendar="standard";
  double time_bounds(time,two);
  double climatological_time;
    climatological_time:standard_name="reference_epoch";
    climatological_time:units="days since 1990-1-1";
    climatological_time:bounds="climatological_time_bounds";
    climatological_time:calendar="standard";
  double climatological_time_bounds(two);
data:
  time_bounds=0,1, 1,2, 2,3, 3,4;
  climatological_time_bounds=0,10957;

The data variable delta_tas contains daily maximum temperatures for 16th-19th July 2023 expressed as anomalies with respect to the climatological normal time-mean of 1990-2019, which is defined by the bounds of climatological_time. Note that 10957 days since 1st January 1990 in the standard calendar is 1st January 2020. The single value of the climatological_time coordinate variable should be a representative time within the climatological interval (see Section 7.4).

In this example, climatological_time is a scalar coordinate variable. We could alternatively define a dimension climatological_time=1, with a one-dimensional coordinate variable climatological_time(climatological_time) and bounds climatological_time_bounds(climatological_time,two). In this case, we must include climatological_time among the dimensions of delta_tas e.g. delta_tas(climatological_time,time,latitude,longitude), but the coordinates attribute is not needed, unlike in the case of a scalar coordinate variable.

delta_tas is interpreted as the difference between two other variables, which may or may not be contained in the dataset e.g. delta_tas = tas - climatological_tas, with

  float tas(time,latitude,longitude);
    tas:standard_name="air_temperature";
    tas:units="degC";
    tas:cell_methods="time: maximum";
  float climatological_tas(climatological_time,latitude,longitude);
    climatological_tas:standard_name="air_temperature";
    climatological_tas:units="degC";
    climatological_tas:coordinates="climatological_time";
    climatological_tas:cell_methods="climatological_time: mean";

In the above example, the daily anomalies are differences between the daily maxima and the time-mean of the entire 30-year period 1990-2019. Alternatively, we might have

  float climatological_tas(climmonths,latitude,longitude);
    climatological_tas:standard_name="air_temperature";
    climatological_tas:units="degC";
    climatological_tas:coordinates="climmonths";
    climatological_tas:cell_methods="climmonths: mean within years climmonths: mean over years";

with dimension climonths=12 for climatological monthly means. In that case, the daily anomalies would be calculated with respect to the 30-year July mean. If the climatology variable is not present in the dataset, no information is available about its cell_methods, and we cannot distinguish the possibilities.

@JonathanGregory JonathanGregory added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Dec 28, 2024
@TomLav
Copy link

TomLav commented Jan 9, 2025

Dear @JonathanGregory, Thank you very much for this proposal. I have no doubt that a new section 7.5 Anomaly data is needed and will help. I support adding a new section.

I read carefully through the proposed text and have the following comments.

  • The first two paragraphs are very clear.

  • At the start of the 3rd paragraph, I would change A data variable A containing anomalies with respect to climatology into A data variable A containing anomalies with respect to a normal value (because A is P - N and we called N a normal value).

  • In the 4th paragraph, I would modify the first sentence as Several CF standard names (ending with _anomaly) have been defined for anomalies with respect to a normal value.

  • The 2nd sentence in 4th paragraph I would split as: The start and end of the climatological period may be recorded in the bounds of a coordinate variable having reference_epoch as its standard name attribute. This coordinate variable can either be a scalar coordinate variable, as in Example 7.A., or a coordinate variable with a single size-one dimension. I am unsure if may be recorded should be changed to should be recorded.

  • I suggest we use _anomaly standard names (instead of anomaly standard names) in paragraph 5.

  • In the text below example 7A, I must admit I do not know why the consequence of having climatological_time(climatological_time) is that we must include climatological_time among the dimensions of delta_tas. I trust you on that, but I do not know what part of the convention imposes this.

Finally, I am not sure if the part below delta_tas is interpreted as the difference ... helps. The final sentence reads: If the climatology variable is not present in the dataset, no information is available about its cell_methods, and we cannot distinguish the possibilities. You also had the sentence (just before Example 7A): Furthermore, if the climatology variable is not present in the file [...] the interpretation of the anomaly is unclear. Both sentences seem to indicate that if the climatology variable is present, we can distinguish between several cases. However I do not think this holds, as there is no formal way for delta_tas to know that climatological_tas was the normal value used. There are no formal links (like attributes, common variables,...) between the two (or I missed something).

So I would suggest instead:

  • Delete the part starting with delta_tas is interpreted as the difference ...

  • Delete Furthermore, if the climatology variable is not present in the file, there is no indication of the cell_methods of the climatology (see Example 7.A). Hence the interpretation of the anomaly is unclear. just before Example 7.A.

  • Find another way to highlight the ambiguity, e.g. (but there will surely be better ways) add the following paragraph (and Notes) at the end.

Edit 12:50 CET: I do not anymore think this makes sense:
====>
In example 7A, the reference_epoch variable climatological_time indicates the period 1990 - 2019. At this stage, the convention does not allow to distinguish between any of the following cases:

  • the daily anomalies in delta_tas are differences between the daily maxima and the time-mean of the entire 30-year period 1990-2019
  • the daily anomalies in delta_tas are differences between the daily maxima and the 1990-2019 July mean
  • the daily anomalies in delta_tas are differences between the daily maxima and the daily maxima for the same days in 1990-2019

Note: Distinguishing between these (and other) cases would require that the cell_methods of the normal value used to compute the anomalies stored in delta_tas would be recorded in a way or another in the dataset, which is not the possible at this stage.
<====

Unfortunately, I did not find a way to easily prepare a modified version of your proposal with acceptable formatting. I guess it will be easier once you will have created a Pull Request with the source code file.

All in all, I am very positive to this proposal, with the caveat that I do not think we should propose that having the climatological_tas variable in the dataset puts us in a better position than if it is not present.

Addendum: If a variable storing the normal values is present in the same dataset (e.g. a variable named climatological_tas), its cell_methods attribute might be used to distinguishing between the different cases, however 1) there is no formal (attribute, common variables, etc...) link from delta_tas to the information in climatological_tas and 2) climatological_tas would then also give access to the time period over which it was evaluated (via the bounds of the climmonths coordinate variable, in your example), which is redundant information to that already recorded in the reference_epoch variable.

@TomLav
Copy link

TomLav commented Jan 9, 2025

As a complement to my comment above (and its later edit), I wanted to contribute the following example, which should correspond to Example 7A above, but in the case where the anomaly is wrt the 1990-2019 (climatological) July mean:

variables:
  float delta_tas(time,latitude,longitude);
    delta_tas:standard_name="air_temperature_anomaly";
    delta_tas:units="degC";
    delta_tas:coordinates="climatological_time";
    delta_tas:cell_methods="time: maximum";
  double time(time);
    time:standard_name="time";
    time:units="days since 2023-7-16";
    time:bounds="time_bounds";
    time:calendar="standard";
  double time_bounds(time,two);
  double climatological_time;
    climatological_time:standard_name="reference_epoch";
    climatological_time:units="days since 1990-1-1";
    climatological_time:bounds="climatological_time_bounds";
    climatological_time:calendar="standard";
  double climatological_time_bounds(two);
data:
  time_bounds=0,1, 1,2, 2,3, 3,4;
  climatological_time_bounds=181,10803;   // 01-07-1990,31-07-2019

The only changes wrt Example 7A are the two values stored in climatological_time_bounds.

I agree there is no formal way to record that the climatology is the average of July months (climatology) rather than the average across the full time period July 1st 1991 - July 31st 2019 (bounds). Nevertheless, I can also see that a data consumer could reasonably assume the climatological meaning given such bounds.

Maybe this example (or just the delta to Example 7A) should be added to the text of the new section? It gives a practical example of a very common case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
Development

No branches or pull requests

2 participants