Extending xclim¶

xclim tries to make it easy for users to add their own indices and indicators. The following goes into details on how to create **Indices** and document them so that xclim can parse most of the metadata directly. We then explain the multiple ways new **Indicators** can be created and, finally, how we can regroup and structure them in virtual submodules.

Central to xclim are the Indicators, objects computing indices over climate variables, but xclim also provides many other modules:

modules

This introduction will focus on the Indicator/Index part of xclim and how one can extend it by implementing new ones.

Indices vs Indicators¶

Internally and in the documentation, xclim makes a distinction between “indices” and “indicators”.

index¶

A python function accepting DataArrays and other parameters (usually built-in types)
Returns one or several DataArrays.
Handles the units : checks input units and set proper CF-compliant output units. But doesn’t usually prescribe specific units, the output will at minimum have the proper dimensionality.
Performs no other checks or set any (non-unit) metadata.
Accessible through xclim.indices.

indicator¶

An instance of a subclass of xclim.core.indicator.Indicator that wraps around an index (stored in its compute property).
Returns one or several DataArrays.
Handles missing values, performs input data and metadata checks (see usage).
Always outputs data in the same units.
Adds dynamically generated metadata to the output after computation.
Accessible through xclim.indicators

Most metadata stored in the Indicators is parsed from the underlying index documentation, so defining indices with complete documentation and an appropriate signature helps the process. The two next sections go into details on the definition of both objects.

Call sequence¶

The following graph shows the steps done when calling an Indicator. Attributes and methods of the Indicator object relating to those steps are listed on the right side.

indicator

Defining new indices¶

The annotated example below shows the general template to be followed when defining proper indices. In the comments, Ind is the indicator instance that would be created from this function.

Note that it is not needed to follow these standards when writing indices that will be wrapped in indicators. Problems in parsing will not raise errors at runtime, but might raise warnings and will result in Indicators with poorer metadata than expected by most users, especially those that dynamically use indicators in other applications where the code is inaccessible, like web services.

index doc

The following code is another example.

[ ]:

import xarray as xr
from IPython.display import Code, display

import xclim
from xclim.core.units import convert_units_to, declare_units
from xclim.indices.generic import threshold_count


@declare_units(tasmax="[temperature]", thresh="[temperature]")
def tx_days_compare(tasmax: xr.DataArray, thresh: str = "0 degC", op: str = ">", freq: str = "YS"):
    r"""
    Number of days where maximum daily temperature is above or under a threshold.

    The daily maximum temperature is compared to a threshold using a given operator and the number
    of days where the condition is true is returned.

    It assumes a daily input.

    Parameters
    ----------
    tasmax : xarray.DataArray
        Maximum daily temperature.
    thresh : str
        Threshold temperature to compare to.
    op : {'>', '<'}
        The operator to use.
        # A fixed set of choices can be imposed. Only strings, numbers, booleans or None are accepted.
    freq : str
        Resampling frequency.

    Returns
    -------
    xarray.DataArray, [temperature]
        Maximum value of daily maximum temperature.

    Notes
    -----
    Let :math:`TX_{ij}` be the maximum temperature at day :math:`i` of period :math:`j`. Then the maximum
    daily maximum temperature for period :math:`j` is:

    .. math::

        TXx_j = max(TX_{ij})

    References
    ----------
    :cite:cts:`smith_citation_2020`
    """
    thresh = convert_units_to(thresh, tasmax)
    out = threshold_count(tasmax, op, thresh, freq)
    out.attrs["units"] = "days"
    return out

Naming and conventions¶

Variable names should correspond to CMIP6 variables, whenever possible. The file xclim/data/variables.yml lists all variables that xclim can use when generating indicators from YAML files (see below), and new indices should try to reflect these also.

Generic functions for common operations¶

The xclim.indices.generic submodule contains useful functions for common computations (like threshold_count or select_resample_op) and many basic index functions, as defined by clix-meta. In order to reduce duplicate code, their use is recommended for xclim’s indices. As previously said, the units handling has to be made explicitly when non-trivial, xclim.core.units also exposes a few helpers for that (like convert_units_to, to_agg_units or rate2amount).

Documentation¶

As shown in both example, a certain level of convention is best followed when writing the docstring of the index function. The general structure follows the NumpyDoc conventions, and some fields might be parsed when creating the indicator (see the image above and the section below). If you are contributing to the xclim codebase, when adding a citation to the docstring, this is best done by adding that reference to the references.bib file and then citing it using its label with the :cite:cts: directive (or one of its variant). See the contributing docs.

Defining new indicators¶

xclim’s Indicators are instances of (subclasses of) xclim.core.indicator.Indicator. While they are the central to xclim, their construction can be somewhat tricky as a lot happens backstage. Essentially, they act as self-aware functions, taking a set of input variables (DataArrays) and parameters (usually strings, integers or floats), performing some health checks on them and returning one or multiple DataArrays, with CF-compliant (and potentially translated) metadata attributes, masked according to a given missing value set of rules. They define the following key attributes:

the identifier, as string that uniquely identifies the indicator, usually all caps.
the realm, one of “atmos”, “land”, “seaIce” or “ocean”, classifying the domain of use of the indicator.
the compute function that returns one or more DataArrays, the “index”,
the cfcheck and datacheck methods that make sure the inputs are appropriate and valid.
the missing function that masks elements based on null values in the input.
all metadata attributes that will be attributed to the output and that document the indicator:
- Indicator-level attribute are : title, abstract, keywords, references and notes.
- Output variables attributes (respecting CF conventions) are: var_name, standard_name, long_name, units, cell_methods, description and comment.

Output variables attributes are regrouped in Indicator.cf_attrs and input parameters are documented in Indicator.parameters.

A particularity of Indicators is that each instance corresponds to a single class: when creating a new indicator, a new class is automatically created. This is done for easy construction of indicators based on others, like shown further down.

See the class documentation for more info on the meaning of each attribute. The indicators module contains over 50 examples of indicators to draw inspiration from.

Identifier vs python name¶

An indicator’s identifier is not the same as the name it has within the python module. For example, xclim.atmos.relative_humidity has hurs as its identifier. As explained below, indicator classes can be accessed through xclim.core.indicator.registry with their identifier.

Metadata parsing vs explicit setting¶

As explained above, most metadata can be parsed from the index’s signature and docstring. Otherwise, it can always be set when creating a new Indicator instance or a new subclass. When creating an indicator, output metadata attributes can be given as strings, or list of strings in the case of an indicator returning multiple outputs. However, they are stored in the cf_attrs list of dictionaries on the instance.

Internationalization of metadata¶

xclim offers the possibility to translate the main Indicator metadata field and automatically add the translations to the outputs. The mechanic is explained in the Internationalization page.

Inputs and checks¶

xclim decides which input arguments of the indicator’s call function are considered variables and which are parameters using the annotations of the underlying index (the compute method). Arguments annotated with the xarray.DataArray type are considered variables and can be read from the dataset passed in ds.

Indicator creation¶

There are two ways of creating indicators:

By initializing an existing indicator (sub)class
From a dictionary

The first method is best when defining indicators in scripts or external modules and are explained here. The second is best used when building virtual modules through YAML files, and is explained further down and in the submodule doc.

Creating a new indicator that simply modifies some metadata output of an existing one is a simple call like:

[ ]:

from xclim.core.indicator import registry

# An indicator based on tg_mean, but returning Celsius and fixed on annual resampling
tg_mean_c = registry["TG_MEAN"](
    identifier="tg_mean_c",
    units="degC",
    title="Mean daily mean temperature but in degC",
    parameters=dict(freq="YS"),  # We inject the freq arg.
)

[ ]:

display(Code(tg_mean_c.__doc__, language="rst"))

The registry is a dictionary mapping indicator identifiers (in uppercase) to their class. This way, we could subclass tg_mean to create our new indicator. tg_mean_c is the exact same as atmos.tg_mean, but outputs the result in Celsius instead of Kelvins, has a different title and removes control over the freq argument, resampling to “YS”. The identifier keyword is here needed in order to differentiate the new indicator from tg_mean itself. If it wasn’t given, a warning would have been raised and further subclassing of tg_mean would have in fact subclassed tg_mean_c, which is not wanted!

By default, indicator classes are registered in xclim.core.indicator.registry, using their identifier, which is prepended by the indicator’s module if that indicator is declared outside xclim. A “child” indicator inherits its module from its parent:

[ ]:

tg_mean_c.__module__ == xclim.atmos.tg_mean.__module__

To create indicators with a different module, for example, in a goal to differentiate them in the registry, two methods can be used : passing module to the constructor, or using conventional class inheritance.

[ ]:

# Passing module
tg_mean_c2 = registry["TG_MEAN_C"](module="test")  # we didn't change the identifier!
print(tg_mean_c2.__module__)
"test.TG_MEAN_C" in registry

[ ]:

# Conventional class inheritance, uses the current module name

class TG_MEAN_C3(registry["TG_MEAN_C"]):  # noqa
    pass  # nothing to change really

tg_mean_c3 = TG_MEAN_C3()

print(tg_mean_c3.__module__)
"__main__.TG_MEAN_C" in registry

While the former method is shorter, the latter is what xclim uses internally, as it provides some clean code structure. See the code in the GitHub repo.

Virtual modules¶

xclim gives users the ability to generate their own modules from existing indices’ library. These mappings can help in emulating existing libraries (such as icclim), with the added benefit of CF-compliant metadata, multilingual metadata support, and optimized calculations using federated resources (using Dask). This can be used for example to tailor existing indices with predefined thresholds without having to rewrite indices.

Presently, xclim is capable of approximating the indices developed in icclim, ANUCLIM and clix-meta and is open to contributions of new indices and library mappings.

This notebook serves as an example of how one might go about creating their own library of mapped indices. Two ways are possible:

From a YAML file (recommended way)
From a mapping (dictionary) of indicators

YAML file¶

The first method is based on the YAML syntax proposed by clix-meta, expanded to xclim’s needs. The full documentation on that syntax is here. This notebook shows an example of different complexities of indicator creation. It creates a minimal python module defining an index, creates a YAML file with the metadata for several indicators and then parses it into xclim.

[ ]:

# These variables were generated by a hidden cell above that syntax-colored them.
print("Content of example.py :")
display(Code(pydata, language="python"))
print("\n\nContent of example.yml :")
display(Code(ymldata, language="yaml"))
print("\n\nContent of example.fr.json :")
display(Code(jsondata, language="json"))

example.yml created a module of 7 indicators.

Values of the base arguments are the identifier of the associated indicators, and those can be different from their name within the Python modules. For example, xclim.atmos.relative_humidity has HURS as identifier. One can always access xclim.atmos.relative_humidity.identifier to get the correct name to use. The base argument also accepts generic base classes which are registered in xc.core.indicator.base_registry.

RX1day is as registry['RX1DAY'], but with an updated long_name and an injected argument : its indexer arg is now set to only compute over may to september.
RX5day_canopy is based on registry['MAX_N_DAY_PRECIPITATION_AMOUNT'], changed the long_name and injects the window and freq arguments.
- It also requests a different variable than the original indicator : prveg instead of pr. As xclim doesn’t know about prveg, a definition is given in the variables section.
R75pdays is based on registry['DAYS_OVER_PRECIP_THRESH'], injects the thresh argument and changes the description of the per argument.
first_frost_day is a more complex example. As there were no base: entry, the Daily class serves as a base by default. This class doesn’t do much, so a lot has to be given explicitly:
- A compute function name if given. Here it refers a “generic” function (in xclim.indices.generic), which means it doesn’t provide any pertinent metadata.
- Thus, output metadata fields are given
- Some parameters are injected, the default for freq is modified, but left as an argument.
- The input variable data is mapped to a known variable. “Generic” functions do not handle the units, so we need to tell xclim that the data argument is minimum daily temperature. This will activate the proper units check and CF-compliance checks within the indicator class.
winter_fd’s compute uses an index function instead of a “generic” one . Functions directly in xclim.indices have docstrings that the indicator builder can parse to populate the indicator’s metadata. They also handle units and expose that information to the indicator class. This example also specifies a base indicator class that supports indexing (which the default Daily does not), which allows the injection of an indexer.
R95p is similar to first_frost_day but here the compute is not defined in xclim but rather in example.py. Also, the custom function returns two outputs, so the output section is a list of mappings rather than a mapping directly.
R99p is the same as R95p but changes the injected value. In order to avoid rewriting the output metadata, and allowed periods, we based it on R95p : as the latter was defined within the current YAML file, the identifier is prefixed by a dot (.).

Additionally, the YAML specified a realm and references to be used on all indices and provided a submodule docstring.

Finally, French translations for the main attributes and the new indicators are given in example.fr.json. Even though new indicator objects are created for each YAML entry, non-specified translations are taken from the base classes if missing in the JSON file.

Note that all files are named the same way : example.<ext>, with the translations having an additional suffix giving the locale name. In the next cell, we build the module by passing only the path without extension. This absence of extension is what tells xclim to try to parse a module (*.py) and custom translations (*.<locale>.json). Those two could also be read beforehand and passed through the indices= and translations= arguments.

Validation of the YAML file¶

Using yamale, it is possible to check if the YAML file is valid. xclim ships with a schema (in xclim/data/schema.yml) file.

The validation can be executed in a python session:

[ ]:

from importlib.resources import files

import yamale

data = files("xclim.data").joinpath("schema.yml")
schema = yamale.make_schema(data)

example_module = yamale.make_data(example_dir / "example.yml")  # in the example folder

yamale.validate(schema, example_module)

Or the validation can alternatively be run from the command line with:

yamale -s path/to/schema.yml path/to/module.yml

Note that xclim builds indicators from a yaml file, as shown in the next example, it validates it first.

Loading the module and computing indicators.¶

[ ]:

import xclim as xc

example = xc.core.indicator.build_indicator_module_from_yaml(example_dir / "example", mode="raise")

[ ]:

docstring = f"{example.__doc__}\n---\n\n{xc.indicators.example.R99p.__doc__}"
display(Code(docstring, language="rst"))

Useful for using this technique in large projects, we can iterate over the indicators like so:

[ ]:

from xclim.testing import open_dataset

ds = open_dataset("ERA5/daily_surface_cancities_1990-1993.nc")
with xr.set_options(keep_attrs=True):
    ds2 = ds.assign(
        pr_per=xc.core.calendar.percentile_doy(ds.pr, window=5, per=75).isel(percentiles=0),
        prveg=ds.pr * 1.1,  # Very realistic
    )
    ds2.prveg.attrs["standard_name"] = "precipitation_flux_onto_canopy"

outs = []
with xc.set_options(metadata_locales="fr"):
    inds = ["Indicators:"]
    for name, ind in example.iter_indicators():
        inds.append(f"  {name}:")
        inds.append(f"    identifier: {ind.identifier}")
        out = ind(ds=ds2)  # Use all default arguments and variables from the dataset
        if isinstance(out, tuple):
            outs.extend(out)
            for i, o in enumerate(out):
                inds.append(f"    long_name_{i}: ({o.name}) {o.long_name}")
        else:
            outs.append(out)
            inds.append(f"    long_name: ({out.name}) {out.long_name}")

display(Code("\n".join(inds), language="yaml"))

out contains all the computed indices, with translated metadata. Note that this merge doesn’t make much sense with the current list of indicators since they have different frequencies (freq).

[ ]:

out = xr.merge(outs)
out.attrs = {
    "title": "Indicators computed from the example module."
}  # Merge puts the attributes of the first variable, we don't want that.
out

Mapping of indicators¶

For more complex mappings, submodules can be constructed from Indicators directly. This is not the recommended way, but can sometimes be a workaround when the YAML version is lacking features.

[ ]:

from xclim.core.indicator import build_indicator_module, registry

mapping = dict(
    egg_cooking_season=registry["MAXIMUM_CONSECUTIVE_WARM_DAYS"](
        module="awesome",
        compute=xc.indices.maximum_consecutive_tx_days,
        parameters=dict(thresh="35 degC"),
        long_name="Season for outdoor egg cooking.",
    ),
    fish_feeling_days=registry["WETDAYS"](
        module="awesome",
        compute=xc.indices.wetdays,
        parameters=dict(thresh="14.0 mm/day"),
        long_name="Days where we feel we are fishes",
    ),
    sweater_weather=xc.atmos.tg_min.__class__(module="awesome"),
)

awesome = build_indicator_module(
    name="awesome",
    objs=mapping,
    doc="""
        =========================
        My Awesome Custom indices
        =========================
        There are only 3 indices that really matter when you come down to brass tacks.
        This mapping library exposes them to users who want to perform real deal
        climate science.
        """,
)

[ ]:

print(xc.indicators.awesome.__doc__)