Download this notebook from github.

Frequency analysis

Frequency analysis refers to the study of the probability of occurrence of events. It’s often used in regulatory contexts to determine design values for infrastructures. For example, your city might require that road drainage systems be able to cope with a level of rainfall that is exceeded only once every 20 years on average. This 20-year return event, the infrastructure design value, is computed by first extracting precipitation annual maxima from a rainfall observation time series, fitting a statistical distribution to the maxima, then estimating the 95th percentile (1:20 chance of being exceeded).

To facilitate this type of analysis on numerous time series from model simulations or observations, xclim packs a few common utility functions. In the following example, we’re estimating the 95th percentile of the daily precipitation maximum over the May-October period using a Generalized Extreme Value distribution.

Note that at the moment, all frequency analysis functions are hard-coded to operate along the time dimension.

Let’s first create a synthetic time series of daily precipitation.

[ ]:
from __future__ import annotations

import warnings

import numpy as np
import xarray as xr
from scipy.stats import bernoulli, gamma

from xclim.core.missing import missing_pct
from xclim.indices.generic import select_resample_op
from xclim.indices.stats import fa, fit, frequency_analysis, parametric_quantile

warnings.simplefilter("ignore")


# Create synthetic daily precipitation time series (mm/d)
n = 50 * 366
start = np.datetime64("1950-01-01")
time = start + np.timedelta64(1, "D") * range(n)
# time = xr.cftime_range(start="1950-01-01", periods=n)

# Generate wet (1) /dry (0) days, then multiply by rain magnitude.
wet = bernoulli.rvs(0.1, size=n)
intensity = gamma(a=4, loc=1, scale=6).rvs(n)
pr = xr.DataArray(
    wet * intensity,
    dims=("time",),
    coords={"time": time},
    attrs={"units": "mm/d", "standard_name": "precipitation_flux"},
)
pr

The frequency_analysis function combines all the necessary steps to estimate our design value:

  1. Extract May to October period (month=[5,6,7,8,9,10])

  2. Extract maxima (mode="max")

  3. Fit the GEV distribution on the maxima (dist="genextreme")

  4. Compute the value exceeded, on average, once every 20 years (t=20)

Note that xclim essentially wraps scipy.stats distributions, so many distributions like norm, gumbel_r, lognorm, etc. are supported.

[ ]:
# Compute the design value
frequency_analysis(
    pr, t=20, dist="genextreme", mode="max", freq="YS", month=[5, 6, 7, 8, 9, 10]
)

In practice, it’s often useful to be able to save intermediate results, for example the parameters of the fitted distribution, so in the following we crack open what goes on behind the frequency_analysis function.

The first step of the frequency analysis is to extract the May-October maxima. This is done using the indices.generic.select_resample_op function, which applies an operator (op) on a resampled time series. It can also select a portion of the year, such as climatological seasons (e.g. ‘DJF’ for winter months), or individual months (e.g. month=[1] for January).

[ ]:
sub = select_resample_op(pr, op="max", freq="YS", month=[5, 6, 7, 8, 9, 10])
sub

The next step is to fit the statistical distribution on these maxima. This is done by the .fit method, which takes as argument the sample series, the distribution’s name and the parameter estimation method. The fit is done by default using the Maximum Likelihood algorithm (method="ML"). Parameters can also be estimated using the method of moments (method="MM").

xclim can also accept a distribution instance instead of name (i.e. a subclass of scipy.stats.rv_continuous). For example, for some extreme value distributions, the maximum likelihood is not always robust. Using the “Probability Weighted Moments” (method="PWM") method can help in that case. This is possible by passing a distribution object from the lmoments3 package together with method="PWM". That package currently only supports expon, gamma, genextreme, genpareto, gumbel_r, pearson3, and weibull_min (with other names, see the documentation). In the following example, we fit using the “Generalized extreme value” distribution from lmoments3.

[ ]:
from lmoments3.distr import gev

# The fitting dimension is hard-coded as `time`.
params = fit(sub, dist=gev, method="PWM")
params

Finally, the last step is to compute the percentile, or quantile, using the fitted parameters, using the parametric_quantile function. The function uses metadata stored in attributes of the parameters generated by fit to determine what distribution to use and what are the units of the quantiles. Here we need to pass the quantile (values between 0 and 1), which for exceedance probabilities is just :math1 - 1/T.

[ ]:
parametric_quantile(params, q=1 - 1.0 / 20)

As a convenience utility, the two last steps (fit and parametric_quantile) are bundled into the fa function, which takes care of converting the return period into a quantile value, and renames the quantile output dimension to return_period. This dimension renaming is done to avoid name clashes with the quantile method. Indeed, it’s often necessary when analysing large ensembles, or probabilistic samples, to compute the quantiles of the quantiles, which will cause xarray to raise an error. The mode argument specifies whether we are working with maxima (max) or minima (min). This is important because a 100-year return period value for minima corresponds to a 0.01 quantile, while a 100-year return period value for maxima corresponds to a 0.99 quantile.

[ ]:
fa(sub, t=20, dist="genextreme", mode="max")

Handling missing values

When working with observations from weather stations, there are often stretches of days without measurements due to equipment malfunction. Practitioners usually do not want to ignore entire years of data due to a few missing days, so one option is to record annual maxima only if there are no more than a given percentage of missing values, say 5%. These kinds of filters can easily be applied using xclim.

[ ]:
# Set the first half of the first year as missing.
pr[:200] = np.nan

# Compute vector returning which years should be considered missing.
null = missing_pct(pr, tolerance=0.05, freq="YS", month=[5, 6, 7, 8, 9, 10])

# Compute stats on masked values
fa(sub.where(~null), t=20, dist="genextreme", mode="high")