Download this notebook from github.

Frequency analysis

Frequency analysis refers to the study of the probability of occurrence of events. It’s often used in regulatory contexts to determine design values for infrastructures. For example, your city might require that road drainage systems be able to cope with a level of rainfall that is exceeded only once every 20 years on average. This 20-year return event, the infrastructure design value, is computed by first extracting precipitation annual maxima from a rainfall observation time series, fitting a statistical distribution to the maxima, then estimating the 95th percentile (1:20 chance of being exceeded).

To facilitate this type of analysis on a large number of time series from model simulations or observations, xclim packs a few common utility functions. In the following example, we’re estimating the 95th percentile of the daily precipitation maximum over the May-October period using a Generalized Extreme Value distribution.

Note that at the moment, all frequency analysis functions are hard-coded to operate along the time dimension.

Let’s first create a synthetic time series of daily precipitation.

[1]:
import numpy as np
import xarray as xr
import warnings

warnings.simplefilter("ignore")
from xclim.indices.stats import frequency_analysis, fit, fa, parametric_quantile
from xclim.indices.generic import select_resample_op
from xclim.core.missing import missing_pct
from scipy.stats import gamma, bernoulli


# Create synthetic daily precipitation time series (mm/d)
n = 50 * 366
start = np.datetime64("1950-01-01")
time = start + np.timedelta64(1, "D") * range(n)
# time = xr.cftime_range(start="1950-01-01", periods=n)

# Generate wet (1) /dry (0) days, then multiply by rain magnitude.
wet = bernoulli.rvs(0.1, size=n)
intensity = gamma(a=4, loc=1, scale=6).rvs(n)
pr = xr.DataArray(
    wet * intensity,
    dims=("time",),
    coords={"time": time},
    attrs={"units": "mm/d", "standard_name": "precipitation_flux"},
)
pr
[1]:
<xarray.DataArray (time: 18300)>
array([0., 0., 0., ..., 0., 0., 0.])
Coordinates:
  * time     (time) datetime64[ns] 1950-01-01 1950-01-02 ... 2000-02-07
Attributes:
    units:          mm/d
    standard_name:  precipitation_flux

The frequency_analysis function combines all the necessary steps to estimate our design value:

  1. Extract May to October period (month=[5,6,7,8,9,10])

  2. Extract maxima (mode="max")

  3. Fit the GEV distribution on the maxima (dist="genextreme")

  4. Compute the value exceeded, on average, once every 20 years (t=20)

Note that xclim essentially wraps scipy.stats distributions, so many distributions like norm, gumbel_r, lognorm, etc. are supported.

[2]:
# Compute the design value
frequency_analysis(
    pr, t=20, dist="genextreme", mode="max", freq="Y", month=[5, 6, 7, 8, 9, 10]
)
[2]:
<xarray.DataArray (return_period: 1)>
array([70.23818681])
Coordinates:
  * return_period  (return_period) int64 20
Attributes:
    units:          mm/d
    standard_name:  precipitation_flux
    long_name:      genextreme quantiles
    description:    Quantiles estimated by the genextreme distribution
    method:         ML
    estimator:      Maximum likelihood
    scipy_dist:     genextreme
    history:        [2022-01-07 21:39:39] fit: Estimate distribution paramete...
    cell_methods:
    mode:           max

In practice, it’s often useful to be able to save intermediate results, for example the parameters of the fitted distribution, so in the following we crack open what goes on behind the frequency_analysis function.

The first step of the frequency analysis is to extract the May-October maxima. This is done using the indices.generic.select_resample_op function, which applies an operator (op) on a resampled time series. It can also select portion of the year, such as climatological seasons (e.g. ‘DJF’ for winter months), or individual months (e.g. month=[1] for January).

[3]:
sub = select_resample_op(pr, op="max", freq="Y", month=[5, 6, 7, 8, 9, 10])
sub
[3]:
<xarray.DataArray (time: 51)>
array([41.0694836 , 63.37267089, 32.93156261, 73.99708785, 61.0269173 ,
       35.74498299, 46.47430442, 60.97010895, 62.57380973, 72.85306513,
       55.35330239, 49.52772208, 40.43039547, 61.43515214, 52.85715653,
       58.37708215, 45.64379164, 47.75724642, 45.96885773, 48.70402096,
       61.99072445, 47.2510167 , 52.86097199, 68.48756422, 78.48291857,
       52.08310351, 58.77181308, 40.43888027, 44.94718191, 35.27677597,
       57.27703974, 56.1227869 , 56.60102404, 49.83999645, 38.52026   ,
       44.44470716, 62.56411449, 42.2306041 , 48.84394259, 45.90049813,
       46.10668638, 42.5222162 , 45.18204632, 46.51212781, 53.57941062,
       63.20017324, 63.24391576, 43.80265745, 45.26901168, 53.76222462,
               nan])
Coordinates:
  * time     (time) datetime64[ns] 1950-12-31 1951-12-31 ... 2000-12-31
Attributes:
    units:          mm/d
    standard_name:  precipitation_flux

The next step is to fit the statistical distribution on these maxima. This is done by the fit method, which takes as argument the sample series, the distribution’s name and the parameter estimation method. The fit is done by default using the Maximum Likelihood algorithm. For some extreme value distributions however, the maximum likelihood is not always robust, and xclim offers the possibility to use Probability Weighted Moments (PWM) to estimate parameters. Note that the lmoments3 package which is used by xclim to compute the PWM only supports expon, gamma, genextreme, genpareto, gumbel_r, pearson3 and weibull_min.

[4]:
# The fitting dimension is hard-coded as `time`.
params = fit(sub, dist="genextreme")
params
[4]:
<xarray.DataArray (dparams: 3)>
array([ 0.12837868, 47.79607913,  9.08764043])
Coordinates:
  * dparams  (dparams) <U5 'c' 'loc' 'scale'
Attributes:
    original_units:          mm/d
    original_standard_name:  precipitation_flux
    long_name:               genextreme parameters
    description:             Parameters of the genextreme distribution
    method:                  ML
    estimator:               Maximum likelihood
    scipy_dist:              genextreme
    units:
    history:                 [2022-01-07 21:39:39] fit: Estimate distribution...

Finally, the last step is to compute the percentile, or quantile, using the fitted parameters, using the parametric_quantile function. The function uses metadata stored in attributes of the parameters generated by fit to determine what distribution to use and what are the units of the quantiles. Here we need to pass the quantile (values between 0 and 1), which for exceedance probabilities is just :math``1 - 1/T``.

[5]:
parametric_quantile(params, q=1 - 1.0 / 20)
[5]:
<xarray.DataArray (quantile: 1)>
array([70.23818681])
Coordinates:
  * quantile  (quantile) float64 0.95
Attributes:
    units:          mm/d
    standard_name:  precipitation_flux
    long_name:      genextreme quantiles
    description:    Quantiles estimated by the genextreme distribution
    method:         ML
    estimator:      Maximum likelihood
    scipy_dist:     genextreme
    history:        [2022-01-07 21:39:39] fit: Estimate distribution paramete...
    cell_methods:   

As a convenience utility, the two last steps (fit and parametric_quantile) are bundled into the fa function, which takes care of converting the return period into a quantile value, and renames the quantile output dimension to return_period. This dimension renaming is done to avoid name clashes with the quantile method. Indeed, it’s often necessary when analysing large ensembles, or probabilistic samples, to compute the quantiles of the quantiles, which will cause xarray to raise an error. The mode argument specifies whether we are working with maxima (max) or minima (min). This is important because a 100-year return period value for minima corresponds to a 0.01 quantile, while a 100-year return period value for maxima corresponds to a 0.99 quantile.

[6]:
fa(sub, t=20, dist="genextreme", mode="max")
[6]:
<xarray.DataArray (return_period: 1)>
array([70.23818681])
Coordinates:
  * return_period  (return_period) int64 20
Attributes:
    units:          mm/d
    standard_name:  precipitation_flux
    long_name:      genextreme quantiles
    description:    Quantiles estimated by the genextreme distribution
    method:         ML
    estimator:      Maximum likelihood
    scipy_dist:     genextreme
    history:        [2022-01-07 21:39:39] fit: Estimate distribution paramete...
    cell_methods:
    mode:           max

Handling missing values

When working with observations from weather stations, there are often stretches of days without measurements due to equipment malfunction. Practitioners usually do not want to ignore entire years of data due to a few missing days, so one option is to record annual maxima only if there are no more than a given percentage of missing values, say 5%. These kinds of filters can easily be applied using xclim.

[7]:
# Set the first half of the first year as missing.
pr[:200] = np.nan

# Compute vector returning which years should be considered missing.
null = missing_pct(pr, tolerance=0.05, freq="Y", month=[5, 6, 7, 8, 9, 10])

# Compute stats on masked values
fa(sub.where(~null), t=20, dist="genextreme", mode="high")
[7]:
<xarray.DataArray (return_period: 1)>
array([70.32263951])
Coordinates:
  * return_period  (return_period) int64 20
Attributes:
    units:          mm/d
    standard_name:  precipitation_flux
    long_name:      genextreme quantiles
    description:    Quantiles estimated by the genextreme distribution
    method:         ML
    estimator:      Maximum likelihood
    scipy_dist:     genextreme
    history:        [2022-01-07 21:39:40] fit: Estimate distribution paramete...
    cell_methods:
    mode:           high