{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Frequency analysis\n",
    "\n",
    "Frequency analysis refers to the study of the probability of occurrence of events. It's often used in regulatory contexts to determine design values for infrastructures. For example, your city might require that road drainage systems be able to cope with a level of rainfall that is exceeded only once every 20 years on average. This 20-year return event, the infrastructure *design value*, is computed by first extracting precipitation annual maxima from a rainfall observation time series, fitting a statistical distribution to the maxima, then estimating the 95th percentile (1:20 chance of being exceeded).\n",
    "\n",
    "To facilitate this type of analysis on numerous time series from model simulations or observations, `xclim` packs a few common utility functions. In the following example, we're estimating the 95th percentile of the daily precipitation maximum over the May-October period using a Generalized Extreme Value distribution.\n",
    "\n",
    "Note that at the moment, all frequency analysis functions are hard-coded to operate along the `time` dimension.\n",
    "\n",
    "Let's first create a synthetic time series of daily precipitation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from __future__ import annotations\n",
    "\n",
    "import warnings\n",
    "\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "from scipy.stats import bernoulli, gamma\n",
    "\n",
    "from xclim.core.missing import missing_pct\n",
    "from xclim.indices.generic import select_resample_op\n",
    "from xclim.indices.stats import fa, fit, frequency_analysis, parametric_quantile\n",
    "\n",
    "warnings.simplefilter(\"ignore\")\n",
    "\n",
    "\n",
    "# Create synthetic daily precipitation time series (mm/d)\n",
    "n = 50 * 366\n",
    "start = np.datetime64(\"1950-01-01\")\n",
    "time = start + np.timedelta64(1, \"D\") * range(n)\n",
    "# time = xr.cftime_range(start=\"1950-01-01\", periods=n)\n",
    "\n",
    "# Generate wet (1) /dry (0) days, then multiply by rain magnitude.\n",
    "wet = bernoulli.rvs(0.1, size=n)\n",
    "intensity = gamma(a=4, loc=1, scale=6).rvs(n)\n",
    "pr = xr.DataArray(\n",
    "    wet * intensity,\n",
    "    dims=(\"time\",),\n",
    "    coords={\"time\": time},\n",
    "    attrs={\"units\": \"mm/d\", \"standard_name\": \"precipitation_flux\"},\n",
    ")\n",
    "pr"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `frequency_analysis` function combines all the necessary steps to estimate our design value:\n",
    "\n",
    "1. Extract May to October period (`month=[5,6,7,8,9,10]`)\n",
    "2. Extract maxima (`mode=\"max\"`)\n",
    "3. Fit the GEV distribution on the maxima (`dist=\"genextreme\"`)\n",
    "4. Compute the value exceeded, on average, once every 20 years (`t=20`)\n",
    "\n",
    "Note that `xclim` essentially wraps `scipy.stats` distributions, so many distributions like `norm`, `gumbel_r`, `lognorm`, etc. are supported."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Compute the design value\n",
    "frequency_analysis(pr, t=20, dist=\"genextreme\", mode=\"max\", freq=\"YS\", month=[5, 6, 7, 8, 9, 10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In practice, it's often useful to be able to save intermediate results, for example the parameters of the fitted distribution, so in the following we crack open what goes on behind the `frequency_analysis` function.\n",
    "\n",
    "The first step of the frequency analysis is to extract the May-October maxima. This is done using the `indices.generic.select_resample_op` function, which applies an operator (`op`) on a resampled time series. It can also select a portion of the year, such as climatological seasons (e.g. 'DJF' for winter months), or individual months (e.g. `month=[1]` for January)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sub = select_resample_op(pr, op=\"max\", freq=\"YS\", month=[5, 6, 7, 8, 9, 10])\n",
    "sub"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to fit the statistical distribution on these maxima. This is done by the `.fit` method, which takes as argument the sample series, the distribution's name and the parameter estimation `method`. The fit is done by default using the Maximum Likelihood algorithm (`method=\"ML\"`). Parameters can also be estimated using the method of moments (`method=\"MM\"`).\n",
    "\n",
    "`xclim` can also accept a distribution instance instead of name (i.e. a subclass of `scipy.stats.rv_continuous`). For example, for some extreme value distributions, the maximum likelihood is not always robust. Using the \"Probability Weighted Moments\" (`method=\"PWM\"`) method can help in that case. This is possible by passing a distribution object from the `lmoments3` package together with `method=\"PWM\"`. That package currently only supports `expon`, `gamma`, `genextreme`, `genpareto`, `gumbel_r`, `pearson3`, and `weibull_min` (with other names, see [the documentation](https://lmoments3.readthedocs.io/en/stable/distributions.html)). In the following example, we fit using the \"Generalized extreme value\" distribution from `lmoments3`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from lmoments3.distr import gev\n",
    "\n",
    "# The fitting dimension is hard-coded as `time`.\n",
    "params = fit(sub, dist=gev, method=\"PWM\")\n",
    "params"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, the last step is to compute the percentile, or quantile, using the fitted parameters, using the `parametric_quantile` function. The function uses metadata stored in attributes of the parameters generated by `fit` to determine what distribution to use and what are the units of the quantiles. Here we need to pass the quantile (values between 0 and 1), which for exceedance probabilities is just :math`1 - 1/T`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "parametric_quantile(params, q=1 - 1.0 / 20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a convenience utility, the two last steps (`fit` and `parametric_quantile`) are bundled into the `fa` function, which takes care of converting the return period into a quantile value, and renames the `quantile` output dimension to `return_period`. This dimension renaming is done to avoid name clashes with the `quantile` method. Indeed, it's often necessary when analysing large ensembles, or probabilistic samples, to compute the quantiles of the quantiles, which will cause `xarray` to raise an error. The `mode` argument specifies whether we are working with maxima (max) or minima (min). This is important because a 100-year return period value for minima corresponds to a 0.01 quantile, while a 100-year return period value for maxima corresponds to a 0.99 quantile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fa(sub, t=20, dist=\"genextreme\", mode=\"max\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Handling missing values\n",
    "\n",
    "When working with observations from weather stations, there are often stretches of days without measurements due to equipment malfunction. Practitioners usually do not want to ignore entire years of data due to a few missing days, so one option is to record annual maxima only if there are no more than a given percentage of missing values, say 5%. These kinds of filters can easily be applied using `xclim`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the first half of the first year as missing.\n",
    "pr[:200] = np.nan\n",
    "\n",
    "# Compute vector returning which years should be considered missing.\n",
    "null = missing_pct(pr, tolerance=0.05, freq=\"YS\", month=[5, 6, 7, 8, 9, 10])\n",
    "\n",
    "# Compute stats on masked values\n",
    "fa(sub.where(~null), t=20, dist=\"genextreme\", mode=\"high\")"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}