{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This cell is not visible when the documentation is built.\n",
    "\n",
    "from __future__ import annotations\n",
    "\n",
    "import numpy as np\n",
    "import xarray as xr\n",
    "\n",
    "# Create 'toy' criteria selection data\n",
    "np.random.normal(loc=3.5, scale=1.5, size=50)\n",
    "# crit['delta_annual_tavg']\n",
    "np.random.seed(0)\n",
    "test = xr.DataArray(np.random.normal(loc=3, scale=1.5, size=50), dims=[\"realization\"]).assign_coords(\n",
    "    horizon=\"2041-2070\"\n",
    ")\n",
    "test = xr.concat(\n",
    "    (\n",
    "        test,\n",
    "        xr.DataArray(np.random.normal(loc=5.34, scale=2, size=50), dims=[\"realization\"]).assign_coords(\n",
    "            horizon=\"2071-2100\"\n",
    "        ),\n",
    "    ),\n",
    "    dim=\"horizon\",\n",
    ")\n",
    "\n",
    "ds_crit = xr.Dataset()\n",
    "\n",
    "ds_crit[\"delta_annual_tavg\"] = test\n",
    "test = xr.DataArray(np.random.normal(loc=5, scale=5, size=50), dims=[\"realization\"]).assign_coords(horizon=\"2041-2070\")\n",
    "test = xr.concat(\n",
    "    (\n",
    "        test,\n",
    "        xr.DataArray(np.random.normal(loc=10, scale=8, size=50), dims=[\"realization\"]).assign_coords(\n",
    "            horizon=\"2071-2100\"\n",
    "        ),\n",
    "    ),\n",
    "    dim=\"horizon\",\n",
    ")\n",
    "ds_crit[\"delta_annual_prtot\"] = test\n",
    "test = xr.DataArray(np.random.normal(loc=0, scale=3, size=50), dims=[\"realization\"]).assign_coords(horizon=\"2041-2070\")\n",
    "test = xr.concat(\n",
    "    (\n",
    "        test,\n",
    "        xr.DataArray(np.random.normal(loc=2, scale=4, size=50), dims=[\"realization\"]).assign_coords(\n",
    "            horizon=\"2071-2100\"\n",
    "        ),\n",
    "    ),\n",
    "    dim=\"horizon\",\n",
    ")\n",
    "ds_crit[\"delta_JJA_prtot\"] = test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensemble-Reduction Techniques\n",
    "=============================\n",
    "\n",
    "`xclim.ensembles` provides means of reducing the number of candidates in a sample to get a reasonable and representative spread of outcomes using a reduced number of candidates. By reducing the number of realizations in a strategic manner, we can significantly reduce the number of realizations to examine, while maintaining a statistical representation of the original dataset. This is particularly useful when computation power or time is a factor.\n",
    "\n",
    "For more information on the principles and methods behind ensemble reduction techniques, see: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152495 and https://doi.org/10.1175/JCLI-D-14-00636.1\n",
    "\n",
    "**Selection Criteria**\\\n",
    "The following example considers a 50-member ensemble with a total of 6 criteria considered (3 variable deltas * 2 time horizons). Our goal here is to reduce this number to a more manageable size while preserving the range of uncertainty across our different criteria."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "from xclim import ensembles\n",
    "\n",
    "# Using an xarray dataset of our criteria\n",
    "ds_crit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.style.use(\"seaborn-v0_8-dark\")\n",
    "plt.rcParams[\"figure.figsize\"] = (13, 5)\n",
    "fig = plt.figure(figsize=(11, 9))\n",
    "ax = plt.axes(projection=\"3d\")\n",
    "\n",
    "for h in ds_crit.horizon:\n",
    "    ax.scatter(\n",
    "        ds_crit.sel(horizon=h).delta_annual_tavg,\n",
    "        ds_crit.sel(horizon=h).delta_annual_prtot,\n",
    "        ds_crit.sel(horizon=h).delta_JJA_prtot,\n",
    "        label=f\"delta {h.values}\",\n",
    "    )\n",
    "\n",
    "ax.set_xlabel(\"delta_annual_tavg (C)\")\n",
    "ax.set_ylabel(\"delta_annual_prtot (%)\")\n",
    "ax.set_zlabel(\"delta_JJA_prtot (%)\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ensemble reduction techniques in `xclim` require a 2D array with dimensions of `criteria` (values) and `realization` (runs/simulations). Hopefully, xclim has a function for this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "crit = ensembles.make_criteria(ds_crit)\n",
    "crit"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **K-Means reduce ensemble**\n",
    "\n",
    "The `kmeans_reduce_ensemble` works by grouping realizations into subgroups based on the provided criteria and retaining a representative `realization` per subgroup.\n",
    "\n",
    "For a real-world example of the K-means clustering algorithm applied to climate data selection, see: https://doi.org/10.1371/journal.pone.0152495 and https://doi.org/10.1175/JCLI-D-11-00440.1\n",
    "\n",
    "The following example uses `method = dict(n_clusters=25)` in order to take the original `50` realizations and reduce them down to `25`. The function itself returns the `ids` (indexes: `int`) of the realizations, which can then be used to select the data from the original ensemble."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ids, cluster, fig_data = ensembles.kmeans_reduce_ensemble(\n",
    "    data=crit, method={\"n_clusters\": 25}, random_state=42, make_graph=True\n",
    ")\n",
    "ds_crit.isel(realization=ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this reduced number, we can now compare the distribution of the selection versus the original ensemble of simulations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.style.use(\"seaborn-v0_8-dark\")\n",
    "fig = plt.figure(figsize=(11, 9))\n",
    "ax = plt.axes(projection=\"3d\")\n",
    "\n",
    "for h in ds_crit.horizon:\n",
    "    ax.scatter(\n",
    "        ds_crit.sel(horizon=h, realization=ids).delta_annual_tavg,\n",
    "        ds_crit.sel(horizon=h, realization=ids).delta_annual_prtot,\n",
    "        ds_crit.sel(horizon=h, realization=ids).delta_JJA_prtot,\n",
    "        label=f\"delta {h.values} - selected\",\n",
    "    )\n",
    "\n",
    "ax.set_xlabel(\"delta_annual_tavg (C)\")\n",
    "ax.set_ylabel(\"delta_annual_prtot (%)\")\n",
    "ax.set_zlabel(\"delta_JJA_prtot (%)\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function optionally produces a data dictionary for figure production of the associated R² profile.\n",
    "\n",
    "The function ``ensembles.plot_rsqprofile`` provides plotting for evaluating the proportion of total variance in climate realizations that is covered by the selection.\n",
    "\n",
    "In this case ~88% of the total variance in original ensemble is covered by the selection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ensembles.plot_rsqprofile(fig_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, we can use `method = {'rsq_cutoff':Float}` or `method = {'rsq_optimize':None}`\n",
    "* For example, with `rsq_cutoff` we instead find the number of realizations needed to cover the provided $R^{2}$ value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ids1, cluster1, fig_data1 = ensembles.kmeans_reduce_ensemble(\n",
    "    data=crit, method={\"rsq_cutoff\": 0.75}, random_state=42, make_graph=True\n",
    ")\n",
    "ensembles.plot_rsqprofile(fig_data1)\n",
    "ds_crit.isel(realization=ids1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **KKZ reduce ensemble**\n",
    "\n",
    "`xclim` also makes available a similar ensemble reduction algorithm, `ensembles.kkz_reduce_ensemble`. See: https://doi.org/10.1175/JCLI-D-14-00636.1\n",
    "\n",
    "The advantage of this algorithm is largely that fewer realizations are needed in order to reach the same level of representative members than the K-means clustering algorithm, as the KKZ methods tends towards identifying members that fall towards the extremes of criteria values.\n",
    "\n",
    "This technique also produces nested selection results, where an additional increase in desired selection size does not alter the previous choices, which is not the case for the K-means algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ids = ensembles.kkz_reduce_ensemble(crit, num_select=10)\n",
    "ds_crit.isel(realization=ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.style.use(\"seaborn-v0_8-dark\")\n",
    "fig = plt.figure(figsize=(9, 9))\n",
    "ax = plt.axes(projection=\"3d\")\n",
    "\n",
    "for h in ds_crit.horizon:\n",
    "    ax.scatter(\n",
    "        ds_crit.sel(horizon=h, realization=ids).delta_annual_tavg,\n",
    "        ds_crit.sel(horizon=h, realization=ids).delta_annual_prtot,\n",
    "        ds_crit.sel(horizon=h, realization=ids).delta_JJA_prtot,\n",
    "        label=f\"delta {h.values} - selected\",\n",
    "    )\n",
    "\n",
    "ax.set_xlabel(\"delta_annual_tavg (C)\")\n",
    "ax.set_ylabel(\"delta_annual_prtot (%)\")\n",
    "ax.set_zlabel(\"delta_JJA_prtot (%)\")\n",
    "plt.legend()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### **KKZ algorithm vs K-Means algorithm**\n",
    "\n",
    "To give a better sense of the differences between **Nested (KKZ)** and **Unnested (K-Means)** results, we can progressively identify members that would be chosen by each algorithm through an iterative fashion.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# NESTED results using KKZ\n",
    "for n in np.arange(5, 15, 3):\n",
    "    ids = ensembles.kkz_reduce_ensemble(crit, num_select=n)\n",
    "    print(ids)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# UNNESTED results using k-means\n",
    "for n in np.arange(5, 15, 3):\n",
    "    ids, cluster, fig_data = ensembles.kmeans_reduce_ensemble(\n",
    "        crit, method={\"n_clusters\": n}, random_state=42, make_graph=True\n",
    "    )\n",
    "    print(ids)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While the **Nested** feature of the KKZ results is typically advantageous, it can sometimes result in unbalanced coverage of the original ensemble. **In general, careful consideration and validation of selection results is suggested when `n` is small, regardless of the technique used.**\n",
    "\n",
    "To illustrate, a simple example using only 2 of our criteria shows differences in results between the two techniques:\n",
    "\n",
    "The **KKZ** algorithm iteratively maximizes distance from previous selected candidates - potentially resulting in 'off-center' results versus the original ensemble\n",
    "\n",
    "The **K-means** algorithm will redivide the data space with each iteration, producing results that are consistently centered on the original ensemble but lacking coverage in the extremes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = crit.isel(criteria=[0, 1])\n",
    "\n",
    "# Use standardized data in the plot so that selection distances is better visualized\n",
    "df = (df - df.mean(\"realization\")) / df.std(\"realization\")\n",
    "\n",
    "plt.figure(figsize=(18, 3))\n",
    "for n in np.arange(1, 6):\n",
    "    plt.subplot(1, 5, n, aspect=\"equal\")\n",
    "    plt.scatter(df.isel(criteria=0), df.isel(criteria=1))\n",
    "    ids_KKZ = ensembles.kkz_reduce_ensemble(crit.isel(criteria=[0, 1]), num_select=n)\n",
    "    plt.scatter(\n",
    "        df.isel(criteria=0, realization=ids_KKZ),\n",
    "        df.isel(criteria=1, realization=ids_KKZ),\n",
    "        s=100,\n",
    "    )\n",
    "    plt.title(f\"KKZ={n}\")\n",
    "    if n == 1:\n",
    "        plt.ylabel(\"standardized delta_annual_prtot\")\n",
    "    if n == 3:\n",
    "        plt.xlabel(\"standardized delta_annual_tavg\")\n",
    "plt.suptitle(\"KKZ selection results\")\n",
    "\n",
    "plt.figure(figsize=(18, 3))\n",
    "for n in np.arange(1, 6):\n",
    "    plt.subplot(1, 5, n, aspect=\"equal\")\n",
    "    plt.scatter(df.isel(criteria=0), df.isel(criteria=1))\n",
    "    ids_Kmeans, c, figdata = ensembles.kmeans_reduce_ensemble(\n",
    "        crit.isel(criteria=[0, 1]),\n",
    "        method={\"n_clusters\": n},\n",
    "        random_state=42,\n",
    "        make_graph=True,\n",
    "    )\n",
    "    plt.scatter(\n",
    "        df.isel(criteria=0, realization=ids_Kmeans),\n",
    "        df.isel(criteria=1, realization=ids_Kmeans),\n",
    "        s=100,\n",
    "    )\n",
    "    plt.title(f\"Kmeans={n}\")\n",
    "    if n == 1:\n",
    "        plt.ylabel(\"standardized delta_annual_prtot\")\n",
    "    if n == 3:\n",
    "        plt.xlabel(\"standardized delta_annual_tavg\")\n",
    "plt.suptitle(\"K-means selection results\")\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}