Ensemble-Reduction Techniques

xclim.ensembles provides means of reducing the number of candidates in a sample to get a reasonable and representative spread of outcomes using a reduced number of candidates. By reducing the number of realizations in a strategic manner, we can significantly reduce the number of realizations to examine, while maintaining statistical representation of original dataset. This is particularly useful when computation power or time is a factor.

For more information on the principles and methods behind ensemble reduction techniques, see: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152495 and https://doi.org/10.1175/JCLI-D-14-00636.1

Selection Criteria
The following example considers 50 member ensemble with a total of 6 criteria considered (3 variable deltas * 2 time horizons). Our goal is to reduce this number to a more manageable size while preserving the range of uncertainty across our different criteria.
:
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr
from xclim import ensembles

# Using an xarray dataset of our criteria
ds_crit
:
<xarray.Dataset>
Dimensions:             (horizon: 2, realization: 50)
Coordinates:
* horizon             (horizon) <U9 '2041-2070' '2071-2100'
Dimensions without coordinates: realization
Data variables:
delta_annual_tavg   (horizon, realization) float64 5.646 3.6 ... 5.594 6.144
delta_annual_prtot  (horizon, realization) float64 14.42 -1.739 ... 20.69
delta_JJA_prtot     (horizon, realization) float64 -1.108 -0.7181 ... 3.48
:
plt.style.use("seaborn-dark")
plt.rcParams["figure.figsize"] = (13, 5)
fig = plt.figure(figsize=(11, 9))
ax = plt.axes(projection="3d")

for h in ds_crit.horizon:
ax.scatter(
ds_crit.sel(horizon=h).delta_annual_tavg,
ds_crit.sel(horizon=h).delta_annual_prtot,
ds_crit.sel(horizon=h).delta_JJA_prtot,
label=f"delta {h.values}",
)

ax.set_xlabel("delta_annual_tavg (C)")
ax.set_ylabel("delta_annual_prtot (%)")
ax.set_zlabel("delta_JJA_prtot (%)")
plt.legend()
plt.show() Ensemble reduction techniques in xclim require a 2D array with dimensions of criteria (values) and realization (runs/simulations).

:
# Create 2d xr.DataArray containing criteria values
crit = None
for h in ds_crit.horizon:
for v in ds_crit.data_vars:
if crit is None:
crit = ds_crit[v].sel(horizon=h)
else:
crit = xr.concat((crit, ds_crit[v].sel(horizon=h)), dim="criteria")
crit.name = "criteria"
crit.shape
:
(6, 50)

K-Means reduce ensemble

The kmeans_reduce_ensemble works by grouping realizations into sub-groups based on the provided critera and retaining a representative realization per sub-group.

For a real-world example of the K-means clustering algorithm applied to climate data selection, see: https://doi.org/10.1371/journal.pone.0152495 and https://doi.org/10.1175/JCLI-D-11-00440.1

The following example uses method = dict(n_clusters=25) in order to take the original 50 realizations and reduce them down to 25. The function itself returns the ids (indexes: int) of the realizations, which can then be used to select the data from the original ensemble.

:
ids, cluster, fig_data = ensembles.kmeans_reduce_ensemble(
data=crit, method={"n_clusters": 25}, random_state=42, make_graph=True
)
ds_crit.isel(realization=ids)
:
<xarray.Dataset>
Dimensions:             (horizon: 2, realization: 25)
Coordinates:
* horizon             (horizon) <U9 '2041-2070' '2071-2100'
Dimensions without coordinates: realization
Data variables:
delta_annual_tavg   (horizon, realization) float64 5.646 4.468 ... 5.594
delta_annual_prtot  (horizon, realization) float64 14.42 -1.352 ... 27.31
delta_JJA_prtot     (horizon, realization) float64 -1.108 3.299 ... 0.4022

With this reduced number, we can now compare the distribution of the selection versus the original ensemble of simulations.

:
plt.style.use("seaborn-dark")
fig = plt.figure(figsize=(11, 9))
ax = plt.axes(projection="3d")

for h in ds_crit.horizon:
ax.scatter(
ds_crit.sel(horizon=h, realization=ids).delta_annual_tavg,
ds_crit.sel(horizon=h, realization=ids).delta_annual_prtot,
ds_crit.sel(horizon=h, realization=ids).delta_JJA_prtot,
label=f"delta {h.values} - selected",
)

ax.set_xlabel("delta_annual_tavg (C)")
ax.set_ylabel("delta_annual_prtot (%)")
ax.set_zlabel("delta_JJA_prtot (%)")
plt.legend()
plt.show() The function optionally produces a data dictionary for figure production of the associated R² profile.

The function ensembles.plot_rsqprofile provides plotting for evaluating the proportion of total variance in climate realizations that is covered by the selection.

In this case ~88% of the total variance in original ensemble is covered by the selection.

:
ensembles.plot_rsqprofile(fig_data) Alternatively we can use method = {'rsq_cutoff':Float} or method = {'rsq_optimize':None} * For example with rsq_cutoff we instead find the number of realizations needed to cover the provided $$R^{2}$$ value

:
ids1, cluster1, fig_data1 = ensembles.kmeans_reduce_ensemble(
data=crit, method={"rsq_cutoff": 0.75}, random_state=42, make_graph=True
)
ensembles.plot_rsqprofile(fig_data1)
ds_crit.isel(realization=ids1)
:
<xarray.Dataset>
Dimensions:             (horizon: 2, realization: 17)
Coordinates:
* horizon             (horizon) <U9 '2041-2070' '2071-2100'
Dimensions without coordinates: realization
Data variables:
delta_annual_tavg   (horizon, realization) float64 5.646 4.468 ... 6.144
delta_annual_prtot  (horizon, realization) float64 14.42 -1.352 ... 20.69
delta_JJA_prtot     (horizon, realization) float64 -1.108 3.299 ... 3.48 KKZ reduce ensemble

xclim also makes available a similar ensemble reduction algorithm, ensembles.kkz_reduce_ensemble. see: https://doi.org/10.1175/JCLI-D-14-00636.1

The advantage of this algorithm is largely that fewer realizations are needed in order to reach the same level of representative members than the K-means clustering algorithm, as the KKZ methods tends towards identifying members that fall towards the extremes of criteria values.

This technique also produces nested selection results, where additional increase in desired selection size does not alter the previous choices, which is not the case for the K-means algorithm.

:
ids = ensembles.kkz_reduce_ensemble(crit, num_select=10)
ds_crit.isel(realization=ids)
:
<xarray.Dataset>
Dimensions:             (horizon: 2, realization: 10)
Coordinates:
* horizon             (horizon) <U9 '2041-2070' '2071-2100'
Dimensions without coordinates: realization
Data variables:
delta_annual_tavg   (horizon, realization) float64 1.719 6.405 ... 7.449
delta_annual_prtot  (horizon, realization) float64 9.611 1.527 ... 22.34
delta_JJA_prtot     (horizon, realization) float64 -0.1268 -4.622 ... 7.207
:
plt.style.use("seaborn-dark")
fig = plt.figure(figsize=(9, 9))
ax = plt.axes(projection="3d")

for h in ds_crit.horizon:

ax.scatter(
ds_crit.sel(horizon=h, realization=ids).delta_annual_tavg,
ds_crit.sel(horizon=h, realization=ids).delta_annual_prtot,
ds_crit.sel(horizon=h, realization=ids).delta_JJA_prtot,
label=f"delta {h.values} - selected",
)

ax.set_xlabel("delta_annual_tavg (C)")
ax.set_ylabel("delta_annual_prtot (%)")
ax.set_zlabel("delta_JJA_prtot (%)")
plt.legend()
plt.show() KKZ algorithm vs K-Means algorithm

To give a better sense of the differences between Nested (KKZ) and Unnested (K-Means) results, we can progressively identify members that would be chosen by each algorithm through iterative fashion.

:
## NESTED results using KKZ
for n in np.arange(5, 14, 2):
ids = ensembles.kkz_reduce_ensemble(crit, num_select=n)
print(ids)
[19, 24, 33, 3, 21]
[19, 24, 33, 3, 21, 18, 35]
[19, 24, 33, 3, 21, 18, 35, 48, 40]
[19, 24, 33, 3, 21, 18, 35, 48, 40, 39, 29]
[19, 24, 33, 3, 21, 18, 35, 48, 40, 39, 29, 11, 2]
:
## UNNESTED results using k-means
for n in np.arange(5, 14, 2):
ids, cluster, fig_data = ensembles.kmeans_reduce_ensemble(
crit, method={"n_clusters": n}, random_state=42, make_graph=True
)
print(ids)
[7, 12, 27, 35, 45]
[12, 14, 26, 27, 35, 45, 49]
[12, 14, 19, 27, 32, 35, 38, 46, 49]
[0, 10, 12, 14, 19, 32, 35, 38, 39, 45, 49]
[2, 12, 14, 16, 17, 19, 22, 27, 38, 39, 40, 45, 49]

While the Nested feature of the KKZ results is typically advantageous, it can sometimes result in unbalanced coverage of the original ensemble. In general careful consideration and validation of selection results is suggested when n is small, regardless of the technique used.

To illustrate a simple example using only 2 of our criteria shows differences in results between the two techniques:

The KKZ algorithm iteratively maximizes distance from previous selected candidates - potentially resulting in ‘off-center’ results versus the original ensemble

The K-means algorithm will redivide the data space with each iteration producing results that are consistently centered on the original ensemble but lacking coverage in the extremes

:
df = crit.isel(criteria=[0, 1])

# Use standardized data in the plot so that selection distances is better visualized
df = (df - df.mean("realization")) / df.std("realization")

plt.figure(figsize=(18, 3))
for n in np.arange(1, 6):
plt.subplot(1, 5, n, aspect="equal")
plt.scatter(df.isel(criteria=0), df.isel(criteria=1))
ids_KKZ = ensembles.kkz_reduce_ensemble(crit.isel(criteria=[0, 1]), num_select=n)
plt.scatter(
df.isel(criteria=0, realization=ids_KKZ),
df.isel(criteria=1, realization=ids_KKZ),
s=100,
)
plt.title(f"KKZ={n}")
if n == 1:
plt.ylabel("standardized delta_annual_prtot")
if n == 3:
plt.xlabel("standardized delta_annual_tavg")
plt.suptitle("KKZ selection results")

plt.figure(figsize=(18, 3))
for n in np.arange(1, 6):
plt.subplot(1, 5, n, aspect="equal")
plt.scatter(df.isel(criteria=0), df.isel(criteria=1))
ids_Kmeans, c, figdata = ensembles.kmeans_reduce_ensemble(
crit.isel(criteria=[0, 1]),
method={"n_clusters": n},
random_state=42,
make_graph=True,
)
plt.scatter(
df.isel(criteria=0, realization=ids_Kmeans),
df.isel(criteria=1, realization=ids_Kmeans),
s=100,
)
plt.title(f"Kmeans={n}")
if n == 1:
plt.ylabel("standardized delta_annual_prtot")
if n == 3:
plt.xlabel("standardized delta_annual_tavg")
plt.suptitle("K-means selection results")
plt.show()  [ ]: