Download this notebook from github.

Ensemble-Reduction Techniques

xclim.ensembles provides means of reducing the number of candidates in a sample to get a reasonable and representative spread of outcomes using a reduced number of candidates. By reducing the number of realizations in a strategic manner, we can significantly reduce the number of realizations to examine, while maintaining a statistical representation of the original dataset. This is particularly useful when computation power or time is a factor.

For more information on the principles and methods behind ensemble reduction techniques, see: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152495 and https://doi.org/10.1175/JCLI-D-14-00636.1

Selection Criteria
The following example considers a 50-member ensemble with a total of 6 criteria considered (3 variable deltas * 2 time horizons). Our goal here is to reduce this number to a more manageable size while preserving the range of uncertainty across our different criteria.
[ ]:
import matplotlib.pyplot as plt
import numpy as np

from xclim import ensembles

# Using an xarray dataset of our criteria
ds_crit
[ ]:
plt.style.use("seaborn-v0_8-dark")
plt.rcParams["figure.figsize"] = (13, 5)
fig = plt.figure(figsize=(11, 9))
ax = plt.axes(projection="3d")

for h in ds_crit.horizon:
    ax.scatter(
        ds_crit.sel(horizon=h).delta_annual_tavg,
        ds_crit.sel(horizon=h).delta_annual_prtot,
        ds_crit.sel(horizon=h).delta_JJA_prtot,
        label=f"delta {h.values}",
    )

ax.set_xlabel("delta_annual_tavg (C)")
ax.set_ylabel("delta_annual_prtot (%)")
ax.set_zlabel("delta_JJA_prtot (%)")
plt.legend()
plt.show()

Ensemble reduction techniques in xclim require a 2D array with dimensions of criteria (values) and realization (runs/simulations). Hopefully, xclim has a function for this.

[ ]:
crit = ensembles.make_criteria(ds_crit)
crit

K-Means reduce ensemble

The kmeans_reduce_ensemble works by grouping realizations into subgroups based on the provided criteria and retaining a representative realization per subgroup.

For a real-world example of the K-means clustering algorithm applied to climate data selection, see: https://doi.org/10.1371/journal.pone.0152495 and https://doi.org/10.1175/JCLI-D-11-00440.1

The following example uses method = dict(n_clusters=25) in order to take the original 50 realizations and reduce them down to 25. The function itself returns the ids (indexes: int) of the realizations, which can then be used to select the data from the original ensemble.

[ ]:
ids, cluster, fig_data = ensembles.kmeans_reduce_ensemble(
    data=crit, method={"n_clusters": 25}, random_state=42, make_graph=True
)
ds_crit.isel(realization=ids)

With this reduced number, we can now compare the distribution of the selection versus the original ensemble of simulations.

[ ]:
plt.style.use("seaborn-v0_8-dark")
fig = plt.figure(figsize=(11, 9))
ax = plt.axes(projection="3d")

for h in ds_crit.horizon:
    ax.scatter(
        ds_crit.sel(horizon=h, realization=ids).delta_annual_tavg,
        ds_crit.sel(horizon=h, realization=ids).delta_annual_prtot,
        ds_crit.sel(horizon=h, realization=ids).delta_JJA_prtot,
        label=f"delta {h.values} - selected",
    )

ax.set_xlabel("delta_annual_tavg (C)")
ax.set_ylabel("delta_annual_prtot (%)")
ax.set_zlabel("delta_JJA_prtot (%)")
plt.legend()
plt.show()

The function optionally produces a data dictionary for figure production of the associated R² profile.

The function ensembles.plot_rsqprofile provides plotting for evaluating the proportion of total variance in climate realizations that is covered by the selection.

In this case ~88% of the total variance in original ensemble is covered by the selection.

[ ]:

Alternatively, we can use method = {'rsq_cutoff':Float} or method = {'rsq_optimize':None} * For example, with rsq_cutoff we instead find the number of realizations needed to cover the provided \(R^{2}\) value

[ ]:
ids1, cluster1, fig_data1 = ensembles.kmeans_reduce_ensemble(
    data=crit, method={"rsq_cutoff": 0.75}, random_state=42, make_graph=True
)
ensembles.plot_rsqprofile(fig_data1)
ds_crit.isel(realization=ids1)

KKZ reduce ensemble

xclim also makes available a similar ensemble reduction algorithm, ensembles.kkz_reduce_ensemble. See: https://doi.org/10.1175/JCLI-D-14-00636.1

The advantage of this algorithm is largely that fewer realizations are needed in order to reach the same level of representative members than the K-means clustering algorithm, as the KKZ methods tends towards identifying members that fall towards the extremes of criteria values.

This technique also produces nested selection results, where an additional increase in desired selection size does not alter the previous choices, which is not the case for the K-means algorithm.

[ ]:
ids = ensembles.kkz_reduce_ensemble(crit, num_select=10)
ds_crit.isel(realization=ids)
[ ]:
plt.style.use("seaborn-v0_8-dark")
fig = plt.figure(figsize=(9, 9))
ax = plt.axes(projection="3d")

for h in ds_crit.horizon:
    ax.scatter(
        ds_crit.sel(horizon=h, realization=ids).delta_annual_tavg,
        ds_crit.sel(horizon=h, realization=ids).delta_annual_prtot,
        ds_crit.sel(horizon=h, realization=ids).delta_JJA_prtot,
        label=f"delta {h.values} - selected",
    )

ax.set_xlabel("delta_annual_tavg (C)")
ax.set_ylabel("delta_annual_prtot (%)")
ax.set_zlabel("delta_JJA_prtot (%)")
plt.legend()
plt.show()

KKZ algorithm vs K-Means algorithm

To give a better sense of the differences between Nested (KKZ) and Unnested (K-Means) results, we can progressively identify members that would be chosen by each algorithm through an iterative fashion.

[ ]:
## NESTED results using KKZ
for n in np.arange(5, 15, 3):
    ids = ensembles.kkz_reduce_ensemble(crit, num_select=n)
    print(ids)
[ ]:
## UNNESTED results using k-means
for n in np.arange(5, 15, 3):
    ids, cluster, fig_data = ensembles.kmeans_reduce_ensemble(
        crit, method={"n_clusters": n}, random_state=42, make_graph=True
    )
    print(ids)

While the Nested feature of the KKZ results is typically advantageous, it can sometimes result in unbalanced coverage of the original ensemble. In general, careful consideration and validation of selection results is suggested when ``n`` is small, regardless of the technique used.

To illustrate, a simple example using only 2 of our criteria shows differences in results between the two techniques:

The KKZ algorithm iteratively maximizes distance from previous selected candidates - potentially resulting in ‘off-center’ results versus the original ensemble

The K-means algorithm will redivide the data space with each iteration, producing results that are consistently centered on the original ensemble but lacking coverage in the extremes

[ ]:
df = crit.isel(criteria=[0, 1])

# Use standardized data in the plot so that selection distances is better visualized
df = (df - df.mean("realization")) / df.std("realization")

plt.figure(figsize=(18, 3))
for n in np.arange(1, 6):
    plt.subplot(1, 5, n, aspect="equal")
    plt.scatter(df.isel(criteria=0), df.isel(criteria=1))
    ids_KKZ = ensembles.kkz_reduce_ensemble(crit.isel(criteria=[0, 1]), num_select=n)
    plt.scatter(
        df.isel(criteria=0, realization=ids_KKZ),
        df.isel(criteria=1, realization=ids_KKZ),
        s=100,
    )
    plt.title(f"KKZ={n}")
    if n == 1:
        plt.ylabel("standardized delta_annual_prtot")
    if n == 3:
        plt.xlabel("standardized delta_annual_tavg")
plt.suptitle("KKZ selection results")

plt.figure(figsize=(18, 3))
for n in np.arange(1, 6):
    plt.subplot(1, 5, n, aspect="equal")
    plt.scatter(df.isel(criteria=0), df.isel(criteria=1))
    ids_Kmeans, c, figdata = ensembles.kmeans_reduce_ensemble(
        crit.isel(criteria=[0, 1]),
        method={"n_clusters": n},
        random_state=42,
        make_graph=True,
    )
    plt.scatter(
        df.isel(criteria=0, realization=ids_Kmeans),
        df.isel(criteria=1, realization=ids_Kmeans),
        s=100,
    )
    plt.title(f"Kmeans={n}")
    if n == 1:
        plt.ylabel("standardized delta_annual_prtot")
    if n == 3:
        plt.xlabel("standardized delta_annual_tavg")
plt.suptitle("K-means selection results")
plt.show()