xclim.ensembles package

Ensemble tools.

This submodule defines some useful methods for dealing with ensembles of climate simulations. In xclim, an “ensemble” is a Dataset or a DataArray where multiple climate realizations or models are concatenated along the realization dimension.

Submodules

xclim.ensembles._base module

Ensembles Creation and Statistics

xclim.ensembles._base._ens_align_datasets(datasets: list[xr.Dataset | Path | str | list[Path | str]] | str, mf_flag: bool = False, resample_freq: str | None = None, calendar: str = 'default', **xr_kwargs) list[xr.Dataset][source]

Create a list of aligned xarray Datasets for ensemble Dataset creation.

Parameters
  • datasets (list[xr.Dataset | xr.DataArray | Path | str | list[Path | str]] or str) – List of netcdf file paths or xarray Dataset/DataArray objects . If mf_flag is True, ‘datasets’ should be a list of lists where each sublist contains input NetCDF files of a xarray multi-file Dataset. DataArrays should have a name, so they can be converted to datasets. If a string, it is assumed to be a glob pattern for finding datasets.

  • mf_flag (bool) – If True climate simulations are treated as xarray multi-file datasets before concatenation. Only applicable when ‘datasets’ is a sequence of file paths.

  • resample_freq (str or None) – If the members of the ensemble have the same frequency but not the same offset, they cannot be properly aligned. If resample_freq is set, the time coordinate of each member will be modified to fit this frequency.

  • calendar (str) – The calendar of the time coordinate of the ensemble. For conversions involving ‘360_day’, the align_on=’date’ option is used. See xclim.core.calendar.convert_calendar. ‘default’ is the standard calendar using np.datetime64 objects.

  • xr_kwargs – Any keyword arguments to be given to xarray when opening the files.

Returns

list[xr.Dataset]

xclim.ensembles._base.create_ensemble(datasets: list[xr.Dataset | xr.DataArray | Path | str | list[Path | str]] | str, mf_flag: bool = False, resample_freq: str | None = None, calendar: str = 'default', **xr_kwargs) xr.Dataset[source]

Create an xarray dataset of an ensemble of climate simulation from a list of netcdf files.

Input data is concatenated along a newly created data dimension (‘realization’). Returns an xarray dataset object containing input data from the list of netcdf files concatenated along a new dimension (name:’realization’). In the case where input files have unequal time dimensions, the output ensemble Dataset is created for maximum time-step interval of all input files. Before concatenation, datasets not covering the entire time span have their data padded with NaN values. Dataset and variable attributes of the first dataset are copied to the resulting dataset.

Parameters
  • datasets (List[Union[xr.Dataset, Path, str, List[Path, str]]] or str) – List of netcdf file paths or xarray Dataset/DataArray objects . If mf_flag is True, ncfiles should be a list of lists where each sublist contains input .nc files of an xarray multifile Dataset. If DataArray object are passed, they should have a name in order to be transformed into Datasets. If a string is passed, it is assumed to be a glob pattern for finding datasets.

  • mf_flag (bool) – If True, climate simulations are treated as xarray multifile Datasets before concatenation. Only applicable when “datasets” is a sequence of file paths.

  • resample_freq (Optional[str]) – If the members of the ensemble have the same frequency but not the same offset, they cannot be properly aligned. If resample_freq is set, the time coordinate of each members will be modified to fit this frequency.

  • calendar (str) – The calendar of the time coordinate of the ensemble. For conversions involving ‘360_day’, the align_on=’date’ option is used. See xclim.core.calendar.convert_calendar. ‘default’ is the standard calendar using np.datetime64 objects.

  • xr_kwargs – Any keyword arguments to be given to xr.open_dataset when opening the files (or to xr.open_mfdataset if mf_flag is True)

Returns

xr.Dataset – Dataset containing concatenated data from all input files.

Notes

Input netcdf files require equal spatial dimension size (e.g. lon, lat dimensions). If input data contains multiple cftime calendar types they must be at monthly or coarser frequency.

Examples

>>> from xclim.ensembles import create_ensemble
>>> ens = create_ensemble(temperature_datasets)

Using multifile datasets, through glob patterns. Simulation 1 is a list of .nc files (e.g. separated by time):

>>> datasets = glob.glob("/dir/*.nc")  

Simulation 2 is also a list of .nc files:

>>> datasets.append(glob.glob("/dir2/*.nc"))  
>>> ens = create_ensemble(datasets, mf_flag=True)  
xclim.ensembles._base.ensemble_mean_std_max_min(ens: Dataset) Dataset[source]

Calculate ensemble statistics between a results from an ensemble of climate simulations.

Returns an xarray Dataset containing ensemble mean, standard-deviation, minimum and maximum for input climate simulations.

Parameters

ens (xr.Dataset) – Ensemble dataset (see xclim.ensembles.create_ensemble).

Returns

xr.Dataset – Dataset with data variables of ensemble statistics.

Examples

>>> from xclim.ensembles import create_ensemble, ensemble_mean_std_max_min

Create the ensemble dataset:

>>> ens = create_ensemble(temperature_datasets)

Calculate ensemble statistics:

>>> ens_mean_std = ensemble_mean_std_max_min(ens)
xclim.ensembles._base.ensemble_percentiles(ens: xr.Dataset | xr.DataArray, values: Sequence[float] = [10, 50, 90], keep_chunk_size: bool | None = None, split: bool = True) xr.Dataset[source]

Calculate ensemble statistics between a results from an ensemble of climate simulations.

Returns a Dataset containing ensemble percentiles for input climate simulations.

Parameters
  • ens (Union[xr.Dataset, xr.DataArray]) – Ensemble dataset or dataarray (see xclim.ensembles.create_ensemble).

  • values (Tuple[int, int, int]) – Percentile values to calculate. Default: (10, 50, 90).

  • keep_chunk_size (Optional[bool]) – For ensembles using dask arrays, all chunks along the ‘realization’ axis are merged. If True, the dataset is rechunked along the dimension with the largest chunks, so that the chunks keep the same size (approx) If False, no shrinking is performed, resulting in much larger chunks If not defined, the function decides which is best

  • split (bool) – Whether to split each percentile into a new variable of concatenate the ouput along a new “percentiles” dimension.

Returns

Union[xr.Dataset, xr.DataArray] – If split is True, same type as ens; dataset otherwise, containing data variable(s) of requested ensemble statistics

Examples

>>> from xclim.ensembles import create_ensemble, ensemble_percentiles

Create ensemble dataset:

>>> ens = create_ensemble(temperature_datasets)

Calculate default ensemble percentiles:

>>> ens_percs = ensemble_percentiles(ens)

Calculate non-default percentiles (25th and 75th)

>>> ens_percs = ensemble_percentiles(ens, values=(25, 50, 75))

If the original array has many small chunks, it might be more efficient to do:

>>> ens_percs = ensemble_percentiles(ens, keep_chunk_size=False)

xclim.ensembles._reduce module

Ensemble Reduction

Ensemble reduction is the process of selecting a subset of members from an ensemble in order to reduce the volume of computation needed while still covering a good portion of the simulated climate variability.

xclim.ensembles._reduce._calc_rsq(z, method, make_graph, n_sim, random_state, sample_weights)[source]

Sub-function to kmeans_reduce_ensemble. Calculates r-square profile (r-square versus number of clusters.

xclim.ensembles._reduce._get_nclust(method=None, n_sim=None, rsq=None, max_clusters=None)[source]

Sub-function to kmeans_reduce_ensemble. Determine number of clusters to create depending on various methods.

xclim.ensembles._reduce.kkz_reduce_ensemble(data: DataArray, num_select: int, *, dist_method: str = 'euclidean', standardize: bool = True, **cdist_kwargs) list[source]

Return a sample of ensemble members using KKZ selection.

The algorithm selects num_select ensemble members spanning the overall range of the ensemble. The selection is ordered, smaller groups are always subsets of larger ones for given criteria. The first selected member is the one nearest to the centroid of the ensemble, all subsequent members are selected in a way maximizing the phase-space coverage of the group. Algorithm taken from [CannonKKZ].

Parameters
  • data (xr.DataArray) – Selection criteria data : 2-D xr.DataArray with dimensions ‘realization’ (N) and ‘criteria’ (P). These are the values used for clustering. Realizations represent the individual original ensemble members and criteria the variables/indicators used in the grouping algorithm.

  • num_select (int) – The number of members to select.

  • dist_method (str) – Any distance metric name accepted by scipy.spatial.distance.cdist.

  • standardize (bool) – Whether to standardize the input before running the selection or not. Standardization consists in translation as to have a zero mean and scaling as to have a unit standard deviation.

  • cdist_kwargs – All extra arguments are passed as-is to scipy.spatial.distance.cdist, see its docs for more information.

Returns

list – Selected model indices along the realization dimension.

References

CannonKKZ

Cannon, Alex J. (2015). Selecting GCM Scenarios that Span the Range of Changes in a Multimodel Ensemble: Application to CMIP5 Climate Extremes Indices. Journal of Climate, (28)3, 1260-1267. https://doi.org/10.1175/JCLI-D-14-00636.1

xclim.ensembles._reduce.kmeans_reduce_ensemble(data: xarray.DataArray, *, method: dict = None, make_graph: bool = True, max_clusters: int | None = None, variable_weights: np.ndarray | None = None, model_weights: np.ndarray | None = None, sample_weights: np.ndarray | None = None, random_state: int | np.random.RandomState | None = None) tuple[list, np.ndarray, dict][source]

Return a sample of ensemble members using k-means clustering.

The algorithm attempts to reduce the total number of ensemble members while maintaining adequate coverage of the ensemble uncertainty in an N-dimensional data space. K-Means clustering is carried out on the input selection criteria data-array in order to group individual ensemble members into a reduced number of similar groups. Subsequently, a single representative simulation is retained from each group.

Parameters
  • data (xr.DataArray) – Selecton criteria data : 2-D xr.DataArray with dimensions ‘realization’ (N) and ‘criteria’ (P). These are the values used for clustering. Realizations represent the individual original ensemble members and criteria the variables/indicators used in the grouping algorithm.

  • method (dict) – Dictionary defining selection method and associated value when required. See Notes.

  • max_clusters (Optional[int]) – Maximum number of members to include in the output ensemble selection. When using ‘rsq_optimize’ or ‘rsq_cutoff’ methods, limit the final selection to a maximum number even if method results indicate a higher value. Defaults to N.

  • variable_weights (Optional[np.ndarray]) – An array of size P. This weighting can be used to influence of weight of the climate indices (criteria dimension) on the clustering itself.

  • model_weights (Optional[np.ndarray]) – An array of size N. This weighting can be used to influence which realization is selected from within each cluster. This parameter has no influence on the clustering itself.

  • sample_weights (Optional[np.ndarray]) – An array of size N. sklearn.cluster.KMeans() sample_weights parameter. This weighting can be used to influence of weight of simulations on the clustering itself. See: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

  • random_state (Optional[Union[int, np.random.RandomState]]) – sklearn.cluster.KMeans() random_state parameter. Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

  • make_graph (bool) – output a dictionary of input for displays a plot of R² vs. the number of clusters. Defaults to True if matplotlib is installed in runtime environment.

Notes

Parameters for method in call must follow these conventions:

rsq_optimize

Calculate coefficient of variation (R²) of cluster results for n = 1 to N clusters and determine an optimal number of clusters that balances cost / benefit tradeoffs. This is the default setting. See supporting information S2 text in https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152495

method={‘rsq_optimize’:None}

rsq_cutoff

Calculate Coefficient of variation (R²) of cluster results for n = 1 to N clusters and determine the minimum numbers of clusters needed for R² > val.

val : float between 0 and 1. R² value that must be exceeded by clustering results.

method={‘rsq_cutoff’: val}

n_clusters

Create a user determined number of clusters.

val : integer between 1 and N

method={‘n_clusters’: val}

Returns

  • list – Selected model indexes (positions)

  • np.ndarray – KMeans clustering results

  • dict – Dictionary of input data for creating R² profile plot. ‘None’ when make_graph=False

References

Casajus et al. 2016. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152495

Examples

>>> import xclim
>>> from xclim.ensembles import create_ensemble, kmeans_reduce_ensemble
>>> from xclim.indices import hot_spell_frequency

Start with ensemble datasets for temperature:

>>> ensTas = create_ensemble(temperature_datasets)

Calculate selection criteria – Use annual climate change Δ fields between 2071-2100 and 1981-2010 normals. First, average annual temperature:

>>> tg = xclim.atmos.tg_mean(tas=ensTas.tas)
>>> his_tg = tg.sel(time=slice("1990", "2019")).mean(dim="time")
>>> fut_tg = tg.sel(time=slice("2020", "2050")).mean(dim="time")
>>> dtg = fut_tg - his_tg

Then, Hotspell frequency as second indicator:

>>> hs = hot_spell_frequency(tasmax=ensTas.tas, window=2, thresh_tasmax="10 degC")
>>> his_hs = hs.sel(time=slice("1990", "2019")).mean(dim="time")
>>> fut_hs = hs.sel(time=slice("2020", "2050")).mean(dim="time")
>>> dhs = fut_hs - his_hs

Create a selection criteria xr.DataArray:

>>> from xarray import concat
>>> crit = concat((dtg, dhs), dim="criteria")

Finally, create clusters and select realization ids of reduced ensemble:

>>> ids, cluster, fig_data = kmeans_reduce_ensemble(
...     data=crit, method={"rsq_cutoff": 0.9}, random_state=42, make_graph=False
... )
>>> ids, cluster, fig_data = kmeans_reduce_ensemble(
...     data=crit, method={"rsq_optimize": None}, random_state=42, make_graph=True
... )
xclim.ensembles._reduce.plot_rsqprofile(fig_data)[source]

Create an R² profile plot using kmeans_reduce_ensemble output.

The R² plot allows evaluation of the proportion of total uncertainty in the original ensemble that is provided by the reduced selected.

Examples

>>> from xclim.ensembles import kmeans_reduce_ensemble, plot_rsqprofile
>>> is_matplotlib_installed()
>>> crit = xr.open_dataset(path_to_ensemble_file).data
>>> ids, cluster, fig_data = kmeans_reduce_ensemble(
...     data=crit, method={"rsq_cutoff": 0.9}, random_state=42, make_graph=True
... )
>>> plot_rsqprofile(fig_data)

xclim.ensembles._robustness module

Ensemble Robustness metrics.

Robustness metrics are used to estimate the confidence of the climate change signal of an ensemble. This submodule is inspired by and tries to follow the guidelines of the IPCC, more specifically the 12th chapter of the Working Group 1’s contribution to the AR5 [AR5WG1C12] (see box 12.1).

References

AR5WG1C12

https://www.ipcc.ch/site/assets/uploads/2018/02/WG1AR5_Chapter12_FINAL.pdf

xclim.ensembles._robustness.change_significance(fut: xr.DataArray | xr.Dataset, ref: xr.DataArray | xr.Dataset = None, test: str = 'ttest', **kwargs) tuple[xr.DataArray | xr.Dataset, xr.DataArray | xr.Dataset][source]

Robustness statistics qualifying how the members of an ensemble agree on the existence of change and on its sign.

Parameters
  • fut (Union[xr.DataArray, xr.Dataset]) – Future period values along ‘realization’ and ‘time’ (…, nr, nt1) or if ref is None, Delta values along realization (…, nr).

  • ref (Union[xr.DataArray, xr.Dataset], optional) – Reference period values along realization’ and ‘time’ (…, nt2, nr). The size of the ‘time’ axis does not need to match the one of fut. But their ‘realization’ axes must be identical. If None (default), values of fut are assumed to be deltas instead of a distribution across the future period. fut and ref must be of the same type (Dataset or DataArray). If they are Dataset, they must have the same variables (name and coords).

  • test ({‘ttest’, ‘welch-ttest’, ‘threshold’, None}) – Name of the statistical test used to determine if there was significant change. See notes.

  • kwargs – Other arguments specific to the statistical test.

    For ‘ttest’ and ‘welch-ttest’:
    p_changefloat (default0.05)

    p-value threshold for rejecting the hypothesis of no significant change.

    For ‘threshold’: (Only one of those must be given.)
    abs_threshfloat (no default)

    Threshold for the (absolute) change to be considered significative.

    rel_threshfloat (no default, in [0, 1])

    Threshold for the relative change (in reference to ref) to be significative. Only valid if ref is given.

Returns

  • change_frac – The fraction of members that show significant change [0, 1]. Passing test=None yields change_frac = 1 everywhere. Same type as fut.

  • pos_frac – The fraction of members showing significant change that show a positive change ]0, 1]. Null values are returned where no members show significant change.

    The table below shows the coefficient needed to retrieve the number of members that have the indicated characteristics, by multiplying it to the total number of members (fut.realization.size).

    Significant change

    Non-significant change

    Any direction

    change_frac

    1 - change_frac

    Positive change

    pos_frac * change_frac

    N.A.

    Negative change

    (1 - pos_frac) * change_frac

Notes

Available statistical tests are :

‘ttest’ :

Single sample T-test. Same test as used by [tebaldi2011]. The future values are compared against the reference mean (over ‘time’). Change is qualified as ‘significant’ when the test’s p-value is below the user-provided p_change value.

‘welch-ttest’ :

Two-sided T-test, without assuming equal population variance. Same significance criterion as ‘ttest’.

‘threshold’ :

Change is considered significative if the absolute delta exceeds a given threshold (absolute or relative).

None :

Significant change is not tested and, thus, members showing no change are included in the sign_frac output.

References

tebaldi2011

Tebaldi C., Arblaster, J.M. and Knutti, R. (2011) Mapping model agreement on future climate projections. GRL. doi:10.1029/2011GL049863

Example

This example computes the mean temperature in an ensemble and compares two time periods, qualifying significant change through a single sample T-test.

>>> from xclim import ensembles
>>> ens = ensembles.create_ensemble(temperature_datasets)
>>> tgmean = xclim.atmos.tg_mean(tas=ens.tas, freq="YS")
>>> fut = tgmean.sel(time=slice("2020", "2050"))
>>> ref = tgmean.sel(time=slice("1990", "2020"))
>>> chng_f, pos_f = ensembles.change_significance(fut, ref, test="ttest")

If the deltas were already computed beforehand, the ‘threshold’ test can still be used, here with a 2 K threshold.

>>> delta = fut.mean("time") - ref.mean("time")
>>> chng_f, pos_f = ensembles.change_significance(
...     delta, test="threshold", abs_thresh=2
... )
xclim.ensembles._robustness.robustness_coefficient(fut: xr.DataArray | xr.Dataset, ref: xr.DataArray | xr.Dataset) xr.DataArray | xr.Dataset[source]

Robustness coefficient quantifying the robustness of a climate change signal in an ensemble.

Taken from Knutti and Sedlacek (2013).

The robustness metric is defined as R = 1 − A1 / A2 , where A1 is defined as the integral of the squared area between two cumulative density functions characterizing the individual model projections and the multi-model mean projection and A2 is the integral of the squared area between two cumulative density functions characterizing the multi-model mean projection and the historical climate. (Description taken from [knutti2013])

A value of R equal to one implies perfect model agreement. Higher model spread or smaller signal decreases the value of R.

Parameters
  • fut (Union[xr.DataArray, xr.Dataset]) – Future ensemble values along ‘realization’ and ‘time’ (nr, nt). Can be a dataset, in which case the coeffcient is computed on each variables.

  • ref (Union[xr.DataArray, xr.Dataset]) – Reference period values along ‘time’ (nt). Same type as fut.

Returns

R – The robustness coeffcient, ]-inf, 1], float. Same type as fut or ref.

References

knutti2013

Knutti, R. and Sedláček, J. (2013) Robustness and uncertainties in the new CMIP5 climate model projections. Nat. Clim. Change. doi:10.1038/nclimate1716