Statistical Downscaling and Bias-Adjustment - Advanced tools

The previous notebook covered the most common utilities of xclim.sdba for conventional cases. Here, we explore more advanced usage of xclim.sdba tools.

Optimization with dask

Adjustment processes can be very heavy when we need to compute them over large regions and long timeseries. Using small groupings (like time.dayofyear) adds precision and robustness, but also decouples the load and computing complexity. Fortunately, unlike the heroic pioneers of scientific computing who managed to write parallelized FORTRAN, we now have dask. With only a few parameters, we can magically distribute the computing load to multiple workers and threads.

A good first read on the use of dask within xarray are the latter’s Optimization tips.

Some xclim.sdba-specific tips:

Most adjustment method will need to perform operation on the whole time coordinate, so it is best to optimize chunking along the other dimensions. This is often different from how public data is shared, where more universal 3D chunks are used.

Chunking of outputs can be controlled in xarray’s to_netcdf. We also suggest using Zarr files. According to its creators, zarr stores should give better performances, especially because of their better ability for parallel I/O. See Dataset.to_zarr and this useful rechunking package.
One of the main bottleneck for adjustments with small groups is that dask needs to build and optimize an enormous task graph. This issue has been greatly reduced with xclim 0.27 and the use of map_blocks in the adjustment methods. However, not all adjustment methods use this optimized syntax.

In order to help dask, one can split the processing in parts. For splitting training and adjustment, see the section below.
Another massive bottleneck of parallelization of xarray is the thread-locking behaviour of some methods. It is quite difficult to isolate and avoid these locking instances, so one of the best workarounds is to use dask configurations with many processes and few threads. The former do not share memory and thus are not impacted when a lock is activated from a thread in another worker. However, this adds many memory transfer operations and, by experience, reduces dask’s ability to parallelize some pipelines. Such a dask Client is usually created with a large n_workers and a small threads_per_worker.
Sometimes, datasets have auxiliary coordinates (for example : lat / lon in a rotated pole dataset). Xarray handles these variables as data variables and will not load them if dask is used. However, in some operations, xclim or xarray will trigger access to those variables, triggering computations each time, since they are dask-based. To avoid this behaviour, one can load the coordinates, or simply remove them from the inputs.

LOESS smoothing and detrending

As described in Cleveland (1979), locally weighted linear regressions are multiple regression methods using a nearest-neighbour approach. Instead of using all data points to compute a linear or polynomial regression, LOESS algorithms compute a local regression for each point in the dataset, using only the k-nearest neighbours as selected by a weighting function. This weighting function must fulfill some strict requirements, see the doc of xclim.sdba.loess.loess_smoothing for more details.

In xclim’s implementation, the user can choose between local constancy (\(d=0\), local estimates are weighted averages) and local linearity (\(d=1\), local estimates are taken from linear regressions). Two weighting functions are currently implemented : “tricube” (\(w(x) = (1 - x^3)^3\)) and “gaussian” (\(w(x) = e^{-x^2 / 2\sigma^2}\)). Finally, the number of Cleveland’s robustifying iterations is controllable through niter. After computing an estimate of \(y(x)\), the weights are modulated by a function of the distance between the estimate and the points and the procedure is started over. These iterations are made to weaken the effect of outliers on the estimate.

The next example shows the application of the LOESS to daily temperature data. The black line and dot are the estimated \(y\), outputs of the sdba.loess.loess_smoothing function, using local linear regression (passing \(d = 1\)), a window spanning 20% (\(f = 0.2\)) of the domain, the “tricube” weighting function and only one iteration. The red curve illustrates the weighting function on January 1st 2014, where the red circles are the nearest-neighbours used in the estimation.

[1]:

from __future__ import annotations

import matplotlib.pyplot as plt
import nc_time_axis
import numpy as np
import xarray as xr

from xclim.sdba import loess

%matplotlib inline

[2]:

# Daily temperature data from xarray's tutorials
ds = xr.tutorial.open_dataset("air_temperature").resample(time="D").mean()
tas = ds.isel(lat=0, lon=0).air

# Compute the smoothed series
f = 0.2
ys = loess.loess_smoothing(tas, d=1, weights="tricube", f=f, niter=1)

# Plot data points and smoothed series
fig, ax = plt.subplots()
ax.plot(tas.time, tas, "o", fillstyle="none")
ax.plot(tas.time, ys, "k")
ax.set_xlabel("Time")
ax.set_ylabel("Temperature [K]")

## The code below calls internal functions to demonstrate how the weights are computed.

# LOESS algorithms as implemented here use scaled coordinates.
x = tas.time
x = (x - x[0]) / (x[-1] - x[0])
xi = x[366]
ti = tas.time[366]

# Weighting function take the distance with all neighbors scaled by the r parameter as input
r = int(f * tas.time.size)
h = np.sort(np.abs(x - xi))[r]
weights = loess._tricube_weighting(np.abs(x - xi).values / h)

# Plot nearest neighbors and weighing function
wax = ax.twinx()
wax.plot(tas.time, weights, color="indianred")
ax.plot(
    tas.time, tas.where(tas * weights > 0), "o", color="lightcoral", fillstyle="none"
)

ax.plot(ti, ys[366], "ko")
wax.set_ylabel("Weights")
plt.show()

../_images/notebooks_sdba-advanced_3_0.png

LOESS smoothing can suffer from heavy boundary effects. On the previous graph, we can associate the strange bend on the left end of the line to them. The next example shows a stronger case. Usually, \(\frac{f}{2}N\) points on each side should be discarded. On the other hand, LOESS has the advantage of always staying within the bounds of the data.

LOESS Detrending

In climate science, it can be used in the detrending process. xclim provides sdba.detrending.LoessDetrend in order to compute trend with the LOESS smoothing and remove them from timeseries.

First we create some toy data with a sinusoidal annual cycle, random noise and a linear temperature increase.

[3]:

time = xr.cftime_range("1990-01-01", "2049-12-31", calendar="noleap")
tas = xr.DataArray(
    (
        10 * np.sin(time.dayofyear * 2 * np.pi / 365)
        + 5 * (np.random.random_sample(time.size) - 0.5)  # Annual variability
        + np.linspace(0, 1.5, num=time.size)  # Random noise
    ),  # 1.5 degC increase in 60 years
    dims=("time",),
    coords={"time": time},
    attrs={"units": "degC"},
    name="temperature",
)
tas.plot()

[3]:

[<matplotlib.lines.Line2D at 0x7feec4745640>]

../_images/notebooks_sdba-advanced_5_1.png

Then we compute the trend on the data. Here, we compute on the whole timeseries (group='time') with the parameters suggested above.

[4]:

from xclim.sdba.detrending import LoessDetrend

# Create the detrending object
det = LoessDetrend(group="time", d=0, niter=2, f=0.2)
# Fitting returns a new object and computes the trend.
fit = det.fit(tas)
# Get the detrended series
tas_det = fit.detrend(tas)

[5]:

fig, ax = plt.subplots()
fit.ds.trend.plot(ax=ax, label="Computed trend")
ax.plot(time, np.linspace(0, 1.5, num=time.size), label="Expected tred")
ax.plot([time[0], time[int(0.1 * time.size)]], [0.4, 0.4], linewidth=6, color="gray")
ax.plot([time[-int(0.1 * time.size)], time[-1]], [1.1, 1.1], linewidth=6, color="gray")
ax.legend()

[5]:

<matplotlib.legend.Legend at 0x7feec4724bc0>

../_images/notebooks_sdba-advanced_8_1.png

As said earlier, this example shows how the Loess has strong boundary effects. It is recommended to remove the \(\frac{f}{2}\cdot N\) outermost points on each side, as shown by the gray bars in the graph above.

Retrieving extra output diagnostics

To fully understand what is happening during the bias-adjustment process, sdba can output diagnostic variables, giving more visibility to what the adjustment is doing behind the scene. This behaviour, a verbose option, is controlled by the sdba_extra_output option, set with xclim.set_options. When True, train calls are instructed to include additional variables to the training datasets. In addition, the adjust calls will always output a dataset, with scen and, depending on the algorithm, other diagnostics variables. See the documentation of each Adjustment objects to see what extra variables are available.

For the moment, this feature is still under construction and only a few Adjustment actually provide these extra outputs. Please open issues on the GitHub repo if you have needs or ideas of interesting diagnostic variables.

For example, QDM.adjust adds sim_q, which gives the quantile of each element of sim within its group.

[12]:

from xclim import set_options

with set_options(sdba_extra_output=True):
    QDM = QuantileDeltaMapping.train(
        ref, hist, nquantiles=15, kind="+", group="time.dayofyear"
    )
    out = QDM.adjust(sim)

out.sim_q

[12]:

<xarray.DataArray 'sim_q' (time: 11315)> Size: 91kB
array([0.13333333, 0.03333333, 0.1       , ..., 1.        , 0.93333333,
       0.9       ])
Coordinates:
  * time     (time) object 91kB 2000-01-01 00:00:00 ... 2030-12-31 00:00:00
Attributes:
    group:               time.dayofyear
    group_compute_dims:  time
    group_window:        1
    long_name:           Group-wise quantiles of `sim`.

Frequency adaption with a rolling window

In the previous example, we performed bias adjustment with a rolling window. Here we show how to include frequency adaptation (see sdba.ipynb for the simple case group="time"). We first generate the same precipitation dataset used in sdba.ipynb

[25]:

import numpy as np
import xarray as xr

t = xr.cftime_range("2000-01-01", "2030-12-31", freq="D", calendar="noleap")

vals = np.random.randint(0, 1000, size=(t.size,)) / 100
vals_ref = (4 ** np.where(vals < 9, vals / 100, vals)) / 3e6
vals_sim = (
    (1 + 0.1 * np.random.random_sample((t.size,)))
    * (4 ** np.where(vals < 9.5, vals / 100, vals))
    / 3e6
)

pr_ref = xr.DataArray(
    vals_ref, coords={"time": t}, dims=("time",), attrs={"units": "mm/day"}
)
pr_ref = pr_ref.sel(time=slice("2000", "2015"))
pr_sim = xr.DataArray(
    vals_sim, coords={"time": t}, dims=("time",), attrs={"units": "mm/day"}
)
pr_hist = pr_sim.sel(time=slice("2000", "2015"))

Bias adjustment on a rolling window can be performed in the same way as shown in sdba.ipynb, but instead of being a single string precising the time grouping (e.g. time.month), the group argument is built with sdba.Grouper function

[26]:

import matplotlib.pyplot as plt

# adapt_freq with a sdba.Grouper
from xclim import sdba

group = sdba.Grouper("time.dayofyear", window=31)
sim_ad, pth, dP0 = sdba.processing.adapt_freq(
    pr_ref, pr_sim, thresh="0.05 mm d-1", group=group
)
QM_ad = sdba.EmpiricalQuantileMapping.train(
    pr_ref, sim_ad, nquantiles=15, kind="*", group=group
)
scen_ad = QM_ad.adjust(pr_sim)

pr_ref.sel(time="2010").plot(alpha=0.9, label="Reference")
pr_sim.sel(time="2010").plot(alpha=0.7, label="Model - biased")
scen_ad.sel(time="2010").plot(alpha=0.6, label="Model - adjusted")
plt.legend()

[26]:

<matplotlib.legend.Legend at 0x7feec4616180>

../_images/notebooks_sdba-advanced_44_1.png

In the figure above, scen occasionally has small peaks where sim is 0, indicating that there are more “dry days” (days with almost no precipitation) in hist than in ref. The frequency-adaptation Themeßl et al. (2010) performed in the step above only worked partially.

The reason for this is the following. The first step above combines precipitations in 365 overlapping blocks of 31 days * Y years, one block for each day of the year. Each block is adapted, and the 16th day-of-year slice (at the center of the block) is assigned to the corresponding day-of-year in the adapted dataset sim_ad. As we proceed to the training, we re-form those 31 days * Y years blocks, but this step does not invert the last one: There can still be more zeroes in the simulation than in the reference.

To alleviate this issue, another way of proceeding is to perform a frequency adaptation on the blocks, and then use the same blocks in the training step, as we show below.

[27]:

# adapt_freq directly in the training step
group = sdba.Grouper("time.dayofyear", window=31)

QM_ad = sdba.EmpiricalQuantileMapping.train(
    pr_ref,
    sim_ad,
    nquantiles=15,
    kind="*",
    group=group,
    adapt_freq_thresh="0.05 mm d-1",
)
scen_ad = QM_ad.adjust(pr_sim)

pr_ref.sel(time="2010").plot(alpha=0.9, label="Reference")
pr_sim.sel(time="2010").plot(alpha=0.7, label="Model - biased")
scen_ad.sel(time="2010").plot(alpha=0.6, label="Model - adjusted")
plt.legend()

[27]:

<matplotlib.legend.Legend at 0x7feec42a5280>

../_images/notebooks_sdba-advanced_46_1.png

Tests for sdba

It can be useful to perform diagnostic tests on adjusted simulations to assess if the bias correction method is working properly, or to compare two different bias correction techniques.

A diagnostic test includes calculations of a property (mean, 20-year return value, annual cycle amplitude, …) on the simulation and on the scenario (adjusted simulation), then a measure (bias, relative bias, ratio, …) of the difference. Usually, the property collapse the time dimension of the simulation/scenario and returns one value by grid point.

You’ll find those in xclim.sdba.properties and xclim.sdba.measures, where they are implemented as special subclasses of xclim’s Indicator, which means they can be worked with the same way as conventional indicators (used in YAML modules for example).

[28]:

from matplotlib import pyplot as plt

import xclim as xc
from xclim import sdba
from xclim.testing import open_dataset

# load test data
hist = open_dataset("sdba/CanESM2_1950-2100.nc").sel(time=slice("1950", "1980")).tasmax
ref = open_dataset("sdba/nrcan_1950-2013.nc").sel(time=slice("1950", "1980")).tasmax
sim = (
    open_dataset("sdba/CanESM2_1950-2100.nc").sel(time=slice("1980", "2010")).tasmax
)  # biased

# learn the bias in historical simulation compared to reference
QM = sdba.EmpiricalQuantileMapping.train(
    ref, hist, nquantiles=50, group="time", kind="+"
)

# correct the bias in the future
scen = QM.adjust(sim, extrapolation="constant", interp="nearest")
ref_future = (
    open_dataset("sdba/nrcan_1950-2013.nc").sel(time=slice("1980", "2010")).tasmax
)  # truth

plt.figure(figsize=(15, 5))
lw = 0.3
sim.isel(location=1).plot(label="sim", linewidth=lw)
scen.isel(location=1).plot(label="scen", linewidth=lw)
hist.isel(location=1).plot(label="hist", linewidth=lw)
ref.isel(location=1).plot(label="ref", linewidth=lw)
ref_future.isel(location=1).plot(label="ref_future", linewidth=lw)
leg = plt.legend()

../_images/notebooks_sdba-advanced_48_0.png

[29]:

# calculate the mean warm Spell Length Distribution
sim_prop = sdba.properties.spell_length_distribution(
    da=sim, thresh="28 degC", op=">", stat="mean", group="time"
)


scen_prop = sdba.properties.spell_length_distribution(
    da=scen, thresh="28 degC", op=">", stat="mean", group="time"
)

ref_prop = sdba.properties.spell_length_distribution(
    da=ref_future, thresh="28 degC", op=">", stat="mean", group="time"
)
# measure the difference between the prediction and the reference with an absolute bias of the properties
measure_sim = sdba.measures.bias(sim_prop, ref_prop)
measure_scen = sdba.measures.bias(scen_prop, ref_prop)

plt.figure(figsize=(5, 3))
plt.plot(measure_sim.location, measure_sim.values, ".", label="biased model (sim)")
plt.plot(measure_scen.location, measure_scen.values, ".", label="adjusted model (scen)")
plt.title(
    "Bias of the mean of the warm spell \n length distribution compared to observations"
)
plt.legend()
plt.ylim(-2.5, 2.5)

[29]:

(-2.5, 2.5)

../_images/notebooks_sdba-advanced_49_1.png

It is possible the change the ‘group’ of the property from ‘time’ to ‘time.season’ or ‘time.month’. This will return 4 or 12 values per grid point, respectively.

[30]:

# calculate the mean warm Spell Length Distribution
sim_prop = sdba.properties.spell_length_distribution(
    da=sim, thresh="28 degC", op=">", stat="mean", group="time.season"
)

scen_prop = sdba.properties.spell_length_distribution(
    da=scen, thresh="28 degC", op=">", stat="mean", group="time.season"
)

ref_prop = sdba.properties.spell_length_distribution(
    da=ref_future, thresh="28 degC", op=">", stat="mean", group="time.season"
)
# Properties are often associated with the same measures. This correspondence is implemented in xclim:
measure = sdba.properties.spell_length_distribution.get_measure()
measure_sim = measure(sim_prop, ref_prop)
measure_scen = measure(scen_prop, ref_prop)

fig, axs = plt.subplots(2, 2, figsize=(9, 6))
axs = axs.ravel()
for i in range(4):
    axs[i].plot(
        measure_sim.location, measure_sim.values[:, i], ".", label="biased model (sim)"
    )
    axs[i].plot(
        measure_scen.location,
        measure_scen.isel(season=i).values,
        ".",
        label="adjusted model (scen)",
    )
    axs[i].set_title(measure_scen.season.values[i])
    axs[i].legend(loc="lower right")
    axs[i].set_ylim(-2.5, 2.5)
fig.suptitle(
    "Bias of the mean of the warm spell length distribution compared to observations"
)
plt.tight_layout()

../_images/notebooks_sdba-advanced_51_0.png

Statistical Downscaling and Bias-Adjustment - Advanced tools

Optimization with dask

LOESS smoothing and detrending

LOESS Detrending

Initializing an Adjustment object from a training dataset

Retrieving extra output diagnostics

Moving window for adjustments

Full example: Multivariate adjustment in the additive space

1. Jitter, additive space transformation and variable stacking

2. Get residuals and trends

3. Adjustments

4. Re-trend and transform back to the physical space

Frequency adaption with a rolling window

Tests for sdba