Documentation overview
Overview of the method
This is a Bayesian calibration method fitted using the 20002019 ECMWF hindcasts and their corresponding observations. The method is the same for t2m and precipitation, and it is applied separately to each variable, each grid point and each leadtime.
It is also fitted separately for each realtime start date, e.g the statistical model that is applied to start date 20200102 is fitted using only the hindcast data for start dates 20000102 to 20190102.
See also Specq, D. and Batté, L. (2020)
Scripts
Script directory
s2saichallengedamienspecq/notebooks/damien_specq_scripts
Purpose of the scripts
dates.txt : list of the 2020 ECMWF forecast start dates to be used for loops in the other scripts
download_forecast_input.sh : Prepare readytouse NetCDF files of the 2020 ECMWF S2S forecasts of t2m and tp.
For t2m :
 Retrieve the ECMWF 2020 S2S forecasts for t2m from climetlab (download_forecast_input.sh#L11)
 Select the weeks 34 (download_forecast_input.sh#L14) and 56 windows (download_forecast_input.sh#L18)
 Average t2m over the weeks 34 (download_forecast_input.sh#L15) and 56 (download_forecast_input.sh#L19) windows
For tp :
 Retrieve the ECMWF 2020 S2S forecasts of accumulated precipitation from climetlab (download_forecast_input.sh#L27)
 Select the lead times at the beginning and the end of the weeks 34 (day 14 and day 28, download_forecast_input.sh#L3132) and 56 windows (day 28 and day 42, download_forecast_input.sh#L3940)
 Take the difference between the lead time at the end minus lead time at the beginning (e.g day 28 minus day 14) to get accumulated precipitation over the weeks 34 (download_forecast_input.sh#L33) and 56 (download_forecast_input.sh#L41) windows
 Remove the few negative values due to a bug in ECMWF S2S outputs, by setting them to zero (download_forecast_input.sh#L34, download_forecast_input.sh#L42)
download_hindcast_input.sh : Prepare readytouse NetCDF files of the 20002019 ECMWF S2S hindcasts of t2m and tp, to be used for training. The preparation process is the same as for realtime forecasts
download_hindcast_like_observations.sh : Prepare readytouse NetCDF files of the observations of t2m and tp corresponding to the 20002019 hindcasts, to be used for training. The preparation process is the same as for realtime forecasts.
ML_train_and_predict.R : Main script that carries out the computation of the tercile probabilities for 2020 forecasts provided for submission
Steps
As an illustration, let's assume we want to apply the method to the ECMWF 20200102 forecast for the average weeks 34 t2m at a fixed grid point.
Line numbers refer to ML_train_and_predict.R.
The script loops on:
 all 2020 start dates
 all grid points
 all lead times (weeks 34 and 56)
 all variables (t2m and tp).
Fitting
 Retrieve the t2m weeks 34 hindcasts for start dates 20000102 to 20190102 (ML_train_and_predict.R#L54 and ML_train_and_predict.R#L62)
 Retrieve the t2m observations corresponding to these hindcasts (ML_train_and_predict.R#L58 and ML_train_and_predict.R#L63)

Convert the t2m hindcast distribution into a normal distribution of mean 0 and standard deviation 1 through a quantilequantile method:
 Apply to t2m hindcast values their own cumulative distribution function (ML_train_and_predict.R#L96). The cumulative distribution function is based on the values of all ensemble members and hindcast years 20002019. It is determined with the R function "pemp" from package "EnvStats".
 Apply the inverse cumulative distribution function of the normal distribution N(0,1) to the values obtained in the previous step (ML_train_and_predict.R#L98) with R function "qnorm".
 Take the ensemble mean of the transformed values (ML_train_and_predict.R#L110)

Convert the corresponding t2m observation distribution into a normal distribution N(0,1) with the same quantilequantile method:
 Apply to t2m observed values their own cumulative distribution function (ML_train_and_predict.R#L92)
 Apply the inverse cumulative distribution function of the normal distribution N(0,1) to the values obtained in the previous step (ML_train_and_predict.R#L94)
 Fit a linear regression function between the values from steps 3 and 4 (ML_train_and_predict.R#L124), to express forecasts from step 3 as a function of observations from step 4. This is the likelihood function making the approach Bayesian.
 Save the three coefficients:
 a: intercept (ML_train_and_predict.R#L125)
 b: slope (ML_train_and_predict.R#L126)
 std: standard deviation of residuals (ML_train_and_predict.R#L127)
Prediction
 Retrieve the raw t2m weeks 34 ECMWF forecasts issued on 20200102 (ML_train_and_predict.R#L56 and ML_train_and_predict.R#L64)

Convert the t2m forecast values in the normal distribution N(0,1) using the same quantilequantile function as Step 3:
 Apply to t2m forecast values the cumulative distribution function of t2m hindcast values (ML_train_and_predict.R#L101)
 Apply the inverse cumulative distribution function of the normal distribution N(0,1) (ML_train_and_predict.R#L102)
 Take the ensemble mean (ML_train_and_predict.R#L111)

Infer the normal posterior distribution of 20200102 t2m weeks34 forecast given the likelihood function coefficients (Step 56), the raw forecast value (Step 8) and the prior distribution parameters (mean 0, standard deviation 1, ML_train_and_predict.R#L120121).
See Specq, D. and Batté, L. (2020) for more details on the formulae: Standard deviation of the posterior distribution sd: ML_train_and_predict.R#L131
 Mean of the posterior distribution m: ML_train_and_predict.R#L132
 Compute the probability that a variable following the normal posterior distribution N(m,sd) will be below the lower tercile of the prior distribution N(0,1): ML_train_and_predict.R#L141
This is the newly predicted probability to be in the lower tercile.  Compute the probability that a variable following the normal posterior distribution N(m,sd) will exceed the upper tercile of the prior distribution N(0,1): ML_train_and_predict.R#L142
This is the newly predicted probability to be in the upper tercile.  Compute the probability to be in the middle tercile so that the three probabilities add up to 1: ML_train_and_predict.R#L143
This is the newly predicted probability to be in the middle tercile.
Additional data wrangling
 Load the provided NetCDF file as a template for file submission (ML_train_and_predict.R#L13)
 Overwrite this file with the newly predicted values of Steps 10, 11 and 12 (ML_train_and_predict.R#L154)
 A mask on t2m data is determined based on available training observations (ML_train_and_predict.R#L78 to 81)
 The masked grid points are removed to the t2m data before fitting (ML_train_and_predict.R#L84 to 86)
 Grid points with no prediction because masked are completed with NAs (ML_train_and_predict.R#L146 to 151)
 The masking is more complex for precipitation because we need to account for:
 The dry mask that was intended by the challenge organizers (ML_train_and_predict.R#L23 to 26)
 The available training observations (same as for t2m, ML_train_and_predict.R#L198 to 200)
 The areas that are always dry for the specific calendar fortnight in the 20 years of hindcasts, where Steps 3 and 8 are not applicable (ML_train_and_predict.R#L193 to 195)
These three masks are merged into a single one at ML_train_and_predict.R#L208 to 210
Safeguards to prevent overfitting
The fit itself relies on a basic linear regression (Step 5) using the standard R function "lm" (ML_train_and_predict.R#L124).
It is very constrained with no tunable parameters.
The only opportunity for overfitting would be to adjust the fitting period within the hindcast period to see if one works better than another. We did not do such trials, as we directly decided to use the full hindcast period.
Safeguards to prevent data leakage
The fitting procedure is very constrained too.
The fit only relies on the ECMWF hindcasts and their corresponding observations.
Moreover, the fit for each realtime start date is done using its corresponding hindcasts calendar dates.
In doing so, the most recent observations used in the fit are at least one year older than the realtime forecast.