Documentation overview
Overview of the method
This is a Bayesian calibration method fitted using the 2000-2019 ECMWF hindcasts and their corresponding observations. The method is the same for t2m and precipitation, and it is applied separately to each variable, each grid point and each lead-time.
It is also fitted separately for each real-time start date, e.g the statistical model that is applied to start date 2020-01-02 is fitted using only the hindcast data for start dates 2000-01-02 to 2019-01-02.
See also Specq, D. and Batté, L. (2020)
Scripts
Script directory
s2s-ai-challenge-damien-specq/notebooks/damien_specq_scripts
Purpose of the scripts
dates.txt : list of the 2020 ECMWF forecast start dates to be used for loops in the other scripts
download_forecast_input.sh : Prepare ready-to-use NetCDF files of the 2020 ECMWF S2S forecasts of t2m and tp.
For t2m :
- Retrieve the ECMWF 2020 S2S forecasts for t2m from climetlab (download_forecast_input.sh#L11)
- Select the weeks 3-4 (download_forecast_input.sh#L14) and 5-6 windows (download_forecast_input.sh#L18)
- Average t2m over the weeks 3-4 (download_forecast_input.sh#L15) and 5-6 (download_forecast_input.sh#L19) windows
For tp :
- Retrieve the ECMWF 2020 S2S forecasts of accumulated precipitation from climetlab (download_forecast_input.sh#L27)
- Select the lead times at the beginning and the end of the weeks 3-4 (day 14 and day 28, download_forecast_input.sh#L31-32) and 5-6 windows (day 28 and day 42, download_forecast_input.sh#L39-40)
- Take the difference between the lead time at the end minus lead time at the beginning (e.g day 28 minus day 14) to get accumulated precipitation over the weeks 3-4 (download_forecast_input.sh#L33) and 5-6 (download_forecast_input.sh#L41) windows
- Remove the few negative values due to a bug in ECMWF S2S outputs, by setting them to zero (download_forecast_input.sh#L34, download_forecast_input.sh#L42)
download_hindcast_input.sh : Prepare ready-to-use NetCDF files of the 2000-2019 ECMWF S2S hindcasts of t2m and tp, to be used for training. The preparation process is the same as for real-time forecasts
download_hindcast_like_observations.sh : Prepare ready-to-use NetCDF files of the observations of t2m and tp corresponding to the 2000-2019 hindcasts, to be used for training. The preparation process is the same as for real-time forecasts.
ML_train_and_predict.R : Main script that carries out the computation of the tercile probabilities for 2020 forecasts provided for submission
Steps
As an illustration, let's assume we want to apply the method to the ECMWF 2020-01-02 forecast for the average weeks 3-4 t2m at a fixed grid point.
Line numbers refer to ML_train_and_predict.R.
The script loops on:
- all 2020 start dates
- all grid points
- all lead times (weeks 3-4 and 5-6)
- all variables (t2m and tp).
Fitting
- Retrieve the t2m weeks 3-4 hindcasts for start dates 2000-01-02 to 2019-01-02 (ML_train_and_predict.R#L54 and ML_train_and_predict.R#L62)
- Retrieve the t2m observations corresponding to these hindcasts (ML_train_and_predict.R#L58 and ML_train_and_predict.R#L63)
-
Convert the t2m hindcast distribution into a normal distribution of mean 0 and standard deviation 1 through a quantile-quantile method:
- Apply to t2m hindcast values their own cumulative distribution function (ML_train_and_predict.R#L96). The cumulative distribution function is based on the values of all ensemble members and hindcast years 2000-2019. It is determined with the R function "pemp" from package "EnvStats".
- Apply the inverse cumulative distribution function of the normal distribution N(0,1) to the values obtained in the previous step (ML_train_and_predict.R#L98) with R function "qnorm".
- Take the ensemble mean of the transformed values (ML_train_and_predict.R#L110)
-
Convert the corresponding t2m observation distribution into a normal distribution N(0,1) with the same quantile-quantile method:
- Apply to t2m observed values their own cumulative distribution function (ML_train_and_predict.R#L92)
- Apply the inverse cumulative distribution function of the normal distribution N(0,1) to the values obtained in the previous step (ML_train_and_predict.R#L94)
- Fit a linear regression function between the values from steps 3 and 4 (ML_train_and_predict.R#L124), to express forecasts from step 3 as a function of observations from step 4. This is the likelihood function making the approach Bayesian.
- Save the three coefficients:
- a: intercept (ML_train_and_predict.R#L125)
- b: slope (ML_train_and_predict.R#L126)
- std: standard deviation of residuals (ML_train_and_predict.R#L127)
Prediction
- Retrieve the raw t2m weeks 3-4 ECMWF forecasts issued on 2020-01-02 (ML_train_and_predict.R#L56 and ML_train_and_predict.R#L64)
-
Convert the t2m forecast values in the normal distribution N(0,1) using the same quantile-quantile function as Step 3:
- Apply to t2m forecast values the cumulative distribution function of t2m hindcast values (ML_train_and_predict.R#L101)
- Apply the inverse cumulative distribution function of the normal distribution N(0,1) (ML_train_and_predict.R#L102)
- Take the ensemble mean (ML_train_and_predict.R#L111)
-
Infer the normal posterior distribution of 2020-01-02 t2m weeks3-4 forecast given the likelihood function coefficients (Step 5-6), the raw forecast value (Step 8) and the prior distribution parameters (mean 0, standard deviation 1, ML_train_and_predict.R#L120-121).
See Specq, D. and Batté, L. (2020) for more details on the formulae:- Standard deviation of the posterior distribution sd: ML_train_and_predict.R#L131
- Mean of the posterior distribution m: ML_train_and_predict.R#L132
- Compute the probability that a variable following the normal posterior distribution N(m,sd) will be below the lower tercile of the prior distribution N(0,1): ML_train_and_predict.R#L141
This is the newly predicted probability to be in the lower tercile. - Compute the probability that a variable following the normal posterior distribution N(m,sd) will exceed the upper tercile of the prior distribution N(0,1): ML_train_and_predict.R#L142
This is the newly predicted probability to be in the upper tercile. - Compute the probability to be in the middle tercile so that the three probabilities add up to 1: ML_train_and_predict.R#L143
This is the newly predicted probability to be in the middle tercile.
Additional data wrangling
- Load the provided NetCDF file as a template for file submission (ML_train_and_predict.R#L13)
- Overwrite this file with the newly predicted values of Steps 10, 11 and 12 (ML_train_and_predict.R#L154)
- A mask on t2m data is determined based on available training observations (ML_train_and_predict.R#L78 to 81)
- The masked grid points are removed to the t2m data before fitting (ML_train_and_predict.R#L84 to 86)
- Grid points with no prediction because masked are completed with NAs (ML_train_and_predict.R#L146 to 151)
- The masking is more complex for precipitation because we need to account for:
- The dry mask that was intended by the challenge organizers (ML_train_and_predict.R#L23 to 26)
- The available training observations (same as for t2m, ML_train_and_predict.R#L198 to 200)
- The areas that are always dry for the specific calendar fortnight in the 20 years of hindcasts, where Steps 3 and 8 are not applicable (ML_train_and_predict.R#L193 to 195)
These three masks are merged into a single one at ML_train_and_predict.R#L208 to 210
Safeguards to prevent overfitting
The fit itself relies on a basic linear regression (Step 5) using the standard R function "lm" (ML_train_and_predict.R#L124).
It is very constrained with no tunable parameters.
The only opportunity for overfitting would be to adjust the fitting period within the hindcast period to see if one works better than another. We did not do such trials, as we directly decided to use the full hindcast period.
Safeguards to prevent data leakage
The fitting procedure is very constrained too.
The fit only relies on the ECMWF hindcasts and their corresponding observations.
Moreover, the fit for each real-time start date is done using its corresponding hindcasts calendar dates.
In doing so, the most recent observations used in the fit are at least one year older than the real-time forecast.