Documentation overview

Overview of the method

This is a Bayesian calibration method fitted using the 2000-2019 ECMWF hindcasts and their corresponding observations. The method is the same for t2m and precipitation, and it is applied separately to each variable, each grid point and each lead-time.

It is also fitted separately for each real-time start date, e.g the statistical model that is applied to start date 2020-01-02 is fitted using only the hindcast data for start dates 2000-01-02 to 2019-01-02.

See also Specq, D. and Batté, L. (2020)

Scripts

Script directory

s2s-ai-challenge-damien-specq/notebooks/damien_specq_scripts

Purpose of the scripts

dates.txt : list of the 2020 ECMWF forecast start dates to be used for loops in the other scripts

download_forecast_input.sh : Prepare ready-to-use NetCDF files of the 2020 ECMWF S2S forecasts of t2m and tp.

For t2m :

Retrieve the ECMWF 2020 S2S forecasts for t2m from climetlab (download_forecast_input.sh#L11)
Select the weeks 3-4 (download_forecast_input.sh#L14) and 5-6 windows (download_forecast_input.sh#L18)
Average t2m over the weeks 3-4 (download_forecast_input.sh#L15) and 5-6 (download_forecast_input.sh#L19) windows

For tp :

Retrieve the ECMWF 2020 S2S forecasts of accumulated precipitation from climetlab (download_forecast_input.sh#L27)
Select the lead times at the beginning and the end of the weeks 3-4 (day 14 and day 28, download_forecast_input.sh#L31-32) and 5-6 windows (day 28 and day 42, download_forecast_input.sh#L39-40)
Take the difference between the lead time at the end minus lead time at the beginning (e.g day 28 minus day 14) to get accumulated precipitation over the weeks 3-4 (download_forecast_input.sh#L33) and 5-6 (download_forecast_input.sh#L41) windows
Remove the few negative values due to a bug in ECMWF S2S outputs, by setting them to zero (download_forecast_input.sh#L34, download_forecast_input.sh#L42)

download_hindcast_input.sh : Prepare ready-to-use NetCDF files of the 2000-2019 ECMWF S2S hindcasts of t2m and tp, to be used for training. The preparation process is the same as for real-time forecasts

download_hindcast_like_observations.sh : Prepare ready-to-use NetCDF files of the observations of t2m and tp corresponding to the 2000-2019 hindcasts, to be used for training. The preparation process is the same as for real-time forecasts.

ML_train_and_predict.R : Main script that carries out the computation of the tercile probabilities for 2020 forecasts provided for submission

Steps

As an illustration, let's assume we want to apply the method to the ECMWF 2020-01-02 forecast for the average weeks 3-4 t2m at a fixed grid point.
Line numbers refer to ML_train_and_predict.R.
The script loops on:

all 2020 start dates
all grid points
all lead times (weeks 3-4 and 5-6)
all variables (t2m and tp).

Fitting

Retrieve the t2m weeks 3-4 hindcasts for start dates 2000-01-02 to 2019-01-02 (ML_train_and_predict.R#L54 and ML_train_and_predict.R#L62)
Retrieve the t2m observations corresponding to these hindcasts (ML_train_and_predict.R#L58 and ML_train_and_predict.R#L63)
Convert the t2m hindcast distribution into a normal distribution of mean 0 and standard deviation 1 through a quantile-quantile method:
- Apply to t2m hindcast values their own cumulative distribution function (ML_train_and_predict.R#L96). The cumulative distribution function is based on the values of all ensemble members and hindcast years 2000-2019. It is determined with the R function "pemp" from package "EnvStats".
- Apply the inverse cumulative distribution function of the normal distribution N(0,1) to the values obtained in the previous step (ML_train_and_predict.R#L98) with R function "qnorm".
- Take the ensemble mean of the transformed values (ML_train_and_predict.R#L110)
Convert the corresponding t2m observation distribution into a normal distribution N(0,1) with the same quantile-quantile method:
- Apply to t2m observed values their own cumulative distribution function (ML_train_and_predict.R#L92)
- Apply the inverse cumulative distribution function of the normal distribution N(0,1) to the values obtained in the previous step (ML_train_and_predict.R#L94)
Fit a linear regression function between the values from steps 3 and 4 (ML_train_and_predict.R#L124), to express forecasts from step 3 as a function of observations from step 4. This is the likelihood function making the approach Bayesian.
Save the three coefficients:
- a: intercept (ML_train_and_predict.R#L125)
- b: slope (ML_train_and_predict.R#L126)
- std: standard deviation of residuals (ML_train_and_predict.R#L127)

Prediction

Retrieve the raw t2m weeks 3-4 ECMWF forecasts issued on 2020-01-02 (ML_train_and_predict.R#L56 and ML_train_and_predict.R#L64)
Convert the t2m forecast values in the normal distribution N(0,1) using the same quantile-quantile function as Step 3:
- Apply to t2m forecast values the cumulative distribution function of t2m hindcast values (ML_train_and_predict.R#L101)
- Apply the inverse cumulative distribution function of the normal distribution N(0,1) (ML_train_and_predict.R#L102)
- Take the ensemble mean (ML_train_and_predict.R#L111)
Infer the normal posterior distribution of 2020-01-02 t2m weeks3-4 forecast given the likelihood function coefficients (Step 5-6), the raw forecast value (Step 8) and the prior distribution parameters (mean 0, standard deviation 1, ML_train_and_predict.R#L120-121).
See Specq, D. and Batté, L. (2020) for more details on the formulae:
- Standard deviation of the posterior distribution sd: ML_train_and_predict.R#L131
- Mean of the posterior distribution m: ML_train_and_predict.R#L132
Compute the probability that a variable following the normal posterior distribution N(m,sd) will be below the lower tercile of the prior distribution N(0,1): ML_train_and_predict.R#L141
This is the newly predicted probability to be in the lower tercile.
Compute the probability that a variable following the normal posterior distribution N(m,sd) will exceed the upper tercile of the prior distribution N(0,1): ML_train_and_predict.R#L142
This is the newly predicted probability to be in the upper tercile.
Compute the probability to be in the middle tercile so that the three probabilities add up to 1: ML_train_and_predict.R#L143
This is the newly predicted probability to be in the middle tercile.

Additional data wrangling

Load the provided NetCDF file as a template for file submission (ML_train_and_predict.R#L13)
Overwrite this file with the newly predicted values of Steps 10, 11 and 12 (ML_train_and_predict.R#L154)
A mask on t2m data is determined based on available training observations (ML_train_and_predict.R#L78 to 81)
The masked grid points are removed to the t2m data before fitting (ML_train_and_predict.R#L84 to 86)
Grid points with no prediction because masked are completed with NAs (ML_train_and_predict.R#L146 to 151)
The masking is more complex for precipitation because we need to account for:
- The dry mask that was intended by the challenge organizers (ML_train_and_predict.R#L23 to 26)
- The available training observations (same as for t2m, ML_train_and_predict.R#L198 to 200)
- The areas that are always dry for the specific calendar fortnight in the 20 years of hindcasts, where Steps 3 and 8 are not applicable (ML_train_and_predict.R#L193 to 195)
  These three masks are merged into a single one at ML_train_and_predict.R#L208 to 210

Safeguards to prevent overfitting

The fit itself relies on a basic linear regression (Step 5) using the standard R function "lm" (ML_train_and_predict.R#L124).
It is very constrained with no tunable parameters.
The only opportunity for overfitting would be to adjust the fitting period within the hindcast period to see if one works better than another. We did not do such trials, as we directly decided to use the full hindcast period.

Safeguards to prevent data leakage

The fitting procedure is very constrained too.
The fit only relies on the ECMWF hindcasts and their corresponding observations.
Moreover, the fit for each real-time start date is done using its corresponding hindcasts calendar dates.
In doing so, the most recent observations used in the fit are at least one year older than the real-time forecast.

Edited 3 years ago

Designs

Child items ...

Activity

Aaron Spring @aaron.spring · 3 years ago

thanks for the overview @damien.specq

Could you please add links to the code with markdown links, e.g. ML_train_and_predict.R#L92? will make the code faster to find.
Damien Specq @damien.specq · 3 years ago

Author Owner

Hi @aaron.spring

I made the modifications, I hope it's fine this way. Please let me know if more explanations are needed for review.

Aaron Spring @aaron.spring · 3 years ago

Thank you

Please register or sign in to reply
Damien Specq changed the description 3 years ago

changed the description

Please register or sign in to reply