Documentation overview
Datasets
For training the models, we only use the ECMWF hindcasts (20002019) and the corresponding categorical observations. For prediction, we only require the ECMWF 2020 forecasts. We always use the biweekly aggregates.
Data preprocessing
In order to enlarge training samples, we train some models with all the weeks of the year together. To do so we need to remove the seasonal cycle. We compute ECMWF hindcast anomalies with respect to its own biweekly leadtimedependant climatology. The 2020 forecast anomalies are computed with respect to the hindcast climatology in the same way.
Methods
We have employed four statistical methods that correct the ECMWF model output of the target variable. All methods are trained separately for each grid point and leadtime.
Climatology: simply issue a probability of ⅓ of observing each tercile category. No training is needed. No predictors are used. We include this method to make sure that we do not perform worse than climatology.
Raw ECMWF forecasts: this method consists in counting the number of members exceeding the tercile edges and using the proportion of members in each group as the class probability. The tercile edges are computed using the hindcast for the 20002019 period (2020 excluded to prevent data leakage), separately for each week of the year but for all members at once. Despite the method being described as “raw”, using tercile edges from the hindcast acts as an implicit bias adjustment. From an ML perspective, the training of this method consists in determining the tercile edges. Therefore we have a separate model for each week of the year. We include this method to ensure that we do not perform worse than the ECMWF model.
Logistic regression: we model the probabilities of observing each tercile category as a regression on the ECMWF ensemble mean (as in Hamill et al. 2004). The logistic regression models the logarithm of the odds as a linear regression: logp(Y)1p(Y)=a+bx (where x is the ensemble mean, and p(Y) is the probability of observing the class Y). In order to have larger samples for training (i.e. determine a and b) we train one single model for all weeks of the year using ensemblemean anomalies with respect to the climatological mean. A multiclass implementation that uses “oneversusrest” has been used to obtain probabilities for the three classes that sum up to one.
Random forest: we employ a random forest classification algorithm with 11 sorted ensemble members as predictors and the observed category as the target. All the weeks of the year are trained together by using anomalies. For the 2020 forecasts, we subset 11 out of the 51 members available by picking sorted members 1, 6, 11, 16, ..., 51. The random forest (James et al. 2013) trains 100 bagged trees of depth 4 with a random preselection of the split variables. The forecasted class probabilities are obtained by analyzing in each tree the proportions of the training data that ended up in the same leaf as the forecast predictors and averaging for all trees.
Train/test/validation strategy We employ Kfold crossvalidation (James et al. 2013) to split the hindcast period into train/validation sets. Specifically, as the test dataset (i.e. the final verification) is a complete year (2020), we employ leaveoneyearout crossvalidation to obtain forecast quality estimates for oneyear periods. As we have 20 years of hindcast, this corresponds to a 20fold CV. We train each method 20 times, setting aside one year from the training and reserving it for prediction and verification. Although we do not tune any hyperparameter of the methods, the validation set helps to understand the yeartoyear variability of the performance. It also helps to prevent overfitting during the multimethod combination at each grid point.
Combination of methods
The performance of each method in 2020 can be estimated as a random draw from the 20 performance results obtained in the crossvalidation. At each grid point, we want to select the method that performs better. As the RPSS distribution for the 20 hindcast years is skewed, we decide to use the median RPSS instead of the mean RPSS to select the best method. For precipitation, the Random Forest and Logistic Regression methods were disregarded due to time constraints (no time to train them after the changes in the challenge verification method).
Final forecasts
As a final step, we train each method with all the hindcast data (20002019) and predict 2020. Then we select the best method based on the hindcast CV results.
Safeguards

2020 obs never loaded

20fold crossvalidation

Although some data preprocessing (anomaly and tercile edge computations) is not done in crossvalidation, the 2020 data was never used during 2020 preprocessing.
Implementation
The analyses are done in Python. We use xarray and dask to parallelize train/predict computations with the apply_ufunc function. The random forest and logistic regression classifiers are from scikitlearn. Aaron’s code make_probabilistic is used for the raw ECMWF model.
Bibliography
Hamill, T. M., Whitaker, J. S., & Wei, X. (2004). Ensemble Reforecasting: Improving MediumRange Forecast Skill Using Retrospective Forecasts. In Monthly Weather Review (Vol. 132, Issue 6, pp. 1434–1447). American Meteorological Society.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning. Springer New York.