Skip to content
Snippets Groups Projects
Commit d06a42e7 authored by Aaron Spring's avatar Aaron Spring :baby_symbol:
Browse files

Upload template

parent 2e66c135
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Train ML model to correct predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Data used
Training-input for Machine Learning model:
- hindcasts of models: ECMWF
Forecast-input for Machine Learning model:
- real-time 2020 forecasts of the same models
Compare Machine Learning model forecast against:
- `CPC` observations 2020
%% Cell type:markdown id: tags:
## Method: (`name`) mean bias reduction
- calculate bias from 2000-2019
- remove bias from 2020 forecast
%% Cell type:markdown id: tags:
## Resources used
for training
- platform: renku
- memory: 8 GB
- processors: 2 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
- [ ] We separate honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [ ] Code is well documented, readable and reproducible.
- [ ] Code to reproduce runs within a day.
%% Cell type:markdown id: tags:
# Todos to improve template
This is just a demo.
- [ ] for both variables
- [ ] for both `lead_time`s
- [ ] ensure probabilistic prediction outcome with `category` dim
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
``` python
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import xarray as xr
xr.set_options(display_style='text')
from dask.utils import format_bytes
import xskillscore as xs
```
%% Cell type:markdown id: tags:
# Get training data
preprocessing of input data may be done in separate notebook/script
%% Cell type:markdown id: tags:
## Hindcast
get weekly initialized hindcasts
%% Cell type:code id: tags:
``` python
v='tp'
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ECMWF_hc_tp_weekly_2000_2019.zarr/
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
hc_weekly = xr.open_zarr('../data/ECMWF_hc_tp_weekly_2000_2019.zarr')
```
%% Cell type:code id: tags:
``` python
def add_time_from_forecast_reference_time_and_step(benchmark, init_dim='time'):
"""Creates time(forecast_reference_time, step).
step: pd.Timedelta
forecast_reference_time: datetime
"""
times = xr.concat(
[
xr.DataArray(
benchmark[init_dim] + step,
dims=init_dim,
coords={init_dim: benchmark[init_dim]},
)
for step in benchmark.step
],
dim="step",
join="inner",
compat="broadcast_equals",
)
benchmark = benchmark.assign_coords(valid_time=times)
return benchmark
```
%% Cell type:code id: tags:
``` python
hc_weekly = add_time_from_forecast_reference_time_and_step(hc_weekly)
```
%% Cell type:markdown id: tags:
## Observations
corresponding to hindcasts
%% Cell type:code id: tags:
``` python
# as prepared renku datasets FIXME
!renku storage pull ../data/cpc-rain-1998-2020-weekly-averaged-1.5-deg/rain_verification_1998_2020.nc
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
obs = xr.open_dataset(f'../data/cpc-rain-1998-2020-weekly-averaged-1.5-deg/rain_verification_1998_2020.nc', chunks={}).rename({'rain':v,'time':'valid_time'})
```
%% Cell type:markdown id: tags:
# ML model
%% Cell type:markdown id: tags:
based on [Weatherbench](https://github.com/pangeo-data/WeatherBench/blob/master/quickstart.ipynb)
%% Cell type:code id: tags:
``` python
import sys
sys.path.insert(1, '/work/s2s-ai-competition-bootstrap/WeatherBench')
from src.train_nn import DataGenerator, PeriodicConv2D, create_predictions
import tensorflow.keras as keras
```
%% Cell type:code id: tags:
``` python
bs=32
import numpy as np
class DataGenerator(keras.utils.Sequence):
def __init__(self, ds, verif, step, batch_size=bs, shuffle=True, load=True, mean=None, std=None):
"""
Data generator for WeatherBench data.
Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
Args:
ds: model
verif: obs
step: Lead_time/step as in model
batch_size: Batch size
shuffle: bool. If True, data is shuffled.
load: bool. If True, datadet is loaded into RAM.
mean: If None, compute mean from data.
std: If None, compute standard deviation from data.
Todo:
- use number
- dont use .sel(step=step) to train over all steps at once
"""
self.ds = ds
self.batch_size = batch_size
self.shuffle = shuffle
self.lead_time = step
self.data = ds.transpose('time', ...).sel(step=step)
self.mean = self.data.mean('time').compute() if mean is None else mean
self.std = self.data.std('time').compute() if std is None else std
self.verif_data = verif.transpose('time', ...).sel(step=step)
self.verif_mean = self.verif_data.mean('time').compute() if mean is None else mean
self.verif_std = self.verif_data.std('time').compute() if std is None else std
# Normalize
self.data = (self.data - self.mean) / self.std
self.verif_data = (self.verif_data - self.verif_mean) / self.verif_std
self.n_samples = self.data.time.size
self.time = ds.time
self.on_epoch_end()
# For some weird reason calling .load() earlier messes up the mean and std computations
if load: print('Loading data into RAM'); self.data.load()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.ceil(self.n_samples / self.batch_size))
def __getitem__(self, i):
'Generate one batch of data'
idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]
# got all nan if nans not masked
X = self.data.isel(time=idxs).fillna(0.).values
y = self.verif_data.isel(time=idxs).fillna(0.).values
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.idxs = np.arange(self.n_samples)
if self.shuffle == True:
np.random.shuffle(self.idxs)
```
%% Cell type:code id: tags:
``` python
# just train model for week 5: FIXME: finally it should be 2 bi-weekly `lead_time`
step = hc_weekly.isel(step=2).step
step
```
%% Cell type:code id: tags:
``` python
# mask
hc_weekly = hc_weekly.where(obs.isel(valid_time=0,drop=True).notnull())
```
%% Cell type:markdown id: tags:
## data prep: train, valid, test
%% Cell type:code id: tags:
``` python
# time is the forecast_reference_time
time_train_start,time_train_end='2000','2017'
time_valid_start,time_valid_end='2018','2019'
time_test = '2020'
```
%% Cell type:code id: tags:
``` python
dg_train = DataGenerator(
hc_weekly.isel(number=0).sel(time=slice(time_train_start,time_train_end))[v],
obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=slice(time_train_start,time_train_end)),
step=step, batch_size=bs, load=True)
```
%% Output
/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
Loading data into RAM
%% Cell type:code id: tags:
``` python
dg_valid = DataGenerator(
hc_weekly.isel(number=0).sel(time=slice(time_valid_start,time_valid_end))[v],
obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=slice(time_valid_start,time_valid_end)),
step=step, batch_size=bs, mean=dg_train.mean, std=dg_train.std, shuffle=False)
```
%% Output
Loading data into RAM
%% Cell type:code id: tags:
``` python
dg_test = DataGenerator(hc_weekly.isel(number=0).sel(time=time_test)[v],
obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=time_test),
step, batch_size=bs, mean=dg_train.mean, std=dg_train.std, shuffle=False)
```
%% Output
Loading data into RAM
%% Cell type:code id: tags:
``` python
X, y = dg_valid[0]
X.shape, y.shape
```
%% Output
((32, 121, 240), (32, 121, 240))
%% Cell type:code id: tags:
``` python
# short look into training data: large biases
# any problem from normalizing?
i=4
xr.DataArray(np.vstack([X[i],y[i]])).plot(yincrease=False, robust=True)
```
%% Cell type:markdown id: tags:
## `fit`
%% Cell type:code id: tags:
``` python
cnn = keras.models.Sequential([
PeriodicConv2D(filters=32, kernel_size=5, conv_kwargs={'activation':'relu'}, input_shape=(32, 64, 1)),
PeriodicConv2D(filters=1, kernel_size=5)
])
```
%% Cell type:code id: tags:
``` python
cnn.summary()
```
%% Output
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
periodic_conv2d (PeriodicCon (None, 32, 64, 32) 832
_________________________________________________________________
periodic_conv2d_1 (PeriodicC (None, 32, 64, 1) 801
=================================================================
Total params: 1,633
Trainable params: 1,633
Non-trainable params: 0
_________________________________________________________________
%% Cell type:code id: tags:
``` python
cnn.compile(keras.optimizers.Adam(1e-4), 'mse')
```
%% Cell type:code id: tags:
``` python
import warnings
warnings.simplefilter("ignore")
```
%% Cell type:code id: tags:
``` python
cnn.fit(dg_train, epochs=1, validation_data=dg_valid)
```
%% Output
30/30 [==============================] - 70s 2s/step - loss: 0.3419 - val_loss: 0.6589
<tensorflow.python.keras.callbacks.History at 0x7f094c44efd0>
%% Cell type:markdown id: tags:
## `predict`
%% Cell type:code id: tags:
``` python
def create_predictions(model, dg, step):
"""Create non-iterative predictions"""
preds = model.predict(dg).squeeze()
# Unnormalize
preds = preds * dg.std.values + dg.mean.values
da = xr.DataArray(
preds,
dims=['time', 'latitude', 'longitude'],
coords={'time': dg.ds.time, 'latitude': dg.ds.latitude, 'longitude': dg.ds.longitude},
)
da=da.assign_coords(step=step)
return da
```
%% Cell type:code id: tags:
``` python
preds = create_predictions(cnn, dg_test, step)
```
%% Cell type:code id: tags:
``` python
preds.coords
```
%% Output
Coordinates:
* time (time) datetime64[ns] 2000-01-02 2000-01-09 ... 2000-12-31
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 354.0 355.5 357.0 358.5
surface int64 ...
step timedelta64[ns] 35 days
%% Cell type:markdown id: tags:
### Validate predictions
%% Cell type:code id: tags:
``` python
obs_test = obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=time_test)
```
%% Cell type:code id: tags:
``` python
rmse_ML = xs.rmse(preds, obs_test.sel(step=step), dim='time')
rmse_ML.plot(robust=True)
plt.title('RMSE ML predictions 2020')
```
%% Cell type:markdown id: tags:
#### predict over all steps
%% Cell type:code id: tags:
``` python
# this is not useful but results have expected dimensions
# actually train for each step and use all members for training and validation
preds=[]
for step in hc_weekly.step:
dg_test = DataGenerator(hc_weekly.isel(number=0).sel(time=slice(time_test))[v], obs_test,
step=step, batch_size=bs, mean=dg_train.mean, std=dg_train.std, shuffle=False)
preds.append(create_predictions(cnn, dg_test, step))
preds = xr.concat(preds, 'step')
preds['step']=hc_weekly.step
preds=preds.to_dataset(name=v)
```
%% Output
Loading data into RAM
Loading data into RAM
Loading data into RAM
Loading data into RAM
%% Cell type:code id: tags:
``` python
preds = preds.expand_dims('number').rename({'time':'forecast_reference_time'})
```
%% Cell type:code id: tags:
``` python
preds.coords
```
%% Output
Coordinates:
* forecast_reference_time (forecast_reference_time) datetime64[ns] 2000-01...
* latitude (latitude) float64 90.0 88.5 87.0 ... -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 ... 357.0 358.5
surface int64 ...
* step (step) timedelta64[ns] 21 days 28 days ... 42 days
%% Cell type:code id: tags:
``` python
# todo: convert preds to preds_as_terciles
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Submission
%% Cell type:code id: tags:
``` python
preds.sizes # expect: category(3), longitude, latitude, step(2), forecast_time (53)
```
%% Cell type:code id: tags:
``` python
format_bytes(preds.nbytes)
```
%% Cell type:code id: tags:
``` python
preds.to_netcdf('submissions/ML_prediction_2020.nc')
```
%% Cell type:code id: tags:
``` python
!git commit -m 'method name'
```
%% Cell type:code id: tags:
``` python
!git tag "predefined-tag-0.0.1" # if this is to be checked by scorer
git push --tags
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Misc
%% Cell type:markdown id: tags:
## reforecasts for `ML_model`
- required for terciles
- dependent of `lead_time/step`
%% Cell type:code id: tags:
``` python
# running a reforecast based on training data
# todo: require cnn todo bi-weekly predictions, otherwise average step/lead_time bi-weekly
preds = create_predictions(cnn, dg_train, step)
```
%% Cell type:code id: tags:
``` python
quantile_kwargs={'q':[.33,.66], 'skipna':False}
```
%% Cell type:code id: tags:
``` python
ML_terciles = preds.groupby('time.weekofyear').quantile(dim=['time'], **quantile_kwargs).rename({'quantile':'category_edge','weekofyear':'forecast_reference_time'}).compute()
ML_terciles.coords
```
%% Output
Coordinates:
* latitude (latitude) float64 90.0 88.5 87.0 ... -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 ... 357.0 358.5
* category_edge (category_edge) float64 0.33 0.66
* forecast_reference_time (forecast_reference_time) int64 1 2 3 ... 51 52 53
%% Cell type:code id: tags:
``` python
#ML_terciles.isel(forecast_reference_time=[0,24]).plot(col='category_edge',row='forecast_reference_time')
```
%% Cell type:code id: tags:
``` python
format_bytes(ML_terciles.nbytes) # *2 for variable; *2 for steps
```
%% Output
'24.63 MB'
%% Cell type:code id: tags:
``` python
ML_terciles.to_netcdf('ML_terciles.nc')
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment