# Train ML model for predictions of week 3-4 & 5-6

This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).

# Synopsis

## Method: `name`

- decription
- a few details

## Data used

Training-input for Machine Learning model:
- renku datasets, climetlab, IRIDL

Forecast-input for Machine Learning model:
- renku datasets, climetlab, IRIDL

Compare Machine Learning model forecast against ground truth:
- renku datasets, climetlab, IRIDL

## Resources used
for training, details in reproducibility

- platform: renku
- memory: 8 GB
- processors: 2 CPU
- storage required: 10 GB

## Safeguards

All points have to be [x] checked. If not, your submission is invalid.

Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)

### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) 

If the organizers suspect overfitting, your contribution can be disqualified.

  - [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)
  - [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
  - [ ] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.
  - [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
  - [ ] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
  - [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
  - [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).

### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
  - [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
  - [ ] Code is well documented, readable and reproducible.
  - [ ] Code to reproduce training and predictions should run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.

# Todos to improve template

This is just a demo.

- [ ] for both variables
- [ ] for both `lead_time`s
- [ ] ensure probabilistic prediction outcome with `category` dim

# Imports

In [None]:
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Sequential

import matplotlib.pyplot as plt

import xarray as xr
xr.set_options(display_style='text')

from dask.utils import format_bytes
import xskillscore as xs

# Get training data

preprocessing of input data may be done in separate notebook/script

## Hindcast

get weekly initialized hindcasts

In [None]:
# consider renku datasets
#! renku storage pull path

## Observations
corresponding to hindcasts

In [None]:
# consider renku datasets
#! renku storage pull path

# ML model

In [None]:
bs=32

import numpy as np
class DataGenerator(keras.utils.Sequence):
    def __init__(self):
        """
        Data generator
        
        Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
        
        Args:
            
        """

        self.on_epoch_end()

        # For some weird reason calling .load() earlier messes up the mean and std computations
        if load: print('Loading data into RAM'); self.data.load()

    def __len__(self):
        'Denotes the number of batches per epoch'
        return int(np.ceil(self.n_samples / self.batch_size))

    def __getitem__(self, i):
        'Generate one batch of data'
        idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]
        # got all nan if nans not masked
        X = self.data.isel(time=idxs).fillna(0.).values
        y = self.verif_data.isel(time=idxs).fillna(0.).values
        return X, y

    def on_epoch_end(self):
        'Updates indexes after each epoch'
        self.idxs = np.arange(self.n_samples)
        if self.shuffle == True:
            np.random.shuffle(self.idxs)

## data prep: train, valid, test

In [None]:
# time is the forecast_reference_time
time_train_start,time_train_end='2000','2017'
time_valid_start,time_valid_end='2018','2019'
time_test = '2020'

In [None]:
dg_train = DataGenerator()

In [None]:
dg_valid = DataGenerator()

In [None]:
dg_test = DataGenerator()

## `fit`

In [None]:
cnn = keras.models.Sequential([])

In [None]:
cnn.summary()

In [None]:
cnn.compile(keras.optimizers.Adam(1e-4), 'mse')

In [None]:
import warnings
warnings.simplefilter("ignore")

In [None]:
cnn.fit(dg_train, epochs=1, validation_data=dg_valid)

## `predict`

Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`. For now RPS, todo: change to RPSS.

In [None]:
from scripts import skill_by_year

In [None]:
def create_predictions(model, dg):
    """Create non-iterative predictions"""
    preds = model.predict(dg).squeeze()
    # transform
    
    return preds

### `predict` training period in-sample

In [None]:
preds_is = create_predictions(cnn, dg_train)

In [None]:
skill_by_year(preds_is)

### `predict` valid out-of-sample

In [None]:
preds_os = create_predictions(cnn, dg_valid)

In [None]:
skill_by_year(preds_os)

### `predict` test

In [None]:
preds_test = create_predictions(cnn, dg_test)

In [None]:
skill_by_year(preds_test)

# Submission

In [None]:
preds_test.sizes # expect: category(3), longitude, latitude, lead_time(2), forecast_time (53)

In [None]:
from scripts import assert_predictions_2020
assert_predictions_2020(preds_test)

In [None]:
preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')

In [None]:
#!git add ../submissions/ML_prediction_2020.nc

In [None]:
#!git commit -m "commit submission for my_method_name" # whatever message you want

In [None]:
#!git tag "submission-my_method_name-0.0.1" # if this is to be checked by scorer, only the last submitted==tagged version will be considered

In [None]:
#!git push --tags

# Reproducibility

## memory

In [None]:
# https://phoenixnap.com/kb/linux-commands-check-memory-usage
!free -g

## CPU

In [None]:
!lscpu

## software

In [None]:
!conda list