ML_forecast_template.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train ML model for predictions of week 3-4 & 5-6\n",
    "\n",
    "This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Synopsis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method: `name`\n",
    "\n",
    "- decription\n",
    "- a few details"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data used\n",
    "\n",
    "Training-input for Machine Learning model:\n",
    "- renku datasets, climetlab, IRIDL\n",
    "\n",
    "Forecast-input for Machine Learning model:\n",
    "- renku datasets, climetlab, IRIDL\n",
    "\n",
    "Compare Machine Learning model forecast against ground truth:\n",
    "- renku datasets, climetlab, IRIDL"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resources used\n",
    "for training, details in reproducibility\n",
    "\n",
    "- platform: renku\n",
    "- memory: 8 GB\n",
    "- processors: 2 CPU\n",
    "- storage required: 10 GB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Safeguards\n",
    "\n",
    "All points have to be [x] checked. If not, your submission is invalid.\n",
    "\n",
    "Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.\n",
    "(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) \n",
    "\n",
    "If the organizers suspect overfitting, your contribution can be disqualified.\n",
    "\n",
    "  - [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)\n",
    "  - [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)\n",
    "  - [ ] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.\n",
    "  - [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).\n",
    "  - [ ] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.\n",
    "  - [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.\n",
    "  - [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Safeguards for Reproducibility\n",
    "Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize\n",
    "  - [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)\n",
    "  - [ ] Code is well documented, readable and reproducible.\n",
    "  - [ ] Code to reproduce training and predictions should run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Todos to improve template\n",
    "\n",
    "This is just a demo.\n",
    "\n",
    "- [ ] for both variables\n",
    "- [ ] for both `lead_time`s\n",
    "- [ ] ensure probabilistic prediction outcome with `category` dim"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.layers import Input, Dense, Flatten\n",
    "from tensorflow.keras.models import Sequential\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "import xarray as xr\n",
    "xr.set_options(display_style='text')\n",
    "\n",
    "from dask.utils import format_bytes\n",
    "import xskillscore as xs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get training data\n",
    "\n",
    "preprocessing of input data may be done in separate notebook/script"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hindcast\n",
    "\n",
    "get weekly initialized hindcasts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# consider renku datasets\n",
    "#! renku storage pull path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Observations\n",
    "corresponding to hindcasts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# consider renku datasets\n",
    "#! renku storage pull path"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ML model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bs=32\n",
    "\n",
    "import numpy as np\n",
    "class DataGenerator(keras.utils.Sequence):\n",
    "    def __init__(self):\n",
    "        \"\"\"\n",
    "        Data generator\n",
    "        \n",
    "        Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly\n",
    "        \n",
    "        Args:\n",
    "            \n",
    "        \"\"\"\n",
    "\n",
    "        self.on_epoch_end()\n",
    "\n",
    "        # For some weird reason calling .load() earlier messes up the mean and std computations\n",
    "        if load: print('Loading data into RAM'); self.data.load()\n",
    "\n",
    "    def __len__(self):\n",
    "        'Denotes the number of batches per epoch'\n",
    "        return int(np.ceil(self.n_samples / self.batch_size))\n",
    "\n",
    "    def __getitem__(self, i):\n",
    "        'Generate one batch of data'\n",
    "        idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]\n",
    "        # got all nan if nans not masked\n",
    "        X = self.data.isel(time=idxs).fillna(0.).values\n",
    "        y = self.verif_data.isel(time=idxs).fillna(0.).values\n",
    "        return X, y\n",
    "\n",
    "    def on_epoch_end(self):\n",
    "        'Updates indexes after each epoch'\n",
    "        self.idxs = np.arange(self.n_samples)\n",
    "        if self.shuffle == True:\n",
    "            np.random.shuffle(self.idxs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## data prep: train, valid, test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# time is the forecast_reference_time\n",
    "time_train_start,time_train_end='2000','2017'\n",
    "time_valid_start,time_valid_end='2018','2019'\n",
    "time_test = '2020'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dg_train = DataGenerator()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dg_valid = DataGenerator()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dg_test = DataGenerator()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `fit`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cnn = keras.models.Sequential([])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cnn.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cnn.compile(keras.optimizers.Adam(1e-4), 'mse')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.simplefilter(\"ignore\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cnn.fit(dg_train, epochs=1, validation_data=dg_valid)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `predict`\n",
    "\n",
    "Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`. For now RPS, todo: change to RPSS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import skill_by_year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_predictions(model, dg):\n",
    "    \"\"\"Create non-iterative predictions\"\"\"\n",
    "    preds = model.predict(dg).squeeze()\n",
    "    # transform\n",
    "    \n",
    "    return preds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `predict` training period in-sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_is = create_predictions(cnn, dg_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "skill_by_year(preds_is)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `predict` valid out-of-sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_os = create_predictions(cnn, dg_valid)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "skill_by_year(preds_os)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `predict` test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_test = create_predictions(cnn, dg_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "skill_by_year(preds_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_test.sizes # expect: category(3), longitude, latitude, lead_time(2), forecast_time (53)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import assert_predictions_2020\n",
    "assert_predictions_2020(preds_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git add ../submissions/ML_prediction_2020.nc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git commit -m \"commit submission for my_method_name\" # whatever message you want"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git tag \"submission-my_method_name-0.0.1\" # if this is to be checked by scorer, only the last submitted==tagged version will be considered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git push --tags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reproducibility"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://phoenixnap.com/kb/linux-commands-check-memory-usage\n",
    "!free -g"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!lscpu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## software"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!conda list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  },
  "toc-autonumbering": true
 },
 "nbformat": 4,
 "nbformat_minor": 4
}