{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Train ML model for predictions of week 3-4 & 5-6\n", "\n", "This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Synopsis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Method: `name`\n", "\n", "- decription\n", "- a few details" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data used\n", "\n", "Training-input for Machine Learning model:\n", "- renku datasets, climetlab, IRIDL\n", "\n", "Forecast-input for Machine Learning model:\n", "- renku datasets, climetlab, IRIDL\n", "\n", "Compare Machine Learning model forecast against ground truth:\n", "- renku datasets, climetlab, IRIDL" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Resources used\n", "for training, details in reproducibility\n", "\n", "- platform: renku\n", "- memory: 8 GB\n", "- processors: 2 CPU\n", "- storage required: 10 GB" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Safeguards\n", "\n", "All points have to be [x] checked. If not, your submission is invalid.\n", "\n", "Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.\n", "(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) \n", "\n", "If the organizers suspect overfitting, your contribution can be disqualified.\n", "\n", " - [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)\n", " - [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)\n", " - [ ] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.\n", " - [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).\n", " - [ ] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.\n", " - [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.\n", " - [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Safeguards for Reproducibility\n", "Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize\n", " - [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)\n", " - [ ] Code is well documented, readable and reproducible.\n", " - [ ] Code to reproduce training and predictions should run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Todos to improve template\n", "\n", "This is just a demo.\n", "\n", "- [ ] for both variables\n", "- [ ] for both `lead_time`s\n", "- [ ] ensure probabilistic prediction outcome with `category` dim" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Imports" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tensorflow.keras.layers import Input, Dense, Flatten\n", "from tensorflow.keras.models import Sequential\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "import xarray as xr\n", "xr.set_options(display_style='text')\n", "\n", "from dask.utils import format_bytes\n", "import xskillscore as xs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Get training data\n", "\n", "preprocessing of input data may be done in separate notebook/script" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Hindcast\n", "\n", "get weekly initialized hindcasts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# consider renku datasets\n", "#! renku storage pull path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Observations\n", "corresponding to hindcasts" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# consider renku datasets\n", "#! renku storage pull path" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# ML model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bs=32\n", "\n", "import numpy as np\n", "class DataGenerator(keras.utils.Sequence):\n", " def __init__(self):\n", " \"\"\"\n", " Data generator\n", " \n", " Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly\n", " \n", " Args:\n", " \n", " \"\"\"\n", "\n", " self.on_epoch_end()\n", "\n", " # For some weird reason calling .load() earlier messes up the mean and std computations\n", " if load: print('Loading data into RAM'); self.data.load()\n", "\n", " def __len__(self):\n", " 'Denotes the number of batches per epoch'\n", " return int(np.ceil(self.n_samples / self.batch_size))\n", "\n", " def __getitem__(self, i):\n", " 'Generate one batch of data'\n", " idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]\n", " # got all nan if nans not masked\n", " X = self.data.isel(time=idxs).fillna(0.).values\n", " y = self.verif_data.isel(time=idxs).fillna(0.).values\n", " return X, y\n", "\n", " def on_epoch_end(self):\n", " 'Updates indexes after each epoch'\n", " self.idxs = np.arange(self.n_samples)\n", " if self.shuffle == True:\n", " np.random.shuffle(self.idxs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## data prep: train, valid, test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# time is the forecast_reference_time\n", "time_train_start,time_train_end='2000','2017'\n", "time_valid_start,time_valid_end='2018','2019'\n", "time_test = '2020'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dg_train = DataGenerator()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dg_valid = DataGenerator()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dg_test = DataGenerator()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `fit`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cnn = keras.models.Sequential([])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cnn.summary()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cnn.compile(keras.optimizers.Adam(1e-4), 'mse')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.simplefilter(\"ignore\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cnn.fit(dg_train, epochs=1, validation_data=dg_valid)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `predict`\n", "\n", "Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`. For now RPS, todo: change to RPSS." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scripts import skill_by_year" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def create_predictions(model, dg):\n", " \"\"\"Create non-iterative predictions\"\"\"\n", " preds = model.predict(dg).squeeze()\n", " # transform\n", " \n", " return preds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `predict` training period in-sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_is = create_predictions(cnn, dg_train)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "skill_by_year(preds_is)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `predict` valid out-of-sample" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_os = create_predictions(cnn, dg_valid)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "skill_by_year(preds_os)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `predict` test" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_test = create_predictions(cnn, dg_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "skill_by_year(preds_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Submission" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_test.sizes # expect: category(3), longitude, latitude, lead_time(2), forecast_time (53)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from scripts import assert_predictions_2020\n", "assert_predictions_2020(preds_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!git add ../submissions/ML_prediction_2020.nc" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!git commit -m \"commit submission for my_method_name\" # whatever message you want" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!git tag \"submission-my_method_name-0.0.1\" # if this is to be checked by scorer, only the last submitted==tagged version will be considered" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!git push --tags" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reproducibility" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## memory" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# https://phoenixnap.com/kb/linux-commands-check-memory-usage\n", "!free -g" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CPU" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!lscpu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## software" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!conda list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.6" }, "toc-autonumbering": true }, "nbformat": 4, "nbformat_minor": 4 }