From 1484cce0a3b7620879065269625d2db206ee3316 Mon Sep 17 00:00:00 2001 From: AS <aaron.spring@mpimet.mpg.de> Date: Mon, 10 May 2021 15:19:47 +0200 Subject: [PATCH] git lfs track submissions/*.nc and ML_forecast template --- notebooks/ML_forecast.ipynb | 1003 +++++++++++++++++++++++++++++++++++ submissions/.gitattributes | 1 + 2 files changed, 1004 insertions(+) create mode 100644 notebooks/ML_forecast.ipynb create mode 100644 submissions/.gitattributes diff --git a/notebooks/ML_forecast.ipynb b/notebooks/ML_forecast.ipynb new file mode 100644 index 0000000..575ffe5 --- /dev/null +++ b/notebooks/ML_forecast.ipynb @@ -0,0 +1,1003 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Train ML model to correct predictions of week 3-4 & 5-6\n", + "\n", + "This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Synopsis" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data used\n", + "\n", + "Training-input for Machine Learning model:\n", + "- hindcasts of models: ECMWF\n", + "\n", + "Forecast-input for Machine Learning model:\n", + "- real-time 2020 forecasts of the same models\n", + "\n", + "Compare Machine Learning model forecast against:\n", + "- `CPC` observations 2020" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Method: (`name`) mean bias reduction\n", + "\n", + "- calculate bias from 2000-2019\n", + "- remove bias from 2020 forecast" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Resources used\n", + "for training\n", + "\n", + "- platform: renku\n", + "- memory: 8 GB\n", + "- processors: 2 CPU\n", + "- storage required: 10 GB" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Safeguards\n", + "\n", + "All points have to be [x] checked. If not, your submission is invalid.\n", + "\n", + "Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.\n", + "(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) \n", + "\n", + "If the organizers suspect overfitting, your contribution can be disqualified.\n", + "\n", + " - [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)\n", + " - [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)\n", + " - [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).\n", + " - [ ] We separate honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.\n", + " - [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.\n", + " - [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Safeguards for Reproducibility\n", + "Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize\n", + " - [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)\n", + " - [ ] Code is well documented, readable and reproducible.\n", + " - [ ] Code to reproduce runs within a day." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Todos to improve template\n", + "\n", + "This is just a demo.\n", + "\n", + "- [ ] for both variables\n", + "- [ ] for both `lead_time`s\n", + "- [ ] ensure probabilistic prediction outcome with `category` dim" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "from tensorflow.keras.layers import Input, Dense, Flatten\n", + "from tensorflow.keras.models import Sequential\n", + "\n", + "import matplotlib.pyplot as plt\n", + "\n", + "import xarray as xr\n", + "xr.set_options(display_style='text')\n", + "\n", + "from dask.utils import format_bytes\n", + "import xskillscore as xs" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Get training data\n", + "\n", + "preprocessing of input data may be done in separate notebook/script" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Hindcast\n", + "\n", + "get weekly initialized hindcasts" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "v='tp'" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n", + "\u001b[0m\n" + ] + } + ], + "source": [ + "# preprocessed as renku dataset\n", + "!renku storage pull ../data/ECMWF_hc_tp_weekly_2000_2019.zarr/" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "hc_weekly = xr.open_zarr('../data/ECMWF_hc_tp_weekly_2000_2019.zarr')" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "def add_time_from_forecast_reference_time_and_step(benchmark, init_dim='time'):\n", + " \"\"\"Creates time(forecast_reference_time, step).\n", + " \n", + " step: pd.Timedelta\n", + " forecast_reference_time: datetime\n", + " \"\"\"\n", + " times = xr.concat(\n", + " [\n", + " xr.DataArray(\n", + " benchmark[init_dim] + step,\n", + " dims=init_dim,\n", + " coords={init_dim: benchmark[init_dim]},\n", + " )\n", + " for step in benchmark.step\n", + " ],\n", + " dim=\"step\",\n", + " join=\"inner\",\n", + " compat=\"broadcast_equals\",\n", + " )\n", + " benchmark = benchmark.assign_coords(valid_time=times)\n", + " return benchmark" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "hc_weekly = add_time_from_forecast_reference_time_and_step(hc_weekly)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Observations\n", + "corresponding to hindcasts" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n", + "\u001b[0m\n" + ] + } + ], + "source": [ + "# as prepared renku datasets FIXME\n", + "!renku storage pull ../data/cpc-rain-1998-2020-weekly-averaged-1.5-deg/rain_verification_1998_2020.nc" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "obs = xr.open_dataset(f'../data/cpc-rain-1998-2020-weekly-averaged-1.5-deg/rain_verification_1998_2020.nc', chunks={}).rename({'rain':v,'time':'valid_time'})" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# ML model" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "based on [Weatherbench](https://github.com/pangeo-data/WeatherBench/blob/master/quickstart.ipynb)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "sys.path.insert(1, '/work/s2s-ai-competition-bootstrap/WeatherBench')\n", + "from src.train_nn import DataGenerator, PeriodicConv2D, create_predictions\n", + "import tensorflow.keras as keras" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "bs=32\n", + "\n", + "import numpy as np\n", + "class DataGenerator(keras.utils.Sequence):\n", + " def __init__(self, ds, verif, step, batch_size=bs, shuffle=True, load=True, mean=None, std=None):\n", + " \"\"\"\n", + " Data generator for WeatherBench data.\n", + " Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly\n", + " Args:\n", + " ds: model\n", + " verif: obs\n", + " step: Lead_time/step as in model\n", + " batch_size: Batch size\n", + " shuffle: bool. If True, data is shuffled.\n", + " load: bool. If True, datadet is loaded into RAM.\n", + " mean: If None, compute mean from data.\n", + " std: If None, compute standard deviation from data.\n", + " \n", + " Todo:\n", + " - use number\n", + " - dont use .sel(step=step) to train over all steps at once\n", + " \"\"\"\n", + "\n", + " self.ds = ds\n", + " self.batch_size = batch_size\n", + " self.shuffle = shuffle\n", + " self.lead_time = step\n", + "\n", + "\n", + " self.data = ds.transpose('time', ...).sel(step=step)\n", + " self.mean = self.data.mean('time').compute() if mean is None else mean\n", + " self.std = self.data.std('time').compute() if std is None else std\n", + " \n", + " self.verif_data = verif.transpose('time', ...).sel(step=step)\n", + " self.verif_mean = self.verif_data.mean('time').compute() if mean is None else mean\n", + " self.verif_std = self.verif_data.std('time').compute() if std is None else std\n", + "\n", + " # Normalize\n", + " self.data = (self.data - self.mean) / self.std\n", + " self.verif_data = (self.verif_data - self.verif_mean) / self.verif_std\n", + " \n", + " self.n_samples = self.data.time.size\n", + " self.time = ds.time\n", + "\n", + " self.on_epoch_end()\n", + "\n", + " # For some weird reason calling .load() earlier messes up the mean and std computations\n", + " if load: print('Loading data into RAM'); self.data.load()\n", + "\n", + " def __len__(self):\n", + " 'Denotes the number of batches per epoch'\n", + " return int(np.ceil(self.n_samples / self.batch_size))\n", + "\n", + " def __getitem__(self, i):\n", + " 'Generate one batch of data'\n", + " idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]\n", + " # got all nan if nans not masked\n", + " X = self.data.isel(time=idxs).fillna(0.).values\n", + " y = self.verif_data.isel(time=idxs).fillna(0.).values\n", + " return X, y\n", + "\n", + " def on_epoch_end(self):\n", + " 'Updates indexes after each epoch'\n", + " self.idxs = np.arange(self.n_samples)\n", + " if self.shuffle == True:\n", + " np.random.shuffle(self.idxs)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# just train model for week 5: FIXME: finally it should be 2 bi-weekly `lead_time`\n", + "step = hc_weekly.isel(step=2).step\n", + "\n", + "step" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [], + "source": [ + "# mask\n", + "hc_weekly = hc_weekly.where(obs.isel(valid_time=0,drop=True).notnull())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## data prep: train, valid, test" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# time is the forecast_reference_time\n", + "time_train_start,time_train_end='2000','2017'\n", + "time_valid_start,time_valid_end='2018','2019'\n", + "time_test = '2020'" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide\n", + " x = np.divide(x1, x2, out)\n", + "/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide\n", + " x = np.divide(x1, x2, out)\n", + "/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide\n", + " x = np.divide(x1, x2, out)\n", + "/opt/conda/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide\n", + " x = np.divide(x1, x2, out)\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading data into RAM\n" + ] + } + ], + "source": [ + "dg_train = DataGenerator(\n", + " hc_weekly.isel(number=0).sel(time=slice(time_train_start,time_train_end))[v],\n", + " obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=slice(time_train_start,time_train_end)),\n", + " step=step, batch_size=bs, load=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading data into RAM\n" + ] + } + ], + "source": [ + "dg_valid = DataGenerator(\n", + " hc_weekly.isel(number=0).sel(time=slice(time_valid_start,time_valid_end))[v],\n", + " obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=slice(time_valid_start,time_valid_end)),\n", + " step=step, batch_size=bs, mean=dg_train.mean, std=dg_train.std, shuffle=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading data into RAM\n" + ] + } + ], + "source": [ + "dg_test = DataGenerator(hc_weekly.isel(number=0).sel(time=time_test)[v],\n", + " obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=time_test),\n", + " step, batch_size=bs, mean=dg_train.mean, std=dg_train.std, shuffle=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((32, 121, 240), (32, 121, 240))" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X, y = dg_valid[0]\n", + "X.shape, y.shape" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# short look into training data: large biases\n", + "# any problem from normalizing?\n", + "i=4\n", + "xr.DataArray(np.vstack([X[i],y[i]])).plot(yincrease=False, robust=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `fit`" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [], + "source": [ + "cnn = keras.models.Sequential([\n", + " PeriodicConv2D(filters=32, kernel_size=5, conv_kwargs={'activation':'relu'}, input_shape=(32, 64, 1)),\n", + " PeriodicConv2D(filters=1, kernel_size=5)\n", + "])" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Model: \"sequential\"\n", + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "periodic_conv2d (PeriodicCon (None, 32, 64, 32) 832 \n", + "_________________________________________________________________\n", + "periodic_conv2d_1 (PeriodicC (None, 32, 64, 1) 801 \n", + "=================================================================\n", + "Total params: 1,633\n", + "Trainable params: 1,633\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n" + ] + } + ], + "source": [ + "cnn.summary()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "cnn.compile(keras.optimizers.Adam(1e-4), 'mse')" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [], + "source": [ + "import warnings\n", + "warnings.simplefilter(\"ignore\")" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "30/30 [==============================] - 70s 2s/step - loss: 0.3419 - val_loss: 0.6589\n" + ] + }, + { + "data": { + "text/plain": [ + "<tensorflow.python.keras.callbacks.History at 0x7f094c44efd0>" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "cnn.fit(dg_train, epochs=1, validation_data=dg_valid)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## `predict`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def create_predictions(model, dg, step):\n", + " \"\"\"Create non-iterative predictions\"\"\"\n", + " preds = model.predict(dg).squeeze()\n", + " # Unnormalize\n", + " preds = preds * dg.std.values + dg.mean.values\n", + " da = xr.DataArray(\n", + " preds,\n", + " dims=['time', 'latitude', 'longitude'],\n", + " coords={'time': dg.ds.time, 'latitude': dg.ds.latitude, 'longitude': dg.ds.longitude},\n", + " )\n", + " da=da.assign_coords(step=step)\n", + " return da" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "preds = create_predictions(cnn, dg_test, step)" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Coordinates:\n", + " * time (time) datetime64[ns] 2000-01-02 2000-01-09 ... 2000-12-31\n", + " * latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n", + " * longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 354.0 355.5 357.0 358.5\n", + " surface int64 ...\n", + " step timedelta64[ns] 35 days" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "preds.coords" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Validate predictions" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "obs_test = obs.sel(valid_time=hc_weekly.valid_time, method='nearest')[v].sel(time=time_test)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "rmse_ML = xs.rmse(preds, obs_test.sel(step=step), dim='time')\n", + "rmse_ML.plot(robust=True)\n", + "plt.title('RMSE ML predictions 2020')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### predict over all steps" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Loading data into RAM\n", + "Loading data into RAM\n", + "Loading data into RAM\n", + "Loading data into RAM\n" + ] + } + ], + "source": [ + "# this is not useful but results have expected dimensions\n", + "# actually train for each step and use all members for training and validation\n", + "\n", + "preds=[]\n", + "for step in hc_weekly.step:\n", + " dg_test = DataGenerator(hc_weekly.isel(number=0).sel(time=slice(time_test))[v], obs_test,\n", + " step=step, batch_size=bs, mean=dg_train.mean, std=dg_train.std, shuffle=False)\n", + " preds.append(create_predictions(cnn, dg_test, step))\n", + "preds = xr.concat(preds, 'step')\n", + "preds['step']=hc_weekly.step\n", + "preds=preds.to_dataset(name=v)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "preds = preds.expand_dims('number').rename({'time':'forecast_reference_time'})" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Coordinates:\n", + " * forecast_reference_time (forecast_reference_time) datetime64[ns] 2000-01...\n", + " * latitude (latitude) float64 90.0 88.5 87.0 ... -88.5 -90.0\n", + " * longitude (longitude) float64 0.0 1.5 3.0 ... 357.0 358.5\n", + " surface int64 ...\n", + " * step (step) timedelta64[ns] 21 days 28 days ... 42 days" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "preds.coords" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# todo: convert preds to preds_as_terciles" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Submission" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preds.sizes # expect: category(3), longitude, latitude, step(2), forecast_time (53)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "format_bytes(preds.nbytes)" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "preds.to_netcdf('submissions/ML_prediction_2020.nc')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!git commit -m 'method name'" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!git tag \"predefined-tag-0.0.1\" # if this is to be checked by scorer\n", + "git push --tags" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Misc" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## reforecasts for `ML_model`\n", + "- required for terciles\n", + "- dependent of `lead_time/step`" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "# running a reforecast based on training data\n", + "# todo: require cnn todo bi-weekly predictions, otherwise average step/lead_time bi-weekly\n", + "preds = create_predictions(cnn, dg_train, step)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [], + "source": [ + "quantile_kwargs={'q':[.33,.66], 'skipna':False}" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Coordinates:\n", + " * latitude (latitude) float64 90.0 88.5 87.0 ... -88.5 -90.0\n", + " * longitude (longitude) float64 0.0 1.5 3.0 ... 357.0 358.5\n", + " * category_edge (category_edge) float64 0.33 0.66\n", + " * forecast_reference_time (forecast_reference_time) int64 1 2 3 ... 51 52 53" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "ML_terciles = preds.groupby('time.weekofyear').quantile(dim=['time'], **quantile_kwargs).rename({'quantile':'category_edge','weekofyear':'forecast_reference_time'}).compute()\n", + "ML_terciles.coords" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [], + "source": [ + "#ML_terciles.isel(forecast_reference_time=[0,24]).plot(col='category_edge',row='forecast_reference_time')" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'24.63 MB'" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "format_bytes(ML_terciles.nbytes) # *2 for variable; *2 for steps" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "ML_terciles.to_netcdf('ML_terciles.nc')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.6" + }, + "toc-autonumbering": true + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/submissions/.gitattributes b/submissions/.gitattributes new file mode 100644 index 0000000..d77cc4d --- /dev/null +++ b/submissions/.gitattributes @@ -0,0 +1 @@ +“*.nc†filter=lfs diff=lfs merge=lfs -text -- GitLab