Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train ML model for predictions of week 3-4 & 5-6\n",
"\n",
"This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Synopsis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data used\n",
"\n",
"Training-input for Machine Learning model:\n",
"- renku datasets, climetlab, IRIDL\n",
"\n",
"Forecast-input for Machine Learning model:\n",
"- renku datasets, climetlab, IRIDL\n",
"\n",
"Compare Machine Learning model forecast against ground truth:\n",
"- renku datasets, climetlab, IRIDL"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resources used\n",
"\n",
"- platform: renku\n",
"- memory: 8 GB\n",
"- processors: 2 CPU\n",
"- storage required: 10 GB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Safeguards\n",
"\n",
"All points have to be [x] checked. If not, your submission is invalid.\n",
"\n",
"Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.\n",
"(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) \n",
"\n",
"If the organizers suspect overfitting, your contribution can be disqualified.\n",
"\n",
" - [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)\n",
" - [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)\n",
" - [ ] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.\n",
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
" - [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).\n",
" - [ ] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.\n",
" - [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.\n",
" - [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Safeguards for Reproducibility\n",
"Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize\n",
" - [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)\n",
" - [ ] Code is well documented, readable and reproducible.\n",
" - [ ] Code to reproduce training and predictions should run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Todos to improve template\n",
"\n",
"This is just a demo.\n",
"\n",
"- [ ] for both variables\n",
"- [ ] for both `lead_time`s\n",
"- [ ] ensure probabilistic prediction outcome with `category` dim"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from tensorflow.keras.layers import Input, Dense, Flatten\n",
"from tensorflow.keras.models import Sequential\n",
"\n",
"import matplotlib.pyplot as plt\n",
"\n",
"import xarray as xr\n",
"xr.set_options(display_style='text')\n",
"\n",
"from dask.utils import format_bytes\n",
"import xskillscore as xs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Get training data\n",
"\n",
"preprocessing of input data may be done in separate notebook/script"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hindcast\n",
"\n",
"get weekly initialized hindcasts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# consider renku datasets\n",
"#! renku storage pull path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Observations\n",
"corresponding to hindcasts"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# consider renku datasets\n",
"#! renku storage pull path"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# ML model"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bs=32\n",
"\n",
"import numpy as np\n",
"class DataGenerator(keras.utils.Sequence):\n",
" def __init__(self):\n",
" \"\"\"\n",
" Data generator\n",
" \n",
" Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly\n",
" \n",
" Args:\n",
" \n",
" \"\"\"\n",
"\n",
" self.on_epoch_end()\n",
"\n",
" # For some weird reason calling .load() earlier messes up the mean and std computations\n",
" if load: print('Loading data into RAM'); self.data.load()\n",
"\n",
" def __len__(self):\n",
" 'Denotes the number of batches per epoch'\n",
" return int(np.ceil(self.n_samples / self.batch_size))\n",
"\n",
" def __getitem__(self, i):\n",
" 'Generate one batch of data'\n",
" idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]\n",
" # got all nan if nans not masked\n",
" X = self.data.isel(time=idxs).fillna(0.).values\n",
" y = self.verif_data.isel(time=idxs).fillna(0.).values\n",
" return X, y\n",
"\n",
" def on_epoch_end(self):\n",
" 'Updates indexes after each epoch'\n",
" self.idxs = np.arange(self.n_samples)\n",
" if self.shuffle == True:\n",
" np.random.shuffle(self.idxs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## data prep: train, valid, test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# time is the forecast_reference_time\n",
"time_train_start,time_train_end='2000','2017'\n",
"time_valid_start,time_valid_end='2018','2019'\n",
"time_test = '2020'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dg_train = DataGenerator()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dg_valid = DataGenerator()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dg_test = DataGenerator()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `fit`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cnn = keras.models.Sequential([])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cnn.summary()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cnn.compile(keras.optimizers.Adam(1e-4), 'mse')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"warnings.simplefilter(\"ignore\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cnn.fit(dg_train, epochs=1, validation_data=dg_valid)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `predict`\n",
"\n",
"Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`. For now RPS, todo: change to RPSS."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scripts import skill_by_year"
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def create_predictions(model, dg):\n",
" \"\"\"Create non-iterative predictions\"\"\"\n",
" preds = model.predict(dg).squeeze()\n",
" # transform\n",
" \n",
" return preds"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `predict` training period in-sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds_is = create_predictions(cnn, dg_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `predict` valid out-of-sample"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds_os = create_predictions(cnn, dg_valid)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `predict` test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds_test = create_predictions(cnn, dg_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submission"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds_test.sizes # expect: category(3), longitude, latitude, lead_time(2), forecast_time (53)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from scripts import assert_predictions_2020\n",
"assert_predictions_2020(preds_test)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!git commit -m \"commit submission for my_method_name\" # whatever message you want"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!git tag \"submission-my_method_name-0.0.1\" # if this is to be checked by scorer, only the last submitted==tagged version will be considered"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reproducibility"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## memory"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# https://phoenixnap.com/kb/linux-commands-check-memory-usage\n",
"!free -g"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CPU"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!lscpu"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## software"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!conda list"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
},
"toc-autonumbering": true
},
"nbformat": 4,
"nbformat_minor": 4
}