"# Train ML model to correct predictions of week 3-4 & 5-6\n",
"\n",
"This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Synopsis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data used\n",
"\n",
"Training-input for Machine Learning model:\n",
"- hindcasts of models: ECMWF\n",
"\n",
"Forecast-input for Machine Learning model:\n",
"- real-time 2020 forecasts of the same models\n",
"\n",
"Compare Machine Learning model forecast against:\n",
"- `CPC` observations 2020"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method: (`name`) mean bias reduction\n",
"\n",
"- calculate bias from 2000-2019\n",
"- remove bias from 2020 forecast"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resources used\n",
"for training\n",
"\n",
"- platform: renku\n",
"- memory: 8 GB\n",
"- processors: 2 CPU\n",
"- storage required: 10 GB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Safeguards\n",
"\n",
"All points have to be [x] checked. If not, your submission is invalid.\n",
"\n",
"Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.\n",
"(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) \n",
"\n",
"If the organizers suspect overfitting, your contribution can be disqualified.\n",
"\n",
" - [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)\n",
" - [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)\n",
" - [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).\n",
" - [ ] We separate honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.\n",
" - [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.\n",
" - [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Safeguards for Reproducibility\n",
"Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize\n",
" - [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)\n",
" - [ ] Code is well documented, readable and reproducible.\n",
" - [ ] Code to reproduce runs within a day."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Todos to improve template\n",
"\n",
"This is just a demo.\n",
"\n",
"- [ ] for both variables\n",
"- [ ] for both `lead_time`s\n",
"- [ ] ensure probabilistic prediction outcome with `category` dim"
"format_bytes(ML_terciles.nbytes) # *2 for variable; *2 for steps"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"ML_terciles.to_netcdf('ML_terciles.nc')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6"
},
"toc-autonumbering": true
},
"nbformat": 4,
"nbformat_minor": 4
}
%% Cell type:markdown id: tags:
# Train ML model to correct predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Data used
Training-input for Machine Learning model:
- hindcasts of models: ECMWF
Forecast-input for Machine Learning model:
- real-time 2020 forecasts of the same models
Compare Machine Learning model forecast against:
-`CPC` observations 2020
%% Cell type:markdown id: tags:
## Method: (`name`) mean bias reduction
- calculate bias from 2000-2019
- remove bias from 2020 forecast
%% Cell type:markdown id: tags:
## Resources used
for training
- platform: renku
- memory: 8 GB
- processors: 2 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
-[ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
-[ ] We separate honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
-[ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [ ] Code is well documented, readable and reproducible.
- [ ] Code to reproduce runs within a day.
%% Cell type:markdown id: tags:
# Todos to improve template
This is just a demo.
- [ ] for both variables
- [ ] for both `lead_time`s
- [ ] ensure probabilistic prediction outcome with `category` dim