Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • aaron.spring/s2s-ai-challenge-template
  • anthonypv_97/s2s-ai-challenge-template
  • anthonypv_97/s2s-ai-challenge-predictia
  • sandeep.sukumaran/s2s-ai-challenge-template
  • lucasdmarten/s2s-ai-challenge-template
  • otaviomf123/s2s-ai-challenge-template
  • utkuunalan/s2s-ai-challenge-template
  • utkuunalan/s2s-ai-challenge-envai
  • glabolhumar/s2s-ai-challenge-enveai
  • glabolhumar/s2s-ai-challenge-envai
  • 1557684138/s2s-ai-challenge-template
  • janer0/s2s-ai-challenge-template
  • luweispark/s2s-ai-challenge-template
  • luweispark/s2s-ai-challenge-tianing
  • 444375642/s2s-ai-challenge-onthego
  • rok.roskar/s2s-ai-challenge-template
  • wanwb1224/s2s-ai-challenge-template
  • 834586573/s2s-ai-challenge-template
  • suryadev/s2s-ai-challenge-template
  • suryadev/s2s-sps
  • rhkdgns322/s2s-ai-challenge-template
  • lorenzo.cavazzi.tech/s2s-ai-challenge-template-test
  • chprsandeep/s2s-ai-challenge-template
  • chprsandeep/s2s-ai-challenge-template-deeplearners
  • chprsandeep/s2s-ai-challenge-deeplearners
  • adam.bienkowski/s2s-ai-challenge-uconn
  • tasko.olevski/s2s-ai-challenge-template-test1
  • 605483660/s2s-ai-challenge-template-1
  • dattranoptimuss/s2s-ai-challenge-template
  • declan.finney/s2s-ai-challenge-template
  • hilla.gerstman/s2s-ai-challenge-template
  • maria.pyrina/s2s-ai-challenge-template
  • weiriche/s2s-ai-challenge-s2s-eth
  • lxywn96/s2s-ai-challenge-template
  • ken.takahashi.guevara/s2s-ai-challenge-senamhi
  • manpreet.phy/s2s-ai-challenge-template
  • rahul.s8396/s2s-ai-challenge-template
  • manmeet.singh/s2s-ai-challenge-template
  • manmeet.singh/s2s-ai-challenge-template-iitm-ut-austin
  • xiangyanfei212/s2s-ai-challenge-template
  • cheikhnoreyni.fall/s2s-ai-challenge-template
  • 1843402075/s2s-ai-challenge-template
  • priyanka.yadav/s2s-ai-challenge-template
  • priyanka.yadav/s2s-ai-challenge-s2s-eth
  • wanedahirou/s2s-ai-challenge-template
  • r08h19/s2s-ai-challenge-template
  • xueqy_666/s2s-ai-challenge-template
  • r08h19/s2s-ai-challenge-pink
  • 1727072371/s2s-ai-challenge-template
  • 1727072371/s2s-ai-challenge-templateandy
  • 1727072371/s2s-ai-challenge-templateandy1
  • jiehongx/s2s-ai-challenge-template
  • kwmski7/s2s-ai-challenge-template
  • lo.riches/s2s-ai-challenge-template
  • thmamouka/s2s-ai-challenge-agroapps
  • vvourlioti/s2s-ai-challenge-agroapps
  • dolkong400/s2s-ai-challenge-template
  • 1843402075/s2s-ai-challenge-123
  • daniel.steinfeld87/s2s-ai-challenge-kit-eth-ubern
  • jehangir_awan/s2s-ai-challenge-template
  • muhammad.haider/s2s-ai-challenge-template
  • rahul.s8396/s2s-ai-challenge-sa-india
  • mudithnirmala/s2s-ai-challenge-template
  • tao.sun/s2s-ai-challenge-template
  • rayjohnbell0/s2s-ai-challenge-template
  • lluis.palma/s2s-ai-challenge-bsc-es
  • daniel.janke/s2s-ai-challenge-daniel-janke
  • daniel.janke/s2s-ai-challenge-danieljanke
  • jordan.gierschendorf/s2s-ai-challenge-template
  • declan.finney/s2s-ai-challenge-swan
  • 1843402075/s2s-ai-challenge-1234
  • yixisi1505/s2s-ai-challenge-ylabaiers
  • hayakawa-gen1010/s2s-ai-challenge-template-ylabaiers
  • adounkpep/s2s-ai-challenge-pyaj
  • molina/s2s-ai-challenge-ncar-team1
  • molina/s2s-ai-challenge-ncar-team2
  • rmcsqrd/s2s-ai-challenge-explore
  • lxywn96/s2s-ai-challenge-template-new
  • lxywn96/s2s-ai-challenge-nuister-f1
  • b1gdaniel/s2s-ai-challenge-nuister-f2
  • xueqy_666/s2s-ai-challenge-xqy
  • xueqy_666/s2s-ai-challenge-nuister-f4
  • 1727072371/s2s-ai-challenge-nuister-f3
  • 1727072371/s2s-ai-challenge-nuister-f5
  • panglin0912/s2s-ai-challenge-nuister-f5
  • 1342071344/s2s-ai-challenge-template
  • 931072922/s2s-ai-challenge-test
  • 931072922/s2s-ai-challenge-test2
  • 931072922/s2s-ai-challenge-piesat
  • jaareval/s2s-ai-challenge-uatest
  • tasko.olevski/s2s-ai-challenge-template-test2
  • medakramzaytar/s2s-ai-challenge-tabola
  • kwibukabertrand/s2s-ai-challenge-template
  • roberto.garcia/s2s-ai-challenge
  • roberto.garcia/s2s-ai-challenge-mnt-cptec-inpe
  • tamer.shoubaki/s2s-ai-challenge-rssai
  • 1342071344/s2s-ai-challenge-teamname
  • 1342071344/s2s-ai-challenge-template0
  • thabangline/s2s-ai-challenge-template
  • 2101110593/s2s-ai-challenge-piesat
  • info2/s2s-ai-challenge-template
  • jordan.gierschendorf1/s2s-ai-challenge-template
  • deepkneko/s2s-ai-challenge-ylabaiers
  • gyin/s2s-ai-challenge-new
  • pmartineau.work/s2s-ai-challenge-template
  • awr/s2s-ai-challenge-template-awr
  • awr/s2s-ai-challenge-temp
  • tasko.olevski/s2s-ai-challenge-template-test3
  • awr/s2s-ai-challenge-template3
  • lluis.palma/s2s-ai-challenge-bsc
  • cheikhnoreyni.fall/s2s-ai-challenge-template-noreyni
  • cheikhnoreyni.fall/s2s-ai-challenge-template-noreynidioum
  • tamerthamoqa/s2s-ai-challenge-template
  • cheikhnoreyni.fall/s2s-ai-challenge-template-noreynilpa
  • damien.specq/s2s-ai-challenge-template
  • kjhall01/s2s-ai-challenge-kjhall01
  • bjoern.mayer92/s2s-ai-challenge-template-zmaw2
  • zhoushanglin100/s2s-ai-challenge-template
  • samguo_321/s2s-ai-challenge-bsc
  • samguo_321/s2s-ai-challenge-guoshan
  • medakramzaytar/s2s-ai-challenge-bb
  • ejiro.chiomaa/s2s-ai-challenge-ejiro
  • mayur/s2s-ai-challenge-template-mayur
  • btickell/s2s-ai-challenge-template-mayur
  • junjie.liu.china/s2s-ai-challenge-template
  • zhanglang/s2s-ai-challenge-template
  • adjebbar83/s2s-ai-challenge-template
  • 1765007740/s2s-ai-challenge-template
128 results
Show changes
# Data Access
- European Weather Cloud:
- [`climetlab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)
- `wget`: wget_curl.ipynb
- `curl`: wget_curl.ipynb
- `mouse`: wget_curl.ipynb
- `intake`: intake.ipynb
- [IRI Data Library](iridl.ldeo.columbia.edu/): IRIDL.ipynb
- S2S: http://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/ (restricted access explained in IRIDL.ipynb)
- SubX: http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/
- NMME: http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/
- s2sprediction.net
plugins:
source:
- module: intake_xarray
sources:
training-input:
description: S2S hindcasts from IRIDL regridded to 1.5 deg grid and aggregated by mean over lead, https://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/overview.html
driver: opendap
parameters:
center:
description: name of the center issuing the hindcast
type: str
default: ECMF
allowed: [BOM, CNRM, ECCC, ECMF, HMCR, ISAC, JMA, KMA, NCEP, UKMO]
grid:
description: regrid to this global resolution
type: float
default: 1.5
lead_name:
description: name of the lead_time dimension
type: str
default: LA
allowed: [LA, L]
lead_start:
description: aggregation start lead passed to RANGEEDGES
type: int
default: 14
lead_end:
description: aggregation end lead passed to RANGEEDGES
type: int
default: 27
experiment_type:
description: type of experiment
type: str
default: perturbed
allowed: [control, perturbed, RMMS]
group:
description: group of variables
type: str
default: 2m_above_ground
#allowed: [2m_above_ground, ...] see https://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/.ECMF/.reforecast/.perturbed/
param:
description: variable name
type: str
default: 2t
#allowed: [2t] see https://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/.ECMF/.reforecast/.perturbed/
xarray_kwargs:
engine: netcdf4
args:
urlpath: http://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/.{{center}}/.reforecast/.{{experiment_type}}/.{{group}}/{{param}}/{{lead_name}}/({{lead_start}})/({{lead_end}})/RANGEEDGES/[{{lead_name}}]average/X/0/{{grid}}/358.5/GRID/Y/90/{{grid}}/-90/GRID/dods
test-input:
description: S2S forecasts from IRIDL regridded to 1.5 deg grid and aggregated by mean over lead, https://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/overview.html
driver: opendap
parameters:
center:
description: name of the center issuing the hindcast
type: str
default: ECMF
allowed: ['BOM','CNRM','ECCC','ECMF','HMCR','ISAC','JMA','KMA','NCEP','UKMO']
grid:
description: regrid to this global resolution
type: float
default: 1.5
lead_name:
description: name of the lead_time dimension
type: str
default: LA
allowed: [LA, L, L1]
lead_start:
description: aggregation start lead passed to RANGEEDGES
type: int
default: 14
lead_end:
description: aggregation end lead passed to RANGEEDGES
type: int
default: 27
experiment_type:
description: type of experiment
type: str
default: perturbed
allowed: [control, perturbed, RMMS]
group:
description: group of variables
type: str
default: 2m_above_ground
#allowed: see https://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/.ECMF/.reforecast/.perturbed/
param:
description: variable name
type: str
default: 2t
#allowed: [2t] see https://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/.ECMF/.reforecast/.perturbed/
xarray_kwargs:
engine: netcdf4
args:
urlpath: http://iridl.ldeo.columbia.edu/SOURCES/.ECMWF/.S2S/.{{center}}/.forecast/.{{experiment_type}}/.{{group}}/{{param}}/S/(0000%201%20Jan%202020)/(0000%2031%20Dec%202020)/RANGEEDGES/{{lead_name}}/({{lead_start}})/({{lead_end}})/RANGEEDGES/[{{lead_name}}]average/X/0/{{grid}}/358.5/GRID/Y/90/{{grid}}/-90/GRID/dods
plugins:
source:
- module: intake_xarray
sources:
training-input:
description: SubX hindcasts from IRIDL regridded to 1.5 deg grid and aggregated by mean over lead, http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/outline.html
driver: opendap
parameters:
center:
description: name of the center issuing the hindcast
type: str
default: EMC
allowed: [CESM, ECCC, EMC, ESRL, GMAO, NCEP, NRL, RSMAS]
model:
description: name of the model
type: str
default: GEFS
allowed: [30LCESM1, 46LCESM1, GEM, GEPS6, GEPS5, GEFS, GEFSv12, FIMr1p1, GEOS_V2p1, CFSv2, NESM, CCSM4]
grid:
description: regrid to this global resolution
type: float
default: 1.5
lead_start:
description: aggregation start lead passed to RANGEEDGES
type: int
default: 14
lead_end:
description: aggregation end lead passed to RANGEEDGES
type: int
default: 27
param:
description: variable name
type: str
default: pr
#allowed: [pr]
xarray_kwargs:
engine: netcdf4
args:
urlpath: http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.{{center}}/.{{model}}/.hindcast/.{{param}}/L/({{lead_start}})/({{lead_end}})/RANGEEDGES/[L]average/X/0/{{grid}}/358.5/GRID/Y/90/{{grid}}/-90/GRID/dods
test-input:
description: SubX forecasts from IRIDL regridded to 1.5 deg grid and aggregated by mean over lead, http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/outline.html
driver: opendap
parameters:
center:
description: name of the center issuing the forecast
type: str
default: EMC
allowed: [CESM, ECCC, EMC, ESRL, GMAO, NCEP, NRL, RSMAS]
model:
description: name of the model
type: str
default: GEFS
allowed: [30LCESM1, 46LCESM1, GEM, GEPS6, GEPS5, GEFS, GEFSv12, FIMr1p1, GEOS_V2p1, CFSv2, NESM, CCSM4]
grid:
description: regrid to this global resolution
type: float
default: 1.5
lead_start:
description: aggregation start lead passed to RANGEEDGES
type: int
default: 14
lead_end:
description: aggregation end lead passed to RANGEEDGES
type: int
default: 27
param:
description: variable name, see http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/outline.html
type: str
default: pr
#allowed: [pr]
xarray_kwargs:
engine: netcdf4
args:
urlpath: http://iridl.ldeo.columbia.edu/SOURCES/.Models/.SubX/.{{center}}/.{{model}}/.forecast/.{{param}}/S/(0000%201%20Jan%202020)/(0000%2031%20Dec%202020)/RANGEEDGES/L/({{lead_start}})/({{lead_end}})/RANGEEDGES/[L]average/X/0/{{grid}}/358.5/GRID/Y/90/{{grid}}/-90/GRID/dods
%% Cell type:markdown id: tags:
# Data Access from EWC via `intake`
Data easily available via `climetlab`: https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge
Data holdings listed: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-input/0.3.0/netcdf/index.html
Therefore, S3 data also accessible with `intake-xarray` and cachable with `fsspec`.
%% Cell type:code id: tags:
``` python
import intake
import fsspec
import xarray as xr
import os, glob
import pandas as pd
xr.set_options(display_style='text')
```
%% Output
/opt/conda/lib/python3.8/site-packages/xarray/backends/cfgrib_.py:27: UserWarning: Failed to load cfgrib - most likely there is a problem accessing the ecCodes library. Try `import cfgrib` to get the full error message
warnings.warn(
<xarray.core.options.set_options at 0x7fa0100dcdc0>
%% Cell type:code id: tags:
``` python
# prevent aihttp timeout errors
from aiohttp import ClientSession, ClientTimeout
timeout = ClientTimeout(total=600)
fsspec.config.conf['https'] = dict(client_kwargs={'timeout': timeout})
```
%% Cell type:markdown id: tags:
# intake
https://github.com/intake/intake-xarray can read and cache `grib` and `netcdf` from catalogs.
Caching via `fsspec`: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
%% Cell type:code id: tags:
``` python
import intake_xarray
cache_path = '/work/s2s-ai-challenge-template/data/cache'
fsspec.config.conf['simplecache'] = {'cache_storage': cache_path, 'same_names':True}
```
%% Cell type:code id: tags:
``` python
%%writefile EWC_catalog.yml
plugins:
source:
- module: intake_xarray
sources:
training-input:
description: climetlab name in AI/ML community naming for hindcasts as input to the ML-model in training period
driver: netcdf
parameters:
model:
description: name of the S2S model
type: str
default: ecmwf
allowed: [ecmwf, eccc, ncep]
param:
description: variable name
type: str
default: tp
allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
date:
description: initialization weekly thursdays
type: datetime
default: 2020.01.02
min: 2020.01.02
max: 2020.12.31
version:
description: versioning of the data
type: str
default: 0.3.0
format:
description: data type
type: str
default: netcdf
allowed: [netcdf, grib]
ending:
description: data format compatible with format; netcdf -> nc, grib -> grib
type: str
default: nc
allowed: [nc, grib]
xarray_kwargs:
engine: h5netcdf
args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/{{version}}/{{format}}/{{model}}-hindcast-{{param}}-{{date.strftime("%Y%m%d")}}.{{ending}}
test-input:
description: climetlab name in AI/ML community naming for 2020 forecasts as input to ML model in test period 2020
driver: netcdf
parameters:
model:
description: name of the S2S model
type: str
default: ecmwf
allowed: [ecmwf, eccc, ncep]
param:
description: variable name
type: str
default: tp
allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
date:
description: initialization weekly thursdays
type: datetime
default: 2020.01.02
min: 2020.01.02
max: 2020.12.31
version:
description: versioning of the data
type: str
default: 0.3.0
format:
description: data type
type: str
default: netcdf
allowed: [netcdf, grib]
ending:
description: data format compatible with format; netcdf -> nc, grib -> grib
type: str
default: nc
allowed: [nc, grib]
xarray_kwargs:
engine: h5netcdf
args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-input/{{version}}/{{format}}/{{model}}-forecast-{{param}}-{{date.strftime("%Y%m%d")}}.{{ending}}
training-output-reference:
description: climetlab name in AI/ML community naming for 2020 forecasts as output reference to compare to ML model output to in training period
driver: netcdf
parameters:
param:
description: variable name
type: str
default: tp
allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
date:
description: initialization weekly thursdays
type: datetime
default: 2020.01.02
min: 2020.01.02
max: 2020.12.31
xarray_kwargs:
engine: h5netcdf
args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-output-reference/{{param}}-{{date.strftime("%Y%m%d")}}.nc
test-output-reference:
description: climetlab name in AI/ML community naming for 2020 forecasts as output reference to compare to ML model output to in test period 2020
driver: netcdf
parameters:
param:
description: variable name
type: str
default: tp
allowed: [t2m, ci, gh, lsm, msl, q, rsn, sm100, sm20, sp, sst, st100, st20, t, tcc, tcw, ttr, tp, v, u]
date:
description: initialization weekly thursdays
type: datetime
default: 2020.01.02
min: 2020.01.02
max: 2020.12.31
xarray_kwargs:
engine: h5netcdf
args: # add simplecache:: for caching: https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally
urlpath: https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-output-reference/{{param}}-{{date.strftime("%Y%m%d")}}.nc
```
%% Output
Writing EWC_catalog.yml
%% Cell type:code id: tags:
``` python
cat = intake.open_catalog('EWC_catalog.yml')
```
%% Cell type:code id: tags:
``` python
# dates for 2020 forecasts and their on-the-fly reforecasts
dates=pd.date_range(start='2020-01-02',freq='7D',end='2020-12-31')
dates
```
%% Output
DatetimeIndex(['2020-01-02', '2020-01-09', '2020-01-16', '2020-01-23',
'2020-01-30', '2020-02-06', '2020-02-13', '2020-02-20',
'2020-02-27', '2020-03-05', '2020-03-12', '2020-03-19',
'2020-03-26', '2020-04-02', '2020-04-09', '2020-04-16',
'2020-04-23', '2020-04-30', '2020-05-07', '2020-05-14',
'2020-05-21', '2020-05-28', '2020-06-04', '2020-06-11',
'2020-06-18', '2020-06-25', '2020-07-02', '2020-07-09',
'2020-07-16', '2020-07-23', '2020-07-30', '2020-08-06',
'2020-08-13', '2020-08-20', '2020-08-27', '2020-09-03',
'2020-09-10', '2020-09-17', '2020-09-24', '2020-10-01',
'2020-10-08', '2020-10-15', '2020-10-22', '2020-10-29',
'2020-11-05', '2020-11-12', '2020-11-19', '2020-11-26',
'2020-12-03', '2020-12-10', '2020-12-17', '2020-12-24',
'2020-12-31'],
dtype='datetime64[ns]', freq='7D')
%% Cell type:markdown id: tags:
# `hindcast-input`
on-the-fly hindcasts corresponding to the 2020 forecasts
%% Cell type:code id: tags:
``` python
cat['training-input'](date=dates[10], param='tp', model='eccc').to_dask()
```
%% Output
/opt/conda/lib/python3.8/site-packages/xarray/backends/plugins.py:61: RuntimeWarning: Engine 'cfgrib' loading failed:
/opt/conda/lib/python3.8/site-packages/gribapi/_bindings.cpython-38-x86_64-linux-gnu.so: undefined symbol: codes_bufr_key_is_header
warnings.warn(f"Engine {name!r} loading failed:\n{ex}", RuntimeWarning)
<xarray.Dataset>
Dimensions: (forecast_time: 20, latitude: 121, lead_time: 32, longitude: 240, realization: 4)
Coordinates:
* realization (realization) int64 0 1 2 3
* forecast_time (forecast_time) datetime64[ns] 1998-03-12 ... 2017-03-12
* lead_time (lead_time) timedelta64[ns] 1 days 2 days ... 31 days 32 days
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (forecast_time, lead_time) datetime64[ns] ...
Data variables:
tp (realization, forecast_time, lead_time, latitude, longitude) float32 ...
Attributes:
GRIB_edition: [2]
GRIB_centre: cwao
GRIB_centreDescription: Canadian Meteorological Service - Montreal
GRIB_subCentre: [0]
Conventions: CF-1.7
institution: Canadian Meteorological Service - Montreal
history: 2021-05-11T10:03 GRIB to CDM+CF via cfgrib-0.9.9...
%% Cell type:markdown id: tags:
# `forecast-input`
2020
%% Cell type:code id: tags:
``` python
cat['test-input'](date=dates[10], param='t2m', model='ecmwf').to_dask()
```
%% Output
<xarray.Dataset>
Dimensions: (forecast_time: 1, latitude: 121, lead_time: 46, longitude: 240, realization: 51)
Coordinates:
* realization (realization) int64 0 1 2 3 4 5 6 7 ... 44 45 46 47 48 49 50
* forecast_time (forecast_time) datetime64[ns] 2020-03-12
* lead_time (lead_time) timedelta64[ns] 1 days 2 days ... 45 days 46 days
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (forecast_time, lead_time) datetime64[ns] ...
Data variables:
t2m (realization, forecast_time, lead_time, latitude, longitude) float32 ...
Attributes:
GRIB_edition: [2]
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: [0]
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2021-05-10T16:14:36 GRIB to CDM+CF via cfgrib-0....
%% Cell type:markdown id: tags:
# `hindcast-like-observations`
observations matching hindcasts
%% Cell type:code id: tags:
``` python
cat['training-output-reference'](date=dates[10], param='t2m').to_dask()
```
%% Output
<xarray.Dataset>
Dimensions: (forecast_time: 1, latitude: 121, lead_time: 47, longitude: 240)
Coordinates:
valid_time (lead_time, forecast_time) datetime64[ns] ...
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
* forecast_time (forecast_time) datetime64[ns] 2020-03-12
* lead_time (lead_time) timedelta64[ns] 0 days 1 days ... 45 days 46 days
Data variables:
t2m (lead_time, forecast_time, latitude, longitude) float32 ...
Attributes:
source_dataset_name: temperature daily from NOAA NCEP CPC: Climate Predi...
source_hosting: IRIDL
source_url: http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
created_by_software: climetlab-s2s-ai-challenge
created_by_script: tools/observations/makefile
%% Cell type:markdown id: tags:
# `forecast-like-observations`
observations matching 2020 forecasts
%% Cell type:code id: tags:
``` python
cat['test-output-reference'](date=dates[10], param='t2m').to_dask()
```
%% Output
<xarray.Dataset>
Dimensions: (forecast_time: 1, latitude: 121, lead_time: 47, longitude: 240)
Coordinates:
valid_time (lead_time, forecast_time) datetime64[ns] ...
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
* forecast_time (forecast_time) datetime64[ns] 2020-03-12
* lead_time (lead_time) timedelta64[ns] 0 days 1 days ... 45 days 46 days
Data variables:
t2m (lead_time, forecast_time, latitude, longitude) float32 ...
Attributes:
source_dataset_name: temperature daily from NOAA NCEP CPC: Climate Predi...
source_hosting: IRIDL
source_url: http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
created_by_software: climetlab-s2s-ai-challenge
created_by_script: tools/observations/makefile
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Data Access via `curl` or `wget`
Data easily available via `climetlab`: https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge
Data holdings listed:
- https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-input/0.3.0/netcdf/index.html
- https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html
- https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-output-reference/index.html
- https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-output-reference/index.html
Therefore, S3 data also accessible with `curl` or `wget`. Alternatively, you can click on the html links and download files by mouse click.
%% Cell type:code id: tags:
``` python
import xarray as xr
import os
from subprocess import call
xr.set_options(display_style='text')
```
%% Output
/opt/conda/lib/python3.8/site-packages/xarray/backends/cfgrib_.py:27: UserWarning: Failed to load cfgrib - most likely there is a problem accessing the ecCodes library. Try `import cfgrib` to get the full error message
warnings.warn(
<xarray.core.options.set_options at 0x7f5170570520>
%% Cell type:code id: tags:
``` python
# version of the EWC data
version = '0.3.0'
```
%% Cell type:markdown id: tags:
# `hindcast-input`
on-the-fly hindcasts corresponding to the 2020 forecasts
%% Cell type:code id: tags:
``` python
parameter = 't2m'
date = '20200102'
model = 'ecmwf'
```
%% Cell type:code id: tags:
``` python
url = f'https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/{version}/netcdf/{model}-hindcast-{parameter}-{date}.nc'
os.system(f'wget {url}')
assert os.path.exists(f'{model}-hindcast-{parameter}-{date}.nc')
```
%% Cell type:markdown id: tags:
# `forecast-input`
2020
%% Cell type:code id: tags:
``` python
url = f'https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-input/{version}/netcdf/{model}-forecast-{parameter}-{date}.nc'
os.system(f'wget {url}')
assert os.path.exists(f'{model}-forecast-{parameter}-{date}.nc')
```
%% Cell type:markdown id: tags:
# `hindcast-like-observations`
CPC observations formatted like training period hindcasts
%% Cell type:code id: tags:
``` python
url = f'https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-output-reference/{parameter}-{date}.nc'
os.system(f'wget {url}')
assert os.path.exists(f'{parameter}-{date}.nc')
```
%% Cell type:markdown id: tags:
# `forecast-like-observations`
CPC observations formatted like test period 2020 forecasts
%% Cell type:code id: tags:
``` python
url = f'https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/test-output-reference/{parameter}-{date}.nc'
os.system(f'wget {url}')
assert os.path.exists(f'{parameter}-{date}.nc')
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Train ML model to correct predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Method: `mean bias reduction`
- calculate the mean bias from 2000-2019 deterministic ensemble mean forecast
- remove that mean bias from 2020 forecast deterministic ensemble mean forecast
- no Machine Learning used here
%% Cell type:markdown id: tags:
## Data used
type: renku datasets
Training-input for Machine Learning model:
- hindcasts of models:
- ECMWF: `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
Forecast-input for Machine Learning model:
- real-time 2020 forecasts of models:
- ECMWF: `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
Compare Machine Learning model forecast against against ground truth:
- `CPC` observations:
- `hindcast-like-observations_biweekly_deterministic.zarr`
- `forecast-like-observations_2020_biweekly_deterministic.zarr`
%% Cell type:markdown id: tags:
## Resources used
for training, details in reproducibility
- platform: MPI-M supercompute 1 Node
- memory: 64 GB
- processors: 36 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [x] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [x] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [x] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.
- [x] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
- [x] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [x] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [x] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [x] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [x] Code is well documented, readable and reproducible.
- [x] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
``` python
import xarray as xr
xr.set_options(display_style='text')
```
%% Output
<xarray.core.options.set_options at 0x7f05cc486340>
%% Cell type:markdown id: tags:
# Get training data
preprocessing of input data may be done in separate notebook/script
%% Cell type:markdown id: tags:
## Hindcast
get weekly initialized hindcasts
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
hind_2000_2019 = xr.open_zarr("../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
fct_2020 = xr.open_zarr("../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:markdown id: tags:
## Observations
corresponding to hindcasts
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
obs_2000_2019 = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
obs_2020 = xr.open_zarr("../data/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:markdown id: tags:
# no ML model
%% Cell type:markdown id: tags:
Here, we just remove the mean bias from the ensemble mean forecast.
%% Cell type:code id: tags:
``` python
from scripts import add_year_week_coords
obs_2000_2019 = add_year_week_coords(obs_2000_2019)
hind_2000_2019 = add_year_week_coords(hind_2000_2019)
```
%% Cell type:code id: tags:
``` python
bias_2000_2019 = (hind_2000_2019.mean('realization') - obs_2000_2019).groupby('week').mean().compute()
```
%% Output
/opt/conda/lib/python3.8/site-packages/dask/array/numpy_compat.py:39: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
%% Cell type:markdown id: tags:
## `predict`
Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`.
%% Cell type:code id: tags:
``` python
from scripts import make_probabilistic
```
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
tercile_file = f'../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
tercile_edges = xr.open_dataset(tercile_file)
```
%% Cell type:code id: tags:
``` python
def create_predictions(fct, bias):
if 'week' not in fct.coords:
fct = add_year_week_coords(fct)
preds = fct - bias.sel(week=fct.week)
preds = make_probabilistic(preds, tercile_edges)
return preds.astype('float32')
```
%% Cell type:markdown id: tags:
### `predict` training period in-sample
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
```
%% Output
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
preds_is = create_predictions(hind_2000_2019, bias_2000_2019).compute()
```
%% Cell type:code id: tags:
``` python
from scripts import skill_by_year
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_is)
```
%% Cell type:markdown id: tags:
### `predict` test
%% Cell type:code id: tags:
``` python
preds_test = create_predictions(fct_2020, bias_2000_2019)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_test)
```
%% Cell type:markdown id: tags:
# Submission
%% Cell type:code id: tags:
``` python
from scripts import assert_predictions_2020
assert_predictions_2020(preds_test)
```
%% Cell type:code id: tags:
``` python
preds_test.attrs = {'author': 'Aaron Spring', 'author_email': 'aaron.spring@mpimet.mpg.de',
'comment': 'created for the s2s-ai-challenge as a template for the website',
'notebook': 'mean_bias_reduction.ipynb',
'website': 'https://s2s-ai-challenge.github.io/#evaluation'}
html_repr = xr.core.formatting_html.dataset_repr(preds_test)
with open('submission_template_repr.html', 'w') as myFile:
myFile.write(html_repr)
```
%% Cell type:code id: tags:
``` python
preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')
```
%% Cell type:code id: tags:
``` python
# !git add ../submissions/ML_prediction_2020.nc
# !git add mean_bias_reduction.ipynb
```
%% Cell type:code id: tags:
``` python
#!git commit -m "template_test no ML mean bias reduction" # whatever message you want
```
%% Cell type:code id: tags:
``` python
#!git tag "submission-no_ML_mean_bias_reduction-0.0.2" # if this is to be checked by scorer, only the last submitted==tagged version will be considered
```
%% Cell type:code id: tags:
``` python
#!git push --tags
```
%% Cell type:markdown id: tags:
# Reproducibility
%% Cell type:markdown id: tags:
## memory
%% Cell type:code id: tags:
``` python
# https://phoenixnap.com/kb/linux-commands-check-memory-usage
!free -g
```
%% Cell type:markdown id: tags:
## CPU
%% Cell type:code id: tags:
``` python
!lscpu
```
%% Cell type:markdown id: tags:
## software
%% Cell type:code id: tags:
``` python
!conda list
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Create biweekly renku datasets from `climetlab-s2s-ai-challenge`
%% Cell type:markdown id: tags:
Goal:
- Create biweekly renku datasets from [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge).
- These renku datasets are then used in notebooks:
- `ML_train_and_predict.ipynb` to train the ML model and do ML-based predictions
- `RPSS_verification.ipynb` to calculate RPSS of the ML model
- `mean_bias_reduction.ipynb` to remove the mean bias and
Requirements:
- [`climetlab`](https://github.com/ecmwf/climetlab)
- [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)
- S2S and CPC observations uploaded on [European Weather Cloud (EWC)](https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html)
Output: [renku dataset](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) `s2s-ai-challenge`
- observations
- deterministic:
- `hindcast-like-observations_2000-2019_biweekly_deterministic.zarr`
- `forecast-like-observations_2020_biweekly_deterministic.zarr`
- edges:
- `hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc`
- probabilistic:
- `hindcast-like-observations_2000-2019_biweekly_terciled.zarr`
- `forecast-like-observations_2020_biweekly_terciled.nc`
- forecasts/hindcasts
- deterministic:
- `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
- `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
- more models could be added
- benchmark:
- probabilistic:
- `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
import xarray as xr
import xskillscore as xs
import pandas as pd
import climetlab_s2s_ai_challenge
import climetlab as cml
print(f'Climetlab version : {cml.__version__}')
print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')
xr.set_options(keep_attrs=True)
xr.set_options(display_style='text')
```
%% Output
WARNING: ecmwflibs universal: cannot find a library called MagPlus
Magics library could not be found
Climetlab version : 0.8.6
Climetlab-s2s-ai-challenge plugin version : 0.8.0
<xarray.core.options.set_options at 0x2b51c148f590>
%% Cell type:code id: tags:
``` python
# caching path for climetlab
cache_path = "/work/mh0727/m300524/S2S_AI/cache4" # set your own path
cml.settings.set("cache-directory", cache_path)
```
%% Cell type:code id: tags:
``` python
cache_path = "../data"
```
%% Cell type:markdown id: tags:
# Download and cache
Download all files for the observations, forecast and hindcast.
%% Cell type:code id: tags:
``` python
# shortcut
from scripts import download
#download()
```
%% Cell type:markdown id: tags:
## hindcast and forecast `input`
%% Cell type:code id: tags:
``` python
# starting dates forecast_time in 2020
dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
forecast_dataset_labels = ['training-input','test-input'] # ML community
# equiv to
forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community
varlist_forecast = ['tp','t2m'] # can add more
center_list = ['ecmwf'] # 'ncep', 'eccc'
```
%% Cell type:code id: tags:
``` python
%%time
# takes ~ 10-30 min to download for one model one variable depending on number of model realizations
# and download settings https://climetlab.readthedocs.io/en/latest/guide/settings.html
for center in center_list:
for ds in forecast_dataset_labels:
cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format='netcdf').to_xarray()
```
%% Cell type:markdown id: tags:
## observations `output-reference`
%% Cell type:code id: tags:
``` python
obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
# equiv to
obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community
varlist_obs = ['tp', 't2m']
```
%% Cell type:code id: tags:
``` python
%%time
# takes 10min to download
for ds in obs_dataset_labels:
print(ds)
# only netcdf, no format choice
cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=varlist_obs).to_xarray()
```
%% Cell type:code id: tags:
``` python
# download obs_time for to create output-reference/observations for other models than ecmwf and eccc,
# i.e. ncep or any S2S or Sub model
obs_time = cml.load_dataset(f"s2s-ai-challenge-observations", parameter=['t2m', 'pr']).to_xarray()
```
%% Cell type:markdown id: tags:
# create bi-weekly aggregates
%% Cell type:code id: tags:
``` python
from scripts import aggregate_biweekly, ensure_attributes
#aggregate_biweekly??
```
%% Cell type:code id: tags:
``` python
for c, center in enumerate(center_list): # forecast centers (could also take models)
for dsl in obs_dataset_labels:# + forecast_dataset_labels: # climetlab dataset labels
for p, parameter in enumerate(varlist_forecast): # variables
if c != 0 and 'observation' in dsl: # only do once for observations
continue
print(f"datasetlabel: {dsl}, center: {center}, parameter: {parameter}")
if 'input' in dsl:
ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=center, parameter=parameter, format='netcdf').to_xarray()
elif 'observation' in dsl: # obs only netcdf, no choice
if parameter not in ['t2m', 'tp']:
continue
ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", parameter=parameter, date=dates).to_xarray()
if p == 0:
ds_biweekly = ds.map(aggregate_biweekly)
else:
ds_biweekly[parameter] = ds.map(aggregate_biweekly)[parameter]
ds_biweekly = ds_biweekly.map(ensure_attributes, biweekly=True)
ds_biweekly = ds_biweekly.sortby('forecast_time')
if 'test' in dsl:
ds_biweekly = ds_biweekly.chunk('auto')
else:
ds_biweekly = ds_biweekly.chunk({'forecast_time':'auto','lead_time':-1,'longitude':-1,'latitude':-1})
if 'hindcast' in dsl:
time = f'{int(ds_biweekly.forecast_time.dt.year.min())}-{int(ds_biweekly.forecast_time.dt.year.max())}'
if 'input' in dsl:
name = f'{center}_{dsl}'
elif 'observations':
name = dsl
elif 'forecast' in dsl:
time = '2020'
if 'input' in dsl:
name = f'{center}_{dsl}'
elif 'observations':
name = dsl
else:
assert False
# pattern: {model_if_not_observations}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
ds_biweekly.attrs.update({'postprocessed_by':'https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
print(f'save to: {zp}')
ds_biweekly.astype('float32').to_zarr(zp, consolidated=True, mode='w')
```
%% Output
datasetlabel: hindcast-like-observations, center: ecmwf, parameter: tp
By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data. This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so
Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.18.0
By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
return self.array[key]
datasetlabel: hindcast-like-observations, center: ecmwf, parameter: t2m
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
... array[indexer]
To avoid creating the large chunks, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
... array[indexer]
return self.array[key]
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
chunk and silence this warning, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
... array[indexer]
To avoid creating the large chunks, set the option
>>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
... array[indexer]
return self.array[key]
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
return self.array[key]
save to: ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
datasetlabel: forecast-like-observations, center: ecmwf, parameter: tp
By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data. This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
datasetlabel: forecast-like-observations, center: ecmwf, parameter: t2m
save to: ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
%% Cell type:markdown id: tags:
## add to `renku` dataset `s2s-ai-challenge`
%% Cell type:code id: tags:
``` python
# observations as hindcast
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
# for further use retrieve from git lfs
# !renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2000_2019 = xr.open_zarr(f"{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
print(obs_2000_2019.sizes,'\n',obs_2000_2019.coords,'\n', obs_2000_2019.nbytes/1e6,'MB')
```
%% Output
Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
Coordinates:
* forecast_time (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
492.546744 MB
%% Cell type:code id: tags:
``` python
# observations as forecast
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
# for further use retrieve from git lfs
# !renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2020 = xr.open_zarr(f"{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
print(obs_2020.sizes,'\n',obs_2020.coords,'\n', obs_2020.nbytes/1e6,'MB')
```
%% Output
Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
Coordinates:
* forecast_time (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
24.630096 MB
%% Cell type:code id: tags:
``` python
# ecmwf hindcast-input
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
# for further use retrieve from git lfs
# !renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
```
%% Cell type:code id: tags:
``` python
# ecmwf forecast-input
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
# for further use retrieve from git lfs
# !renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
```
%% Cell type:markdown id: tags:
# tercile edges
Create 2 tercile edges at 1/3 and 2/3 quantiles of the 2000-2019 biweekly distrbution for each week of the year
%% Cell type:code id: tags:
``` python
obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
```
%% Cell type:code id: tags:
``` python
from scripts import add_year_week_coords
```
%% Cell type:code id: tags:
``` python
# add week for groupby, see https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29
obs_2000_2019 = add_year_week_coords(obs_2000_2019)
```
%% Output
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
%% Cell type:code id: tags:
``` python
obs_2000_2019
```
%% Output
<xarray.Dataset>
Dimensions: (forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
Coordinates:
* forecast_time (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
week (forecast_time) int64 1 2 3 4 5 6 7 ... 47 48 49 50 51 52 53
year (forecast_time) int64 2000 2000 2000 2000 ... 2019 2019 2019
Data variables:
t2m (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
tp (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
Attributes:
created_by_script: tools/observations/makefile
created_by_software: climetlab-s2s-ai-challenge
function: climetlab_s2s_ai_challenge.extra.forecast_like_obse...
postprocessed_by: https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
regrid_method: conservative
source_dataset_name: NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
source_hosting: IRIDL
source_url: http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
%% Cell type:code id: tags:
``` python
tercile_file = f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
```
%% Cell type:code id: tags:
``` python
%%time
obs_2000_2019.chunk({'forecast_time':-1,'longitude':'auto'}).groupby('week').quantile(q=[1./3.,2./3.], dim='forecast_time').rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)
```
%% Output
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/numpy/lib/nanfunctions.py:1390: RuntimeWarning: All-NaN slice encountered
overwrite_input, interpolation)
CPU times: user 19min 35s, sys: 8min 33s, total: 28min 9s
Wall time: 16min 44s
%% Cell type:code id: tags:
``` python
tercile_edges = xr.open_dataset(tercile_file)
tercile_edges
```
%% Output
<xarray.Dataset>
Dimensions: (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, week: 53)
Coordinates:
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
* category_edge (category_edge) float64 0.3333 0.6667
* week (week) int64 1 2 3 4 5 6 7 8 9 ... 45 46 47 48 49 50 51 52 53
Data variables:
t2m (week, category_edge, lead_time, latitude, longitude) float32 ...
tp (week, category_edge, lead_time, latitude, longitude) float32 ...
Attributes:
created_by_script: tools/observations/makefile
created_by_software: climetlab-s2s-ai-challenge
function: climetlab_s2s_ai_challenge.extra.forecast_like_obse...
postprocessed_by: https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
regrid_method: conservative
source_dataset_name: NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
source_hosting: IRIDL
source_url: http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
%% Cell type:code id: tags:
``` python
tercile_edges.nbytes*1e-6,'MB'
```
%% Output
(49.255184, 'MB')
%% Cell type:code id: tags:
``` python
# run renku commands from projects root directory only
# tercile edges
#!renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
```
%% Cell type:code id: tags:
``` python
# to use retrieve from git lfs
#!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
#xr.open_dataset("../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc")
```
%% Cell type:markdown id: tags:
# observations in categories
- counting how many deterministic forecasts realizations fall into each category, like counting rps
- categorize forecast-like-observations 2020 into categories
%% Cell type:code id: tags:
``` python
obs_2020 = xr.open_zarr(f'{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr', consolidated=True)
obs_2020.sizes
```
%% Output
Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
%% Cell type:code id: tags:
``` python
# create a mask for land grid
mask = obs_2020.std(['lead_time','forecast_time']).notnull()
```
%% Cell type:code id: tags:
``` python
# mask.to_array().plot(col='variable')
```
%% Cell type:code id: tags:
``` python
# total precipitation in arid regions are masked
# Frederic Vitart suggested by email: "Based on your map we could mask all the areas where the lower tercile boundary is lower than 0.1 mm"
# we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
th = 0.01
tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > th
#tp_arid_mask.where(mask.tp).plot(col='forecast_time', col_wrap=4)
#plt.suptitle(f'dry mask: week 3-4 tp 1/3 category_edge > {th} kg m-2',y=1., x=.4)
#plt.savefig('dry_mask.png')
```
%% Cell type:code id: tags:
``` python
# look into tercile edges
# tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
# tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
# tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)
```
%% Cell type:markdown id: tags:
## categorize observations
%% Cell type:markdown id: tags:
### forecast 2020
%% Cell type:code id: tags:
``` python
from scripts import make_probabilistic
```
%% Cell type:code id: tags:
``` python
# tp_arid_mask.isel(week=[0,10,20,30,40]).plot(col='week')
```
%% Cell type:code id: tags:
``` python
obs_2020_p = make_probabilistic(obs_2020, tercile_edges, mask=mask)
```
%% Output
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
%% Cell type:code id: tags:
``` python
obs_2020_p.nbytes/1e6, 'MB'
```
%% Output
(147.75984, 'MB')
%% Cell type:code id: tags:
``` python
obs_2020_p
```
%% Output
<xarray.Dataset>
Dimensions: (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
Coordinates:
* forecast_time (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
* category (category) <U12 'below normal' 'near normal' 'above normal'
Data variables:
t2m (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
tp (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
%% Cell type:code id: tags:
``` python
obs_2020_p.astype('float32').to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
```
%% Output
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
%% Cell type:code id: tags:
``` python
# forecast-like-observations terciled
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_terciled.nc
```
%% Cell type:code id: tags:
``` python
# to use retrieve from git lfs
#!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
xr.open_dataset("../data/forecast-like-observations_2020_biweekly_terciled.nc")
```
%% Output
<xarray.Dataset>
Dimensions: (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
Coordinates:
* forecast_time (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (lead_time, forecast_time) datetime64[ns] ...
* category (category) object 'below normal' 'near normal' 'above normal'
Data variables:
t2m (category, lead_time, forecast_time, latitude, longitude) float32 ...
tp (category, lead_time, forecast_time, latitude, longitude) float32 ...
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
### hindcast 2000_2019
%% Cell type:code id: tags:
``` python
obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
```
%% Cell type:code id: tags:
``` python
obs_2000_2019_p = make_probabilistic(obs_2000_2019, tercile_edges, mask=mask)
```
%% Output
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
%% Cell type:code id: tags:
``` python
obs_2000_2019_p.nbytes/1e6, 'MB'
```
%% Output
(2955.138888, 'MB')
%% Cell type:code id: tags:
``` python
obs_2000_2019_p.astype('float32').chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')
```
%% Output
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
<xarray.backends.zarr.ZarrStore at 0x2b51c1233360>
%% Cell type:code id: tags:
``` python
# forecast-like-observations terciled
# run renku commands from projects root directory only
# !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
```
%% Cell type:code id: tags:
``` python
# to use retrieve from git lfs
#!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
```
%% Output
<xarray.Dataset>
Dimensions: (category: 3, forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
Coordinates:
* category (category) <U12 'below normal' 'near normal' 'above normal'
* forecast_time (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
Data variables:
t2m (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>
tp (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>
%% Cell type:code id: tags:
``` python
# checking category frequencies
# o = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
# w=0
# v='tp'
# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False)[v].plot(row='lead_time',col='category', levels=[5.5,6.5,7.5])
# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False).sum('category', skipna=False)[v].plot(row='lead_time', levels=[16.5,17.5,18.5,19.5,20.5])
```
%% Cell type:markdown id: tags:
# Benchmark
center: ECMWF
The calibration has been performed by using the tercile boundaries from the model climatology rather than from observations. Script by Frederic Vitart.
%% Cell type:code id: tags:
``` python
bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()
```
%% Output
50%|█████ | 1/2 [00:00<00:00, 6.11it/s]
By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
100%|██████████| 2/2 [00:00<00:00, 6.89it/s]
WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so
Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3
%% Cell type:code id: tags:
``` python
bench_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}
```
%% Cell type:code id: tags:
``` python
bench_p['lead_time'] = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate
bench_p['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
'week34_t2m': 'mean[day 14, 27]',
'week56_t2m': 'mean[day 28, 41]',
'week34_tp': 'day 28 minus day 14',
'week56_tp': 'day 42 minus day 28'}
```
%% Cell type:code id: tags:
``` python
bench_p = bench_p / 100 # convert percent to [0-1] probability
```
%% Cell type:code id: tags:
``` python
bench_p = bench_p.map(ensure_attributes, biweekly=True)
```
%% Output
0%| | 0/1 [00:00<?, ?it/s]
By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
100%|██████████| 1/1 [00:00<00:00, 4.34it/s]
100%|██████████| 1/1 [00:00<00:00, 4.22it/s]
%% Cell type:code id: tags:
``` python
# bench_p.isel(forecast_time=2).t2m.plot(row='lead_time', col='category')
```
%% Cell type:code id: tags:
``` python
bench_p
```
%% Output
<xarray.Dataset>
Dimensions: (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
Coordinates:
* category (category) object 'below normal' 'near normal' 'above normal'
* forecast_time (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
* lead_time (lead_time) timedelta64[ns] 14 days 28 days
* latitude (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
* longitude (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
valid_time (forecast_time, lead_time) datetime64[ns] 2020-01-16 ... 2...
Data variables:
tp (category, forecast_time, lead_time, latitude, longitude) float32 ...
t2m (category, forecast_time, lead_time, latitude, longitude) float32 ...
%% Cell type:code id: tags:
``` python
bench_p.astype('float32').to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')
```
%% Cell type:code id: tags:
``` python
#!renku dataset add s2s-ai-challenge data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
```
import xarray as xr
import pandas as pd
import numpy as np
import climetlab_s2s_ai_challenge
import climetlab as cml
cache_path = '../data'
def download(varlist_forecast=['tp','t2m'],
center_list=['ecwmf'],
forecast_dataset_labels=['hindcast-input','forecast-input'],
obs_dataset_labels=['hindcast-like-observations','forecast-like-observations'],
varlist_observations=['t2m','tp'],
benchmark=True,
format='netcdf'
):
"""Download files with climetlab to cache_path. Set cache_path:
cml.settings.set("cache-directory", cache_path)
"""
if isinstance(center_list, str):
center_list = [center_list]
if isinstance(varlist_forecast, str):
varlist_forecast = [varlist_forecast]
dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
if forecast_dataset_labels:
print(f'Downloads variables {varlist_forecast} from datasets {forecast_dataset_labels} from center {center_list} in {format} format.')
for center in center_list:
for ds in forecast_dataset_labels:
for parameter in varlist_forecast:
try:
cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format=format).to_xarray()
except:
pass
if obs_dataset_labels:
print(f'Downloads variables tp and t2m from datasets {obs_dataset_labels} netcdf format. Additionally downloads raw t2m and pr observations with a time dimension.')
try:
for ds in obs_dataset_labels:
for parameter in varlist_observations:
cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=parameter).to_xarray()
except:
pass
# raw
cml.load_dataset(f"s2s-ai-challenge-observations", parameter=varlist_observations).to_xarray()
if benchmark:
cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()
print('finished')
return
def add_valid_time_from_forecast_reference_time_and_lead_time(forecast, init_dim='forecast_time'):
"""Creates valid_time(forecast_time, lead_time).
lead_time: pd.Timedelta
forecast_time: datetime
"""
times = xr.concat(
[
xr.DataArray(
forecast[init_dim] + lead,
dims=init_dim,
coords={init_dim: forecast[init_dim]},
)
for lead in forecast.lead_time
],
dim="lead_time",
join="inner",
compat="broadcast_equals",
)
forecast = forecast.assign_coords(valid_time=times)
return forecast
def aggregate_biweekly(da):
"""
Aggregate initialized S2S forecasts biweekly for xr.DataArrays.
Use ds.map(aggregate_biweekly) for xr.Datasets.
Applies to the ECMWF S2S data model: https://confluence.ecmwf.int/display/S2S/Parameters
"""
# biweekly averaging
w34 = [pd.Timedelta(f'{i} d') for i in range(14,28)]
w34 = xr.DataArray(w34,dims='lead_time', coords={'lead_time':w34})
w56 = [pd.Timedelta(f'{i} d') for i in range(28,42)]
w56 = xr.DataArray(w56,dims='lead_time', coords={'lead_time':w56})
biweekly_lead = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate
v = da.name
if climetlab_s2s_ai_challenge.CF_CELL_METHODS[v] == 'sum': # biweekly difference for sum variables: tp and ttr
d34 = da.sel(lead_time=pd.Timedelta("28 d")) - da.sel(lead_time=pd.Timedelta("14 d")) # tp from day 14 to day 27
d56 = da.sel(lead_time=pd.Timedelta("42 d")) - da.sel(lead_time=pd.Timedelta("28 d")) # tp from day 28 to day 42
da_biweekly = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
else: # t2m, see climetlab_s2s_ai_challenge.CF_CELL_METHODS # biweekly: mean [day 14, day 27]
d34 = da.sel(lead_time=w34).mean('lead_time')
d56 = da.sel(lead_time=w56).mean('lead_time')
da_biweekly = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
da_biweekly = add_valid_time_from_forecast_reference_time_and_lead_time(da_biweekly)
da_biweekly['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
'week34_t2m': 'mean[day 14, 27]',
'week56_t2m': 'mean[day 28, 41]',
'week34_tp': 'day 28 minus day 14',
'week56_tp': 'day 42 minus day 28'}
return da_biweekly
def ensure_attributes(da, biweekly=False):
"""Ensure that coordinates and variables have proper attributes. Set biweekly==True to set special comments for the biweely aggregates."""
template = cml.load_dataset('s2s-ai-challenge-test-input',parameter='t2m', origin='ecmwf', format='netcdf', date='20200102').to_xarray()
for c in da.coords:
if c in template.coords:
da.coords[c].attrs.update(template.coords[c].attrs)
if 'valid_time' in da.coords:
da['valid_time'].attrs.update({'long_name': 'validity time',
'standard_name': 'time',
'description': 'time for which the forecast is valid',
'calculate':'forecast_time + lead_time'})
if 'forecast_time' in da.coords:
da['forecast_time'].attrs.update({'long_name' : 'initial time of forecast', 'standard_name': 'forecast_reference_time',
'description':'The forecast reference time in NWP is the "data time", the time of the analysis from which the forecast was made. It is not the time for which the forecast is valid.'})
# fix tp
if da.name == 'tp':
da.attrs['units'] = 'kg m-2'
if biweekly:
da['lead_time'].attrs.update({'standard_name':'forecast_period', 'long_name': 'lead time',
'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
'week34_t2m': 'mean[14 days, 27 days]',
'week56_t2m': 'mean[28 days, 41 days]',
'week34_tp': '28 days minus 14 days',
'week56_tp': '42 days minus 28 days'})
if da.name == 'tp':
da.attrs.update({'aggregate_week34': '28 days minus 14 days',
'aggregate_week56': '42 days minus 28 days',
'description': 'https://confluence.ecmwf.int/display/S2S/S2S+Total+Precipitation'})
if da.name == 't2m':
da.attrs.update({'aggregate_week34': 'mean[14 days, 27 days]',
'aggregate_week56': 'mean[28 days, 41 days]',
'variable_before_categorization': 'https://confluence.ecmwf.int/display/S2S/S2S+Surface+Air+Temperature'})
return da
def add_year_week_coords(ds):
import numpy as np
if 'week' not in ds.coords and 'year' not in ds.coords:
year = ds.forecast_time.dt.year.to_index().unique()
week = (list(np.arange(1,54)))
weeks = week * len(year)
years = np.repeat(year,len(week))
ds.coords["week"] = ("forecast_time", weeks)
ds.coords['week'].attrs['description'] = "This week represents the number of forecast_time starting from 1 to 53. Note: This week is different from the ISO week from groupby('forecast_time.weekofyear'), see https://en.wikipedia.org/wiki/ISO_week_date and https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29"
ds.coords["year"] = ("forecast_time", years)
ds.coords['year'].attrs['long_name'] = "calendar year"
return ds
def make_probabilistic(ds, tercile_edges, member_dim='realization', mask=None, groupby_coord='week'):
"""Compute probabilities from ds (observations or forecasts) based on tercile_edges."""
# broadcast
ds = add_year_week_coords(ds)
tercile_edges = tercile_edges.sel({groupby_coord: ds.coords[groupby_coord]})
bn = ds < tercile_edges.isel(category_edge=0, drop=True) # below normal
n = (ds >= tercile_edges.isel(category_edge=0, drop=True)) & (ds < tercile_edges.isel(category_edge=1, drop=True)) # normal
an = ds >= tercile_edges.isel(category_edge=1, drop=True) # above normal
if member_dim in ds.dims:
bn = bn.mean(member_dim)
an = an.mean(member_dim)
n = n.mean(member_dim)
ds_p = xr.concat([bn, n, an],'category').assign_coords(category=['below normal', 'near normal', 'above normal'])
if mask is not None:
ds_p = ds_p.where(mask)
if 'tp' in ds_p.data_vars:
# mask arid grid cells where category_edge are too close to 0
# we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > 0.01
ds_p['tp'] = ds_p['tp'].where(tp_arid_mask)
ds_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}
ds_p['tp'].attrs = {'long_name': 'Probability of total precipitation in tercile categories', 'units': '1',
'comment': 'All three tercile category probabilities must add up to 1.',
'variable_before_categorization': 'https://confluence.ecmwf.int/display/S2S/S2S+Total+Precipitation'
}
ds_p['t2m'].attrs = {'long_name': 'Probability of 2m temperature in tercile categories', 'units': '1',
'comment': 'All three tercile category probabilities must add up to 1.',
'variable_before_categorization': 'https://confluence.ecmwf.int/display/S2S/S2S+Surface+Air+Temperature'
}
if 'year' in ds_p.coords:
del ds_p.coords['year']
if groupby_coord in ds_p.coords:
ds_p = ds_p.drop(groupby_coord)
return ds_p
def skill_by_year(preds, adapt=False):
"""Returns pd.Dataframe of RPSS per year."""
# similar verification_RPSS.ipynb
# as scorer bot but returns a score for each year
import xarray as xr
import xskillscore as xs
import pandas as pd
import numpy as np
xr.set_options(keep_attrs=True)
# from root
#renku storage pull data/forecast-like-observations_2020_biweekly_terciled.nc
#renku storage pull data/hindcast-like-observations_2000-2019_biweekly_terciled.nc
cache_path = '../data'
if 2020 in preds.forecast_time.dt.year:
obs_p = xr.open_dataset(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc').sel(forecast_time=preds.forecast_time)
else:
obs_p = xr.open_dataset(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', engine='zarr').sel(forecast_time=preds.forecast_time)
# ML probabilities
fct_p = preds
# climatology
clim_p = xr.DataArray([1/3, 1/3, 1/3], dims='category', coords={'category':['below normal', 'near normal', 'above normal']}).to_dataset(name='tp')
clim_p['t2m'] = clim_p['tp']
if adapt:
# select only obs_p where fct_p forecasts provided
for c in ['longitude', 'latitude', 'forecast_time', 'lead_time']:
obs_p = obs_p.sel({c:fct_p[c]})
obs_p = obs_p[list(fct_p.data_vars)]
clim_p = clim_p[list(fct_p.data_vars)]
else:
# check inputs
assert_predictions_2020(obs_p)
assert_predictions_2020(fct_p)
# rps_ML
rps_ML = xs.rps(obs_p, fct_p, category_edges=None, dim=[], input_distributions='p').compute()
# rps_clim
rps_clim = xs.rps(obs_p, clim_p, category_edges=None, dim=[], input_distributions='p').compute()
## RPSS
# penalize # https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/7
expect = obs_p.sum('category')
expect = expect.where(expect > 0.98).where(expect < 1.02) # should be True if not all NaN
# https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/50
rps_ML = rps_ML.where(expect, other=2) # assign RPS=2 where value was expected but NaN found
# following Weigel 2007: https://doi.org/10.1175/MWR3280.1
rpss = 1 - (rps_ML.groupby('forecast_time.year').mean() / rps_clim.groupby('forecast_time.year').mean())
# clip
rpss = rpss.clip(-10, 1)
# weighted area mean
weights = np.cos(np.deg2rad(np.abs(rpss.latitude)))
# spatially weighted score averaged over lead_times and variables to one single value
scores = rpss.sel(latitude=slice(None, -60)).weighted(weights).mean('latitude').mean('longitude')
scores = scores.to_array().mean(['lead_time', 'variable'])
return scores.to_dataframe('RPSS')
def assert_predictions_2020(preds_test, exclude='weekofyear'):
"""Check the variables, coordinates and dimensions of 2020 predictions."""
from xarray.testing import assert_equal # doesnt care about attrs but checks coords
# is dataset
assert isinstance(preds_test, xr.Dataset)
# has both vars: tp and t2m
if 'data_vars' in exclude:
assert 'tp' in preds_test.data_vars
assert 't2m' in preds_test.data_vars
## coords
# ignore weekofyear coord if not dim
if 'weekofyear' in exclude and 'weekofyear' in preds_test.coords and 'weekofyear' not in preds_test.dims:
preds_test = preds_test.drop('weekofyear')
# forecast_time
if 'forecast_time' in exclude:
d = pd.date_range(start='2020-01-02', freq='7D', periods=53)
forecast_time = xr.DataArray(d, dims='forecast_time', coords={'forecast_time':d}, name='forecast_time')
assert_equal(forecast_time, preds_test['forecast_time'])
# longitude
if 'longitude' in exclude:
lon = np.arange(0., 360., 1.5)
longitude = xr.DataArray(lon, dims='longitude', coords={'longitude': lon}, name='longitude')
assert_equal(longitude, preds_test['longitude'])
# latitude
if 'latitude' in exclude:
lat = np.arange(-90., 90.1, 1.5)[::-1]
latitude = xr.DataArray(lat, dims='latitude', coords={'latitude': lat}, name='latitude')
assert_equal(latitude, preds_test['latitude'])
# lead_time
if 'lead_time' in exclude:
lead = [pd.Timedelta(f'{i} d') for i in [14, 28]]
lead_time = xr.DataArray(lead, dims='lead_time', coords={'lead_time': lead}, name='lead_time')
assert_equal(lead_time, preds_test['lead_time'])
# category
if 'category' in exclude:
cat = np.array(['below normal', 'near normal', 'above normal'], dtype='<U12')
category = xr.DataArray(cat, dims='category', coords={'category': cat}, name='category')
assert_equal(category, preds_test['category'])
# size
if 'size' in exclude:
from dask.utils import format_bytes
size_in_MB = float(format_bytes(preds_test.nbytes).split(' ')[0])
# todo: refine for dtypes
assert size_in_MB > 50
assert size_in_MB < 250
# no other dims
if 'dims' in exclude:
assert set(preds_test.dims) - {'category', 'forecast_time', 'latitude', 'lead_time', 'longitude'} == set()
“*.nc” filter=lfs diff=lfs merge=lfs -text
source diff could not be displayed: it is stored in LFS. Options to address this: view the blob.
echo remove previous zarrs from renku dataset
renku dataset unlink s2s-ai-challenge --include "data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr/**/**" -y
renku dataset unlink s2s-ai-challenge --include "data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr/**/**" -y
renku dataset unlink s2s-ai-challenge --include "data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr/**/**" -y
renku dataset unlink s2s-ai-challenge --include "data/forecast-like-observations_2020_biweekly_deterministic.zarr/**/**" -y
renku dataset unlink s2s-ai-challenge --include "data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr/**/**" -y
renku dataset unlink s2s-ai-challenge --include "data/**/**" -y
# observations and derived
renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
renku dataset add s2s-ai-challenge -o data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
renku dataset add s2s-ai-challenge -o data/forecast-like-observations_2020_biweekly_terciled.nc
renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
# benchmark
renku dataset add s2s-ai-challenge -o data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
# forecast / hindcast
for center in ecmwf; do
renku dataset add s2s-ai-challenge data/${center}_hindcast-input_2000-2019_biweekly_deterministic.zarr
renku dataset add s2s-ai-challenge data/${center}_forecast-input_2020_biweekly_deterministic.zarr
done
renku dataset ls-files s2s-ai-challenge
renku dataset ls-tags s2s-ai-challenge
echo consider new tag: renku dataset tag -d description s2s-ai-challenge 0.x