Compare revisions

9019caf1 · 9019caf1 · 9019caf1 · 9019caf1
--- a/notebooks/renku_datasets_biweekly.ipynb
+++ b/notebooks/renku_datasets_biweekly.ipynb
 %% Cell type:markdown id: tags:

 # Create biweekly renku datasets from `climetlab-s2s-ai-challenge`

 %% Cell type:markdown id: tags:

 Goal:

 - Create biweekly renku datasets from [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge).
 - These renku datasets are then used in notebooks:
    - `ML_train_and_predict.ipynb` to train the ML model and do ML-based predictions
    - `RPSS_verification.ipynb` to calculate RPSS of the ML model
+    - `mean_bias_reduction.ipynb` to remove the mean bias and

 Requirements:
 - [`climetlab`](https://github.com/ecmwf/climetlab)
 - [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)
 - S2S and CPC observations uploaded on [European Weather Cloud (EWC)](https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html)

 Output: [renku dataset](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) `s2s-ai-challenge`
 - observations
    - deterministic:
        - `hindcast-like-observations_2000-2019_biweekly_deterministic.zarr`
        - `forecast-like-observations_2020_biweekly_deterministic.zarr`
    - edges:
        - `hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc`
    - probabilistic:
        - `hindcast-like-observations_2000-2019_biweekly_terciled.zarr`
        - `forecast-like-observations_2020_biweekly_terciled.nc`
 - forecasts/hindcasts
    - deterministic:
        - `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
        - `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
    - more models could be added
 - benchmark:
    - probabilistic:
        - `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`

 %% Cell type:code id: tags:

 ``` python
-#!renku dataset create s2s-ai-challenge
-```
-
-%% Cell type:code id: tags:
-
-``` python
-import climetlab as cml
 import matplotlib.pyplot as plt
 import xarray as xr
 import xskillscore as xs
 import pandas as pd

+import climetlab_s2s_ai_challenge
+import climetlab as cml
+print(f'Climetlab version : {cml.__version__}')
+print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')
+
 xr.set_options(keep_attrs=True)
 xr.set_options(display_style='text')
 ```

 %% Output

    WARNING: ecmwflibs universal: cannot find a library called MagPlus
-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/climetlab/plotting/drivers/magics/actions.py:36: UserWarning: Magics library could not be found
-      warnings.warn(str(e))
+    Magics library could not be found
+
+    Climetlab version : 0.8.6
+    Climetlab-s2s-ai-challenge plugin version : 0.8.0

-    <xarray.core.options.set_options at 0x2b2b3825c090>
+    <xarray.core.options.set_options at 0x2b51c148f590>

 %% Cell type:code id: tags:

 ``` python
 # caching path for climetlab
-cache_path = "/work/mh0727/m300524/S2S_AI/cache" # set your own path
+cache_path = "/work/mh0727/m300524/S2S_AI/cache4" # set your own path
 cml.settings.set("cache-directory", cache_path)
 ```

 %% Cell type:code id: tags:

 ``` python
 cache_path = "../data"
 ```

-%% Cell type:code id: tags:
-
-``` python
-import climetlab_s2s_ai_challenge
-print(f'Climetlab version : {cml.__version__}')
-print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')
-```
-
-%% Output
-
-    Climetlab version : 0.7.0
-    Climetlab-s2s-ai-challenge plugin version : 0.6.0
-
 %% Cell type:markdown id: tags:

 # Download and cache

 Download all files for the observations, forecast and hindcast.

 %% Cell type:code id: tags:

 ``` python
-dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
-dates[:5]
+# shortcut
+from scripts import download
+#download()
 ```

-%% Output
-
-    ['20200102', '20200109', '20200116', '20200123', '20200130']
-
 %% Cell type:markdown id: tags:

-## observations `output-reference`
+## hindcast and forecast `input`

 %% Cell type:code id: tags:

 ``` python
-obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
+# starting dates forecast_time in 2020
+dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
+
+forecast_dataset_labels = ['training-input','test-input'] # ML community
 # equiv to
-obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community
-```
+forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community

-%% Cell type:code id: tags:
+varlist_forecast = ['tp','t2m'] # can add more

-``` python
-varlist_obs = ['tp','t2m']
+center_list = ['ecmwf'] # 'ncep', 'eccc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
-# takes 5h for one model
-for v in varlist_obs:
-    for d in dates:
-        for ds in obs_dataset_labels:
-            print(ds,v,d)
-            # only netcdf, no choice
-            cml.load_dataset(f"s2s-ai-challenge-{ds}", date=d, parameter=v).to_xarray()
+# takes ~ 10-30 min to download for one model one variable depending on number of model realizations
+# and download settings https://climetlab.readthedocs.io/en/latest/guide/settings.html
+for center in center_list:
+    for ds in forecast_dataset_labels:
+        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format='netcdf').to_xarray()
 ```

 %% Cell type:markdown id: tags:

-## hindcast and forecast `input`
+## observations `output-reference`

 %% Cell type:code id: tags:

 ``` python
-forecast_dataset_labels = ['training-input','test-input'] # ML community
+obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
 # equiv to
-forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community
-```
-
-%% Cell type:code id: tags:
+obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community

-``` python
-varlist_forecast = ['tp','t2m']
+varlist_obs = ['tp', 't2m']
 ```

 %% Cell type:code id: tags:

 ``` python
-model_list = ['ecmwf']
+%%time
+# takes 10min to download
+for ds in obs_dataset_labels:
+    print(ds)
+    # only netcdf, no format choice
+    cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=varlist_obs).to_xarray()
 ```

 %% Cell type:code id: tags:

 ``` python
-%%time
-# takes 4-5h for one model
-for model in model_list:
-    for v in varlist_forecast:
-        for d in dates:
-            for ds in forecast_dataset_labels:
-                print(ds,v,d)
-                # take netcdf format
-                cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=model, date=d, parameter=v, format='netcdf').to_xarray()
+# download obs_time for to create output-reference/observations for other models than ecmwf and eccc,
+# i.e. ncep or any S2S or Sub model
+obs_time = cml.load_dataset(f"s2s-ai-challenge-observations", parameter=['t2m', 'pr']).to_xarray()
 ```

 %% Cell type:markdown id: tags:

 # create bi-weekly aggregates

 %% Cell type:code id: tags:

 ``` python
-from scripts import add_valid_time_from_forecast_reference_time_and_lead_time
+from scripts import aggregate_biweekly, ensure_attributes
+
+#aggregate_biweekly??
 ```

 %% Cell type:code id: tags:

 ``` python
-w34 = [pd.Timedelta(f'{i} d') for i in range(14,28)]
-w34 = xr.DataArray(w34,dims='lead_time', coords={'lead_time':w34})
-w34
-```
-
-%% Output
+for c, center in enumerate(center_list):  # forecast centers (could also take models)
+    for dsl in obs_dataset_labels:# + forecast_dataset_labels:  # climetlab dataset labels
+        for p, parameter in enumerate(varlist_forecast):  # variables
+            if c != 0 and 'observation' in dsl:  # only do once for observations
+                continue
+            print(f"datasetlabel: {dsl}, center: {center}, parameter: {parameter}")
+            if 'input' in dsl:
+                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=center, parameter=parameter, format='netcdf').to_xarray()
+            elif 'observation' in dsl: # obs only netcdf, no choice
+                if parameter not in ['t2m', 'tp']:
+                    continue
+                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", parameter=parameter, date=dates).to_xarray()
+
+            if p == 0:
+                ds_biweekly = ds.map(aggregate_biweekly)
+            else:
+                ds_biweekly[parameter] = ds.map(aggregate_biweekly)[parameter]
+
+            ds_biweekly = ds_biweekly.map(ensure_attributes, biweekly=True)
+
+            ds_biweekly = ds_biweekly.sortby('forecast_time')
+
+        if 'test' in dsl:
+            ds_biweekly = ds_biweekly.chunk('auto')
+        else:
+            ds_biweekly = ds_biweekly.chunk({'forecast_time':'auto','lead_time':-1,'longitude':-1,'latitude':-1})

-    <xarray.DataArray (lead_time: 14)>
-    array([1209600000000000, 1296000000000000, 1382400000000000,
-           1468800000000000, 1555200000000000, 1641600000000000,
-           1728000000000000, 1814400000000000, 1900800000000000,
-           1987200000000000, 2073600000000000, 2160000000000000,
-           2246400000000000, 2332800000000000], dtype='timedelta64[ns]')
-    Coordinates:
-      * lead_time  (lead_time) timedelta64[ns] 14 days 15 days ... 26 days 27 days
+        if 'hindcast' in dsl:
+            time = f'{int(ds_biweekly.forecast_time.dt.year.min())}-{int(ds_biweekly.forecast_time.dt.year.max())}'
+            if 'input' in dsl:
+                name = f'{center}_{dsl}'
+            elif 'observations':
+                name = dsl

-%% Cell type:code id: tags:
+        elif 'forecast' in dsl:
+            time = '2020'
+            if 'input' in dsl:
+                name = f'{center}_{dsl}'
+            elif 'observations':
+                name = dsl
+        else:
+            assert False

-``` python
-w56 = [pd.Timedelta(f'{i} d') for i in range(28,42)]
-w56 = xr.DataArray(w56,dims='lead_time', coords={'lead_time':w56})
-w56
+        # pattern: {model_if_not_observations}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
+        zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
+        ds_biweekly.attrs.update({'postprocessed_by':'https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
+        print(f'save to: {zp}')
+        ds_biweekly.astype('float32').to_zarr(zp, consolidated=True, mode='w')
 ```

 %% Output

-    <xarray.DataArray (lead_time: 14)>
-    array([2419200000000000, 2505600000000000, 2592000000000000,
-           2678400000000000, 2764800000000000, 2851200000000000,
-           2937600000000000, 3024000000000000, 3110400000000000,
-           3196800000000000, 3283200000000000, 3369600000000000,
-           3456000000000000, 3542400000000000], dtype='timedelta64[ns]')
-    Coordinates:
-      * lead_time  (lead_time) timedelta64[ns] 28 days 29 days ... 40 days 41 days
-
-%% Cell type:code id: tags:
-
-``` python
-# starting dates forecast_time in 2020
-dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
-
-biweekly_lead = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average
-# alternative: new lead_time coordinate: use 21 35 as mid points of week 3-4 and week 5-6
-```
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: tp
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.

-%% Cell type:code id: tags:
+    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so

-``` python
-# refactor after closed: https://github.com/ecmwf/climetlab/issues/16 https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge/issues/9
-# then, no need to loop over dates
-model = 'ecmwf'
-for dsl in forecast_dataset_labels + obs_dataset_labels: # dataset labels
-    for v in varlist_forecast:
-        print(dsl,v)
-        ds = []
-        for d in dates:
-            if 'input' in dsl:
-                # just ecmwf model for now
-                ds2 = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=model, date=d, parameter=v, format='netcdf').to_xarray()
-            elif 'observation' in dsl: # obs only netcdf, no choice
-                ds2 = cml.load_dataset(f"s2s-ai-challenge-{dsl}", date=d, parameter=v).to_xarray().chunk()
-            if v == 't2m': # biweekly mean
-                d34 = ds2.sel(lead_time=w34).mean('lead_time')
-                d56 = ds2.sel(lead_time=w56).mean('lead_time')
-                ds2 = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
-            elif v=='tp': # biweekly sum
-                d34 = ds2.sel(lead_time=pd.Timedelta("27 d")) - ds2.sel(lead_time=pd.Timedelta("13 d")) # tp from day 14 to day 27
-                d56 = ds2.sel(lead_time=pd.Timedelta("41 d")) - ds2.sel(lead_time=pd.Timedelta("27 d")) # tp from day 28 to day 42
-                ds2 = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
-            ds.append(ds2.chunk({'forecast_time':1}))
-
-        # sort forecast_time
-        ds = xr.concat(ds, 'forecast_time').sortby('forecast_time')
-        ds = add_valid_time_from_forecast_reference_time_and_lead_time(ds)
-        ds.lead_time.attrs['comment'] = 'lead_time describes bi-weekly aggregates. The pd.Timedelta corresponds to the first day of the '
-
-        # first run on tp and add t2m to that dataset
-        if v == 'tp':
-            ds_ = ds.copy()
-        elif v == 't2m':
-            ds_[v] = ds[v]
-            ds = ds_
-
-    if 'test' in dsl:
-        ds = ds.chunk('auto')
-    else:
-        ds = ds.chunk({'forecast_time':12,'lead_time':-1,'longitude':'auto','latitude':'auto'}) # 3 month init chunk
-
-    if 'hindcast' in dsl:
-        time = '2000-2019'
-        if 'input' in dsl:
-            name = f'{model}_{dsl}'
-        elif 'observations':
-            name = dsl
-
-    elif 'forecast' in dsl:
-        time = '2020'
-        if 'input' in dsl:
-            name = f'{model}_{dsl}'
-        elif 'observations':
-            name = dsl
-    else:
-        assert False
-
-    # pattern: {model if applies _}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
-    zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
-    print(f'save to: {zp}')
-    ds.to_zarr(zp, consolidated=True, mode='w')
-```
+    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.18.0
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

-%% Output
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]

-    hindcast-input tp
-    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: t2m

-    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]

-    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3
-    hindcast-input t2m
-    save to: ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
-    forecast-input tp
-    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
-    forecast-input t2m
-    save to: ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
-    hindcast-like-observations tp
-    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
-    hindcast-like-observations t2m
    save to: ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

-    forecast-like-observations tp
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: tp
    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
-    forecast-like-observations t2m
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: t2m
+
+    
+
    save to: ../data/forecast-like-observations_2020_biweekly_deterministic.zarr

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

 %% Cell type:markdown id: tags:

 ## add to `renku` dataset `s2s-ai-challenge`

 %% Cell type:code id: tags:

 ``` python
 # observations as hindcast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # for further use retrieve from git lfs
 # !renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f"{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2000_2019.sizes,'\n',obs_2000_2019.coords,'\n', obs_2000_2019.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-     985.065144 MB
+     492.546744 MB

 %% Cell type:code id: tags:

 ``` python
 # observations as forecast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
+# for further use retrieve from git lfs
+# !renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
+```
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020 = xr.open_zarr(f"{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2020.sizes,'\n',obs_2020.coords,'\n', obs_2020.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
-     49.256016 MB
+     24.630096 MB

 %% Cell type:code id: tags:

 ``` python
 # ecmwf hindcast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
-print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 11}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 8 9 10
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-     5417.730832 MB
+``` python
+hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
+print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+```

 %% Cell type:code id: tags:

 ``` python
 # ecmwf forecast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
-print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 51}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 ... 44 45 46 47 48 49 50
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
-     1255.926504 MB
+``` python
+fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
+print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+```

 %% Cell type:markdown id: tags:

 # tercile edges

 Create 2 tercile edges at 1/3 and 2/3 quantiles of the 2000-2019 biweekly distrbution for each week of the year

 %% Cell type:code id: tags:

 ``` python
+obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from scripts import add_year_week_coords
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# add week for groupby, see https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29
+obs_2000_2019 = add_year_week_coords(obs_2000_2019)
+```
+
+%% Output
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+      FutureWarning,
+
+%% Cell type:code id: tags:
+
+``` python
+obs_2000_2019
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
+        week           (forecast_time) int64 1 2 3 4 5 6 7 ... 47 48 49 50 51 52 53
+        year           (forecast_time) int64 2000 2000 2000 2000 ... 2019 2019 2019
+    Data variables:
+        t2m            (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+        tp             (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+    Attributes:
+        created_by_script:    tools/observations/makefile
+        created_by_software:  climetlab-s2s-ai-challenge
+        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
+        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
+        source_hosting:       IRIDL
+        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
+
+%% Cell type:code id: tags:
+
+``` python
 tercile_file = f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
-xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr',
-             consolidated=True).chunk({'forecast_time':-1,'longitude':'auto'}).groupby('forecast_time.weekofyear').quantile(q=[1./3.,2./3.], dim=['forecast_time']).rename({'quantile':'category_edge'}).to_netcdf(tercile_file)
+obs_2000_2019.chunk({'forecast_time':-1,'longitude':'auto'}).groupby('week').quantile(q=[1./3.,2./3.], dim='forecast_time').rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
-      FutureWarning,
    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/numpy/lib/nanfunctions.py:1390: RuntimeWarning: All-NaN slice encountered
      overwrite_input, interpolation)

-    CPU times: user 15min 49s, sys: 6min 16s, total: 22min 6s
-    Wall time: 13min 14s
+    CPU times: user 19min 35s, sys: 8min 33s, total: 28min 9s
+    Wall time: 16min 44s

 %% Cell type:code id: tags:

 ``` python
 tercile_edges = xr.open_dataset(tercile_file)

 tercile_edges
 ```

 %% Output

    <xarray.Dataset>
-    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, weekofyear: 53)
+    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, week: 53)
    Coordinates:
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
      * category_edge  (category_edge) float64 0.3333 0.6667
-      * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53
+      * week           (week) int64 1 2 3 4 5 6 7 8 9 ... 45 46 47 48 49 50 51 52 53
    Data variables:
-        t2m            (weekofyear, category_edge, lead_time, latitude, longitude) float64 ...
-        tp             (weekofyear, category_edge, lead_time, latitude, longitude) float64 ...
+        t2m            (week, category_edge, lead_time, latitude, longitude) float32 ...
+        tp             (week, category_edge, lead_time, latitude, longitude) float32 ...
+    Attributes:
+        created_by_script:    tools/observations/makefile
+        created_by_software:  climetlab-s2s-ai-challenge
+        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
+        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
+        source_hosting:       IRIDL
+        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...

 %% Cell type:code id: tags:

 ``` python
 tercile_edges.nbytes*1e-6,'MB'
 ```

 %% Output

-    (98.507024, 'MB')
+    (49.255184, 'MB')

 %% Cell type:code id: tags:

 ``` python
 # run renku commands from projects root directory only
 # tercile edges
 #!renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 #xr.open_dataset("../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc")
 ```

 %% Cell type:markdown id: tags:

 # observations in categories

 - counting how many deterministic forecasts realizations fall into each category, like counting rps
 - categorize forecast-like-observations 2020 into categories

 %% Cell type:code id: tags:

 ``` python
 obs_2020 = xr.open_zarr(f'{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr', consolidated=True)
 obs_2020.sizes
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))

 %% Cell type:code id: tags:

 ``` python
 # create a mask for land grid
 mask = obs_2020.std(['lead_time','forecast_time']).notnull()
 ```

 %% Cell type:code id: tags:

 ``` python
 # mask.to_array().plot(col='variable')
 ```

 %% Cell type:code id: tags:

 ``` python
-```
-
-%% Cell type:code id: tags:
-
-``` python
-tercile_edges = xr.open_dataset(tercile_file)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-tercile_edges.tp
-```
-
-%% Output
-
-    <xarray.DataArray 'tp' (weekofyear: 53, category_edge: 2, lead_time: 2, latitude: 121, longitude: 240)>
-    [6156480 values with dtype=float64]
-    Coordinates:
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * category_edge  (category_edge) float64 0.3333 0.6667
-      * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53
-    Attributes:
-        comment:        precipitation accumulated since lead_time including 0 days
-        long_name:      total precipitation
-        pointwidth:     0
-        standard_name:  precipitation_amount
-        units:          kg m-2
-
-%% Cell type:code id: tags:
-
-``` python
 # total precipitation in arid regions are masked
 # Frederic Vitart suggested by email: "Based on your map we could mask all the areas where the lower tercile boundary is lower than 0.1 mm"
 # we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
 th = 0.01
 tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > th
 #tp_arid_mask.where(mask.tp).plot(col='forecast_time', col_wrap=4)
 #plt.suptitle(f'dry mask: week 3-4 tp 1/3 category_edge > {th} kg m-2',y=1., x=.4)
 #plt.savefig('dry_mask.png')
 ```

 %% Cell type:code id: tags:

 ``` python
 # look into tercile edges
+
+# tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
+# tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
+# tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)
 ```

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-#tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
-```
+## categorize observations

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-#tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
-```
+### forecast 2020

 %% Cell type:code id: tags:

 ``` python
-# tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)
+from scripts import make_probabilistic
 ```

-%% Cell type:markdown id: tags:
-
-## categorize observations
-
 %% Cell type:code id: tags:

 ``` python
-from scripts import make_probabilistic
+# tp_arid_mask.isel(week=[0,10,20,30,40]).plot(col='week')
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2020_p = make_probabilistic(obs_2020, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2020_p.nbytes/1e6, 'MB'
 ```

 %% Output

-    (147.760264, 'MB')
+    (147.75984, 'MB')

 %% Cell type:code id: tags:

 ``` python
-obs_2020_p.to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
+obs_2020_p
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
+      * category       (category) <U12 'below normal' 'near normal' 'above normal'
+    Data variables:
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+
+%% Cell type:code id: tags:
+
+``` python
+obs_2020_p.astype('float32').to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_terciled.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
 xr.open_dataset("../data/forecast-like-observations_2020_biweekly_terciled.nc")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] ...
-        weekofyear     (forecast_time) int64 ...
      * category       (category) object 'below normal' 'near normal' 'above normal'
    Data variables:
-        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 ...
-        tp             (category, lead_time, forecast_time, latitude, longitude) float64 ...
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 ...
+        tp             (category, lead_time, forecast_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
 ```

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-```
+### hindcast 2000_2019

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p = make_probabilistic(obs_2000_2019, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p.nbytes/1e6, 'MB'
 ```

 %% Output

-    (2955.147368, 'MB')
+    (2955.138888, 'MB')

 %% Cell type:code id: tags:

 ``` python
-obs_2000_2019_p.chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')
+obs_2000_2019_p.astype('float32').chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

-    <xarray.backends.zarr.ZarrStore at 0x2b2b614e84b0>
+    <xarray.backends.zarr.ZarrStore at 0x2b51c1233360>

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * category       (category) <U12 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-        weekofyear     (forecast_time) int64 dask.array<chunksize=(1060,), meta=np.ndarray>
    Data variables:
-        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(3, 2, 96, 121, 240), meta=np.ndarray>
-        tp             (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(3, 2, 96, 121, 240), meta=np.ndarray>
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>

 %% Cell type:code id: tags:

 ``` python
+# checking category frequencies
+# o = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
+# w=0
+# v='tp'
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False)[v].plot(row='lead_time',col='category', levels=[5.5,6.5,7.5])
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False).sum('category', skipna=False)[v].plot(row='lead_time', levels=[16.5,17.5,18.5,19.5,20.5])
 ```

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-```
+# Benchmark

-%% Cell type:markdown id: tags:
+center: ECMWF

-# benchmark
+The calibration has been performed by using the tercile boundaries from the model climatology rather than from observations. Script by Frederic Vitart.

 %% Cell type:code id: tags:

 ``` python
-bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark",
-                         parameter='t2m',
-                         weeks=['34','56']
-                       ).to_xarray()
+bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()
 ```

 %% Output

+     50%|█████     | 1/2 [00:00<00:00,  6.11it/s]
+
    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

+    100%|██████████| 2/2 [00:00<00:00,  6.89it/s]
    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so

    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3

 %% Cell type:code id: tags:

 ``` python
-bench_p['tp'] = cml.load_dataset("s2s-ai-challenge-test-output-benchmark",
-                         parameter='tp',
-                         weeks=['34','56']
-                       ).to_xarray().tp
+bench_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
+                        'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}
 ```

 %% Cell type:code id: tags:

 ``` python
-bench_p['category'] = ['below normal','near normal','above normal']
+bench_p['lead_time'] = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate
+
+bench_p['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
+                         'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
+                         'week34_t2m': 'mean[day 14, 27]',
+                         'week56_t2m': 'mean[day 28, 41]',
+                         'week34_tp': 'day 28 minus day 14',
+                         'week56_tp': 'day 42 minus day 28'}
 ```

 %% Cell type:code id: tags:

 ``` python
-bench_p['lead_time'] = biweekly_lead
+bench_p = bench_p / 100 # convert percent to [0-1] probability
 ```

 %% Cell type:code id: tags:

 ``` python
-bench_p = bench_p / 100
-
-bench_p.tp.attrs['units']=''
-bench_p.t2m.attrs['units']=''
+bench_p = bench_p.map(ensure_attributes, biweekly=True)
 ```

+%% Output
+
+      0%|          | 0/1 [00:00<?, ?it/s]
+
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
+
+    100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
+    100%|██████████| 1/1 [00:00<00:00,  4.22it/s]
+
 %% Cell type:code id: tags:

 ``` python
 # bench_p.isel(forecast_time=2).t2m.plot(row='lead_time', col='category')
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
+      * category       (category) object 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
-      * category       (category) <U12 'below normal' 'near normal' 'above normal'
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-        valid_time     (forecast_time, lead_time) datetime64[ns] dask.array<chunksize=(53, 1), meta=np.ndarray>
+        valid_time     (forecast_time, lead_time) datetime64[ns] 2020-01-16 ... 2...
    Data variables:
-        t2m            (category, forecast_time, lead_time, latitude, longitude) float32 dask.array<chunksize=(3, 53, 1, 121, 240), meta=np.ndarray>
-        tp             (category, forecast_time, lead_time, latitude, longitude) float32 dask.array<chunksize=(3, 53, 1, 121, 240), meta=np.ndarray>
-    Attributes:
-        GRIB_edition:            1
-        GRIB_centre:             ecmf
-        GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
-        GRIB_subCentre:          0
-        Conventions:             CF-1.7
-        institution:             European Centre for Medium-Range Weather Forecasts
-        history:                 2021-05-25T12:55 GRIB to CDM+CF via cfgrib-0.9.9...
+        tp             (category, forecast_time, lead_time, latitude, longitude) float32 ...
+        t2m            (category, forecast_time, lead_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
-bench_p.to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')
+bench_p.astype('float32').to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')
 ```

 %% Cell type:code id: tags:

 ``` python
 #!renku dataset add s2s-ai-challenge data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
 ```

 %% Cell type:markdown id: tags:

 # Create biweekly renku datasets from `climetlab-s2s-ai-challenge`

 %% Cell type:markdown id: tags:

 Goal:

 - Create biweekly renku datasets from [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge).
 - These renku datasets are then used in notebooks:
    - `ML_train_and_predict.ipynb` to train the ML model and do ML-based predictions
    - `RPSS_verification.ipynb` to calculate RPSS of the ML model
+    - `mean_bias_reduction.ipynb` to remove the mean bias and

 Requirements:
 - [`climetlab`](https://github.com/ecmwf/climetlab)
 - [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)
 - S2S and CPC observations uploaded on [European Weather Cloud (EWC)](https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html)

 Output: [renku dataset](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) `s2s-ai-challenge`
 - observations
    - deterministic:
        - `hindcast-like-observations_2000-2019_biweekly_deterministic.zarr`
        - `forecast-like-observations_2020_biweekly_deterministic.zarr`
    - edges:
        - `hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc`
    - probabilistic:
        - `hindcast-like-observations_2000-2019_biweekly_terciled.zarr`
        - `forecast-like-observations_2020_biweekly_terciled.nc`
 - forecasts/hindcasts
    - deterministic:
        - `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
        - `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
    - more models could be added
 - benchmark:
    - probabilistic:
        - `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`

 %% Cell type:code id: tags:

 ``` python
-#!renku dataset create s2s-ai-challenge
-```
-
-%% Cell type:code id: tags:
-
-``` python
-import climetlab as cml
 import matplotlib.pyplot as plt
 import xarray as xr
 import xskillscore as xs
 import pandas as pd

+import climetlab_s2s_ai_challenge
+import climetlab as cml
+print(f'Climetlab version : {cml.__version__}')
+print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')
+
 xr.set_options(keep_attrs=True)
 xr.set_options(display_style='text')
 ```

 %% Output

    WARNING: ecmwflibs universal: cannot find a library called MagPlus
-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/climetlab/plotting/drivers/magics/actions.py:36: UserWarning: Magics library could not be found
-      warnings.warn(str(e))
+    Magics library could not be found
+
+    Climetlab version : 0.8.6
+    Climetlab-s2s-ai-challenge plugin version : 0.8.0

-    <xarray.core.options.set_options at 0x2b2b3825c090>
+    <xarray.core.options.set_options at 0x2b51c148f590>

 %% Cell type:code id: tags:

 ``` python
 # caching path for climetlab
-cache_path = "/work/mh0727/m300524/S2S_AI/cache" # set your own path
+cache_path = "/work/mh0727/m300524/S2S_AI/cache4" # set your own path
 cml.settings.set("cache-directory", cache_path)
 ```

 %% Cell type:code id: tags:

 ``` python
 cache_path = "../data"
 ```

-%% Cell type:code id: tags:
-
-``` python
-import climetlab_s2s_ai_challenge
-print(f'Climetlab version : {cml.__version__}')
-print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')
-```
-
-%% Output
-
-    Climetlab version : 0.7.0
-    Climetlab-s2s-ai-challenge plugin version : 0.6.0
-
 %% Cell type:markdown id: tags:

 # Download and cache

 Download all files for the observations, forecast and hindcast.

 %% Cell type:code id: tags:

 ``` python
-dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
-dates[:5]
+# shortcut
+from scripts import download
+#download()
 ```

-%% Output
-
-    ['20200102', '20200109', '20200116', '20200123', '20200130']
-
 %% Cell type:markdown id: tags:

-## observations `output-reference`
+## hindcast and forecast `input`

 %% Cell type:code id: tags:

 ``` python
-obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
+# starting dates forecast_time in 2020
+dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
+
+forecast_dataset_labels = ['training-input','test-input'] # ML community
 # equiv to
-obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community
-```
+forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community

-%% Cell type:code id: tags:
+varlist_forecast = ['tp','t2m'] # can add more

-``` python
-varlist_obs = ['tp','t2m']
+center_list = ['ecmwf'] # 'ncep', 'eccc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
-# takes 5h for one model
-for v in varlist_obs:
-    for d in dates:
-        for ds in obs_dataset_labels:
-            print(ds,v,d)
-            # only netcdf, no choice
-            cml.load_dataset(f"s2s-ai-challenge-{ds}", date=d, parameter=v).to_xarray()
+# takes ~ 10-30 min to download for one model one variable depending on number of model realizations
+# and download settings https://climetlab.readthedocs.io/en/latest/guide/settings.html
+for center in center_list:
+    for ds in forecast_dataset_labels:
+        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format='netcdf').to_xarray()
 ```

 %% Cell type:markdown id: tags:

-## hindcast and forecast `input`
+## observations `output-reference`

 %% Cell type:code id: tags:

 ``` python
-forecast_dataset_labels = ['training-input','test-input'] # ML community
+obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
 # equiv to
-forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community
-```
-
-%% Cell type:code id: tags:
+obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community

-``` python
-varlist_forecast = ['tp','t2m']
+varlist_obs = ['tp', 't2m']
 ```

 %% Cell type:code id: tags:

 ``` python
-model_list = ['ecmwf']
+%%time
+# takes 10min to download
+for ds in obs_dataset_labels:
+    print(ds)
+    # only netcdf, no format choice
+    cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=varlist_obs).to_xarray()
 ```

 %% Cell type:code id: tags:

 ``` python
-%%time
-# takes 4-5h for one model
-for model in model_list:
-    for v in varlist_forecast:
-        for d in dates:
-            for ds in forecast_dataset_labels:
-                print(ds,v,d)
-                # take netcdf format
-                cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=model, date=d, parameter=v, format='netcdf').to_xarray()
+# download obs_time for to create output-reference/observations for other models than ecmwf and eccc,
+# i.e. ncep or any S2S or Sub model
+obs_time = cml.load_dataset(f"s2s-ai-challenge-observations", parameter=['t2m', 'pr']).to_xarray()
 ```

 %% Cell type:markdown id: tags:

 # create bi-weekly aggregates

 %% Cell type:code id: tags:

 ``` python
-from scripts import add_valid_time_from_forecast_reference_time_and_lead_time
+from scripts import aggregate_biweekly, ensure_attributes
+
+#aggregate_biweekly??
 ```

 %% Cell type:code id: tags:

 ``` python
-w34 = [pd.Timedelta(f'{i} d') for i in range(14,28)]
-w34 = xr.DataArray(w34,dims='lead_time', coords={'lead_time':w34})
-w34
-```
-
-%% Output
+for c, center in enumerate(center_list):  # forecast centers (could also take models)
+    for dsl in obs_dataset_labels:# + forecast_dataset_labels:  # climetlab dataset labels
+        for p, parameter in enumerate(varlist_forecast):  # variables
+            if c != 0 and 'observation' in dsl:  # only do once for observations
+                continue
+            print(f"datasetlabel: {dsl}, center: {center}, parameter: {parameter}")
+            if 'input' in dsl:
+                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=center, parameter=parameter, format='netcdf').to_xarray()
+            elif 'observation' in dsl: # obs only netcdf, no choice
+                if parameter not in ['t2m', 'tp']:
+                    continue
+                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", parameter=parameter, date=dates).to_xarray()
+
+            if p == 0:
+                ds_biweekly = ds.map(aggregate_biweekly)
+            else:
+                ds_biweekly[parameter] = ds.map(aggregate_biweekly)[parameter]
+
+            ds_biweekly = ds_biweekly.map(ensure_attributes, biweekly=True)
+
+            ds_biweekly = ds_biweekly.sortby('forecast_time')
+
+        if 'test' in dsl:
+            ds_biweekly = ds_biweekly.chunk('auto')
+        else:
+            ds_biweekly = ds_biweekly.chunk({'forecast_time':'auto','lead_time':-1,'longitude':-1,'latitude':-1})

-    <xarray.DataArray (lead_time: 14)>
-    array([1209600000000000, 1296000000000000, 1382400000000000,
-           1468800000000000, 1555200000000000, 1641600000000000,
-           1728000000000000, 1814400000000000, 1900800000000000,
-           1987200000000000, 2073600000000000, 2160000000000000,
-           2246400000000000, 2332800000000000], dtype='timedelta64[ns]')
-    Coordinates:
-      * lead_time  (lead_time) timedelta64[ns] 14 days 15 days ... 26 days 27 days
+        if 'hindcast' in dsl:
+            time = f'{int(ds_biweekly.forecast_time.dt.year.min())}-{int(ds_biweekly.forecast_time.dt.year.max())}'
+            if 'input' in dsl:
+                name = f'{center}_{dsl}'
+            elif 'observations':
+                name = dsl

-%% Cell type:code id: tags:
+        elif 'forecast' in dsl:
+            time = '2020'
+            if 'input' in dsl:
+                name = f'{center}_{dsl}'
+            elif 'observations':
+                name = dsl
+        else:
+            assert False

-``` python
-w56 = [pd.Timedelta(f'{i} d') for i in range(28,42)]
-w56 = xr.DataArray(w56,dims='lead_time', coords={'lead_time':w56})
-w56
+        # pattern: {model_if_not_observations}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
+        zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
+        ds_biweekly.attrs.update({'postprocessed_by':'https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
+        print(f'save to: {zp}')
+        ds_biweekly.astype('float32').to_zarr(zp, consolidated=True, mode='w')
 ```

 %% Output

-    <xarray.DataArray (lead_time: 14)>
-    array([2419200000000000, 2505600000000000, 2592000000000000,
-           2678400000000000, 2764800000000000, 2851200000000000,
-           2937600000000000, 3024000000000000, 3110400000000000,
-           3196800000000000, 3283200000000000, 3369600000000000,
-           3456000000000000, 3542400000000000], dtype='timedelta64[ns]')
-    Coordinates:
-      * lead_time  (lead_time) timedelta64[ns] 28 days 29 days ... 40 days 41 days
-
-%% Cell type:code id: tags:
-
-``` python
-# starting dates forecast_time in 2020
-dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
-
-biweekly_lead = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average
-# alternative: new lead_time coordinate: use 21 35 as mid points of week 3-4 and week 5-6
-```
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: tp
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.

-%% Cell type:code id: tags:
+    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so

-``` python
-# refactor after closed: https://github.com/ecmwf/climetlab/issues/16 https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge/issues/9
-# then, no need to loop over dates
-model = 'ecmwf'
-for dsl in forecast_dataset_labels + obs_dataset_labels: # dataset labels
-    for v in varlist_forecast:
-        print(dsl,v)
-        ds = []
-        for d in dates:
-            if 'input' in dsl:
-                # just ecmwf model for now
-                ds2 = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=model, date=d, parameter=v, format='netcdf').to_xarray()
-            elif 'observation' in dsl: # obs only netcdf, no choice
-                ds2 = cml.load_dataset(f"s2s-ai-challenge-{dsl}", date=d, parameter=v).to_xarray().chunk()
-            if v == 't2m': # biweekly mean
-                d34 = ds2.sel(lead_time=w34).mean('lead_time')
-                d56 = ds2.sel(lead_time=w56).mean('lead_time')
-                ds2 = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
-            elif v=='tp': # biweekly sum
-                d34 = ds2.sel(lead_time=pd.Timedelta("27 d")) - ds2.sel(lead_time=pd.Timedelta("13 d")) # tp from day 14 to day 27
-                d56 = ds2.sel(lead_time=pd.Timedelta("41 d")) - ds2.sel(lead_time=pd.Timedelta("27 d")) # tp from day 28 to day 42
-                ds2 = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
-            ds.append(ds2.chunk({'forecast_time':1}))
-
-        # sort forecast_time
-        ds = xr.concat(ds, 'forecast_time').sortby('forecast_time')
-        ds = add_valid_time_from_forecast_reference_time_and_lead_time(ds)
-        ds.lead_time.attrs['comment'] = 'lead_time describes bi-weekly aggregates. The pd.Timedelta corresponds to the first day of the '
-
-        # first run on tp and add t2m to that dataset
-        if v == 'tp':
-            ds_ = ds.copy()
-        elif v == 't2m':
-            ds_[v] = ds[v]
-            ds = ds_
-
-    if 'test' in dsl:
-        ds = ds.chunk('auto')
-    else:
-        ds = ds.chunk({'forecast_time':12,'lead_time':-1,'longitude':'auto','latitude':'auto'}) # 3 month init chunk
-
-    if 'hindcast' in dsl:
-        time = '2000-2019'
-        if 'input' in dsl:
-            name = f'{model}_{dsl}'
-        elif 'observations':
-            name = dsl
-
-    elif 'forecast' in dsl:
-        time = '2020'
-        if 'input' in dsl:
-            name = f'{model}_{dsl}'
-        elif 'observations':
-            name = dsl
-    else:
-        assert False
-
-    # pattern: {model if applies _}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
-    zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
-    print(f'save to: {zp}')
-    ds.to_zarr(zp, consolidated=True, mode='w')
-```
+    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.18.0
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

-%% Output
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]

-    hindcast-input tp
-    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: t2m

-    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]

-    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3
-    hindcast-input t2m
-    save to: ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
-    forecast-input tp
-    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
-    forecast-input t2m
-    save to: ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
-    hindcast-like-observations tp
-    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
-    hindcast-like-observations t2m
    save to: ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

-    forecast-like-observations tp
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: tp
    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
-    forecast-like-observations t2m
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: t2m
+
+    
+
    save to: ../data/forecast-like-observations_2020_biweekly_deterministic.zarr

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

 %% Cell type:markdown id: tags:

 ## add to `renku` dataset `s2s-ai-challenge`

 %% Cell type:code id: tags:

 ``` python
 # observations as hindcast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # for further use retrieve from git lfs
 # !renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f"{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2000_2019.sizes,'\n',obs_2000_2019.coords,'\n', obs_2000_2019.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-     985.065144 MB
+     492.546744 MB

 %% Cell type:code id: tags:

 ``` python
 # observations as forecast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
+# for further use retrieve from git lfs
+# !renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
+```
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020 = xr.open_zarr(f"{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2020.sizes,'\n',obs_2020.coords,'\n', obs_2020.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
-     49.256016 MB
+     24.630096 MB

 %% Cell type:code id: tags:

 ``` python
 # ecmwf hindcast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
-print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 11}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 8 9 10
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-     5417.730832 MB
+``` python
+hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
+print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+```

 %% Cell type:code id: tags:

 ``` python
 # ecmwf forecast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
-print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 51}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 ... 44 45 46 47 48 49 50
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
-     1255.926504 MB
+``` python
+fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
+print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+```

 %% Cell type:markdown id: tags:

 # tercile edges

 Create 2 tercile edges at 1/3 and 2/3 quantiles of the 2000-2019 biweekly distrbution for each week of the year

 %% Cell type:code id: tags:

 ``` python
+obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from scripts import add_year_week_coords
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# add week for groupby, see https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29
+obs_2000_2019 = add_year_week_coords(obs_2000_2019)
+```
+
+%% Output
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+      FutureWarning,
+
+%% Cell type:code id: tags:
+
+``` python
+obs_2000_2019
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
+        week           (forecast_time) int64 1 2 3 4 5 6 7 ... 47 48 49 50 51 52 53
+        year           (forecast_time) int64 2000 2000 2000 2000 ... 2019 2019 2019
+    Data variables:
+        t2m            (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+        tp             (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+    Attributes:
+        created_by_script:    tools/observations/makefile
+        created_by_software:  climetlab-s2s-ai-challenge
+        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
+        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
+        source_hosting:       IRIDL
+        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
+
+%% Cell type:code id: tags:
+
+``` python
 tercile_file = f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
-xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr',
-             consolidated=True).chunk({'forecast_time':-1,'longitude':'auto'}).groupby('forecast_time.weekofyear').quantile(q=[1./3.,2./3.], dim=['forecast_time']).rename({'quantile':'category_edge'}).to_netcdf(tercile_file)
+obs_2000_2019.chunk({'forecast_time':-1,'longitude':'auto'}).groupby('week').quantile(q=[1./3.,2./3.], dim='forecast_time').rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
-      FutureWarning,
    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/numpy/lib/nanfunctions.py:1390: RuntimeWarning: All-NaN slice encountered
      overwrite_input, interpolation)

-    CPU times: user 15min 49s, sys: 6min 16s, total: 22min 6s
-    Wall time: 13min 14s
+    CPU times: user 19min 35s, sys: 8min 33s, total: 28min 9s
+    Wall time: 16min 44s

 %% Cell type:code id: tags:

 ``` python
 tercile_edges = xr.open_dataset(tercile_file)

 tercile_edges
 ```

 %% Output

    <xarray.Dataset>
-    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, weekofyear: 53)
+    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, week: 53)
    Coordinates:
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
      * category_edge  (category_edge) float64 0.3333 0.6667
-      * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53
+      * week           (week) int64 1 2 3 4 5 6 7 8 9 ... 45 46 47 48 49 50 51 52 53
    Data variables:
-        t2m            (weekofyear, category_edge, lead_time, latitude, longitude) float64 ...
-        tp             (weekofyear, category_edge, lead_time, latitude, longitude) float64 ...
+        t2m            (week, category_edge, lead_time, latitude, longitude) float32 ...
+        tp             (week, category_edge, lead_time, latitude, longitude) float32 ...
+    Attributes:
+        created_by_script:    tools/observations/makefile
+        created_by_software:  climetlab-s2s-ai-challenge
+        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
+        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
+        source_hosting:       IRIDL
+        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...

 %% Cell type:code id: tags:

 ``` python
 tercile_edges.nbytes*1e-6,'MB'
 ```

 %% Output

-    (98.507024, 'MB')
+    (49.255184, 'MB')

 %% Cell type:code id: tags:

 ``` python
 # run renku commands from projects root directory only
 # tercile edges
 #!renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 #xr.open_dataset("../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc")
 ```

 %% Cell type:markdown id: tags:

 # observations in categories

 - counting how many deterministic forecasts realizations fall into each category, like counting rps
 - categorize forecast-like-observations 2020 into categories

 %% Cell type:code id: tags:

 ``` python
 obs_2020 = xr.open_zarr(f'{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr', consolidated=True)
 obs_2020.sizes
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))

 %% Cell type:code id: tags:

 ``` python
 # create a mask for land grid
 mask = obs_2020.std(['lead_time','forecast_time']).notnull()
 ```

 %% Cell type:code id: tags:

 ``` python
 # mask.to_array().plot(col='variable')
 ```

 %% Cell type:code id: tags:

 ``` python
-```
-
-%% Cell type:code id: tags:
-
-``` python
-tercile_edges = xr.open_dataset(tercile_file)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-tercile_edges.tp
-```
-
-%% Output
-
-    <xarray.DataArray 'tp' (weekofyear: 53, category_edge: 2, lead_time: 2, latitude: 121, longitude: 240)>
-    [6156480 values with dtype=float64]
-    Coordinates:
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * category_edge  (category_edge) float64 0.3333 0.6667
-      * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53
-    Attributes:
-        comment:        precipitation accumulated since lead_time including 0 days
-        long_name:      total precipitation
-        pointwidth:     0
-        standard_name:  precipitation_amount
-        units:          kg m-2
-
-%% Cell type:code id: tags:
-
-``` python
 # total precipitation in arid regions are masked
 # Frederic Vitart suggested by email: "Based on your map we could mask all the areas where the lower tercile boundary is lower than 0.1 mm"
 # we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
 th = 0.01
 tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > th
 #tp_arid_mask.where(mask.tp).plot(col='forecast_time', col_wrap=4)
 #plt.suptitle(f'dry mask: week 3-4 tp 1/3 category_edge > {th} kg m-2',y=1., x=.4)
 #plt.savefig('dry_mask.png')
 ```

 %% Cell type:code id: tags:

 ``` python
 # look into tercile edges
+
+# tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
+# tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
+# tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)
 ```

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-#tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
-```
+## categorize observations

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-#tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
-```
+### forecast 2020

 %% Cell type:code id: tags:

 ``` python
-# tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)
+from scripts import make_probabilistic
 ```

-%% Cell type:markdown id: tags:
-
-## categorize observations
-
 %% Cell type:code id: tags:

 ``` python
-from scripts import make_probabilistic
+# tp_arid_mask.isel(week=[0,10,20,30,40]).plot(col='week')
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2020_p = make_probabilistic(obs_2020, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2020_p.nbytes/1e6, 'MB'
 ```

 %% Output

-    (147.760264, 'MB')
+    (147.75984, 'MB')

 %% Cell type:code id: tags:

 ``` python
-obs_2020_p.to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
+obs_2020_p
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
+      * category       (category) <U12 'below normal' 'near normal' 'above normal'
+    Data variables:
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+
+%% Cell type:code id: tags:
+
+``` python
+obs_2020_p.astype('float32').to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_terciled.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
 xr.open_dataset("../data/forecast-like-observations_2020_biweekly_terciled.nc")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] ...
-        weekofyear     (forecast_time) int64 ...
      * category       (category) object 'below normal' 'near normal' 'above normal'
    Data variables:
-        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 ...
-        tp             (category, lead_time, forecast_time, latitude, longitude) float64 ...
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 ...
+        tp             (category, lead_time, forecast_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
 ```

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-```
+### hindcast 2000_2019

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p = make_probabilistic(obs_2000_2019, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p.nbytes/1e6, 'MB'
 ```

 %% Output

-    (2955.147368, 'MB')
+    (2955.138888, 'MB')

 %% Cell type:code id: tags:

 ``` python
-obs_2000_2019_p.chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')
+obs_2000_2019_p.astype('float32').chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

-    <xarray.backends.zarr.ZarrStore at 0x2b2b614e84b0>
+    <xarray.backends.zarr.ZarrStore at 0x2b51c1233360>

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * category       (category) <U12 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-        weekofyear     (forecast_time) int64 dask.array<chunksize=(1060,), meta=np.ndarray>
    Data variables:
-        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(3, 2, 96, 121, 240), meta=np.ndarray>
-        tp             (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(3, 2, 96, 121, 240), meta=np.ndarray>
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>

 %% Cell type:code id: tags:

 ``` python
+# checking category frequencies
+# o = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
+# w=0
+# v='tp'
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False)[v].plot(row='lead_time',col='category', levels=[5.5,6.5,7.5])
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False).sum('category', skipna=False)[v].plot(row='lead_time', levels=[16.5,17.5,18.5,19.5,20.5])
 ```

-%% Cell type:code id: tags:
+%% Cell type:markdown id: tags:

-``` python
-```
+# Benchmark

-%% Cell type:markdown id: tags:
+center: ECMWF

-# benchmark
+The calibration has been performed by using the tercile boundaries from the model climatology rather than from observations. Script by Frederic Vitart.

 %% Cell type:code id: tags:

 ``` python
-bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark",
-                         parameter='t2m',
-                         weeks=['34','56']
-                       ).to_xarray()
+bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()
 ```

 %% Output

+     50%|█████     | 1/2 [00:00<00:00,  6.11it/s]
+
    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

+    100%|██████████| 2/2 [00:00<00:00,  6.89it/s]
    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so

    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3

 %% Cell type:code id: tags:

 ``` python
-bench_p['tp'] = cml.load_dataset("s2s-ai-challenge-test-output-benchmark",
-                         parameter='tp',
-                         weeks=['34','56']
-                       ).to_xarray().tp
+bench_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
+                        'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}
 ```

 %% Cell type:code id: tags:

 ``` python
-bench_p['category'] = ['below normal','near normal','above normal']
+bench_p['lead_time'] = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate
+
+bench_p['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
+                         'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
+                         'week34_t2m': 'mean[day 14, 27]',
+                         'week56_t2m': 'mean[day 28, 41]',
+                         'week34_tp': 'day 28 minus day 14',
+                         'week56_tp': 'day 42 minus day 28'}
 ```

 %% Cell type:code id: tags:

 ``` python
-bench_p['lead_time'] = biweekly_lead
+bench_p = bench_p / 100 # convert percent to [0-1] probability
 ```

 %% Cell type:code id: tags:

 ``` python
-bench_p = bench_p / 100
-
-bench_p.tp.attrs['units']=''
-bench_p.t2m.attrs['units']=''
+bench_p = bench_p.map(ensure_attributes, biweekly=True)
 ```

+%% Output
+
+      0%|          | 0/1 [00:00<?, ?it/s]
+
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
+
+    100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
+    100%|██████████| 1/1 [00:00<00:00,  4.22it/s]
+
 %% Cell type:code id: tags:

 ``` python
 # bench_p.isel(forecast_time=2).t2m.plot(row='lead_time', col='category')
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
+      * category       (category) object 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
-      * category       (category) <U12 'below normal' 'near normal' 'above normal'
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-        valid_time     (forecast_time, lead_time) datetime64[ns] dask.array<chunksize=(53, 1), meta=np.ndarray>
+        valid_time     (forecast_time, lead_time) datetime64[ns] 2020-01-16 ... 2...
    Data variables:
-        t2m            (category, forecast_time, lead_time, latitude, longitude) float32 dask.array<chunksize=(3, 53, 1, 121, 240), meta=np.ndarray>
-        tp             (category, forecast_time, lead_time, latitude, longitude) float32 dask.array<chunksize=(3, 53, 1, 121, 240), meta=np.ndarray>
-    Attributes:
-        GRIB_edition:            1
-        GRIB_centre:             ecmf
-        GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
-        GRIB_subCentre:          0
-        Conventions:             CF-1.7
-        institution:             European Centre for Medium-Range Weather Forecasts
-        history:                 2021-05-25T12:55 GRIB to CDM+CF via cfgrib-0.9.9...
+        tp             (category, forecast_time, lead_time, latitude, longitude) float32 ...
+        t2m            (category, forecast_time, lead_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
-bench_p.to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')
+bench_p.astype('float32').to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')
 ```

 %% Cell type:code id: tags:

 ``` python
 #!renku dataset add s2s-ai-challenge data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
 ```

--- a/notebooks/scripts.py
+++ b/notebooks/scripts.py
 import xarray as xr
 import pandas as pd
 import numpy as np
+import climetlab_s2s_ai_challenge
+import climetlab as cml

 cache_path = '../data'


+def download(varlist_forecast=['tp','t2m'],
+             center_list=['ecwmf'],
+             forecast_dataset_labels=['hindcast-input','forecast-input'],
+             obs_dataset_labels=['hindcast-like-observations','forecast-like-observations'],
+             varlist_observations=['t2m','tp'],
+             benchmark=True,
+             format='netcdf'
+            ):
+    """Download files with climetlab to cache_path. Set cache_path:
+    cml.settings.set("cache-directory", cache_path)
+    """
+    if isinstance(center_list, str):
+        center_list = [center_list]
+    if isinstance(varlist_forecast, str):
+        varlist_forecast = [varlist_forecast]
+
+    dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()
+    
+    if forecast_dataset_labels:
+        print(f'Downloads variables {varlist_forecast} from datasets {forecast_dataset_labels} from center {center_list} in {format} format.')
+        for center in center_list:
+            for ds in forecast_dataset_labels:
+                for parameter in varlist_forecast: 
+                    try:
+                        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format=format).to_xarray()
+                    except:
+                        pass
+    if obs_dataset_labels:
+        print(f'Downloads variables tp and t2m from datasets {obs_dataset_labels} netcdf format. Additionally downloads raw t2m and pr observations with a time dimension.')
+        try:
+            for ds in obs_dataset_labels:
+                for parameter in varlist_observations:
+                    cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=parameter).to_xarray()
+        except:
+            pass
+        # raw
+        cml.load_dataset(f"s2s-ai-challenge-observations", parameter=varlist_observations).to_xarray()
+    if benchmark:
+        cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()
+    print('finished')
+    return
+
+
 def add_valid_time_from_forecast_reference_time_and_lead_time(forecast, init_dim='forecast_time'):
    """Creates valid_time(forecast_time, lead_time).
    
@@ -28,11 +73,98 @@ def add_valid_time_from_forecast_reference_time_and_lead_time(forecast, init_dim
    return forecast


-def make_probabilistic(ds, tercile_edges, member_dim='realization', mask=None):
+def aggregate_biweekly(da):
+    """
+    Aggregate initialized S2S forecasts biweekly for xr.DataArrays.
+    Use ds.map(aggregate_biweekly) for xr.Datasets.
+    
+    Applies to the ECMWF S2S data model: https://confluence.ecmwf.int/display/S2S/Parameters
+    """
+    # biweekly averaging
+    w34 = [pd.Timedelta(f'{i} d') for i in range(14,28)]
+    w34 = xr.DataArray(w34,dims='lead_time', coords={'lead_time':w34})
+    
+    w56 = [pd.Timedelta(f'{i} d') for i in range(28,42)]
+    w56 = xr.DataArray(w56,dims='lead_time', coords={'lead_time':w56})
+    
+    biweekly_lead = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate
+
+    v = da.name
+    if climetlab_s2s_ai_challenge.CF_CELL_METHODS[v] == 'sum': # biweekly difference for sum variables: tp and ttr
+        d34 = da.sel(lead_time=pd.Timedelta("28 d")) - da.sel(lead_time=pd.Timedelta("14 d")) # tp from day 14 to day 27
+        d56 = da.sel(lead_time=pd.Timedelta("42 d")) - da.sel(lead_time=pd.Timedelta("28 d")) # tp from day 28 to day 42
+        da_biweekly = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
+    else: # t2m, see climetlab_s2s_ai_challenge.CF_CELL_METHODS # biweekly: mean [day 14, day 27]
+        d34 = da.sel(lead_time=w34).mean('lead_time')
+        d56 = da.sel(lead_time=w56).mean('lead_time')
+        da_biweekly = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)
+    
+    da_biweekly = add_valid_time_from_forecast_reference_time_and_lead_time(da_biweekly)
+    da_biweekly['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
+                         'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
+                         'week34_t2m': 'mean[day 14, 27]',
+                         'week56_t2m': 'mean[day 28, 41]',
+                         'week34_tp': 'day 28 minus day 14',
+                         'week56_tp': 'day 42 minus day 28'}
+    return da_biweekly
+
+
+def ensure_attributes(da, biweekly=False):
+    """Ensure that coordinates and variables have proper attributes. Set biweekly==True to set special comments for the biweely aggregates."""
+    template = cml.load_dataset('s2s-ai-challenge-test-input',parameter='t2m', origin='ecmwf', format='netcdf', date='20200102').to_xarray()
+    for c in da.coords:
+        if c in template.coords:
+            da.coords[c].attrs.update(template.coords[c].attrs)
+    
+    if 'valid_time' in da.coords:
+        da['valid_time'].attrs.update({'long_name': 'validity time',
+                                     'standard_name': 'time',
+                                     'description': 'time for which the forecast is valid',
+                                     'calculate':'forecast_time + lead_time'})
+    if 'forecast_time' in da.coords:
+        da['forecast_time'].attrs.update({'long_name' : 'initial time of forecast', 'standard_name': 'forecast_reference_time',
+                                      'description':'The forecast reference time in NWP is the "data time", the time of the analysis from which the forecast was made. It is not the time for which the forecast is valid.'})
+    # fix tp
+    if da.name == 'tp':
+        da.attrs['units'] = 'kg m-2'
+    if biweekly:
+        da['lead_time'].attrs.update({'standard_name':'forecast_period', 'long_name': 'lead time',
+                                      'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
+                         'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
+                         'week34_t2m': 'mean[14 days, 27 days]',
+                         'week56_t2m': 'mean[28 days, 41 days]',
+                         'week34_tp': '28 days minus 14 days',
+                         'week56_tp': '42 days minus 28 days'})
+        if da.name == 'tp':
+            da.attrs.update({'aggregate_week34': '28 days minus 14 days',
+                      'aggregate_week56': '42 days minus 28 days',
+                      'description': 'https://confluence.ecmwf.int/display/S2S/S2S+Total+Precipitation'})
+        if da.name == 't2m':
+            da.attrs.update({'aggregate_week34': 'mean[14 days, 27 days]',
+                      'aggregate_week56': 'mean[28 days, 41 days]',
+                      'variable_before_categorization': 'https://confluence.ecmwf.int/display/S2S/S2S+Surface+Air+Temperature'})
+    return da
+
+
+def add_year_week_coords(ds):
+    import numpy as np
+    if 'week' not in ds.coords and 'year' not in ds.coords:
+        year = ds.forecast_time.dt.year.to_index().unique()
+        week = (list(np.arange(1,54)))
+        weeks = week * len(year)
+        years = np.repeat(year,len(week))
+        ds.coords["week"] = ("forecast_time", weeks)
+        ds.coords['week'].attrs['description'] = "This week represents the number of forecast_time starting from 1 to 53. Note: This week is different from the ISO week from groupby('forecast_time.weekofyear'), see https://en.wikipedia.org/wiki/ISO_week_date and https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29"
+        ds.coords["year"] = ("forecast_time", years)
+        ds.coords['year'].attrs['long_name'] = "calendar year"
+    return ds
+
+
+def make_probabilistic(ds, tercile_edges, member_dim='realization', mask=None, groupby_coord='week'):
    """Compute probabilities from ds (observations or forecasts) based on tercile_edges."""
    # broadcast
-    if 'forecast_time' not in tercile_edges.dims and 'weekofyear' in tercile_edges.dims:
-        tercile_edges = tercile_edges.sel(weekofyear=ds.forecast_time.dt.weekofyear)
+    ds = add_year_week_coords(ds)
+    tercile_edges = tercile_edges.sel({groupby_coord: ds.coords[groupby_coord]})
    bn = ds < tercile_edges.isel(category_edge=0, drop=True)  # below normal
    n = (ds >= tercile_edges.isel(category_edge=0, drop=True)) & (ds < tercile_edges.isel(category_edge=1, drop=True))  # normal
    an = ds >= tercile_edges.isel(category_edge=1, drop=True)  # above normal
@@ -48,11 +180,24 @@ def make_probabilistic(ds, tercile_edges, member_dim='realization', mask=None):
        # we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
        tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > 0.01
        ds_p['tp'] = ds_p['tp'].where(tp_arid_mask)
-    # ds_p = ds_p * 100 # in percent %
+    ds_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
+                        'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}
+    ds_p['tp'].attrs = {'long_name': 'Probability of total precipitation in tercile categories', 'units': '1',
+                      'comment': 'All three tercile category probabilities must add up to 1.',
+                      'variable_before_categorization': 'https://confluence.ecmwf.int/display/S2S/S2S+Total+Precipitation'
+                     }
+    ds_p['t2m'].attrs = {'long_name': 'Probability of 2m temperature in tercile categories', 'units': '1',
+                      'comment': 'All three tercile category probabilities must add up to 1.',
+                      'variable_before_categorization': 'https://confluence.ecmwf.int/display/S2S/S2S+Surface+Air+Temperature'
+                      }
+    if 'year' in ds_p.coords:
+        del ds_p.coords['year']
+    if groupby_coord in ds_p.coords:
+        ds_p = ds_p.drop(groupby_coord)
    return ds_p


-def skill_by_year(preds):
+def skill_by_year(preds, adapt=False):
    """Returns pd.Dataframe of RPSS per year."""
    # similar verification_RPSS.ipynb
    # as scorer bot but returns a score for each year
@@ -65,98 +210,114 @@ def skill_by_year(preds):
    # from root
    #renku storage pull data/forecast-like-observations_2020_biweekly_terciled.nc
    #renku storage pull data/hindcast-like-observations_2000-2019_biweekly_terciled.nc
-    
+    cache_path = '../data'
    if 2020 in preds.forecast_time.dt.year:
        obs_p = xr.open_dataset(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc').sel(forecast_time=preds.forecast_time)
    else:
        obs_p = xr.open_dataset(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', engine='zarr').sel(forecast_time=preds.forecast_time)
-
+    
    # ML probabilities
    fct_p = preds
+
    
    # climatology
    clim_p = xr.DataArray([1/3, 1/3, 1/3], dims='category', coords={'category':['below normal', 'near normal', 'above normal']}).to_dataset(name='tp')
    clim_p['t2m'] = clim_p['tp']
    
-    ## RPSS
+    if adapt:
+        # select only obs_p where fct_p forecasts provided
+        for c in ['longitude', 'latitude', 'forecast_time', 'lead_time']:
+            obs_p = obs_p.sel({c:fct_p[c]})
+        obs_p = obs_p[list(fct_p.data_vars)]
+        clim_p = clim_p[list(fct_p.data_vars)]
+    
+    else:
+        # check inputs
+        assert_predictions_2020(obs_p)
+        assert_predictions_2020(fct_p)
+        
    # rps_ML
    rps_ML = xs.rps(obs_p, fct_p, category_edges=None, dim=[], input_distributions='p').compute()
    # rps_clim
-    rps_clim = xs.rps(obs_p, clim_p, category_edges=None, dim='forecast_time', input_distributions='p').compute()
-    # rpss
-    rpss = 1 - rps_ML / rps_clim
-    
-    rpss = rpss.groupby('forecast_time.year').mean()
-    
-    
-    # cleaning
-    # check for -inf grid cells
-    if (rpss==-np.inf).to_array().any():
-        (rpss == rpss.min()).sum()
+    rps_clim = xs.rps(obs_p, clim_p, category_edges=None, dim=[], input_distributions='p').compute()

-        # dirty fix
-        rpss = rpss.clip(-1, 1)
-    # what to do with requested grid cells where NaN is submitted? also penalize, todo: https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/7
+    ## RPSS
+    # penalize # https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/7
+    expect = obs_p.sum('category')
+    expect = expect.where(expect > 0.98).where(expect < 1.02)  # should be True if not all NaN

-    mask = xr.ones_like(rpss.isel(lead_time=0, drop=True)).reset_coords(drop=True).t2m
-    boundary_tropics = 30
-    mask = xr.concat([mask.where(mask.latitude > boundary_tropics),
-                    mask.where(np.abs(mask.latitude) <= boundary_tropics),
-                    mask.where((mask.latitude < -boundary_tropics) & (mask.latitude > -60))],'area')
-    mask = mask.assign_coords(area=['northern_extratropics', 'tropics', 'southern_extratropics'])
-    mask.name = 'area'
+    # https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/50
+    rps_ML = rps_ML.where(expect, other=2)  # assign RPS=2 where value was expected but NaN found

-    mask = mask.where(rpss.t2m.isel(lead_time=0, drop=True).notnull())
+    # following Weigel 2007: https://doi.org/10.1175/MWR3280.1
+    rpss = 1 - (rps_ML.groupby('forecast_time.year').mean() / rps_clim.groupby('forecast_time.year').mean())
+    # clip
+    rpss = rpss.clip(-10, 1)
    
    # weighted area mean
-    weights = np.cos(np.deg2rad(np.abs(mask.latitude)))
+    weights = np.cos(np.deg2rad(np.abs(rpss.latitude)))
    # spatially weighted score averaged over lead_times and variables to one single value
-    scores = (rpss * mask).weighted(weights).mean('latitude').mean('longitude')
+    scores = rpss.sel(latitude=slice(None, -60)).weighted(weights).mean('latitude').mean('longitude')
    scores = scores.to_array().mean(['lead_time', 'variable'])
    return scores.to_dataframe('RPSS')


-def assert_predictions_2020(preds_test):
+def assert_predictions_2020(preds_test, exclude='weekofyear'):
    """Check the variables, coordinates and dimensions of 2020 predictions."""
+    from xarray.testing import assert_equal # doesnt care about attrs but checks coords
+    
    # is dataset
    assert isinstance(preds_test, xr.Dataset)

    # has both vars: tp and t2m
-    assert 'tp' in preds_test.data_vars
-    assert 't2m' in preds_test.data_vars
+    if 'data_vars' in exclude:
+        assert 'tp' in preds_test.data_vars
+        assert 't2m' in preds_test.data_vars
    
    ## coords
+    # ignore weekofyear coord if not dim
+    if 'weekofyear' in exclude and 'weekofyear' in preds_test.coords and 'weekofyear' not in preds_test.dims:
+        preds_test = preds_test.drop('weekofyear')
+    
    # forecast_time
-    d = pd.date_range(start='2020-01-02', freq='7D', periods=53)
-    forecast_time = xr.DataArray(d, dims='forecast_time', coords={'forecast_time':d})
-    assert (forecast_time == preds_test['forecast_time']).all()
+    if 'forecast_time' in exclude:
+        d = pd.date_range(start='2020-01-02', freq='7D', periods=53)
+        forecast_time = xr.DataArray(d, dims='forecast_time', coords={'forecast_time':d}, name='forecast_time')
+        assert_equal(forecast_time,  preds_test['forecast_time'])

    # longitude
-    lon = np.arange(0., 360., 1.5)
-    longitude = xr.DataArray(lon, dims='longitude', coords={'longitude': lon})
-    assert (longitude == preds_test['longitude']).all()
-    
+    if 'longitude' in exclude:
+        lon = np.arange(0., 360., 1.5)
+        longitude = xr.DataArray(lon, dims='longitude', coords={'longitude': lon}, name='longitude')
+        assert_equal(longitude, preds_test['longitude'])
+
    # latitude
-    lat = np.arange(90., -90.1, 1.5)
-    latitude = xr.DataArray(lat, dims='latitude', coords={'latitude': lat})
-    assert (latitude == preds_test['latitude']).all()
+    if 'latitude' in exclude:
+        lat = np.arange(-90., 90.1, 1.5)[::-1]
+        latitude = xr.DataArray(lat, dims='latitude', coords={'latitude': lat}, name='latitude')
+        assert_equal(latitude, preds_test['latitude'])
    
    # lead_time
-    lead = [pd.Timedelta(f'{i} d') for i in [14, 28]]
-    lead_time = xr.DataArray(lead, dims='lead_time', coords={'lead_time': lead})
-    assert (lead_time == preds_test['lead_time']).all()
+    if 'lead_time' in exclude:
+        lead = [pd.Timedelta(f'{i} d') for i in [14, 28]]
+        lead_time = xr.DataArray(lead, dims='lead_time', coords={'lead_time': lead}, name='lead_time')
+        assert_equal(lead_time, preds_test['lead_time'])
    
    # category
-    cat = np.array(['below normal', 'near normal', 'above normal'], dtype='<U12')
-    category = xr.DataArray(cat, dims='category', coords={'category': cat})
-    assert (category == preds_test['category']).all()
+    if 'category' in exclude:
+        cat = np.array(['below normal', 'near normal', 'above normal'], dtype='<U12')
+        category = xr.DataArray(cat, dims='category', coords={'category': cat}, name='category')
+        assert_equal(category, preds_test['category'])
    
    # size
-    from dask.utils import format_bytes
-    size_in_MB = float(format_bytes(preds_test.nbytes).split(' ')[0])
-    assert size_in_MB > 50
-    assert size_in_MB < 250
+    if 'size' in exclude:
+        from dask.utils import format_bytes
+        size_in_MB = float(format_bytes(preds_test.nbytes).split(' ')[0])
+        # todo: refine for dtypes
+        assert size_in_MB > 50
+        assert size_in_MB < 250
    
    # no other dims
-    assert set(preds_test.dims) - {'category', 'forecast_time', 'latitude', 'lead_time', 'longitude'} == set()
-    
\ No newline at end of file
+    if 'dims' in exclude:
+        assert set(preds_test.dims) - {'category', 'forecast_time', 'latitude', 'lead_time', 'longitude'} == set()
+    
--- a/submissions/ML_prediction_2020.nc
+++ b/submissions/ML_prediction_2020.nc
--- a/update_renku_dataset.sh
+++ b/update_renku_dataset.sh
+
+echo remove previous zarrs from renku dataset
+
+renku dataset unlink s2s-ai-challenge --include "data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/forecast-like-observations_2020_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/**/**" -y
+
+
+# observations and derived
+renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
+
+renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
+
+renku dataset add s2s-ai-challenge -o data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
+
+renku dataset add s2s-ai-challenge -o data/forecast-like-observations_2020_biweekly_terciled.nc
+
+renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
+
+# benchmark
+renku dataset add s2s-ai-challenge -o data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
+
+# forecast / hindcast
+for center in ecmwf; do
+    renku dataset add s2s-ai-challenge data/${center}_hindcast-input_2000-2019_biweekly_deterministic.zarr
+
+    renku dataset add s2s-ai-challenge data/${center}_forecast-input_2020_biweekly_deterministic.zarr
+
+done
+
+renku dataset ls-files s2s-ai-challenge
+
+renku dataset ls-tags s2s-ai-challenge
+
+echo consider new tag: renku dataset tag -d description s2s-ai-challenge 0.x
No results found