Compare revisions

9019caf1 · 9019caf1 · 9019caf1 · 9019caf1
--- a/notebooks/renku_datasets_biweekly.ipynb
+++ b/notebooks/renku_datasets_biweekly.ipynb
 %% Cell type:markdown id: tags:

 # Create biweekly renku datasets from `climetlab-s2s-ai-challenge`

 %% Cell type:markdown id: tags:

 Goal:

 - Create biweekly renku datasets from [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge).
 - These renku datasets are then used in notebooks:
    - `ML_train_and_predict.ipynb` to train the ML model and do ML-based predictions
    - `RPSS_verification.ipynb` to calculate RPSS of the ML model
+    - `mean_bias_reduction.ipynb` to remove the mean bias and

 Requirements:
 - [`climetlab`](https://github.com/ecmwf/climetlab)
 - [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)
 - S2S and CPC observations uploaded on [European Weather Cloud (EWC)](https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html)

 Output: [renku dataset](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) `s2s-ai-challenge`
 - observations
    - deterministic:
        - `hindcast-like-observations_2000-2019_biweekly_deterministic.zarr`
        - `forecast-like-observations_2020_biweekly_deterministic.zarr`
    - edges:
        - `hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc`
    - probabilistic:
        - `hindcast-like-observations_2000-2019_biweekly_terciled.zarr`
        - `forecast-like-observations_2020_biweekly_terciled.nc`
 - forecasts/hindcasts
    - deterministic:
        - `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
        - `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
    - more models could be added
 - benchmark:
    - probabilistic:
        - `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`

 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt
 import xarray as xr
 import xskillscore as xs
 import pandas as pd

 import climetlab_s2s_ai_challenge
 import climetlab as cml
 print(f'Climetlab version : {cml.__version__}')
 print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')

 xr.set_options(keep_attrs=True)
 xr.set_options(display_style='text')
 ```

 %% Output

-    Climetlab version : 0.8.0
-    Climetlab-s2s-ai-challenge plugin version : 0.6.7
-
    WARNING: ecmwflibs universal: cannot find a library called MagPlus
-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/climetlab/plotting/drivers/magics/actions.py:36: UserWarning: Magics library could not be found
-      warnings.warn(str(e))
+    Magics library could not be found
+
+    Climetlab version : 0.8.6
+    Climetlab-s2s-ai-challenge plugin version : 0.8.0

-    <xarray.core.options.set_options at 0x2b4cbd09de10>
+    <xarray.core.options.set_options at 0x2b51c148f590>

 %% Cell type:code id: tags:

 ``` python
 # caching path for climetlab
-cache_path = "/work/mh0727/m300524/S2S_AI/cache" # set your own path
+cache_path = "/work/mh0727/m300524/S2S_AI/cache4" # set your own path
 cml.settings.set("cache-directory", cache_path)
+```
+
+%% Cell type:code id: tags:

+``` python
 cache_path = "../data"
 ```

 %% Cell type:markdown id: tags:

 # Download and cache

 Download all files for the observations, forecast and hindcast.

 %% Cell type:code id: tags:

 ``` python
 # shortcut
 from scripts import download
 #download()
 ```

 %% Cell type:markdown id: tags:

 ## hindcast and forecast `input`

 %% Cell type:code id: tags:

 ``` python
 # starting dates forecast_time in 2020
 dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()

 forecast_dataset_labels = ['training-input','test-input'] # ML community
 # equiv to
 forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community

 varlist_forecast = ['tp','t2m'] # can add more

 center_list = ['ecmwf'] # 'ncep', 'eccc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
 # takes ~ 10-30 min to download for one model one variable depending on number of model realizations
 # and download settings https://climetlab.readthedocs.io/en/latest/guide/settings.html
 for center in center_list:
    for ds in forecast_dataset_labels:
        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format='netcdf').to_xarray()
 ```

 %% Cell type:markdown id: tags:

 ## observations `output-reference`

 %% Cell type:code id: tags:

 ``` python
 obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
 # equiv to
 obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community

 varlist_obs = ['tp', 't2m']
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
 # takes 10min to download
 for ds in obs_dataset_labels:
    print(ds)
    # only netcdf, no format choice
    cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=varlist_obs).to_xarray()
 ```

 %% Cell type:code id: tags:

 ``` python
 # download obs_time for to create output-reference/observations for other models than ecmwf and eccc,
 # i.e. ncep or any S2S or Sub model
 obs_time = cml.load_dataset(f"s2s-ai-challenge-observations", parameter=['t2m', 'pr']).to_xarray()
 ```

 %% Cell type:markdown id: tags:

 # create bi-weekly aggregates

 %% Cell type:code id: tags:

 ``` python
 from scripts import aggregate_biweekly, ensure_attributes

 #aggregate_biweekly??
 ```

 %% Cell type:code id: tags:

 ``` python
 for c, center in enumerate(center_list):  # forecast centers (could also take models)
-    for dsl in obs_dataset_labels + forecast_dataset_labels:  # climetlab dataset labels
+    for dsl in obs_dataset_labels:# + forecast_dataset_labels:  # climetlab dataset labels
        for p, parameter in enumerate(varlist_forecast):  # variables
            if c != 0 and 'observation' in dsl:  # only do once for observations
                continue
            print(f"datasetlabel: {dsl}, center: {center}, parameter: {parameter}")
            if 'input' in dsl:
                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=center, parameter=parameter, format='netcdf').to_xarray()
            elif 'observation' in dsl: # obs only netcdf, no choice
                if parameter not in ['t2m', 'tp']:
                    continue
                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", parameter=parameter, date=dates).to_xarray()

            if p == 0:
                ds_biweekly = ds.map(aggregate_biweekly)
            else:
                ds_biweekly[parameter] = ds.map(aggregate_biweekly)[parameter]

            ds_biweekly = ds_biweekly.map(ensure_attributes, biweekly=True)

+            ds_biweekly = ds_biweekly.sortby('forecast_time')
+
        if 'test' in dsl:
            ds_biweekly = ds_biweekly.chunk('auto')
        else:
            ds_biweekly = ds_biweekly.chunk({'forecast_time':'auto','lead_time':-1,'longitude':-1,'latitude':-1})

        if 'hindcast' in dsl:
            time = f'{int(ds_biweekly.forecast_time.dt.year.min())}-{int(ds_biweekly.forecast_time.dt.year.max())}'
            if 'input' in dsl:
                name = f'{center}_{dsl}'
            elif 'observations':
                name = dsl

        elif 'forecast' in dsl:
            time = '2020'
            if 'input' in dsl:
                name = f'{center}_{dsl}'
            elif 'observations':
                name = dsl
        else:
            assert False

        # pattern: {model_if_not_observations}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
        zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
-        ds_biweekly.attrs.update({'postprocessed':'by https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
+        ds_biweekly.attrs.update({'postprocessed_by':'https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
        print(f'save to: {zp}')
        ds_biweekly.astype('float32').to_zarr(zp, consolidated=True, mode='w')
 ```

+%% Output
+
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: tp
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
+
+    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so
+
+    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.18.0
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]
+
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: t2m
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]
+
+    save to: ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
+      x = np.divide(x1, x2, out)
+
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: tp
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: t2m
+
+    
+
+    save to: ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
+      x = np.divide(x1, x2, out)
+
 %% Cell type:markdown id: tags:

 ## add to `renku` dataset `s2s-ai-challenge`

 %% Cell type:code id: tags:

 ``` python
 # observations as hindcast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # for further use retrieve from git lfs
 # !renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f"{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2000_2019.sizes,'\n',obs_2000_2019.coords,'\n', obs_2000_2019.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
     492.546744 MB

 %% Cell type:code id: tags:

 ``` python
 # observations as forecast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
+# for further use retrieve from git lfs
+# !renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
+```
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020 = xr.open_zarr(f"{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2020.sizes,'\n',obs_2020.coords,'\n', obs_2020.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
     24.630096 MB

 %% Cell type:code id: tags:

 ``` python
 # ecmwf hindcast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
-print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 11}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 8 9 10
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-     5417.730832 MB
+``` python
+hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
+print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+```

 %% Cell type:code id: tags:

 ``` python
 # ecmwf forecast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
-print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 51}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 ... 44 45 46 47 48 49 50
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
-     1255.926504 MB
+``` python
+fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
+print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+```

 %% Cell type:markdown id: tags:

 # tercile edges

 Create 2 tercile edges at 1/3 and 2/3 quantiles of the 2000-2019 biweekly distrbution for each week of the year

 %% Cell type:code id: tags:

 ``` python
+obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from scripts import add_year_week_coords
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# add week for groupby, see https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29
+obs_2000_2019 = add_year_week_coords(obs_2000_2019)
+```
+
+%% Output
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+      FutureWarning,
+
+%% Cell type:code id: tags:
+
+``` python
+obs_2000_2019
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
+        week           (forecast_time) int64 1 2 3 4 5 6 7 ... 47 48 49 50 51 52 53
+        year           (forecast_time) int64 2000 2000 2000 2000 ... 2019 2019 2019
+    Data variables:
+        t2m            (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+        tp             (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+    Attributes:
+        created_by_script:    tools/observations/makefile
+        created_by_software:  climetlab-s2s-ai-challenge
+        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
+        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
+        source_hosting:       IRIDL
+        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
+
+%% Cell type:code id: tags:
+
+``` python
 tercile_file = f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
-xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr',
-             consolidated=True).chunk({'forecast_time':-1,'longitude':'auto'}).groupby('forecast_time.weekofyear').quantile(q=[1./3.,2./3.], dim=['forecast_time']).rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)
+obs_2000_2019.chunk({'forecast_time':-1,'longitude':'auto'}).groupby('week').quantile(q=[1./3.,2./3.], dim='forecast_time').rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
-      FutureWarning,
    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/numpy/lib/nanfunctions.py:1390: RuntimeWarning: All-NaN slice encountered
      overwrite_input, interpolation)

-    CPU times: user 21min 25s, sys: 9min 19s, total: 30min 45s
-    Wall time: 18min 4s
+    CPU times: user 19min 35s, sys: 8min 33s, total: 28min 9s
+    Wall time: 16min 44s

 %% Cell type:code id: tags:

 ``` python
 tercile_edges = xr.open_dataset(tercile_file)

 tercile_edges
 ```

 %% Output

    <xarray.Dataset>
-    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, weekofyear: 53)
+    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, week: 53)
    Coordinates:
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
      * category_edge  (category_edge) float64 0.3333 0.6667
-      * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53
+      * week           (week) int64 1 2 3 4 5 6 7 8 9 ... 45 46 47 48 49 50 51 52 53
    Data variables:
-        t2m            (weekofyear, category_edge, lead_time, latitude, longitude) float32 ...
-        tp             (weekofyear, category_edge, lead_time, latitude, longitude) float32 ...
+        t2m            (week, category_edge, lead_time, latitude, longitude) float32 ...
+        tp             (week, category_edge, lead_time, latitude, longitude) float32 ...
    Attributes:
        created_by_script:    tools/observations/makefile
        created_by_software:  climetlab-s2s-ai-challenge
        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
-        postprocessed:        by https://renkulab.io/gitlab/aaron.spring/s2s-ai-c...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
        source_hosting:       IRIDL
        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...

 %% Cell type:code id: tags:

 ``` python
 tercile_edges.nbytes*1e-6,'MB'
 ```

 %% Output

    (49.255184, 'MB')

 %% Cell type:code id: tags:

 ``` python
 # run renku commands from projects root directory only
 # tercile edges
 #!renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 #xr.open_dataset("../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc")
 ```

 %% Cell type:markdown id: tags:

 # observations in categories

 - counting how many deterministic forecasts realizations fall into each category, like counting rps
 - categorize forecast-like-observations 2020 into categories

 %% Cell type:code id: tags:

 ``` python
 obs_2020 = xr.open_zarr(f'{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr', consolidated=True)
 obs_2020.sizes
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))

 %% Cell type:code id: tags:

 ``` python
 # create a mask for land grid
 mask = obs_2020.std(['lead_time','forecast_time']).notnull()
 ```

 %% Cell type:code id: tags:

 ``` python
 # mask.to_array().plot(col='variable')
 ```

 %% Cell type:code id: tags:

 ``` python
 # total precipitation in arid regions are masked
 # Frederic Vitart suggested by email: "Based on your map we could mask all the areas where the lower tercile boundary is lower than 0.1 mm"
 # we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
 th = 0.01
 tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > th
 #tp_arid_mask.where(mask.tp).plot(col='forecast_time', col_wrap=4)
 #plt.suptitle(f'dry mask: week 3-4 tp 1/3 category_edge > {th} kg m-2',y=1., x=.4)
 #plt.savefig('dry_mask.png')
 ```

 %% Cell type:code id: tags:

 ``` python
 # look into tercile edges
-```

-%% Cell type:code id: tags:
-
-``` python
-#tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-#tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
-```
-
-%% Cell type:code id: tags:
-
-``` python
+# tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
+# tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
 # tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)
 ```

 %% Cell type:markdown id: tags:

 ## categorize observations

 %% Cell type:markdown id: tags:

 ### forecast 2020

 %% Cell type:code id: tags:

 ``` python
 from scripts import make_probabilistic
 ```

 %% Cell type:code id: tags:

 ``` python
+# tp_arid_mask.isel(week=[0,10,20,30,40]).plot(col='week')
+```
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020_p = make_probabilistic(obs_2020, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2020_p.nbytes/1e6, 'MB'
 ```

 %% Output

    (147.75984, 'MB')

 %% Cell type:code id: tags:

 ``` python
+obs_2020_p
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
+      * category       (category) <U12 'below normal' 'near normal' 'above normal'
+    Data variables:
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020_p.astype('float32').to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_terciled.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
 xr.open_dataset("../data/forecast-like-observations_2020_biweekly_terciled.nc")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] ...
      * category       (category) object 'below normal' 'near normal' 'above normal'
    Data variables:
        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 ...
        tp             (category, lead_time, forecast_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### hindcast 2000_2019

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p = make_probabilistic(obs_2000_2019, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p.nbytes/1e6, 'MB'
 ```

 %% Output

    (2955.138888, 'MB')

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p.astype('float32').chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

-    <xarray.backends.zarr.ZarrStore at 0x2b34e40d80c0>
+    <xarray.backends.zarr.ZarrStore at 0x2b51c1233360>

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * category       (category) <U12 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
    Data variables:
-        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 280, 121, 240), meta=np.ndarray>
-        tp             (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 280, 121, 240), meta=np.ndarray>
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>

 %% Cell type:code id: tags:

 ``` python
+# checking category frequencies
+# o = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
+# w=0
+# v='tp'
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False)[v].plot(row='lead_time',col='category', levels=[5.5,6.5,7.5])
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False).sum('category', skipna=False)[v].plot(row='lead_time', levels=[16.5,17.5,18.5,19.5,20.5])
 ```

 %% Cell type:markdown id: tags:

 # Benchmark

 center: ECMWF

 The calibration has been performed by using the tercile boundaries from the model climatology rather than from observations. Script by Frederic Vitart.

 %% Cell type:code id: tags:

 ``` python
 bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()
 ```

 %% Output

     50%|█████     | 1/2 [00:00<00:00,  6.11it/s]

    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

    100%|██████████| 2/2 [00:00<00:00,  6.89it/s]
    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so

    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3

 %% Cell type:code id: tags:

 ``` python
 bench_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
                        'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p['lead_time'] = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate

 bench_p['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
                         'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
                         'week34_t2m': 'mean[day 14, 27]',
                         'week56_t2m': 'mean[day 28, 41]',
                         'week34_tp': 'day 28 minus day 14',
                         'week56_tp': 'day 42 minus day 28'}
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p = bench_p / 100 # convert percent to [0-1] probability
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p = bench_p.map(ensure_attributes, biweekly=True)
 ```

 %% Output

      0%|          | 0/1 [00:00<?, ?it/s]

    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

    100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
    100%|██████████| 1/1 [00:00<00:00,  4.22it/s]

 %% Cell type:code id: tags:

 ``` python
 # bench_p.isel(forecast_time=2).t2m.plot(row='lead_time', col='category')
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * category       (category) object 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (forecast_time, lead_time) datetime64[ns] 2020-01-16 ... 2...
    Data variables:
        tp             (category, forecast_time, lead_time, latitude, longitude) float32 ...
        t2m            (category, forecast_time, lead_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
 bench_p.astype('float32').to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')
 ```

 %% Cell type:code id: tags:

 ``` python
 #!renku dataset add s2s-ai-challenge data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
 ```

 %% Cell type:markdown id: tags:

 # Create biweekly renku datasets from `climetlab-s2s-ai-challenge`

 %% Cell type:markdown id: tags:

 Goal:

 - Create biweekly renku datasets from [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge).
 - These renku datasets are then used in notebooks:
    - `ML_train_and_predict.ipynb` to train the ML model and do ML-based predictions
    - `RPSS_verification.ipynb` to calculate RPSS of the ML model
+    - `mean_bias_reduction.ipynb` to remove the mean bias and

 Requirements:
 - [`climetlab`](https://github.com/ecmwf/climetlab)
 - [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)
 - S2S and CPC observations uploaded on [European Weather Cloud (EWC)](https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html)

 Output: [renku dataset](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) `s2s-ai-challenge`
 - observations
    - deterministic:
        - `hindcast-like-observations_2000-2019_biweekly_deterministic.zarr`
        - `forecast-like-observations_2020_biweekly_deterministic.zarr`
    - edges:
        - `hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc`
    - probabilistic:
        - `hindcast-like-observations_2000-2019_biweekly_terciled.zarr`
        - `forecast-like-observations_2020_biweekly_terciled.nc`
 - forecasts/hindcasts
    - deterministic:
        - `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
        - `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
    - more models could be added
 - benchmark:
    - probabilistic:
        - `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`

 %% Cell type:code id: tags:

 ``` python
 import matplotlib.pyplot as plt
 import xarray as xr
 import xskillscore as xs
 import pandas as pd

 import climetlab_s2s_ai_challenge
 import climetlab as cml
 print(f'Climetlab version : {cml.__version__}')
 print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')

 xr.set_options(keep_attrs=True)
 xr.set_options(display_style='text')
 ```

 %% Output

-    Climetlab version : 0.8.0
-    Climetlab-s2s-ai-challenge plugin version : 0.6.7
-
    WARNING: ecmwflibs universal: cannot find a library called MagPlus
-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/climetlab/plotting/drivers/magics/actions.py:36: UserWarning: Magics library could not be found
-      warnings.warn(str(e))
+    Magics library could not be found
+
+    Climetlab version : 0.8.6
+    Climetlab-s2s-ai-challenge plugin version : 0.8.0

-    <xarray.core.options.set_options at 0x2b4cbd09de10>
+    <xarray.core.options.set_options at 0x2b51c148f590>

 %% Cell type:code id: tags:

 ``` python
 # caching path for climetlab
-cache_path = "/work/mh0727/m300524/S2S_AI/cache" # set your own path
+cache_path = "/work/mh0727/m300524/S2S_AI/cache4" # set your own path
 cml.settings.set("cache-directory", cache_path)
+```
+
+%% Cell type:code id: tags:

+``` python
 cache_path = "../data"
 ```

 %% Cell type:markdown id: tags:

 # Download and cache

 Download all files for the observations, forecast and hindcast.

 %% Cell type:code id: tags:

 ``` python
 # shortcut
 from scripts import download
 #download()
 ```

 %% Cell type:markdown id: tags:

 ## hindcast and forecast `input`

 %% Cell type:code id: tags:

 ``` python
 # starting dates forecast_time in 2020
 dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()

 forecast_dataset_labels = ['training-input','test-input'] # ML community
 # equiv to
 forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community

 varlist_forecast = ['tp','t2m'] # can add more

 center_list = ['ecmwf'] # 'ncep', 'eccc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
 # takes ~ 10-30 min to download for one model one variable depending on number of model realizations
 # and download settings https://climetlab.readthedocs.io/en/latest/guide/settings.html
 for center in center_list:
    for ds in forecast_dataset_labels:
        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format='netcdf').to_xarray()
 ```

 %% Cell type:markdown id: tags:

 ## observations `output-reference`

 %% Cell type:code id: tags:

 ``` python
 obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community
 # equiv to
 obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community

 varlist_obs = ['tp', 't2m']
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
 # takes 10min to download
 for ds in obs_dataset_labels:
    print(ds)
    # only netcdf, no format choice
    cml.load_dataset(f"s2s-ai-challenge-{ds}", date=dates, parameter=varlist_obs).to_xarray()
 ```

 %% Cell type:code id: tags:

 ``` python
 # download obs_time for to create output-reference/observations for other models than ecmwf and eccc,
 # i.e. ncep or any S2S or Sub model
 obs_time = cml.load_dataset(f"s2s-ai-challenge-observations", parameter=['t2m', 'pr']).to_xarray()
 ```

 %% Cell type:markdown id: tags:

 # create bi-weekly aggregates

 %% Cell type:code id: tags:

 ``` python
 from scripts import aggregate_biweekly, ensure_attributes

 #aggregate_biweekly??
 ```

 %% Cell type:code id: tags:

 ``` python
 for c, center in enumerate(center_list):  # forecast centers (could also take models)
-    for dsl in obs_dataset_labels + forecast_dataset_labels:  # climetlab dataset labels
+    for dsl in obs_dataset_labels:# + forecast_dataset_labels:  # climetlab dataset labels
        for p, parameter in enumerate(varlist_forecast):  # variables
            if c != 0 and 'observation' in dsl:  # only do once for observations
                continue
            print(f"datasetlabel: {dsl}, center: {center}, parameter: {parameter}")
            if 'input' in dsl:
                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", origin=center, parameter=parameter, format='netcdf').to_xarray()
            elif 'observation' in dsl: # obs only netcdf, no choice
                if parameter not in ['t2m', 'tp']:
                    continue
                ds = cml.load_dataset(f"s2s-ai-challenge-{dsl}", parameter=parameter, date=dates).to_xarray()

            if p == 0:
                ds_biweekly = ds.map(aggregate_biweekly)
            else:
                ds_biweekly[parameter] = ds.map(aggregate_biweekly)[parameter]

            ds_biweekly = ds_biweekly.map(ensure_attributes, biweekly=True)

+            ds_biweekly = ds_biweekly.sortby('forecast_time')
+
        if 'test' in dsl:
            ds_biweekly = ds_biweekly.chunk('auto')
        else:
            ds_biweekly = ds_biweekly.chunk({'forecast_time':'auto','lead_time':-1,'longitude':-1,'latitude':-1})

        if 'hindcast' in dsl:
            time = f'{int(ds_biweekly.forecast_time.dt.year.min())}-{int(ds_biweekly.forecast_time.dt.year.max())}'
            if 'input' in dsl:
                name = f'{center}_{dsl}'
            elif 'observations':
                name = dsl

        elif 'forecast' in dsl:
            time = '2020'
            if 'input' in dsl:
                name = f'{center}_{dsl}'
            elif 'observations':
                name = dsl
        else:
            assert False

        # pattern: {model_if_not_observations}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr
        zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'
-        ds_biweekly.attrs.update({'postprocessed':'by https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
+        ds_biweekly.attrs.update({'postprocessed_by':'https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/renku_datasets_biweekly.ipynb'})
        print(f'save to: {zp}')
        ds_biweekly.astype('float32').to_zarr(zp, consolidated=True, mode='w')
 ```

+%% Output
+
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: tp
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
+
+    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so
+
+    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.18.0
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]
+
+    datasetlabel: hindcast-like-observations, center: ecmwf, parameter: t2m
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing is producing a large chunk. To accept the large
+    chunk and silence this warning, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}):
+        ...     array[indexer]
+    
+    To avoid creating the large chunks, set the option
+        >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}):
+        ...     array[indexer]
+      return self.array[key]
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/indexing.py:1379: PerformanceWarning: Slicing with an out-of-order index is generating 20 times more chunks
+      return self.array[key]
+
+    save to: ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
+      x = np.divide(x1, x2, out)
+
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: tp
+    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.
+    datasetlabel: forecast-like-observations, center: ecmwf, parameter: t2m
+
+    
+
+    save to: ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
+      x = np.divide(x1, x2, out)
+
 %% Cell type:markdown id: tags:

 ## add to `renku` dataset `s2s-ai-challenge`

 %% Cell type:code id: tags:

 ``` python
 # observations as hindcast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # for further use retrieve from git lfs
 # !renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f"{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2000_2019.sizes,'\n',obs_2000_2019.coords,'\n', obs_2000_2019.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
     492.546744 MB

 %% Cell type:code id: tags:

 ``` python
 # observations as forecast
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
+# for further use retrieve from git lfs
+# !renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
+```
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020 = xr.open_zarr(f"{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
 print(obs_2020.sizes,'\n',obs_2020.coords,'\n', obs_2020.nbytes/1e6,'MB')
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))
     Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
     24.630096 MB

 %% Cell type:code id: tags:

 ``` python
 # ecmwf hindcast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
-print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 11}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 8 9 10
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
-     5417.730832 MB
+``` python
+hind_2000_2019 = xr.open_zarr(f"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
+print(hind_2000_2019.sizes,'\n',hind_2000_2019.coords,'\n', hind_2000_2019.nbytes/1e6,'MB')
+```

 %% Cell type:code id: tags:

 ``` python
 # ecmwf forecast-input
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
-fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
-print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+# for further use retrieve from git lfs
+# !renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
 ```

-%% Output
+%% Cell type:code id: tags:

-    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 51}))
-     Coordinates:
-      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
-      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
-      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
-      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
-      * realization    (realization) int64 0 1 2 3 4 5 6 7 ... 44 45 46 47 48 49 50
-        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
-     1255.926504 MB
+``` python
+fct_2020 = xr.open_zarr(f"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
+print(fct_2020.sizes,'\n',fct_2020.coords,'\n', fct_2020.nbytes/1e6,'MB')
+```

 %% Cell type:markdown id: tags:

 # tercile edges

 Create 2 tercile edges at 1/3 and 2/3 quantiles of the 2000-2019 biweekly distrbution for each week of the year

 %% Cell type:code id: tags:

 ``` python
+obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
+```
+
+%% Cell type:code id: tags:
+
+``` python
+from scripts import add_year_week_coords
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# add week for groupby, see https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29
+obs_2000_2019 = add_year_week_coords(obs_2000_2019)
+```
+
+%% Output
+
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+      FutureWarning,
+
+%% Cell type:code id: tags:
+
+``` python
+obs_2000_2019
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
+        week           (forecast_time) int64 1 2 3 4 5 6 7 ... 47 48 49 50 51 52 53
+        year           (forecast_time) int64 2000 2000 2000 2000 ... 2019 2019 2019
+    Data variables:
+        t2m            (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+        tp             (lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(2, 530, 121, 240), meta=np.ndarray>
+    Attributes:
+        created_by_script:    tools/observations/makefile
+        created_by_software:  climetlab-s2s-ai-challenge
+        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
+        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
+        source_hosting:       IRIDL
+        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...
+
+%% Cell type:code id: tags:
+
+``` python
 tercile_file = f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
 ```

 %% Cell type:code id: tags:

 ``` python
 %%time
-xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr',
-             consolidated=True).chunk({'forecast_time':-1,'longitude':'auto'}).groupby('forecast_time.weekofyear').quantile(q=[1./3.,2./3.], dim=['forecast_time']).rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)
+obs_2000_2019.chunk({'forecast_time':-1,'longitude':'auto'}).groupby('week').quantile(q=[1./3.,2./3.], dim='forecast_time').rename({'quantile':'category_edge'}).astype('float32').to_netcdf(tercile_file)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
-      FutureWarning,
    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/numpy/lib/nanfunctions.py:1390: RuntimeWarning: All-NaN slice encountered
      overwrite_input, interpolation)

-    CPU times: user 21min 25s, sys: 9min 19s, total: 30min 45s
-    Wall time: 18min 4s
+    CPU times: user 19min 35s, sys: 8min 33s, total: 28min 9s
+    Wall time: 16min 44s

 %% Cell type:code id: tags:

 ``` python
 tercile_edges = xr.open_dataset(tercile_file)

 tercile_edges
 ```

 %% Output

    <xarray.Dataset>
-    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, weekofyear: 53)
+    Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, week: 53)
    Coordinates:
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
      * category_edge  (category_edge) float64 0.3333 0.6667
-      * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53
+      * week           (week) int64 1 2 3 4 5 6 7 8 9 ... 45 46 47 48 49 50 51 52 53
    Data variables:
-        t2m            (weekofyear, category_edge, lead_time, latitude, longitude) float32 ...
-        tp             (weekofyear, category_edge, lead_time, latitude, longitude) float32 ...
+        t2m            (week, category_edge, lead_time, latitude, longitude) float32 ...
+        tp             (week, category_edge, lead_time, latitude, longitude) float32 ...
    Attributes:
        created_by_script:    tools/observations/makefile
        created_by_software:  climetlab-s2s-ai-challenge
        function:             climetlab_s2s_ai_challenge.extra.forecast_like_obse...
-        postprocessed:        by https://renkulab.io/gitlab/aaron.spring/s2s-ai-c...
+        postprocessed_by:     https://renkulab.io/gitlab/aaron.spring/s2s-ai-chal...
+        regrid_method:        conservative
        source_dataset_name:  NOAA NCEP CPC UNIFIED_PRCP GAUGE_BASED GLOBAL v1p0 ...
        source_hosting:       IRIDL
        source_url:           http://iridl.ldeo.columbia.edu/SOURCES/.NOAA/.NCEP/...

 %% Cell type:code id: tags:

 ``` python
 tercile_edges.nbytes*1e-6,'MB'
 ```

 %% Output

    (49.255184, 'MB')

 %% Cell type:code id: tags:

 ``` python
 # run renku commands from projects root directory only
 # tercile edges
 #!renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
 #xr.open_dataset("../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc")
 ```

 %% Cell type:markdown id: tags:

 # observations in categories

 - counting how many deterministic forecasts realizations fall into each category, like counting rps
 - categorize forecast-like-observations 2020 into categories

 %% Cell type:code id: tags:

 ``` python
 obs_2020 = xr.open_zarr(f'{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr', consolidated=True)
 obs_2020.sizes
 ```

 %% Output

    Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))

 %% Cell type:code id: tags:

 ``` python
 # create a mask for land grid
 mask = obs_2020.std(['lead_time','forecast_time']).notnull()
 ```

 %% Cell type:code id: tags:

 ``` python
 # mask.to_array().plot(col='variable')
 ```

 %% Cell type:code id: tags:

 ``` python
 # total precipitation in arid regions are masked
 # Frederic Vitart suggested by email: "Based on your map we could mask all the areas where the lower tercile boundary is lower than 0.1 mm"
 # we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1
 th = 0.01
 tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > th
 #tp_arid_mask.where(mask.tp).plot(col='forecast_time', col_wrap=4)
 #plt.suptitle(f'dry mask: week 3-4 tp 1/3 category_edge > {th} kg m-2',y=1., x=.4)
 #plt.savefig('dry_mask.png')
 ```

 %% Cell type:code id: tags:

 ``` python
 # look into tercile edges
-```

-%% Cell type:code id: tags:
-
-``` python
-#tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
-```
-
-%% Cell type:code id: tags:
-
-``` python
-#tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
-```
-
-%% Cell type:code id: tags:
-
-``` python
+# tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)
+# tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)
 # tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)
 ```

 %% Cell type:markdown id: tags:

 ## categorize observations

 %% Cell type:markdown id: tags:

 ### forecast 2020

 %% Cell type:code id: tags:

 ``` python
 from scripts import make_probabilistic
 ```

 %% Cell type:code id: tags:

 ``` python
+# tp_arid_mask.isel(week=[0,10,20,30,40]).plot(col='week')
+```
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020_p = make_probabilistic(obs_2020, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2020_p.nbytes/1e6, 'MB'
 ```

 %% Output

    (147.75984, 'MB')

 %% Cell type:code id: tags:

 ``` python
+obs_2020_p
+```
+
+%% Output
+
+    <xarray.Dataset>
+    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
+    Coordinates:
+      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
+      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
+      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
+      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
+        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray>
+      * category       (category) <U12 'below normal' 'near normal' 'above normal'
+    Data variables:
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float64 dask.array<chunksize=(1, 2, 53, 121, 240), meta=np.ndarray>
+
+%% Cell type:code id: tags:
+
+``` python
 obs_2020_p.astype('float32').to_netcdf(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_terciled.nc
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
 xr.open_dataset("../data/forecast-like-observations_2020_biweekly_terciled.nc")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] ...
      * category       (category) object 'below normal' 'near normal' 'above normal'
    Data variables:
        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 ...
        tp             (category, lead_time, forecast_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:markdown id: tags:

 ### hindcast 2000_2019

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019 = xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr', consolidated=True)
 ```

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p = make_probabilistic(obs_2000_2019, tercile_edges, mask=mask)
 ```

 %% Output

-    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
+    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:383: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
      FutureWarning,

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p.nbytes/1e6, 'MB'
 ```

 %% Output

    (2955.138888, 'MB')

 %% Cell type:code id: tags:

 ``` python
 obs_2000_2019_p.astype('float32').chunk('auto').to_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', consolidated=True, mode='w')
 ```

 %% Output

    /work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
      x = np.divide(x1, x2, out)

-    <xarray.backends.zarr.ZarrStore at 0x2b34e40d80c0>
+    <xarray.backends.zarr.ZarrStore at 0x2b51c1233360>

 %% Cell type:code id: tags:

 ``` python
 # forecast-like-observations terciled
 # run renku commands from projects root directory only
 # !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 ```

 %% Cell type:code id: tags:

 ``` python
 # to use retrieve from git lfs
 #!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
 xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 1060, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * category       (category) <U12 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray>
    Data variables:
-        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 280, 121, 240), meta=np.ndarray>
-        tp             (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 280, 121, 240), meta=np.ndarray>
+        t2m            (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>
+        tp             (category, lead_time, forecast_time, latitude, longitude) float32 dask.array<chunksize=(1, 2, 530, 121, 240), meta=np.ndarray>

 %% Cell type:code id: tags:

 ``` python
+# checking category frequencies
+# o = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr")
+# w=0
+# v='tp'
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False)[v].plot(row='lead_time',col='category', levels=[5.5,6.5,7.5])
+# o.sel(forecast_time=o.forecast_time.dt.dayofyear==2+7*w).sum('forecast_time', skipna=False).sum('category', skipna=False)[v].plot(row='lead_time', levels=[16.5,17.5,18.5,19.5,20.5])
 ```

 %% Cell type:markdown id: tags:

 # Benchmark

 center: ECMWF

 The calibration has been performed by using the tercile boundaries from the model climatology rather than from observations. Script by Frederic Vitart.

 %% Cell type:code id: tags:

 ``` python
 bench_p = cml.load_dataset("s2s-ai-challenge-test-output-benchmark", parameter=['tp','t2m']).to_xarray()
 ```

 %% Output

     50%|█████     | 1/2 [00:00<00:00,  6.11it/s]

    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

    100%|██████████| 2/2 [00:00<00:00,  6.89it/s]
    WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so

    Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3

 %% Cell type:code id: tags:

 ``` python
 bench_p['category'].attrs = {'long_name': 'tercile category probabilities', 'units': '1',
                        'description': 'Probabilities for three tercile categories. All three tercile category probabilities must add up to 1.'}
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p['lead_time'] = [pd.Timedelta(f"{i} d") for i in [14, 28]] # take first day of biweekly average as new coordinate

 bench_p['lead_time'].attrs = {'long_name':'forecast_period', 'description': 'Forecast period is the time interval between the forecast reference time and the validity time.',
                         'aggregate': 'The pd.Timedelta corresponds to the first day of a biweekly aggregate.',
                         'week34_t2m': 'mean[day 14, 27]',
                         'week56_t2m': 'mean[day 28, 41]',
                         'week34_tp': 'day 28 minus day 14',
                         'week56_tp': 'day 42 minus day 28'}
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p = bench_p / 100 # convert percent to [0-1] probability
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p = bench_p.map(ensure_attributes, biweekly=True)
 ```

 %% Output

      0%|          | 0/1 [00:00<?, ?it/s]

    By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.

    100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
    100%|██████████| 1/1 [00:00<00:00,  4.22it/s]

 %% Cell type:code id: tags:

 ``` python
 # bench_p.isel(forecast_time=2).t2m.plot(row='lead_time', col='category')
 ```

 %% Cell type:code id: tags:

 ``` python
 bench_p
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (category: 3, forecast_time: 53, latitude: 121, lead_time: 2, longitude: 240)
    Coordinates:
      * category       (category) object 'below normal' 'near normal' 'above normal'
      * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31
      * lead_time      (lead_time) timedelta64[ns] 14 days 28 days
      * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0
      * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5
        valid_time     (forecast_time, lead_time) datetime64[ns] 2020-01-16 ... 2...
    Data variables:
        tp             (category, forecast_time, lead_time, latitude, longitude) float32 ...
        t2m            (category, forecast_time, lead_time, latitude, longitude) float32 ...

 %% Cell type:code id: tags:

 ``` python
 bench_p.astype('float32').to_netcdf('../data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc')
 ```

 %% Cell type:code id: tags:

 ``` python
 #!renku dataset add s2s-ai-challenge data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
 ```

--- a/notebooks/scripts.py
+++ b/notebooks/scripts.py
@@ -19,7 +19,7 @@ def download(varlist_forecast=['tp','t2m'],
    cml.settings.set("cache-directory", cache_path)
    """
    if isinstance(center_list, str):
-        model_list = [center_list]
+        center_list = [center_list]
    if isinstance(varlist_forecast, str):
        varlist_forecast = [varlist_forecast]

@@ -27,11 +27,11 @@ def download(varlist_forecast=['tp','t2m'],
    
    if forecast_dataset_labels:
        print(f'Downloads variables {varlist_forecast} from datasets {forecast_dataset_labels} from center {center_list} in {format} format.')
-        for model in model_list:
+        for center in center_list:
            for ds in forecast_dataset_labels:
                for parameter in varlist_forecast: 
                    try:
-                        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=model, parameter=varlist_forecast, format=format).to_xarray()
+                        cml.load_dataset(f"s2s-ai-challenge-{ds}", origin=center, parameter=varlist_forecast, format=format).to_xarray()
                    except:
                        pass
    if obs_dataset_labels:
@@ -146,11 +146,25 @@ def ensure_attributes(da, biweekly=False):
    return da


-def make_probabilistic(ds, tercile_edges, member_dim='realization', mask=None):
+def add_year_week_coords(ds):
+    import numpy as np
+    if 'week' not in ds.coords and 'year' not in ds.coords:
+        year = ds.forecast_time.dt.year.to_index().unique()
+        week = (list(np.arange(1,54)))
+        weeks = week * len(year)
+        years = np.repeat(year,len(week))
+        ds.coords["week"] = ("forecast_time", weeks)
+        ds.coords['week'].attrs['description'] = "This week represents the number of forecast_time starting from 1 to 53. Note: This week is different from the ISO week from groupby('forecast_time.weekofyear'), see https://en.wikipedia.org/wiki/ISO_week_date and https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge/-/issues/29"
+        ds.coords["year"] = ("forecast_time", years)
+        ds.coords['year'].attrs['long_name'] = "calendar year"
+    return ds
+
+
+def make_probabilistic(ds, tercile_edges, member_dim='realization', mask=None, groupby_coord='week'):
    """Compute probabilities from ds (observations or forecasts) based on tercile_edges."""
    # broadcast
-    if 'forecast_time' not in tercile_edges.dims and 'weekofyear' in tercile_edges.dims:
-        tercile_edges = tercile_edges.sel(weekofyear=ds.forecast_time.dt.weekofyear)
+    ds = add_year_week_coords(ds)
+    tercile_edges = tercile_edges.sel({groupby_coord: ds.coords[groupby_coord]})
    bn = ds < tercile_edges.isel(category_edge=0, drop=True)  # below normal
    n = (ds >= tercile_edges.isel(category_edge=0, drop=True)) & (ds < tercile_edges.isel(category_edge=1, drop=True))  # normal
    an = ds >= tercile_edges.isel(category_edge=1, drop=True)  # above normal
@@ -176,12 +190,14 @@ def make_probabilistic(ds, tercile_edges, member_dim='realization', mask=None):
                      'comment': 'All three tercile category probabilities must add up to 1.',
                      'variable_before_categorization': 'https://confluence.ecmwf.int/display/S2S/S2S+Surface+Air+Temperature'
                      }
-    if 'weekofyear' in ds_p.coords:
-        ds_p = ds_p.drop('weekofyear')
+    if 'year' in ds_p.coords:
+        del ds_p.coords['year']
+    if groupby_coord in ds_p.coords:
+        ds_p = ds_p.drop(groupby_coord)
    return ds_p


-def skill_by_year(preds):
+def skill_by_year(preds, adapt=False):
    """Returns pd.Dataframe of RPSS per year."""
    # similar verification_RPSS.ipynb
    # as scorer bot but returns a score for each year
@@ -194,44 +210,49 @@ def skill_by_year(preds):
    # from root
    #renku storage pull data/forecast-like-observations_2020_biweekly_terciled.nc
    #renku storage pull data/hindcast-like-observations_2000-2019_biweekly_terciled.nc
-    
+    cache_path = '../data'
    if 2020 in preds.forecast_time.dt.year:
        obs_p = xr.open_dataset(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc').sel(forecast_time=preds.forecast_time)
    else:
        obs_p = xr.open_dataset(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_terciled.zarr', engine='zarr').sel(forecast_time=preds.forecast_time)
-
+    
    # ML probabilities
    fct_p = preds
-    
-    # check inputs
-    assert_predictions_2020(obs_p)
-    assert_predictions_2020(fct_p)

    
    # climatology
    clim_p = xr.DataArray([1/3, 1/3, 1/3], dims='category', coords={'category':['below normal', 'near normal', 'above normal']}).to_dataset(name='tp')
    clim_p['t2m'] = clim_p['tp']
    
-    ## RPSS
+    if adapt:
+        # select only obs_p where fct_p forecasts provided
+        for c in ['longitude', 'latitude', 'forecast_time', 'lead_time']:
+            obs_p = obs_p.sel({c:fct_p[c]})
+        obs_p = obs_p[list(fct_p.data_vars)]
+        clim_p = clim_p[list(fct_p.data_vars)]
+    
+    else:
+        # check inputs
+        assert_predictions_2020(obs_p)
+        assert_predictions_2020(fct_p)
+        
    # rps_ML
    rps_ML = xs.rps(obs_p, fct_p, category_edges=None, dim=[], input_distributions='p').compute()
    # rps_clim
    rps_clim = xs.rps(obs_p, clim_p, category_edges=None, dim=[], input_distributions='p').compute()
-    
-    # rpss
-    rpss = 1 - (rps_ML / rps_clim)
-    
-    # https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/7

-    # penalize
-    penalize = obs_p.where(fct_p!=1, other=-10).mean('category')
-    rpss = rpss.where(penalize!=0,other=-10)
+    ## RPSS
+    # penalize # https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/7
+    expect = obs_p.sum('category')
+    expect = expect.where(expect > 0.98).where(expect < 1.02)  # should be True if not all NaN
+
+    # https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/50
+    rps_ML = rps_ML.where(expect, other=2)  # assign RPS=2 where value was expected but NaN found

+    # following Weigel 2007: https://doi.org/10.1175/MWR3280.1
+    rpss = 1 - (rps_ML.groupby('forecast_time.year').mean() / rps_clim.groupby('forecast_time.year').mean())
    # clip
    rpss = rpss.clip(-10, 1)
-
-    # average over all forecasts
-    rpss = rpss.groupby('forecast_time.year').mean()
    
    # weighted area mean
    weights = np.cos(np.deg2rad(np.abs(rpss.latitude)))

--- a/submissions/ML_prediction_2020.nc
+++ b/submissions/ML_prediction_2020.nc
--- a/update_renku_dataset.sh
+++ b/update_renku_dataset.sh

+echo remove previous zarrs from renku dataset
+
+renku dataset unlink s2s-ai-challenge --include "data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/forecast-like-observations_2020_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr/**/**" -y
+
+renku dataset unlink s2s-ai-challenge --include "data/**/**" -y
+
+
 # observations and derived
-renku dataset add -o s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
+renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr

-renku dataset add -o s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr
+renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr

-renku dataset add -o s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
+renku dataset add s2s-ai-challenge -o data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc

-renku dataset add -o s2s-ai-challenge data/forecast-like-observations_2020_biweekly_terciled.nc
+renku dataset add s2s-ai-challenge -o data/forecast-like-observations_2020_biweekly_terciled.nc

-renku dataset add -o s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
+renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr

 # benchmark
-renku dataset add -o s2s-ai-challenge data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc
+renku dataset add s2s-ai-challenge -o data/ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc

 # forecast / hindcast
 for center in ecmwf; do
-    renku dataset add -o s2s-ai-challenge data/${center}_hindcast-input_2000-2019_biweekly_deterministic.zarr
+    renku dataset add s2s-ai-challenge data/${center}_hindcast-input_2000-2019_biweekly_deterministic.zarr

-    renku dataset add -o s2s-ai-challenge data/${center}_forecast-input_2020_biweekly_deterministic.zarr
+    renku dataset add s2s-ai-challenge data/${center}_forecast-input_2020_biweekly_deterministic.zarr

 done

@@ -25,4 +40,4 @@ renku dataset ls-files s2s-ai-challenge

 renku dataset ls-tags s2s-ai-challenge

-echo consider new tag: renku dataset tag -d description s2s-ai-challenge 0.x
\ No newline at end of file
+echo consider new tag: renku dataset tag -d description s2s-ai-challenge 0.x
No results found