renku_datasets_biweekly.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Create biweekly renku datasets from `climetlab-s2s-ai-challenge`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Goal:\n",
    "\n",
    "- Create biweekly renku datasets from [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge).\n",
    "- These renku datasets are then used in notebooks:\n",
    "    - `ML_train_and_predict.ipynb` to train the ML model and do ML-based predictions\n",
    "    - `RPSS_verification.ipynb` to calculate RPSS of the ML model\n",
    "\n",
    "Requirements:\n",
    "- [`climetlab`](https://github.com/ecmwf/climetlab)\n",
    "- [`climatelab-s2s-ai-challenge`](https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge)\n",
    "- S2S and CPC observations uploaded on [European Weather Cloud (EWC)](https://storage.ecmwf.europeanweather.cloud/s2s-ai-challenge/data/training-input/0.3.0/netcdf/index.html)\n",
    "\n",
    "Output: [renku dataset](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) `s2s-ai-challenge`\n",
    "- observations\n",
    "    - deterministic:\n",
    "        - `hindcast-like-observations_2000-2019_biweekly_deterministic.zarr`\n",
    "        - `forecast-like-observations_2020_biweekly_deterministic.zarr`\n",
    "    - edges:\n",
    "        - `hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc`\n",
    "    - probabilistic:\n",
    "        - `hindcast-like-observations_2000-2019_biweekly_terciled.zarr`\n",
    "        - `forecast-like-observations_2020_biweekly_terciled.nc`\n",
    "- forecasts/hindcasts\n",
    "    - deterministic:\n",
    "        - `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`\n",
    "        - `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`\n",
    "    - more models could be added\n",
    "- benchmark:\n",
    "    - probabilistic:\n",
    "        - `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!renku dataset create s2s-ai-challenge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: ecmwflibs universal: cannot find a library called MagPlus\n",
      "/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/climetlab/plotting/drivers/magics/actions.py:36: UserWarning: Magics library could not be found\n",
      "  warnings.warn(str(e))\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<xarray.core.options.set_options at 0x2b2b3825c090>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import climetlab as cml\n",
    "import matplotlib.pyplot as plt\n",
    "import xarray as xr\n",
    "import xskillscore as xs\n",
    "import pandas as pd\n",
    "\n",
    "xr.set_options(keep_attrs=True)\n",
    "xr.set_options(display_style='text')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# caching path for climetlab\n",
    "cache_path = \"/work/mh0727/m300524/S2S_AI/cache\" # set your own path\n",
    "cml.settings.set(\"cache-directory\", cache_path)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "cache_path = \"../data\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Climetlab version : 0.7.0\n",
      "Climetlab-s2s-ai-challenge plugin version : 0.6.0\n"
     ]
    }
   ],
   "source": [
    "import climetlab_s2s_ai_challenge\n",
    "print(f'Climetlab version : {cml.__version__}')\n",
    "print(f'Climetlab-s2s-ai-challenge plugin version : {climetlab_s2s_ai_challenge.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download and cache\n",
    "\n",
    "Download all files for the observations, forecast and hindcast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['20200102', '20200109', '20200116', '20200123', '20200130']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()\n",
    "dates[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## observations `output-reference`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_dataset_labels = ['training-output-reference','test-output-reference'] # ML community\n",
    "# equiv to\n",
    "obs_dataset_labels = ['hindcast-like-observations','forecast-like-observations'] # NWP community"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "varlist_obs = ['tp','t2m']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# takes 5h for one model\n",
    "for v in varlist_obs:\n",
    "    for d in dates:\n",
    "        for ds in obs_dataset_labels:\n",
    "            print(ds,v,d)\n",
    "            # only netcdf, no choice\n",
    "            cml.load_dataset(f\"s2s-ai-challenge-{ds}\", date=d, parameter=v).to_xarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## hindcast and forecast `input`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "forecast_dataset_labels = ['training-input','test-input'] # ML community\n",
    "# equiv to\n",
    "forecast_dataset_labels = ['hindcast-input','forecast-input'] # NWP community"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "varlist_forecast = ['tp','t2m']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_list = ['ecmwf']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "# takes 4-5h for one model\n",
    "for model in model_list:\n",
    "    for v in varlist_forecast:\n",
    "        for d in dates:\n",
    "            for ds in forecast_dataset_labels:\n",
    "                print(ds,v,d)\n",
    "                # take netcdf format\n",
    "                cml.load_dataset(f\"s2s-ai-challenge-{ds}\", origin=model, date=d, parameter=v, format='netcdf').to_xarray()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# create bi-weekly aggregates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import add_valid_time_from_forecast_reference_time_and_lead_time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre>&lt;xarray.DataArray (lead_time: 14)&gt;\n",
       "array([1209600000000000, 1296000000000000, 1382400000000000,\n",
       "       1468800000000000, 1555200000000000, 1641600000000000,\n",
       "       1728000000000000, 1814400000000000, 1900800000000000,\n",
       "       1987200000000000, 2073600000000000, 2160000000000000,\n",
       "       2246400000000000, 2332800000000000], dtype=&#x27;timedelta64[ns]&#x27;)\n",
       "Coordinates:\n",
       "  * lead_time  (lead_time) timedelta64[ns] 14 days 15 days ... 26 days 27 days</pre>"
      ],
      "text/plain": [
       "<xarray.DataArray (lead_time: 14)>\n",
       "array([1209600000000000, 1296000000000000, 1382400000000000,\n",
       "       1468800000000000, 1555200000000000, 1641600000000000,\n",
       "       1728000000000000, 1814400000000000, 1900800000000000,\n",
       "       1987200000000000, 2073600000000000, 2160000000000000,\n",
       "       2246400000000000, 2332800000000000], dtype='timedelta64[ns]')\n",
       "Coordinates:\n",
       "  * lead_time  (lead_time) timedelta64[ns] 14 days 15 days ... 26 days 27 days"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w34 = [pd.Timedelta(f'{i} d') for i in range(14,28)]\n",
    "w34 = xr.DataArray(w34,dims='lead_time', coords={'lead_time':w34})\n",
    "w34"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre>&lt;xarray.DataArray (lead_time: 14)&gt;\n",
       "array([2419200000000000, 2505600000000000, 2592000000000000,\n",
       "       2678400000000000, 2764800000000000, 2851200000000000,\n",
       "       2937600000000000, 3024000000000000, 3110400000000000,\n",
       "       3196800000000000, 3283200000000000, 3369600000000000,\n",
       "       3456000000000000, 3542400000000000], dtype=&#x27;timedelta64[ns]&#x27;)\n",
       "Coordinates:\n",
       "  * lead_time  (lead_time) timedelta64[ns] 28 days 29 days ... 40 days 41 days</pre>"
      ],
      "text/plain": [
       "<xarray.DataArray (lead_time: 14)>\n",
       "array([2419200000000000, 2505600000000000, 2592000000000000,\n",
       "       2678400000000000, 2764800000000000, 2851200000000000,\n",
       "       2937600000000000, 3024000000000000, 3110400000000000,\n",
       "       3196800000000000, 3283200000000000, 3369600000000000,\n",
       "       3456000000000000, 3542400000000000], dtype='timedelta64[ns]')\n",
       "Coordinates:\n",
       "  * lead_time  (lead_time) timedelta64[ns] 28 days 29 days ... 40 days 41 days"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "w56 = [pd.Timedelta(f'{i} d') for i in range(28,42)]\n",
    "w56 = xr.DataArray(w56,dims='lead_time', coords={'lead_time':w56})\n",
    "w56"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
    "# starting dates forecast_time in 2020\n",
    "dates = xr.cftime_range(start='20200102',freq='7D', periods=53).strftime('%Y%m%d').to_list()\n",
    "\n",
    "biweekly_lead = [pd.Timedelta(f\"{i} d\") for i in [14, 28]] # take first day of biweekly average \n",
    "# alternative: new lead_time coordinate: use 21 35 as mid points of week 3-4 and week 5-6"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hindcast-input tp\n",
      "By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data. \n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "WARNING: ecmwflibs universal: found eccodes at /work/mh0727/m300524/conda-envs/s2s-ai/lib/libeccodes.so\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning: ecCodes 2.21.0 or higher is recommended. You are running version 2.12.3\n",
      "hindcast-input t2m\n",
      "save to: ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr\n",
      "forecast-input tp\n",
      "By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data. \n",
      "forecast-input t2m\n",
      "save to: ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr\n",
      "hindcast-like-observations tp\n",
      "By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.\n",
      "hindcast-like-observations t2m\n",
      "save to: ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide\n",
      "  x = np.divide(x1, x2, out)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "forecast-like-observations tp\n",
      "By downloading data from this dataset, you agree to the terms and conditions defined at https://apps.ecmwf.int/datasets/data/s2s/licence/. If you do not agree with such terms, do not download the data.  This dataset has been dowloaded from IRIDL. By downloading this data you also agree to the terms and conditions defined at https://iridl.ldeo.columbia.edu.\n",
      "forecast-like-observations t2m\n",
      "save to: ../data/forecast-like-observations_2020_biweekly_deterministic.zarr\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide\n",
      "  x = np.divide(x1, x2, out)\n"
     ]
    }
   ],
   "source": [
    "# refactor after closed: https://github.com/ecmwf/climetlab/issues/16 https://github.com/ecmwf-lab/climetlab-s2s-ai-challenge/issues/9\n",
    "# then, no need to loop over dates\n",
    "model = 'ecmwf'\n",
    "for dsl in forecast_dataset_labels + obs_dataset_labels: # dataset labels\n",
    "    for v in varlist_forecast:\n",
    "        print(dsl,v)\n",
    "        ds = []\n",
    "        for d in dates:\n",
    "            if 'input' in dsl:\n",
    "                # just ecmwf model for now\n",
    "                ds2 = cml.load_dataset(f\"s2s-ai-challenge-{dsl}\", origin=model, date=d, parameter=v, format='netcdf').to_xarray()\n",
    "            elif 'observation' in dsl: # obs only netcdf, no choice\n",
    "                ds2 = cml.load_dataset(f\"s2s-ai-challenge-{dsl}\", date=d, parameter=v).to_xarray().chunk()\n",
    "            if v == 't2m': # biweekly mean\n",
    "                d34 = ds2.sel(lead_time=w34).mean('lead_time')\n",
    "                d56 = ds2.sel(lead_time=w56).mean('lead_time')\n",
    "                ds2 = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)\n",
    "            elif v=='tp': # biweekly sum\n",
    "                d34 = ds2.sel(lead_time=pd.Timedelta(\"27 d\")) - ds2.sel(lead_time=pd.Timedelta(\"13 d\")) # tp from day 14 to day 27\n",
    "                d56 = ds2.sel(lead_time=pd.Timedelta(\"41 d\")) - ds2.sel(lead_time=pd.Timedelta(\"27 d\")) # tp from day 28 to day 42\n",
    "                ds2 = xr.concat([d34,d56],'lead_time').assign_coords(lead_time=biweekly_lead)\n",
    "            ds.append(ds2.chunk({'forecast_time':1}))\n",
    "        \n",
    "        # sort forecast_time\n",
    "        ds = xr.concat(ds, 'forecast_time').sortby('forecast_time')\n",
    "        ds = add_valid_time_from_forecast_reference_time_and_lead_time(ds)\n",
    "        ds.lead_time.attrs['comment'] = 'lead_time describes bi-weekly aggregates. The pd.Timedelta corresponds to the first day of the '\n",
    "\n",
    "        # first run on tp and add t2m to that dataset\n",
    "        if v == 'tp':\n",
    "            ds_ = ds.copy()\n",
    "        elif v == 't2m':\n",
    "            ds_[v] = ds[v]\n",
    "            ds = ds_\n",
    "\n",
    "    if 'test' in dsl:\n",
    "        ds = ds.chunk('auto')\n",
    "    else:\n",
    "        ds = ds.chunk({'forecast_time':12,'lead_time':-1,'longitude':'auto','latitude':'auto'}) # 3 month init chunk\n",
    "    \n",
    "    if 'hindcast' in dsl:\n",
    "        time = '2000-2019'\n",
    "        if 'input' in dsl:\n",
    "            name = f'{model}_{dsl}'\n",
    "        elif 'observations':\n",
    "            name = dsl\n",
    "        \n",
    "    elif 'forecast' in dsl:\n",
    "        time = '2020'\n",
    "        if 'input' in dsl:\n",
    "            name = f'{model}_{dsl}'\n",
    "        elif 'observations':\n",
    "            name = dsl\n",
    "    else:\n",
    "        assert False\n",
    "    \n",
    "    # pattern: {model if applies _}{observations/forecast/hindcast}_{time}_biweekly_deterministic.zarr\n",
    "    zp = f'{cache_path}/{name}_{time}_biweekly_deterministic.zarr'\n",
    "    print(f'save to: {zp}')\n",
    "    ds.to_zarr(zp, consolidated=True, mode='w')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## add to `renku` dataset `s2s-ai-challenge`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# observations as hindcast\n",
    "# run renku commands from projects root directory only\n",
    "# !renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for further use retrieve from git lfs\n",
    "# !renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240})) \n",
      " Coordinates:\n",
      "  * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31\n",
      "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
      "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
      "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
      "    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray> \n",
      " 985.065144 MB\n"
     ]
    }
   ],
   "source": [
    "obs_2000_2019 = xr.open_zarr(f\"{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr\", consolidated=True)\n",
    "print(obs_2000_2019.sizes,'\\n',obs_2000_2019.coords,'\\n', obs_2000_2019.nbytes/1e6,'MB')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# observations as forecast\n",
    "# run renku commands from projects root directory only\n",
    "# !renku dataset add s2s-ai-challenge data/forecast-like-observations_2020_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240})) \n",
      " Coordinates:\n",
      "  * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31\n",
      "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
      "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
      "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
      "    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray> \n",
      " 49.256016 MB\n"
     ]
    }
   ],
   "source": [
    "obs_2020 = xr.open_zarr(f\"{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr\", consolidated=True)\n",
    "print(obs_2020.sizes,'\\n',obs_2020.coords,'\\n', obs_2020.nbytes/1e6,'MB')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ecmwf hindcast-input\n",
    "# run renku commands from projects root directory only\n",
    "# !renku dataset add s2s-ai-challenge data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen(SortedKeysDict({'forecast_time': 1060, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 11})) \n",
      " Coordinates:\n",
      "  * forecast_time  (forecast_time) datetime64[ns] 2000-01-02 ... 2019-12-31\n",
      "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
      "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
      "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
      "  * realization    (realization) int64 0 1 2 3 4 5 6 7 8 9 10\n",
      "    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 1060), meta=np.ndarray> \n",
      " 5417.730832 MB\n"
     ]
    }
   ],
   "source": [
    "hind_2000_2019 = xr.open_zarr(f\"{cache_path}/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr\", consolidated=True)\n",
    "print(hind_2000_2019.sizes,'\\n',hind_2000_2019.coords,'\\n', hind_2000_2019.nbytes/1e6,'MB')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ecmwf forecast-input\n",
    "# run renku commands from projects root directory only\n",
    "# !renku dataset add s2s-ai-challenge data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'realization': 51})) \n",
      " Coordinates:\n",
      "  * forecast_time  (forecast_time) datetime64[ns] 2020-01-02 ... 2020-12-31\n",
      "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
      "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
      "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
      "  * realization    (realization) int64 0 1 2 3 4 5 6 7 ... 44 45 46 47 48 49 50\n",
      "    valid_time     (lead_time, forecast_time) datetime64[ns] dask.array<chunksize=(2, 53), meta=np.ndarray> \n",
      " 1255.926504 MB\n"
     ]
    }
   ],
   "source": [
    "fct_2020 = xr.open_zarr(f\"{cache_path}/ecmwf_forecast-input_2020_biweekly_deterministic.zarr\", consolidated=True)\n",
    "print(fct_2020.sizes,'\\n',fct_2020.coords,'\\n', fct_2020.nbytes/1e6,'MB')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# tercile edges\n",
    "\n",
    "Create 2 tercile edges at 1/3 and 2/3 quantiles of the 2000-2019 biweekly distrbution for each week of the year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "tercile_file = f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.\n",
      "  FutureWarning,\n",
      "/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/numpy/lib/nanfunctions.py:1390: RuntimeWarning: All-NaN slice encountered\n",
      "  overwrite_input, interpolation)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 15min 49s, sys: 6min 16s, total: 22min 6s\n",
      "Wall time: 13min 14s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "xr.open_zarr(f'{cache_path}/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr',\n",
    "             consolidated=True).chunk({'forecast_time':-1,'longitude':'auto'}).groupby('forecast_time.weekofyear').quantile(q=[1./3.,2./3.], dim=['forecast_time']).rename({'quantile':'category_edge'}).to_netcdf(tercile_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre>&lt;xarray.Dataset&gt;\n",
       "Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, weekofyear: 53)\n",
       "Coordinates:\n",
       "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
       "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
       "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
       "  * category_edge  (category_edge) float64 0.3333 0.6667\n",
       "  * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53\n",
       "Data variables:\n",
       "    t2m            (weekofyear, category_edge, lead_time, latitude, longitude) float64 ...\n",
       "    tp             (weekofyear, category_edge, lead_time, latitude, longitude) float64 ...</pre>"
      ],
      "text/plain": [
       "<xarray.Dataset>\n",
       "Dimensions:        (category_edge: 2, latitude: 121, lead_time: 2, longitude: 240, weekofyear: 53)\n",
       "Coordinates:\n",
       "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
       "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
       "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
       "  * category_edge  (category_edge) float64 0.3333 0.6667\n",
       "  * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53\n",
       "Data variables:\n",
       "    t2m            (weekofyear, category_edge, lead_time, latitude, longitude) float64 ...\n",
       "    tp             (weekofyear, category_edge, lead_time, latitude, longitude) float64 ..."
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tercile_edges = xr.open_dataset(tercile_file)\n",
    "\n",
    "tercile_edges"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(98.507024, 'MB')"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tercile_edges.nbytes*1e-6,'MB'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run renku commands from projects root directory only\n",
    "# tercile edges\n",
    "#!renku dataset add s2s-ai-challenge data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to use retrieve from git lfs\n",
    "#!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc\n",
    "#xr.open_dataset(\"../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# observations in categories\n",
    "\n",
    "- counting how many deterministic forecasts realizations fall into each category, like counting rps\n",
    "- categorize forecast-like-observations 2020 into categories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240}))"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "obs_2020 = xr.open_zarr(f'{cache_path}/forecast-like-observations_2020_biweekly_deterministic.zarr', consolidated=True)\n",
    "obs_2020.sizes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a mask for land grid\n",
    "mask = obs_2020.std(['lead_time','forecast_time']).notnull()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# mask.to_array().plot(col='variable')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "tercile_edges = xr.open_dataset(tercile_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre>&lt;xarray.DataArray &#x27;tp&#x27; (weekofyear: 53, category_edge: 2, lead_time: 2, latitude: 121, longitude: 240)&gt;\n",
       "[6156480 values with dtype=float64]\n",
       "Coordinates:\n",
       "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
       "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
       "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
       "  * category_edge  (category_edge) float64 0.3333 0.6667\n",
       "  * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53\n",
       "Attributes:\n",
       "    comment:        precipitation accumulated since lead_time including 0 days\n",
       "    long_name:      total precipitation\n",
       "    pointwidth:     0\n",
       "    standard_name:  precipitation_amount\n",
       "    units:          kg m-2</pre>"
      ],
      "text/plain": [
       "<xarray.DataArray 'tp' (weekofyear: 53, category_edge: 2, lead_time: 2, latitude: 121, longitude: 240)>\n",
       "[6156480 values with dtype=float64]\n",
       "Coordinates:\n",
       "  * latitude       (latitude) float64 90.0 88.5 87.0 85.5 ... -87.0 -88.5 -90.0\n",
       "  * lead_time      (lead_time) timedelta64[ns] 14 days 28 days\n",
       "  * longitude      (longitude) float64 0.0 1.5 3.0 4.5 ... 355.5 357.0 358.5\n",
       "  * category_edge  (category_edge) float64 0.3333 0.6667\n",
       "  * weekofyear     (weekofyear) int64 1 2 3 4 5 6 7 8 ... 47 48 49 50 51 52 53\n",
       "Attributes:\n",
       "    comment:        precipitation accumulated since lead_time including 0 days\n",
       "    long_name:      total precipitation\n",
       "    pointwidth:     0\n",
       "    standard_name:  precipitation_amount\n",
       "    units:          kg m-2"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tercile_edges.tp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# total precipitation in arid regions are masked\n",
    "# Frederic Vitart suggested by email: \"Based on your map we could mask all the areas where the lower tercile boundary is lower than 0.1 mm\"\n",
    "# we are using a dry mask as in https://doi.org/10.1175/MWR-D-17-0092.1\n",
    "th = 0.01\n",
    "tp_arid_mask = tercile_edges.tp.isel(category_edge=0, lead_time=0, drop=True) > th\n",
    "#tp_arid_mask.where(mask.tp).plot(col='forecast_time', col_wrap=4)\n",
    "#plt.suptitle(f'dry mask: week 3-4 tp 1/3 category_edge > {th} kg m-2',y=1., x=.4)\n",
    "#plt.savefig('dry_mask.png')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# look into tercile edges"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "#tercile_edges.isel(forecast_time=0)['tp'].plot(col='lead_time',row='category_edge', robust=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#tercile_edges.isel(forecast_time=[0,20],category_edge=1)['tp'].plot(col='lead_time', row='forecast_time', robust=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# tercile_edges.tp.mean(['forecast_time']).plot(col='lead_time',row='category_edge',vmax=.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## categorize observations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import make_probabilistic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.\n",
      "  FutureWarning,\n"
     ]
    }
   ],
   "source": [
    "obs_2020_p = make_probabilistic(obs_2020, tercile_edges, mask=mask)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(147.760264, 'MB')"
      ]
     },