mean_bias_reduction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Train ML model to correct predictions of week 3-4 & 5-6\n",
    "\n",
    "This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Synopsis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Method: `mean bias reduction`\n",
    "\n",
    "- calculate the mean bias from 2000-2019 deterministic ensemble mean forecast\n",
    "- remove that mean bias from 2020 forecast deterministic ensemble mean forecast\n",
    "- no Machine Learning used here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data used\n",
    "\n",
    "type: renku datasets\n",
    "\n",
    "Training-input for Machine Learning model:\n",
    "- hindcasts of models:\n",
    "    - ECMWF: `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`\n",
    "\n",
    "Forecast-input for Machine Learning model:\n",
    "- real-time 2020 forecasts of models:\n",
    "    - ECMWF: `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`\n",
    "\n",
    "Compare Machine Learning model forecast against against ground truth:\n",
    "- `CPC` observations:\n",
    "    - `hindcast-like-observations_biweekly_deterministic.zarr`\n",
    "    - `forecast-like-observations_2020_biweekly_deterministic.zarr`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Resources used\n",
    "for training, details in reproducibility\n",
    "\n",
    "- platform: MPI-M supercompute 1 Node\n",
    "- memory: 64 GB\n",
    "- processors: 36 CPU\n",
    "- storage required: 10 GB"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Safeguards\n",
    "\n",
    "All points have to be [x] checked. If not, your submission is invalid.\n",
    "\n",
    "Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.\n",
    "(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) \n",
    "\n",
    "If the organizers suspect overfitting, your contribution can be disqualified.\n",
    "\n",
    "  - [x] We didnt use 2020 observations in training (explicit overfitting and cheating)\n",
    "  - [x] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)\n",
    "  - [x] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.\n",
    "  - [x] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).\n",
    "  - [x] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.\n",
    "  - [x] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.\n",
    "  - [x] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Safeguards for Reproducibility\n",
    "Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize\n",
    "  - [x] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)\n",
    "  - [x] Code is well documented, readable and reproducible.\n",
    "  - [x] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xarray.core.options.set_options at 0x7f05cc486340>"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import xarray as xr\n",
    "xr.set_options(display_style='text')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Get training data\n",
    "\n",
    "preprocessing of input data may be done in separate notebook/script"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hindcast\n",
    "\n",
    "get weekly initialized hindcasts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
      "\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# preprocessed as renku dataset\n",
    "!renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "hind_2000_2019 = xr.open_zarr(\"../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr\", consolidated=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
      "\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# preprocessed as renku dataset\n",
    "!renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "fct_2020 = xr.open_zarr(\"../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr\", consolidated=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Observations\n",
    "corresponding to hindcasts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
      "\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# preprocessed as renku dataset\n",
    "!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_2000_2019 = xr.open_zarr(\"../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr\", consolidated=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
      "\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "# preprocessed as renku dataset\n",
    "!renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "obs_2020 = xr.open_zarr(\"../data/forecast-like-observations_2020_biweekly_deterministic.zarr\", consolidated=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# no ML model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we just remove the mean bias from the ensemble mean forecast."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import add_year_week_coords\n",
    "obs_2000_2019 = add_year_week_coords(obs_2000_2019)\n",
    "hind_2000_2019 = add_year_week_coords(hind_2000_2019)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/opt/conda/lib/python3.8/site-packages/dask/array/numpy_compat.py:39: RuntimeWarning: invalid value encountered in true_divide\n",
      "  x = np.divide(x1, x2, out)\n"
     ]
    }
   ],
   "source": [
    "bias_2000_2019 = (hind_2000_2019.mean('realization') - obs_2000_2019).groupby('week').mean().compute()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `predict`\n",
    "\n",
    "Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import make_probabilistic"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
      "\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "tercile_file = f'../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'\n",
    "tercile_edges = xr.open_dataset(tercile_file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "def create_predictions(fct, bias):\n",
    "    if 'week' not in fct.coords:\n",
    "        fct = add_year_week_coords(fct)\n",
    "    preds = fct - bias.sel(week=fct.week)\n",
    "    preds = make_probabilistic(preds, tercile_edges)\n",
    "    return preds.astype('float32')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `predict` training period in-sample"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
      "\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
      "\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_is = create_predictions(hind_2000_2019, bias_2000_2019).compute()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import skill_by_year"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "skill_by_year(preds_is)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `predict` test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_test = create_predictions(fct_2020, bias_2000_2019)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "skill_by_year(preds_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scripts import assert_predictions_2020\n",
    "assert_predictions_2020(preds_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_test.attrs = {'author': 'Aaron Spring', 'author_email': 'aaron.spring@mpimet.mpg.de',\n",
    "               'comment': 'created for the s2s-ai-challenge as a template for the website',\n",
    "               'notebook': 'mean_bias_reduction.ipynb',\n",
    "               'website': 'https://s2s-ai-challenge.github.io/#evaluation'}\n",
    "\n",
    "html_repr = xr.core.formatting_html.dataset_repr(preds_test)\n",
    "\n",
    "with open('submission_template_repr.html', 'w') as myFile:\n",
    "    myFile.write(html_repr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !git add ../submissions/ML_prediction_2020.nc\n",
    "# !git add mean_bias_reduction.ipynb"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git commit -m \"template_test no ML mean bias reduction\" # whatever message you want"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git tag \"submission-no_ML_mean_bias_reduction-0.0.2\" # if this is to be checked by scorer, only the last submitted==tagged version will be considered"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!git push --tags"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reproducibility"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## memory"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://phoenixnap.com/kb/linux-commands-check-memory-usage\n",
    "!free -g"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!lscpu"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## software"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!conda list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  },
  "toc-autonumbering": true
 },
 "nbformat": 4,
 "nbformat_minor": 4
}