Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train ML model to correct predictions of week 3-4 & 5-6\n",
"\n",
"This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Synopsis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Method: `mean bias reduction`\n",
"\n",
"- calculate the mean bias from 2000-2019 deterministic ensemble mean forecast\n",
"- remove that mean bias from 2020 forecast deterministic ensemble mean forecast\n",
"- no Machine Learning used here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data used\n",
"\n",
"type: renku datasets\n",
"\n",
"Training-input for Machine Learning model:\n",
"- hindcasts of models:\n",
" - ECMWF: `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`\n",
"\n",
"Forecast-input for Machine Learning model:\n",
"- real-time 2020 forecasts of models:\n",
" - ECMWF: `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`\n",
"\n",
"Compare Machine Learning model forecast against against ground truth:\n",
"- `CPC` observations:\n",
" - `hindcast-like-observations_biweekly_deterministic.zarr`\n",
" - `forecast-like-observations_2020_biweekly_deterministic.zarr`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resources used\n",
"\n",
"- platform: MPI-M supercompute 1 Node\n",
"- memory: 64 GB\n",
"- processors: 36 CPU\n",
"- storage required: 10 GB"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Safeguards\n",
"\n",
"All points have to be [x] checked. If not, your submission is invalid.\n",
"\n",
"Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.\n",
"(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1) \n",
"\n",
"If the organizers suspect overfitting, your contribution can be disqualified.\n",
"\n",
" - [x] We didnt use 2020 observations in training (explicit overfitting and cheating)\n",
" - [x] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)\n",
" - [x] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.\n",
" - [x] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).\n",
" - [x] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.\n",
" - [x] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.\n",
" - [x] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics))."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Safeguards for Reproducibility\n",
"Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize\n",
" - [x] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)\n",
" - [x] Code is well documented, readable and reproducible.\n",
" - [x] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Imports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<xarray.core.options.set_options at 0x7f05cc486340>"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"xr.set_options(display_style='text')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Get training data\n",
"\n",
"preprocessing of input data may be done in separate notebook/script"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hindcast\n",
"\n",
"get weekly initialized hindcasts"
]
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
"\u001b[0m\n"
]
}
],
"source": [
"# preprocessed as renku dataset\n",
"!renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"hind_2000_2019 = xr.open_zarr(\"../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr\", consolidated=True)"
]
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
"\u001b[0m\n"
]
}
],
"source": [
"# preprocessed as renku dataset\n",
"!renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"fct_2020 = xr.open_zarr(\"../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr\", consolidated=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Observations\n",
"corresponding to hindcasts"
]
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
"\u001b[0m\n"
]
}
],
"source": [
"# preprocessed as renku dataset\n",
"!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"obs_2000_2019 = xr.open_zarr(\"../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr\", consolidated=True)"
]
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
"\u001b[0m\n"
]
}
],
"source": [
"# preprocessed as renku dataset\n",
"!renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"obs_2020 = xr.open_zarr(\"../data/forecast-like-observations_2020_biweekly_deterministic.zarr\", consolidated=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# no ML model"
]
},
{
"Here, we just remove the mean bias from the ensemble mean forecast."
"source": [
"from scripts import add_year_week_coords\n",
"obs_2000_2019 = add_year_week_coords(obs_2000_2019)\n",
"hind_2000_2019 = add_year_week_coords(hind_2000_2019)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.8/site-packages/dask/array/numpy_compat.py:39: RuntimeWarning: invalid value encountered in true_divide\n",
" x = np.divide(x1, x2, out)\n"
]
}
],
"source": [
"bias_2000_2019 = (hind_2000_2019.mean('realization') - obs_2000_2019).groupby('week').mean().compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `predict`\n",
"\n",
"Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`."
"source": [
"from scripts import make_probabilistic"
]
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
"\u001b[0m\n"
]
}
],
"source": [
"!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"tercile_file = f'../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'\n",
"tercile_edges = xr.open_dataset(tercile_file)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"def create_predictions(fct, bias):\n",
" if 'week' not in fct.coords:\n",
" fct = add_year_week_coords(fct)\n",
" preds = fct - bias.sel(week=fct.week)\n",
" preds = make_probabilistic(preds, tercile_edges)\n",
" return preds.astype('float32')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `predict` training period in-sample"
]
},
{
"cell_type": "code",
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
"\u001b[0m\n"
]
}
],
"!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc"
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\u001b[33m\u001b[1mWarning: \u001b[0mRun CLI commands only from project's root directory.\n",
"\u001b[0m\n"
]
}
],
"!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr"
"source": [
"preds_is = create_predictions(hind_2000_2019, bias_2000_2019).compute()"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"from scripts import skill_by_year"
]
},
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### `predict` test"
]
},
{
"cell_type": "code",
"source": [
"preds_test = create_predictions(fct_2020, bias_2000_2019)"
]
},
{
"cell_type": "code",
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Submission"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"from scripts import assert_predictions_2020\n",
"assert_predictions_2020(preds_test)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"preds_test.attrs = {'author': 'Aaron Spring', 'author_email': 'aaron.spring@mpimet.mpg.de',\n",
" 'comment': 'created for the s2s-ai-challenge as a template for the website',\n",
" 'notebook': 'mean_bias_reduction.ipynb',\n",
" 'website': 'https://s2s-ai-challenge.github.io/#evaluation'}\n",
"\n",
"html_repr = xr.core.formatting_html.dataset_repr(preds_test)\n",
"\n",
"with open('submission_template_repr.html', 'w') as myFile:\n",
" myFile.write(html_repr)"
"metadata": {},
"outputs": [],
"source": [
"preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')"
"# !git add ../submissions/ML_prediction_2020.nc\n",
"# !git add mean_bias_reduction.ipynb"
"metadata": {},
"outputs": [],
"source": [
"#!git commit -m \"template_test no ML mean bias reduction\" # whatever message you want"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!git tag \"submission-no_ML_mean_bias_reduction-0.0.2\" # if this is to be checked by scorer, only the last submitted==tagged version will be considered"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!git push --tags"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reproducibility"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## memory"
]
},
{
"cell_type": "code",
"source": [
"# https://phoenixnap.com/kb/linux-commands-check-memory-usage\n",
"!free -g"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## CPU"
]
},
{
"cell_type": "code",
"source": [
"!lscpu"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## software"
]
},
{
"cell_type": "code",
"source": [
"!conda list"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
},
"toc-autonumbering": true
},
"nbformat": 4,
"nbformat_minor": 4
}