Commit 40fc39be authored by Aaron Spring's avatar Aaron Spring 🚼
Browse files

Merge branch 'AS_year_week' into 'master'

add year week as coord not replacing forecast_time

Closes s2s-ai-challenge#29

See merge request !18
parents 3fd65723 f0198ecd
Pipeline #234736 passed with stage
in 24 seconds
This diff is collapsed.
%% Cell type:markdown id: tags:
# Train ML model to correct predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Method: `mean bias reduction`
- calculate the mean bias from 2000-2019 deterministic ensemble mean forecast
- remove that mean bias from 2020 forecast deterministic ensemble mean forecast
- no Machine Learning used here
%% Cell type:markdown id: tags:
## Data used
type: renku datasets
Training-input for Machine Learning model:
- hindcasts of models:
- ECMWF: `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
Forecast-input for Machine Learning model:
- real-time 2020 forecasts of models:
- ECMWF: `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
Compare Machine Learning model forecast against against ground truth:
- `CPC` observations:
- `hindcast-like-observations_biweekly_deterministic.zarr`
- `forecast-like-observations_2020_biweekly_deterministic.zarr`
%% Cell type:markdown id: tags:
## Resources used
for training, details in reproducibility
- platform: MPI-M supercompute 1 Node
- memory: 64 GB
- processors: 36 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [x] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [x] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [x] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.
- [x] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
- [x] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [x] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [x] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [x] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [x] Code is well documented, readable and reproducible.
- [x] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
``` python
import xarray as xr
xr.set_options(display_style='text')
```
%%%% Output: execute_result
<xarray.core.options.set_options at 0x2b37fc26ec50>
<xarray.core.options.set_options at 0x2b858800a050>
%% Cell type:markdown id: tags:
# Get training data
preprocessing of input data may be done in separate notebook/script
%% Cell type:markdown id: tags:
## Hindcast
get weekly initialized hindcasts
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
hind_2000_2019 = xr.open_zarr("../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
fct_2020 = xr.open_zarr("../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:markdown id: tags:
## Observations
corresponding to hindcasts
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2000_2019 = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2020 = xr.open_zarr("../data/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:markdown id: tags:
# no ML model
%% Cell type:markdown id: tags:
Here, we just remove the mean bias from the ensemble mean forecast.
%% Cell type:code id: tags:
``` python
bias_2000_2019 = (hind_2000_2019.mean('realization') - obs_2000_2019).groupby('forecast_time.weekofyear').mean().compute()
from scripts import add_year_week_coords
obs_2000_2019 = add_year_week_coords(obs_2000_2019)
hind_2000_2019 = add_year_week_coords(hind_2000_2019)
```
%%%% Output: stream
WARNING: ecmwflibs universal: cannot find a library called MagPlus
Magics library could not be found
%% Cell type:code id: tags:
``` python
bias_2000_2019 = (hind_2000_2019.mean('realization') - obs_2000_2019).groupby('week').mean().compute()
```
%%%% Output: stream
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
%% Cell type:markdown id: tags:
## `predict`
Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`.
%% Cell type:code id: tags:
``` python
from scripts import make_probabilistic
```
%%%% Output: stream
WARNING: ecmwflibs universal: cannot find a library called MagPlus
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/climetlab/plotting/drivers/magics/actions.py:36: UserWarning: Magics library could not be found
warnings.warn(str(e))
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
```
%% Cell type:code id: tags:
``` python
tercile_file = f'../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
tercile_edges = xr.open_dataset(tercile_file)
```
%% Cell type:code id: tags:
``` python
def create_predictions(fct, bias):
preds = fct - bias.sel(weekofyear=fct.forecast_time.dt.weekofyear)
if 'week' not in fct.coords:
fct = add_year_week_coords(fct)
preds = fct - bias.sel(week=fct.week)
preds = make_probabilistic(preds, tercile_edges)
return preds.astype('float32')
```
%% Cell type:markdown id: tags:
### `predict` training period in-sample
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
```
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
```
%% Cell type:code id: tags:
``` python
preds_is = create_predictions(hind_2000_2019, bias_2000_2019).compute()
```
%%%% Output: stream
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
%% Cell type:code id: tags:
``` python
from scripts import skill_by_year
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_is)
```
%%%% Output: execute_result
RPSS
year
2000 -0.141857
2001 -0.203405
2002 -0.202549
2003 -0.206234
2004 -0.549463
2005 -0.168421
2006 -0.184515
2007 -0.616939
2008 -0.195251
2009 -0.202809
2010 -0.189126
2011 -0.678302
2012 -0.620137
2013 -0.202285
2014 -0.206982
2015 -0.172498
2016 -0.136464
2017 -0.638293
2018 -0.667205
2019 -0.180896
2000 -0.147543
2001 -0.190513
2002 -0.186966
2003 -0.197091
2004 -0.194299
2005 -0.163684
2006 -0.180648
2007 -0.182294
2008 -0.173836
2009 -0.208382
2010 -0.150463
2011 -0.177190
2012 -0.176229
2013 -0.186028
2014 -0.192715
2015 -0.165891
2016 -0.159139
2017 -0.193667
2018 -0.202936
2019 -0.177937
%% Cell type:markdown id: tags:
### `predict` test
%% Cell type:code id: tags:
``` python
preds_test = create_predictions(fct_2020, bias_2000_2019)
```
%%%% Output: stream
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/xarray/core/accessor_dt.py:381: FutureWarning: dt.weekofyear and dt.week have been deprecated. Please use dt.isocalendar().week instead.
FutureWarning,
%% Cell type:code id: tags:
``` python
skill_by_year(preds_test)
```
%%%% Output: stream
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
%%%% Output: execute_result
RPSS
year
2020 -0.093422
2020 -0.096323
%% Cell type:markdown id: tags:
# Submission
%% Cell type:code id: tags:
``` python
from scripts import assert_predictions_2020
assert_predictions_2020(preds_test)
```
%% Cell type:code id: tags:
``` python
preds_test.attrs = {'author': 'Aaron Spring', 'author_email': 'aaron.spring@mpimet.mpg.de',
'comment': 'created for the s2s-ai-challenge as a template for the website',
'notebook': 'mean_bias_reduction.ipynb',
'website': 'https://s2s-ai-challenge.github.io/#evaluation'}
html_repr = xr.core.formatting_html.dataset_repr(preds_test)
with open('submission_template_repr.html', 'w') as myFile:
myFile.write(html_repr)
```
%% Cell type:code id: tags:
``` python
preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')
```
%% Cell type:code id: tags:
``` python
#!git add ../submissions/ML_prediction_2020.nc
#!git add mean_bias_reduction.ipynb
# !git add ../submissions/ML_prediction_2020.nc
# !git add mean_bias_reduction.ipynb
```
%% Cell type:code id: tags:
``` python
#!git commit -m "template_test no ML mean bias reduction" # whatever message you want
```
%% Cell type:code id: tags:
``` python
#!git tag "submission-no_ML_mean_bias_reduction-0.0.1" # if this is to be checked by scorer, only the last submitted==tagged version will be considered
#!git tag "submission-no_ML_mean_bias_reduction-0.0.2" # if this is to be checked by scorer, only the last submitted==tagged version will be considered
```
%% Cell type:code id: tags:
``` python
#!git push --tags
```
%% Cell type:markdown id: tags:
# Reproducibility
%% Cell type:markdown id: tags:
## memory
%% Cell type:code id: tags:
``` python
# https://phoenixnap.com/kb/linux-commands-check-memory-usage
!free -g
```
%%%% Output: stream
total used free shared buffers cached
Mem: 62 15 46 0 0 5
-/+ buffers/cache: 10 52
Mem: 62 18 43 0 0 5
-/+ buffers/cache: 12 49
Swap: 0 0 0
%% Cell type:markdown id: tags:
## CPU
%% Cell type:code id: tags:
``` python
!lscpu
```
%%%% Output: stream
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 2100.000
BogoMIPS: 4190.01
CPU MHz: 2101.000
BogoMIPS: 4190.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
%% Cell type:markdown id: tags:
## software
%% Cell type:code id: tags:
``` python
!conda list
```
%%%% Output: stream
# packages in environment at /work/mh0727/m300524/conda-envs/s2s-ai:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_tflow_select 2.3.0 mkl
absl-py 0.12.0 py37h06a4308_0
aiobotocore 1.2.2 pyhd3eb1b0_0
aiohttp 3.7.4 py37h27cfd23_1
aioitertools 0.7.1 pyhd3eb1b0_0
anyio 2.2.0 pypi_0 pypi
appdirs 1.4.4 py_0
argcomplete 1.12.2 pypi_0 pypi
argon2-cffi 20.1.0 py37h27cfd23_1
asciitree 0.3.3 py_2
astunparse 1.6.3 py_0
async-timeout 3.0.1 py37h06a4308_0
async_generator 1.10 py37h28b3542_0
attrs 20.2.0 pypi_0 pypi
babel 2.9.0 pypi_0 pypi
backcall 0.2.0 pyhd3eb1b0_0
backrefs 5.0.1 pypi_0 pypi
bagit 1.8.1 pypi_0 pypi
beautifulsoup4 4.9.3 pyha847dfd_0
black 20.8b1 pypi_0 pypi
blas 1.0 mkl
bleach 3.3.0 pyhd3eb1b0_0
blinker 1.4 py37h06a4308_0
bokeh 2.3.0 py37h06a4308_0
botocore 1.20.33 pyhd3eb1b0_1
botocore 1.19.52 pyhd3eb1b0_0
bottleneck 1.3.2 py37heb32a55_1
bracex 2.1.1 pypi_0 pypi
branca 0.3.1 pypi_0 pypi
brotlipy 0.7.0 py37h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
c-ares 1.17.1 h27cfd23_0
ca-certificates 2021.1.19 h06a4308_1
ca-certificates 2021.5.30 ha878542_0 conda-forge
cachecontrol 0.11.7 pypi_0 pypi
cachetools 4.2.1 pyhd3eb1b0_0
calamus 0.3.7 pypi_0 pypi
cdsapi 0.5.1 pypi_0 pypi
certifi 2020.12.5 py37h06a4308_0
certifi 2021.5.30 py37h89c1867_0 conda-forge
cffi 1.14.5 py37h261ae71_0
cfgrib 0.9.8.5 pyhd8ed1ab_0 conda-forge
cftime 1.4.1 py37h6323ea4_0
cftime 1.5.0 pypi_0 pypi
chardet 3.0.4 py37h06a4308_1003
click 7.1.2 pyhd3eb1b0_0
click-completion 0.5.2 pypi_0 pypi
click-plugins 1.1.1 pypi_0 pypi
climetlab 0.8.0 pypi_0 pypi
climetlab-s2s-ai-challenge 0.6.7 pypi_0 pypi
climetlab-s2s-ai-competition 0.3.7 pypi_0 pypi
climetlab 0.8.6 pypi_0 pypi
climetlab-s2s-ai-challenge 0.8.0 pypi_0 pypi
climpred 2.1.4 pypi_0 pypi
cloudpickle 1.6.0 py_0
colorama 0.4.4 pypi_0 pypi
coloredlogs 15.0 pypi_0 pypi
commonmark 0.9.1 pypi_0 pypi
configargparse 1.4 pypi_0 pypi
coverage 5.5 py37h27cfd23_2
cryptography 3.4.6 py37hd23ed53_0
curl 7.71.1 hbc83047_1
cwlgen 0.4.2 pypi_0 pypi
cwltool 3.0.20210319143721 pypi_0 pypi
cycler 0.10.0 py37_0
cython 0.29.22 py37h2531618_0
cytoolz 0.11.0 py37h7b6447c_0
dask 2021.3.0 pypi_0 pypi
dask-core 2021.6.2 pyhd8ed1ab_0 conda-forge
dask-labextension 5.0.1 pypi_0 pypi
dbus 1.13.18 hb2f20db_0
decorator 4.4.2 pyhd3eb1b0_0
defusedxml 0.7.1 pyhd3eb1b0_0
distributed 2021.3.0 py37h06a4308_0
distributed 2021.6.2 py37h89c1867_0 conda-forge
docopt 0.6.2 py37h06a4308_0
docutils 0.15.2 py37h89c1867_2 conda-forge
eccodes 1.2.0 pypi_0 pypi
ecmwf-api-client 1.6.1 pypi_0 pypi
ecmwflibs 0.2.3 pypi_0 pypi
entrypoints 0.3 py37_0
environ-config 20.1.0 pypi_0 pypi
expat 2.2.10 he6710b0_2
fasteners 0.16 pyhd3eb1b0_0
fastprogress 1.0.0 py_0 conda-forge
filelock 3.0.12 pypi_0 pypi
folium 0.12.1 pypi_0 pypi
fontconfig 2.13.1 h6c09931_0
freetype 2.10.4 h5ab3b9f_0
frozendict 1.2 pypi_0 pypi
fsspec 0.8.7 pyhd3eb1b0_0
gast 0.4.0 py_0
gitdb 4.0.6 pypi_0 pypi
gitpython 3.1.12 pypi_0 pypi
glib 2.67.4 h36276a3_1
google-auth 1.28.0 pyhd3eb1b0_0
google-auth-oauthlib 0.4.3 pyhd3eb1b0_0
google-pasta 0.2.0 py_0
grpcio 1.36.1 py37h2157cd5_1
gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
h5netcdf 0.10.0 pyhd8ed1ab_0 conda-forge
h5py 2.10.0 py37h7918eee_0
h5py 2.10.0 nompi_py37h1e651dc_105 conda-forge
hdf4 4.2.13 h3ca952b_2
hdf5 1.10.4 hb1b8bf9_0