Commit fb0ceb3b authored by Aaron Spring's avatar Aaron Spring 🚼
Browse files

Merge branch 'safeguards_typo' into 'master'

fix typo in safeguards

See merge request aaron.spring/s2s-ai-challenge-template!8
parents 19c69a9c 0d362947
Pipeline #204827 passed with stage
in 19 seconds
......@@ -2,6 +2,7 @@
### unreleased
- Fix typo in safeguards in [ML_forecast_template.ipynb](https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/blob/master/notebooks/ML_forecast_template.ipynb): "We did NOT use `test` explicitly in training or implicitly in incrementally adjusting parameters."" (!8, [Aaron Spring](https://renkulab.io/gitlab/aaron.spring))
- Add notebooks showcasing accessing output of different models from different sources: (!2, [Aaron Spring](https://renkulab.io/gitlab/aaron.spring))
- S2S-Project models:
- from from European Weather Cloud:
......
%% Cell type:markdown id: tags:
# Train ML model for predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Method: `name`
- decription
- a few details
%% Cell type:markdown id: tags:
## Data used
Training-input for Machine Learning model:
- renku datasets, climetlab, IRIDL
Forecast-input for Machine Learning model:
- renku datasets, climetlab, IRIDL
Compare Machine Learning model forecast against ground truth:
- renku datasets, climetlab, IRIDL
%% Cell type:markdown id: tags:
## Resources used
for training, details in reproducibility
- platform: renku
- memory: 8 GB
- processors: 2 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [ ] We did not use 2020 observations in training (explicit overfitting and cheating)
- [ ] We did not repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [ ] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.
- [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
- [ ] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [ ] We did not use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [ ] Code is well documented, readable and reproducible.
- [ ] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
%% Cell type:markdown id: tags:
# Todos to improve template
This is just a demo.
- [ ] for both variables
- [ ] for both `lead_time`s
- [ ] ensure probabilistic prediction outcome with `category` dim
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
``` python
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import xarray as xr
xr.set_options(display_style='text')
from dask.utils import format_bytes
import xskillscore as xs
```
%% Cell type:markdown id: tags:
# Get training data
preprocessing of input data may be done in separate notebook/script
%% Cell type:markdown id: tags:
## Hindcast
get weekly initialized hindcasts
%% Cell type:code id: tags:
``` python
# consider renku datasets
#! renku storage pull path
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Observations
corresponding to hindcasts
%% Cell type:code id: tags:
``` python
# consider renku datasets
#! renku storage pull path
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# ML model
%% Cell type:code id: tags:
``` python
bs=32
import numpy as np
class DataGenerator(keras.utils.Sequence):
def __init__(self):
"""
Data generator
Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
Args:
"""
self.on_epoch_end()
# For some weird reason calling .load() earlier messes up the mean and std computations
if load: print('Loading data into RAM'); self.data.load()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.ceil(self.n_samples / self.batch_size))
def __getitem__(self, i):
'Generate one batch of data'
idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]
# got all nan if nans not masked
X = self.data.isel(time=idxs).fillna(0.).values
y = self.verif_data.isel(time=idxs).fillna(0.).values
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.idxs = np.arange(self.n_samples)
if self.shuffle == True:
np.random.shuffle(self.idxs)
```
%% Cell type:markdown id: tags:
## data prep: train, valid, test
%% Cell type:code id: tags:
``` python
# time is the forecast_reference_time
time_train_start,time_train_end='2000','2017'
time_valid_start,time_valid_end='2018','2019'
time_test = '2020'
```
%% Cell type:code id: tags:
``` python
dg_train = DataGenerator()
```
%% Cell type:code id: tags:
``` python
dg_valid = DataGenerator()
```
%% Cell type:code id: tags:
``` python
dg_test = DataGenerator()
```
%% Cell type:markdown id: tags:
## `fit`
%% Cell type:code id: tags:
``` python
cnn = keras.models.Sequential([])
```
%% Cell type:code id: tags:
``` python
cnn.summary()
```
%% Cell type:code id: tags:
``` python
cnn.compile(keras.optimizers.Adam(1e-4), 'mse')
```
%% Cell type:code id: tags:
``` python
import warnings
warnings.simplefilter("ignore")
```
%% Cell type:code id: tags:
``` python
cnn.fit(dg_train, epochs=1, validation_data=dg_valid)
```
%% Cell type:markdown id: tags:
## `predict`
Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`. For now RPS, todo: change to RPSS.
%% Cell type:code id: tags:
``` python
from scripts import skill_by_year
```
%% Cell type:code id: tags:
``` python
def create_predictions(model, dg):
"""Create non-iterative predictions"""
preds = model.predict(dg).squeeze()
# transform
return preds
```
%% Cell type:markdown id: tags:
### `predict` training period in-sample
%% Cell type:code id: tags:
``` python
preds_is = create_predictions(cnn, dg_train)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_is)
```
%% Cell type:markdown id: tags:
### `predict` valid out-of-sample
%% Cell type:code id: tags:
``` python
preds_os = create_predictions(cnn, dg_valid)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_os)
```
%% Cell type:markdown id: tags:
### `predict` test
%% Cell type:code id: tags:
``` python
preds_test = create_predictions(cnn, dg_test)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_test)
```
%% Cell type:markdown id: tags:
# Submission
%% Cell type:code id: tags:
``` python
preds_test.sizes # expect: category(3), longitude, latitude, lead_time(2), forecast_time (53)
```
%% Cell type:code id: tags:
``` python
from scripts import assert_predictions_2020
assert_predictions_2020(preds_test)
```
%% Cell type:code id: tags:
``` python
preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')
```
%% Cell type:code id: tags:
``` python
#!git add ../submissions/ML_prediction_2020.nc
```
%% Cell type:code id: tags:
``` python
#!git commit -m "commit submission for my_method_name" # whatever message you want
```
%% Cell type:code id: tags:
``` python
#!git tag "submission-my_method_name-0.0.1" # if this is to be checked by scorer, only the last submitted==tagged version will be considered
```
%% Cell type:code id: tags:
``` python
#!git push --tags
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Reproducibility
%% Cell type:markdown id: tags:
## memory
%% Cell type:code id: tags:
``` python
# https://phoenixnap.com/kb/linux-commands-check-memory-usage
!free -g
```
%% Cell type:markdown id: tags:
## CPU
%% Cell type:code id: tags:
``` python
!lscpu
```
%% Cell type:markdown id: tags:
## software
%% Cell type:code id: tags:
``` python
!conda list
```
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Train ML model to correct predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Method: `ML-based mean bias reduction`
- calculate the ML-based bias from 2000-2019 deterministic ensemble mean forecast
- remove that the ML-based bias from 2020 forecast deterministic ensemble mean forecast
%% Cell type:markdown id: tags:
## Data used
type: renku datasets
Training-input for Machine Learning model:
- hindcasts of models:
- ECMWF: `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
Forecast-input for Machine Learning model:
- real-time 2020 forecasts of models:
- ECMWF: `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
Compare Machine Learning model forecast against against ground truth:
- `CPC` observations:
- `hindcast-like-observations_biweekly_deterministic.zarr`
- `forecast-like-observations_2020_biweekly_deterministic.zarr`
%% Cell type:markdown id: tags:
## Resources used
for training, details in reproducibility
- platform: MPI-M supercompute 1 Node
- memory: 64 GB
- processors: 36 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [x] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [x] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [x] We did not use 2020 observations in training (explicit overfitting and cheating)
- [x] We did not repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [x] We provide RPSS scores for the training period with script `print_RPS_per_year`, see in section 6.3 `predict`.
- [x] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
- [x] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [x] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [x] We did not use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [x] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [x] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [x] Code is well documented, readable and reproducible.
- [x] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
%% Cell type:markdown id: tags:
# Todos to improve template
This is just a demo.
- [ ] use multiple predictor variables and two predicted variables
- [ ] for both `lead_time`s in one go
- [ ] consider seasonality, for now all `forecast_time` months are mixed
- [ ] make probabilistic predictions with `category` dim, for now works deterministic
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
``` python
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import xarray as xr
xr.set_options(display_style='text')
import numpy as np
from dask.utils import format_bytes
import xskillscore as xs
```
%% Cell type:markdown id: tags:
# Get training data
preprocessing of input data may be done in separate notebook/script
%% Cell type:markdown id: tags:
## Hindcast
get weekly initialized hindcasts
%% Cell type:code id: tags:
``` python
v='t2m'
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
hind_2000_2019 = xr.open_zarr("../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
fct_2020 = xr.open_zarr("../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:markdown id: tags:
## Observations
corresponding to hindcasts
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2000_2019 = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)#[v]
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2020 = xr.open_zarr("../data/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)#[v]
```
%% Cell type:markdown id: tags:
# ML model
%% Cell type:markdown id: tags:
based on [Weatherbench](https://github.com/pangeo-data/WeatherBench/blob/master/quickstart.ipynb)
%% Cell type:code id: tags:
``` python
# run once only and dont commit
!git clone https://github.com/pangeo-data/WeatherBench/
```
%% Cell type:code id: tags:
``` python
import sys
sys.path.insert(1, 'WeatherBench')
from WeatherBench.src.train_nn import DataGenerator, PeriodicConv2D, create_predictions
import tensorflow.keras as keras
```
%% Cell type:code id: tags:
``` python
bs=32
import numpy as np