Commit 2364af44 authored by Aaron Spring's avatar Aaron Spring 🚼
Browse files

fix RPSS_verif and skill_by_year

parent ebb987a6
Pipeline #201598 passed with stage
in 17 seconds
%% Cell type:markdown id: tags:
# Train ML model for predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Method: `name`
- decription
- a few details
%% Cell type:markdown id: tags:
## Data used
Training-input for Machine Learning model:
- renku datasets, climetlab, IRIDL
Forecast-input for Machine Learning model:
- renku datasets, climetlab, IRIDL
Compare Machine Learning model forecast against ground truth:
- renku datasets, climetlab, IRIDL
%% Cell type:markdown id: tags:
## Resources used
for training, details in reproducibility
- platform: renku
- memory: 8 GB
- processors: 2 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [ ] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.
- [ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
- [ ] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [ ] Code is well documented, readable and reproducible.
- [ ] Code to reproduce training and predictions should run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
- [ ] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
%% Cell type:markdown id: tags:
# Todos to improve template
This is just a demo.
- [ ] for both variables
- [ ] for both `lead_time`s
- [ ] ensure probabilistic prediction outcome with `category` dim
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
``` python
from tensorflow.keras.layers import Input, Dense, Flatten
from tensorflow.keras.models import Sequential
import matplotlib.pyplot as plt
import xarray as xr
xr.set_options(display_style='text')
from dask.utils import format_bytes
import xskillscore as xs
```
%% Cell type:markdown id: tags:
# Get training data
preprocessing of input data may be done in separate notebook/script
%% Cell type:markdown id: tags:
## Hindcast
get weekly initialized hindcasts
%% Cell type:code id: tags:
``` python
# consider renku datasets
#! renku storage pull path
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
## Observations
corresponding to hindcasts
%% Cell type:code id: tags:
``` python
# consider renku datasets
#! renku storage pull path
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# ML model
%% Cell type:code id: tags:
``` python
bs=32
import numpy as np
class DataGenerator(keras.utils.Sequence):
def __init__(self):
"""
Data generator
Template from https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
Args:
"""
self.on_epoch_end()
# For some weird reason calling .load() earlier messes up the mean and std computations
if load: print('Loading data into RAM'); self.data.load()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.ceil(self.n_samples / self.batch_size))
def __getitem__(self, i):
'Generate one batch of data'
idxs = self.idxs[i * self.batch_size:(i + 1) * self.batch_size]
# got all nan if nans not masked
X = self.data.isel(time=idxs).fillna(0.).values
y = self.verif_data.isel(time=idxs).fillna(0.).values
return X, y
def on_epoch_end(self):
'Updates indexes after each epoch'
self.idxs = np.arange(self.n_samples)
if self.shuffle == True:
np.random.shuffle(self.idxs)
```
%% Cell type:markdown id: tags:
## data prep: train, valid, test
%% Cell type:code id: tags:
``` python
# time is the forecast_reference_time
time_train_start,time_train_end='2000','2017'
time_valid_start,time_valid_end='2018','2019'
time_test = '2020'
```
%% Cell type:code id: tags:
``` python
dg_train = DataGenerator()
```
%% Cell type:code id: tags:
``` python
dg_valid = DataGenerator()
```
%% Cell type:code id: tags:
``` python
dg_test = DataGenerator()
```
%% Cell type:markdown id: tags:
## `fit`
%% Cell type:code id: tags:
``` python
cnn = keras.models.Sequential([])
```
%% Cell type:code id: tags:
``` python
cnn.summary()
```
%% Cell type:code id: tags:
``` python
cnn.compile(keras.optimizers.Adam(1e-4), 'mse')
```
%% Cell type:code id: tags:
``` python
import warnings
warnings.simplefilter("ignore")
```
%% Cell type:code id: tags:
``` python
cnn.fit(dg_train, epochs=1, validation_data=dg_valid)
```
%% Cell type:markdown id: tags:
## `predict`
Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`. For now RPS, todo: change to RPSS.
%% Cell type:code id: tags:
``` python
from scripts import skill_by_year
```
%% Cell type:code id: tags:
``` python
def create_predictions(model, dg):
"""Create non-iterative predictions"""
preds = model.predict(dg).squeeze()
# transform
return preds
```
%% Cell type:markdown id: tags:
### `predict` training period in-sample
%% Cell type:code id: tags:
``` python
preds_is = create_predictions(cnn, dg_train)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_is)
```
%% Cell type:markdown id: tags:
### `predict` valid out-of-sample
%% Cell type:code id: tags:
``` python
preds_os = create_predictions(cnn, dg_valid)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_os)
```
%% Cell type:markdown id: tags:
### `predict` test
%% Cell type:code id: tags:
``` python
preds_test = create_predictions(cnn, dg_test)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_test)
```
%% Cell type:markdown id: tags:
# Submission
%% Cell type:code id: tags:
``` python
preds_test.sizes # expect: category(3), longitude, latitude, lead_time(2), forecast_time (53)
```
%% Cell type:code id: tags:
``` python
from scripts import assert_predictions_2020
assert_predictions_2020(preds_test)
```
%% Cell type:code id: tags:
``` python
preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')
```
%% Cell type:code id: tags:
``` python
#!git add ../submissions/ML_prediction_2020.nc
```
%% Cell type:code id: tags:
``` python
#!git commit -m "commit submission for my_method_name" # whatever message you want
```
%% Cell type:code id: tags:
``` python
#!git tag "submission-my_method_name-0.0.1" # if this is to be checked by scorer, only the last submitted==tagged version will be considered
```
%% Cell type:code id: tags:
``` python
#!git push --tags
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Reproducibility
%% Cell type:markdown id: tags:
## memory
%% Cell type:code id: tags:
``` python
# https://phoenixnap.com/kb/linux-commands-check-memory-usage
!free -g
```
%% Cell type:markdown id: tags:
## CPU
%% Cell type:code id: tags:
``` python
!lscpu
```
%% Cell type:markdown id: tags:
## software
%% Cell type:code id: tags:
``` python
!conda list
```
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.
%% Cell type:markdown id: tags:
# Score biweekly ML forecasts against re-calibrated ECWMF forecast with RPSS
%% Cell type:markdown id: tags:
Goal:
- Score biweekly 2020 ML-based forecasts against climatology with RPSS
- identical to scorer bot script: https://renkulab.io/gitlab/tasko.olevski/s2s-ai-competition-scoring-image/-/blob/master/scoring/scoring_script.py
- (for reference) RPSS of recalibrated real-time 2020 ECMWF forecasts
Requirements:
- [`xskillscore`](https://github.com/xarray-contrib/xskillscore)
- [renku datasets](https://renku-python.readthedocs.io/en/latest/commands.html#module-renku.cli.dataset) / file
- observations
- probabilistic:
- renku dataset: `forecast-like-observations_2020_biweekly_terciled.nc`
- ML forecasts
- probabilistic:
- file: `../submissions/ML_prediction_2020.nc`
- benchmark to check whether ML better than recalibrated ECMWF:
- probabilistic:
- renku dataset: `ecmwf_recalibrated_benchmark_2020_biweekly_terciled.nc`
Output:
- RPSS score spatially averaged from [90N-60S] and averaged over both `lead_time`s and variables
%% Cell type:code id: tags:
``` python
import xarray as xr
import xskillscore as xs
import numpy as np
xr.set_options(keep_attrs=True)
xr.set_options(display_style='text')
cache_path = "../data"
plot = False # take much more space in git, consider not commiting figures regularly because git size will increase
```
%%%% Output: stream
/opt/conda/lib/python3.8/site-packages/xarray/backends/cfgrib_.py:27: UserWarning: Failed to load cfgrib - most likely there is a problem accessing the ecCodes library. Try `import cfgrib` to get the full error message
warnings.warn(
%% Cell type:markdown id: tags:
# Get categorized
%% Cell type:markdown id: tags:
## 2020 observations
%% Cell type:code id: tags:
``` python
# to use retrieve from git lfs
!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
```
%%%% Output: stream
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
obs_p = xr.open_dataset(f'{cache_path}/forecast-like-observations_2020_biweekly_terciled.nc')
obs_p.sizes
```
%%%% Output: stream
/opt/conda/lib/python3.8/site-packages/xarray/backends/plugins.py:61: RuntimeWarning: Engine 'cfgrib' loading failed:
/opt/conda/lib/python3.8/site-packages/gribapi/_bindings.cpython-38-x86_64-linux-gnu.so: undefined symbol: codes_bufr_key_is_header
warnings.warn(f"Engine {name!r} loading failed:\n{ex}", RuntimeWarning)
%%%% Output: execute_result
Frozen(SortedKeysDict({'category': 3, 'lead_time': 2, 'forecast_time': 53, 'latitude': 121, 'longitude': 240}))
%% Cell type:markdown id: tags:
## 2020 ML-based forecasts
%% Cell type:code id: tags:
``` python
!renku storage pull ../submissions/ML_prediction_2020.nc
```
%%%% Output: stream
Warning: Run CLI commands only from project's root directory.

%% Cell type:code id: tags:
``` python
# submission for your ML model
fct_p = xr.open_dataset(f"../submissions/ML_prediction_2020.nc")
fct_p.sizes
```
%%%% Output: execute_result
Frozen(SortedKeysDict({'forecast_time': 53, 'latitude': 121, 'lead_time': 2, 'longitude': 240, 'category': 3}))
%% Cell type:markdown id: tags:
## climatology
%% Cell type:code id: tags:
``` python
clim_p = xr.DataArray([1/3, 1/3, 1/3], dims='category', coords={'category':['below normal', 'near normal', 'above normal']}).to_dataset(name='tp')
clim_p['t2m'] = clim_p['tp']
```
%% Cell type:markdown id: tags:
### checks
%% Cell type:code id: tags:
``` python
from scripts import assert_predictions_2020
```
%% Cell type:code id: tags:
``` python
assert_predictions_2020(fct_p)
assert_predictions_2020(obs_p)
```
%% Cell type:markdown id: tags:
# RPS
%% Cell type:markdown id: tags:
## RPS ML model
%% Cell type:code id: tags:
``` python
rps_ML = xs.rps(obs_p, fct_p, category_edges=None, dim='forecast_time', input_distributions='p').compute()
```
%% Cell type:code id: tags:
``` python
if plot:
for v in rps_ML.data_vars:
rps_ML[v].plot(robust=True, col='lead_time')
```
%% Cell type:markdown id: tags:
## RPS climatology
%% Cell type:code id: tags:
``` python
rps_clim = xs.rps(obs_p, clim_p, category_edges=None, dim='forecast_time', input_distributions='p').compute()
```
%% Cell type:code id: tags:
``` python
if plot:
for v in rps_clim.data_vars:
rps_clim[v].plot(robust=True, col='lead_time')
```
%% Cell type:markdown id: tags:
## RPSS
%% Cell type:code id: tags:
``` python
rpss = (1 - rps_ML / rps_clim)
```
%% Cell type:code id: tags:
``` python
if plot:
for v in rpss.data_vars:
rpss[v].plot(robust=True, col='lead_time')
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Scoring for RPSS leaderboard
%% Cell type:code id: tags:
``` python
# check for -inf grid cells
if (rpss==-np.inf).to_array().any():
(rpss == rpss.min()).sum()
print(f'find N={(rpss == rpss.min()).sum()} -inf grid cells')
# dirty fix
rpss = rpss.clip(-1, 1)
```
%% Cell type:code id: tags:
``` python
# what to do with requested grid cells where NaN is submitted? also penalize, todo: https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/7
```