# Train ML model for predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Method: `name`
- decription
- a few details
%% Cell type:markdown id: tags:
## Data used
Training-input for Machine Learning model:
- renku datasets, climetlab, IRIDL
Forecast-input for Machine Learning model:
- renku datasets, climetlab, IRIDL
Compare Machine Learning model forecast against ground truth:
- renku datasets, climetlab, IRIDL
%% Cell type:markdown id: tags:
## Resources used
for training, details in reproducibility
- platform: renku
- memory: 8 GB
- processors: 2 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [ ] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [ ] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [ ] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.
-[ ] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
-[ ] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [ ] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
-[ ] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [ ] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [ ] Code is well documented, readable and reproducible.
- [ ] Code to reproduce training and predictions should run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
- [ ] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
%% Cell type:markdown id: tags:
# Todos to improve template
This is just a demo.
- [ ] for both variables
- [ ] for both `lead_time`s
- [ ] ensure probabilistic prediction outcome with `category` dim
Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`. For now RPS, todo: change to RPSS.
- RPSS score spatially averaged from [90N-60S] and averaged over both `lead_time`s and variables
%% Cell type:code id: tags:
``` python
importxarrayasxr
importxskillscoreasxs
importnumpyasnp
xr.set_options(keep_attrs=True)
xr.set_options(display_style='text')
cache_path="../data"
plot=False# take much more space in git, consider not commiting figures regularly because git size will increase
```
%%%% Output: stream
/opt/conda/lib/python3.8/site-packages/xarray/backends/cfgrib_.py:27: UserWarning: Failed to load cfgrib - most likely there is a problem accessing the ecCodes library. Try `import cfgrib` to get the full error message
# what to do with requested grid cells where NaN is submitted? also penalize, todo: https://renkulab.io/gitlab/aaron.spring/s2s-ai-challenge-template/-/issues/7