Commit 38c9843c authored by Cheikh Modou Noreyni Fall's avatar Cheikh Modou Noreyni Fall Committed by renku 0.10.4
Browse files

renku migrate

parent 77be7795
Pipeline #265235 passed with stage
in 27 seconds
'@context':
'@version': 1.1
_id: '@id'
created: schema:dateCreated
creator:
'@context':
'@version': 1.1
_id: '@id'
affiliation: schema:affiliation
alternate_name: schema:alternateName
email: schema:email
label: rdfs:label
name: schema:name
prov: http://www.w3.org/ns/prov#
rdfs: http://www.w3.org/2000/01/rdf-schema#
schema: http://schema.org/
'@id': schema:creator
name: schema:name
prov: http://www.w3.org/ns/prov#
schema: http://schema.org/
updated: schema:dateUpdated
version: schema:schemaVersion
'@id': https://renkulab.io/projects/aaron.spring/s2s-ai-challenge-template
'@type':
- http://schema.org/Project
- http://www.w3.org/ns/prov#Location
- prov:Location
- schema:Project
_id: https://renkulab.io/projects/cheikhnoreyni.fall/s2s-ai-challenge-template-noreynidioum
created: '2021-04-28T12:48:48.162011+00:00'
creator:
'@type':
- prov:Person
- schema:Person
_id: mailto:aaron.spring@mpimet.mpg.de
affiliation: null
alternate_name: null
email: aaron.spring@mpimet.mpg.de
label: Aaron Spring
name: Aaron Spring
http://schema.org/agent: 0.16.0
http://schema.org/creator:
'@id': mailto:aaron.spring@mpimet.mpg.de
......@@ -28,3 +62,6 @@ https://swissdatasciencecenter.github.io/renku-ontology#templateMetadata: '{"__t
https://swissdatasciencecenter.github.io/renku-ontology#templateReference: 0.1.19
https://swissdatasciencecenter.github.io/renku-ontology#templateSource: https://github.com/SwissDataScienceCenter/renku-project-template
https://swissdatasciencecenter.github.io/renku-ontology#templateVersion: 817b03c0f434a79eb54473688c1f897a66228b3d
name: s2s-ai-challenge-template
updated: '2021-10-11T19:42:44.168852+00:00'
version: '4'
%% Cell type:markdown id: tags:
# Train ML model to correct predictions of week 3-4 & 5-6
This notebook create a Machine Learning `ML_model` to predict weeks 3-4 & 5-6 based on `S2S` weeks 3-4 & 5-6 forecasts and is compared to `CPC` observations for the [`s2s-ai-challenge`](https://s2s-ai-challenge.github.io/).
%% Cell type:markdown id: tags:
# Synopsis
%% Cell type:markdown id: tags:
## Method: `mean bias reduction`
- calculate the mean bias from 2000-2019 deterministic ensemble mean forecast
- remove that mean bias from 2020 forecast deterministic ensemble mean forecast
- no Machine Learning used here
%% Cell type:markdown id: tags:
## Data used
type: renku datasets
Training-input for Machine Learning model:
- hindcasts of models:
- ECMWF: `ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr`
Forecast-input for Machine Learning model:
- real-time 2020 forecasts of models:
- ECMWF: `ecmwf_forecast-input_2020_biweekly_deterministic.zarr`
Compare Machine Learning model forecast against against ground truth:
- `CPC` observations:
- `hindcast-like-observations_biweekly_deterministic.zarr`
- `forecast-like-observations_2020_biweekly_deterministic.zarr`
%% Cell type:markdown id: tags:
## Resources used
for training, details in reproducibility
- platform: MPI-M supercompute 1 Node
- memory: 64 GB
- processors: 36 CPU
- storage required: 10 GB
%% Cell type:markdown id: tags:
## Safeguards
All points have to be [x] checked. If not, your submission is invalid.
Changes to the code after submissions are not possible, as the `commit` before the `tag` will be reviewed.
(Only in exceptions and if previous effort in reproducibility can be found, it may be allowed to improve readability and reproducibility after November 1st 2021.)
%% Cell type:markdown id: tags:
### Safeguards to prevent [overfitting](https://en.wikipedia.org/wiki/Overfitting?wprov=sfti1)
If the organizers suspect overfitting, your contribution can be disqualified.
- [x] We didnt use 2020 observations in training (explicit overfitting and cheating)
- [x] We didnt repeatedly verify my model on 2020 observations and incrementally improved my RPSS (implicit overfitting)
- [x] We provide RPSS scores for the training period with script `skill_by_year`, see in section 6.3 `predict`.
- [x] We tried our best to prevent [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)?wprov=sfti1).
- [x] We honor the `train-validate-test` [split principle](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets). This means that the hindcast data is split into `train` and `validate`, whereas `test` is withheld.
- [x] We did use `test` explicitly in training or implicitly in incrementally adjusting parameters.
- [x] We considered [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)).
%% Cell type:markdown id: tags:
### Safeguards for Reproducibility
Notebook/code must be independently reproducible from scratch by the organizers (after the competition), if not possible: no prize
- [x] All training data is publicly available (no pre-trained private neural networks, as they are not reproducible for us)
- [x] Code is well documented, readable and reproducible.
- [x] Code to reproduce training and predictions is preferred to run within a day on the described architecture. If the training takes longer than a day, please justify why this is needed. Please do not submit training piplelines, which take weeks to train.
%% Cell type:markdown id: tags:
# Imports
%% Cell type:code id: tags:
``` python
import xarray as xr
xr.set_options(display_style='text')
```
%%%% Output: execute_result
<xarray.core.options.set_options at 0x2b858800a050>
<xarray.core.options.set_options at 0x7fc6201fe710>
%% Cell type:code id: tags:
``` python
!renku githooks install
```
%%%% Output: stream
/bin/sh: 1: renku: not found
%% Cell type:markdown id: tags:
# Get training data
preprocessing of input data may be done in separate notebook/script
%% Cell type:markdown id: tags:
## Hindcast
get weekly initialized hindcasts
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr
```
%%%% Output: stream
/bin/sh: 1: renku: not found
%% Cell type:code id: tags:
``` python
hind_2000_2019 = xr.open_zarr("../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
```
%%%% Output: error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-c0d1e31a7f4a> in <module>
----> 1 hind_2000_2019 = xr.open_zarr("../data/ecmwf_hindcast-input_2000-2019_biweekly_deterministic.zarr", consolidated=True)
/opt/conda/lib/python3.7/site-packages/xarray/backends/zarr.py in open_zarr(store, group, synchronizer, chunks, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, drop_variables, consolidated, overwrite_encoded_chunks, chunk_store, storage_options, decode_timedelta, use_cftime, **kwargs)
781 backend_kwargs=backend_kwargs,
782 decode_timedelta=decode_timedelta,
--> 783 use_cftime=use_cftime,
784 )
785 return ds
/opt/conda/lib/python3.7/site-packages/xarray/backends/api.py in open_dataset(filename_or_obj, engine, chunks, cache, decode_cf, mask_and_scale, decode_times, decode_timedelta, use_cftime, concat_characters, decode_coords, drop_variables, backend_kwargs, *args, **kwargs)
481 engine = plugins.guess_engine(filename_or_obj)
482
--> 483 backend = plugins.get_backend(engine)
484
485 decoders = _resolve_decoders_kwargs(
/opt/conda/lib/python3.7/site-packages/xarray/backends/plugins.py in get_backend(engine)
155 if engine not in engines:
156 raise ValueError(
--> 157 f"unrecognized engine {engine} must be one of: {list(engines)}"
158 )
159 backend = engines[engine]
ValueError: unrecognized engine zarr must be one of: ['store']
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
fct_2020 = xr.open_zarr("../data/ecmwf_forecast-input_2020_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:markdown id: tags:
## Observations
corresponding to hindcasts
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2000_2019 = xr.open_zarr("../data/hindcast-like-observations_2000-2019_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:code id: tags:
``` python
# preprocessed as renku dataset
!renku storage pull ../data/forecast-like-observations_2020_biweekly_deterministic.zarr
```
%% Cell type:code id: tags:
``` python
obs_2020 = xr.open_zarr("../data/forecast-like-observations_2020_biweekly_deterministic.zarr", consolidated=True)
```
%% Cell type:markdown id: tags:
# no ML model
%% Cell type:markdown id: tags:
Here, we just remove the mean bias from the ensemble mean forecast.
%% Cell type:code id: tags:
``` python
from scripts import add_year_week_coords
obs_2000_2019 = add_year_week_coords(obs_2000_2019)
hind_2000_2019 = add_year_week_coords(hind_2000_2019)
```
%%%% Output: stream
WARNING: ecmwflibs universal: cannot find a library called MagPlus
Magics library could not be found
%% Cell type:code id: tags:
``` python
bias_2000_2019 = (hind_2000_2019.mean('realization') - obs_2000_2019).groupby('week').mean().compute()
```
%%%% Output: stream
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
%% Cell type:markdown id: tags:
## `predict`
Create predictions and print `mean(variable, lead_time, longitude, weighted latitude)` RPSS for all years as calculated by `skill_by_year`.
%% Cell type:code id: tags:
``` python
from scripts import make_probabilistic
```
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc
```
%% Cell type:code id: tags:
``` python
tercile_file = f'../data/hindcast-like-observations_2000-2019_biweekly_tercile-edges.nc'
tercile_edges = xr.open_dataset(tercile_file)
```
%% Cell type:code id: tags:
``` python
def create_predictions(fct, bias):
if 'week' not in fct.coords:
fct = add_year_week_coords(fct)
preds = fct - bias.sel(week=fct.week)
preds = make_probabilistic(preds, tercile_edges)
return preds.astype('float32')
```
%% Cell type:markdown id: tags:
### `predict` training period in-sample
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/forecast-like-observations_2020_biweekly_terciled.nc
```
%% Cell type:code id: tags:
``` python
!renku storage pull ../data/hindcast-like-observations_2000-2019_biweekly_terciled.zarr
```
%% Cell type:code id: tags:
``` python
preds_is = create_predictions(hind_2000_2019, bias_2000_2019).compute()
```
%% Cell type:code id: tags:
``` python
from scripts import skill_by_year
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_is)
```
%%%% Output: execute_result
RPSS
year
2000 -0.147543
2001 -0.190513
2002 -0.186966
2003 -0.197091
2004 -0.194299
2005 -0.163684
2006 -0.180648
2007 -0.182294
2008 -0.173836
2009 -0.208382
2010 -0.150463
2011 -0.177190
2012 -0.176229
2013 -0.186028
2014 -0.192715
2015 -0.165891
2016 -0.159139
2017 -0.193667
2018 -0.202936
2019 -0.177937
%% Cell type:markdown id: tags:
### `predict` test
%% Cell type:code id: tags:
``` python
preds_test = create_predictions(fct_2020, bias_2000_2019)
```
%% Cell type:code id: tags:
``` python
skill_by_year(preds_test)
```
%%%% Output: stream
/work/mh0727/m300524/conda-envs/s2s-ai/lib/python3.7/site-packages/dask/array/numpy_compat.py:40: RuntimeWarning: invalid value encountered in true_divide
x = np.divide(x1, x2, out)
%%%% Output: execute_result
RPSS
year
2020 -0.096323
%% Cell type:markdown id: tags:
# Submission
%% Cell type:code id: tags:
``` python
from scripts import assert_predictions_2020
assert_predictions_2020(preds_test)
```
%% Cell type:code id: tags:
``` python
preds_test.attrs = {'author': 'Aaron Spring', 'author_email': 'aaron.spring@mpimet.mpg.de',
'comment': 'created for the s2s-ai-challenge as a template for the website',
'notebook': 'mean_bias_reduction.ipynb',
'website': 'https://s2s-ai-challenge.github.io/#evaluation'}
html_repr = xr.core.formatting_html.dataset_repr(preds_test)
with open('submission_template_repr.html', 'w') as myFile:
myFile.write(html_repr)
```
%% Cell type:code id: tags:
``` python
preds_test.to_netcdf('../submissions/ML_prediction_2020.nc')
```
%% Cell type:code id: tags:
``` python
# !git add ../submissions/ML_prediction_2020.nc
# !git add mean_bias_reduction.ipynb
```
%% Cell type:code id: tags:
``` python
#!git commit -m "template_test no ML mean bias reduction" # whatever message you want
```
%% Cell type:code id: tags:
``` python
#!git tag "submission-no_ML_mean_bias_reduction-0.0.2" # if this is to be checked by scorer, only the last submitted==tagged version will be considered
```
%% Cell type:code id: tags:
``` python
#!git push --tags
```
%% Cell type:markdown id: tags:
# Reproducibility
%% Cell type:markdown id: tags:
## memory
%% Cell type:code id: tags:
``` python
# https://phoenixnap.com/kb/linux-commands-check-memory-usage
!free -g
```
%%%% Output: stream
total used free shared buffers cached
Mem: 62 18 43 0 0 5
-/+ buffers/cache: 12 49
Swap: 0 0 0
%% Cell type:markdown id: tags:
## CPU
%% Cell type:code id: tags:
``` python
!lscpu
```
%%%% Output: stream
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 72
On-line CPU(s) list: 0-71
Thread(s) per core: 2
Core(s) per socket: 18
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 2101.000
BogoMIPS: 4190.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
%% Cell type:markdown id: tags:
## software
%% Cell type:code id: tags:
``` python
!conda list
```
%%%% Output: stream
# packages in environment at /work/mh0727/m300524/conda-envs/s2s-ai:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
_tflow_select 2.3.0 mkl
absl-py 0.12.0 py37h06a4308_0
aiobotocore 1.2.2 pyhd3eb1b0_0
aiohttp 3.7.4 py37h27cfd23_1
aioitertools 0.7.1 pyhd3eb1b0_0
anyio 2.2.0 pypi_0 pypi
appdirs 1.4.4 py_0
argcomplete 1.12.2 pypi_0 pypi
argon2-cffi 20.1.0 py37h27cfd23_1
asciitree 0.3.3 py_2
astunparse 1.6.3 py_0
async-timeout 3.0.1 py37h06a4308_0
async_generator 1.10 py37h28b3542_0
attrs 20.2.0 pypi_0 pypi
babel 2.9.0 pypi_0 pypi
backcall 0.2.0 pyhd3eb1b0_0
backrefs 5.0.1 pypi_0 pypi
bagit 1.8.1 pypi_0 pypi
beautifulsoup4 4.9.3 pyha847dfd_0
black 20.8b1 pypi_0 pypi
blas 1.0 mkl
bleach 3.3.0 pyhd3eb1b0_0
blinker 1.4 py37h06a4308_0
bokeh 2.3.0 py37h06a4308_0
botocore 1.19.52 pyhd3eb1b0_0
bottleneck 1.3.2 py37heb32a55_1
bracex 2.1.1 pypi_0 pypi
branca 0.3.1 pypi_0 pypi
brotlipy 0.7.0 py37h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
c-ares 1.17.1 h27cfd23_0
ca-certificates 2021.5.30 ha878542_0 conda-forge
cachecontrol 0.11.7 pypi_0 pypi
cachetools 4.2.1 pyhd3eb1b0_0
calamus 0.3.7 pypi_0 pypi
cdsapi 0.5.1 pypi_0 pypi
certifi 2021.5.30 py37h89c1867_0 conda-forge
cffi 1.14.5 py37h261ae71_0
cfgrib 0.9.8.5 pyhd8ed1ab_0 conda-forge
cftime 1.5.0 pypi_0 pypi
chardet 3.0.4 py37h06a4308_1003
click 7.1.2 pyhd3eb1b0_0
click-completion 0.5.2 pypi_0 pypi
click-plugins 1.1.1 pypi_0 pypi
climetlab 0.8.6 pypi_0 pypi
climetlab-s2s-ai-challenge 0.8.0 pypi_0 pypi
climpred 2.1.4 pypi_0 pypi
cloudpickle 1.6.0 py_0
colorama 0.4.4 pypi_0 pypi
coloredlogs 15.0 pypi_0 pypi
commonmark 0.9.1 pypi_0 pypi
configargparse 1.4 pypi_0 pypi
coverage 5.5 py37h27cfd23_2
cryptography 3.4.6 py37hd23ed53_0
curl 7.71.1 hbc83047_1
cwlgen 0.4.2 pypi_0 pypi
cwltool 3.0.20210319143721 pypi_0 pypi
cycler 0.10.0 py37_0
cython 0.29.22 py37h2531618_0
cytoolz 0.11.0 py37h7b6447c_0
dask 2021.3.0 pypi_0 pypi
dask-core 2021.6.2 pyhd8ed1ab_0 conda-forge
dask-labextension 5.0.1 pypi_0 pypi
dbus 1.13.18 hb2f20db_0
decorator 4.4.2 pyhd3eb1b0_0
defusedxml 0.7.1 pyhd3eb1b0_0
distributed 2021.6.2 py37h89c1867_0 conda-forge
docopt 0.6.2 py37h06a4308_0
docutils 0.15.2 py37h89c1867_2 conda-forge
eccodes 1.2.0 pypi_0 pypi
ecmwf-api-client 1.6.1 pypi_0 pypi
ecmwflibs 0.2.3 pypi_0 pypi
entrypoints 0.3 py37_0
environ-config 20.1.0 pypi_0 pypi
expat 2.2.10 he6710b0_2
fasteners 0.16 pyhd3eb1b0_0
fastprogress 1.0.0 py_0 conda-forge
filelock 3.0.12 pypi_0 pypi
folium 0.12.1 pypi_0 pypi
fontconfig 2.13.1 h6c09931_0
freetype 2.10.4 h5ab3b9f_0
frozendict 1.2 pypi_0 pypi
fsspec 0.8.7 pyhd3eb1b0_0
gast 0.4.0 py_0
gitdb 4.0.6 pypi_0 pypi
gitpython 3.1.12 pypi_0 pypi
glib 2.67.4 h36276a3_1
google-auth 1.28.0 pyhd3eb1b0_0
google-auth-oauthlib 0.4.3 pyhd3eb1b0_0
google-pasta 0.2.0 py_0
grpcio 1.36.1 py37h2157cd5_1
gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
h5netcdf 0.10.0 pyhd8ed1ab_0 conda-forge
h5py 2.10.0 nompi_py37h1e651dc_105 conda-forge