Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# Introduction
The exercise for the SDSC Renku Tutorial is dervied from the [IEEE Investment Ranking Challenge](https://www.crowdai.org/challenges/ieee-investment-ranking-challenge). The inital part of this document explains the contents of this zip file and project set up. The section that follows is a description of the Crowd AI challenge.
# Archive Contents
This zip files contains data and two initial notebooks which make up the starting point of the challenge.
The ```data``` folder contains one file:
* full_dataset.csv : The dataset described below
The ```templates` folder contains one file:
* prediction_template.csv : The prediction template.
The ```notebooks``` folder contains two files:
* 01_features.ipynb : A notebook that does feature engineering
* 02_model.ipynb : A notebook that builds a simple random-forest model from the features.
The notebooks together constitue a workflow. The notebook [01_features.ipynb](01_features.ipynb) reads in the data and extracts features used in the model notebook, [02_model.ipynb](02_model.ipynb)
# Installation and Set Up
## Step 0: Prerequisites
You need a python environment with renku, papermill, and nbdime installed.
## Step 1 : Create a Project
**This step should be done by just one team member.** Once completed, the other team members can clone the repository (step 2)
### 1.1
Go to https://internal.renku.ch/gitlab/profile/keys, log in or register a new account and add an SSH key.
Create a group for your team and add the team members to your group.
### 1.2
Go to https://internal.renku.ch/ and create a project.
To do so first make sure you are logged in, then go to projects and click on 'New Project'.
Fill in the Title and Description.
To make it easy for team members to collaborate, you will need to move it to the group for your project.
Move to group: Gitlab Project > Settings > General > Advanced settings > Transfer project
### 1.3
Refresh the browser with the Renku UI.
Clone the project repo to your machine. The repository URL to clone is located in the Renku > Project > Settings page.
Go to the settings tab, copy the SSH repository URL and use it to clone it on your machine.
This should look like:
```bash
$ git clone git@internal.renku.ch:jane.doe/my-project.git
```
Create a README.md describing the project and add the readme to git.
### 1.4
In the repo folder, do:
1. Copy requirements.txt and replace the empty one.
2. Install pipenv with `pip install pipenv`
3. Run `pipenv install -r requirements.txt [--skip-lock]`
4. Run `pipenv shell` to activate the virtualenv.
5. Commit the changes by running: `git add requirements.txt Pipfile` and then `git commit -m "Added requirements and Pipfile"`
### 1.5
Create a dataset
```bash
$ renku dataset create invest
```
Then add the files to the dataset:
```bash
# renku dataset add invest [path to data_sample.csv]
$ renku dataset add invest <path-to-full-dataset-csv>
```
### 1.6
Add the template. Create a folder called templates and put the prediction_template file there, then add it to git.
```bash
$ mkdir -p templates
$ cp [path to prediction_template.csv] templates/
$ git add templates/
$ git commit -m "Added output template."
```
### 1.7
Add the notebooks. Put them in the notebooks folder of the project and add to git.
```bash
mkdir -p notebooks
cp [path to 01_features.ipynb] notebooks
cp [path to 02_model.ipynb] notebooks
git add notebooks/
git commit -m "Added initial notebooks."
```
### 1.8
Run the notebooks. First the features notebook, then the model notebook:
```bash
$ mkdir -p data/outputs
$ renku run papermill notebooks/01_features.ipynb notebooks/01_features_run.ipynb -p dataset_file_path data/invest/full_dataset.csv -p features_pickle_file_path data/outputs/features.pkl
$ renku run papermill notebooks/02_model.ipynb notebooks/02_model_run.ipynb -p features_pickle_file_path data/outputs/features.pkl -p pred_template_file_path templates/prediction_template.csv -p pred_output_file_path data/outputs/predictions.csv
```
### 1.9
Push the project to the server.
## Step 2: Collaboration
The following instructions are for all other team members.
### 2.1
Go to https://internal.renku.ch/gitlab/profile/keys, log in or register a new account and add an SSH key.
### 2.2
Go to the settings page of the team project in https://internal.renku.ch/, copy the SSH repository URL and use it to clone it on your machine.
This should look like:
```bash
$ git clone git@internal.renku.ch:jane.doe/my-project.git
```
### 2.3
In the repo folder, do:
1. Install pipenv with `pip install pipenv`
2. Run `pipenv install -r requirements.txt [--skip-lock]`
3. Run `pipenv shell` to activate the virtualenv.
## Step 3: Modify the notebooks and be the best data science team!
You can either work using the server environment or on your local environment.
To work on the server environment, launch the the notebook from the Project > Files > Notebook tab.
To work locally, launch jupyter and update the notebooks. For example, modify either the features or model notebook. (It is enough to add a cell that prints something). If you execute ```renku status```, you will see your outputs are out of date. Running ```renku update``` will update everything.
```
# CrowdAI Challenge Introduction
Using the provided data sets of financial predictors and semi-annual returns, participants are challenged to develop a model that will help identify the best-performing stocks in each time-period.
Research Question: **Which stocks will experience the highest and lowest returns during the next six months?**
Out of the thousands of stocks in the market, small groups will experience exceptionally high or low returns. Considering the distribution of stock returns, a portfolio manager must buy the stocks in the right tail of the distribution and avoid the stocks in the left tail. The performance of an entire equity portfolio is often driven by these key investment decisions. **The goal of this challenge is to explore methodology that will increase the probability that portfolio managers identify these stocks with extreme positive or negative returns.**
## Access Dataset
Teams are provided with predictors and semi-annual returns for a group of stocks from `1996` to `2017`. This span of **21 years** is represented as **42 non-overlapping 6-month periods**. In each of the `42 time periods`, roughly **900 stocks** with the largest market capitalization (i.e., total market value in USD) were selected. Therefore, the selected set of stocks at each time period changes as companies increase or decrease in value. **All stock identifiers have been removed** and **all numeric variables have been anonymized and normalized**. Training and test datasets were created by selecting a **random sample of stocks** at each time period. `60%` of stocks were sampled into the training set and the remaining `40%` created the test set. Finally, all data from the second half of 2017 was allocated to the test set. This 6-month period will provide a final out-of-sample test of a model’s performance.
## Problem Statement
Each team must create a model that ranks a set of stocks based on the expected return over a forward 6-month window. This model can be a risk factor-based strategy (multi-factor model), predictive model, or any other data-based heuristic. There are many ways to approach this task and creative, non-traditional solutions are strongly encouraged. The final model will be tested on each 6-month period from 2002 to 2017.
# Author
SP Mohanty <sharada.mohanty@epfl.ch>
Harlander, Benjamin <Harlander.Benjamin@principal.com>
```