Skip to content
README.md 7.48 KiB
Newer Older
Luis Salamanca's avatar
Luis Salamanca committed
# Introduction

The exercise for the SDSC Renku Tutorial is dervied from the [IEEE Investment Ranking Challenge](https://www.crowdai.org/challenges/ieee-investment-ranking-challenge). The inital part of this document explains the contents of this zip file and project set up. The section that follows is a description of the Crowd AI challenge.


# Archive Contents

This zip files contains data and two initial notebooks which make up the starting point of the challenge. 

The ```data``` folder contains one file:
* full_dataset.csv : The dataset described below

The ```templates` folder contains one file:
* prediction_template.csv : The prediction template.

The ```notebooks``` folder contains two files:
* 01_features.ipynb : A notebook that does feature engineering
* 02_model.ipynb : A notebook that builds a simple random-forest model from the features.

The notebooks together constitue a workflow. The notebook [01_features.ipynb](01_features.ipynb) reads in the data and extracts features used in the model notebook, [02_model.ipynb](02_model.ipynb)


# Installation and Set Up

## Step 0: Prerequisites 

You need a python environment with renku, papermill, and nbdime installed.

## Step 1 : Create a Project

**This step should be done by just one team member.** Once completed, the other team members can clone the repository (step 2)

### 1.1

Go to https://internal.renku.ch/gitlab/profile/keys, log in or register a new account and add an SSH key.

Create a group for your team and add the team members to your group.

### 1.2

Go to https://internal.renku.ch/ and create a project.
To do so first make sure you are logged in, then go to projects and click on 'New Project'.
Fill in the Title and Description.

To make it easy for team members to collaborate, you will need to move it to the group for your project.

Move to group: Gitlab Project > Settings > General > Advanced settings > Transfer project

### 1.3

Refresh the browser with the Renku UI.

Clone the project repo to your machine. The repository URL to clone is located in the Renku > Project > Settings page.

Go to the settings tab, copy the SSH repository URL and use it to clone it on your machine.
This should look like:
```bash
$ git clone git@internal.renku.ch:jane.doe/my-project.git
```

Create a README.md describing the project and add the readme to git.

### 1.4

In the repo folder, do:
1. Copy requirements.txt and replace the empty one.
2. Install pipenv with `pip install pipenv`
3. Run `pipenv install -r requirements.txt [--skip-lock]`
4. Run `pipenv shell` to activate the virtualenv.
5. Commit the changes by running: `git add requirements.txt Pipfile` and then `git commit -m "Added requirements and Pipfile"`

### 1.5

Create a dataset

```bash
$ renku dataset create invest
```

Then add the files to the dataset:
```bash
# renku dataset add invest [path to data_sample.csv]
$ renku dataset add invest <path-to-full-dataset-csv>
```

### 1.6

Add the template. Create a folder called templates and put the prediction_template file there, then add it to git.

```bash
$ mkdir -p templates
$ cp [path to prediction_template.csv] templates/
$ git add templates/
$ git commit -m "Added output template."
```

### 1.7


Add the notebooks. Put them in the notebooks folder of the project and add to git.

```bash
mkdir -p notebooks
cp [path to 01_features.ipynb] notebooks
cp [path to 02_model.ipynb] notebooks
git add notebooks/
git commit -m "Added initial notebooks."
```

### 1.8

Run the notebooks. First the features notebook, then the model notebook:

```bash
$ mkdir -p data/outputs
$ renku run papermill notebooks/01_features.ipynb notebooks/01_features_run.ipynb -p dataset_file_path data/invest/full_dataset.csv -p features_pickle_file_path data/outputs/features.pkl
$ renku run papermill notebooks/02_model.ipynb notebooks/02_model_run.ipynb -p features_pickle_file_path data/outputs/features.pkl -p pred_template_file_path templates/prediction_template.csv -p pred_output_file_path data/outputs/predictions.csv
```

### 1.9

Push the project to the server.

## Step 2: Collaboration

The following instructions are for all other team members.

### 2.1

Go to https://internal.renku.ch/gitlab/profile/keys, log in or register a new account and add an SSH key.

### 2.2

Go to the settings page of the team project in https://internal.renku.ch/, copy the SSH repository URL and use it to clone it on your machine.
This should look like:
```bash
$ git clone git@internal.renku.ch:jane.doe/my-project.git
```

### 2.3

In the repo folder, do:
1. Install pipenv with `pip install pipenv`
2. Run `pipenv install -r requirements.txt [--skip-lock]`
3. Run `pipenv shell` to activate the virtualenv.


## Step 3: Modify the notebooks and be the best data science team!

You can either work using the server environment or on your local environment. 

To work on the server environment, launch the the notebook from the Project > Files > Notebook tab. 

To work locally, launch jupyter and update the notebooks. For example, modify either the features or model notebook. (It is enough to add a cell that prints something). If you execute ```renku status```, you will see your outputs are out of date. Running ```renku update``` will update everything.


```
# CrowdAI Challenge Introduction

Using the provided data sets of financial predictors and semi-annual returns, participants are challenged to develop a model that will help identify the best-performing stocks in each time-period.

Research Question: **Which stocks will experience the highest and lowest returns during the next six months?**   

Out of the thousands of stocks in the market, small groups will experience exceptionally high or low returns. Considering the distribution of stock returns, a portfolio manager must buy the stocks in the right tail of the distribution and avoid the stocks in the left tail. The performance of an entire equity portfolio is often driven by these key investment decisions. **The goal of this challenge is to explore methodology that will increase the probability that portfolio managers identify these stocks with extreme positive or negative returns.**   

## Access Dataset

Teams are provided with predictors and semi-annual returns for a group of stocks from `1996` to `2017`. This span of **21 years** is represented as **42 non-overlapping 6-month periods**. In each of the `42 time periods`, roughly **900 stocks** with the largest market capitalization (i.e., total market value in USD) were selected. Therefore, the selected set of stocks at each time period changes as companies increase or decrease in value. **All stock identifiers have been removed** and **all numeric variables have been anonymized and normalized**. Training and test datasets were created by selecting a **random sample of stocks** at each time period. `60%` of stocks were sampled into the training set and the remaining `40%` created the test set. Finally, all data from the second half of 2017 was allocated to the test set. This 6-month period will provide a final out-of-sample test of a model’s performance.

## Problem Statement

Each team must create a model that ranks a set of stocks based on the expected return over a forward 6-month window. This model can be a risk factor-based strategy (multi-factor model), predictive model, or any other data-based heuristic. There are many ways to approach this task and creative, non-traditional solutions are strongly encouraged. The final model will be tested on each 6-month period from 2002 to 2017.


# Author
SP Mohanty <sharada.mohanty@epfl.ch>   
Harlander, Benjamin <Harlander.Benjamin@principal.com>
```