Commit 4f03206e authored by Eniko Szekely's avatar Eniko Szekely
Browse files

Only mean

parent 91039bdd
Pipeline #72 failed with stages
in 2 minutes and 16 seconds
%% Cell type:code id: tags:
``` python
import base64
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stat
sns.set()
```
%% Cell type:markdown id: tags:
# Preparation
This code is derived from the [Crowd AI challenge starter kit](https://github.com/crowdAI/ieee_investment_ranking_challenge-starter-kit). This notebook demonstrates reading in the data and writing out some features to process.
## Read Challenge Data and the Prediction Template
%% Cell type:code id: tags:parameters
``` python
dataset_file_path = "../data/invest/full_dataset.csv"
features_pickle_file_path = '../data/outputs/features.pkl'
```
%% Cell type:code id: tags:
``` python
df = pd.read_csv(dataset_file_path)
```
%% Cell type:markdown id: tags:
## Inspect the data / EDA
%% Cell type:code id: tags:
``` python
# How many time periods are there?
time_periods = pd.unique(df['time_period'])
len(time_periods)
```
%%%% Output: execute_result
41
%% Cell type:code id: tags:
``` python
# Let's take a look at the distribution of returns
# in a (non-random) sampling of time periods
fig, axs = plt.subplots(2, 3, sharex=True, sharey=True)
fig.suptitle("Distribution of 6M Forward Returns")
ts = time_periods[::7] # extract 6 periods to look at
sample_df = df[df['time_period'].isin(ts)]
for idx, (name, period_df) in enumerate(sample_df.groupby('time_period')):
ax = axs[idx//3][idx%3]
sns.distplot(period_df['Norm_Ret_F6M'].dropna(), ax=ax, axlabel="")
ax.set_title("period|{}".format(name))
```
%% Cell type:markdown id: tags:
## Recompute the `Rank_F6M`
%% Cell type:code id: tags:
``` python
for time in time_periods:
returns = df.loc[(df['time_period'] == time) & (df['Train'] == 1),'Norm_Ret_F6M']
rank = len(returns) - stat.rankdata(returns,method='ordinal').astype(int) + 1
df.loc[(df['time_period'] == time) & (df['Train'] == 1),'Rank_F6M'] = rank
```
%% Cell type:markdown id: tags:
## Example Feature Engineering
Each of the 71 variables is broken up into **6 non-overlapping observations** in each time period. For example `X1` has six monthly observations in each period represented as `X1_1`, `X1_2`,...,`X1_6`
To make it easier to model, we will average the 6 observations within each `time_period`.
%% Cell type:code id: tags:
``` python
# Create a new frame that contains averages over the observations
# and percentile ranks for each of the averaged columns
model_columns = ['time_period', 'index', 'Train', 'Norm_Ret_F6M', 'Rank_F6M']
model_df = pd.DataFrame(df[model_columns])
variable_list = ["X" + str(i) + '_' for i in range(1,71)]
for var in variable_list:
var_avg = df.filter(regex=(var)).mean(axis=1)
model_df[var + 'avg'] = var_avg
model_df[var + 'avg' + '_pctile'] = stat.rankdata(var_avg)/len(var_avg)
# model_df[var + 'avg' + '_pctile'] = stat.rankdata(var_avg)/len(var_avg)
model_df.head()
```
%%%% Output: execute_result
time_period index Train Norm_Ret_F6M Rank_F6M X1_avg \
0 1996_2 1996_2_lo2py80q 1 -0.164343 563.0 0.103141
1 1996_2 1996_2_c0lbkx5l 1 0.159314 402.0 0.595348
2 1996_2 1996_2_awxeoifz 1 0.931337 131.0 -0.130188
3 1996_2 1996_2_4s31wr2v 1 0.520933 254.0 -0.171538
4 1996_2 1996_2_d70vvuvm 1 -0.750410 772.0 -0.064863
X2_avg X3_avg X4_avg X5_avg ... X61_avg X62_avg \
0 0.266471 0.090441 -0.001460 0.004838 ... NaN -0.168426
1 0.794574 0.512223 0.264419 0.249596 ... 0.074326 -1.240017
2 -0.532513 -0.980416 -0.649966 -0.584488 ... -0.286353 -1.000232
3 -0.607686 -1.247847 -0.776422 -0.495723 ... -0.005309 -0.263559
4 -0.175671 -0.145160 -0.111024 -0.370753 ... -0.882975 -0.932800
X63_avg X64_avg X65_avg X66_avg X67_avg X68_avg X69_avg \
0 -0.543723 0.045641 -0.237641 -0.393841 0.222733 -0.004981 0.002180
1 -0.461571 0.312671 -0.359308 -0.686705 -0.166179 -0.005312 0.002179
2 -0.380619 -0.620955 -0.500304 -0.125145 0.876411 -0.005404 0.002178
3 -0.282470 -0.470143 -0.515858 -0.508529 0.984860 -0.005187 0.002178
4 1.022226 0.223307 -0.398994 -0.844874 0.139202 -0.005607 0.002148
X70_avg
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
[5 rows x 75 columns]
%% Cell type:markdown id: tags:
# Write out features
%% Cell type:code id: tags:
``` python
model_df.to_pickle(features_pickle_file_path)
```
%% Cell type:code id: tags:
``` python
```
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment