IPython Notebook, Numpy, Pandas, MongoDB, R — for the better part of a year now, I have been trying out these technologies as part of Udacity's Data Analyst Nanodegree. My undergrad education barely touched on data visualization or more broadly data science, and so I figured being exposed to the aforementioned technologies would be fun. And fun it has been, with R's powerful IDE-powered data mundging and visualization techniques having been particularly revelatory. I learned enough of R to create some complex visualizations, and was impressed by how easy is to import data into its Dataframe representations and then transform and visualize that data. I also thought RStudio's paradigm of continuously intermixed code editing and execution was superior to my habitual workflow of just endlessly cycling between tweaking and executing of Python scripts.
Still, R is a not-quite-general-purpose-language and I hit upon multiple instances in which simple things were hard to do. In such times, I could not help but miss the powers of Python, a language I have tons of experience with and which is about as general purpose as it gets. Luckily, the courses also covered the equivalent of an R implementation for Python: the Python Data Analysis Library, Pandas. This let me use the features of R I now liked — dataframes, powerful plotting methods, elegant methods for transforming data — with Python's lovely syntax and libraries I already knew and loved. And soon I got to do just that, using both Pandas and the supremely good Machine Learning package Scikit-learn for the final project of Udacity's Intro to Machine Learning Course. Not only that, but I also used IPython Notebook for RStudio-esque intermixed code editing and execution and nice PDF output.
I had such a nice experience with this combination of tools that I decided to dedicate a post to it, and what follows is mostly a summation of that experience. Reading it should be sufficient to get a general idea for why these tools are useful, whereas a much more detailed introdution and tutorial for Pandas can be found elsewhere (for instance here). Incidentally, this whole post was written in IPython Notebook and the source of that can be found here with the produced HTML here.
First, a bit about the project. The task was to first explore and clean a given dataset, and then train classification models using it. The dataset contained dozens of features about roughly 150 important employees from the notoriously corrupt company Enron, witch were classified as either a "Person of Interest" or not based on the outcome of investigations into Enron's corruption. It's a tiny dataset and not what I would have chosen, but such were the instructions. The data was provided in a bunch of Python dictionaries, and at first I just used a Python script to change it into a CSV and started exploring it in RStudio. But, it soon dawned on me that I would be much better off just working entirely in Python, and the following code is taken verbatim from my final project submission.
And so, the code. Following some imports and a '%matplotlib notebook' comment to allow plotting within IPython, I loaded the data using pickle and printed out some basic things about it (not yet using Pandas):
import matplotlib.pyplot as plt
import matplotlib
import pickle
import pandas as pd
import numpy as np
from IPython.display import display
%matplotlib notebook
enron_data = pickle.load(open("./ud120-projects/final_project/final_project_dataset.pkl", "rb"))
print("Number of people: %d"%len(enron_data.keys()))
print("Number of features per person: %d"%len(list(enron_data.values())[0]))
print("Number of POI: %d"%sum([1 if x['poi'] else 0 for x in enron_data.values()]))
But working with this set of dictionaries would not be nearly as fast or easy as a Pandas dataframe, so I soon converted it to that and went ahead and summarized all the features with a single method call:
df = pd.DataFrame.from_dict(enron_data)
del df['TOTAL']
df = df.transpose()
numeric_df = df.apply(pd.to_numeric, errors='coerce')
del numeric_df['email_address']
numeric_df.describe()
Looking through these, I found one instance of a valid outlier - Mark A. Frevert (CEO of Enron), and removed him from the dataset.
I should emphasize the benefits of doing all this in IPython Notebook. Being able to tweak parts of the code without reexecuting all of it and reloading all the data made iterating on ideas much faster, and iterating on ideas fast is essential for exploratory data analysis and development of machine learned models. It's no accident that the Matlab IDE and RStudio, both tools commonly used in the sciences for data processing and analysis, have essentially the same structure. I did not understand the benefits of IPython Notebook when I was first made to use it for class assignments in College, but now it has finally dawned on me that it fills the same role as those IDEs and became popular because it is similaly well suited for working with data.
del numeric_df['loan_advances']
del numeric_df['restricted_stock_deferred']
del numeric_df['director_fees']
std = numeric_df.apply(lambda x: np.abs(x - x.mean()) / x.std())
std = std.fillna(std.mean())
std.describe()
This result suggested that most features have large outliers (larger than 3 standard deviations). In order to be careful not to remove any useful data, I manually inspected all rows with large outliers to see any values that seem appropriate for removal:
outliers = std.apply(lambda x: x > 5).any(axis=1)
outlier_df = pd.DataFrame(index=numeric_df[outliers].index)
for col in numeric_df.columns:
outlier_df[str((col,col+'_std'))] = list(zip(numeric_df[outliers][col],std[outliers][col]))
display(outlier_df)
numeric_df.drop('FREVERT MARK A',inplace=True)
df.drop('FREVERT MARK A',inplace=True)
Looking through these, I found one instance of a valid outlier - Mark A. Frevert (CEO of Enron), and removed him from the dataset.
I should emphasize the benefits of doing all this in IPython Notebook. Being able to tweak parts of the code without reexecuting all of it and reloading all the data made iterating on ideas much faster, and iterating on ideas fast is essential for exploratory data analysis and development of machine learned models. It's no accident that the Matlab IDE and RStudio, both tools commonly used in the sciences for data processing and analysis, have essentially the same structure. I did not understand the benefits of IPython Notebook when I was first made to use it for class assignments in College, but now it has finally dawned on me that it fills the same role as those IDEs and became popular because it is similaly well suited for working with data.
The project also instructed me to choose a set of features, and to engineer some of my own. In order to get an initial idea of possible promising features and how I could use them to create new features, I computed the correlation of each feature to the Person of Interest classification:
corr = numeric_df.corr()
print('\nCorrelations between features to POI:\n ' +str(corr['poi']))
The results indicated that 'exercised_stock_options', 'total_stock_value', and 'bonus' are the most promising features. Just for fun, I went ahead and plotted these features to see if I could visually verify their significance:
numeric_df.hist(column='exercised_stock_options',by='poi',bins=25,sharex=True,sharey=True)
plt.suptitle("exercised_stock_options by POI")
numeric_df.hist(column='total_stock_value',by='poi',bins=25,sharex=True,sharey=True)
plt.suptitle("total_stock_value by POI")
numeric_df.hist(column='bonus',by='poi',bins=25,sharex=True,sharey=True)
plt.suptitle("bonus by POI")
As well as one that is not strongly correlated:
numeric_df.hist(column='to_messages',by='poi',bins=25,sharex=True,sharey=True)
plt.suptitle("to_messages by POI")
The data and plots above indicated that the exercised_stock_options, total_stock_value, and restricted_stock, and to a lesser extent to payment related information (total_payments, salary, bonus, and expenses), are all correlated to Persons of Interest. Therefore, I created new features as sums and ratios of these ones. Working with Pandas made this incredibely easy due to vectorized operations, and though Numpy could similarly make this easy I think Pandas' Dataframe construct makes it especially easy.
It was also easy to fix any problems with the data before starting to train machine learning models. In order to use the data for evaluation and training, I replaced null values with the mean of each feature so as to be able to use the dataset with Scikit-learn. I also scaled all features to a range of 1-0, to better work with Support Vector Machines:
#Get rid of label
del numeric_df['poi']
poi = df['poi']
#Create new features
numeric_df['stock_sum'] = numeric_df['exercised_stock_options'] +\
numeric_df['total_stock_value'] +\
numeric_df['restricted_stock']
numeric_df['stock_ratio'] = numeric_df['exercised_stock_options']/numeric_df['total_stock_value']
numeric_df['money_total'] = numeric_df['salary'] +\
numeric_df['bonus'] -\
numeric_df['expenses']
numeric_df['money_ratio'] = numeric_df['bonus']/numeric_df['salary']
numeric_df['email_ratio'] = numeric_df['from_messages']/(numeric_df['to_messages']+numeric_df['from_messages'])
numeric_df['poi_email_ratio_from'] = numeric_df['from_poi_to_this_person']/numeric_df['to_messages']
numeric_df['poi_email_ratio_to'] = numeric_df['from_this_person_to_poi']/numeric_df['from_messages']
#Feel in NA values with 'marker' value outside range of real values
numeric_df = numeric_df.fillna(numeric_df.mean())
#Scale to 1-0
numeric_df = (numeric_df-numeric_df.min())/(numeric_df.max()-numeric_df.min())
Then, I scored features using Scikit-learn's SelectKBest to get an ordering of them to test with multiple algorithms afterward. Pandas Dataframes can be used directly with Scikit-learn, which is another great benefit of it:
from sklearn.feature_selection import SelectKBest
selector = SelectKBest()
selector.fit(numeric_df,poi.tolist())
scores = {numeric_df.columns[i]:selector.scores_[i] for i in range(len(numeric_df.columns))}
sorted_features = sorted(scores,key=scores.get, reverse=True)
for feature in sorted_features:
print('Feature %s has value %f'%(feature,scores[feature]))
It appeared that several of my features are among the most useful, as 'poi_email_ratio_to', 'stock_sum', and 'money_total' are all ranked highly. But, since the data is so small I had no need to get rid of any of the features and went ahead with testing several classifiers with several sets of features.
Proceding with the project, I selected three algorithms to test and compare: Naive Bayes, Decision Trees, and Support Vector Machines. Naive Bayes is a good baseline for any ML task, and the other two fit well into the task of binary classification with many features and can both be automatically tuned using sklearn classes. A word on SkLearn: it is simply a very well designed Machine Learning toolkit, with great compatibility with Numpy (and therefore also Pandas) and an elegant and smart API structure that makes trying out different models and evaluating features and just about anything one might want short of Deep Learning easy.
I think the code that follows will attest to that. I tested those three algorithms with a variable number of features, from one to all of them ordered by the SelectKBest scoring. Because the data is so small, I could afford an extensive validation scheme and did multiple random splits of the data into training and testing to get an average that best indicated the strength of each algorithm. I also went ahead and evaluated precision and recall besides accuracy, since those were to be the metric of performance. And all it took to do all that is maybe 50 lines of code:
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.grid_search import RandomizedSearchCV, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.cross_validation import StratifiedShuffleSplit
import scipy
import warnings
warnings.filterwarnings('ignore')
gnb_clf = GridSearchCV(GaussianNB(),{})
#No params to tune for for linear bayes, use for convenience
svc_clf = SVC()
svc_search_params = {'C': scipy.stats.expon(scale=1),
'gamma': scipy.stats.expon(scale=.1),
'kernel': ['linear','poly','rbf'],
'class_weight':['balanced',None]}
svc_search = RandomizedSearchCV(svc_clf,
param_distributions=svc_search_params,
n_iter=25)
tree_clf = DecisionTreeClassifier()
tree_search_params = {'criterion':['gini','entropy'],
'max_leaf_nodes':[None,25,50,100,1000],
'min_samples_split':[2,3,4],
'max_features':[0.25,0.5,0.75,1.0]}
tree_search = GridSearchCV(tree_clf,
tree_search_params,
scoring='recall')
search_methods = [gnb_clf,svc_search,tree_search]
average_accuracies = [[0],[0],[0]]
average_precision = [[0],[0],[0]]
average_recall = [[0],[0],[0]]
num_splits = 10
train_split = 0.9
indices = list(StratifiedShuffleSplit(poi.tolist(),
num_splits,
test_size=1-train_split,
random_state=0))
best_features = None
max_score = 0
best_classifier = None
num_features = 0
for num_features in range(1,len(sorted_features)+1):
features = sorted_features[:num_features]
feature_df = numeric_df[features]
for classifier_idx in range(3):
sum_values = [0,0,0]
#Only do parameter search once, too wasteful to do a ton
search_methods[classifier_idx].fit(feature_df.iloc[indices[0][0],:],
poi[indices[0][0]].tolist())
classifier = search_methods[classifier_idx].best_estimator_
for split_idx in range(num_splits):
train_indices, test_indices = indices[split_idx]
train_data = (feature_df.iloc[train_indices,:],poi[train_indices].tolist())
test_data = (feature_df.iloc[test_indices,:],poi[test_indices].tolist())
classifier.fit(train_data[0],train_data[1])
predicted = classifier.predict(test_data[0])
sum_values[0]+=accuracy_score(predicted,test_data[1])
sum_values[1]+=precision_score(predicted,test_data[1])
sum_values[2]+=recall_score(predicted,test_data[1])
avg_acc,avg_prs,avg_recall = [val/num_splits for val in sum_values]
average_accuracies[classifier_idx].append(avg_acc)
average_precision[classifier_idx].append(avg_prs)
average_recall[classifier_idx].append(avg_recall)
score = (avg_prs+avg_recall)/2
if score>max_score and avg_prs>0.3 and avg_recall>0.3:
max_score = score
best_features = features
best_classifier = search_methods[classifier_idx].best_estimator_
print('Best classifier found is %s \n\
with score (recall+precision)/2 of %f\n\
and feature set %s'%(str(best_classifier),max_score,best_features))
Then, I could go right back to Pandas to plot the results. Sure, I could do this with matplotlib just as well, but the flexibility and simplicity of the 'plot' function call on a DataFrame makes it much less annoying to use in my opinion.
results = pd.DataFrame.from_dict({'Naive Bayes': average_accuracies[0],
'SVC':average_accuracies[1],
'Decision Tree':average_accuracies[2]})
results.plot(xlim=(1,len(sorted_features)-1),ylim=(0,1))
plt.suptitle("Classifier accuracy by # of features")
results = pd.DataFrame.from_dict({'Naive Bayes': average_precision[0],
'SVC':average_precision[1],
'Decision Tree':average_precision[2]})
results.plot(xlim=(1,len(sorted_features)-1),ylim=(0,1))
plt.suptitle("Classifier precision by # of features")
results = pd.DataFrame.from_dict({'Naive Bayes': average_recall[0],
'SVC':average_recall[1],
'Decision Tree':average_recall[2]})
results.plot(xlim=(1,len(sorted_features)-1),ylim=(0,1))
plt.suptitle("Classifier recall by # of features")
As output by my code, the best algorithm was consistently found to be Decision Trees and so I could finally finish up the project by submitting that as my model.
I did not much care for the project's dataset and overall structure, but I still greatly enjoyed completing it because of how fun it was to combine Pandas data processing with Scikit-learn model training in the process, with IPython Notebook making that process even more fluid. While not at all a well written introduction or tutorial for these packages, I do hope that this write up about a single project I finished using them might inspire some readers to try out doing that as well.