Introduction to Machine Learning @Udacity
Final Project: Identify suspects in Enron Fraud
1. Dataset and goal of project
Goal
The main purpose of project is develop the machine learning algorithm to detect person of interest(POI) from dataset. A POI is someone who was indicted for fraud, settled with the government, or testified in exchange for immunity.
Dataset
We have Enron email+financial (E+F) dataset. It contains 146 Enron managers to investigate. Each sample in this dictionary is containing 21 features. 18 people from this dataset labeled as POI. There are two imbalanced classes (many more non-POIs than POIs). Here's an example of one POI data point:
LAY KENNETH Lsalary : 1072321
to_messages : 4273
deferral_payments : 202911
total_payments : 103559793
exercised_stock_options : 34348384
bonus : 7000000
restricted_stock : 14761694
shared_receipt_with_poi : 2411
restricted_stock_deferred: NaN
total_stock_value : 49110078
expenses : 99832
loan_advances : 81525000
from_messages : 36
other : 10359729
from_this_person_to_poi : 16
poi : 1
director_fees : NaN
deferred_income : -300000
long_term_incentive : 3600000
email_address : [email protected]
from_poi_to_this_person : 123
Outliers
Dataset contains some outliers. The TOTAL row is the biggest Enron E+F dataset outlier. We should remove it from dataset for reason it's a spreadsheet quirk. Moreover, there are 4 more outliers with big salary and bonus. Two people made bonuses more than 6 million dollars, and a salary of over 1 million dollars. There's no mistake in the data. Ken Lay and Jeffrey Skilling made such money. So, these data points should be left in and examine it with others.
2. Feature selection process
Feature Selection | Justification |
---|---|
Expenses | PoI (100%) had expenses as compared to non-poi (60%) |
Shared_receipt_with_poi | (14/18 :77%) had shared receipt with PoI as compared to non-poi (77/127:57%) |
From_poi_to_this_person | Emails from poi to the employee that might have material information |
New features
In addition I create two new features which were considered in course:
Feature Selection | Justification |
---|---|
Total_payments/Salary | total_payments/salary to identify those who had most to gain from non-salaried compensation |
Total_stock_value_to_payments | Total_stock_value/total_payments to identify those who had most to gain from higher stock price |
Impact of New Features (using Decision Tree Classifier)
Best scores without New Feature | | Best scores with New Features |
---|---|
Accuracy: 0.82793 Precision: 0.36532 Recall: 0.39400 F1: 0.37912 F2: 0.38791 |
Accuracy: 0.84443 Precision: 0.43697 Recall: 0.30850 F1: 0.36166 F2: 0.32777 |
Feature selection process included several iterations. On the first step I created set of features based on data visualization and intuition. Then I examine seven classifiers on these features, and optimized them using pipeline with selectKBest, pca and MinMaxScaler. Of these, Decision Tree gave the best accuracy, precision and recall with following feature importance:
Feature Importance:
expenses 0.396shared_receipt_with_poi 0.271
payment_to_salary 0.070
total_stock_to_payments 0.263
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
Accuracy: 0.84443 Precision: 0.43697 Recall: 0.30850 F1: 0.36166 F2: 0.32777
Total predictions: 14000 True positives: 617 False positives: 795 False negatives: 1383 True negatives: 11205
Therefore, I chose the following features (dropping payment to salary) for the final run with following feature importance which improved precision to .53 and recall to .36! (see below for the full results)
Selected Features:
expenses
shared_receipt_with_poi
total_stock_to_payments
3. Pick an algorithm
The following table describes all results of examination from the algorithm used:
Algorithm | Pipeline | Accuracy | Precision | Recall | F1 | F2 |
---|---|---|---|---|---|---|
Naive Bayes | No | 0.67838 | 0. 19497 | 0. 34850 | 0. 25004 | 0. 30108 |
Yes | 0. 72746 | 0. 21792 | 0. 29800 | 0. 25174 | 0. 27760 | |
SVM | No* | |||||
Yes* | ||||||
Decision Tree | No | 0. 85292 | 0. 53235 | 0. 36200 | 0. 43095 | 0. 38675 |
Yes | 0. 82408 | 0. 37951 | 0. 22600 | 0. 28330 | 0. 24589 | |
Nearest Neighbors | No | 0.78369 | 0.18232 | 0.11650 | 0.14216 | 0.12557 |
Yes | 0.78046 | 0.19759 | 0.13950 | 0.16354 | 0.14822 | |
Random Forest | No | 0.77738 | 0.28468 | 0.29550 | 0.28999 | 0.29327 |
Yes | 0.77785 | 0.28937 | 0.30500 | 0.29698 | 0.30174 | |
AdaBoost | No | 0.81585 | 0.33852 | 0.20650 | 0.25652 | 0.22397 |
Yes | 0.78877 | 0.24347 | 0.17700 | 0.20498 | 0.18722 | |
QDA | No | 0.67600 | 0.18237 | 0.31750 | 0.23167 | 0.27652 |
Yes | 0.67600 | : 0.18237 | 0.31750 | 0.23167 | 0.27652 |
Chosen algorithm
Based on best performance level I picked Decision Tree as a final algorithm.
4. Tune the algorithm
Reasons for algorithm tuning
The main reason is to get better results from algorithm. I used GridSearchCV with following parameters to tune the algorithm.
Parameter | Settings for investigation |
Best Value |
---|---|---|
min_samples_split | [2,4,6,8] | 2 |
splitter | ['random','best'] | best |
max_depth | [2,4,6,8,10,15] | 4 |
criterian | ['gini','entropy'] | 'entropy' |
5. Validation
To validate my analysis I used stratified shuffle split cross validation developed by Udacity and defined in tester.py file. I had to modify test_classifier to return all the computed metrics for comparison with prevailing values. In addition, the input arrays to fit function had to numpy arrays for the pipeline classifier.
6. Evaluation metrics
I used precision and recall evaluation metrics to estimate model. Final results can be found in table below
Accuracy | Precision | Recall | F1 | F2 |
---|---|---|---|---|
0.88100 | 0.71228 | 0.38 | 0.49560 | 0.41910 |
True Positive | False Positive | False Negative | True Negatives | Total |
760 | 307 | 1240 | 10693 | 13000 |
Conclusion
With Precision of .71 and Recall of .38, project goal of higher than .3 was reached. In this example, higher precision is more important as we want to minimize innocent employees identified as poi suspects. At the same time Recall = 0.38 says only 38% of all POIs were identified.
We have very imbalanced classes in E+F dataset. In addition, almost half of all POIs weren't included in dataset. Under the circumstances, the result received is quite good though it's not ideal, of course.