Skip to content

dkumar95120/machine-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction to Machine Learning @Udacity

Final Project: Identify suspects in Enron Fraud

1. Dataset and goal of project

Goal

The main purpose of project is develop the machine learning algorithm to detect person of interest(POI) from dataset. A POI is someone who was indicted for fraud, settled with the government, or testified in exchange for immunity.

Dataset

We have Enron email+financial (E+F) dataset. It contains 146 Enron managers to investigate. Each sample in this dictionary is containing 21 features. 18 people from this dataset labeled as POI. There are two imbalanced classes (many more non-POIs than POIs). Here's an example of one POI data point:

LAY KENNETH L
salary : 1072321
to_messages : 4273
deferral_payments : 202911
total_payments : 103559793
exercised_stock_options : 34348384
bonus : 7000000
restricted_stock : 14761694
shared_receipt_with_poi : 2411
restricted_stock_deferred: NaN
total_stock_value : 49110078
expenses : 99832
loan_advances : 81525000
from_messages : 36
other : 10359729
from_this_person_to_poi : 16
poi : 1
director_fees : NaN
deferred_income : -300000
long_term_incentive : 3600000
email_address : [email protected]
from_poi_to_this_person : 123

Outliers

Dataset contains some outliers. The TOTAL row is the biggest Enron E+F dataset outlier. We should remove it from dataset for reason it's a spreadsheet quirk. Moreover, there are 4 more outliers with big salary and bonus. Two people made bonuses more than 6 million dollars, and a salary of over 1 million dollars. There's no mistake in the data. Ken Lay and Jeffrey Skilling made such money. So, these data points should be left in and examine it with others.

2. Feature selection process

Feature Selection Justification
Expenses PoI (100%) had expenses as compared to non-poi (60%)
Shared_receipt_with_poi (14/18 :77%) had shared receipt with PoI as compared to non-poi (77/127:57%)
From_poi_to_this_person Emails from poi to the employee that might have material information

New features

In addition I create two new features which were considered in course:

Feature Selection Justification
Total_payments/Salary total_payments/salary to identify those who had most to gain from non-salaried compensation
Total_stock_value_to_payments Total_stock_value/total_payments to identify those who had most to gain from higher stock price

Impact of New Features (using Decision Tree Classifier)

Best scores without New Feature | Best scores with New Features

Accuracy: 0.82793

Precision: 0.36532

Recall: 0.39400

F1: 0.37912

F2: 0.38791

Accuracy: 0.84443

Precision: 0.43697

Recall: 0.30850

F1: 0.36166

F2: 0.32777

Feature selection process included several iterations. On the first step I created set of features based on data visualization and intuition. Then I examine seven classifiers on these features, and optimized them using pipeline with selectKBest, pca and MinMaxScaler. Of these, Decision Tree gave the best accuracy, precision and recall with following feature importance:

Feature Importance:

expenses 0.396
shared_receipt_with_poi 0.271
payment_to_salary 0.070
total_stock_to_payments 0.263
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
Accuracy: 0.84443 Precision: 0.43697 Recall: 0.30850 F1: 0.36166 F2: 0.32777
Total predictions: 14000 True positives: 617 False positives: 795 False negatives: 1383 True negatives: 11205

Therefore, I chose the following features (dropping payment to salary) for the final run with following feature importance which improved precision to .53 and recall to .36! (see below for the full results)

Selected Features:

expenses

shared_receipt_with_poi

total_stock_to_payments

3. Pick an algorithm

The following table describes all results of examination from the algorithm used:

Algorithm Pipeline Accuracy Precision Recall F1 F2
Naive Bayes No 0.67838 0. 19497 0. 34850 0. 25004 0. 30108
Yes 0. 72746 0. 21792 0. 29800 0. 25174 0. 27760
SVM No*
Yes*
Decision Tree No 0. 85292 0. 53235 0. 36200 0. 43095 0. 38675
Yes 0. 82408 0. 37951 0. 22600 0. 28330 0. 24589
Nearest Neighbors No 0.78369 0.18232 0.11650 0.14216 0.12557
Yes 0.78046 0.19759 0.13950 0.16354 0.14822
Random Forest No 0.77738 0.28468 0.29550 0.28999 0.29327
Yes 0.77785 0.28937 0.30500 0.29698 0.30174
AdaBoost No 0.81585 0.33852 0.20650 0.25652 0.22397
Yes 0.78877 0.24347 0.17700 0.20498 0.18722
QDA No 0.67600 0.18237 0.31750 0.23167 0.27652
Yes 0.67600 : 0.18237 0.31750 0.23167 0.27652
*Got a divide by zero when trying out: SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

Chosen algorithm

Based on best performance level I picked Decision Tree as a final algorithm.

4. Tune the algorithm

Reasons for algorithm tuning

The main reason is to get better results from algorithm. I used GridSearchCV with following parameters to tune the algorithm.

Parameter

Settings for investigation

Best Value
min_samples_split [2,4,6,8] 2
splitter ['random','best'] best
max_depth [2,4,6,8,10,15] 4
criterian ['gini','entropy'] 'entropy'

5. Validation

To validate my analysis I used stratified shuffle split cross validation developed by Udacity and defined in tester.py file. I had to modify test_classifier to return all the computed metrics for comparison with prevailing values. In addition, the input arrays to fit function had to numpy arrays for the pipeline classifier.

6. Evaluation metrics

I used precision and recall evaluation metrics to estimate model. Final results can be found in table below

Accuracy Precision Recall F1 F2
0.88100 0.71228 0.38 0.49560 0.41910
True Positive False Positive False Negative True Negatives Total
760 307 1240 10693 13000

Conclusion

With Precision of .71 and Recall of .38, project goal of higher than .3 was reached. In this example, higher precision is more important as we want to minimize innocent employees identified as poi suspects. At the same time Recall = 0.38 says only 38% of all POIs were identified.

We have very imbalanced classes in E+F dataset. In addition, almost half of all POIs weren't included in dataset. Under the circumstances, the result received is quite good though it's not ideal, of course.

About

Repo for Intro to Machine Learning mini-projects and final project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published