GitHub - dkumar95120/machine-learning: Repo for Intro to Machine Learning mini-projects and final project

Introduction to Machine Learning @Udacity

Final Project: Identify suspects in Enron Fraud

1. Dataset and goal of project

Goal

The main purpose of project is develop the machine learning algorithm to detect person of interest(POI) from dataset. A POI is someone who was indicted for fraud, settled with the government, or testified in exchange for immunity.

Dataset

We have Enron email+financial (E+F) dataset. It contains 146 Enron managers to investigate. Each sample in this dictionary is containing 21 features. 18 people from this dataset labeled as POI. There are two imbalanced classes (many more non-POIs than POIs). Here's an example of one POI data point:

LAY KENNETH L
salary : 1072321
to_messages : 4273
deferral_payments : 202911
total_payments : 103559793
exercised_stock_options : 34348384
bonus : 7000000
restricted_stock : 14761694
shared_receipt_with_poi : 2411
restricted_stock_deferred: NaN
total_stock_value : 49110078
expenses : 99832
loan_advances : 81525000
from_messages : 36
other : 10359729
from_this_person_to_poi : 16
poi : 1
director_fees : NaN
deferred_income : -300000
long_term_incentive : 3600000
email_address : [email protected]
from_poi_to_this_person : 123

Outliers

Dataset contains some outliers. The TOTAL row is the biggest Enron E+F dataset outlier. We should remove it from dataset for reason it's a spreadsheet quirk. Moreover, there are 4 more outliers with big salary and bonus. Two people made bonuses more than 6 million dollars, and a salary of over 1 million dollars. There's no mistake in the data. Ken Lay and Jeffrey Skilling made such money. So, these data points should be left in and examine it with others.

2. Feature selection process

Feature Selection	Justification
Expenses	PoI (100%) had expenses as compared to non-poi (60%)
Shared_receipt_with_poi	(14/18 :77%) had shared receipt with PoI as compared to non-poi (77/127:57%)
From_poi_to_this_person	Emails from poi to the employee that might have material information

New features

In addition I create two new features which were considered in course:

Feature Selection	Justification
Total_payments/Salary	total_payments/salary to identify those who had most to gain from non-salaried compensation
Total_stock_value_to_payments	Total_stock_value/total_payments to identify those who had most to gain from higher stock price

Impact of New Features (using Decision Tree Classifier)

Best scores without New Feature |

Best scores with New Features

Accuracy: 0.82793

Precision: 0.36532

Recall: 0.39400

F1: 0.37912

F2: 0.38791

Accuracy: 0.84443

Precision: 0.43697

Recall: 0.30850

F1: 0.36166

F2: 0.32777

Feature selection process included several iterations. On the first step I created set of features based on data visualization and intuition. Then I examine seven classifiers on these features, and optimized them using pipeline with selectKBest, pca and MinMaxScaler. Of these, Decision Tree gave the best accuracy, precision and recall with following feature importance:

Feature Importance:

expenses 0.396
shared_receipt_with_poi 0.271
payment_to_salary 0.070
total_stock_to_payments 0.263
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
Accuracy: 0.84443 Precision: 0.43697 Recall: 0.30850 F1: 0.36166 F2: 0.32777
Total predictions: 14000 True positives: 617 False positives: 795 False negatives: 1383 True negatives: 11205

Therefore, I chose the following features (dropping payment to salary) for the final run with following feature importance which improved precision to .53 and recall to .36! (see below for the full results)

Selected Features:

expenses

shared_receipt_with_poi

total_stock_to_payments

3. Pick an algorithm

The following table describes all results of examination from the algorithm used:

Algorithm	Pipeline	Accuracy	Precision	Recall	F1	F2
Naive Bayes	No	0.67838	0. 19497	0. 34850	0. 25004	0. 30108
	Yes	0. 72746	0. 21792	0. 29800	0. 25174	0. 27760
SVM	No*
	Yes*
Decision Tree	No	0. 85292	0. 53235	0. 36200	0. 43095	0. 38675
	Yes	0. 82408	0. 37951	0. 22600	0. 28330	0. 24589
Nearest Neighbors	No	0.78369	0.18232	0.11650	0.14216	0.12557
	Yes	0.78046	0.19759	0.13950	0.16354	0.14822
Random Forest	No	0.77738	0.28468	0.29550	0.28999	0.29327
	Yes	0.77785	0.28937	0.30500	0.29698	0.30174
AdaBoost	No	0.81585	0.33852	0.20650	0.25652	0.22397
	Yes	0.78877	0.24347	0.17700	0.20498	0.18722
QDA	No	0.67600	0.18237	0.31750	0.23167	0.27652
	Yes	0.67600	: 0.18237	0.31750	0.23167	0.27652

*Got a divide by zero when trying out: SVC(C=0.1, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

Chosen algorithm

Based on best performance level I picked Decision Tree as a final algorithm.

4. Tune the algorithm

Reasons for algorithm tuning

The main reason is to get better results from algorithm. I used GridSearchCV with following parameters to tune the algorithm.

Parameter	Settings for investigation	Best Value
min_samples_split	[2,4,6,8]	2
splitter	['random','best']	best
max_depth	[2,4,6,8,10,15]	4
criterian	['gini','entropy']	'entropy'

5. Validation

To validate my analysis I used stratified shuffle split cross validation developed by Udacity and defined in tester.py file. I had to modify test_classifier to return all the computed metrics for comparison with prevailing values. In addition, the input arrays to fit function had to numpy arrays for the pipeline classifier.

6. Evaluation metrics

I used precision and recall evaluation metrics to estimate model. Final results can be found in table below

Accuracy	Precision	Recall	F1	F2
0.88100	0.71228	0.38	0.49560	0.41910
True Positive	False Positive	False Negative	True Negatives	Total
760	307	1240	10693	13000

Conclusion

With Precision of .71 and Recall of .38, project goal of higher than .3 was reached. In this example, higher precision is more important as we want to minimize innocent employees identified as poi suspects. At the same time Recall = 0.38 says only 38% of all POIs were identified.

We have very imbalanced classes in E+F dataset. In addition, almost half of all POIs weren't included in dataset. Under the circumstances, the result received is quite good though it's not ideal, of course.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
Dharmendra Kumar Final Project Udacity/final_project		Dharmendra Kumar Final Project Udacity/final_project
choose_your_own		choose_your_own
datasets_questions		datasets_questions
decision_tree		decision_tree
evaluation		evaluation
feature_selection		feature_selection
final_project		final_project
k_means		k_means
naive_bayes		naive_bayes
outliers		outliers
pca		pca
regression		regression
svm		svm
text_learning		text_learning
tools		tools
validation		validation
.gitignore		.gitignore
Decision Tree Mini Project.docx		Decision Tree Mini Project.docx
Machine Learning Class.docx		Machine Learning Class.docx
README.md		README.md
Udacity Machine Learning FAQ.pdf		Udacity Machine Learning FAQ.pdf
Udacity Reviews.pdf		Udacity Reviews.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 13

Languages

dkumar95120/machine-learning

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 13

Languages

Packages