-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.txt
117 lines (73 loc) · 5.74 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
SAMPLE TEST RESULTS GRAPHS ARE AVAILABLE IN THE test_results FOLDER, they are mainly there as an example of output.
The project code is structured as follows:
├── malwareClustering.py <-- MAIN FILE, also defines configuration for directories for storage etc
├── README.txt
├── src <-- contains all project main code, clustering, analysis, plotting, etc
│ ├── apkFeature.py <-- simple object made for storting apk data
│ ├── algorithms <-- contains both unsupervised and supervised learning methods code
│ │ ├── supervised.py
│ │ └── unsupervised.py
│ ├── config.py <-- configuration file for the application
│ ├── dataAnalyzer.py <-- used to perform one-hot encoding
│ ├── dataClusterer.py <-- constains clustering process
│ ├── dataProcessor.py <-- extracts data from APK or exitsing feature vector of strings
│ ├── errorLog.py <-- for outputting logs and errors
│ ├── fileReader.py <-- reads all files necessary for the project
│ └── util.py <-- evaluation score computation occurs here
└── test_results <-- contains test graphs results from various scenarios
In order to run the project software, the following steps must bet done:
1 - install python 3.7
2 - install sklearn, androguard, pandas, matplotlib, numpy via pip
Or execute the following command
Linux command for installation: sudo apt-get install python3.7 python3.7-dev python3-pip && python3.7 -m pip install numpy pandas sklearn matplotlib
TO CONFIGURE THE LOCATION OF THE DREBIN DATASET PLEASE CHANGE THE FOLLOWING LINES INSIDE malwareClustering.py:
input_drebin_directory = '/home/user/Projects/drebin/' <-- DREBIN APK ZIP FILE LOCATIONS
output_directory = '/home/user/Projects/output_dir/' <-- WHERE ALL OUTPUT IS STORED
feature_vector_directory = '/home/user/Projects/drebin/feature_vectors/*' <-- NEEDS TO HAVE THE FEATURES OF THE APKS FROM THE DREBIN DATASET
labeled_apk_csv_file_path = '/home/user/Projects/drebin/sha256_family.csv' <-- DREBIN LABELS CSV FILE PATH
To run the project all that needs to be done is to run the following command inside the project directory (where the malwareClustering.py is):
python3.7 malwareClustering.py --help , will display a list of the parameters available
To run clustering with the full dataset, please run the following command:
python3.7 malwareClustering.py -nc adaptive -m extract_existing_feature analyze_data clustering cache_all plot_data <--results are stored in the output_dir/stats/ folder
##################################################################################################################
Below are the parameter commands available:
-h, --help show this help message and exit
-p DREBIN_PATTERN, --drebin_pattern DREBIN_PATTERN
-p folder names (if any) of where the drebin APKs are
stored, will be used as pattern for searching e.g:
drebin- (optional)
-odir OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
-i input directory for the DREBIN APKS, (zip
containing the manifest files only)
-idir INPUT_DIRECTORY, --input_directory INPUT_DIRECTORY
-o output directory for cluster statistics and data.
-dcsv DREBIN_CSV_LOCATION, --drebin_csv_location DREBIN_CSV_LOCATION
-drebin_csv /--drebin_csv_location location of the
drebin csv that contains the labels of all APK samples
-f FEATURE_VECTOR_DIRECTORY, --feature_vector_directory FEATURE_VECTOR_DIRECTORY
-f feature vector directory
-m --operation_mode [{extract_apk_feature,extract_existing_feature,learning,analyze_data,clustering,cache_all,load_cache,plot_data,visualize_dataset,threshold_dataset,recompute_stats,include_benign}
-m operation mode, meaning extract data from APKs,
apply machine learning on existing feature vectors or
simply process data from APKs (features
vectors)default: none
-ls LOAD_PREVIOUS_STATS, --load_previous_stats LOAD_PREVIOUS_STATS
-ls load previous stats from file
-dhs [DOWNSAMPLE_THRESHOLD [DOWNSAMPLE_THRESHOLD ...]], --downsample_threshold [DOWNSAMPLE_THRESHOLD [DOWNSAMPLE_THRESHOLD ...]]
only consider X number of sample from the thresholded
number of classes, must be used in conjuction with
-ths argument , -ths X 1, 1 means enable downsampling
-ths [THRESHOLD_SAMPLES [THRESHOLD_SAMPLES ...]], --threshold_samples [THRESHOLD_SAMPLES [THRESHOLD_SAMPLES ...]]
only consider malware classes that have above X
samples, can be downsampled to the X samples passed if
a second param is added : 1 or 0
-nc NUM_CLUSTERS, --num_clusters NUM_CLUSTERS
-num_clusters number of clusters to be used in
unsupervised methods, range (n, m)n,m >= 2, eg -nc
"2|166" will iteratively cluster from 2 to 166
clusters in range (will take some time), -nc 23 44 66
will cluster only with 23 44 and 66 clusters
-c [CLUSTERING_METHOD [CLUSTERING_METHOD ...]], --clustering_method [CLUSTERING_METHOD [CLUSTERING_METHOD ...]]
-c clustering_method method type (e.g All, k-means ,
dbscan, etc
#####################################################################################################################################