README.txt


SAMPLE TEST RESULTS GRAPHS ARE AVAILABLE IN THE test_results FOLDER, they are mainly there as an example of output.


The project code is structured as follows:

├── malwareClustering.py  <-- MAIN FILE, also defines configuration for directories for storage etc
├── README.txt
├── src   	          <-- contains all project main code, clustering, analysis, plotting, etc
│   ├── apkFeature.py     <-- simple object made for storting apk data
│   ├── algorithms 	  <-- contains both unsupervised and supervised learning methods code
│   │   ├── supervised.py
│   │   └── unsupervised.py
│   ├── config.py	  <-- configuration file for the application
│   ├── dataAnalyzer.py   <-- used to perform one-hot encoding
│   ├── dataClusterer.py <-- constains clustering process
│   ├── dataProcessor.py  <-- extracts data from APK or exitsing feature vector of strings
│   ├── errorLog.py	  <-- for outputting logs and errors
│   ├── fileReader.py	  <-- reads all files necessary for the project
│   └── util.py		  <-- evaluation score computation occurs here
└── test_results          <-- contains test graphs results from various scenarios


In order to run the project software, the following steps must bet done:

1 - install python 3.7
2 - install sklearn, androguard, pandas, matplotlib, numpy via pip

Or execute the following command

Linux command for installation: sudo apt-get install python3.7 python3.7-dev python3-pip && python3.7 -m pip install numpy pandas sklearn matplotlib


TO CONFIGURE THE LOCATION OF THE DREBIN DATASET PLEASE CHANGE THE FOLLOWING LINES INSIDE malwareClustering.py:

input_drebin_directory = '/home/user/Projects/drebin/'    <-- DREBIN APK ZIP FILE LOCATIONS
output_directory = '/home/user/Projects/output_dir/'    <-- WHERE ALL OUTPUT IS STORED

feature_vector_directory = '/home/user/Projects/drebin/feature_vectors/*'   <-- NEEDS TO HAVE THE FEATURES OF THE APKS FROM THE DREBIN DATASET

labeled_apk_csv_file_path = '/home/user/Projects/drebin/sha256_family.csv' <-- DREBIN LABELS CSV FILE PATH


To run the project all that needs to be done is to run the following command inside the project directory (where the malwareClustering.py is):

python3.7 malwareClustering.py --help , will display a list of the parameters available 


To run clustering with the full dataset, please run the following command:

python3.7 malwareClustering.py -nc adaptive -m extract_existing_feature analyze_data clustering cache_all plot_data <--results are stored in the output_dir/stats/ folder


##################################################################################################################
Below are the parameter commands available:

  -h, --help            show this help message and exit
  -p DREBIN_PATTERN, --drebin_pattern DREBIN_PATTERN
                        -p folder names (if any) of where the drebin APKs are
                        stored, will be used as pattern for searching e.g:
                        drebin- (optional)
  -odir OUTPUT_DIRECTORY, --output_directory OUTPUT_DIRECTORY
                        -i input directory for the DREBIN APKS, (zip
                        containing the manifest files only)
  -idir INPUT_DIRECTORY, --input_directory INPUT_DIRECTORY
                        -o output directory for cluster statistics and data.
  -dcsv DREBIN_CSV_LOCATION, --drebin_csv_location DREBIN_CSV_LOCATION
                        -drebin_csv /--drebin_csv_location location of the
                        drebin csv that contains the labels of all APK samples
  -f FEATURE_VECTOR_DIRECTORY, --feature_vector_directory FEATURE_VECTOR_DIRECTORY
                        -f feature vector directory
  -m --operation_mode  [{extract_apk_feature,extract_existing_feature,learning,analyze_data,clustering,cache_all,load_cache,plot_data,visualize_dataset,threshold_dataset,recompute_stats,include_benign}  
                        -m operation mode, meaning extract data from APKs,
                        apply machine learning on existing feature vectors or
                        simply process data from APKs (features
                        vectors)default: none
  -ls LOAD_PREVIOUS_STATS, --load_previous_stats LOAD_PREVIOUS_STATS
                        -ls load previous stats from file
  -dhs [DOWNSAMPLE_THRESHOLD [DOWNSAMPLE_THRESHOLD ...]], --downsample_threshold [DOWNSAMPLE_THRESHOLD [DOWNSAMPLE_THRESHOLD ...]]
                        only consider X number of sample from the thresholded
                        number of classes, must be used in conjuction with
                        -ths argument , -ths X 1, 1 means enable downsampling
  -ths [THRESHOLD_SAMPLES [THRESHOLD_SAMPLES ...]], --threshold_samples [THRESHOLD_SAMPLES [THRESHOLD_SAMPLES ...]]
                        only consider malware classes that have above X
                        samples, can be downsampled to the X samples passed if
                        a second param is added : 1 or 0
  -nc NUM_CLUSTERS, --num_clusters NUM_CLUSTERS
                        -num_clusters number of clusters to be used in
                        unsupervised methods, range (n, m)n,m >= 2, eg -nc
                        "2|166" will iteratively cluster from 2 to 166
                        clusters in range (will take some time), -nc 23 44 66
                        will cluster only with 23 44 and 66 clusters
  -c [CLUSTERING_METHOD [CLUSTERING_METHOD ...]], --clustering_method [CLUSTERING_METHOD [CLUSTERING_METHOD ...]]
                        -c clustering_method method type (e.g All, k-means ,
                        dbscan, etc


#####################################################################################################################################