Skip to content

Implementation of Evolutionary Bagging algorithm and reproduction of majority of experiments conducted in the paper.

Notifications You must be signed in to change notification settings

luqqasek/EvoBagging---implementation-and-experiment-reproduction

Repository files navigation

EvoBagging-implementation

Implementation of Evolutionary Bagging algorithm as part of assignment for Ensamble Learning Course at Centrale-Supelec

Description

This repository consists of implementation of Evolutionary bagging algorithm proposed in "Evolutionary bagging for ensemble learning" by Giang Ngoa , Rodney Beardb, Rohitash Chandraa Link to article. Original implementation given by authors of paper can be found under following link repository. Our implementation decreases computation speed 7 times (on red wine dataset original implementation took 50s while ours 7s to train for 25 iterations) Notebook with results. However results reported in paper are different than one computed her. Despite classification model our implementation allows user to create regression model.

Repository organization

.
├── data 
│   *folder containing data files from original repostiory*
├── experiments 
│   *folder containing subdirectories with scripts to recreate experiments from article as well as .csv with results*
│    detail description of conducted experiments and python notebook file to plot graphs and create tables.    
│    ├── comparison_with_paper
│    ├── baselines
│    ├── bias_reduction
│    ├── diveristy_between_evolved_bags
│    ├── evobagging_grid_search
│    ├── influence_of_hyper_parameters
│    ├── nbit_parity
│    ├── optimal_bag_number     
│    └── voting_rule
├── implementation_comparison
│   *folder containing results of comparison of training Evolutionary Bagging algorithm on red wine dataset using 
│   original implementation as well and implementation from this repository*
├── DataProcessing.py
│   *class implementation of data container for easier data preparation for EvoBagging class*
└── Evobagging.py
    *class implementation of Evolutionary Bagging algorithm*

Following files are taken form repository published by authors of papers:

Example usage

This repository provides access to transformed data used in original experiment by DataProcessing class. After loading data is ready to training. After loading DataProcessing from DataProcessing.py and EvoBagging from EvoBagging.py

data = DataProcessing()
data.from_original_paper(dataset_name="red_wine", test_size=0.2)

To train model we need to firstly initiate it and then use fit method.

model = EvoBagging(number_of_initial_bags=50,
                   maximum_bag_size=data.X_train.shape[0],
                   generation_gap_size=10,
                   k=5000,
                   mutation_size=65,
                   mode='classification',
                   number_of_mutated_bags=5,
                   number_of_iteration=25,
                   selection='naive',
                   logging=False,
                   metric_name="accuracy_score")
model.fit(data.X_train, data.y_train)

In order to predict on test set one need to run following

predictions_test = model.predict(data.X_test)

It is not obligatory to use DataProcessing() class. Model accept numpy matrix of size (num_data_instance, num_features) as X_train and numpy array of size (num_data_instance, ) as y_train

Thera are:

Further documentation is provided in EvoBagging class

Implementation of experiments

In subdirectory experiments one can find implementation and results of experiments from cited paper. Description of experiments can be found here. Comparison of graphs and tables reported in paper can be found here. Graphs were made by following python notebook

Inaccuracies and differences between original implementation and paper

What is written in paper What is implemented by authors in their repository What we did
"Bag B will be mutated by replacing selected random samples given by the mutation size (MS) from B with the same amount of random samples from complement of B." All occurrences of sampled replacements are deleted from bag and replaced with new observations, however as mutation size is fixed it can happen that more samples will be deleted than introduced to bag the same as in authors repository
No information about data preprocessing Data is significantly changed e.g in red wine there is initially 6 classes from 3 to 8, authors assign all values <6,5 as 0 and >6,5 as 1 we follow what was done in authors repository
"Note that we use a binary encoded evolutionary algorithm where each gene in the chromosome (individual) represents whether the data instance is in the bag as shown in Fig 1." In fact binary encoding of data instances in each bag is not possible in this scenario since data instances in ach bag are sampled with replacement thus it can be one or more data instances with the same index We store indexes of data that are in each bag
"We simplify the step of selecting C crossover parents by using a rank selection scheme where the offspring with the highest fitness are selected for crossover" Rank selection scheme includes randomness, however in their implementation always c bags with highest fitness are chosen for corssover In our implementation user can chose which selection scheme he wants to use "naive" or "rank"
"we terminate the EvoBagging training after a certain number of iterations which have been determined from trial experiments." There is no detailed information how number of iterations is established We chose the same number of iteration for dataset as indicated in paper
No information about parameters of classifier (e.g max depth) While building classifier for specific bag authors limit max depth of classifier to number of features we allow user to choose whether max depth of classifier should be fixed to number of features by introducing parameter classifier_restricted
Mean test accuracy on Red wine dataset is reported to be 92.76% with standard deviation (0.013) (30 experiments) By running provided implementation we obtained lower accuracy Left box stands in line with accuracy obtained by our implementation Notebook with results
Models are trained for given number of iterations Reported test accuracy comes from iteration where model obtained highest training accuracy - not always last iteration Depends on the experiment
Table 5 on page 8 maximal tree depth is limited by number of features which is 8 for Pima dataset therefore it is not possible to obtain higher average depth using their script we run both variants, one with restricted depth of classifier and one without
"In each bag of the ensemble, we obtain the average bias of the respective individual learner for all the data samples in the test set." Bias reduction is reported on train set we report bias on train set
"To find the optimal number of bags for bagging, we run a search with an interval of 10 and select the one with the highest test classification metric" In implementation interval of 5 i used between 5 and 100 We use the same search space as in implementation
Number of folds while determining optimal number of bags is not specified In implementation 3 fold cross validation is used For better estimation we use 5 fold cross validation
Bag B will be mutated by replacing selected random samples given by the mutation size (MS ) from B with the same amount of random samples from Bc It is not clear whether data instances used to replace samples are sampled with replacement or not. In implementation random sampling without replacement is done We do the same as in original implementation thus random sampling without replacement
Number of iterations for nbit parity problem is not specified Iteration number is set to 20 We fix number of iterations to 20
Voting rule experiment Voting rule has impact on crossover part however while comparing two schemes of prediction model is trained using majority voting and accuracy on test set is reported based on majority voting and weighted voting We implement experiment as it was implemented in repository as well as our experiment where training and evaluation is done using each of voting schemes

About

Implementation of Evolutionary Bagging algorithm and reproduction of majority of experiments conducted in the paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •