Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Classification Performance of Ecoli Data Sets

− In this digital era, machine learning is a technology that is in demand by organizations and individuals. In the data and digital information age, the ability to process data efficiently is needed. As the amount of data grows, machine learning has various problems. One of them is that a class imbalance is also often found with the increasing amount of data. Class imbalance is a condition where a class dominates another class. One example is when the positive value class has fewer numbers than the negative class. The class that is less in number is categorized as the minority class, while the class that dominates the data set is called the majority class. Class imbalance can affect classification performance incorrectly, so handling imbalanced classes is needed to improve classification results. Classification of imbalanced data using Random Forest has satisfactory results, as well as implementing SMOTE and ADASYN as sampling methods because they are prevalent and easy to implement. In this research, we use ecoli protein data set to evaluate the performance of random forest classifier with and without oversampling methods. In this research, we used the f1 score and balanced accuracy as the primary evaluation metrics. We calculated the average with the highest score of 84% for the f1 score and 90% for balanced accuracy. Both SMOTE and ADASYN perform similarly to improve the classification performance and found that balanced accuracy is a better-suited metric for imbalanced classification.


INTRODUCTION
The world has entered the digital world. Various technologies, including the computer, facilitate various human activities. Various organizations, groups, and even individuals compete to get the latest and best information. Often the method used to obtain this information could be more effective. To make information processing more effective, one of the steps taken is using machine learning. However, sometimes the information obtained could be in its better form, and sometimes the data obtained still needs to be more efficient; this is often found in large data sets. These large data sets usually contain imbalanced classes, which may affect classification performance [1].
Data sets that contain significant differences between classes with very few instances, known as minor classes [2], and classes with sufficient instances, known as major classes, are considered imbalanced data sets. Synthetic Minority Oversampling Technique (SMOTE) is a frequently used oversampling technique to handle imbalance class problems and is also deemed a very successful technique to generate synthetic data [3]. A simple explanation of how SMOTE works. This technique starts with selecting an instance from the minority class to be selected as a point, identifying its K-nearest neighbors, then creating a line to its K-nearest neighbor. Points between this line are what is considered synthetic observation. As SMOTE generates synthetic observation on all the minority class data, this affects the original data distribution of the minority class. As an alternative to mitigate such a problem, ADASYN is proposed as a possible solution [4]. Adasyn generates synthetic observations based on a weighing technique based on more complex data points of the minority class that are harder to classify.
The classification method we are using is Random Forest Classifier. In order to create an efficient classification model, our research includes data resampling with SMOTE and ADASYN. The purpose of this research is to compare the effect of imbalanced data sets on classification performance and how an effective oversampling method can improve the performance of the model that has been built.
In 2019, Brandt et al. [4] , research to compare SMOTE and ADASYN in imbalanced data classification. The data set used in this research is the credit card fraud data set collected from Kaggle.com. This research uses random forest classifier to create the machine learning model. SMOTE and ADASYN perform efficiently for RFC, as they gain an increase in Sensitivity by 2.99% and 2.57%, respectively, for SMOTE and ADASYN. In this research, SMOTE deemed to perform better than ADASYN, with a higher increase in performance.
In 2019, Gameng et al. [5] performed research on a modified adaptive synthetic SMOTE for imbalanced data sets. The primary data set in this research is from an open admission program of a state college and random forest classifier applied with SMOTE and modifier ADASYN. Gameng et al. conclude that a modified ADASYN performs best when evaluated on four performance metrics (Accuracy, Precision, Recall, and F1-Score).
In 2021, Syaliman et al. [2] discussed enhancing machine learning classification performance accuracy. One example data set used is E-coli data sets have eight classes, and the approach to handle imbalanced Data is SMOTE, Gain Ratio, and the classification model used is K-NN. The proposed method performed better than the K-NN classification without SMOTE and GR, with an increase of 11.4% accuracy.
In 2021, Ramadhan et al. [6] , assessed analysis of SVM classification combined with oversampling methods SMOTE and ADASYN. The example data set used is based on diabetes examination results of Karya Medika Laboratory, that have nine classes. As a technical data set resulting in parameters that are difficult for the public to understand, another problem is the massive difference in the data for diabetics and non-diabetics. The model achieves 83% without applying oversampling methods. After applying oversampling methods, the model gained higher accuracy at 85.4% with SMOTE and 87.3% with ADASYN, with SMOTE having more errors in predicting a False Negative value.
In this research, topics, and limitations are applied to find the performance of the model by applying several research scenarios. The limitation of this research is that classification using Random Forest (RF) is carried out on imbalanced data, then applying SMOTE / ADASYN to return to RF classification. The data used is the e-coli data set obtained from the KEEL website. The data taken has an imbalanced ratio range of nine to thirteen. In this study, there are two objectives, namely, to see how the performance and accuracy of the model in analyzing imbalanced data using the SMOTE oversampling method and to compare with the ADASYN oversampling method.
The handling imbalanced data set with SMOTE and ADASYN tries to solve the imbalanced classification problems. We applied the oversampling technique to improve the class situation on the E coli data set. Previous research rarely compares the raw performance of the oversampling technique, especially on the E coli data set. We show how SMOTE and ADASYN work and evaluate how each sampling method affects the classification performance. In our research, we use the balanced accuracy score as a comparable metric that covers the whole performance of our classifier. This research aims to get a comparable result on which sampling methods work best and does every data set need to be resampled to get the best classification performance.

Research Stages
The system built in this research is a classification system using random forest classifier on the e-coli data set. As one of the scenarios in this research, the SMOTE and ADASYN oversampling methods are also applied. The flow of this research system is depicted in Figure 1 below:

Data set
In this study, the data set used is the e-coli data set obtained from the KEEL data set repository website [7]. The e-coli data set represents classification problems for protein prediction obtained through microbiological data on Escherichia Coli (E.Coli) bacteria. The data used are five data sets with different Imbalanced Ratio ranging from nine to thirteen. In this research, positive classes are labeled as one (1), and negative classes are labeled as zero (0). The details of the data set can be seen in Table 1 below:

Data splitting
After all, data is collected, the first stage is to split the data into two, namely, Train Data and Test Data. This study uses a split ratio of 20% test data and 80% train data. Details of the data that has been split can be seen in the Table 2 below:

Oversampling
Oversampling is one technique to overcome the class imbalance in a data set. Oversampling in the minority class aims to balance the distribution for each repeated data in the minority class. In 2019, Yahaya et al. [8] stated that a training data with huge number of noises can drastically affect classification performance. One way to overcome this problem is to reduce class imbalance by oversampling the minority class. This research will only apply oversampling methods to the training data set. This is done because both oversampling techniques might make exact copies that, if applied to the entire data set, may result in certain instances appearing both in training and testing data set that leads to biased classification [9].

SMOTE
SMOTE is the most used and popular sampling technique and is considered a successful technique [10]. SMOTE is an approach to minority classes. The fewer classes will be oversampled by creating data clones from existing ones. SMOTE creates additional training data by taking samples from each minority class and forming new samples based on the k minority class nearest neighbors. The synthetic classification method provides a better-created decision tree classification method. SMOTE is able to reduce the chance of overfitting than a simple random oversampling methods [11].The steps on how SMOTE works can be seen on Table 3 below: Identify the number of nearest neighbors to consider 4 Calculate the K-NN using Euclidean distance from the minority class samples, then select one of the K-NN randomly 5 Take the vector difference between the selected point and the nearest neighbor, then multiply the difference with a random integer between 0 and 1 6 Identify the new point on the line segment by adding the random integer to the selected point 7 Repeat the process for every identified point to satisfy the number of synthetic samples needed.

ADASYN
ADASYN is a minority class approach that has similarities with SMOTE. ADASYN differs from SMOTE in the number of samples created. ADASYN will sample more minority classes within the k-nearest neighbor area. ADASYN uses weighted distribution for each minority class sample based on each class learning difficulty [12]. ADASYN works by calculating the degree of class imbalance, then calculating the number of synthetic data examples that are needed to be generated. The steps on how ADASYN works can be seen on Table 4 below: Calculate the total number of synthetic observations to generate (G) 3 Find the K-NN for each minority point and calculate a value that indicates how many of the neighbors come from the majority class (ri) 4 Apply normalization of the ri to make it equal to 1 5 Calculate the amount of synthetic observation to generate for each neighborhood (Gi) 6 Generate Gi number of data for each neighborhood, then generate the new synthetic observation (Si) = Description : d = Ratio minority instances to majority instances = #Minority instances = #Majority instances

Random Forest
Random forest is an ensemble learning model. Multiple models are trained and combined to make predictions. In random forests, those individual models are known as decision trees [13]. A decision tree works by predicting the designated rule of sequences on the input features. Each decision tree in a random forest is trained by randomly selecting a subset of the original data set. The training process recursively splits data based on the selected feature until reaching a leaf node that represents class labels. Each decision tree makes individual predictions based on the trained model in the prediction process. Each of these predictions is considered a vote for a particular label. RF Classifier is a popular model due to its ability to handle complex tasks like missing values, and imbalanced data set [14]s. The steps on how Random Forest Classifier can be seen on Table 5 below: Random Forest (RF) accept input of a labelled data set with that has been split into training set and testing set 2 RF run a bootstrap sampling to randomly select subsets of training data to create bootstrap samples 3 For every bootstrap sample created, RF construct a decision tree. 4 Random forest select a random subset of features from the total feature 5 Once the decision tree construction finished, RF made a prediction by each tree on the testing set.

Evaluation
The confusion matrix is used as an evaluation metric in this research to find factual information and classification prediction results [15]. The confusion matrix is one of the evaluation methods to measure the performance of the classification system. The confusion matrix table can be seen in Table 6 below:  To measure the performance of the classification model built, the information obtained from the confusion matrix will be used to calculate the precision, recall, and F1-Score values. The formula for calculating the evaluation value is: Precision measures the accuracy of predicting positive value by considering all positive value such as TP and FP [16]. The formula for Precision Score is as follows: Recall measures the accuracy of predicting all positive values. The recall is obtained by dividing TP by all positively classified data [17]. The formula is as follows: F1 Score is the harmonic mean of precision and recall, where FP and FN are considered equally weighted [18]. The formula of the F1 Score is as follows: Specificity measures the accuracy of predicting all negative values [19] (true-negative rate), obtained by dividing TN by all negatively classified data. The formula is as follows: Balanced accuracy(BA) is an arithmetic mean of specificity and recall [20]. BA is gained by averaging recall and specificity as the formula is shown:

RESULT AND DISCUSSION
Evaluation in this research is testing the classification model that has been built. The evaluation will refer to several predetermined metrics and balanced accuracy, and the f1-score is chosen as the main comparable score. The system starts by splitting the data with a ratio of 80% for the training set and 20% for the testing set. In this study, three scenarios are applied. The first scenario is to compare the performance of the model against 5 data sets without using the oversampling method and classification with a random forest classifier. The second scenario compares the performance of the model against 5 data sets that have been applied SMOTE oversampling and classification with a random forest classifier. The third scenario compares the performance of the model against 5 data sets that have been applied ADASYN oversampling and classification with a random forest classifier. The detailed description of the scenario can be seen on table 7 below: Performance evaluation of random forest classifier combined with SMOTE oversampling method 3 Performance evaluation of random forest classifier combined with ADASYN oversampling method

Performance result without oversampling method
In the first scenario, the goal is to compare five data sets without using undersampling. The results of the first scenario can be seen in Table 8 below:

Performance result with SMOTE oversampling method
The details of the data distribution before and after applying SMOTE can be seen in Table 9 below: In the second scenario, the goal is to compare five data sets using SMOTE oversampling method. The results of the second scenario can be seen in Table 10 below:

Performance result with ADASYN oversampling method
The details of the data distribution before and after applying ADASYN can be seen in Table 11 below: In the third scenario, the goal is to compare five data sets using ADASYN oversampling method. The results of the third scenario can be seen in Table 12 below:

Analysis of experiment result
We compare the performance of imbalanced data with the oversampling method. We use two main performance metrics, f1-score and balanced accuracy. Based on the f1 score, as seen in Figure 2, DS1 and DS2 experience an increase in F1 scores when SMOTE and ADASYN are applied. However, in DS3, the effect of SMOTE and ADASYN is less significant and even less for DS4 and DS5. Imbalanced classification without sampling methods has a higher f1 score than the balanced classification, even though IR is not too far apart between data sets. For DS1, DS2, and DS3, SMOTE and ADASYN have similar performance. Furthermore, for DS4, SMOTE perform slightly better than ADASYN, and for DS5, ADASYN perform slightly better than SMOTE. With a slight difference in imbalanced ratio, and DS3, DS4, and DS5 are data sets with higher IR, SMOTE and ADASYN perform less significantly than when applied on DS1 and DS2 with lower IR. From this result, we conclude that based on the F1 score, IR is not the only parameter to decide if a data set needs oversampling. The figure 2 below shows F1 score of all scenarios on all five data sets.

Figure 2. F1 Score Results
Analysis based on balanced accuracy, as seen in Figure 3. DS1 and DS2 SMOTE and ADASYN experience an increase in balanced accuracy when we apply SMOTE and ADASYN. However, less significant for DS3 and DS4, DS5 experienced a slight decrease in balanced accuracy.
For DS1, DS2, and DS3, SMOTE and ADASYN have similar performance, SMOTE performs better than ADASYN in DS4, and ADASYN performs better than SMOTE in DS5. With a slight difference in imbalanced ratio, and DS3, DS4, and DS5 are data sets with higher IR, SMOTE and ADASYN perform less significantly than when applied on DS1 and DS2 with lower IR. From this result, we conclude that based on balanced accuracy, IR is not the only parameter to decide if a data set needs oversampling. The figure 3 below shows balanced accuracy of all scenarios on all five data sets.

Figure 3. Balanced Accuracy Results
Based on Table 11, imbalanced classification has a better average in the f1 score, and balanced accuracy has a better average for balanced classification. We conclude that a balanced accuracy score is a better-suited performance metric to evaluate imbalanced data sets. SMOTE and ADASYN have a higher average for balanced accuracy and are performing better than imbalance classification, with an average of balanced accuracy of 90% both for SMOTE and ADASYN. The average of the f1 score and balanced accuracy can be seen on

CONCLUSION
We have several conclusions based on the analysis carried out in this research on Handling Imbalanced Data Sets Using SMOTE and ADASYN to Improve Minor Class Classification Performance. Based on the f1 score SMOTE and ADASYN performs less significantly. Classification performed better when no oversampling method was applied, with an 84% f1 score, higher than SMOTE and ADASYN at 78% and 76%. Comparing SMOTE and ADASYN performance, both have similar performance and are only slightly better than one another. Based on balanced accuracy, SMOTE and ADASYN perform better with higher averages than imbalanced classification. Classification performs better when the data set is preprocessed with an oversampling method. By the average of balanced accuracy, SMOTE and ADASYN are better at 90% than imbalance classification at 88%. Another conclusion is that an imbalanced ratio is not the only parameter to decide if a data set needs to be resampled. Furthermore, this research concludes that balanced accuracy is a better-suited performance metric for imbalanced learning.