Employee Attrition Prediction Using Feature Selection with Information Gain and Random Forest Classification

.


INTRODUCTION
With the rapid development of the economy and industry, the phenomenon of employee attrition has gradually become popular in recent years [1]. In a company or agency, attrition often occurs or the process of reducing employees is caused by various factors. Employee attrition is one part of people analytics to help make more appropriate human resource (HR) decisions [2]. Employees are an important element in a company to fulfill the vision and mission to be achieved by the company. By having superior employees, the company has a competitive advantage over other companies [3]. Therefore, we need a system that can manage human resources effectively and efficiently.
The reduction of employees can have a negative impact on the company because it brings new problems if not handled properly. When a company changes employees frequently, it can be said that the attrition level of the company is very high. The level of attrition itself is measured based on the number of employees who stop working within a certain period of time. If the attrition level is high, it can cause problems for the company, including recruitment time to recruit, train, and develop new employees to fill vacant job positions [4], Productivity declines, and new employees have to re-adapt. This makes performance not optimal.
Prediction of employee attrition is carried out to determine what factors can affect employee attrition and can provide initial information about employee reductions that may occur soon so that the company can take appropriate action against the situation. In this final project, the employee attrition prediction is made using the IBM HR Analytics dataset via the Kaggle.com site [5].
In this study, the authors will compare the use of the Information Gain, Select K Best, and Recursive Feature Elimination (RFE) selection features to find out what factors can affect the occurrence of attrition and provide initial information to the company regarding the possibility of employee attrition that will occur. Then, compare the performance results of the three feature selections using the Random Forest classification method. The Random Forest classification method is used because this method is very suitable for developing predictive models [6].
The difference from previous studies is in research [7] Random Forest produces a good accuracy value of 0.85, but it produces low precision, recall, and f1 score values. Precision value is 0.60, recall is 0.28, and an f1 score is 0.39. Therefore, this study develops previous research to seek more optimal results. In research [8] Figure 1 shows how the flow of the system built in this study. The first step is to prepare the dataset. The dataset is IBM HR Analytics data via the Kaggle.com site [5]. The second stage divides the dataset into train and test data. The third stage is preprocessing, which consists of encoding, scaling, and sampling to change the raw data to make it easier to understand and ready to be processed by the system at the next stage. The fourth stage is feature selection to find the features that influence the prediction process most. The fifth stage is to perform a classification model on the train data. The sixth stage is making predictions. The last stage is an evaluation to calculate the accurate value of the classification process and the processes carried out.

Dataset
The dataset used in this study is the IBM HR Analytics dataset sourced from the kaggle.com site [5]. The dataset consists of 1470 data, 35 attributes, and uses English. After analyzing the data content of the dataset attribute, the 'EmployeeCount' attribute is dropped because the data content is sequential data only. Then drop the feature on the 'StandardHours', 'Over18', and 'EmployeeNumber' features because the data only contains the same values. After removing the feature, the total dataset used consists of 1470 data and 31 attributes. The attributes used in this dataset are listed in Table 1.

Split Data
In this data sharing process, the dataset is divided into train data and test data, which is 70% train data and 30% test data. Data sharing is done by using random states whose purpose is to make the values consistent when the system is run. The number of datasets after splitting is 1029 trains and 441 test data.

Preprocessing
Preprocessing is an important step before classification. Preprocessing aims to process raw data into data that is ready to use and facilitate the classification process. Preprocessing in this study consists of several stages as follows: a. Data Encoding At this step, the dataset variable of type "categorical" is changed to "numerical" to equate all data types with making modeling easier. Data conversion is done using 'LabelEncoder'. b. Feature Scaling In an HR dataset, the data for each feature generally has a different scale [10]. For example, the age range of employees in the dataset ranges from 20 to 50 years, and earnings range from $1000 to $15,000. However, the presence of significant scale gaps between features usually slows down optimization algorithms [10]. This study normalized and standardized the original data set after the dataset type conversion step. Normalization can be done using the MinMax scaler with the equation (1).
Where x' is a new value, x is the old value, min (x) is the minimum value, and max (x) is the max value.
c. Sampling

Figure 2. Sampling Train and Test
At this stage, resampling of the dataset is carried out to overcome unbalanced classes using the SMOTE-ENN method. SMOTE-ENN is a sampling technique that combines over and under sampling techniques in the minority class [11]. SMOTE-ENN works by first looking for the k-nearest neighbor of each observation, then checking whether the majority class of the k-nearest neighbor observation is the same as the observation class. If different, then both are deleted. The number of k used in the ENN is the default value of K = 3.

Feature Selection
After the preprocessing stage, the next step is selecting the dataset's features. This feature selection aims to eliminate or reduce considered unnecessary features and find out what features most affect employee attrition. In this study, we will compare three feature selections, namely: a. Data Encoding Information Gain is a simple and efficient feature selection method [12]. Determination of the features in the Information Gain is based on the most informative features in a particular class [13]. The best feature is determined by first calculating the entropy value using equation 2.
(2) Where c is the number of values in the classification class, and pi is the number of samples in class i. After that, the Information Gain is calculated using equation 3.
Where A is attribute, v is the possible value of attribute A, Value (A) is the set of possible values of A, | | is the number of sample values v, |S| is the sum of all data samples, and SelectKBest is a module in the sci-kit learns library that selects the k features with the top scores. Scores were calculated based on univariate statistical analysis, ie, variables were analyzed one by one [14]. In this study, selecting the k best features uses the 'import SelectKBest' library and the SelectKBest() function. c. Recursive Feature Elimination.
Recursive feature elimination (RFE) is a recursive process that sorts the features according to their importance to the prediction process [15]. In each iteration, feature importance is measured, and less relevant features are removed. The advantages of RFE are that it is easy to configure, use, and can efficiently select features to predict target variables.

Figure 3. Random Forest method step
Random forest is one of the Machine Learning Supervised classification techniques invented by Leo Breiman and Adele Cutler in 2000 [6] and developed to improve the decision tree method, which is prone to overfitting [16]. In its development, this method has become one of the most popular methods in machine learning [16]. Random forest is a method that can be used to develop predictive models [6]. A Random Forest consists of many decision trees, from the 1st tree to the nth tree, where n is the total number of trees in the Random Forest [17]. In Figure 5, the Random Forest method combines each tree from the best decision tree model, then combines them into a model. The more trees used, the better the accuracy. Determination of the classification is formed based on the voting results of the formed tree.
This method is used to take data attributes randomly in accordance with applicable regulations and build a decision tree consisting of root nodes, internal nodes, and leaf nodes. The root node is the top node or is the input commonly called the root of the decision tree. An internal node is a branch node that has at least two outputs and only one input. The leaf node is the last node that has only one input and no output. The decision tree first calculates the Gini value as a branch determinant at the node. Calculation of the Gini value using the equation (4).
Where ( ) is the relative frequency of class j at node t. After that, when the node p is divided into k partitions (children), the quality of the split is calculated using the equation (5).
Where is the number of records in child i and n is the number of records in node p.
The lowest Gini index value will be the best split value for the attribute. After the class from each decision, a tree is formed, and voting is carried out for each class in the sample data. Then, the votes from each class are combined, and the most votes are obtained.

Evaluation
This system is used to measure the modeling performance of the classification predictions that have been built using the Confusion Matrix. The confusion matrix table can be seen in Table 2. The confusion matrix is divided into two classes, namely positive class and negative class [18]. Then there are 4 categories, namely positive (TP), true negative (TN), false positive (FP), dan false negative (FN). a. Accuracy Accuracy is the total number of correct predictions [18]. Accuracy can be formulated as follows: b. Precision Precision is the accuracy of the predicted data correctly [18]. Precision can be formulated as follows: c. Recall Recall is the accuracy of correctly identified data [18]. Recall can be formulated as follows: d. F-Measure F-Measure is a process to optimize the value of Precision and Recall [18]. F-Measure can be formulated as follows: F-Measure (%) = 2 + (9)

AUROC Curve
ROC curve is a curve that researchers widely use to evaluate predictive results [19]. The ROC curve is divided into two dimensions, where the TP level is plotted on the Y axis, and the FP level is plotted on the X axis. To calculate the area under the ROC curve is to use the AUC (Area Under the ROC) method. AUC is a fraction of a square unit area, and its value always ranges from 0.0 to 1.0, so the greater the AUC value, the stronger the resulting classification [20].

RESULTS AND DISCUSSION
In this study, we will use a dataset that has gone through the resampling stage using SMOTE-ENN to predict employee attrition and has two target classes, namely yes (1) and no (0). The data train has 1,263 data consisting of 857 yes data and 406 no data. The test data has 528 data consisting of 345 yes data and 183 no data. There are three test scenarios in this study, the first scenario is to see the performance of using Information Gain feature selection, the second scenario is to see the performance results from using the Select K Best feature selection, and the third scenario is to see the performance results from the use of Recursive Feature Elimination (RFE). The last scenario is to see performance results without using feature selection. The four scenarios use the Random Forest classification method and will compare the performance results of the three feature selections. The performance results of the Information Gain feature selection test using the Random Forest classification method are shown in Table 5. Based on Table 5, the highest accuracy value was obtained using 25 features, which is 89.2%. This is probably caused by the selection of the number of features used in the classification modeling is very influential. In this scenario, it can be concluded that the more features used, the higher the accuracy, precision, recall, and f1 score value. In this scenario, the AUC value is also displayed which can be seen in Figure 4. The highest AUROC value is obtained by using 25 features, which is 0.953. In this scenario, the Select K Best feature selection is used by sorting the features based on calculating the k value and using the Random Forest classification method. The results of feature selection based on the calculation of the value of k can be seen in Table 6, which will be tested using the number of features 10, 15, 20, and 25. Then, compare the accuracy values generated from each number of features tested.   Table 7. Based on Table 7, the values of accuracy, precision, recall, and f1 score increased in the tests of 10, 15, and 20 features but experienced a slight decrease in the tests of 25 features. The highest accuracy value in this scenario is obtained when using 20 features, which is 87.8%. In this scenario, the AUC value is displayed which can be seen in Figure 5. The highest AUROC value is obtained by using 20 features, which is 0.949.  In this scenario, Recursive Feature Elimination is used by sorting the features based on the ranking of the most important features and using the Random Forest classification method. The top 10, 15, 20, and 25 features are taken in this scenario testing. Then, compare the accuracy values generated from each feature to be tested. The results of feature selection based on Recursive Feature Elimination can be seen in Table 8.  The results of the Recursive Feature Elimination test performance using the Random Forest classification method are shown in Table 9. Based on Table 9, the more features used, the higher the accuracy value. The highest value of accuracy, precision, recall, and f1 score is obtained when using 25 features. The accuracy value is 88.8%. In this scenario, the AUC value is displayed which can be seen in Figure 6. The highest AUROC value is obtained by using 20 features, which is 0.950.

Random Forest Performance without Feature Selection
In this scenario, testing is carried out using all the features in the dataset. This test will display the performance value using all the number of features, namely as many as 30 features using the Random Forest classification method. Performance results can be seen in Table 10.

Implementation
The results of the feature selection comparison are shown in Figure 7. Information gain feature selection accuracy has increased. By using the Information Gain feature selection method, the highest accuracy value was obtained