Depression Detection of User in Media Social Twitter Using Random Forest

−One of the disorders of mental health that often occurs in individuals is depression. Identifying depression in the first place is important for the individual. But in fact, conducting an early examination of depression still has some drawbacks. If it continues to be ignored, this can have an impact on the health of the individual. Therefore, there is a need for other methods that can represent the level of depression in individuals, through other media such as social media such as Twitter. Twitter has become one of the media to tell what users of the application experience or feel. This is encouraging to detect of depression in Twitter users. The data used is data taken from the results of the distribution of forms based on DASS-42 with a total of 159 Twitter users for each username taken 100 tweets. This study uses the Word2Vec extraction feature, to convert data from text to vector by looking at the relationship of each word and Random Forest as a classification method, to maintain the balance of data in different classes, especially very large data sets. Based on the test results, the Random Forest model produces an accuracy of 68.75%


INTRODUCTION
Mental health is important both for the individual and for the development of the country, maintaining mental health can not only prevent the development of mental illness but also make individuals wealthier and better interact with the community. According to the World Health Organization (WHO), there are several types of mental illness. That is depression, anxiety disorders, bipolar disorders, eating disorders, post-traumatic stress disorders, and psychosis [1]. According to Pieper and Uden (2006), mental health is a condition in which a person does not feel guilty about himself, has a religious view of himself, and is able to accept his weaknesses and weaknesses. The satisfaction of his life, the satisfaction of his social life, and the happiness of his life [2]. The World Federation for Mental Health states that mental health is a condition that allows for optimal physical, mental, and emotional development if it corresponds to the other person. A mentally healthy society is a society that allows its members of society to develop according to their abilities.
The importance of identifying depression from the beginning is shown by the rampant incidence of suicide in various countries caused by the lack of prevention or early treatment in cases of depression [3]. The World Health Organization (WHO) shows that more than 264 million people of all ages suffer from depression worldwide [4]. A committee of the Institute of Medicine on the Prevention of Mental Illness has identified depression as the most preventable disorder [5]. Therefore, it is very important to identify individuals suffering from depression early to minimize their impact on public health. In addition, early detection can reduce the risk of self-employment. But in fact, conducting an early examination of depression still has some drawbacks [6]. It starts with the lack of public awareness about depressive diseases and the neglect of the disease [7]. If it continues to be ignored, this could have an impact on public health. Therefore, there is a need for other methods that can represent the level of depression in individuals, through other media such as social media. Social media, especially Twitter, has become one of the media to tell what users of the application experience or feel. There has been previous research on identification related to depression on Twitter social media through tweets posted by users. Santos et al. [8] conducted a study to detect mental health issues in Brazil using data obtained from tweets of Twitter users in Brazil who had been diagnosed by mental health practitioners. In addition, research conducted by Tubagus Rahman Ramadan using the Naive Bayes method found that Twitter can be used as a platform to detect depression through tweets posted by users [9]. From these studies, the hypothesis was found that a person's mental health condition can be analyzed from tweets posted on social media.
This study proposes to use the Random Forest method. The selection of methods is based on some previous studies as a reference. Eky Cahya [10] produced the highest average accuracy of 90.62% with the Random Forest method compared to the accuracy of the Artificial Neural Network method of 82.29%. Ahmed Husseini, et al [11] conducted a study on the detection of depression from Twitter users using several methods mentioning that the use of Recurrent Neural Network (RNN) has limitations in terms of long sentences due to sentences growing exponentially or decreasing gradients. The Random Forest classification method is stated to be able to handle high data dimensions, as well as being fast and not experiencing much overfitting [12] in contrast to research conducted by Sho Tsugawa, et al [13] detecting depression from Twitter activities using the Support Vector Machine (SVM) method with a bag-of-words model achieving low accuracy because it produces overfitting. Random Forest is an ensemble algorithm with supervised learning that theoretically excels in handling imbalances in data and is quick in classifying [12]. The extraction feature in this study uses an existing model in Word2Vec which is based on previous research by Faisal, et al [14] which compared several extraction features, namely Trainable, Glove, Word2Vec, FastText, and Metadata. The Word2Vec extraction feature produced the best performance in the study. The dataset used in this study is a dataset with crawling results that have been distributed from a questionnaire based on DASS-42, the dataset is labeled based on the provisions of DASS-42. The dataset determination using DASS-42 is based on previous research. Sitti Rahmah [15] conducted a study to identify stress, anxiety, and depression in college students using a DASS-42 measuring device. In her study, the DASS-42 measurement method was successful in identifying stress, anxiety, and depression in college students. This study focuses on detection using Twitter data in Indonesian. We hope that the results of this research can be used by companies, especially in the recruitment process. The results of this study can be used as a benchmark for prospective employees, whether they are depressed or not. In addition, the results of this study can be used in the healthcare industry to speed up and reduce the operational costs of the depression detection process in individuals.

RESEARCH METHOD
The system is built using a machine learning technique, namely random forest with Word2vec extraction, which uses tweets as a gauge to detect depression in Twitter social media users. Figure 1 shows the research method flowchart.

Figure 1. Research Method Flowchart
There are several phases to building this system such as in Figure 1. The first step is data collection, data collection using the form distribution method. The next step is data preprocessing, to cleanse the data from noise on the words. After passed, there will be feature extraction to simplify the text data so that it can be read by the model. Before entering the model classification stage, the data will be split into training data and test data. The training data will be used for the training Random Forest model and the test data will be used for the test Random Forest model. In the last step, an evaluation is carried out to see the performance of the random forest model.

Data Collection
The data used is the result of crawling data on Twitter based on the form previously distributed to respondents based on DASS-42. The result of the crawling becomes a dataset that has a label where the label is obtained from the calculation results of DASS-42. The Depression Anxiety Stress Scale (DASS-42) is a measurement tool used to measure a person's negative emotional state: depression, anxiety, and stress [16]. There are 42 questions to assess the severity of depressive symptoms which include 3 sub-variables, namely physical, psychological, and behavioral. Each sub-variable has 14 items, the division of items can be seen in Table 1. The result of the form that the respondent has filled out on a scale of 0 to 3 with each information value 0: never, 1: sometimes, 2: Often, and 3: accordingly calculated based on the total score for each disturbance, so that the maximum total score for each disturbance is 3 x 14 = 42. The severity of the disorder can be seen in table 2 [17]. In this case, only the depression scale was used for labeling. To facilitate data processing, user designations with depressive symptoms are indicated by "1" and subjects without depressive symptoms are indicated by "0". To see the results of the data that has been crawled can be seen in table 3.
1 User 3 ya allah mau punya cewe muka bad bitch:(( gagitu konsepnyaa, we still can change this. hold on plisss @pengencptkaya Yg di tiktok bukan? @minorneeds Jika ada umur yg panjang, marilah kita sudahi Siapa disini yg kitchen officenya bau arak bali temen"? 0 The total dataset used is 159 users with usernames, tweets, and labels with a comparison of data on symptoms of depression 94 and not depression 65. Figure 2 explains the data comparison with labels 0 and 1 based on DASS-42.

Preprocessing Data
The labeled dataset then enters the preprocessing stage to produce data that is easy to receive and improve the performance of the system. Here is an overview of the preprocessing sequence in this study. Figure 3 Explain The preprocessing pipeline.

Figure 3. The Preprocessing Pipeline
In the first step, case-folding will be converting all letters to lowercase or lower case. After passed the case folding step, there will be a punctuation process to divide text in the form of a sentence, paragraph, or document, into several parts. For the next step, tokenize will be removing numbers and symbols so that the dataset is only a letter. Before entering the normalization step, there will be a stop word process to reduce the number of words in

Feature Extraction
In the next step, after preprocessing proceed to the feature selection step using Word2Vec. The purpose of using this feature is so that the input for the Random Forest algorithm is in the form of a vector. Word2Vec is one of the most useful word embedding methods for creating vectors from word representations. The architectural model in Word2Vec is divided into two types, namely Continuous Bag of Word (CBOW) and Skip Gram. CBOW computational times are faster and have better word representation for frequently used words [18]. Therefore, the architectural model used in this study is CBOW. The CBOW model takes the context of each word as input and tries to predict the words associated with that context. Here is an illustration of the architecture used on CBOW.

Random Forest Modeling
The Random Forest algorithm is a classification method with a supervised learning approach. The Random Forest method, invented by Leo Breiman in 2001, is a very successful algorithm for classification [19]. After it was discovered, many researchers developed this method. Therefore, this method is often used in the classification process because it is known for its accuracy and ability to handle large amounts of data [12] . Random Forest is often used to deal with problems related to classification, regression, and so on. Random forests rely on random vector values that have the same distribution in all trees, with each decision tree having the maximum depth [20]. In general, Random Forest is an evolution of Bagging, iteratively bootstraps the training data and arranges a classifier tree based on resampling data, the prediction process is carried out by summing the prediction results of several trees, Usually, the majority approach is used for voting [21]. The basic algorithm of the Random Forest, shown in Figure 4 [21]:

Evaluation
This study used accuracy values and f1-scores in the confusion matrix. A confusion matrix can be understood as a tool that serves to analyze whether a classifier can identify tuples of different classes or not [22]. The confusion matrix has 4 values, namely true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP shows the amount of class data predicted according to the actual data on that class. FP indicates the amount of data predicted does not correspond to the actual data. FN shows the actual amount of data that goes into a class but fails to be predicted by the model. TN shows the amount of data other than that class. These values are used to evaluate based on accuracy, precision, recall, and F1-score. Accuracy is a comparison between classified data and overall data, the formula of Accuracy is [23]: Precision is a comparison between classified data and overall data, the formula of precision is [23]: A recall is a comparison between correctly classified data and the number of data that are in that class, the formula of recall is [23]: F1-Score is a comparison between classified data and overall data, the formula of the F1-score is [23]: F1-Score = 2*(Recall*Precision) Recall + Precision (4)

RESULTS AND DISCUSSION
This section describes the results obtained based on the system model that has been built and implemented. Several stages of testing were carried out in this research. One of the objectives of this research is to find out how the performance of the Random Forest model that was built. For this reason, several tests were carried out in this study, namely: a. Test 1 is carried out by comparing the combination of the data preprocessing stage and comparing the results of the accuracy of the combination. b. Test 2 is done by dividing the training data and the different test data when building the Random Forest model and comparing the accuracy results. c. Test 3 is comparing the accuracy results of the Random Forest model by comparing the combination of parameters used, the best parameters are searched using the grid search method.

Implementation Result of Test 1
In this first test, a comparison of the performance of the accuracy value of the system model was built by trying several combinations of the preprocessing. The results of such experiments can be seen in Table 1.

16% 66%
From table 1, the scenario that produces the best performance is by preprocessing using stop word removal and normalization, this is because both methods are very important in text preprocessing, and their function is to remove the slang word and perfect the word. The results of this first test will be applied to subsequent tests. The results of the preprocessing stage can be seen in table 2.

Implementation Result of Test 3
This third test is to compare the performance results of the system model built by trying to change some of the parameter values owned by the Random Forest classification method using the grid search method. The combination of preprocessing data in this test follows the first test, and the comparison of test data using 70:20 follows the second test. The changed parameters are max_depth and criterion because these two parameters are the most important in the classification of Random Forests. The best results of tuning this hyperparameter can be seen in table 4 with an accuracy value of 68.75%.

Discussion
The first test was performed to see how the model performed when the combinations of preprocessing were different. The results of the first test found that scenarios using stopword removal and normalization methods in preprocessing got the best accuracy, this is because both methods are important for text preprocessing because they can make text clean from slang words and can improve abbreviated words. The results of such tests are used in the later stages of testing. The second test was carried out to see how the performance of the Random Forest model if only the distribution of the test data and the train data would be changed. From the results of this second test, we can conclude that the composition of the training data and the distribution of the test data can affect the accuracy of the Random Forest model. The more training data you use, the more accurate from Random Forest model. However, it is not recommended to use a training data structure close to 100%, as accuracy results will not give the best results. The distribution of this data structure varies from algorithm to algorithm, so you need to run this test to build your system. The second test did not perform hyperparameter adjustments in the Random Forest model, so the third test was run to improve performance. The results of the third test were very satisfactory, with accuracy improved by 12.5% from the initial 56.25% to 68.75%. Based on the first test, the second test, and the third test, it can be concluded that the best performance obtained by the system model is by preprocessing using the stopword removal method, using 20% test data, and coupled with tuning hyperparameters in Random Forest model with parameter values for Max_depth: 4 and criterion: entropy. Table 4 shows the successful test in detecting depression in tweets using three experimental scenarios with the best accuracy in the Random Forest model. The accuracy of the power value obtained from the Random Forest classification model is 68.75%.

CONCLUSION
In this study, we aim to develop how the Random Forest classification system model can detect depression on social media Twitter. The model of the system was developed alongside the word2vec extraction feature. Based on the results of the study, we have succeeded in showing how to build a system model that can be used to detect depression on social media Twitter. The thing that needs to be done is to collect data to become a dataset, clean the data by preprocessing, extracting features, classifying Random Forests, and evaluating the model. In addition, from the test results of the model, we get an accuracy performance result of 68.75%. In this study, we found shortcomings, namely, the results of crawling data did not get a lot of data and the labeling data was not balanced, resulting in poor accuracy. For further research, testing can be carried out by adding more datasets, tuning