Depression Detection on Social Media Twitter Using Hierarchical Attention Network Method

secara mental. Teknologi berkembang begitu pesat khususnya teknologi komunikasi melalui media sosial


INTRODUCTION
Mental illness, including depression, is not a mild condition that only some mentally weak people experience. According to the World Health Organization (WHO), the number of people suffering from this type of mental illness exceeds 264 million [1]. Depression can affect anyone, regardless of gender, age, race, or social status. Some people with mental disorders are reluctant to admit their condition, while others are sick and don't realize they need help [2]. In addition, sufferers consciously deny that they are sick and need help because they are ashamed or afraid of being ostracized by those around them. This condition is exacerbated by the lack of social support, because people with mental disorders are often isolated in many communities. As a result, it is unknowable and will only be detected when it is too late. Currently, technology is developing so rapidly, especially communication technology through social media. With all the features of social media, it makes it easier for users to carry out all their activities.
Twitter is a platform that enables people from all backgrounds to express their opinions about life, business, products, brands or services today [3], since it can be used to spread information in real-time quickly [4]. Twitter is used by various groups of users, ranging from celebrities, political figures, actors, business people, and various leaders, to convey facts or opinions [5]. According to the Kamus Besar Bahasa Indonesia (KBBI), facts are things that resemble reality [6], while opinions are expressions that describe the thoughts or emotions of the author [7]. Through tweets, Twitter users can express all feelings ranging from sadness, anger, confusion, and experiences experienced with ease. This phenomenon provides an opportunity for psychologists to obtain additional data through social media Twitter [2]. Social media automated analysis has the potential to provide a method for the early detection [8].
According to a study by Semiocast, a social media research institute based in the city of Paris, France, Indonesia has the fifth-highest number of Twitter account users in the world. It is in the third-highest position in the country with the most active tweets per day [4]. When examined further, the emotions expressed through these tweets can be associated with mental illness, especially depression. Depression is a person's negative emotional state and has gone through quite a long time [9].
Several studies on this have been carried out, using social media as a source to identify depressive disorders through sentiment and emotion analysis approaches and techniques [10]. For example, research by Ivan Sekulic and Michael Strube in 2019 used Logistic Regression and Linear SVM algorithm to assess mental health on social media. The Reddit data used in this study included thousands of users who had been diagnosed as having about one or more mental disorders. The study tested several methods such as Logistic Regression, Linear SVM, Supervised FastText, and HAN. As a result, the HAN method proved to be better in predicting mental health than other methods [11].
According to a study by Hasan et al [12] using the SVM and Naïve Bayes methods. There is a weakness of these methods; namely, the linear classification does not share parameters between features and classes. This may limit their generalization when the output is large. One solution to this problem is to factorize linear classification into multilayer neural networks [13]. The research that has been done shows that from the user's posts, the category of mental health can be known. This has shown that media can be a useful resource for mental health professionals preparing for the possibility of mental diseases such as depression.
In a study conducted by Zichao Yang et al [14] under the title Hierarchical Attention Networks for Document Classification. In this study, the HAN method was tested and compared with several other methods. The other classification method consists of Logistic Regression, SVM, LSTM, CNN, Conv-GRNN, and LSTM-GRNN. This study uses data from other studies, such as Yelp reviews, IMDB reviews, Yahoo answers, and Amazon reviews. In this study, the resulting HAN classification model can significantly outperform CNN up to 7.3%, 8.8%, 8.5%, and 10.2%. The research concludes that the Hierarchical Attention Network is a model that provides the best performance from all datasets.
According to a study by Hayatin N [2] with the title Implementation of Nave Bayes Multinomial for Data Classification of Tweets Containing the Term of Depression. The dataset used in this study was taken from the Sentiment140 dataset from Kaggle, which contains 1,600,000 tweets extracted using the Twitter API. From the experimental results, the accuracy value is 70%, the precision and recall values are 72% and 65%, and the fmeasure value is 68%. From these tests, it can be concluded that the algorithm used produces poor accuracy.
By referring to the problems above and a number of relevant studies, researchers were motivated to conduct a similar study by detecting depression in Twitter users using a deep learning algorithm with the Hierarchical Attention Network method. This method is one of the multilayer neural network classification methods since it is aimed to capture two main insights about document structure. HAN builds a sentence representation first and then combines it into a document representation [14]. It was proven in research by Ivan Sekulic et al [11], that using the HAN method produces fairly excellent results for classifying text compared to several other methods such as Logistic Regression, Linear SVM, and Supervised FastText.

Research Flow
In order to develop a classification system to identify depression, this research is divided into several steps. There are several steps involved in creating this classification system, including collecting the dataset, preprocessing, splitting the data, data training, data testing, modeling, and evaluating the classification model. Figure 1 shows a description of the system to be built.

Dataset
The study collects tweets from Twitter users who responded to a 42-question depression test based on the Depression Anxiety Stress Scale (DASS-42). DASS-42 is a measuring instrument used in this study to measure the severity of disorders such as depression, anxiety, and stress. DASS-42 questionnaire developed by Lovibond, S.H, and Lovibond, P.F has been tested for reliability validity and has been declared valid and reliable [15].
For each user, we have gathered data on their tweets, mentions, and replies using the Twitter API. The data have been categorized into two different labels, a positive label implies that the user's tweet has the potential for depression, while the negative label is the opposite. A negative label consists of 1856 tweets and a positive label consists of 2013 tweets.

Preprocessing
Preprocessing is a crucial and initial step in the sentiment analysis and depression detection processes. It changes raw data into a format that can be analyzed. The core of this procedure is the cleaning and transformation of the required data. Preprocessing methods used in this study include stemming, tokenization, stopword removal, case folding, and cleansing. Cleansing is a stage in preprocessing to clean existing data by removing symbols, numbers, punctuation marks, redundant spaces, and characters that are not in the alphabet. Case folding is the process of uniforming all existing letters into lower case letters. Stopword removal is the process of removing words that are unnecessary, meaningless, and less influential for the upcoming process. Tokenization is the process of splitting a sentence into a series of words. Stemming is the process of finding the basic word of a word. This process removes affixes so that the word becomes a basic word. Table 1 demonstrates an example of the preprocessing process, starting with the data input and finale with data that have built high quality. Tokenization isu mau offline enakan online bisa sambil kerja "isu", "mau", "offline", "enakan", "online", "bisa", "sambil", "kerja" Stemming isu mau offline enakan online bisa sambil kerja isu mau offline enak online bisa sambil kerja

Hierarchical Attention Network
The method used in this research is the Hierarchical Attention Network method. The sentence encoder, sentencelevel attention layer, word sequence encoder, and word-level attention layer are some of the components of HAN method [14]. Figure 3 shows the overall architecture of the Hierarchical Attention Network (HAN). It uses GRU-based sequence encoders on the sentence and document level, producing a document representation in the process. A representation of a given sentence is created by the word sequence encoder and delivered to the sentence sequence encoder, which, given a list of encoded sentences, produces a document representation. Both word sequence encoders and sentence sequence encoders use attention mechanisms to enhance the representation of the input sequence [11].

Word Encoder
A two-way GRU is used to annotate words from both directions and combine contextual information.
xit is a vector of words corresponding to wit.

Word Attention
A mechanism is needed to extract important or informative words and combine the representations of these words to form vectors.

Sentence Encoder
The Sentence Encoder also uses a two-way GRU to encode sentences like the Word Encoder.

Sentence Attention
The attention mechanism at the sentence level measures how important the sentence is.
In this study, we classify a user as a document, providing the HAN to be simply customized. Just as a document is a sequence of sentences, we consider modeling a Twitter user as a sequence of postings. This study also classifies tweets as sentences because they both consist of a sequence of tokens. This interpretation makes it suitable for this study to successfully apply the HAN, which was quite successful in classifying documents, to Twitter users.

Evaluation
Evaluation is conducted to determine the performance quality of the developed classification model. In this paper, the accuracy of the model has been found out by the confusion matrix. There are four values used to represent the results of the classification process such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Based on these values, the accuracy, precision, recall, and f1-score values can be obtained by the formula given below:

RESULTS AND DISCUSSION
The testing in these steps are the test tuning of the parameters used in the process of calculating the Hierarchical Attention Network (HAN) method. The tuning parameters attempt to obtain the most optimal parameters that will provide the best accuracy of each kernels, so the models will be good to use on classifying the depression. These parameters include partitioning training and test data, changing the number of epochs, changing the batch size value, and eliminating some components in preprocessing. To determine the impact of splitting the test data and training data, the first test was carried out. Then the second test was carried out to determine the impact of batch size. And the last is a test to determine the effect of doing the preprocessing process without stemming, without stopword removal, without both, and with complete preprocessing. The data used is from Twitter users who have filled out the questionnaire. There are about 3800 data which are divided into two categories or classes, including positive and negative. A positive label indicates that the user's tweet may be depressed. Meanwhile, a negative label indicates the opposite. 1856 tweets are included in a negative label, while 2013 tweets are included in a positive label.

Result and discussion of the effect of data partitioning
Data partition testing is done by changing the size of the testing data and training data, Performance data partition is calculated by using formulas, we calculated the precision, recall, F1 scores, and accuracy using the formulas as shown in Table 2. If visualized through a graph, the results of the data partition test can be seen in the following figure: In Figure 4, it can be seen that the fourth test is the test with the highest value for each parameter, especially on accuracy with a value of 75.44%. The test will be used as a dataset in the next test. Batch size and epoch testing are done by changing the batch size and epoch values according to the provisions in  Table 3. This test will use a data partition with the best result from the previous test, which is 90:10.  Table 3 shows the result of testing batch size and epoch. The third test is the highest accuracy, with an accuracy value of 74,82%. However, the accuracy obtained in each test did not show a significant difference except for the training time.

Result and discussion of the effect of preprocessing
In this scenario, testing is carried out by applying different preprocessing techniques. At this stage, four tests were carried out by applying various combinations of preprocessing techniques, namely without stemming, without stopword removal, without stemming and stopword removal, and using full preprocessing. The results of the accuracy of this scenario can be seen in Table 4. Based on the results of the tests that have been carried out, it can be concluded that tests involving complete preprocessing techniques and with stopwords but without stemming have a fairly good effect on improving the performance of the classification system when compared to the results of tests that do not involve both stemming and stopword removal processes which only achieve an accuracy of 68.37 percent.

CONCLUSION
In conclusion, this paper developed and analyzed the performance of Hierarchical Attention Networks to identify depressed and non-depressed participants from their tweets, which were acquired from those who filled out the questionnaire. Based on the results of various test scenarios that have been carried out, it can be concluded that the best system performance is produced when using a combination of preprocessing without stemming. The more training data used, the higher the accuracy value obtained according to the data partition test. In this case, the best data partition is with 90% training data and 10% test data. Batch size value speeds up the data training process. In addition, a high batch size will require a more considerable epoch value to get the maximum accuracy value. However, it requires more advanced computer performance. In this case, the best batch size and epoch values are 32 and 20. By changing the data partition, preprocessing, batch size, and epoch, the highest accuracy results obtained in this test is 74.13%. Based on the results of the research that has been done. Suggestions that can be applied for further research are the need to increase the number of datasets labeled by experts. Use a highperformance computer so that the training process is more optimal and gets better results. Additionally, implement feature extraction to produce more information in order to minimize misclassification.