Analyze Detection Depression In Social Media Twitter Using Bidirectional Encoder Representations from Transformers

− Human health is an essential part of the welfare of a country. Early detection of a disease is necessary to prevent it from spreading in an area. Social media is now a rapid and widespread development of information to provide convenience for the public to communicate. Depressed people have a variety of depressive symptoms from every human behaviour. Psychological doctors often conduct face-to-face interviews on commonly used diagnoses and statistical manual criteria for mental disorders. Depression is a mental disorder that typically appears in humans with the characteristics of depressed mood, loss of interest and pleasure, unstable body energy, and poor concentration. In conducting this research, the aim is to detect people who are depressed by using the Machine Learning-based BERT (Bidirectional Encoder Representations from Transformers) method. BERT can binarily classify text on social media, namely Twitter, which contains Depression detection. Based on the tests that have been carried out, the best accuracy value is 0.7176 or 71%.


INTRODUCTION
Human health is an essential part of the well-being of a country. Early detection of a disease is necessary to prevent its distribution in an area. With the existence of social media mong In today's society, social media plays an essential role in detecting diseases in the form of depression through social media Twitter. Indonesia is one example where the different lives and their opinions about the present, In a study conducted by TMoslmi 2017 [1].
Social media is now a rapid and widespread information development, making it easier for people to communicate. In 2018 internet service users in Indonesia reached 171.17 million users compared to 2017, a significant increase of 143.26 million users, In a study conducted by G.Mahendra 2021 [2]. Users use Twitter to express their emotions. Users of Twitter accounts used to communicate on Twitter, including Tweets. There's a Tweet quoted from Copernicus, with basically nothing opening. Tweets usually contain short messages, messages, statemen, ts and, in some cases, links to articles, blog spots, podcasts, or videos, In a study conducted by M.I.Maulana 2019 [3]-Depression is the leading cause of disability worldwide. Depressed people have various kinds of depressive symptoms from every human behaviour. Psychological doctors often conduct face-to-face interviews on commonly used diagnoses and mental disorder statistics manual criteria. Globally it is estimated that 350 million people age staggering from depression, In a study conducted by G.Shen 2017 [4]. Depression is a common mental disorder that appears in humans with its characteristics, namely depressed mood, loss of interest and pleasure, unstable body energy, and poor concentration, In a study conducted by T.R.Ramadan 2021 [5].
Comparison of a method with each other, namely the Naïve Bayes Classifier, is used to calculate the probability that the proposed data is correct. Still, Naïve Bayes it is not possible to measure the accuracy of the prediction. In addition, the naive Bayesian method has weaknesses in feature selection, In a study conducted by A.Pattekari 2019 [6], from Existing research using the Naïve Bayes method has resulted in 3049 total depression, and 15705 numbers do not show depression, In a study conducted by T.R.Ramadan 2021 [5]. While Support Vector Machine(SVM) can only handle the classification of two classes, In a study conducted by F.Zikra 2021 [7], from research that has been done already exist SVM only produce a classification accuracy is 85.38% and also obtained 16 of the global burden of disease and injury for people aged 10 to 19 years experiencing mental disorders (Trisni Handayani, Dian Ayubi, 2020) [8]. As a result of the accuracy that has been studied in existing research, the SVM method is better than the Naïve Bayes method. The method used in this study is the BERT method (Bidirentional Machine Learning-based Encoder Representations from Transformers developed in 2018, the BERT method has been developed with large Data so that BERT can handle various tasks in its field, namely NLP. This study aims to determine BERT's performance in classifying crawled tweets using acceptable adjustment strategies, In a study conducted by Y.Ajitama 2021 [9]. BERT can classify text binary on social media, namely Twitter, which contains Depression detection. Depression Detection in question is a Twitter account user currently experiencing symptoms of depression and then disclosed on social media, namely Twitter, In a study conducted by I.Prapitasari 2019 [10]. BERT method can also read a long text well. Architecture The transformer uses fewer parameters than the CNN model to produce a good performance in a short time, In a study conducted by A.Khan 2020 [11].

System Design
At this stage, to carry out Depression Detection, which takes data from on Twitter social media, build a system with the flow that has been made. In making this classical system there are several stages consisting of data retrieval of datasets, preprocessing, data separation, training, testing, modeling, and evaluation. The following is a flow chart using the BERT method can be seen in Figure 1.

Dataset
In the Data Collection step there are several stages In the first stage, respondents fill out a questionnaire form containing a Depression Detection using the DASS-42 assessment, the second stage is crawling on Twitter tweets taken from respondents' accounts who have filled out the questionnaire form. Tweets on Indonesian language twitter, there are 3867 datasets. The dataset contains two columns containing labels and tweets, and label 0 contains tweets containing elements of depression. In contrast, label 1 contains tweets that do not include aspects of depression, while tweets contain tweets that have been crawling. When doing the labelling is done manually.

Pre-Processing Data
Processing Data has several stages of the process, such as case folding, cleaning, stopword, stemming and tokenizing. In this process, it functions to tidy up data containing user respondents to make it easier to process at a later stage. Here are the procedures:

Case Folding
This process aims to convert all letters/words in the existing data into lowercase letters so that the data used becomes the same form, Case Folding can be seen in Table 1

Cleaning
This process aims to clean up existing data into a form that is easier to process to eliminate redundant symbols, numbers, punctuation marks and spaces, Cleaning can be seen in Table 2.

Stopword
This process aims to remove words that do not need to be entered but first see whether these words affect the data, Stopward can be seen in Table 3

Stemming
In this process, the aim is to remove affixes at the beginning and the end so that the goal is to avoid ambiguity in the data, Stemming can be seen in Table 4.

DASS-42
DASS is a 42-item, 3-item questionnaire designed to measure negative emotional states of depression, anxiety, and stress. The score for each respondent on each subscale was then graded according to severity. The severity of Depression pressure can be measured by the presence of DASS-42, In a study conducted by N.Syafitri 2020 [12]. This research is a data collection system based on the results of Twitter tweets that have been filled out in the DASS-42 respondent form. The data that has been collected is 15 accounts of Depression detected on Twitter with ratings of Very Severe, Severe, Moderate and Heavy. This data collection was carried out on July 25, 2021. The following is an example of a

Twitter
Witter is one of the social media with total users. In 2018 there were many Twitter users (Wearesocial, 2015).

BERT
BERT (Bidirectional Encoder Representations from Transformers) is a language using a fine-tuning approach. BERT pre-trains in an unsupervised way by looking at the left and right context conditions simultaneously in each layer, In a study conducted by Y.Ajitama 2021 [9]. BERT uses a new technique called Masked Language Modeling, which shows two-way training in the model. The state transformer consists of two mechanisms: an encoder that reads the input and a decoder that predicts the task, In a study conducted by EP.T.Kerja [14]. BERT only requires an encoder, utilizes the attention principle encoder and reads the entire text as input. BERT can build contextual relationships for each token well, In a study conducted by J.Devlin 2019 [15]. BERT method can also read a long text well. The Transformer architecture uses fewer parameters than the CNN model, so it can produce good performance in a short time, In a study conducted by A.Khan 2020 [11].

Input BERT
BERT uses WordPiece embedding, which contains 30,000 syllables. At the beginning of each sequence, there is a CLS token. For each sentence is placed the SEP token. Then add embedding on each token to distinguish whether the token is included in the embedding segment. The position embed is the result of this sum with the same dimensions as the embedding token. In a study conducted by F.A.Pratama 2020 [16]. To calculate embedding can be seen in Formula (1) (2).
Each embedding has dimensions of 768 can be seen in Figure 2.

BERT Pre-training dan Fine Tuning
In the pre-training process, two tasks will be carried out by BERT, namely Masked Language Model (MLM) and Next Sentence Prediction (NSP). The function of MLM is so that the BERT method can simultaneously combine context from left and right. Then MLM masks some input tokens to predict the original vocabulary id of the masked vocabulary [15]. In comparison, NSP is so that the model can understand the relationship between two sentences. The BERT method is given two inputs, namely in the form of phrase pairs and a learning model for its use to identify whether the sentence is the following sentence in the original document. The input process is in the form of sentences A and B as an example for the pre-train, 50% probability that B is the following sentence from A (IsNext), and the other 50% is a random sentence from the corpus (NotNext). BERT uses pre-training data from BookCorpus, which contains 800 million words and from English Wikipedia, which has 2.5 billion words [15]. After encoding the data, the data that has been inputted into the BERT model is ready for fine-tuning. There are several strategies in doing fine-tuning, including [9]: 1. Text length In this study, we will use tweet data so that the sequence length will not exceed 512 due to the limited size of the tweet.

Layer selection
In this study, an effective layer is needed to classify it.

Overfitting Problem
In this study, overfitting chose the BERT optimizer with a reasonable learning rate.

BERT Model
BERT is a pre-trained example that has been trained using a large amount of data so that it has suitable parameters for use in tasks related to language understanding, one of which is sentiment analysis. For BERT to be used for sentiment analysis, it is hoped that the additional layer results can handle classification tasks. After that, finetuning was done using a dataset related to sentiment analysis in this final project in the form of a tweet related to the selection of depression detection. This analysis uses a BERT-base sample containing 110 million parameters using 12 layers, 768 hidden sizes and 12 attention heads. BERT only requires an encoder, utilizes the attention principle encoder and reads the entire text as input

Matriks Evaluasi
In this Depression Detection study, Scikit-Learn is used for its non-paid machine learning library software for python programming. This stage describes the Machine Learning process on models, data, algorithms and functions. But the calculation process and sentiment analysis cannot be visualized in detail. For this stage, the metrics used as benchmarks for the results of the calculation of sentiment analysis data are: a) F1 Score b) Precision c) Recall d) Accuracy F1 score is the average of precision and recall. Here is the formula: Precision is the ratio of correctly predicted positive observations to the total number of predicted observations. Here is the formula: Recall is the ratio of positive observations that will be correctly predicted on actual class observations. Here's the formula: Accuracy is the intuitive performance of the ratio of observations that will be correctly predicted on the total observations. Here is the formula:

RESULT AND DISCUSSION
In the text classification test, tweets on Indonesian language twitter, there are 3867 datasets. The dataset contains two columns containing labels and tweets, and label 0 contains tweets containing elements of depression. In contrast, label 1 contains tweets that do not include aspects of depression, while tweets contain tweets that have been crawling. When doing the labelling is done manually. On Test Scenario This final project focuses on the preprocessing stage and testing a BERT method. The first scenario is to do a test by changing the number of sample batch sizes based on the Neural Network, which aims: to determine the best classification for this test. The second scenario is to test split data which seeks to find the best performance on test data and train data from various data.

Results and discussion of the effect of Classification
In Stages classification, first, the input data obtained is converted into input form BERT can read. To be read by BERT, it is necessary to add the token [CLS] at the beginning of the sentence and pass [SEP] at the end of the sentence and determines the length of the sentence to add as many [PAD] token as remaining tokens. After [CLS], [SEP], and [PAD] tokens are added, BERT will convert each word token into ids token and returns input_ids, and attention_mask results for later will be passed into the BERT model. After converting data into an input that BERT can read, it will make the creation of a data loader to help speed up data retrieval. After creating the data loader, the step Next is modelling. After the model is created, it will be training and evaluation of the model that has been made, classification can be seen in Figure 3.

Results and discussion of the effect of Split Data
The first scenario, testing is done by changing the split data. The data is divided into two, namely train data and test data. data train serves to train the algorithm in finding the appropriate model in this test, while test data serves to test or determine the performance of the model to be obtained in this test. In this test, BERT classification can be used to determine whether or not split data has an effect on the accuracy value. The first scenario results can be seen in Table 5 The results of the tests that have been carried out are that each data split produces a different accuracy value can be seen in Table 5. In the above test, it can be seen that the train data and test data with a comparison 90:10 is the best accuracy value, which is 0.7176 or 71%, so it can be said that there is a lot of train data then the accuracy value is getting better.

Results and discussion of the effect of Batch Size
The second scenario, perform the test by changing the batch size. there are different batch size values, namely 16, 32 and 64, Batch size is a number of sample data values distributed to the Neural Network. In the first scenario doing testing changing the split data to 90:10 is the best value for accuracy. For this test by changing the batch size to 16, 32 and 64, the second scenario results can be seen in table 6.

Classification
Accuracy BERT Data 90: 10 16 0.6943 32 0.6580 64 0.7172 The results of the tests that have been carried out in Table 6, each batch size has different values using split data 90:10 batch size 64 is the best value for this test. Based on Experiment above, it can be said that the batch size affects the increase or decrease in the accuracy value.

CONCLUSION
After the tester is done scenario testing that has do for the detection of depression on tweets on Indonesianlanguage twitter using the BERT method, it can be concluded that the best value results are 0.7176 or 71% with split data train data and test data 90:10 (90% train data and 10% test data) with batch size 64 and epoch 4. The best system performance is the result of sharing data between training data and test data with a total data sharing of 90:10, System performance in sharing data using the classification BERT method produces an accuracy of 0.7176 or 71% . In this test, the higher the train data, the better the accuracy value, while the more test data, the lower the accuracy value and the more batch sizes, the better the accuracy value. The evaluation and analysis stage using a confusion matrix with four combinations, namely True Negative (TN), True Positive (TP), False Negative (FN), and False Positive (FP) from these combinations get the calculation results for Accuracy, Precision, Recall and F1-Score. From the value that has been obtained will be a test. The following is a test set of values that have been obtained, Accuracy 71%, Precision 81%, Recall 71% dan F1-score 75%. For further research using the BERT method, more datasets should be needed so that the accuracy value obtained is much better.