Clustering Content Types and User Motivation using DBSCAN on Twitter

− We are currently in an era full of information and communication technology. One of the communication media used is Twitter. Twitter is a microblogging service that is used by its users to express their thoughts on a topic called a tweet. Tweets that are posted can be either positive tweets or negative tweets. One of the topics that is currently being discussed by Twitter users is Anies Baswedan as a 2024 Indonesian Presidential Candidate. Many people have tweeted this but it is not known how many users support or reject Anies Baswedan to run as a 2024 Indonesian presidential candidate. To assist the analysis, use the method clustering namely algorithm (Density-Based Spatial Clustering of Application with Noise). DBSCAN has the advantage of being able to detect data that is not included in a cluster and will be considered noise. This can improve the accuracy of the grouping because the data in the cluster will be cleaner. The TF-IDF Vectorizer is used to make it easier for programs to manage data because it can turn sentences into vectors that can be processed by the algorithm. To determine the evaluation of the program, the silhouette score method will be used. The results of calculating the silhouette score show a value of 0.29 with the formation of 3 clusters. Then an analysis is carried out based on the top words from each cluster and it can be identified that cluster 0 has a positive category supporting Anies Baswedan to run for the 2024 Presidential Candidate and cluster 1 has a negative category that does not support Anies Baswedan not advancing for the 2024 Presidential Candidate.


INTRODUCTION
The age of information and communication technologies is currently upon us.Advances in technology have provided information and communication resources that are broad than what humans already have.The need for information and communication is no less important than the need for human clothing and food [1].
One service that provides a source of information and communication is Twitter.Twitter is a microblogging service that is used by millions of users to convey ideas or opinions.Users can create, publish and exchange short messages called tweets.Twitter is available on various platforms such as applications on smartphones and websites [2] making it easier for Twitter users to access it anywhere and anytime.
With information and interactions carried out on Twitter, a topic of discussion will be formed indirectly with keywords related to that topic or commonly called hashtags.Currently, there are various kinds of topics on Twitter, one of which is the topic of politics.This topic is being widely discussed considering that the 2024 Indonesian presidential election will soon be held.Many candidates have been informed that they will run for president and one of them is Anies Baswedan.Anies Baswedan is one of the topics that is hotly discussed because of Anies' proven track record, which is proven by many pollsters that have noted that Anies' chances of winning the 2024 presidential election are relatively high [3].One of the keywords in the topic of Anies as a 2024 presidential candidate is "#AniesPresiden2024".This keyword is often used because it relates to Anies' advance as president in 2024.
Previous research that became a reference for making this final project was research conducted by [4].The author uses the DBSCAN method to classify text data with 2,184 text data.Several previous steps were carried out, namely cleaning, eliminating data duplication, stemming, and stopwords.Then classification was carried out with DBSCAN using different Eps and MinPts parameters.The Silhouette Index was used to evaluate, and the result was 0.413 with Eps 0.1 and MinPts 10 parameters.31 clusters were formed with the highest frequency of occurrence of the word "kpu", followed by "firdaus", "kota", "pasang", and "ayat".
There is also research conducted by [5] on the topic of the influence of content and customer engagement in the context of social media.This study aims to determine the effect of information, entertainment, remuneration, and relational content on passive and active engagement behavior of social media users.The data used is data from 12 wine brands on Facebook for 12 months.Multi-variant Linear Regression Analysis was chosen as a method to investigate the effects of content on the behavior, contribution, and engagement of consumers.The results reveal the effect that rational attraction on social media has a superior effect in facilitating the active or passive engagement of social media users whereas emotional appeal facilitates passive rather than very active engagement behavior.
Another reference in [6] reveals the structural dimensions of consumers' motives for using Instagram and to explore the relationship between the identified motivations and the main attitudinal and behavioral intention variables by using a comprehensive survey on a total of 212 Instagram users.The study concluded that Instagram users have five main social and psychological motives: social interaction, archiving, self-expression, escapism, and peeking.Research conducted by [7] discusses the topic of the 2019 presidential election by establishing a tweet grouping system to distinguish topics of discussion regarding the presidential election related to the two presidential candidate pairs.This grouping system uses the DBSCAN method combined with ontology-based concept weighting.This study uses ontology-based concept weighting to apply knowledge about the hierarchical structure of topics, so that each topic that is at the same hierarchical level has equality.In research [8] discussing the 2019 election on Twitter social media, people convey positive and negative comments and even tend to "black campaigns" and hoaxes before the election is held or when the election is in progress regarding the election being held, comments on Twitter at this time cannot be determined more precisely.positive or negative direction, therefore it is necessary to carry out a sentiment analysis to determine the tendency of public opinion towards elections.Based on research that has been done before, in this study using the DBSCAN method which has the advantage of being able to detect data that does not enter into clusters will be considered as Noise data.With this Noise data, clustering will be more optimal because the data included in the cluster is data that is cleaner and more compatible between one data and another.This is what makes this research different from previous related studies.In addition, the DBSCAN method uses the Euclidean distance formula to calculate the distance between data points and the Silhouette score to evaluate the program.The topic raised in this research is also different from previous related studies because it raises the topic of Anies Baswedan as a presidential candidate for RI 2024 which is being hotly discussed on the Twitter platform.This research also discusses the type of content and user motivation to find out how much the public supports and wants to overthrow Anies Baswedan as the 2024 Republic of Indonesia presidential candidate.
Based on research that has been done before, this research was made to detect whether the type of content is positive or negative and the user's motivation in writing a tweet on the topic.The topic chosen is Anies Baswedan as a candidate for President of Indonesia in 2024.The method that will be used is the DBSCAN Algorithm because DBSCAN has the advantage of being able to detect data that is not included in the cluster and will be considered as Noise data.With this Noise data, clustering will be more optimal because the data included in the cluster is data that is cleaner and more compatible between one data and another.This research will be carried out by collecting data, performing preprocessing, inputting it into the DBSCAN algorithm to determine the clusters formed, calculating the Silhouette Score to evaluate the program and performing cluster analysis to determine the type of content and user motivation in each cluster.

RESEARCH METHODOLOGY 2.1 Research Steps
In this study, eight stages will be carried out, namely Crawling data to get tweet data, pre-processing to produce clean data, TF-IDF process to convert sentences into vectors, DBSCAN Clustering to get clusters and Silhouette Score calculations, Data Visualization to display data in each cluster, Cluster Analysis to determine the type of user context and motivation.For more details, it can be seen in Figure 1.

Figure 1. Research Flow
In the research flow, the first process is Crawling data to get the raw data to be used, after getting the dataset, the dataset will enter the pre-processing stage to be cleaned, then enter the TF-IDF weighting stage after that it will be clustered using DBSCAN Clustering then on visualize it to be able to see the data for each cluster, after which it is analyzed to determine the type of content and motivation of the user.

Data Crawling
One way to get datasets is to do Data Crawling.Data crawling is an activity of retrieving data from a website or database [9].One way to do data crawling is to use a python-language program combined with the snscrape library.To do crawling on twitter, use the Twitter API.Tweets were taken from about January 2022 to April 2023.The total data obtained was 49,519 tweets using the ni situ language using five hashtags or search terms, namely "AniesPresiden2024", "#AniesBaswedan2024", "#Anies2024", #AniesPresidenku", and # AniesPresident RI2024".The results of the crawling data will become the dataset used in this study.The results of crawling data can be seen in Table 1.Anomali @aniesbaswedan, dirilis sejumlah Lembaga survey selalu dibawah GP dan PS, tapi setiap agenda jalan2x dimonitor dan dihadang dengan demo kecil-kecilan, oleh orang kecil yang dijanji fulus yang cukup buat makan sehari #AniesPresiden2024

Data Preprocessing
The dataset that has been crawled will be included in the preprocessing to get more optimal results.Six stages will be carried out in preprocessing, namely Cleaning Data, Case Folding Removal, Tokenizing, Data Normalization, Stop Word Removal, and Stemming.The preprocessing stages can be seen in Figure 2.

Data Cleaning
Data cleaning is a process used to remove urls, numbers, enter, tabs and symbols in sentences.This is done because when clustering the data is not used so it can be deleted.

Case Folding
Case folding is a process used to convert all capital letters into non-capital letters [10].This aims to facilitate the process of Stop Word and Stemming in identifying words.

Tokenizing
In text mining, tokenizing is a procedure used to turn sentences into strings of words.Sentences with large dimensions will be divided into several smaller sentences and then separated into rows of words [11].Some symbols that will be identified as delimiters are periods (.), commas (,), and spaces.

Data Normalization
Data normalization is the process of simplifying or changing a word into the standard form of that word according to the KKBI.The purpose of data normalization is to reduce word errors after the data crawling process [12].

Stop Word Removal
Stop word is a process carried out to eliminate words that lack or do not have the information contained in the sentence.This can improve the accuracy of the process because the data that is processed is data that has information value in the word.In Stopwords the missing words are called special words.In English, examples of words to be deleted are all, am, an, are, etc [13].

Stemming
Stemming is the process of mapping and decomposing various forms (variants) of a word into its basic form (stem). Stemming changes words that contain affixes to stems.The main goal of the stemming algorithm is to minimize grammatical forms and get meaningful terms from the morphological structure of the language [14].In this study, a literature library will be used which contains Indonesian words as a reference in carrying out the Stemming process.

Feature Extraction
One of the methods used to determine the significance of words in a document is TF-IDF.The frequency of occurrence of a word in a document determines whether or not the word is significant [15].In the TF-IDF method, feature extraction is used to determine the level of importance of words in a group of documents.To assist in the clustering process, datasets that are in the form of words must be converted into numbers so that they can be read by the program.The TF-IDF formula can be defined in the following equation ( 1) [16].
After obtaining the TF-IDF results from the above formula, then proceed with the normalization process using the Euclidean method with equation (2).

DBSCAN Model Clustering
DBSCAN is a grouping technique or algorithm by creating regions based on related densities.DBSCAN has an advantage over other methods because this method can identify outliers and noise.Outliers or noise can be formed because items are not close to other objects [17].Unlike the K-Means and K-Medoids algorithms, DBSCAN does not require defining the number of clusters to be formed because DBSCAN will identify disordered cluster structures using clustering techniques [18].DBSCAN clustering has 2 parameters, namely MinPts and Epsilons.MinPts serves to determine the minimum data that can be used as initial data to determine the boundaries of a cluster and Epsilons are used to determine the distance between data in a cluster.DBSCAN uses the Euclidean Distance function to determine the distance between items.The Euclidean Distance formula can be seen in equation (3) [19]: Where is the variable of object ( ) and is the Euclidean Distance value.

Sillhouette Score
The performance of the algorithms was assessed using the algorithm's silhouette score value.By measuring both intra-cluster cohesion and inter-cluster separation, Silhouette assists in determining whether allocating a data point to one cluster rather than another is the best course of action.The purpose of the cluster validation technique is to evaluate the cluster results, the results of this evaluation can be used to determine the number of clusters in the dataset.This technique provides a brief graphical representation of how well each object is located within its cluster.The Silhouette Score formula can be seen in equation ( 4) :

Data Visualization
Data visualization is used to display data with various methods, one of which is in the form of graphs or charts.
One way to display visualization data is to use the Word Cloud library to collect words into an image and the Matploblib library to display the image so that it is easy to see [20].

RESULT AND DISCUSSION
The dataset that has been collected by the data crawling process uses five different keywords or hashtags, namely#AniesPresiden2024, #AniesBaswedan2024, #Anies2024, #AniesPresidenku, and #AniesPresidenRI2024.And we have succeeded in obtaining data for a total of 49,519 Indonesian language tweet data.The dataset will enter into the pre-processing process to help clean up the data so that later the data entered into the clustering algorithm can produce more optimal clusters.Figure 3 is the result of the preprocessing process which is visualized in the form of an image using the word cloud library.

Figure 3. Data Visualization
After the preprocessing is done, the feature extraction process is carried out using TF-IDF to convert words into vectors so that it can make the program easier to carry out the clustering process.After that, the dataset will be entered into the DBSCAN Clustering algorithm program.

DBSCAN Clustering
In the clustering process using the DBSCAN Clustering algorithm, the clustering process will be carried out using different parameters to find the best results.In this research, an experiment will be carried out to find parameters using an Epsilons value of 0.01 and MinPts with a range from 1 to 10.To measure the evaluation value in the experiment, calculations are carried out using the silhouette score in each experiment.The results of the experiment can be seen in Table 2.

Silhouette Scoring
In this section, the clustering results from DBSCAN will be calculated with the values from the dataset.The calculations were carried out for a number of trials, and the results of calculating the silhouette score on the clustering results can be seen in Table 3.

Cluster Analysis
The results in Table 3 show that the highest silhouette score is at MinPts 10 with the number of clusters formed being 3 clusters.These three clusters are labeled -1, 0, and 1. Labels with -1 are grouped as Noise.The highest frequency of words in each cluster can be seen in Table 4. From the table above it can be seen that on label 0 there are 4 words namely "Semangat", "Tumbuh", "Ubah", and "Indonesia".These words appear 10 times.On label 1 there are 2 words, namely "Anies" and "Bohong".This result can be obtained because several users write tweets with sentences that are similar to one another, both from the same hashtags and the same content.The contents of label -1 can be categorized as Noise because the data cannot be clustered into 1 cluster because several other words have far-reaching values so they cannot be included in any cluster.
After all the steps are done, the final step is to determine the type of content and user motivation based on the type of tweets related to politics.To do this, it will take the words that appear the most in each cluster and then manually identify them by determining whether the type of content and motivation of users in each cluster is positive (supporting) or negative (bringing down).The cluster labeled -1 will be ignored because the cluster is Noise data which has very diverse information so it cannot be included in any cluster.
In the cluster labeled 0, the words that appear the most are "Semangat", "Tumbuh", "Ubah", and "Indonesia" where the word indicates that this cluster has a positive type of content with user motivation, namely flattering Anies Baswedan as the 2024 presidential candidate.In the cluster labeled 1, the word that appears the most is "Anies", and "Bohong", this makes this cluster a negative type of content because the motivation of users to post tweets with these words is to destroy Anies Baswedan's image as the next 2024 presidential candidate.general elections.The grouping of content types and user motivation based on the words that appear the most can be seen in table 4 and table 5.

Table 3 .
Silhouette Score Result

Table 4 .
Top Word Frequency

Table 5 .
Content Type