Recommender System Based on Tweets with Singular Value Decomposition and Support Vector Machine Classification

−In modern times, the movie industry is growing rapidly. Netflix is one of the platforms that can be used to watch movies and provides many types of genres and movie titles. With so many genres and movie titles sometimes making it difficult for people to choose a movie to watch, one solution to the problem is a recommendation system that can recommend movies based on user ratings. One method in the recommendation system is collaborative filtering. One of the algorithms contained in collaborative filtering is singular value decomposition. Twitter is one of the places where people often write their opinions about the movies they have watched, from people's tweets on Twitter will be processed into rating value data. In this system, tweets become input that is processed into data that has a rating. This research implements a user-based recommendation system based on ratings from tweets using collaborative filtering combined with the Singular Value Decomposition (SVD) algorithm and Support Vector Machine (SVM) classification and implemented it on user-based and item-based. This research aims to implement a system that combines collaborative filtering techniques with the Singular Value Decomposition (SVD) algorithm and Support Vector Machine (SVM) classification. With the hope of producing a good movie recommendation model and providing accurate predictions for recommended and non-recommended movies. The test results in this study show that Collaborative Filtering gets the best RMSE value of 0.8162 on user-based and 0.5911 on item-based. The combination of Singular Value Decomposition (SVD) algorithm and Support Vector Machine (SVM) classification using hyperparameter tuning resulted in 81% precision and 81% recall for user-based while 80% precision and 80% recall for item-based.


INTRODUCTION
In this modern era, the development of the film industry and technology is very rapid. Movies are entertainment media that have various genres and titles to watch. Nowadays, people can watch movies not only through theaters but can use the Netflix platform. Netflix is an application that provides online streaming services, namely movies with various genres and titles [1]. On Netflix there are very many movies available, sometimes it will cause users confusion to choose the movie to watch. The solution to this problem is to have a movie recommendation system on Netflix that can recommend movies that have high ratings to users.
More and more people are accessing social media. This is one of the reasons why the world of technology is growing so rapidly. One of the most popular social media at the moment is Twitter. This is because users can express their feelings, ideas, and thoughts to their followers in the form of tweet updates [2].
There are several types of recommender systems, including collaborative filtering, content-based filtering, knowledge-based, and association-ruled-based recommendation. The most successful among the available techniques is collaborative filtering. There are two categories namely model-based and memory-based [3]. Collaborative filtering is a method that provides a recommendation based on the similarity of users [4]. Singular Value Decomposition (SVD) is one of the collaborative filtering algorithms that factorizes a non-zero matrix. [5]. Support Vector Machine (SVM) introduced by Vapnik is one of the supervised learning methods with multidimensional estimation classification based on SVM development time. This method goes beyond learning theory. The method can be seen as a general approach to expressing functions in a higher dimensional space, and in many cases can overcome the "curse of dimensionality" [6].
In previous studies in overcoming data sparsity and scalability, researchers combined collaborative filtering with Transductive Support Vector Machine (TSVM) based on Active Learning (AL) and SVMCF4R [3], [7]. Therefore, in this research, the author will combine collaborative filtering technique with Singular Value Decomposition (SVD) algorithm and Support Vector Machine (SVM) classification technique with classification method using hyperparameter tuning with grid search to improve performance. To the author's knowledge, there has been no research using hyperparameter tuning with grid search as a process to improve the performance of combining the Singular Value Decomposition (SVD) algorithm with Support Vector Machine (SVM) classification added to collaborative filtering.
This research aims to implement a system that combines collaborative filtering techniques with the Singular Value Decomposition (SVD) algorithm and Support Vector Machine (SVM) classification. With the hope that the application of classification using SVM after processing using collaborative filtering produces a good movie recommendation model and provides accurate predictions for recommended and non-recommended movies from the process of adding classification methods. The results of collaborative filtering use RMSE as a model evaluation to determine the best form of user-based and item-based classification using Support Vector Machine (SVM) with precision and recall as benchmarks and recall as benchmarks for the results of movie recommendations and movies that are not recommended movies and movies that are not recommended.

Data Crawling
In the data crawling process that we have done on Twitter, namely using the snscrape python library. The data that has been crawled produces several review tweets from each Twitter user who is trusted in reviewing movies. We crawled data sourced from movie titles on the Netflix platform. The movie titles on Netflix that we crawled were movie titles from 2005-2021 and the data we retrieved were id_tweet, username, date, tweet, movie title.
After obtaining data containing reviews of movie titles on Netflix, we then select several reviews that match the movie reviews. Then we choose one of the best tweet reviews regarding the discussion of related movie titles. After that, the selected data will be added using rating values from specialized websites for reviewing movies (IMDb, Rotten Tomatoes, and Metacritic) according to the movie titles on Netflix. The results that have been obtained from the data crawling process get results such as Table1.

Data Preprocessing
Data preprocessing is an important step in data processing, the goal is to get good and efficient data quality. at this stage, the data that was originally in the form of reviews on Twitter will be converted into a 1-5 rating form that can be used as a recommendation system. In the process of converting tweets into ratings, several stages are carried out consisting of Text Processing, Polarity, and Labeling.
Text processing is a stage to get more structured data, at this stage cleaning text that still contains elements of punctuation, numbers, emoticons, URLs, and hashtags.
Polarity is a process that is done to identify a text with a measure of how negative and how positive the text is. Polarity is useful in the process of predicting sentences that have positive or negative words. For example, "the movie is cool" then the word "cool" has a positive meaning. [8]. In this research, polarity is applied by using the library from TextBlob. This library aims to help the processed text data to identify the meaning of words properly. Text data that has a polarity value close to -1 means that the rating will be made between 0-2.4 then data whose polarity is close to 1 will be made a rating between 2.6-5 and data that produces a polarity value of 0 becomes a rating of 2.5.
Labeling in this research process is by identifying data from the polarity results to check whether the data is following the existing rating context. Text data becomes a rating with a value of 0 -5.
Preprocessing 2 is changing the rating data which was originally still 1-5 then made into 0 and 1. In general, at the preprocessing 2 stages the data used is data that has gone through the process stages of collaborative filtering. Ratings with a scale of 0-2 are converted into a value of 0 which means the user does not like and ratings with a scale of 3-5 are converted into 1 which means the user likes the movie.

Collaborative Filtering
Collaborative filtering is a recommendation system that provides recommendations based on information from users who search for other users with similar interests. Recommendations are based on an arrangement of users who provide a similarity score. [9]. In collaborative filtering to make recommendations can use similarity based on an item (item-based) and similarity based on the user (user-based).
The steps taken in the collaborative filtering system in this study are data normalization, calculating similarity values, predicting ratings, and evaluating collaborative filtering models.
Data normalization is a process for grouping data attributes that form simple, non-redundant, and flexible entities. So that data that has been normalized has good quality. The formula used for data normalization is like formula 1.
For , is the normalized rating of item i by user u.. Then for , the actual rating of item i from user u and ̅ is the average rating of the items rated by user u.
When calculating the similarity value, there are many ways to find it, one of which is this method. Pearson correlation similarity value between 1 and 2 can be calculated as formula 2 [10]: For 1 2 is used to designate a set of items covered by 1 and 2 . ̅ 1 denotes the average rating of user 1 . Next, the rating prediction stage of this process predicts the rating on the empty rating value. Using the value of n in topN which has the smallest RMSE value. On item-based can be applied like formula 3 [10]: For ̂ eans the set of N items that are similar to item j and have been rated by user i. n user-based can be applied as formula 4 [10]: For ̂ means the set of N nearest neighbors of user i who has rated item j.

Singular Value Decomposition (SVD)
The Singular Value Decomposition (SVD) algorithm is a real factorization on complex matrices, where in the SVD transformation, the original matrix can be decomposed into three matrices of the same size, but if you multiply the three matrices that have been decomposed, it will be the same as the original matrix. [11]. The equation of the SVD algorithm is: With the following visualization: U is an m x n matriks, S is an n x n diagonal matriks, and is an n x n Column matrix, U is called the left singular vector, { } to form the orthonormal basis, for the profile expression test . = 1 for i = j and otherwise . = 0. The rows of contain elements of the right singular vector { } and form the orthonormal basis for gene transcription responses. The elements of S are only zero on the diagonal and are called singular values, hence = ( 1 ,…, ) henceforth > 0 for 1 ≤ ≤ and 1 = 0 ( + 1) ≤ ≤ with the sorting convention of the order of singular values of the singular vectors determined by high to low, the sorting of singular values starts with the highest singular value at the top left of the index S matrix [5]. In SVD calculation, we first need to calculate the eigenvalues and eigenvectors of and . The eigenvectors of form a column of U, while the eigenvectors of form a column of V. In addition, the singular values ( ) in S are the square roots of the eigenvalues of atau .

Support Vector Machine (SVM)
Support Vector Machine (SVM) is a classification algorithm that provides a more efficient and accurate classification process compared to other classification methods..because it applies the principle of Structural Risk Minimization (SRM) which guarantees a low classification error rate. [12]. SVM develops a hyperplane or nhyperplane that is useful for keeping some of the data points in the class. [13]. The best hyperplane is the one that is at the maximum distance from the supported points of each class. In particular to reduce the global error the margin should be higher. SVM is an appropriate classification used in separating two classes in the input space because the purpose of the SVM algorithm can find the best hyperplane. [14]. The two data separated by the hyperplane are the first class worth 1, and the next class worth -1 as in formulas 5 and 6. [14].
For is the -i data, W is the weight value of the support vector perpendicular to the hyperplane, b is the bias value, and is the -i data class.
SVM is supervised learning that focuses on extracting features from user profiles and training classifiers for the classification process. [12]. The SVM method is usually used in two ways, the first is in the user-item matrix, all items are used as features. [12]. The main principle of SVM is linear classification, but it has been developed to overcome non-linear problems by using a trick kernel. Trick kernels such as formula 7 [15].
The use of kernels can optimize the process of Support Vector Machine (SVM) classification by knowing the kernel function to be used. The following kernels can be used as formulas 8, 9, 10, and 11 [15].
Hyperparameter tuning is a method for algorithm optimization. Techniques in hyperparameter tuning that can be used are grid search, random search, evolutionary algorithm, and sequential model-based optimization. [16]. In this study, the authors used hyperparameter tuning with grid search. The grid search algorithm works by trying all combinations of parameter values and returning combinations with high values. [17].

Performance Evaluation
Evaluate CF model is the process to calculate the value of RMSE in CF method. Root Mean Square Error (RMSE) calculates the larger difference for large errors in prediction ratings. [18]. The closer the RMSE value is to 0, the better. To find RMSE as formula 12 [18].
Furthermore, to Evaluate the classification model, performance measurements on classification accuracy metrics can be calculated using Confusion Matrix. To calculate the ratio of relevant recommendation results, it is necessary to calculate precision and recall.
In the confusion matrix, there are four generated in the table, including True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). True Positive (TP) is the amount of data that is positive and correctly predicted as positive, False Positive (FP) is the amount of data that is negative but predicted as positive, False Negative (FN) is the amount of data that is positive but predicted as negative, True Negative (TN) is the

RESULT AND DISCUSSION
In this research, the first step that must be done is to calculate the predicted rating value using the collaborative filtering memory-based method, namely user-based and item-based, then the data is evaluated using the RMSE value to select the best topN. The second step is the classification process using the Support Vector Machine (SVM) algorithm and then optimized using grid search hyperparameters which aims to produce films that will be recommended or not recommended, after which it can be evaluated using precision and recall values.

Data Result
Based on crawling data on Twitter, we used 785 movie titles on Netflix and 30 Twitter users using the snscrape library. After that, we selected one tweet review from thousands of review data about each related movie title. The results we have obtained after selecting the best review is that we get a total data of 3134. After the process of selecting one of the best reviews, it will go to the next step, namely the text processing stage. Text processing is cleaning punctuation, numbers, emoticons, URLs, and hashtags. The following Table 2 displays the crawling result data that has gone through the selection of 1 review that matches the movie title and has gone through text processing. Then proceed to the polarity stage, which makes the value into -1 to 1 and changes to a rating of 0 -5. The pattern can be seen in table 3.

Collaborative Filtering Result
Using the dataset that is already in the matrix, the normalization process is carried out to make it non-redundant and flexible. The normalization process takes place item-based and user-based. Data that has been normalized for user-based can be seen in table 5 and item-based can be seen in table 6. After normalizing user-based and item-based. then calculate the similarity value using Pearson correlation. As can be seen for the calculation of user-based similarity in table 7 and item-based in table 8. If the similarity value is close to 1 then the two items are very similar, close to -1 then the two items are very different and close to 0 does not have much correlation. Next, use the similarity value to calculate the prediction rating with topN. The prediction rating will be evaluated using RMSE based on the best n value. Correlation between RMSE and n for user-based like Figure 3 and item-based like Figure 4.  The better RMSE value is a value close to 0, so the best RMSE value on item-based when topN = 5 with an RMSE value of 0.5911. For user-based, the best RMSE value is when topN = 5 with an RMSE value of 0.81629. Furthermore, each topN that produces the best RMSE value will be selected to calculate the prediction rating and used for the classification process using Support Vector Machine (SVM) in the next step. Dataset 2 will be used as a user-based and item-based classification process based on the prediction rating of the best top-n value. As in table 9 the user-based dataset and item-based dataset in table 10.

SVM Classification Result
The dataset used in this classification process by changing the 0-5 rating value to 0 and 1. The 0-2 rating value will be changed to a value of 0 and the 3-5 rating value will be changed to a value of 1. A value of 1 can be assumed that the user likes the movie and a value of 0 can be assumed that the user does not like the movie. As in table 11,  the values become 0 and 1 for user-based and table 12 for item-based. After the dataset has been converted to 0 and 1, it will then be processed into the SVD algorithm and SVM classification. In SVM classification, hyperparameter tuning is done using grid search. For the SVM data process can be seen in table 13. Hyperparameter tuning using grid search will find the best parameter value to optimize the value of precision and recall. As shown in Table 14, the hyperparameter tuning steps using grid search. After performing hyperparameter tuning, it can be seen that the best parameters for user-based data are at C = 0.1, Gamma = 1, and Kernel = rbf with the same precision and recall values with an optimal increase after hyperparameter tuning on the precision value. For item-based data, the best parameters are C = 0.1, Gamma = 1, and Kernel = poly with an optimal increase after hyperparameter tuning on the precision value.

CONCLUSION
In the research we have done to develop a collaborative filtering method that has been combined with the Singular Value Decomposition (SVD) algorithm and Support Vector Machine (SVM) classification, this research uses crawling datasets from Twitter combined with IMDb, Rotten Tomatoes, and Meta-critic web rating reviews. Then the data is processed to produce movie ratings. Collaborative filtering in rating prediction using user-based and item-based gets the best topN value at user-based topN = 5 with an RMSE value of 0.8162 and item-based topN = 5 with an RMSE value of 0.5910. The results of the best top-n are then processed in the Support Vector Machine (SVM) classification algorithm plus optimization of the classification algorithm with hyperparameter tuning using grid search with the best parameters, namely C = 0.1, Gamma = 1, Kernel = rbf getting for user-based precision value 81% and recall 81%. For item-based with the best parameters, namely C = 0.1, Gamma = 1, Kernel = poly, the precision value is 80% and the recall value is 80%. It can be concluded that the combination of collaborative filtering and the Support Vector Machine (SVM) classification algorithm can be used to make movie recommendations on whether the movie can be recommended or not and by using hyperparameter tuning with grid search can find the best parameters to produce optimal values. Therefore, future research can improve the performance of the recommendation system with a larger dataset. In addition, in the future, the recommendation system can be created in combination with other methods such as content-based and other classification algorithms.