Implementation of Neural Machine Translation for English-Sundanese Language using Long Short Term Memory (LSTM)

− In this modern era, machine translation has been used all over the world for solving humankind’s problems such as it deals with language. There are many purposes for using machine translation such as learning another language, communicating, finding a certain or better word to use, and even writing something in a book, or another article Machine translation also used by people who want to translate their native language into their foreign language. The international language being used is the English language. The input of it is a word or a sentence from the source language and it will be translated into another language. Several methods have been conducted to do the machine translation task such as the statistical approach and the neural approach In terms of Sundanese machine translation, there are several methods or several approaches that other researchers have conducted. However the study about Sundanese machine translation, none of the research conducted the English into Sundanese language.Whereas In 5 years before there are approximately 2,67 million average from all over the world to come to West Java and approximately 12163 people who end up stays in West Java region. In this study using the encoder and decoder LSTM architecture achieve a good result regarding building a model for machine translation task. The performance of this model has achieved 0.99 accuracies in both training and testing as well as less than 0.1 loss value to both training and testing data. This model also achieves more than 0.8 average BLEU score for both training and testing data.


INTRODUCTION
West Java is one of the most populous provinces in Indonesia.There are approximately 49 million people who live in West Java in the data statistics on 2022 by the West Java government [1].The people in West Java are using the Sundanese language to communicate.
In this modern era, machine translation has been used all over the world for solving humankind's problems such as it deals with language.Machine translation is almost used by people who want to translate their native language into their foreign language.The international language being used is the English language.Many tourists come to West Java which is the origin of the Sundanese language.In 5 years before there is approximately 2,67 million [2] average from all over the world came to West Java and approximately 12163 people who end up staying in the West Java region.That is the reason why this study builds a machine translation model from English to Sundanese which is still rarely found in the current study.
Machine translation is the task to translate a source language to another language.The input of it is a word or a sentence from the source language and it will be translated into another language.There are many purposes for using machine translation such as learning another language, communicating, finding a certain or better word to use, and even writing something in a book or another article.Several methods have been conducted to do the machine translation task such as the statistical approach and the neural approach.The statistical approach mostly uses the wordto-word or sentences to perform each word translation based on features and statistics of a word for a certain language.Neural-based approaches are using deep learning methods to perform machine translation.Most of the model is a Recurrent Neural Network modified into a Long Short Term Memory model ( LSTM ) or Gated Recurrent Unit ( GRU ).There are two types of neural machine translation approaches based on the form of the data input.The first is word-based neural machine translation, and the second is a character-based machine translation.The neural-based machine translation is improved by some attention mechanism which is the based mechanism for the transformer model.This study will be focused on neural machine translation with a simple LSTM method since it was good enough to perform machine learning tasks from another language also.
In terms of Sundanese machine translation, there are several methods or several approaches that other researchers have conducted.For example, Suryani et.al. use a phrase-based statistical approach [3] with the help of PoS Tag information regarding the Sundanese word.This statistical approach also can be enhanced with some monolingual corpus methods like in the research of Darwis et.al. [4].There is some research that instead of using a statistical approach used the neural machine translation or using deep learning approach which is the state of the current machine translation task according to Yang et.al. survey [5].Yustiana et.al. [6] using the Recurrent Neural Network based to do the machine translation from Indonesian to Sundanese language.The result that the study achieves a great result with GRU and attention-based model.
Regarding the study about Sundanese machine translation, none of the research conducted the English into the Sundanese language.The purpose of this research is to build English Sundanese machine translation with a neural approach or deep learning approach because that is the state of the art of the current method in the machine translation Building the model is preparation for how the model can process the text and the model can train so the model can solve the translation problem.Evaluation is to evaluate the result that has been yielded by the model.This study will be going through the data preparation, building the model, and evaluation of the model see Figure 1.

Data Preparation
The data for this study are 20000 sentences total data from both languages which are English as a source language and Sundanese as a target language.Table I is an example of sentences from both languages.There are several counts of words in each sentence, from just one word to more than 20 words.This text data must be going through the data preprocessing [7], [8] because the model cannot read words however it must be represented as a vector to process it.First, the sentences must be converted into lowercase, and then the tokenization method.Tokenization is the task to convert sentences into words [9].Table 2 is the example of tokenization which is to convert all of the sentences into words and then the words will be used to make the vocabulary.Vocabulary will later be useful as the reference to encoding or vectorizing the input.Table 2. Tokenization example for both target language and source language dataset.Tokenization is a process to change sentences into tokens.The token will be used later to build the vocabulary English Sundanese Sentence Token Sentence Token I go to school I, go, to, school Abdi berangkat ka sakola Abdi, berangkat, ka, sakola The vectorize method for this study is using the simple integer encoding method or ordinal encoding method [10].With integer encoding, the words in the vocab are labeled with the integer from one to the size of the vocab itself (Table 3).Then the word that has already been labelled with the integer number is used as a reference to make the sequence encoder.
The purpose of the encoding is to make each sentence from the data to be converted into the vector.For example, in Table Table 3 Table 4 each of every sentence is converted into the vector with using integer encoding as a reference to convert each word in the sentence.The length of a vector is different between source data and target data, therefore using neural method it will make the input more flexible.Table 4. Encode sentences into integer encoding sequences.This is the final process of changing every sentence to a vector so the method can process it to make a translation model.After the data has been vectorized it must be split into a training set and a testing set.The training set is used to train the model, and the test set is used to test the model and make the evaluation of whether using this neural model is already a good or bad model.Table 5 show the portion and the total number of data used for training, validation, and testing data.The purpose of validation data is to validate the model while in the training process.It will be simulated as a test set however in the training process [5].

Building the Model
This study will be using the modified Recurrent Neural Network as a model which is the Long Short Term Memory (LSTM) [11], [12].LSTM model is good to capture the information of a sequence since its capability to store some of the memory so it will solve the vanishing gradient problem which occurs in plain Recurrent Neural Network [13].
The input of the model will be vectorised English sentence and the output will also be the vectorised Sundanese sentence (Figure 2).The RNN architecture of the neural networks model is using many-to-many architecture [14], [15].Figure 3 shows the common RNN architecture for different inputs and outputs.The architecture follows different tasks.For example, in the sentiment analysis task which the input is sequence and the output is just one label the architecture will be using the many-to-one and so on.
Since neural machine translation inputs sentences with more than one length vector as well as the output.So, it will be Figure 2. Model input is the source sentence ( English ) that will going through or processed with LSTM which is the neural-based approach method, and the output is the target sentence ( Sundanese ) Figure 3. Common RNN architecture is conducted in various types of machine learning tasks.This study will be using many-to-many architecture because the input source sentences ( one or more tokens ) and the output is also a sentence ( one or more tokens ) Figure 4. Model input is the sentences that contain one or many words or tokens in the English language, and then map or processed to the output which is Sundanese words or tokens.The number of words in input and output can be the difference because the translation is not always as simple as a word to word translation.This model will be compiled with the system specification as shown in Table 6.The hardware specification is mostly using the latest hardware and the operating system as well.For software, specifications are using a python programming language in the Anaconda Jupyter Notebook application with Tensorflow Keras deep learning model.

Evaluation
The purpose of the evaluation is to know if the model that we have created is already good or not good enough.There are two approaches to the evaluation of the machine translation model which are the automatic approach and the manual approach.The manual approach is simply using human experts in the language field to evaluate the result of the model [16].The automatic approach is using a Table 6.System specification to compile the model later.Many deep learning tasks require a high specification to process a lot of data.This study will be processing 16000 of data in form of encoding sentences.Certain calculation based on the model result for example using the Bilingual Evaluation Understudy (BLEU) score [17].This study will be using the automatic approach which is using the BLEU score to evaluate the machine translation model.
BLEU score was calculated using n-gram matching between the result of the model (candidate) against the references.BLEU score value lies between 0 and 1 where 0 is not related at all and 1 is most related between all the references.This study just uses one as a reference and will be evaluated training data and testing data using 1-grams, 1-2 grams, 1-3 grams, and 1-4 grams.
In addition to using the BLEU score as the evaluation, this study also will be showing the model loss and accuracy for both training and validation data [18].The purpose is to track the performance while in the training process.

Data Preparation
The total data that will be used is 20000 ( and its translation ) parallel sentences which are 18000 for training data as well validation data and 2000 for testing data (Figure 5).The length of a sentence ( the count of the words that contains in the sentences ) varies. Figure 6 is an example of random sentences from the source language ( English ) and target language ( Sundanese ) with various numbers of words between the sentences.Figure 7 is the sentences that already do the preprocessing which is lowercasing and deleting some unused symbols.
After the data goes through the tokenization process it yields 3149 and 3251 vocabulary sizes for the source and target  Language respectively (Table 7).Afterward, it yields the maximum 79 and 66 vector sizes for the source and target language respectively.In addition, the input vector size will be 79 and the output vector size will be 66 with the 0 padding in between if the sentences did not match with 79 lengths of a word as shown in Table 8.Table .8Encode sentenses into integer encoding sequences.The length of vector in every sentences follows the max vector length of every languages.For example In the source sentence will be encoded in 79 length of vector.So if the sentences just contains one word the rest of vector will be filled as zero.

Building The Model
The architecture implementation model is shown in Figure 8.The first layer is the input layer to read the input which is the source language vector with a size of 66 and then goes to the Embedding layer to embed the vector so the LSTM layer later can be processed.The first LSTM layer is used as an encoder and then the output will be processed with the RepeatVector layer.The purpose of the RepeatVector layer is to work as a bridge from the first LSTM layer to the second LSTM layer as the encoder.
Afterward, the LSTM decoder will go to the Dense layer with the time-distributed form with softmax activation ( equation 3 ) function to distribute the probabilities to each class which is the target vocabulary.The model summary can be shown in Figure 9.
Figure 8.The model architecture will first going through input layer and then embedding layer to embed the model into LSTM.The output of LSTM will be processed in RepeatVector Layer before it will process to LSTM again.Finally, the output will be Dense time distributed layer.

Figure 9.
The embedding layer will take 79 input sizes which is the maximum vector in the source language.The final layer which Dense time-distributed layer yield 66 output size which is the maximum vector size for the target language.
This model parameter can be shown in Table 9.The optimizer will be using Adam [19] since this optimizer is good in most of the deep learning performances in the recent studies [20].The loss function is using categorical crossentropy ( equation 4 ) to compute the loss for determining the output which are the words from the target vocabulary.The validation split is 0.1 which means 10% data will be used as the validation data from the data training.The epochs are 200 but using the Early Stopping callbacks to prevent the overfitting problem [21] occurred in the models with the patience of two steps.

Training Process
The input is 79 vector size with each number meaning the tokens of the word in an English sentence.This input will go to the input layer and then the Embedding layer to embed the input and then processed by LSTM.The LSTM layer captures the information in the sequence of the word in a sentence.Thereafter output of the LSTM is processed by a time-distributed layer to map or to match 66 input size vectors of the output which are the tokens in Sundanese Sentence.If the model is wrong guessing it will update the weight until the weight is good enough to fit with the test sentences.Adam optimizer approach will find the best weight from epoch to epoch.If between 2 epochs occur the indication of overfitting.The callback will force the training process to stop.The overfitting occurred if the accuracy or loss for training data is bigger than validation data with a certain threshold.This process is repeated until 200 epochs or the callback function is triggered.

Evaluation
The training process takes approximately 6 hours with 70 epochs total because of the Early Stopping trigger.The accuracy and the loss as shown in Figure 10 0.996 and 0.016 for the data training.The average BLEU score for training data is 0.863 (Figure 11) and for testing, data is 0.806 (Figure 12).Shown in Figure 8.Despite the training, the process is so long this model achieves a good result regarding the translation with the loss accuracy as well as the BLEU score evaluation.For the training loss and accuracy, it gets 0.016 and 0.996 respectively.For the testing loss and accuracy, it gets 0.0403 and 0.993 respectively.For the training set with 1-gram , 1-2 grams, 1-3 grams, and 1-4 are 0.957, 0.891, 0.846, and 0.759 respectively with a 0.863 average.For the test set with 1-gram , 1-2 grams, 1-3 grams, and 1-4 are 0.913, 0.835, 0.787, and 0.690 respectively with a 0.806 average.Table X show 5 sentence translation example in the test set.The sentence mostly translated well same the target language.

Figure 1 .
Figure 1.Data preparation is to get the data and convert text into numerical so the model can process it.

Figure 5 .
Figure 5.The total number of parallel sentences which are 20000 sentences in total and 18000 will be used as training and validating data.2000 sentences will be used as testing data

Figure 6 .
Figure 6.The example of 10 first raw data is already shuffled.The raw data must be processed before making the data clean.

Figure 7 .
Figure 7. Data after preprocessing is just to remove the unused symbols and lowercasing all of the words.Because making a vocabulary is case-sensitive.

Figure 10 .Figure 11 .
Figure 10.Accuracy and loss value from epoch to epoch.In this experiment, the epoch must be 200 but since the model uses an early stopping callback function, the epoch stops at 70 because the last 2 epoch indicates the model encounter an overfitting problem. it

Figure 12 .
Figure 12.The BLUE score results are not far from the training data which is a good indication.Calculated in 1, 1-2, 1-3, and 1-4 grams.The 1-grams is the highest result and the rest keep decreasing.It is because it calculated for all of the grams for 1 to the rest.

Table 1 .
An example of the dataset is a sentence from the source language (English) and target language (Sundanese).The number of words contained in every sentence is various.

.
Integer encoding example in target and source tokens.Every token is change to different integer number.

Table 5 .
Data is split into training, validation, and testing.Training data will be used to train the model, validation data will be used to validate the data before using the testing data to prevent overfitting.The testing data will be used to test and evaluate the model.

Table 7 .
Vocabulary and maximum vector in both languages.Vocabulary is a set of a token and max vector length is the sum of tokens that contains in each of every encoded sentence.

Table 9 .
Model parameters for the model architecture are Adam optimizer, loss function, epochs number, validation split, and a callback function.

Table 10 .
Translation example on testing data.The English and Sundanese column is the true value of the translation.The predict column is the translation result which is predicted by the model.For these five examples, the result is approximately the same.The difference is just in the third data Abdi resep olahraga sapertos maén bal Abdi resep olahraga sapertos maén bal