Multi Kelas Speaker Recognition Menggunakan Deep Learning dengan CN-Celeb Dataset
Abstract
Speaker recognition has been widely applied in various fields of human life such as Siri from Apple, Cortana from Microsoft, and Voice Assistant by Google. One of the problems when creating speaker recognition is related to the dataset used for the modeling process. The dataset used for creating the speaker recognition model is mostly data that cannot represent real-world conditions. The result is when implemented in the real-world conditions are not optimal. This study develops a speaker recognition model using deep learning (LSTM) with the CN-Celeb dataset. The CN-Celeb dataset is data taken directly from the real world so there is a lot of noise. The hope of using this dataset is that it can represent real world conditions. Model development uses 2 stacked LSTM for multi-class speaker recognition tasks. In addition, this study performs tuning hyperparameters with a grid search method to obtain the most optimal model configuration. The results showed that the EER value of the LSTM model was 10.13% better than the reference baseline paper of 15.52%. In addition, when compared with other studies that also used the CN-Celeb dataset but using different models, it was found that the LSTM model had promising results. From the results of study that has been carried out and also compared with other people's research, it was found that the LSTM model gave promising performance. The LSTM model is compared with the x-vectors, PLDA, TDNN, and transformers models
Downloads
References
J. P. Campbell, “Speaker recognition: A tutorial,” Proceedings of the IEEE, vol. 85, no. 9, pp. 1437–1462, 1997.
H. W. Lin and M. Tegmark, “Criticality in formal languages and statistical physics,” arXiv preprint arXiv:1606.06737, 2016.
N. Singh, R. A. Khan, and R. Shree, “Applications of speaker recognition,” Procedia Eng, vol. 38, pp. 3122–3126, 2012.
H. Ai, W. Xia, and Q. Zhang, “Speaker recognition based on lightweight neural network for smart home solutions,” in International Symposium on Cyberspace Safety and Security, 2019, pp. 421–431.
S. Saleem, F. Subhan, N. Naseer, A. Bais, and A. Imtiaz, “Forensic speaker recognition: A new method based on extracting accent and language information from short utterances,” Forensic Science International: Digital Investigation, vol. 34, p. 300982, 2020.
Y. Fan et al., “Cn-celeb: a challenging chinese speaker recognition dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7604–7608.
L. Li, D. Wang, and T. F. Zheng, “Neural discriminant analysis for deep speaker embedding,” arXiv preprint arXiv:2005.11905, 2020.
Y. Cai, L. Li, A. Abel, X. Zhu, and D. Wang, “Deep normalization for speaker vectors,” IEEE/ACM Trans Audio Speech Lang Process, vol. 29, pp. 733–744, 2020.
F. Ye and J. Yang, “A deep neural network model for speaker identification,” Applied Sciences, vol. 11, no. 8, p. 3603, 2021.
H.-S. Heo, J. Jung, J. Kang, Y. Kwon, Y. J. Kim, and B.-J. L. abd J. S. Chung, “Self-supervised curriculum learning for speaker verification,” arXiv preprint arXiv:2203.14525, 2022.
N. Zhang, J. Wang, Z. Hong, C. Zhao, X. Qu, and J. Xiao, “DT-SV: A Transformer-based Time-domain Approach for Speaker Verification,” arXiv preprint arXiv:2205.13249, 2022.
Y. Cai and D. Wang, “Deep generative LDA,” arXiv preprint arXiv:2010.16138, 2020.
C. Zeng, X. Wang, E. Cooper, X. Miao, and J. Yamagishi, “Attention back-end for automatic speaker verification with multiple enrollment utterances,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6717–6721.
S. Kadyrov, C. Turan, A. Amirzhanov, and C. Ozdemir, “Speaker Recognition from Spectrogram Images,” in 2021 IEEE International Conference on Smart Information Systems and Technologies (SIST), 2021, pp. 1–4.
S. Hourri, N. S. Nikolov, and J. Kharroubi, “A deep learning approach to integrate convolutional neural networks in speaker recognition,” Int J Speech Technol, vol. 23, no. 3, pp. 615–623, 2020.
F. Ertam, “An effective gender recognition approach using voice data via deeper LSTM networks,” Applied Acoustics, vol. 156, pp. 351–358, 2019.
Y. Dokuz and Z. Tufekci, “Mini-batch sample selection strategies for deep learning based speech recognition,” Applied Acoustics, vol. 171, p. 107573, 2021.
Y. Cai, L. Li, D. Wang, and A. Abel, “Deep Speaker Vector Normalization with Maximum Gaussianality Training,” arXiv preprint arXiv:2010.16148, 2020.
L. Li et al., “CN-Celeb: multi-genre speaker recognition,” Speech Commun, vol. 137, pp. 77–91, 2022.
N. Reimers and I. Gurevych, “Optimal hyperparameters for deep lstm-networks for sequence labeling tasks,” arXiv preprint arXiv:1707.06799, 2017.
Z. Qin, D. Kim, and T. Gedeon, “Rethinking softmax with cross-entropy: Neural network classifier as mutual information estimator,” arXiv preprint arXiv:1911.10688, 2019.
M. Vacher, B. Lecouteux, J. S. Romero, M. Ajili, F. Portet, and S. Rossato, “Speech and speaker recognition for home automation: Preliminary results,” in 2015 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), 2015, pp. 1–10.
K. Nugroho and E. Noersasongko, “Enhanced Indonesian ethnic speaker recognition using data augmentation deep neural network,” Journal of King Saud University-Computer and Information Sciences, vol. 34, no. 7, pp. 4375–4384, 2022.
R. O. Ogundokun, R. Maskeliūnas, and R. Damaševičius, “Human Posture Detection Using Image Augmentation and Hyperparameter-Optimized Transfer Learning Algorithms,” Applied Sciences, vol. 12, no. 19, p. 10156, 2022.
A. Poddar, M. Sahidullah, and G. Saha, “Speaker verification with short utterances: a review of challenges, trends and opportunities,” IET Biom, vol. 7, no. 2, pp. 91–101, 2018.
W. Xie, A. Nagrani, J. S. Chung, and A. Zisserman, “Utterance-level aggregation for speaker recognition in the wild,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 5791–5795.
Y. Fan et al., “Cn-celeb: a challenging chinese speaker recognition dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7604–7608.
Y. Cai and D. Wang, “Deep generative LDA,” arXiv preprint arXiv:2010.16138, 2020.
L. Li, D. Wang, and T. F. Zheng, “Neural discriminant analysis for deep speaker embedding,” arXiv preprint arXiv:2005.11905, 2020.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Multi Kelas Speaker Recognition Menggunakan Deep Learning dengan CN-Celeb Dataset
Pages: 1202−1211
Copyright (c) 2022 Adipta Martulandi, Amalia Zahra

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















