Prediksi Penyakit Kanker Payudara Menggunakan Algoritma Synthetic Minority Oversampling Technique dan Categorical Boosting Classifier

Muhamad Bintang Mandala; Wina Witanti; Agus Komarudin

doi:10.47065/bits.v7i1.7403

Muhamad Bintang Mandala * Universitas Jenderal Achmad Yani, Cimahi, Indonesia
Wina Witanti Universitas Jenderal Achmad Yani, Cimahi, Indonesia
Agus Komarudin Universitas Jenderal Achmad Yani, Cimahi, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/bits.v7i1.7403

Keywords: CatBoost; Detection; Data Mining; Cancer; Machine Learning

Abstract

Breast cancer remains one of the leading causes of mortality worldwide, with high prevalence rates among women in Indonesia. Accurate and efficient diagnostic models are essential to support early detection and reduce mortality. This study aims to develop a predictive model for breast cancer classification using the CatBoost algorithm, a gradient boosting method known for its ability to natively handle categorical features and reduce overfitting through ordered boosting. The dataset used consists of diagnostic features of breast tumors, which were preprocessed by checking completeness and transforming numerical attributes into categorical bins to capture value distribution more effectively. To address class imbalance between benign and malignant cases, the SMOTE (Synthetic Minority Over-sampling Technique) method was applied, resulting in a balanced training set. Optimal hyperparameters for the CatBoost model were obtained using Bayesian optimization, with key parameters including depth, learning rate, and L2 regularization. The model was then trained and evaluated using recall, accuracy, and F1-score metrics, with a confusion matrix used to assess prediction quality. The results demonstrate that CatBoost achieved high performance with a recall of 1,0, accuracy of 98,6%, and F1-score of 0,99, outperforming or matching other benchmark models such as SVM, Neural Network, and XGBoost. These findings highlight the reliability and effectiveness of CatBoost in supporting medical decision-making for breast cancer diagnosis.

Downloads

Download data is not yet available.

References

F. Guida et al., “Global and regional estimates of orphans attributed to maternal cancer mortality in 2020,” Nat Med, vol. 28, no. 12, pp. 2563–2572, Dec. 2022, doi: 10.1038/s41591-022-02109-2.

S. N. Rasman and H. Trustisari, “Deskriptif Literatur Review: Pendampingan Pasien Kanker Payudara pada Perawatan Paliatif,” Advances in Cancer Science, vol. 1, no. 1, p. 9, Jun. 2024, doi: 10.47134/acsc.v1i1.6.

E. Lestari and W. Isti Rahayu, “Prediksi Keganasan Kanker Payudara Dengan Pendekatan Machine Learning,” JATI (Jurnal Mahasiswa Teknik Informatika), vol. 7, no. 3, pp. 1966–1971, Nov. 2023, doi: 10.36040/jati.v7i3.6963.

G. A. Marhaeni, N. N. Suindri, N. P. G. Arneni, N. Habibah, and N. N. A. Dewi, “Edukasi Tentang Kanker Payudara Meningkatkan Perilaku Pemeriksaan Payudara Sendiri (SADARI) Pada Remaja Putri,” Jurnal Pengabdian Masyarakat Sasambo, vol. 5, no. 2, p. 136, May 2024, doi: 10.32807/jpms.v5i2.1438.

N. M. Mochtar, L. R. Aisy, D. N. Irawati, and Y. W. Finansah, “Hubungan Faktor Genetik dan Faktor Usia terhadap Kejadian Kanker Payudara pada Wanita di RSUD Dr.Soedomo Trenggalek Periode 2020-2021,” JurnalMU: Jurnal Medis Umum, vol. 1, no. 3, pp. 175–184, Dec. 2024, doi: 10.30651/jmu.v1i3.24772.

J. Cathryne, J. E. Siahaan, M. P. Soumokil, M. T. Cengga, and M. Sampepadang, “Gambaran Faktor-Faktor Pemeriksaan Payudara Sendiri Sebagai Deteksi Dini Kanker Payudara,” Nursing Current: Jurnal Keperawatan, vol. 12, no. 2, pp. 214–226, Dec. 2024, doi: 10.19166/nc.v12i2.8999.

I. G. B. Setiawan and P. A. T. Adiputra, “Analisis Biaya dan Manfaat Pemeriksaan CA 15-3 dalam Diagnostik dan Pemantauan Kanker Payudara di Era BPJS,” JBN (Jurnal Bedah Nasional), vol. 7, no. 1, p. 23, Jan. 2023, doi: 10.24843/JBN.2023.v07.i01.p04.

A. M. Majid and I. Nawangsih, “Perbandingan Metode Ensemble Untuk Meningkatkan Akurasi Algoritm Machine Learning Dalam Memprediksi Penyakit Breast Cancer (Kanker Payudara),” Jurnal SAINTIKOM (Jurnal Sains Manajemen Informatika dan Komputer), vol. 23, no. 1, p. 97, Feb. 2024, doi: 10.53513/jis.v23i1.9563.

J. Badriyah, Nilam Ramadhani, Agung Muliawan, Khanun Roisatul Ummah, and Ata Amrullah, “Penerapan Dimensi Reduksi Pada Machine Learning Dalam Klasifikasi Kanker Payudara Berdasarkan Parameter Medis,” Jurnal RESTIKOM : Riset Teknik Informatika dan Komputer, vol. 6, no. 3, pp. 526–533, Dec. 2024, doi: 10.52005/restikom.v6i3.379.

K. H. Lee et al., “Machine learning-based clinical decision support system for treatment recommendation and overall survival prediction of hepatocellular carcinoma: a multi-center study,” NPJ Digit Med, vol. 7, no. 1, p. 2, Jan. 2024, doi: 10.1038/s41746-023-00976-8.

N. Arya, A. Mathur, S. Saha, and S. Saha, “Proposal of SVM Utility Kernel for Breast Cancer Survival Estimation,” IEEE/ACM Trans Comput Biol Bioinform, vol. 20, no. 2, pp. 1372–1383, Mar. 2023, doi: 10.1109/TCBB.2022.3198879.

J. Kusuma, B. H. Hayadi, W. Wanayumini, And R. Rosnelly, “Komparasi Metode Multi Layer Perceptron (MLP) dan Support Vector Machine (SVM) untuk Klasifikasi Kanker Payudara,” MIND Journal, vol. 7, no. 1, pp. 51–60, Jun. 2022, doi: 10.26760/mindjournal.v7i1.51-60.

N. Tri, R. Adiningrum, R. Rianti, and C. Prianto, “Rancang Bangun Aplikasi Prediksi Kanker Payudara Dengan Pendekatan Machine Learning,” Jurnal Informatika dan Teknik Elektro Terapan, vol. 11, no. 3, pp. 2830–7062, doi: 10.23960/jitet.v11i3%20s1.3351.

D. R. I. M. Setiadi et al., “Integrating Hybrid Statistical and Unsupervised LSTM-Guided Feature Extraction for Breast Cancer Detection,” Journal of Computing Theories and Applications, vol. 2, no. 4, pp. 536–552, May 2025, doi: 10.62411/jcta.12698.

L. J. H. Leow et al., “A Convolutional Neural Network-Based Auto-Segmentation Pipeline for Breast Cancer Imaging,” Mathematics, vol. 12, no. 4, p. 616, Feb. 2024, doi: 10.3390/math12040616.

Rahmanul Hoque, Suman Das, Mahmudul Hoque, and Mahmudul Hoque, “Breast Cancer Classification using XGBoost,” World Journal of Advanced Research and Reviews, vol. 21, no. 2, pp. 1985–1994, Feb. 2024, doi: 10.30574/wjarr.2024.21.2.0625.

M. Ravly Andryan et al., “Komparasi Kinerja Algoritma Xgboost Dan Algoritma Support Vector Machine (SVM) Untuk Diagnosa Penyakit Kanker Payudara,” Jurnal Informatika dan Komputer, vol. 6, no. 1, pp. 1–5, 2022.

A. Pandey, V. Ramesh, R. Mohan, S. Kaliappan, S. R. A, and K. Gurunathan, “Image Processing Based Early Breast Cancer Detection in Mammography Images Using GRU and XGBoost Approach,” in 2024 International Conference on Electronics, Computing, Communication and Control Technology (ICECCC), IEEE, May 2024, pp. 1–6. doi: 10.1109/ICECCC61767.2024.10593949.

F. Lin et al., “Identification of lysine lactylation (kla)-related lncRNA signatures using XGBoost to predict prognosis and immune microenvironment in breast cancer patients,” Sci Rep, vol. 14, no. 1, p. 20432, Sep. 2024, doi: 10.1038/s41598-024-71482-4.

E. R. Webb et al., “Kindlin-1 regulates IL-6 secretion and modulates the immune environment in breast cancer models,” Elife, vol. 12, Mar. 2023, doi: 10.7554/eLife.85739.

L. Zhang and D. Jánošík, “Enhanced short-term load forecasting with hybrid machine learning models: CatBoost and XGBoost approaches,” Expert Syst Appl, vol. 241, p. 122686, May 2024, doi: 10.1016/j.eswa.2023.122686.

M. Katlav and F. Ergen, “Improved forecasting of the compressive strength of ultra‐high‐performance concrete ( UHPC ) via the CatBoost model optimized with different algorithms,” Structural Concrete, vol. 26, no. 1, pp. 212–235, Feb. 2025, doi: 10.1002/suco.202400163.

H. Zhao, Z. Ma, and Y. Sun, “Predict Onset Age of Hypertension Using CatBoost and Medical Big Data,” in 2020 International Conference on Networking and Network Applications (NaNA), IEEE, Dec. 2020, pp. 405–409. doi: 10.1109/NaNA51271.2020.00075.

M. Mao et al., “Application of FCEEMD-TSMFDE and adaptive CatBoost in fault diagnosis of complex variable condition bearings,” Sci Rep, vol. 14, no. 1, p. 30448, Dec. 2024, doi: 10.1038/s41598-024-78845-x.

A. Alobaid and T. Bonny, “A Comparative Analysis of Machine and Deep Learning Models in the Early Detection of Breast Cancer,” in 2024 Advances in Science and Engineering Technology International Conferences (ASET), IEEE, Jun. 2024, pp. 1–9. doi: 10.1109/ASET60340.2024.10708703.

S. Jin, Q. Li, S. Liu, O. K. Joel, and X. Ge, “Short - Term Temperature Forecasting Using LSTM -CatBoost Combination Method,” in 2024 16th International Conference on Wireless Communications and Signal Processing (WCSP), IEEE, Oct. 2024, pp. 1217–1222. doi: 10.1109/WCSP62071.2024.10826699.

A. Pfob, S.-C. Lu, and C. Sidey-Gibbons, “Machine learning in medicine: a practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison,” BMC Med Res Methodol, vol. 22, no. 1, p. 282, Nov. 2022, doi: 10.1186/s12874-022-01758-8.

J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J Big Data, vol. 7, no. 1, p. 94, Dec. 2020, doi: 10.1186/s40537-020-00369-8.

V. R. Joseph, “Optimal ratio for data splitting,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 15, no. 4, pp. 531–538, Aug. 2022, doi: 10.1002/sam.11583.

P. Billion Polak, J. D. Prusa, and T. M. Khoshgoftaar, “Low-shot learning and class imbalance: a survey,” J Big Data, vol. 11, no. 1, p. 1, Jan. 2024, doi: 10.1186/s40537-023-00851-z.

S. Yan, Z. Zhao, S. Liu, and M. Zhou, “BO-SMOTE: A Novel Bayesian-Optimization-Based Synthetic Minority Oversampling Technique,” IEEE Trans Syst Man Cybern Syst, vol. 54, no. 4, pp. 2079–2091, Apr. 2024, doi: 10.1109/TSMC.2023.3335241.

W. Rahayu et al., “Synthetic Minority Oversampling Technique (SMOTE) for Boosting the Accuracy of C4.5 Algorithm Model,” Journal of Artificial Intelligence and Engineering Applications (JAIEA), vol. 3, no. 3, pp. 624–630, Jun. 2024, doi: 10.59934/jaiea.v3i3.469.

M. Luo et al., “Combination of Feature Selection and CatBoost for Prediction: The First Application to the Estimation of Aboveground Biomass,” Forests, vol. 12, no. 2, p. 216, Feb. 2021, doi: 10.3390/f12020216.

S. Zhang et al., “Prediction of traffic accident impact range based on CatBoost ensemble algorithm,” in Second International Conference on Algorithms, Microchips, and Network Applications (AMNA 2023, May 2023, p. 54. doi: 10.1117/12.2679147.

Y. F. Zamzam, T. H. Saragih, R. Herteno, Muliadi, D. T. Nugrahadi, and P.-H. Huynh, “Comparison of CatBoost and Random Forest Methods for Lung Cancer Classification using Hyperparameter Tuning Bayesian Optimization-based,” Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 6, no. 2, pp. 125–136, Mar. 2024, doi: 10.35882/jeeemi.v6i2.382.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Prediksi Penyakit Kanker Payudara Menggunakan Algoritma Synthetic Minority Oversampling Technique dan Categorical Boosting Classifier

Prediksi Penyakit Kanker Payudara Menggunakan Algoritma Synthetic Minority Oversampling Technique dan Categorical Boosting Classifier

Abstract

Downloads

References

Most read articles by the same author(s)