Optimasi Bayesian pada Gradient Boosting untuk Prediksi Niat Beli E-Commerce pada Dataset dengan Ketidakseimbangan Kelas

Imam Bagus Setyawan; Heribertus Himawan

doi:10.47065/bits.v8i1.9710

Imam Bagus Setyawan * Universitas Dian Nuswantoro, Semarang, Indonesia
Heribertus Himawan Universitas Dian Nuswantoro, Semarang, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/bits.v8i1.9710

Keywords: CatBoost; Class Imbalance; LightGBM; Purchase Intention; Optuna; XGBoost

Abstract

Predicting consumer purchase intention in e-commerce is a crucial challenge due to the high rate of class imbalance, where the majority of visitors only browse without making a transaction. This study compares the performance of three Gradient Boosting family algorithms (XGBoost, LightGBM, and CatBoost) using the Online Shoppers Intention dataset, which has a class ratio of 84.5% to 15.5%. To overcome majority class bias, the Synthetic Minority Oversampling Technique (SMOTE) approach was implemented on the training data. This research focuses on hyperparameter optimization implementation using the Optuna framework based on the Tree-structured Parzen Estimator (TPE), which is statistically validated using the Friedman and Post-Hoc Nemenyi tests. Model evaluation using stratified 10-Fold Cross-Validation shows that all three models can handle class imbalance effectively. LightGBM achieved an accuracy of 88.36% with an ROC-AUC of 0.9138, XGBoost achieved an accuracy of 88.56% with an ROC-AUC of 0.9127, and CatBoost achieved an accuracy of 88.56% with an ROC-AUC of 0.9121. Feature importance analysis identifies ProductRelated_Duration and ExitRates as the main predictors of purchase intention. The Friedman statistical test detected global performance differences (p=0.0450), but the Nemenyi post-hoc test found insufficient empirical evidence to claim significant pairwise performance differences. This research provides a practical contribution to the e-commerce industry by demonstrating that the selection of ensemble algorithms no longer needs to rely absolutely on pseudo-accuracy margins, but can be objectively recommended based on computational latency efficiency, where the LightGBM architecture proves to be efficient.

Downloads

Download data is not yet available.

References

Z. Wen, W. Lin, and H. Liu, “Machine-learning-based approach for anonymous online customer purchase intentions using clickstream data,” Systems, vol. 11, no. 5, p. 255, 2023, doi: https://doi.org/10.3390/systems11050255.

C. O. Sakar, S. O. Polat, M. Katircioglu, and Y. Kastro, “Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks,” Neural Comput. Appl., vol. 31, no. 10, pp. 6893–6908, 2020, doi: https://doi.org/10.1007/s00521-018-3523-0.

S. Matharaarachchi, M. Domaratzki, and S. Muthukumarana, “Enhancing SMOTE for imbalanced data with abnormal minority instances,” Machine Learning with Applications, vol. 18, p. 100597, 2024, doi: https://doi.org/10.1016/j.mlwa.2024.100597.

S. Dhote, C. Vichoray, R. Pais, S. Baskar, and P. Mohamed Shakeel, “Hybrid geometric sampling and AdaBoost based deep learning approach for data imbalance in E-commerce,” Electronic Commerce Research, vol. 20, no. 2, pp. 259–274, 2020, doi: https://doi.org/10.1007/s10660-019-09383-2.

A. Mishra, C. Shetty, A. Malhotra, A. Maheshwari, and M. S. A. Basha, “Balancing the Cart: Evaluating Imbalance-Aware Machine-Learning Pipelines for Predicting E-Commerce Purchases,” in 2025 IEEE 6th Global Conference for Advancement in Technology (GCAT), IEEE, 2025, pp. 1–7. doi: DOI:10.1109/GCAT66372.2025.11368362.

S. N. Ruscikasani, R. R. N. Oktalivia, F. R. Putra, A. J. Wahidin, B. Rahmatullah, and I. Kurniawati, “Prediksi Pembelian E-Commerce Menggunakan XGBoost Berbasis Perilaku Sesi Pengguna,” RIGGS: Journal of Artificial Intelligence and Digital Business, vol. 4, no. 4, pp. 5666–5672, Dec. 2025, doi: 10.31004/riggs.v4i4.4287.

S.-S. M. Ajibade et al., “Machine Learning Classification of Online Shopper Purchasing Intentions,” in 2025 International Conference on NexGen Networks and Cybernetics (IC2NC), 2025, pp. 867–872. doi: 10.1109/IC2NC67409.2025.11376387.

T. Yu and H. Zhu, “Hyper-parameter optimization: A review of algorithms and applications,” arXiv preprint arXiv:2003.05689, 2020, doi: https://doi.org/10.48550/arXiv.2003.05689.

J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J. Big Data, vol. 7, no. 1, p. 94, 2020, doi: 10.1186/s40537-020-00369-8.

T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2021, pp. 2623–2631. doi: https://doi.org/10.48550/arXiv.1907.10902.

S. Tao, P. Peng, Y. Li, H. Sun, Q. Li, and H. Wang, “Supervised contrastive representation learning with tree-structured parzen estimator Bayesian optimization for imbalanced tabular data,” Expert Syst. Appl., vol. 237, p. 121294, 2024, doi: https://doi.org/10.48550/arXiv.2210.10824.

R. M. Munshi et al., “Optimising hyperparameters with a tree structured Parzen estimator to improve diabetes prediction,” Sci. Rep., vol. 15, no. 1, p. 35430, 2025, doi: https://doi.org/10.1111/j.1464-5491.2007.02157.x.

B.-B. Jia, J.-Y. Liu, and M.-L. Zhang, “Pairwise statistical comparisons of multiple algorithms,” Front. Comput. Sci., vol. 19, no. 12, p. 1912372, 2025, doi: https://doi.org/10.1007/s11704-025-41325-0.

P. Riesthuis, H. Otgaar, and C. Bücken, “Ready to ROC? A tutorial on simulation-based power analyses for null hypothesis significance, minimum-effect, and equivalence testing for ROC curve analyses,” Behav. Res. Methods, vol. 57, no. 4, p. 120, 2025, doi: https://doi.org/10.20982/tqmp.19.1.p059.

M. Imani, M. Joudaki, A. Bagheri, and H. R. Arabnia, “Why ROC-AUC Is Misleading for Highly Imbalanced Data: In-Depth Evaluation of MCC, F2-Score, H-Measure, and AUC-Based Metrics Across Diverse Classifiers,” Technologies (Basel)., vol. 14, no. 1, p. 54, 2026, doi: https://doi.org/10.3390/technologies14010054.

J. Demšar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, 2021, doi: https://dl.acm.org/doi/10.1007/s10115-026-02752-y.

M. Mozolewski, S. Bobek, and G. J. Nalepa, “Explaining time series classifiers with phar: Rule extraction and fusion from post-hoc attributions,” arXiv preprint arXiv:2508.01687, 2025, doi: https://doi.org/10.48550/arXiv.2508.01687.

L. C. M. Liaw, S. C. Tan, P. Y. Goh, and C. P. Lim, “A histogram SMOTE-based sampling algorithm with incremental learning for imbalanced data classification,” Inf. Sci. (N. Y)., vol. 686, p. 121193, 2025, doi: https://doi.org/10.1016/j.ins.2024.121193.

B. So, “Enhanced gradient boosting for zero-inflated insurance claims and comparative analysis of CatBoost, XGBoost, and LightGBM,” Scand. Actuar. J., vol. 2024, no. 10, pp. 1013–1035, Nov. 2024, doi: 10.1080/03461238.2024.2365390.

R. G. Farahani, A. Zarrabi, and P. Ghazanfari, “A Report on CatBoost: unbiased boosting with categorical features,” Accessed: Aug, vol. 11, 2025, doi: DOI:10.13140/RG.2.2.30029.96485.

Y. Chen, S. Chen, Y. Yang, and S. Lu, “Comparison of decision tree and ensemble algorithms,” Applied and Computational Engineering, vol. 55, pp. 241–248, Jul. 2024, doi: 10.54254/2755-2721/55/20241535.

C. Pinichka, S. Chotpantarat, K. H. Cho, and W. Siriwong, “Comparative analysis of SWAT and SWAT coupled with XGBoost model using Optuna hyperparameter optimization for nutrient simulation: A case study in the Upper Nan River basin, Thailand,” J. Environ. Manage., vol. 388, p. 126053, 2025, doi: https://doi.org/10.1016/j.jenvman.2025.126053.

H. Liao, X. Zhang, C. Zhao, Y. Chen, X. Zeng, and H. Li, “LightGBM: an efficient and accurate method for predicting pregnancy diseases,” J. Obstet. Gynaecol. (Lahore)., vol. 42, no. 4, pp. 620–629, 2022, doi: DOI: 10.1080/01443615.2021.1945006.

C. Jansen, M. Nalenz, G. Schollmeyer, and T. Augustin, “Statistical comparisons of classifiers by generalized stochastic dominance,” Journal of Machine Learning Research, vol. 24, no. 231, pp. 1–37, 2023, doi: https://doi.org/10.48550/arXiv.2209.01857.

M. Hajihosseinlou, A. Maghsoudi, and R. Ghezelbash, “A novel scheme for mapping of MVT-type Pb–Zn prospectivity: LightGBM, a highly efficient gradient boosting decision tree machine learning algorithm,” Natural resources research, vol. 32, no. 6, pp. 2417–2438, 2023, doi: https://doi.org/10.1007/s11053-023-10249-6.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Optimasi Bayesian pada Gradient Boosting untuk Prediksi Niat Beli E-Commerce pada Dataset dengan Ketidakseimbangan Kelas