Optimasi Bayesian pada Gradient Boosting untuk Prediksi Niat Beli E-Commerce pada Dataset dengan Ketidakseimbangan Kelas
Abstract
Predicting consumer purchase intention in e-commerce is a crucial challenge due to the high rate of class imbalance, where the majority of visitors only browse without making a transaction. This study compares the performance of three Gradient Boosting family algorithms (XGBoost, LightGBM, and CatBoost) using the Online Shoppers Intention dataset, which has a class ratio of 84.5% to 15.5%. To overcome majority class bias, the Synthetic Minority Oversampling Technique (SMOTE) approach was implemented on the training data. This research focuses on hyperparameter optimization implementation using the Optuna framework based on the Tree-structured Parzen Estimator (TPE), which is statistically validated using the Friedman and Post-Hoc Nemenyi tests. Model evaluation using stratified 10-Fold Cross-Validation shows that all three models can handle class imbalance effectively. LightGBM achieved an accuracy of 88.36% with an ROC-AUC of 0.9138, XGBoost achieved an accuracy of 88.56% with an ROC-AUC of 0.9127, and CatBoost achieved an accuracy of 88.56% with an ROC-AUC of 0.9121. Feature importance analysis identifies ProductRelated_Duration and ExitRates as the main predictors of purchase intention. The Friedman statistical test detected global performance differences (p=0.0450), but the Nemenyi post-hoc test found insufficient empirical evidence to claim significant pairwise performance differences. This research provides a practical contribution to the e-commerce industry by demonstrating that the selection of ensemble algorithms no longer needs to rely absolutely on pseudo-accuracy margins, but can be objectively recommended based on computational latency efficiency, where the LightGBM architecture proves to be efficient.
Downloads
References
Z. Wen, W. Lin, and H. Liu, “Machine-learning-based approach for anonymous online customer purchase intentions using clickstream data,” Systems, vol. 11, no. 5, p. 255, 2023, doi: https://doi.org/10.3390/systems11050255.
C. O. Sakar, S. O. Polat, M. Katircioglu, and Y. Kastro, “Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks,” Neural Comput. Appl., vol. 31, no. 10, pp. 6893–6908, 2020, doi: https://doi.org/10.1007/s00521-018-3523-0.
S. Matharaarachchi, M. Domaratzki, and S. Muthukumarana, “Enhancing SMOTE for imbalanced data with abnormal minority instances,” Machine Learning with Applications, vol. 18, p. 100597, 2024, doi: https://doi.org/10.1016/j.mlwa.2024.100597.
S. Dhote, C. Vichoray, R. Pais, S. Baskar, and P. Mohamed Shakeel, “Hybrid geometric sampling and AdaBoost based deep learning approach for data imbalance in E-commerce,” Electronic Commerce Research, vol. 20, no. 2, pp. 259–274, 2020, doi: https://doi.org/10.1007/s10660-019-09383-2.
A. Mishra, C. Shetty, A. Malhotra, A. Maheshwari, and M. S. A. Basha, “Balancing the Cart: Evaluating Imbalance-Aware Machine-Learning Pipelines for Predicting E-Commerce Purchases,” in 2025 IEEE 6th Global Conference for Advancement in Technology (GCAT), IEEE, 2025, pp. 1–7. doi: DOI:10.1109/GCAT66372.2025.11368362.
S. N. Ruscikasani, R. R. N. Oktalivia, F. R. Putra, A. J. Wahidin, B. Rahmatullah, and I. Kurniawati, “Prediksi Pembelian E-Commerce Menggunakan XGBoost Berbasis Perilaku Sesi Pengguna,” RIGGS: Journal of Artificial Intelligence and Digital Business, vol. 4, no. 4, pp. 5666–5672, Dec. 2025, doi: 10.31004/riggs.v4i4.4287.
S.-S. M. Ajibade et al., “Machine Learning Classification of Online Shopper Purchasing Intentions,” in 2025 International Conference on NexGen Networks and Cybernetics (IC2NC), 2025, pp. 867–872. doi: 10.1109/IC2NC67409.2025.11376387.
T. Yu and H. Zhu, “Hyper-parameter optimization: A review of algorithms and applications,” arXiv preprint arXiv:2003.05689, 2020, doi: https://doi.org/10.48550/arXiv.2003.05689.
J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J. Big Data, vol. 7, no. 1, p. 94, 2020, doi: 10.1186/s40537-020-00369-8.
T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama, “Optuna: A next-generation hyperparameter optimization framework,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2021, pp. 2623–2631. doi: https://doi.org/10.48550/arXiv.1907.10902.
S. Tao, P. Peng, Y. Li, H. Sun, Q. Li, and H. Wang, “Supervised contrastive representation learning with tree-structured parzen estimator Bayesian optimization for imbalanced tabular data,” Expert Syst. Appl., vol. 237, p. 121294, 2024, doi: https://doi.org/10.48550/arXiv.2210.10824.
R. M. Munshi et al., “Optimising hyperparameters with a tree structured Parzen estimator to improve diabetes prediction,” Sci. Rep., vol. 15, no. 1, p. 35430, 2025, doi: https://doi.org/10.1111/j.1464-5491.2007.02157.x.
B.-B. Jia, J.-Y. Liu, and M.-L. Zhang, “Pairwise statistical comparisons of multiple algorithms,” Front. Comput. Sci., vol. 19, no. 12, p. 1912372, 2025, doi: https://doi.org/10.1007/s11704-025-41325-0.
P. Riesthuis, H. Otgaar, and C. Bücken, “Ready to ROC? A tutorial on simulation-based power analyses for null hypothesis significance, minimum-effect, and equivalence testing for ROC curve analyses,” Behav. Res. Methods, vol. 57, no. 4, p. 120, 2025, doi: https://doi.org/10.20982/tqmp.19.1.p059.
M. Imani, M. Joudaki, A. Bagheri, and H. R. Arabnia, “Why ROC-AUC Is Misleading for Highly Imbalanced Data: In-Depth Evaluation of MCC, F2-Score, H-Measure, and AUC-Based Metrics Across Diverse Classifiers,” Technologies (Basel)., vol. 14, no. 1, p. 54, 2026, doi: https://doi.org/10.3390/technologies14010054.
J. Demšar, “Statistical Comparisons of Classifiers over Multiple Data Sets,” J. Mach. Learn. Res., vol. 7, pp. 1–30, 2021, doi: https://dl.acm.org/doi/10.1007/s10115-026-02752-y.
M. Mozolewski, S. Bobek, and G. J. Nalepa, “Explaining time series classifiers with phar: Rule extraction and fusion from post-hoc attributions,” arXiv preprint arXiv:2508.01687, 2025, doi: https://doi.org/10.48550/arXiv.2508.01687.
L. C. M. Liaw, S. C. Tan, P. Y. Goh, and C. P. Lim, “A histogram SMOTE-based sampling algorithm with incremental learning for imbalanced data classification,” Inf. Sci. (N. Y)., vol. 686, p. 121193, 2025, doi: https://doi.org/10.1016/j.ins.2024.121193.
B. So, “Enhanced gradient boosting for zero-inflated insurance claims and comparative analysis of CatBoost, XGBoost, and LightGBM,” Scand. Actuar. J., vol. 2024, no. 10, pp. 1013–1035, Nov. 2024, doi: 10.1080/03461238.2024.2365390.
R. G. Farahani, A. Zarrabi, and P. Ghazanfari, “A Report on CatBoost: unbiased boosting with categorical features,” Accessed: Aug, vol. 11, 2025, doi: DOI:10.13140/RG.2.2.30029.96485.
Y. Chen, S. Chen, Y. Yang, and S. Lu, “Comparison of decision tree and ensemble algorithms,” Applied and Computational Engineering, vol. 55, pp. 241–248, Jul. 2024, doi: 10.54254/2755-2721/55/20241535.
C. Pinichka, S. Chotpantarat, K. H. Cho, and W. Siriwong, “Comparative analysis of SWAT and SWAT coupled with XGBoost model using Optuna hyperparameter optimization for nutrient simulation: A case study in the Upper Nan River basin, Thailand,” J. Environ. Manage., vol. 388, p. 126053, 2025, doi: https://doi.org/10.1016/j.jenvman.2025.126053.
H. Liao, X. Zhang, C. Zhao, Y. Chen, X. Zeng, and H. Li, “LightGBM: an efficient and accurate method for predicting pregnancy diseases,” J. Obstet. Gynaecol. (Lahore)., vol. 42, no. 4, pp. 620–629, 2022, doi: DOI: 10.1080/01443615.2021.1945006.
C. Jansen, M. Nalenz, G. Schollmeyer, and T. Augustin, “Statistical comparisons of classifiers by generalized stochastic dominance,” Journal of Machine Learning Research, vol. 24, no. 231, pp. 1–37, 2023, doi: https://doi.org/10.48550/arXiv.2209.01857.
M. Hajihosseinlou, A. Maghsoudi, and R. Ghezelbash, “A novel scheme for mapping of MVT-type Pb–Zn prospectivity: LightGBM, a highly efficient gradient boosting decision tree machine learning algorithm,” Natural resources research, vol. 32, no. 6, pp. 2417–2438, 2023, doi: https://doi.org/10.1007/s11053-023-10249-6.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Optimasi Bayesian pada Gradient Boosting untuk Prediksi Niat Beli E-Commerce pada Dataset dengan Ketidakseimbangan Kelas
Pages: 51-61
Copyright (c) 2026 Imam Bagus Setyawan, Heribertus Himawan

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















