Hybrid Feature Selection with Metaheuristics for Improving the Accuracy of Diabetes Disease Prediction

Ida Maratul Khamidah; Suci Ramadhani; Aulia Khoirunnita

doi:10.47065/bits.v7i4.9541

Ida Maratul Khamidah * Politeknik Pertanian Negeri Samarinda, Samarinda, Indonesia
Suci Ramadhani Politeknik Pertanian Negeri Samarinda, Samarinda, Indonesia
Aulia Khoirunnita Universitas Mulawarman, Samarinda, Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/bits.v7i4.9541

Keywords: Diabetes Prediction; Feature Selection; Meta-Heuristic Optimization; Machine Learning; Hybrid Methods

Abstract

Early diagnosis of diabetes mellitus is crucial to prevent severe complications and reduce long-term healthcare costs, making accurate and efficient predictive models an important research focus in medical data analytics. However, one of the main challenges in diabetes prediction lies in the presence of irrelevant and redundant features within medical datasets, which can degrade classification accuracy, increase computational complexity, and reduce model generalizability. To address this issue, this study proposes a Hybrid Feature Selection (HFS) approach that integrates filter-based methods and meta-heuristic optimization to identify an optimal subset of features for diabetes prediction. In the proposed framework, statistical filter techniques combining Chi-square and Mutual Information are first employed to rank and reduce feature dimensionality by selecting the most relevant attributes. Subsequently, a Genetic Algorithm (GA) is applied to further optimize the feature subset by maximizing classification accuracy while minimizing the number of selected features. The effectiveness of the proposed HFS approach is evaluated using the Pima Indian Diabetes Dataset, consisting of 768 instances and 8 clinical features, and tested across multiple machine learning classifiers, including Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and XGBoost. Experimental results demonstrate that the proposed HFS significantly improves predictive performance compared to baseline models without feature selection. Specifically, the Random Forest classifier achieved the highest accuracy of 79.22%, compared to 74.03% in the baseline model, representing an improvement of approximately 5.2%. Additionally, notable improvements were observed in F1-score and AUC, with AUC increasing from 0.8336 to 0.8403. Beyond accuracy gains, the proposed method reduced feature dimensionality from 8 to 5 features, resulting in lower computational cost and faster model training time. These findings indicate that the hybrid integration of filter-based selection and meta-heuristic optimization provides a robust and efficient solution for feature selection in medical prediction tasks. Overall, the proposed HFS framework offers a promising approach for developing accurate, efficient, and reliable decision-support systems for early diabetes diagnosis.

Downloads

Download data is not yet available.

References

Sirmayanti, Pulung Hendro PRASTYO, Mahyati, and Farhan RAHMAN, “A systematic literature review of diabetes prediction using metaheuristic algorithm-based feature selection: Algorithms and challenges method,” Appl. Comput. Sci., vol. 21, no. 1, pp. 126–142, 2025, doi: 10.35784/acs_6849.

A. Singh, N. Prakash, and A. Jain, “Meta‐Heuristic Optimization for the Multi‐Classification of Chronic Disease: A Review With Machine Learning Perspectives,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 15, no. 3, p. e70030, 2025, doi: 10.1002/widm.70030.

S. Malik et al., “Hybrid metaheuristic optimization for detecting and diagnosing noncommunicable diseases,” Sci. Rep., vol. 15, no. 1, p. 7816, 2025, doi: 10.1038/s41598-025-91136-3.

E. H. Houssein, E. Saber, A. A. Ali, and Y. M. Wazery, “Integrating metaheuristics and artificial intelligence for healthcare: basics, challenging and future directions,” Artif. Intell. Rev., vol. 57, no. 8, p. 205, 2024, doi: 10.1007/s10462-024-10822-2.

A. Dyoub and I. Letteri, “Dataset optimization for chronic disease prediction with bio-inspired feature selection,” arXiv Prepr. arXiv2401.05380, 2023, doi: 10.48550/arxiv.2401.05380.

N. Tasnim, S. Al Mamun, M. Shahidul Islam, M. S. Kaiser, and M. Mahmud, “Explainable mortality prediction model for congestive heart failure with nature-based feature selection method,” Appl. Sci., vol. 13, no. 10, p. 6138, 2023, doi: 10.3390/app13106138.

A. Salhi, R. Alshamrani, A. Althbiti, A. Ismail, M. Abd-ElRahman, and B. M. Hassan, “Optimizing high dimensional data classification with a hybrid AI driven feature selection framework and machine learning schema,” Sci. Rep., vol. 15, no. 1, p. 35038, 2025, doi: 10.1038/s41598-025-08699-4.

J. Piri, P. Mohapatra, R. Dey, B. Acharya, V. C. Gerogiannis, and A. Kanavos, “Literature review on hybrid evolutionary approaches for feature selection,” Algorithms, vol. 16, no. 3, p. 167, 2023, doi: 10.3390/a16030167.

H. Alirezapour, N. Mansouri, and B. Mohammad Hasani Zade, “A comprehensive survey on feature selection with grasshopper optimization algorithm,” Neural Process. Lett., vol. 56, no. 1, p. 28, 2024, doi: 10.1007/s11063-024-11514-2.

M. A. S. Ali, P. P. Fathimathul Rajeena, and D. S. Abd Elminaam, “An Efficient Heap Based Optimizer Algorithm for Feature Selection,” Mathematics, vol. 10, no. 14, p. 2396, 2022, doi: 10.3390/math10142396.

M. H. Nadimi-Shahraki, Z. Asghari Varzaneh, H. Zamani, and S. Mirjalili, “Binary starling murmuration optimizer algorithm to select effective features from medical data,” Appl. Sci., vol. 13, no. 1, p. 564, 2022, doi: 10.3390/app13010564.

S. A. Al-Shalif et al., “A systematic literature review on meta-heuristic based feature selection techniques for text classification,” PeerJ Comput. Sci., vol. 10, p. e2084, 2024, doi: 10.7717/peerj-cs.2084.

K. H. Abdulkareem, M. A. Mohammed, Z. A. A. Alyasseri, D. Z. Khutar, and O. A. Alomari, “WOA-COVID-19: Whale Optimization Algorithm for Selection of Multi-Examination Features based on COVID-19 Infections,” Mesopotamian J. Comput. Sci., vol. 2025, pp. 172–185, 2025, doi: 10.58496/MJCSC/2025/010.

N. Mohd Ali, R. Besar, and N. A. Ab. Aziz, “Hybrid feature selection of breast cancer gene expression microarray data based on metaheuristic methods: A comprehensive review,” Symmetry (Basel)., vol. 14, no. 10, p. 1955, 2022, doi: 10.3390/sym14101955.

Q. A. Z. Jabbar, “Hybrid Feature Selection Using Secretary Bird Optimization and Decision Tree Classifier,” J. La Multiapp, vol. 6, no. 3, pp. 631–645, 2025, doi: 10.37899/journallamultiapp.v6i3.2196.

M. S. Salih, R. K. Ibrahim, S. R. Zeebaree, D. Asaad, L. M. Zebari, and N. M. Abdulkareem, “Diabetic prediction based on machine learning using PIMA Indian dataset,” Commun. Appl. Nonlinear Anal., vol. 31, no. 5s, pp. 138–156, 2024, doi: 10.52783/cana.v31.1008.

A. A. Ali, G. R. Galal, and H. S. Hassan, “Diabetes Prediction on Pima Indians Dataset Using Machine Learning Techniques,” Int. J. Environ. Sci., vol. 11, no. 7, 2025, doi: 10.64252/3a8wqx36.

S. R. Mishra and S. Dash, “Machine Learning Based Diabetes Prediction Using the PIMA Indian Dataset,” in 2024 2nd International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), 2024, pp. 1–6. doi: 10.1109/SCOPES64467.2024.10991027.

Y. D. Pratama and A. Salam, “Comparison of Data Normalization Techniques on KNN Classification Performance for Pima Indians Diabetes Dataset,” J. Appl. Informatics Comput., vol. 9, no. 3, pp. 693–706, 2025, doi: https://doi.org/10.30871/jaic.v9i3.9353.

Y. Guan, C. J. Tsai, and S. Zhang, “Research on Diabetes Prediction Model of Pima Indian Females,” in Proceedings of the 2023 4th International Symposium on Artificial Intelligence for Medicine Science, 2023, pp. 294–303. doi: 10.1145/3644116.3644168.

P. Verma and A. Khatoon, “Data Mining Applications in Healthcare: A Comparative Analysis of Classification Techniques for Diabetes Diagnosis Using the PIMA Indian Diabetes Dataset,” in 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM), 2024, pp. 1–5. doi: 10.1109/ICIPTM59628.2024.10563296.

S. Malakar, S. D. Roy, S. Das, S. Sen, J. D. Velasquez, and R. Sarkar, “Computer based diagnosis of some chronic diseases: a medical journey of the last two decades,” Arch. Comput. Methods Eng., vol. 29, no. 7, p. 5525, 2022, doi: 10.1007/s11831-022-09776-x.

A. Abu-Shareha, M. M. Abualhaj, M. A. Alsharaiah, A. Al-Saaidah, and A. Achuthan, “Diabetes Prediction Through Classification Using Pima Dataset: Survey and Evaluation,” J. Soft Comput. Data Min., vol. 6, no. 1, pp. 1–20, 2025, doi: 10.30880/jscdm.2025.06.01.001.

G. Pradhan et al., “Optimized forest framework with a binary multineighborhood artificial bee colony for enhanced diabetes mellitus detection,” Int. J. Comput. Intell. Syst., vol. 17, no. 1, p. 194, 2024, doi: 10.1007/s44196-024-00598-2.

G. M. Foody, “Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient,” PLoS One, vol. 18, no. 10, p. e0291908, 2023, doi: 10.1371/journal.pone.0291908.

A. Humphrey et al., “Machine-learning classification of astronomical sources: estimating F1-score in the absence of ground truth,” Mon. Not. R. Astron. Soc. Lett., vol. 517, no. 1, pp. L116–L120, 2022, doi: 10.1093/mnrasl/slac120.

W. Jia, Y. Qin, and C. Zhao, “Rapid detection of adulterated lamb meat using near infrared and electronic nose: A F1-score-MRE data fusion approach,” Food Chem., vol. 439, p. 138123, 2024, doi: 10.1016/j.foodchem.2023.138123.

A. M. Carrington et al., “Deep ROC analysis and AUC as balanced average accuracy, for improved classifier selection, audit and explanation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 329–341, 2022, doi: 10.1109/TPAMI.2022.3145392.

P. Adeodato and S. Melo, “Kolmogorov-Smirnov and ROC curve metrics for binary classification performance assessment are equivalent,” in International Conference on Pattern Recognition (ICPR), 2022, pp. 1194–1199. doi: 10.1109/ICPR56361.2022.9956449.

S. M. Malakouti, M. B. Menhaj, and A. A. Suratgar, “The usage of 10-fold cross-validation and grid search to enhance ML methods performance in solar farm power generation prediction,” Clean. Eng. Technol., vol. 15, p. 100664, 2023, doi: 10.1016/j.clet.2023.100664.

S. M. Malakouti, “Improving the prediction of wind speed and power production of SCADA system with ensemble method and 10-fold cross-validation,” Case Stud. Chem. Environ. Eng., vol. 8, p. 100351, 2023, doi: 10.1016/j.cscee.2023.100351.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Hybrid Feature Selection with Metaheuristics for Improving the Accuracy of Diabetes Disease Prediction