Hybrid Feature Selection with Metaheuristics for Improving the Accuracy of Diabetes Disease Prediction
Abstract
Early diagnosis of diabetes mellitus is crucial to prevent severe complications and reduce long-term healthcare costs, making accurate and efficient predictive models an important research focus in medical data analytics. However, one of the main challenges in diabetes prediction lies in the presence of irrelevant and redundant features within medical datasets, which can degrade classification accuracy, increase computational complexity, and reduce model generalizability. To address this issue, this study proposes a Hybrid Feature Selection (HFS) approach that integrates filter-based methods and meta-heuristic optimization to identify an optimal subset of features for diabetes prediction. In the proposed framework, statistical filter techniques combining Chi-square and Mutual Information are first employed to rank and reduce feature dimensionality by selecting the most relevant attributes. Subsequently, a Genetic Algorithm (GA) is applied to further optimize the feature subset by maximizing classification accuracy while minimizing the number of selected features. The effectiveness of the proposed HFS approach is evaluated using the Pima Indian Diabetes Dataset, consisting of 768 instances and 8 clinical features, and tested across multiple machine learning classifiers, including Random Forest, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), and XGBoost. Experimental results demonstrate that the proposed HFS significantly improves predictive performance compared to baseline models without feature selection. Specifically, the Random Forest classifier achieved the highest accuracy of 79.22%, compared to 74.03% in the baseline model, representing an improvement of approximately 5.2%. Additionally, notable improvements were observed in F1-score and AUC, with AUC increasing from 0.8336 to 0.8403. Beyond accuracy gains, the proposed method reduced feature dimensionality from 8 to 5 features, resulting in lower computational cost and faster model training time. These findings indicate that the hybrid integration of filter-based selection and meta-heuristic optimization provides a robust and efficient solution for feature selection in medical prediction tasks. Overall, the proposed HFS framework offers a promising approach for developing accurate, efficient, and reliable decision-support systems for early diabetes diagnosis.
Downloads
References
Sirmayanti, Pulung Hendro PRASTYO, Mahyati, and Farhan RAHMAN, “A systematic literature review of diabetes prediction using metaheuristic algorithm-based feature selection: Algorithms and challenges method,” Appl. Comput. Sci., vol. 21, no. 1, pp. 126–142, 2025, doi: 10.35784/acs_6849.
A. Singh, N. Prakash, and A. Jain, “Meta‐Heuristic Optimization for the Multi‐Classification of Chronic Disease: A Review With Machine Learning Perspectives,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 15, no. 3, p. e70030, 2025, doi: 10.1002/widm.70030.
S. Malik et al., “Hybrid metaheuristic optimization for detecting and diagnosing noncommunicable diseases,” Sci. Rep., vol. 15, no. 1, p. 7816, 2025, doi: 10.1038/s41598-025-91136-3.
E. H. Houssein, E. Saber, A. A. Ali, and Y. M. Wazery, “Integrating metaheuristics and artificial intelligence for healthcare: basics, challenging and future directions,” Artif. Intell. Rev., vol. 57, no. 8, p. 205, 2024, doi: 10.1007/s10462-024-10822-2.
A. Dyoub and I. Letteri, “Dataset optimization for chronic disease prediction with bio-inspired feature selection,” arXiv Prepr. arXiv2401.05380, 2023, doi: 10.48550/arxiv.2401.05380.
N. Tasnim, S. Al Mamun, M. Shahidul Islam, M. S. Kaiser, and M. Mahmud, “Explainable mortality prediction model for congestive heart failure with nature-based feature selection method,” Appl. Sci., vol. 13, no. 10, p. 6138, 2023, doi: 10.3390/app13106138.
A. Salhi, R. Alshamrani, A. Althbiti, A. Ismail, M. Abd-ElRahman, and B. M. Hassan, “Optimizing high dimensional data classification with a hybrid AI driven feature selection framework and machine learning schema,” Sci. Rep., vol. 15, no. 1, p. 35038, 2025, doi: 10.1038/s41598-025-08699-4.
J. Piri, P. Mohapatra, R. Dey, B. Acharya, V. C. Gerogiannis, and A. Kanavos, “Literature review on hybrid evolutionary approaches for feature selection,” Algorithms, vol. 16, no. 3, p. 167, 2023, doi: 10.3390/a16030167.
H. Alirezapour, N. Mansouri, and B. Mohammad Hasani Zade, “A comprehensive survey on feature selection with grasshopper optimization algorithm,” Neural Process. Lett., vol. 56, no. 1, p. 28, 2024, doi: 10.1007/s11063-024-11514-2.
M. A. S. Ali, P. P. Fathimathul Rajeena, and D. S. Abd Elminaam, “An Efficient Heap Based Optimizer Algorithm for Feature Selection,” Mathematics, vol. 10, no. 14, p. 2396, 2022, doi: 10.3390/math10142396.
M. H. Nadimi-Shahraki, Z. Asghari Varzaneh, H. Zamani, and S. Mirjalili, “Binary starling murmuration optimizer algorithm to select effective features from medical data,” Appl. Sci., vol. 13, no. 1, p. 564, 2022, doi: 10.3390/app13010564.
S. A. Al-Shalif et al., “A systematic literature review on meta-heuristic based feature selection techniques for text classification,” PeerJ Comput. Sci., vol. 10, p. e2084, 2024, doi: 10.7717/peerj-cs.2084.
K. H. Abdulkareem, M. A. Mohammed, Z. A. A. Alyasseri, D. Z. Khutar, and O. A. Alomari, “WOA-COVID-19: Whale Optimization Algorithm for Selection of Multi-Examination Features based on COVID-19 Infections,” Mesopotamian J. Comput. Sci., vol. 2025, pp. 172–185, 2025, doi: 10.58496/MJCSC/2025/010.
N. Mohd Ali, R. Besar, and N. A. Ab. Aziz, “Hybrid feature selection of breast cancer gene expression microarray data based on metaheuristic methods: A comprehensive review,” Symmetry (Basel)., vol. 14, no. 10, p. 1955, 2022, doi: 10.3390/sym14101955.
Q. A. Z. Jabbar, “Hybrid Feature Selection Using Secretary Bird Optimization and Decision Tree Classifier,” J. La Multiapp, vol. 6, no. 3, pp. 631–645, 2025, doi: 10.37899/journallamultiapp.v6i3.2196.
M. S. Salih, R. K. Ibrahim, S. R. Zeebaree, D. Asaad, L. M. Zebari, and N. M. Abdulkareem, “Diabetic prediction based on machine learning using PIMA Indian dataset,” Commun. Appl. Nonlinear Anal., vol. 31, no. 5s, pp. 138–156, 2024, doi: 10.52783/cana.v31.1008.
A. A. Ali, G. R. Galal, and H. S. Hassan, “Diabetes Prediction on Pima Indians Dataset Using Machine Learning Techniques,” Int. J. Environ. Sci., vol. 11, no. 7, 2025, doi: 10.64252/3a8wqx36.
S. R. Mishra and S. Dash, “Machine Learning Based Diabetes Prediction Using the PIMA Indian Dataset,” in 2024 2nd International Conference on Signal Processing, Communication, Power and Embedded System (SCOPES), 2024, pp. 1–6. doi: 10.1109/SCOPES64467.2024.10991027.
Y. D. Pratama and A. Salam, “Comparison of Data Normalization Techniques on KNN Classification Performance for Pima Indians Diabetes Dataset,” J. Appl. Informatics Comput., vol. 9, no. 3, pp. 693–706, 2025, doi: https://doi.org/10.30871/jaic.v9i3.9353.
Y. Guan, C. J. Tsai, and S. Zhang, “Research on Diabetes Prediction Model of Pima Indian Females,” in Proceedings of the 2023 4th International Symposium on Artificial Intelligence for Medicine Science, 2023, pp. 294–303. doi: 10.1145/3644116.3644168.
P. Verma and A. Khatoon, “Data Mining Applications in Healthcare: A Comparative Analysis of Classification Techniques for Diabetes Diagnosis Using the PIMA Indian Diabetes Dataset,” in 2024 4th International Conference on Innovative Practices in Technology and Management (ICIPTM), 2024, pp. 1–5. doi: 10.1109/ICIPTM59628.2024.10563296.
S. Malakar, S. D. Roy, S. Das, S. Sen, J. D. Velasquez, and R. Sarkar, “Computer based diagnosis of some chronic diseases: a medical journey of the last two decades,” Arch. Comput. Methods Eng., vol. 29, no. 7, p. 5525, 2022, doi: 10.1007/s11831-022-09776-x.
A. Abu-Shareha, M. M. Abualhaj, M. A. Alsharaiah, A. Al-Saaidah, and A. Achuthan, “Diabetes Prediction Through Classification Using Pima Dataset: Survey and Evaluation,” J. Soft Comput. Data Min., vol. 6, no. 1, pp. 1–20, 2025, doi: 10.30880/jscdm.2025.06.01.001.
G. Pradhan et al., “Optimized forest framework with a binary multineighborhood artificial bee colony for enhanced diabetes mellitus detection,” Int. J. Comput. Intell. Syst., vol. 17, no. 1, p. 194, 2024, doi: 10.1007/s44196-024-00598-2.
G. M. Foody, “Challenges in the real world use of classification accuracy metrics: From recall and precision to the Matthews correlation coefficient,” PLoS One, vol. 18, no. 10, p. e0291908, 2023, doi: 10.1371/journal.pone.0291908.
A. Humphrey et al., “Machine-learning classification of astronomical sources: estimating F1-score in the absence of ground truth,” Mon. Not. R. Astron. Soc. Lett., vol. 517, no. 1, pp. L116–L120, 2022, doi: 10.1093/mnrasl/slac120.
W. Jia, Y. Qin, and C. Zhao, “Rapid detection of adulterated lamb meat using near infrared and electronic nose: A F1-score-MRE data fusion approach,” Food Chem., vol. 439, p. 138123, 2024, doi: 10.1016/j.foodchem.2023.138123.
A. M. Carrington et al., “Deep ROC analysis and AUC as balanced average accuracy, for improved classifier selection, audit and explanation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 1, pp. 329–341, 2022, doi: 10.1109/TPAMI.2022.3145392.
P. Adeodato and S. Melo, “Kolmogorov-Smirnov and ROC curve metrics for binary classification performance assessment are equivalent,” in International Conference on Pattern Recognition (ICPR), 2022, pp. 1194–1199. doi: 10.1109/ICPR56361.2022.9956449.
S. M. Malakouti, M. B. Menhaj, and A. A. Suratgar, “The usage of 10-fold cross-validation and grid search to enhance ML methods performance in solar farm power generation prediction,” Clean. Eng. Technol., vol. 15, p. 100664, 2023, doi: 10.1016/j.clet.2023.100664.
S. M. Malakouti, “Improving the prediction of wind speed and power production of SCADA system with ensemble method and 10-fold cross-validation,” Case Stud. Chem. Environ. Eng., vol. 8, p. 100351, 2023, doi: 10.1016/j.cscee.2023.100351.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Hybrid Feature Selection with Metaheuristics for Improving the Accuracy of Diabetes Disease Prediction
Pages: 2704-2714
Copyright (c) 2026 Ida Maratul Khamidah, Suci Ramadhani, Aulia Khoirunnita

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















