Leakage-Aware Random Forest Regression for Predicting Job Automation Risk Using Structured Labor Market Data

Alya Zalfa Chairunnisa; Nawirah Athqiyah; Vanisa Amalia Putri; Ken Dhita Tania; Allsela Meiriza

doi:10.47065/bits.v8i1.9706

Alya Zalfa Chairunnisa Universitas Sriwijaya, Palembang, Indonesia
Nawirah Athqiyah Universitas Sriwijaya, Palembang, Indonesia
Vanisa Amalia Putri * Universitas Sriwijaya, Palembang, Indonesia
Ken Dhita Tania Universitas Sriwijaya, Palembang, Indonesia
Allsela Meiriza Indonesia

(*) Corresponding Author

DOI: https://doi.org/10.47065/bits.v8i1.9706

Keywords: Artificial Intelligence; Automation Risk; Data Leakage; Random Forest; Regression

Abstract

This study aims to predict job automation risk in the era of artificial intelligence (AI) using a leakage-aware Random Forest Regression approach. The automation risk score, defined as a composite index derived from task exposure to AI, occupational routine intensity, and technological susceptibility indicators sourced from the AI Impact Jobs Dataset, serves as the target variable. The dataset comprises 5,000 job vacancy records from 44 countries across 9 industries spanning 2010 to 2025. A rigorous methodological framework is applied by systematically identifying and eliminating potential data leakage features, including ai_intensity_score, reskilling_required, and ai_mentioned, which were found to share mathematical or conceptual derivation paths with the target variable. The model is evaluated using R², RMSE, MAE, and MAPE with 5-fold cross-validation. The results show that the model achieves an R² score of 0.8087 on testing data, with RMSE of 0.1129 and MAE of 0.0893. Feature importance analysis reveals that salary_change_vs_prev_year_percent is the most influential predictor (55.85%), which, although indicative of dominance bias typical in synthetic datasets, aligns with economic theories linking wage dynamics to automation incentives. The findings demonstrate that leakage control significantly reduces inflated performance estimates (from R² = 0.8857 to 0.8087), and that Random Forest Regression provides a robust predictive framework for tabular socio-economic data when combined with rigorous preprocessing. This study contributes a methodological template for preventing data leakage in labor market prediction tasks.

Downloads

Download data is not yet available.

References

D. Acemoglu and S. Johnson, Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity. New York, NY, USA: PublicAffairs, 2023.

World Economic Forum, The Future of Jobs Report 2023. Geneva, Switzerland: World Economic Forum, 2023. [Online]. Available: https://www.weforum.org/publications/the-future-of-jobs-report-2023/

OECD, “OECD Employment Outlook 2023: Artificial Intelligence and the Labour Market,” OECD Employ. Outlook, vol. 2023, Jul. 2023, doi: 10.1787/08785bba-en.

E. W. Felten, M. Raj, and R. Seamans, “How Will Language Modelers Like ChatGPT Affect Occupations and Industries?,” Soc. Sci. Res. Netw., Mar. 2023, doi: 10.2139/ssrn.4375268.

T. Eloundou, S. Manning, P. Mishkin, and D. Rock, “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models,” arXiv, Aug. 2023, doi: 10.48550/arXiv.2303.10130.

S. M. Greenstein, “Internet Data Capping Note (B),” Fac. Res. Harv. Bus. Sch., Apr. 2026, [Online]. Available: https://www.hbs.edu/faculty/Pages/item.aspx?num=54312

D. Autor, “The Labor Market Impacts of Technological Change: From Unbridled Enthusiasm to Qualified Optimism to Vast Uncertainty,” Natl. Bur. Econ. Res., May 2022, doi: 10.3386/w30074.

A. Korinek and J. E. Stiglitz, “Artificial Intelligence, Globalization, and Strategies for Economic Development,” Natl. Bur. Econ. Res., Feb. 2021, doi: 10.3386/w28453.

K. Ellingrud et al., Generative AI and the Future of Work in America. McKinsey Global Institute, 2023. [Online]. Available: https://www.mckinsey.com/mgi/our-research/generative-ai-and-the-future-of-work-in-america

E. Brynjolfsson, D. Li, and L. Raymond, “Generative AI at Work,” Q. J. Econ., vol. 140, no. 2, pp. 889–942, May 2025, doi: 10.1093/qje/qjae044.

S. Noy and W. Zhang, “Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,” Science, vol. 381, no. 6654, pp. 187–192, Jul. 2023, doi: 10.1126/science.adh2586.

S. Kapoor and A. Narayanan, “Leakage and the Reproducibility Crisis in Machine-Learning-Based Science,” Patterns, vol. 4, no. 9, Sep. 2023, doi: 10.1016/j.patter.2023.100804.

A. Apicella, F. Isgrò, and R. Prevete, “Don’t Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning,” Artif. Intell. Rev., vol. 58, no. 11, p. 339, Aug. 2025, doi: 10.1007/s10462-025-11326-3.

L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?,” Jul. 2022, doi: 10.48550/arXiv.2207.08815.

B. Bischl et al., “Hyperparameter Optimization: Foundations, Algorithms, Best Practices, and Open Challenges,” WIREs Data Min. Knowl. Discov., vol. 13, no. 2, p. e1484, 2023, doi: 10.1002/widm.1484.

D. Chicco, M. J. Warrens, and G. Jurman, “The Coefficient of Determination R-Squared Is More Informative Than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation,” PeerJ Comput. Sci., vol. 7, p. e623, Jul. 2021, doi: 10.7717/peerj-cs.623.

R. G. Pensa, A. Crombach, S. Peignier, and C. Rigotti, “Explaining Random Forest and XGBoost with Shallow Decision Trees by Co-Clustering Feature Importance,” Mach. Learn., vol. 114, no. 12, p. 287, Nov. 2025, doi: 10.1007/s10994-025-06932-9.

Y. E. Hasugian, “Analisis Dampak Artificial Intelligence (AI) Terhadap Sektor Tenaga Kerja Di Indonesia,” Majelis J. Huk. Indones., vol. 3, no. 1, pp. 84–101, Feb. 2026, doi: 10.62383/majelis.v3i1.1501.

D. A. Fife and J. D’Onofrio, “Common, Uncommon, and Novel Applications of Random Forest in Psychological Research,” Behav. Res. Methods, vol. 55, no. 5, pp. 2447–2466, Aug. 2023, doi: 10.3758/s13428-022-01901-9.

M. S. Aziz, H. Subiakto, and R. Puspa, “Diffusion of Artificial Intelligence Across Indonesia: Digital Disparities, Local Contexts, and Policy Implications,” Masy. Kebud. Dan Polit., vol. 38, no. 3, pp. 276–292, Oct. 2025, doi: 10.20473/mkp.V38I32025.276-292.

T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A Survey on Missing Data in Machine Learning,” J. Big Data, vol. 8, no. 1, p. 140, Oct. 2021, doi: 10.1186/s40537-021-00516-9.

F. Bolikulov, R. Nasimov, A. Rashidov, F. Akhmedov, and Y.-I. Cho, “Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms,” Mathematics, vol. 12, no. 16, p. 2553, Jan. 2024, doi: 10.3390/math12162553.

M. Priestley, F. O’donnell, and E. Simperl, “A Survey of Data Quality Requirements That Matter in ML Development Pipelines,” ACM J. Data Inf. Qual., vol. 5, no. 2, Jun. 2023, doi: 10.1145/3592616.

V. R. Joseph, “Optimal Ratio for Data Splitting,” arXiv, Feb. 2022, doi: 10.1002/sam.11583.

J. J. Eertink, M. W. Heymans, G. J. C. Zwezerijnen, J. M. Zijlstra, H. C. W. de Vet, and R. Boellaard, “External Validation: A Simulation Study to Compare Cross-Validation Versus Holdout or External Testing to Assess the Performance of Clinical Prediction Models Using PET Data from DLBCL Patients,” EJNMMI Res., vol. 12, no. 1, p. 58, Sep. 2022, doi: 10.1186/s13550-022-00931-w.

J. Allgaier and R. Pryss, “Practical Approaches in Evaluating Validation and Biases of Machine Learning Applied to Mobile Health Studies,” Commun. Med., vol. 4, no. 1, p. 76, Apr. 2024, doi: 10.1038/s43856-024-00468-0.

D. Acemoglu and P. Restrepo, “Robots and Jobs: Evidence from US Labor Markets,” J. Polit. Econ., vol. 128, no. 6, pp. 2188–2244, Jun. 2020, doi: 10.1086/705716.

Bila bermanfaat silahkan share artikel ini

Berikan Komentar Anda terhadap artikel Leakage-Aware Random Forest Regression for Predicting Job Automation Risk Using Structured Labor Market Data

Leakage-Aware Random Forest Regression for Predicting Job Automation Risk Using Structured Labor Market Data

Abstract

Downloads

References

Most read articles by the same author(s)