Leakage-Aware Random Forest Regression for Predicting Job Automation Risk Using Structured Labor Market Data
Abstract
This study aims to predict job automation risk in the era of artificial intelligence (AI) using a leakage-aware Random Forest Regression approach. The automation risk score, defined as a composite index derived from task exposure to AI, occupational routine intensity, and technological susceptibility indicators sourced from the AI Impact Jobs Dataset, serves as the target variable. The dataset comprises 5,000 job vacancy records from 44 countries across 9 industries spanning 2010 to 2025. A rigorous methodological framework is applied by systematically identifying and eliminating potential data leakage features, including ai_intensity_score, reskilling_required, and ai_mentioned, which were found to share mathematical or conceptual derivation paths with the target variable. The model is evaluated using R², RMSE, MAE, and MAPE with 5-fold cross-validation. The results show that the model achieves an R² score of 0.8087 on testing data, with RMSE of 0.1129 and MAE of 0.0893. Feature importance analysis reveals that salary_change_vs_prev_year_percent is the most influential predictor (55.85%), which, although indicative of dominance bias typical in synthetic datasets, aligns with economic theories linking wage dynamics to automation incentives. The findings demonstrate that leakage control significantly reduces inflated performance estimates (from R² = 0.8857 to 0.8087), and that Random Forest Regression provides a robust predictive framework for tabular socio-economic data when combined with rigorous preprocessing. This study contributes a methodological template for preventing data leakage in labor market prediction tasks.
Downloads
References
D. Acemoglu and S. Johnson, Power and Progress: Our Thousand-Year Struggle Over Technology and Prosperity. New York, NY, USA: PublicAffairs, 2023.
World Economic Forum, The Future of Jobs Report 2023. Geneva, Switzerland: World Economic Forum, 2023. [Online]. Available: https://www.weforum.org/publications/the-future-of-jobs-report-2023/
OECD, “OECD Employment Outlook 2023: Artificial Intelligence and the Labour Market,” OECD Employ. Outlook, vol. 2023, Jul. 2023, doi: 10.1787/08785bba-en.
E. W. Felten, M. Raj, and R. Seamans, “How Will Language Modelers Like ChatGPT Affect Occupations and Industries?,” Soc. Sci. Res. Netw., Mar. 2023, doi: 10.2139/ssrn.4375268.
T. Eloundou, S. Manning, P. Mishkin, and D. Rock, “GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models,” arXiv, Aug. 2023, doi: 10.48550/arXiv.2303.10130.
S. M. Greenstein, “Internet Data Capping Note (B),” Fac. Res. Harv. Bus. Sch., Apr. 2026, [Online]. Available: https://www.hbs.edu/faculty/Pages/item.aspx?num=54312
D. Autor, “The Labor Market Impacts of Technological Change: From Unbridled Enthusiasm to Qualified Optimism to Vast Uncertainty,” Natl. Bur. Econ. Res., May 2022, doi: 10.3386/w30074.
A. Korinek and J. E. Stiglitz, “Artificial Intelligence, Globalization, and Strategies for Economic Development,” Natl. Bur. Econ. Res., Feb. 2021, doi: 10.3386/w28453.
K. Ellingrud et al., Generative AI and the Future of Work in America. McKinsey Global Institute, 2023. [Online]. Available: https://www.mckinsey.com/mgi/our-research/generative-ai-and-the-future-of-work-in-america
E. Brynjolfsson, D. Li, and L. Raymond, “Generative AI at Work,” Q. J. Econ., vol. 140, no. 2, pp. 889–942, May 2025, doi: 10.1093/qje/qjae044.
S. Noy and W. Zhang, “Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,” Science, vol. 381, no. 6654, pp. 187–192, Jul. 2023, doi: 10.1126/science.adh2586.
S. Kapoor and A. Narayanan, “Leakage and the Reproducibility Crisis in Machine-Learning-Based Science,” Patterns, vol. 4, no. 9, Sep. 2023, doi: 10.1016/j.patter.2023.100804.
A. Apicella, F. Isgrò, and R. Prevete, “Don’t Push the Button! Exploring Data Leakage Risks in Machine Learning and Transfer Learning,” Artif. Intell. Rev., vol. 58, no. 11, p. 339, Aug. 2025, doi: 10.1007/s10462-025-11326-3.
L. Grinsztajn, E. Oyallon, and G. Varoquaux, “Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?,” Jul. 2022, doi: 10.48550/arXiv.2207.08815.
B. Bischl et al., “Hyperparameter Optimization: Foundations, Algorithms, Best Practices, and Open Challenges,” WIREs Data Min. Knowl. Discov., vol. 13, no. 2, p. e1484, 2023, doi: 10.1002/widm.1484.
D. Chicco, M. J. Warrens, and G. Jurman, “The Coefficient of Determination R-Squared Is More Informative Than SMAPE, MAE, MAPE, MSE and RMSE in Regression Analysis Evaluation,” PeerJ Comput. Sci., vol. 7, p. e623, Jul. 2021, doi: 10.7717/peerj-cs.623.
R. G. Pensa, A. Crombach, S. Peignier, and C. Rigotti, “Explaining Random Forest and XGBoost with Shallow Decision Trees by Co-Clustering Feature Importance,” Mach. Learn., vol. 114, no. 12, p. 287, Nov. 2025, doi: 10.1007/s10994-025-06932-9.
Y. E. Hasugian, “Analisis Dampak Artificial Intelligence (AI) Terhadap Sektor Tenaga Kerja Di Indonesia,” Majelis J. Huk. Indones., vol. 3, no. 1, pp. 84–101, Feb. 2026, doi: 10.62383/majelis.v3i1.1501.
D. A. Fife and J. D’Onofrio, “Common, Uncommon, and Novel Applications of Random Forest in Psychological Research,” Behav. Res. Methods, vol. 55, no. 5, pp. 2447–2466, Aug. 2023, doi: 10.3758/s13428-022-01901-9.
M. S. Aziz, H. Subiakto, and R. Puspa, “Diffusion of Artificial Intelligence Across Indonesia: Digital Disparities, Local Contexts, and Policy Implications,” Masy. Kebud. Dan Polit., vol. 38, no. 3, pp. 276–292, Oct. 2025, doi: 10.20473/mkp.V38I32025.276-292.
T. Emmanuel, T. Maupong, D. Mpoeleng, T. Semong, B. Mphago, and O. Tabona, “A Survey on Missing Data in Machine Learning,” J. Big Data, vol. 8, no. 1, p. 140, Oct. 2021, doi: 10.1186/s40537-021-00516-9.
F. Bolikulov, R. Nasimov, A. Rashidov, F. Akhmedov, and Y.-I. Cho, “Effective Methods of Categorical Data Encoding for Artificial Intelligence Algorithms,” Mathematics, vol. 12, no. 16, p. 2553, Jan. 2024, doi: 10.3390/math12162553.
M. Priestley, F. O’donnell, and E. Simperl, “A Survey of Data Quality Requirements That Matter in ML Development Pipelines,” ACM J. Data Inf. Qual., vol. 5, no. 2, Jun. 2023, doi: 10.1145/3592616.
V. R. Joseph, “Optimal Ratio for Data Splitting,” arXiv, Feb. 2022, doi: 10.1002/sam.11583.
J. J. Eertink, M. W. Heymans, G. J. C. Zwezerijnen, J. M. Zijlstra, H. C. W. de Vet, and R. Boellaard, “External Validation: A Simulation Study to Compare Cross-Validation Versus Holdout or External Testing to Assess the Performance of Clinical Prediction Models Using PET Data from DLBCL Patients,” EJNMMI Res., vol. 12, no. 1, p. 58, Sep. 2022, doi: 10.1186/s13550-022-00931-w.
J. Allgaier and R. Pryss, “Practical Approaches in Evaluating Validation and Biases of Machine Learning Applied to Mobile Health Studies,” Commun. Med., vol. 4, no. 1, p. 76, Apr. 2024, doi: 10.1038/s43856-024-00468-0.
D. Acemoglu and P. Restrepo, “Robots and Jobs: Evidence from US Labor Markets,” J. Polit. Econ., vol. 128, no. 6, pp. 2188–2244, Jun. 2020, doi: 10.1086/705716.
Bila bermanfaat silahkan share artikel ini
Berikan Komentar Anda terhadap artikel Leakage-Aware Random Forest Regression for Predicting Job Automation Risk Using Structured Labor Market Data
Pages: 217-227
Copyright (c) 2026 Alya Zalfa Chairunnisa, Nawirah Athqiyah, Vanisa Amalia Putri, Ken Dhita Tania, Allsela Meiriza

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (Refer to The Effect of Open Access).





















