Explainable AI and robust forecasting of global salary trends: Addressing data drift and unseen categories with tree-based models

N. B. Shakhovskaya

This article studies salary prediction under distributional drift using explainable boosting models and hybrid forecasting. We integrate unseen-aware feature engineering, robust objectives, SHAP-based interpretability, drift detection, and time-series forecasting (Prophet/SARIMAX) on multi-year data (2020–2024), and report a comprehensive evaluation aligned with typical MMC guidelines. Modern salary data are heterogeneous, heavy-tailed, and non-stationary. Therefore we combine robust tree-based learners with drift monitoring and explainable forecasting to prioritize stable absolute error, transparency, and maintainability over raw variance capture. Our best integrated pipeline reaches $R^2=0.31$ on a 2024 hold-out while keeping MAE/RMSE stable across folds, and uncovers year-to-year drift that necessitates periodic retraining monthly and quarterly forecasts indicate a sustained upward trend with seasonality, where SARIMAX captures short-term fluctuations and Prophet yields interpretable trend decompositions.

George E. P. Box, Gwilym M. Jenkins. Time Series Analysis: Forecasting and Control. Holden-Day (1976).
Chen Q., Ge J., Xie H., Xu X., Yang Y. Large language models at work in China's labor market. China Economic Review. 92, 102413 (2025).
Gama J., Žliobaitė I., Bifet A., Pechenizkiy M., Bouchachia A. A survey on concept drift adaptation. ACM Computing Surveys. 46 (4), 1–37 (2014).
Ke G., Meng Q., Finley T., Wang T., Chen W., Ma W., Ye Q., Liu T.-Y. LightGBM: A highly efficient gradient boosting decision tree. 31st Conference on Neural Information Processing Systems (NIPS 2017). 1–9 (2017).
Hinder F., Vaquet V., Hammer B. One or two things we know about concept drift – a survey on monitoring in evolving environments. Part A: detecting concept drift. Frontiers in Artificial Intelligence. 7, 1330257 (2024).
Lundberg S. M., Lee S.-I. A unified approach to interpreting model predictions. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems. 4768–4777 (2017).
Kim K. Unemployment Dynamics Forecasting with Machine Learning Regression Models. Preprint arXiv:2505.01933 (2025).
Prokhorenkova L., Gusev G., Vorobev A., Dorogush A. V., Gulin A. Catboost: unbiased boosting with categorical features. Proceedings of the 32nd International Conference on Neural Information Processing Systems. 6639–6649 (2018).
Taylor S. J., Letham B. Forecasting at scale. The American Statistician. 72 (1), 37–45 (2018).
Wang C., Shakhovska N., Sachenko A., Komar M. A new approach for missing data imputation in big data interface. Information Technology and Control. 49 (4), 541–555 (2020).
Acharya D. B., Divya B., Kuppan K. Explainable and Fair AI: Balancing Performance in Financial and Real Estate Machine Learning Models. IEEE Access. 12, 154022–154034 (2024).