Data potential and feasibility study with Grid Mean Algorithm

V. A. Holdovanskyi; V. I. Alieksieiev

The Grid Mean Algorithm is a computational approach designed to evaluate regression metrics such as coefficient of determination ($R^2$), mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE) directly on tabular data without the need to train machine learning (ML) models. This method enables researchers and practitioners to assess the potential of data for regression tasks, estimate the feasibility of ML projects, and make informed decisions about resource allocation. Additionally, the algorithm allows for estimating the approximate accuracy limit achievable with the given data, making it a valuable criterion for determining the optimality of a model. By addressing whether further research stages are necessary or redundant, it provides a practical tool for planning ML experiments and evaluating the economic viability of investing in such models. Experiments on synthetic datasets demonstrate the method's capability to produce accurate metric estimates across various functional forms and noise levels, making it a robust choice for initial data exploration and ML project planning.

data potential

ML feasibility

regression metrics evaluation

Cano J.-R. Analysis of data complexity measures for classification. Expert Systems with Applications. 40 (12), 4820–4831 (2013).
Jain S., Shukla S., Wadhvani R. Dynamic selection of normalization techniques using data complexity measures. Expert Systems with Applications. 106, 252–262 (2018).
Sotoca J. M., Sanchez J. S., Mollineda R. A. A review of data complexity measures and their applicability to pattern classification problems. Proc. III Taller Nacional de Mineria de Datos y Aprendizaje (TAMIDA). 77–83 (2005).
Renggli C., Rimanic L., Kolar L., Wu W., Zhang C. Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise. 2023 IEEE 39th International Conference on Data Engineering (ICDE). 218–231 (2022).
Bodendorf F., Franke J. A machine learning approach to estimate product costs in the early product design phase: a use case from the automotive industry. Procedia CIRP. 100, 643–648 (2021).
Smith A. E., Mason A. K. Cost estimation predictive modeling: Regression versus neural network. The Engineering Economist. 42 (2), 137–161 (1997).
Johnson M. L., Ed. Essential Numerical Computer Methods: Reliable Lab Solutions. Burlington, MA: Academic Press (2010).
Geman S., Bienenstock E., Doursat R. Neural Networks and the Bias/Variance Dilemma. Neural Computation. 4 (1), 1–58 (1992).
Neal B., Mittal S., Baratin A., Tantia V., Scicluna M., Lacoste-Julien S., Mitliagkas I. A modern take on the bias-variance tradeoff in neural networks. Preprint arXiv:1810.08591 (2019).
Dar Y., Muthukumar V., Baraniuk R. G. A farewell to the bias-variance tradeoff? An overview of the theory of overparameterized machine learning. Preprint arXiv:2109.02355 (2021).
Yang Z., Yu Y., You C., Steinhardt J., Ma Y. Rethinking bias-variance trade-off for generalization of neural networks. ICML'20: Proceedings of the 37th International Conference on Machine Learni. 998, 10767–10777 (2020).
McPherron S. P., Archer W., Otárola-Castillo E. R., Torquato M. G., Keevil T. L. Machine learning, bootstrapping, null models, and why we are still not 100% sure which bone surface modifications were made by crocodiles. Journal of Human Evolution. 161, 103071 (2021).
Akaike H. A New Look at the Statistical Model Identification. IEEE Transactions on Automatic Control. 19 (6), 716–723 (1974).
Association for Talent Development (ATD). ROI Basics. Alexandria, VA: ATD Press (2019).
Robello W. E. A Decision Support Tool using Machine Learning Techniques for Strategic Investment Planning Based on Customer Evaluation Criteria. Praxis, The George Washington University, May 21, 2023. Prax. Research Committee: T. Holzer, S. Sarkani, T. Mazzuchi, A. Etemadi.
Haertel R., Ringger E. K., Felt P., Seppi K. An Analytic and Empirical Evaluation of Return-on-Investment-Based Active Learning. Proceedings of the 9th Linguistic Annotation Workshop. 11–20 (2015).
Bai S., Zhao Y. Startup Investment Decision Support: Application of Venture Capital Scorecards Using Machine Learning Approaches. Systems. 9 (3), 55 (2021).
Lundgren J., Taheri S. Performance Benchmarking and Cost Analysis of Machine Learning Techniques: An Investigation into Traditional and State-Of-The-Art Models in Business Operations. Bachelor Thesis, KTH Royal Institute of Technology (2023).
Stradowski S., Madeyski L. Costs and Benefits of Machine Learning Software Defect Prediction: Industrial Case Study. FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 92–103 (2024).
Deshpande G., Ruhe G., Saunders C. How much data analytics is enough? The ROI of machine learning classification and its application to requirements dependency classification. Preprint arXiv:2109.14097v1 (2021).
Pandey S., Gupta S., Chhajed S. ROI of AI: Effectiveness and Measurement. International Journal of Engineering Research & Technology. 10 (05), 749–761 (2021).
Holdovanskyi V., Berko A. Y., Alieksieiev V. Determination-based correlation coefficient. CEUR Workshop Proceedings. 3711, 198–224 (2024).