This study explores the correlation between the popularity of open-source repositories and their quality, as assessed using static code quality metrics. The primary focus is on defining key indicators for two distinct paradigms, namely functional and object-oriented programming, and developing a code search method to systematically process repositories retrieved during the search process. The ultimate purpose of the research is to design an effective code search method for identifying high-quality GitHub repositories in order to ensure a balance between repository popularity and code quality for further use in training machine learning models. To achieve this, the study explores current methods of repository extraction, defines relevant code quality metrics for the two paradigms, and analyzes the correlation between quality indicators and repository popularity.
A comparative analysis of data extraction methods, in particular GitHub API, GHTorrent, and GitHub Archive, has been carried out. A detailed comparative table was created for each of these methods, assessing their advantages and limitations, and determining the most optimal approach for further work. In addition, both fundamental and niche quality metrics were identified for each programming paradigm to enable a more comprehensive evaluation of repository quality. This study examines SonarQube, which provides insights into code quality, maintainability, and technical debt, making it a valuable tool for assessing repository suitability for machine learning-based defect prediction.
Many widely used open-source projects gain traction due to active community contributions and extensive use, but their intrinsic code quality does not always align with high standards. Conversely, lesser-known repositories may exhibit superior quality but lack sufficient adoption to be considered representative datasets for training machine learning models. The results of this study contribute to the broader field of software quality assurance and defect prediction by providing a structured approach to evaluating open-source repositories. The proposed method can enhance the selection of reliable datasets for training AI models in software engineering, ultimately leading to more effective defect detection and improved software quality control processes.
[1] Mombach, T., & Valente, M. T. (2018). GitHub REST API vs GHTorrent vs GitHub Archive: A Comparative Study.
[2] Sharma, T., Kechagia, M., Georgiou, S., Tiwari, R., Vats, I., Moazen, H., & Sarro, F. (2024). A survey on machine learning techniques applied to source code. Journal of Systems and Software, 209, 111934. https://doi.org/10.1016/j.jss.2023.111934
[3] Riaz, M., Mendes, E., & Tempero, E. (2009). A systematic review of software maintainability prediction and metrics. 2009 3rd International Symposium on Empirical Software Engineering and Measurement, 367–377. https://doi.org/ 10.1109/ esem.2009. 5314233
[4] Azeem, M. I., Palomba, F., Shi, L., & Wang, Q. (2019). Machine learning techniques for code smell detection: A systematic literature review and meta-analysis. Information and Software Technology, 108, 115–138. https://doi.org/10.1016/ j.infsof. 2018. 12.009
[5] Mashhadi, E., Ahmadvand, H., & Hemmati, H. (2023). Method- level bug severity prediction using source code metrics and LLMS. 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 635–646. https://doi.org/ 10.1109/issre59848.2023.00055
[6] Gray, D. P. H. (2013). Software defect prediction using static code metrics: Formulating a methodology. https://doi.org/ 10.18745/ th.11067
[7] Kumar, L., & Sureka, A. (2017). A comparative study of different source code metrics and machine learning algorithms for predicting change proneness of object oriented systems. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1712.07944
[8] Esteves, G., Figueiredo, E., Veloso, A., Viggiato, M., & Ziviani, N. (2020). Understanding machine learning software defect predictions. Automated Software Engineering, 27(34), 369–392. https://doi.org/10.1007/s10515-020-00277-4
[9] Rosenberg, L.H., &Hyatt, L.E.(2002). Software QualityMetrics for Object-Oriented Environments.
[10] Zage, W., &Zage, D.(1993). Evaluating design metrics on large- scale software. IEEE Software, 10(4), 75–81. https://doi.org/10.1109/52.219620
[11] Warmuth, D. (2019). Validation of software measures for the functional programming language Erlang. https://doi.org/ 10.18452/19886
[12] Adigun, M., Sotonwa, K., Adeyiga, J., & Abas, M. (2020). Software Complexity Metrics of Functional Languages Using Binary Search Algorithm.
[13] Nuñez-Varela, A. S., Pérez-Gonzalez, H. G., Martínez-Perez, F. E., & Soubervielle-Montalvo, C. (2017). Source code metrics: A systematic mapping study. Journal of Systems and Software, 128, 164–197. https://doi.org/10.1016/j.jss.2017.03.044
[14] Zagane, M., Abdi, M. K., & Alenezi, M. (2020). Deep learning for software vulnerabilities detection using code metrics. IEEE Access, 8, 74562–74570. https://doi.org/10.1109/access.2020. 2988557
[15] Beal, F., Rucker de Bassi, P., & Cabrera Paraiso, E. (2017). Developer modelling using software qualitymetrics and machine learning. Proceedings of the 19th International Conference on Enterprise Information Systems, 424–432. https://doi.org/ 10.5220/ 0006327104240432
[16] Marcilio, D., Bonifacio, R., Monteiro, E., Canedo, E., Luz, W., & Pinto, G. (2019). Are static analysis violations really fixed? A closer look at realistic usage of Sonarqube. 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), 209–219. https://doi.org/10.1109/icpc.2019.00040
[17] Medeiros, N., Ivaki, N., Costa, P., & Vieira, M. (2020). Vulnerable code detection using software metrics and machine learning. IEEE Access, 8, 219174–219198. https://doi.org/ 10.1109/access. 2020.3041181
[18] Manjhi, D., & Chaturvedi, A. (2019). Software component reusability classification in Functional Paradigm. 2019 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 17. https://doi.org/ 10.1109/ icecct. 2019.8869123
[19] Ardito, L., Coppola, R., Barbato, L., & Verga, D. (2020). A tool- based perspective on software code maintainability metrics: A
Systematic Literature Review. Scientific Programming, 2020, 126. https://doi.org/10.1155/2020/8840389
[20] Jha, S., Kumar, R., Hoang Son, L., Abdel-Basset, M., Priyadarshini, I., Sharma, R., & Viet Long, H. (2019). Deep Learning Approach for software maintainability metrics prediction. IEEE Access, 7, 61840–61855. https://doi.org/ 10.1109/ access. 2019.2913349
[21] Agnihotri, M., & Chug, A. (2020). A Systematic Literature Survey of Software Metrics, Code Smells and Refactoring Techniques. https://doi.org/10.3745/JIPS.04.0184
[22] Mhawish, M. Y., &Gupta, M.(2020). Predicting code smells and analysis of predictions: Using machine learning techniques and software metrics. Journal of Computer Science and Technology, 35(6), 1428–1445. https://doi.org/10.1007/s11390-020-0323-7
[23] Jarczyk, O., Gruszka, B., Jaroszewicz, S., Bukowski, L., & Wierzbicki, A. (2014). GitHub projects. Quality Analysis of open-source software. Lecture Notes in Computer Science, 80– 94. https://doi.org/10.1007/978-3-319-13734-6_6
[24] Borges, H., & Tulio Valente, M. (2018). What’sin a github star? understanding repository starring practices in a social coding platform. Journal of Systems and Software, 146, 112–129. https://doi.org/10.1016/j.jss.2018.09.016