Research of data mining methods for classification of imbalanced data sets

A. V. Doroshenko; D. Y. Savchuk

With the rapid development of information technology, which is widely used in all spheres of human life and activity, extremely large amounts of data have been accumulated today. By applying machine learning methods to this data, new practically useful knowledge can be obtained. The main goal of this paper is to study different machine learning methods for solving the classification problem and compare their efficiency and accuracy. A separate task is data pre-processing aimed at solving the problem of sample imbalance, as well as identifying the principal components that will be used to solve the classification problem. For this purpose, an information system for classifying the bankruptcy of a company with specified economic and financial characteristics was researched and developed. The study uses a dataset on the basis of which the efficiency and quality of application of several existing classification algorithms are evaluated. These classifiers are: conventional and linear Support Vector Machine, Extra Trees, Random Forest, Decision Tree, Logistic Regression, Multilayer perceptron Classifier, Gradient Boosting, Naive Bayes Classifier. For data pre-processing, we scaled the data, used the SMOTE method to get rid of the imbalance of the training sample, and performed principal component analysis and L1 regularisation. Principal component analysis allowed us to identify 15 principal components that have the greatest impact on classification accuracy and, accordingly, use them in the classification process. Analysing the results, we found that the best classifier was Random Forest with 95.9 % accuracy, and the worst was Naive Bayes with 85.1 %. To evaluate the quality of classification and select the best classifier, the Confusion matrix is used, which takes into account the number of true positive (TP) and true negative (TN) values, as well as the number of false negative (FN) and false positive (FP) classification results, and the values of such metrics as accuracy, precision, sensitivity, F1, and ROC. Accuracy is the percentage of correct answers given by the algorithm, while Recall is the number of TPs divided by the number of TPs plus the number of FNs. F1 indicates the balance between accuracy and sensitivity. Precision is the number of true positive predictions divided by the number of false positive and true negative predictions. ROC AUC is a tool for measuring performance for classification tasks at different thresholds. It shows how well a model can distinguish between classes. The conclusions present the main results of the study and indicate the main future direction of the work, namely, the study of classification results for other datasets and more efficient processing and analysis.

1. Teslyuk, V., Doroshenko, A., & Savchuk, D. (2023). Intelligent Methods and Models for Assessing Level of Student Adaptation to Online Learning, 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine. CEUR Workshop Proceedings, 3387, 331‑343.

2. Akhavan, F., & Hassannayebi, E. (2024). A hybrid machine learning with process analytics for predicting customer experience in online insurance services industry. Decision Analytics Journal, 11, art. no. 100452. https://doi.org/10.1016/j.dajour.2024.100452

3. Guha, A., & Veeranjaneyulu, N. (2019). Prediction of bankruptcy using big data analytic based on fuzzy C-means algorithm. IAES International Journal of Artificial Intelligence, 8(2), 168‑174. https://doi.org/10.11591/ijai.v8.i2.pp168-174

4. Liang, D., Lu, C.-C., Tsai, C.-F., & Shih, G.-A. (2016). Financial Ratios and Corporate Governance Indicators in Bankruptcy Prediction: A Comprehensive Study. European Journal of Operational Research, 252(2), 561–572. https://doi.org/10.1016/j.ejor.2016.01.012

5. Chen, T.-K., Liao, H.-H., Chen, G.-D., Kang, W.-H., & Lin, Y.-C. (2023). Bankruptcy Prediction Using Machine Learning Models with the Text-based Communicative Value of Annual Reports. Expert Systems with Applications, 120714. https://doi.org/10.1016/j.eswa.2023.120714

6. Ali, H., Mohd Salleh, M. N., Saedudin, R., Hussain, K., & Mushtaq, M. F. (2019). Imbalance class problems in data mining: a review. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), 1552. https://doi.org/10.11591/ijeecs.v14.i3.pp1552-1563

7. More, S., & Rana, Anjali and P. (2018). Dipti and Agarwal, Isha, Random Forest Classifier Approach for Imbalanced Big Data Classification for Smart City Application Domains. International Journal of Computational Intelligence & IoT, 1(2). Retrieved from: https://ssrn.com/abstract=3354727

8. Santos, M. S., Abreu, P. H., Japkowicz, N. et al. (2022). On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev, 55, 6207‑6275. https://doi.org/10.1007/s10462-022-10150-3

9. Doroshenko, А. & Tkachenko, R. (2018). Classification of Imbalanced Classes Using the Committee of Neural Networks. 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), 400–403, https://doi.org/10.1109/STC-CSIT.2018.8526611

10. Basha, S. J., Madala, S. R., Vivek, K., Kumar, E. S., & Ammannamma, T. (2022). A Review on Imbalanced Data Classification Techniques. 2022 International Conference on Advanced Computing Technologies and Applications (ICACTA), Coimbatore, India, 1–6, https://doi.org/10.1109/ICACTA54488.2022.9753392

11. Zhongqiang, Sun, Wenhao, Ying, Wenjin, Zhang, & Shengrong, Gong (2024). Undersampling method based on minority class density for imbalanced data. Expert Systems with Applications, 249(Part A), 123328. https://doi.org/10.1016/j.eswa.2024.123328

12. Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321‑357. https://doi.org/10.1613/jair.953

13. Srividya, Mohanavalli, S., Sripriya, N., & Poornima, S. (2018). Outlier Detection using Clustering Techniques. International Journal of Engineering & Technology, 7(3.12), 813. https://doi.org/10.14419/ijet.v7i3.12.16508

14. Regularization path of L1- Logistic Regression. (б. д.). scikit-learn. https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_path.html

15. Pan, H., Badawi, D., Bassi, I., Ozev, S. & Cetin, A. E. (2022). Detecting Anomaly in Chemical Sensors via L1-Kernel-Based Principal Component Analysis. IEEE Sensors Letters, 6(10), art no. 7004304, 1–4. https://doi.org/10.1109/LSENS.2022.3209102

16. Soomro, G. M., Krayem, S., Amur, Z. H., Chramcov, B., Jasek, R., & Noordin, I. (2023). Tumor Detection of Breast Tissue Using Random Forest with Principal Component Analysis. IEEE 8th International Conference on Engineering Technologies and Applied Sciences (ICETAS), Bahrain, Bahrain, 1–7, https://doi.org/10.1109/ICETAS59148.2023.10346582

17. Maćkiewicz, A., & Ratajczak, W. (1993). Principal components analysis (PCA). Computers & Geosciences, 19(3), 303‑342. https://doi.org/10.1016/0098-3004(93)90090-r

18. Doroshenko, Anastasіya (2019). Application of global optimization methods to increase the accuracy of classification in the data mining tasks. In: Luengo D., Subbotin S. (Eds.): Computer Modeling and Intelligent Systems. Proc. 2-nd Int. Conf. CMIS-2019, Vol-2353: Main Conference Zaporizhzhia, Ukraine, April 15-19, 98–109. https://doi.org/10.32782/cmis/2353-8

19. Jadhav, T. et al. (2023). Predicting Urban Land Cover Using Classification: A Machine Learning Approach. IEEE 11th Region 10 Humanitarian Technology Conference (R10-HTC), Rajkot, India, 450–454, https://doi.org/10.1109/R10-HTC57504.2023.10461930

20. Savchuk, D. & Doroshenko, A. (2021). Investigation of machine learning classification methods effectiveness. IEEE 16th International Conference on Computer Sciences and Information Technologies (CSIT), Lviv, Ukraine, 33–37. https://doi.org/10.1109/CSIT52700.2021.9648582

21. Ahmed, T., Paul, R. R., Alam, M. A., Hasan, M. T., & Rab, M. R. (2022). Performance Comparison of Different Machine Learning Classifiers in Categorizing Bangla News Articles. 4th International Conference on Natural Language Processing (ICNLP), Xi'an, China, 376–379, https://doi.org/10.1109/ICNLP55136.2022.00069

22. Tanouz, D., Subramanian, R. Raja, Eswar, D., Parameswara Reddy, G. V., Ranjith Kumar, A., Praneeth, CH. V. N. M. (2021). Credit Card Fraud Detection Using Machine Learning. 5th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 967–972. https://doi.org/10.1109/ICICCS51141.2021.9432308

23. Izonin, I., Tkachenko, R., Pidkostelnyi, R., Pavliuk, O., Khavalko, V., Batyuk, A. (2021). Experimental evaluation of the effectiveness of ann-based numerical data augmentation methods for diagnostics tasks CEUR Workshop Proceedings, 3038, 223‑232.

24. Md. Shojeb Hossain Shojol, Md Abu Ismail Siddique, Fariha Haque (2023) Enhanced Convolutional Neural Networks for Early Detection and Classification of Ophthalmic Diseases. International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 2023, 209–213. https://doi.org/10.1109/ICICT4SD59951.2023.10303558

25. Singh, A. K. (2022). Detection of Credit Card Fraud using Machine Learning Algorithms. 11th International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 2022, 673–677. https://doi.org/10.1109/SMART55829.2022.10047099

26. Subbotin, S., Tabunshchyk, G., Arras, P., Tabunshchyk, D., & Trotsenko, E. (2021). Intelligent Data Analysis for Individual Hypertensia Patient's State Monitoring and Prediction. IEEE International Conference on Smart Information Systems and Technologies (SIST), Nur-Sultan, Kazakhstan, 2021, 1–4. https://doi.org/10.1109/SIST50301.2021.9465989