This article discusses the practical aspects of applying logistic regression for binary data classification. Logistic regression determines the probability of an object belonging to one of two classes. This probability is calculated with the help of a sigmoid function, the argument of which is a linear convolution of the feature vector of the object with the weighting coefficients obtained during the minimization of the logarithmic loss function. Predicted class labels are determined by comparing the calculated probability with a given threshold value.
The logistic regression study was performed using the computer simulation method. For this, a software complex was developed, the work of which reproduces the main stages of logistic regression: preparation of input data, training, testing with determination of quality metrics of binary classification, application of the logistic regression method for data classification in practice.
The paper examines the effect of overlapping and imbalance of classes in the input data set on the efficiency of binary classification. The overlapping of classes is modeled by the formation of input data based on two shifted relative to each other density functions of the normal distribution of random variables. Class imbalance is simulated by the probability of switching between these features.
It is shown that when the distance between the mathematical expectations of the density functions of the normal distribution decreases or when the dispersion of random variables increases, the overlapping of relevant classes increases, which leads to an increase in the number of objects that the classifier can assign to one or another class.
Approaching the probability of switching between the distribution functions of random variables to the extreme values of the unit interval leads to an increase in class imbalance, which is manifested in an increase in the number of elements of the input data set labeled with the label of the same class.
It has been experimentally confirmed that the AUC ROC metric, popular in binary classification problems, is dependent on the degree of class overlap and relatively resistant to class imbalance.
- Ewens, W. J. & Brumberg, K. (2023). Introductory Statistics for Data Analysis. Springer.
- Friedman, J. (2011). The Elements of Statistical Learning. Springer.
- Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. John Wiley & Sons, Inc. https://doi.org/https://doi.org/10.1002/0471722146.
- Hilbe, J. M. (2009). Logistic Regression Models (1st ed.). Chapman and Hall/CRC. https://doi.org/https://doi.org/10.1201/9781420075779.
- Kleinbaum, D. G., & Klein, M. (2010). Logistic Regression: A Self-Learning Text (3rd ed.). Springer.
- Harrell, F. E. (2015). Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis (2nd ed.). Springer.
- Basyuk, T. M., Lytvyn, V. V., Zakharia, L. M., & Kunanets, N. E. (2019). Machine learning: a study guide (in Ukrainian). Lviv: “Novyy Svit – 2000” Publishing House.
- Ponniah, P. (2007). Data Modeling Fundamentals: A Practical Guide for IT Professionals. Wiley. John Wiley & Sons, LTD.
- Dubrovin, V. I., Deinega, L. Yu., & Yatsenko, A. K. (2023). Statistical analysis software (in Ukrainian). Electronics and electrical engineering, Automation and computer-integrated technologies, 3, 25–32. https://doi.org/10.15588/1607-6761-2023-3-3.
- Blokdyk, G. (2019). What is Custom Software Development? Your Guide to Building Software That Works for You. Emereo Pty The Limited. https://multishoring.com/blog/what-is-custom-software-development/.
- Baruah, R., Ramani, S. S., & Chandratrey, K. (2024). Data Science Toolkit - Logistic regression custom model service. https://learn.microsoft.com/en-us/xandr/data-science-toolkit/logistic-re....
- Build vs. buy. A strategic framework for evaluating third-party solutions. (2022). https://www.thoughtworks.com/content/dam/thoughtworks/documents/ebook/tw_ebook_build_vs_buy_2022.pdf.
- How do you weigh using software development tools versus building your own solutions? (2024). https://www.linkedin.com/advice/0/how-do-you-weigh-using-software-development.
- Hackeling, G. (2014). Mastering Machine Learning With Scikit-learn: Apply Effective Learning Algorithms to Real-world Problems Using Scikit-learn. Packt Publishing.
- Adams, S. A. (2020). An Introduction to Logistic Regression in Python with statsmodels and scikit-learn. Level Up Coding. https://levelup.gitconnected.com/an-introduction-to-logistic-regression-... scikit-learn-1a1fb5ce1c13.
- Wiley, M., & Wiley, J. F. (2019). Advanced R Statistical Programming and Data Models: Analysis, Machine Learning, and Visualization. APress.
- Agresti, A., & Kateri, M. (2021). Foundations of Statistics for Data Scientists: With R and Python. CRC Press.
- Allison, P. D. (2018). Logistic Regression Using SAS. Theory and Application, Second Edition. SAS Institute.
- Nasser, H. (2020). Logistic Regression Using SPSS. https://doi.org/10.13140/RG.2.2.21524.12162. https://www.researchgate.net/publication/344138306_Logistic_Regression_Using_SPSS.
- George, D., & Mallery, P. (2021). IBM SPSS Statistics. 27 Step by Step: A Simple Guide and Reference. Taylor & Francis.
- Geron, A. (2023). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 3rd Edition. O'Reilly Media.
- Karimpour, A. (2020). Fundamentals of Data Science with MATLAB: Introduction to Scientific Computing, Data Analysis, and Data Visualization. Amazon.
- Lin, M., & Chen, J. (2023). Research on Credit Big Data Algorithm Based on Logistic Regression. Procedia Computer Science, 228, 511–518. https://doi.org/10.1016/j.procs.2023.11.058. https://www.sciencedirect.com/science/article/pii/S1877050923018823.
- Tutz, G. (2020). Modelling heterogeneity: on the problem of group comparisons with logistic regression and the potential of the heterogeneous choice model. Advances in Data Analysis and Classification, 14, 517–542 (2020). https://doi.org/10.1007/s11634-019-00381-8.
- Gibbons, L. E., & Hosmer, D. W. (1991). Conditional logistic regression with missing data.Communications in Statistics-Simulation and Computation, 20(1), 109–120.
- Bootkrajang, J., & Kabán, A. (2012, September). Label-noise robust logistic regression and its applications. In Joint European conference on machine learning and knowledge discovery in databases (pp. 143– 158). Berlin, Heidelberg: Springer Berlin Heidelberg.
- Sohn, S. Y., Kim, D. H., & Yoon, J. H. (2016). Technology credit scoring model with fuzzy logistic regression. Applied soft computing, 43, 150–158.
- Larsen, K., Petersen, J. H., Budtz Jørgensen, E., & Endahl, L. (2000). Interpreting parameters in the logistic regression model with random effects. Biometrics, 56(3), 909–914.
- Can logistic regression be used for non linear relationships between the independet variables? (2024). https://typeset.io/questions/can-logistic-regression-be-used-for-non-lin....
- Zhang, L., Geisler, T., Ray, H., & Xie, Y. (2021). Improving logistic regression on the imbalanced data by a novel penalized log-likelihood function. Journal of applied statistics, 49(13), 3257–3277. https://doi.org/10.1080/02664763.2021.1939662.
- Jing, Q., & Yifei, L. (2019). L 1-2 Regularized Logistic Regression. 53rd Asilomar Conference on Signals, Systems, and Computers, 779–783. IEEE. https://doi.org/10.1109/IEEECONF44664.2019.9048830.
- Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, 58(1), 267–288.
- Munkhdalai, L., Lee, J. Y., & Ryu, K. H. (2020). A Hybrid Credit Scoring Model Using Neural Networks and Logistic Regression. In Advances in Intelligent Information Hiding and Multimedia Signal Processing. Smart Innovation, Systems and Technologies, 156. Springer, Singapore. https://doi.org/10.1007/978-981-13-9714- 1_27.
- Matsuga, O. M., Dudukina, S. O., & Hryhoruk, S. P. (2020). Building a model for predicting the outcome of treatment on the example of one medical problem (in Ukrainian). Actual problems of automation and information technologies, 24, 47–56.
- Zubchenko, V. P., & Avramenko, A. V. (2023). Study of the scoring model for bank borrowers (in Ukrainian). Bulletin of Taras Shevchenko Kyiv National University, series: physical and mathematical sciences, 2, 44–53. https://doi.org/10.17721/1812-5409.2023/2.5.
- Kravets, P., Tverdokhlib, Yu. (2023). An information system for monitoring reviews in social networks for the formation of recommendations for the purchase of goods (in Ukrainian). Bulletin of the Lviv Polytechnic National University, series: information systems and networks,13,218–234.https://doi.org/10.23939/sisn2023.13.218.
- Rahman, H. A. A., & Yap, B. W. (2016). Imbalance Effects on Classification Using Binary Logistic Regression. In International Conference on Soft Computing in Data Science, SCDS 2016, Communications in Computer and Information Science, Springer, Singapore, 652, 136–147. https://doi.org/10.1007/978-981-10-2777- 2_12.
- Sun, T., Tang, K., & Li, D. (2022). Gradient Descent Learning With Floats. IEEE Transactions on Cybernetics, 3 (52), 1763–1771. https://doi.org/10.1109/TCYB.2020.2997399.
- Fehrman, B., Gess, B., & Jentzen, A. (2020). Convergence Rates for the Stochastic Gradient Descent Method for Non-Convex Objective Functions. Journal of Machine Learning Research, 21 (136), 1–48. https://www.jmlr.org/papers/volume21/19-636/19-636.pdf.
- Kravets, P., Pasichnyk, V., & Prodanyuk, M. (2024). A mathematical model of logistic regression for binary classification. Part 1. Regression models of data generalization (in Ukrainian). Bulletin of the Lviv Polytechnic National University, series: information systems and networks, 15, 290–321. https://doi.org/10.23939/ sisn2024.15.290.
- Kravets, P., Pasichnyk, V., & Prodanyuk, M. (2024). Mathematical logistic regression model for binary classification. Part 2. Data preparation, training and testing processes (in Ukrainian). Bulletin of the Lviv Polytechnic National University, series: information systems and networks, 15, 322–340. https://doi.org/10.23939/ sisn2024.15.322.
- Barzilai, J., & Borwein, J. M. (1988). Two-Point Step Size Gradient Methods. IMA Journal of Numerical Analysis, 8, 141–148. https://doi.org/10.1093/imanum/8.1.141.
- Wolfe, P. (1969). Convergence Conditions for Ascent Methods. SIAM Review. 11 (2), 226–235. https://doi.org/10.1137/1011036. JSTOR 2028111.
- Walton, N. (2019). Robbins-Monro – Applied Probability Notes. https://appliedprobability.blog/2019/01/26/robbins-munro-2/.
- Hossin, M., & Sulaiman, M. N. (2015). A Review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process 5(2), 1–11. https://doi.org/10.5121/ijdkp.2015.5201.