MATHEMATICAL MODEL OF ERRORS IDENTIFICATION IN TEXTS OF UKRAINIAN CONTENT

R. B. Fedchuk; V. A. Vysotska

The problem of automated error detection in Ukrainian texts is becoming particularly relevant in the context of the growth of digital content. A mathematical model of a decision support system for detecting errors in Ukrainian-language texts has been developed. The process of error identification has been studied as a multi-class classification task at the token level, considering the context of the text. The use of probabilistic models has been proposed to determine the type of error depending on the environment of tokens in the text. The feasibility of forming training samples containing both real and artificially created errors has been identified to ensure a balanced learning process. The effectiveness of approaches to vectorizing texts has been established, considering the morphological and syntactic structure of the Ukrainian language, which increases the accuracy of the model. It has been found that the integration of contextual information significantly improves the results of error identification. Detailed DFD diagrams have been constructed that formalize the processes of the system's functioning and the interaction of its components. Experimental training of the ukr-roberta-base model on the UA-GEC corpus for the task of identifying errors in Ukrainian texts has been carried out. The following model quality results were obtained: F1 – 0.736, accuracy – 0.76, precision – 0.85, recall – 0.65. Examples of the model’s performance on test data are provided. It was found that the model has already learned to detect punctuation and basic spelling errors, which indicates its effectiveness and prospects for further development. Prospects for further research include scaling the developed model and adapting it to expand the coverage of more complex types of language errors.

natural language processing

1. Bryant, C., Yuan, Z., Qorib, M. R., Cao, H., Ng, H. T., & Briscoe, T. (2023). Grammatical Error Correction: A Survey of the State of the Art. Computational Linguistics, 49(3), 643-701. https://doi.org/10.48550/arXiv.2211.05166, https://doi.org/10.1162/coli_a_00478
2. Smith, O. B., Ilori, J. O., & Onesirosan, P. (1984). The proximate composition and nutritive value of the winged bean Psophocarpus tetragonolobus (L.) DC for broilers. Animal Feed Science and Technology, 11(1), 231-237. https://doi.org/10.1016/0377-8401(84)90066-X
3. Grammarly Inc. Free Grammar Checker. Retrieved from: https://www.grammarly.com/grammar-check
4. Meet UA-GEC - a grammar correction dataset for the Ukrainian language. Retrieved from: https://dou.ua/forums/topic/33272/
5. Syvokon, O., Nahorna, O., Kuchmiichuk, P., & Osidach, N. (2023). UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language. Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), 96-102. https://doi.org/10.18653/v1/2023.unlp-1.12
6. Syvokon, O., & Nahorna, O. (2021). UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language. ArXiv. https://doi.org/10.48550/arXiv.2103.16997
7. Writing correctly is easy. OnlineCorrector. Retrieved from: https://onlinecorrector.com.ua/ [in Ukrainian]
8. Omelianchuk, K., Atrasevych, V., Chernodub, A. N., & Skurzhanskyi, O. (2020). GECToR - Grammatical Error Correction: Tag, Not Rewrite. ArXiv. https://doi.org/ 10.48550/ arXiv.2005.12592.
https://doi.org/10.18653/v1/2020.bea-1.16
9. HuggingFace. Transformers Documentation. Retrieved from: https://huggingface.co/docs/transformers/index
10. Katinskaia, A., & Yangarber, R. (2024). GPT-3.5 for Grammatical Error Correction. ArXiv. https://doi.org/10.48550/ arXiv.2405.08469
11. Luhtaru, A., Korotkova, E., & Fishel, M. (2024). No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 1209-1222. Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.eacl-long.73
12. Shvedova, M., et al. (2017-2022). General Regionally Annotated Corpus of Ukrainian Language (GRAC). Network for Ukrainian Studies Jena. ArXiv. https://doi.org/10.48550/arXiv.1911.02116
13. Ukrainian RoBERTa base model. Hugging Face. Retrieved from: https://huggingface.co/youscan/ukr-roberta-base
14. Kaggle. Learn Documentation. Retrieved from: https://www. kaggle.com/learn
15. Stanza - A Python NLP Package for Many Human Languages. Stanford NLP Group. Retrieved from https://stanfordnlp. github.io/stanza