Information Technologies for Solving the Problem of Correcting Errors in Ukrainian-language Texts

Rostyslav Fedchuk; Victoria Vysotska

This article is dedicated to the study and analysis of grammatical error correction (GEC) tasks in Ukrainian language texts, which is a significant issue in the field of natural language processing (NLP). The paper addresses the specific challenges faced by automatic error correction systems due to the peculiarities of the Ukrainian language, such as its morphological complexity and contextuality. Examples of typical errors are provided, and the reasons why existing GEC methods often prove insufficient for Ukrainian are analysed. The literature review covers recent research and publications in the GEC field, particularly those related to other languages, and highlights approaches that can be adapted for Ukrainian. Special attention is given to the analysis of existing Ukrainian text corpora, such as the UA_GEC and others used for training machine learning models. Their volume, text types, specifications, advantages, and disadvantages are described. Tools for natural language processing that support Ukrainian, such as LanguageTool, NLP-uk, Stanza, NLP-Cube, pymorphy2, Tree_stam, are examined. Their functionalities, performance, and accuracy are analysed. Pre-trained machine learning models, including mBART50 and mT5 were adapted for Ukrainian with description of their effectiveness in GEC tasks. The article presents practical aspects of applying these models and corpora for automatic grammatical error correction in Ukrainian texts. The process of adapting models to the specifics of the Ukrainian language is detailed, practical case examples are provided, and results are analysed. A significant part of the paper is devoted to the description of the developed decision support system for correcting errors in Ukrainian language texts. The system’s architecture, its main components, and processes are presented through UML diagrams. The input and output data are described, along with an analysis of the obtained results, demonstrating the effectiveness of the proposed solutions. The results of this study can be useful for NLP system developers, researchers in text processing, and educational institutions focused on improving the quality of written texts in Ukrainian.

identification and correction of grammatical errors

error detection and correction

machine learning models

text corpora

natural language processing (NLP)

linguistic tools

Ukrainian language

Bryant, C., Yuan, Z., Qorib, M. R., Cao, H., Ng, H. T., & Briscoe, T. (2023). Grammatical Error Correction: A Survey of the State of the Art. Computational Linguistics, 49(3), 643–701. DOI: 10.48550/arXiv.2211.05166.
Smith, O. B., Ilori, J. O., Onesirosan, P. (1984). The proximate composition and nutritive value of the winged bean Psophocarpus tetragonolobus (L.) DC for broilers. Anim. Feed Sci. Technol., 11: 231–237
Chomsky, N. (1961). On the notion” rule of grammar” (pp. 155–210), USA: American Mathematical Society.
Naghshnejad, M.; Joshi, T.; Nair, V.N. (2020) Recent Trends in the Use of Deep Learning Models for Grammar Error Handling, arXiv:2009.02358.
Brockett, C., Dolan, W. B., & Gamon, M. (2006). Correcting ESL Errors Using Phrasal SMT Techniques. Association for Computational Linguistics, Proceedings of the 21st Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, 249–256. DOI: 10.3115/1220175.1220207
Yoshimoto, I., Kose, T., Mitsuzawa, K., Sakaguchi, K., Mizumoto, T., Hayashibe, Y., Komachi, M., & Matsumoto, Y. (2013). NAIST at 2013 CoNLL Grammatical Error Correction Shared Task. Association for Computational Linguistics, 26–33. https://aclanthology.org/W13-3604
Felice, M., Yuan, Z., Andersen, E., Yannakoudakis, H., & Kochmar, E. (2014). Grammatical error correction using hybrid systems and type filtering. Association for Computational Linguistics, 15–24. DOI:10.3115/v1/W14-1702
Junczys-Dowmunt, M., & Grundkiewicz, R. (2014). The AMU System in the CoNLL-2014 Shared Task: Grammatical Error Correction by Data-Intensive and Feature-Rich Statistical Machine Translation. Association for Computational Linguistics, Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, 25–33. https://doi.org/10.3115/v1/W14-1703
Cho, K., Merriënboer, B. V., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. Association for Computational Linguistics, 1724–1734. https://doi.org/10.3115/v1/D14-1179.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. https://doi.org/10.48550/arXiv.1706.03762
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. (2020). Transformers: State-of-the-Art Natural Language Processing. Association for Computational Linguistics, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
We believe that anyone can write beautifully and professionally. LanguageTool. https://languagetool.org/about
LanguageTool API NLP UK. Github. https://github.com/brown-uk/nlp_uk
Stanza – A Python NLP Package for Many Human Languages. Stan for DNLP. https://stanfordnlp.github.io/stanza
NLP-Cube. Github. https://github.com/adobe/NLP-Cube.
Pymorphy. Github. https://github.com/pymorphy2/pymorphy2
Tree_stem. Github. https://github.com/amakukha/stemmers_ukrainian
MT5: Multilingual T5. Github. https://github.com/google-research/multilingual-t5
Multilingual Machine Translation. https://github.com/facebookresearch/fairseq/tree/main/examples/m2m_100
MBART50. https://github.com/facebookresearch/fairseq/tree/main/examples/multiling... models
Ukrainian Roberta base model. Hugging Face. https://huggingface.co/youscan/ukr-roberta-base
Uk-punctcase model. Hugging Face. https://huggingface.co/ukr-models/uk-punctcase
Ukrainian model to restore punctuation and capitalization. https://huggingface.co/dchaplinsky/ punctuation_uk_bert
XML Roberta Base Uk model. Hugging Face. https://huggingface.co/ukr-models/xlm-roberta-base-uk
Chaplynskyi, D. (2023). Introducing UberText 2.0: A Corpus of Modern Ukrainian at Scale. Association for Computational Linguistics, Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), 1–10. https://doi.org/10.18653/v1/2023.unlp-1.1
Abadji, J., Suarez, P. O., Romary, L., & Sagot, B. (2022). Towards a Cleaner Document-Oriented Multilingual Crawled Corpus. European Language Resources Association, Proceedings of the Thirteenth Language Resources and Evaluation Conference, 4344–4355. https://aclanthology.org/2022.lrec-1.463
Darchuk, N. (2017). Possibilities of semantic marking of the corpus of the Ukrainian language (KUM). Digital Repository Dragomanov Ukrainian State University. https://enpuir.npu.edu.ua/handle/123456789/17838
Shvedova, M., et al. (2017-2022). General Regionally Annotated Corpus of Ukrainian Language (GRAC). Network for ukrainian studies jena. https://doi.org/10.48550/arXiv.1911.02116
BRUK: Braunskyi korpus ukrainskoi movy. Github. https://github.com/brown-uk/corpus
Kotsyba N., et al. (2018). Laboratorija ukrajins’koji. https://mova.institute/
UA-GEC. URL: https://github.com/grammarly/ua-gec
Syvokon, O., & Nahorna, O. (2021). UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language. https://doi.org/10.48550/arXiv.2103.16997
Syvokon O., Nahorna O., Kuchmiichuk P. & Osidach N. (2023). UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language. Association for Computational Linguistics, Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), 96–102. https://doi.org/10.18653/v1/2023.unlp- 1.12
Bondarenko, M., et.al. (2023). Omparative Study of Models Trained on Synthetic Data for Ukrainian Grammatical Error Correction. Association for Computational Linguistics, Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), 103–113. https://doi.org/10.18653/v1/2023.unlp-1.13
Romanyshyn M. (2023) Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP). Association for Computational Linguistics. https://aclanthology.org/2023.unlp-1.pdf
Didenko, B., & Sameliuk, A. (2023). RedPenNet for Grammatical Error Correction: Outputs to Tokens, Attentions to Spans. Association for Computational Linguistics, Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), 121–131. https://doi.org/10.18653/v1/2023.unlp-1.15
Gomez, F. P., Rozovskaya, A., & Roth, D. (2023). A Low-Resource Approach to the Grammatical Error Correction of Ukrainian. Association for Computational Linguistics, Proceedings of the Second Ukrainian Natural Language Processing Workshop (UNLP), 114–120. https://doi.org/10.18653/v1/2023.unlp-1.14.
Vysotska, V. (2024). Linguistic intellectual analysis methods for Ukrainian textual content processing. CEUR Workshop Proceedings. https://ceur-ws.org/Vol-3722/paper25.pdf.
Vysotska, V. (2024). Linguistic intellectual analysis methods for Ukrainian textual content processing. CEUR Workshop Proceedings. https://ceur-ws.org/Vol-3722/paper18.pdf.
Vysotska, V., Holoshchuk, S., & Holoshchuk, R. (2021). A Comparative Analysis for English and Ukrainian Texts Processing Based on Semantics and Syntax Approach. https://ceur-ws.org/Vol-2870/paper26.pdf.
Vysotska, V. (2024). Computer Linguistic Systems Design and Development Features for Ukrainian Language Content Processing. In COLINS (3) (pp. 229–271). https://ceur-ws.org/Vol-3688/paper18.pdf.
Kholodna, N., et.al. (2022, November). Machine Learning Model for Paraphrases Detection Based on Text Content Pair Binary Classification. In MoMLeT+ DS (pp. 283–306). https://ceur-ws.org/Vol-3312/paper23.pdf
Lytvyn, V., et.al. (2023). Identification and Correction of Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology. Mathematics, 11(4), 904. DOI: 10.3390/math11040904
Kholodna, N., et.al. (2021). A Machine Learning Model for Automatic Emotion Detection from Speech. In MoMLeT+ DS (pp. 699–713). https://ceur-ws.org/Vol-2917/paper42.pdf.
Kholodna, N., et.al. (2023). Technology for grammatical errors correction in Ukrainian text content based on machine learning methods. Radio Electronics, Computer Science, Control, 1, 114. 10.15588/1607-3274-2023-1-12