Інформаційна технологія класифікації текстів з використанням великих мовних моделей

Максим Яромич

Стаття досліджує архітектурні рішення для ефективної та швидкої побудови розподілених систем онлайн опрацювання транзакцій із використанням хмарного інструментарію, cloud-native архітектурних принципів, та методів реплікації баз даних. Стаття фокусується на методах зменшення мережевої затримки, оптимізації використання ресурсів і, як наслідок, коштів, реплікації даних та відмовостійкості. Стаття наглядно демонструє як із використанням сучасних хмарних рішень та технологій можна швидко та легко побудувати розподілену систему онлайн опрацювання транзакцій корпоративного рівня. Рішення використані у статті можуть бути застосовані як для окремих підсистем, так і як цілісний архітектурний підхід. Розглядаються принципи побудови архітектури систем, вибір технологій для забезпечення продуктивності та відмовостійкості. Також аналізуються сучасні методи розгортання веб-додатків, включаючи використання контейнеризації та оркестрування для спрощення управління інфраструктурою. Окремо розглядаються механізми автоматичного масштабування, що дозволяють динамічно адаптувати систему до змін навантаження, оптимізуючи використання ресурсів. Запропоновані методи та підходи є актуальними для розробників, архітекторів та дослідників, які працюють над оптимізацією розподілених веб-додатків та прагнуть створювати високопродуктивні, масштабовані й стійкі до відмов системи.

класифікація текстів

інформаційні технології

автоматизований моніторинг тексту

Aggarwal, C. C. & Zhai, C. (2012). Mining text data. Springer. https://doi.org/10.1007/978-1-4614-3223-4
Allam, H., Makuvubre, L. & Gyamfi, B. (2025). Text classification: How machine learning is revolutionizing text categorization. Information, 16(2), 130–177. https://doi.org/10.3390/info16020130
Arthur, D. & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07). Society for Industrial and Applied Mathematics. 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494
Baly, R., Da San Martino, G. & Glass, J. (2020). We can detect your bias: Predicting the political ideology of news articles [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2010.05338
Barberá, P., Boydstun, A. E. & Linn, S. (2021). Automated text classification of news articles: A practical guide. Political Analysis, 29(1), 19–42. https://doi.org/10.1017/pan.2020.8
Bibi, A., Ihsan, U. & Ashraf, H. (2024). Multilingual sentiment analysis using deep learning: Survey [Preprint]. Preprints. https://doi.org/10.20944/preprints202312.1990.v2
Caliński, T. & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1– 27. https://doi.org/10.1080/03610927408827101
Davies, D. L. & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning (ICML '03). AAAI Press. 147–153. https://dl.acm.org/doi/10.5555/- 3041838.3041857
Henriksson, E., Myntti, A. & Hellström, S. (2024). Automatic register identification for the open web using multilingual deep learning [Preprint]. arXiv. https://arxiv.org/abs/2406.19892
Hong, C. & Oh, T. (2025). Optimization for threat classification of various data types based on ML model and LLM. Scientific Reports, 15(22768). https://doi.org/10.1038/s41598-025-05182-y
Hubert, L. & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075
Islam, Z., Ahmed, S. & Khan, M. (2017). Multilingual text classification using information-theoretic features [Master's thesis, Deutsches Nationalbibliothek]. DNB. https://d-nb.info/1077557639/34
Jain, A. K., Murty, M. N. & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264– 323. https://doi.org/10.1145/331499.331504
Kim, J. Y., Kwon, H. Y. & Lee, S. (2022). Threat classification model for security information event management focusing on model efficiency. Computers & Security, 120(102789). https://doi.org/10.1016/j.cose.- 2022.102789
Kreek, R. A., Apostolova, E. & Xu, W. (2018). Training and prediction data discrepancies: Challenges of text classification with noisy, historical data. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Textю. Association for Computational Linguistics. 104–109. https://doi.org/10.18653/v1/W18-6114
Kunanets', N. & Yaromych, M. (2025). Vydilennya kontseptiv u literaturnykh tekstakh iz vykorystannyam velykykh movnykh modelei [Extraction of concepts in literary texts using large language models]. Visnyk nauky ta osvity, 32(2), 343–357. https://doi.org/10.52058/2786-6165-2025-2(32)-343-357
Li, Q., Peng, H. & Li, J. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology, 13(2), 31. https://doi.org/10.1145/3495162
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
Päpcke, S., Weitin, T. & Herget, K. (2022). Stylometric similarity in literary corpora: Non-authorship clustering and Deutscher Novellenschatz. Digital Scholarship in the Humanities, 38(1), 277–295. https://doi.org/10.- 1093/llc/fqac039
Pasichnyk, V. & Yaromych, M. (2025). Large language models and ontologies in philological research: Analytical review of sources. Current Issues of the Humanities: Linguistics and Literary Studies, 83(3), 236–250. https://doi.org/10.24919/2308-4863/83-3-35
Pasichnyk, V. V. & Yaromych, M. V. (2025a). Automated formation of technical documentation in the IT field using large language models. Studia Methodologica, 59, 250–273. https://doi.org/10.32782/2307- 1222.2025-59-22
Pasichnyk, V. V. & Yaromych, M. V. (2025b). Genre classification of literature by metrics using large language models. Scientific Works of the Interregional Academy of Personnel Management. Philology, 1(15), 60–68. https://doi.org/10.32689/maup.philol.2025.1.11
Pei, Z. (2022). The impact of semantic and stylistic features in genre classification for news [Master's thesis, Uppsala University]. DiVA Portal. http://uu.diva-portal.org/smash/get/diva2:1670078/FULLTEXT01.pdf
Rao, Y., Li, Y. & Zhang, X. (2022). A method for classifying information in education policy texts based on an improved attention mechanism model. Wireless Communications and Mobile Computing, 2022, (5467572). https://doi.org/10.1155/2022/5467572
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Strehl, A. & Ghosh, J. (2002). Cluster ensembles–A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. https://jmlr.org/papers/v3/strehl02a.html
Sulea, O.-M., Zampieri, M. & Malmasi, S. (2017). Exploring the use of text classification in the legal domain [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1710.09306
Taha, K., Yoo, P. D. & Yeun, C. (2024). A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights. Computer Science Review, 54, Article 100664. https://doi.org/10.1016/j.cosrev.2024.100664
Wiegreffe, S. & Pinter, Y. (2019). Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. 11–20. https://doi.- org/10.18653/v1/D19-1002
Wyawhare, A., Jain, S. & Sharma, R. (2023). Comparative analysis of multilingual text classification & identification through deep learning and embedding visualization [Preprint]. arXiv. https://arxiv.org/ abs/2312.03789
Xiang, L. (2022). Application of an improved TF-IDF method in literary text classification. Advances in Multimedia, 2022, 1–10. https://doi.org/10.1155/2022/9285324
Xu, R. & Wunsch, D. (2009). Clustering. Wiley. https://doi.org/10.1002/9780470382776
Zhao, W. X., Zhou, K., Li, J. & Wen, J.-R. (2023). A survey of large language models [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2303.18223
Zhu, H., Zhang, Y. & Li, X. (2022). The research trends of text classification studies (2000–2022): A bibliometric analysis. Sage Open, 12(1). https://doi.org/10.1177/21582440221089963