Information Technology for Text Classification Tasks Using Large Language Models

Maksym Yaromych

The article addresses the problem of text classification in the context of growing information flows and the need for automated content analysis. A universal information technology is proposed, combining classical machine learning methods with the potential of Large Language Models for processing news, scientific, literary, journalistic and legal texts. Using the BBC News corpus (2225 texts), k-means clustering with TF-IDF demonstrated clear thematic grouping. The scientific contribution lies in the development of a methodological framework capable of transitioning from statistical to semantic classification models. The technology can be implemented in education, research, media and legal anal- ytics. Future directions include multimodal data integration and explainability mechanisms for decision- support systems.

text classification

information technology

k-means

TF-IDF

Large Language Models

semantic analysis

automated text monitoring

Aggarwal, C. C. & Zhai, C. (2012). Mining text data. Springer. https://doi.org/10.1007/978-1-4614-3223-4
Allam, H., Makuvubre, L. & Gyamfi, B. (2025). Text classification: How machine learning is revolutionizing text categorization. Information, 16(2), 130–177. https://doi.org/10.3390/info16020130
Arthur, D. & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07). Society for Industrial and Applied Mathematics. 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494
Baly, R., Da San Martino, G. & Glass, J. (2020). We can detect your bias: Predicting the political ideology of news articles [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2010.05338
Barberá, P., Boydstun, A. E. & Linn, S. (2021). Automated text classification of news articles: A practical guide. Political Analysis, 29(1), 19–42. https://doi.org/10.1017/pan.2020.8
Bibi, A., Ihsan, U. & Ashraf, H. (2024). Multilingual sentiment analysis using deep learning: Survey [Preprint]. Preprints. https://doi.org/10.20944/preprints202312.1990.v2
Caliński, T. & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1– 27. https://doi.org/10.1080/03610927408827101
Davies, D. L. & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning (ICML '03). AAAI Press. 147–153. https://dl.acm.org/doi/10.5555/- 3041838.3041857
Henriksson, E., Myntti, A. & Hellström, S. (2024). Automatic register identification for the open web using multilingual deep learning [Preprint]. arXiv. https://arxiv.org/abs/2406.19892
Hong, C. & Oh, T. (2025). Optimization for threat classification of various data types based on ML model and LLM. Scientific Reports, 15(22768). https://doi.org/10.1038/s41598-025-05182-y
Hubert, L. & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075
Islam, Z., Ahmed, S. & Khan, M. (2017). Multilingual text classification using information-theoretic features [Master's thesis, Deutsches Nationalbibliothek]. DNB. https://d-nb.info/1077557639/34
Jain, A. K., Murty, M. N. & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264– 323. https://doi.org/10.1145/331499.331504
Kim, J. Y., Kwon, H. Y. & Lee, S. (2022). Threat classification model for security information event management focusing on model efficiency. Computers & Security, 120(102789). https://doi.org/10.1016/j.cose.- 2022.102789
Kreek, R. A., Apostolova, E. & Xu, W. (2018). Training and prediction data discrepancies: Challenges of text classification with noisy, historical data. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Textю. Association for Computational Linguistics. 104–109. https://doi.org/10.18653/v1/W18-6114
Kunanets', N. & Yaromych, M. (2025). Vydilennya kontseptiv u literaturnykh tekstakh iz vykorystannyam velykykh movnykh modelei [Extraction of concepts in literary texts using large language models]. Visnyk nauky ta osvity, 32(2), 343–357. https://doi.org/10.52058/2786-6165-2025-2(32)-343-357
Li, Q., Peng, H. & Li, J. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology, 13(2), 31. https://doi.org/10.1145/3495162
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
Päpcke, S., Weitin, T. & Herget, K. (2022). Stylometric similarity in literary corpora: Non-authorship clustering and Deutscher Novellenschatz. Digital Scholarship in the Humanities, 38(1), 277–295. https://doi.org/10.- 1093/llc/fqac039
Pasichnyk, V. & Yaromych, M. (2025). Large language models and ontologies in philological research: Analytical review of sources. Current Issues of the Humanities: Linguistics and Literary Studies, 83(3), 236–250. https://doi.org/10.24919/2308-4863/83-3-35
Pasichnyk, V. V. & Yaromych, M. V. (2025a). Automated formation of technical documentation in the IT field using large language models. Studia Methodologica, 59, 250–273. https://doi.org/10.32782/2307- 1222.2025-59-22
Pasichnyk, V. V. & Yaromych, M. V. (2025b). Genre classification of literature by metrics using large language models. Scientific Works of the Interregional Academy of Personnel Management. Philology, 1(15), 60–68. https://doi.org/10.32689/maup.philol.2025.1.11
Pei, Z. (2022). The impact of semantic and stylistic features in genre classification for news [Master's thesis, Uppsala University]. DiVA Portal. http://uu.diva-portal.org/smash/get/diva2:1670078/FULLTEXT01.pdf
Rao, Y., Li, Y. & Zhang, X. (2022). A method for classifying information in education policy texts based on an improved attention mechanism model. Wireless Communications and Mobile Computing, 2022, (5467572). https://doi.org/10.1155/2022/5467572
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Strehl, A. & Ghosh, J. (2002). Cluster ensembles–A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. https://jmlr.org/papers/v3/strehl02a.html
Sulea, O.-M., Zampieri, M. & Malmasi, S. (2017). Exploring the use of text classification in the legal domain [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1710.09306
Taha, K., Yoo, P. D. & Yeun, C. (2024). A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights. Computer Science Review, 54, Article 100664. https://doi.org/10.1016/j.cosrev.2024.100664
Wiegreffe, S. & Pinter, Y. (2019). Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. 11–20. https://doi.- org/10.18653/v1/D19-1002
Wyawhare, A., Jain, S. & Sharma, R. (2023). Comparative analysis of multilingual text classification & identification through deep learning and embedding visualization [Preprint]. arXiv. https://arxiv.org/ abs/2312.03789
Xiang, L. (2022). Application of an improved TF-IDF method in literary text classification. Advances in Multimedia, 2022, 1–10. https://doi.org/10.1155/2022/9285324
Xu, R. & Wunsch, D. (2009). Clustering. Wiley. https://doi.org/10.1002/9780470382776
Zhao, W. X., Zhou, K., Li, J. & Wen, J.-R. (2023). A survey of large language models [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2303.18223
Zhu, H., Zhang, Y. & Li, X. (2022). The research trends of text classification studies (2000–2022): A bibliometric analysis. Sage Open, 12(1). https://doi.org/10.1177/21582440221089963