The article addresses the problem of text classification in the context of growing information flows and the need for automated content analysis. A universal information technology is proposed, combining classical machine learning methods with the potential of Large Language Models for processing news, scientific, literary, journalistic and legal texts. Using the BBC News corpus (2225 texts), k-means clustering with TF-IDF demonstrated clear thematic grouping. The scientific contribution lies in the development of a methodological framework capable of transitioning from statistical to semantic classification models. The technology can be implemented in education, research, media and legal anal- ytics. Future directions include multimodal data integration and explainability mechanisms for decision- support systems.
- Aggarwal, C. C. & Zhai, C. (2012). Mining text data. Springer. https://doi.org/10.1007/978-1-4614-3223-4
- Allam, H., Makuvubre, L. & Gyamfi, B. (2025). Text classification: How machine learning is revolutionizing text categorization. Information, 16(2), 130–177. https://doi.org/10.3390/info16020130
- Arthur, D. & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07). Society for Industrial and Applied Mathematics. 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494
- Baly, R., Da San Martino, G. & Glass, J. (2020). We can detect your bias: Predicting the political ideology of news articles [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2010.05338
- Barberá, P., Boydstun, A. E. & Linn, S. (2021). Automated text classification of news articles: A practical guide. Political Analysis, 29(1), 19–42. https://doi.org/10.1017/pan.2020.8
- Bibi, A., Ihsan, U. & Ashraf, H. (2024). Multilingual sentiment analysis using deep learning: Survey [Preprint]. Preprints. https://doi.org/10.20944/preprints202312.1990.v2
- Caliński, T. & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1– 27. https://doi.org/10.1080/03610927408827101
- Davies, D. L. & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
- Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning (ICML '03). AAAI Press. 147–153. https://dl.acm.org/doi/10.5555/- 3041838.3041857
- Henriksson, E., Myntti, A. & Hellström, S. (2024). Automatic register identification for the open web using multilingual deep learning [Preprint]. arXiv. https://arxiv.org/abs/2406.19892
- Hong, C. & Oh, T. (2025). Optimization for threat classification of various data types based on ML model and LLM. Scientific Reports, 15(22768). https://doi.org/10.1038/s41598-025-05182-y
- Hubert, L. & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075
- Islam, Z., Ahmed, S. & Khan, M. (2017). Multilingual text classification using information-theoretic features [Master's thesis, Deutsches Nationalbibliothek]. DNB. https://d-nb.info/1077557639/34
- Jain, A. K., Murty, M. N. & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264– 323. https://doi.org/10.1145/331499.331504
- Kim, J. Y., Kwon, H. Y. & Lee, S. (2022). Threat classification model for security information event management focusing on model efficiency. Computers & Security, 120(102789). https://doi.org/10.1016/j.cose.- 2022.102789
- Kreek, R. A., Apostolova, E. & Xu, W. (2018). Training and prediction data discrepancies: Challenges of text classification with noisy, historical data. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Textю. Association for Computational Linguistics. 104–109. https://doi.org/10.18653/v1/W18-6114
- Kunanets', N. & Yaromych, M. (2025). Vydilennya kontseptiv u literaturnykh tekstakh iz vykorystannyam velykykh movnykh modelei [Extraction of concepts in literary texts using large language models]. Visnyk nauky ta osvity, 32(2), 343–357. https://doi.org/10.52058/2786-6165-2025-2(32)-343-357
- Li, Q., Peng, H. & Li, J. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology, 13(2), 31. https://doi.org/10.1145/3495162
- Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
- Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
- Päpcke, S., Weitin, T. & Herget, K. (2022). Stylometric similarity in literary corpora: Non-authorship clustering and Deutscher Novellenschatz. Digital Scholarship in the Humanities, 38(1), 277–295. https://doi.org/10.- 1093/llc/fqac039
- Pasichnyk, V. & Yaromych, M. (2025). Large language models and ontologies in philological research: Analytical review of sources. Current Issues of the Humanities: Linguistics and Literary Studies, 83(3), 236–250. https://doi.org/10.24919/2308-4863/83-3-35
- Pasichnyk, V. V. & Yaromych, M. V. (2025a). Automated formation of technical documentation in the IT field using large language models. Studia Methodologica, 59, 250–273. https://doi.org/10.32782/2307- 1222.2025-59-22
- Pasichnyk, V. V. & Yaromych, M. V. (2025b). Genre classification of literature by metrics using large language models. Scientific Works of the Interregional Academy of Personnel Management. Philology, 1(15), 60–68. https://doi.org/10.32689/maup.philol.2025.1.11
- Pei, Z. (2022). The impact of semantic and stylistic features in genre classification for news [Master's thesis, Uppsala University]. DiVA Portal. http://uu.diva-portal.org/smash/get/diva2:1670078/FULLTEXT01.pdf
- Rao, Y., Li, Y. & Zhang, X. (2022). A method for classifying information in education policy texts based on an improved attention mechanism model. Wireless Communications and Mobile Computing, 2022, (5467572). https://doi.org/10.1155/2022/5467572
- Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
- Strehl, A. & Ghosh, J. (2002). Cluster ensembles–A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. https://jmlr.org/papers/v3/strehl02a.html
- Sulea, O.-M., Zampieri, M. & Malmasi, S. (2017). Exploring the use of text classification in the legal domain [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1710.09306
- Taha, K., Yoo, P. D. & Yeun, C. (2024). A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights. Computer Science Review, 54, Article 100664. https://doi.org/10.1016/j.cosrev.2024.100664
- Wiegreffe, S. & Pinter, Y. (2019). Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. 11–20. https://doi.- org/10.18653/v1/D19-1002
- Wyawhare, A., Jain, S. & Sharma, R. (2023). Comparative analysis of multilingual text classification & identification through deep learning and embedding visualization [Preprint]. arXiv. https://arxiv.org/ abs/2312.03789
- Xiang, L. (2022). Application of an improved TF-IDF method in literary text classification. Advances in Multimedia, 2022, 1–10. https://doi.org/10.1155/2022/9285324
- Xu, R. & Wunsch, D. (2009). Clustering. Wiley. https://doi.org/10.1002/9780470382776
- Zhao, W. X., Zhou, K., Li, J. & Wen, J.-R. (2023). A survey of large language models [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2303.18223
- Zhu, H., Zhang, Y. & Li, X. (2022). The research trends of text classification studies (2000–2022): A bibliometric analysis. Sage Open, 12(1). https://doi.org/10.1177/21582440221089963