Information Technology for Text Classification Tasks Using Large Language Models

2025;
: pp. 201 - 214
Authors:
1
Lviv Polytechnic National University Department of Applied Linguistics, Ukraine

The article addresses the problem of text classification in the context of growing information flows and the need for automated content analysis. A universal information technology is proposed, combining classical machine learning methods with the potential of Large Language Models for processing news, scientific, literary, journalistic and legal texts. Using the BBC News corpus (2225 texts), k-means clustering with TF-IDF demonstrated clear thematic grouping. The scientific contribution lies in the development of a methodological framework capable of transitioning from statistical to semantic classification models. The technology can be implemented in education, research, media and legal anal- ytics. Future directions include multimodal data integration and explainability mechanisms for decision- support systems.

  1. Aggarwal, C. C. & Zhai, C. (2012). Mining text data. Springer. https://doi.org/10.1007/978-1-4614-3223-4
  2. Allam, H., Makuvubre, L. & Gyamfi, B. (2025). Text classification: How machine learning is revolutionizing text categorization. Information, 16(2), 130–177. https://doi.org/10.3390/info16020130
  3. Arthur, D. & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA '07). Society for Industrial and Applied Mathematics. 1027–1035. https://dl.acm.org/doi/10.5555/1283383.1283494
  4. Baly, R., Da San Martino, G. & Glass, J. (2020). We can detect your bias: Predicting the political ideology of news articles [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2010.05338
  5. Barberá, P., Boydstun, A. E. & Linn, S. (2021). Automated text classification of news articles: A practical guide. Political Analysis, 29(1), 19–42. https://doi.org/10.1017/pan.2020.8
  6. Bibi, A., Ihsan, U. & Ashraf, H. (2024). Multilingual sentiment analysis using deep learning: Survey [Preprint]. Preprints. https://doi.org/10.20944/preprints202312.1990.v2
  7. Caliński, T. & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1– 27. https://doi.org/10.1080/03610927408827101
  8. Davies, D. L. & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227. https://doi.org/10.1109/TPAMI.1979.4766909
  9. Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference on Machine Learning (ICML '03). AAAI Press. 147–153. https://dl.acm.org/doi/10.5555/- 3041838.3041857
  10. Henriksson, E., Myntti, A. & Hellström, S. (2024). Automatic register identification for the open web using multilingual deep learning [Preprint]. arXiv. https://arxiv.org/abs/2406.19892
  11. Hong, C. & Oh, T. (2025). Optimization for threat classification of various data types based on ML model and LLM. Scientific Reports, 15(22768). https://doi.org/10.1038/s41598-025-05182-y
  12. Hubert, L. & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075
  13. Islam, Z., Ahmed, S. & Khan, M. (2017). Multilingual text classification using information-theoretic features [Master's thesis, Deutsches Nationalbibliothek]. DNB. https://d-nb.info/1077557639/34
  14. Jain, A. K., Murty, M. N. & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264– 323. https://doi.org/10.1145/331499.331504
  15. Kim, J. Y., Kwon, H. Y. & Lee, S. (2022). Threat classification model for security information event management focusing on model efficiency. Computers & Security, 120(102789). https://doi.org/10.1016/j.cose.- 2022.102789
  16. Kreek, R. A., Apostolova, E. & Xu, W. (2018). Training and prediction data discrepancies: Challenges of text classification with noisy, historical data. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Textю. Association for Computational Linguistics. 104–109. https://doi.org/10.18653/v1/W18-6114
  17. Kunanets', N. & Yaromych, M. (2025). Vydilennya kontseptiv u literaturnykh tekstakh iz vykorystannyam velykykh movnykh modelei [Extraction of concepts in literary texts using large language models]. Visnyk nauky ta osvity, 32(2), 343–357. https://doi.org/10.52058/2786-6165-2025-2(32)-343-357
  18. Li, Q., Peng, H. & Li, J. (2022). A survey on text classification: From traditional to deep learning. ACM Transactions on Intelligent Systems and Technology, 13(2), 31. https://doi.org/10.1145/3495162
  19. Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
  20. Manning, C. D., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press. https://doi.org/10.1017/CBO9780511809071
  21. Päpcke, S., Weitin, T. & Herget, K. (2022). Stylometric similarity in literary corpora: Non-authorship clustering and Deutscher Novellenschatz. Digital Scholarship in the Humanities, 38(1), 277–295. https://doi.org/10.- 1093/llc/fqac039
  22. Pasichnyk, V. & Yaromych, M. (2025). Large language models and ontologies in philological research: Analytical review of sources. Current Issues of the Humanities: Linguistics and Literary Studies, 83(3), 236–250. https://doi.org/10.24919/2308-4863/83-3-35
  23. Pasichnyk, V. V. & Yaromych, M. V. (2025a). Automated formation of technical documentation in the IT field using large language models. Studia Methodologica, 59, 250–273. https://doi.org/10.32782/2307- 1222.2025-59-22
  24. Pasichnyk, V. V. & Yaromych, M. V. (2025b). Genre classification of literature by metrics using large language models. Scientific Works of the Interregional Academy of Personnel Management. Philology, 1(15), 60–68. https://doi.org/10.32689/maup.philol.2025.1.11
  25. Pei, Z. (2022). The impact of semantic and stylistic features in genre classification for news [Master's thesis, Uppsala University]. DiVA Portal. http://uu.diva-portal.org/smash/get/diva2:1670078/FULLTEXT01.pdf
  26. Rao, Y., Li, Y. & Zhang, X. (2022). A method for classifying information in education policy texts based on an improved attention mechanism model. Wireless Communications and Mobile Computing, 2022, (5467572). https://doi.org/10.1155/2022/5467572
  27. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
  28. Strehl, A. & Ghosh, J. (2002). Cluster ensembles–A knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. https://jmlr.org/papers/v3/strehl02a.html
  29. Sulea, O.-M., Zampieri, M. & Malmasi, S. (2017). Exploring the use of text classification in the legal domain [Preprint]. arXiv. https://doi.org/10.48550/arXiv.1710.09306
  30. Taha, K., Yoo, P. D. & Yeun, C. (2024). A comprehensive survey of text classification techniques and their research applications: Observational and experimental insights. Computer Science Review, 54, Article 100664. https://doi.org/10.1016/j.cosrev.2024.100664
  31. Wiegreffe, S. & Pinter, Y. (2019). Attention is not not explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics. 11–20. https://doi.- org/10.18653/v1/D19-1002
  32. Wyawhare, A., Jain, S. & Sharma, R. (2023). Comparative analysis of multilingual text classification & identification through deep learning and embedding visualization [Preprint]. arXiv. https://arxiv.org/ abs/2312.03789
  33. Xiang, L. (2022). Application of an improved TF-IDF method in literary text classification. Advances in Multimedia, 2022, 1–10. https://doi.org/10.1155/2022/9285324
  34. Xu, R. & Wunsch, D. (2009). Clustering. Wiley. https://doi.org/10.1002/9780470382776
  35. Zhao, W. X., Zhou, K., Li, J. & Wen, J.-R. (2023). A survey of large language models [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2303.18223
  36. Zhu, H., Zhang, Y. & Li, X. (2022). The research trends of text classification studies (2000–2022): A bibliometric analysis. Sage Open, 12(1). https://doi.org/10.1177/21582440221089963