Comparison and Clustering of Textual Information Sources Based on the Cosine Similarity Algorithm

Zhengbing Hu; Dmytro Uhryn; Artem Kalancha

This article presents a study aimed at developing an optimal concept for analyzing and comparing information sources based on large amounts of text information using natural language processing (NLP) methods. The object of the study was Telegram news channels, which are used as sources of text data. Pre-processing of texts was carried out, including cleaning, tokenization and lemmatization, to form a global dictionary consisting of unique words from all information sources. For each source, a vector representation of texts was constructed, the dimension of which corresponds to the number of unique words in the global dictionary. The frequency of use of each word in the channel texts was displayed in the corresponding positions of the vector. By applying the cosine similarity algorithm to pairs of vectors, a square matrix was obtained that demonstrates the degree of similarity between different sources. An analysis of the similarity of channels in limited time intervals was conducted, which allowed us to identify trends in changes in their information policies. The model parameters were optimized to ensure maximum channel differentiation, which increased the efficiency of the analysis. Clustering algorithms were applied, which divided the channels into groups according to the degree of lexical similarity. The results of the study demonstrate the effectiveness of the proposed approach for quantitatively assessing the similarity and clustering text data from different sources. The proposed method can be used to analyze information sources, identify relationships between sources, study the dynamics of changes in their activities, and assess the socio-cultural impact of media content.

information source

text

similarity

natural language processing

Abualigah, L. M., Khader, A. T., & Al-Betar, M. A. (2016). Multi-objectives-based text clustering technique using K-mean algorithm. 2016 7th International Conference on Computer Science and Information Technology (CSIT), 1-6. https://doi.org/10.1109/csit.2016.7549464
Camacho-Collados, J. (2018). On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. eprint arXiv, 1707(01780), 1-4. https://doi.org/10.48550/arXiv.1707.01780
Chai, C. (2023). Comparison of text preprocessing methods. Natural Language Engineering, 29(3), 509-553. https://doi.org/10.1017/S1351324922000213
Chiarella, C., He, X.-Z., & Hommes, C. (2006). A dynamic analysis of moving average rules. Journal of Economic Dynamics and Control, 30(9), 1729–1753. https://doi.org/10.1016/j.jedc.2005.08.014
Daelemans, W., Hoste, V., De Meulder, F., Naudts, B. (2003). Combined Optimization of Feature Selection and Algorithm Parameters in Machine Learning of Language. In: Lavrač, N., Gamberger, D., Blockeel, H., Todorovski,L. (Eds.) Machine Learning: ECML 2003. Lecture Notes in Computer Science, 2837 https://doi.org/10.1007/978- 3-540-39857-8_1
Dogra, V., Verma, S., Kavita, Chatterjee, P., Shafi, J., Choi, J., & Ijaz, M. F. (2022). A Complete Process of Text Classification System Using State-of-the-Art NLP Models. Computational Intelligence and Neuroscience, 2022, 1–26. https://doi.org/10.1155/2022/1883698
Guan, R., Shi, X., Marchese, M., Yang. C., & Liang, Y. (2011). Text Clustering with Seeds Affinity Propagation. IEEE Transactions on Knowledge and Data Engineering, 23(4), 627-637. https://doi.org/10.1109/TKDE.2010.144
Janani, R., & Vijayarani, Dr. S. (2019). Text document clustering using Spectral Clustering algorithm with Particle Swarm Optimization. Expert Systems with Applications. 134, 192-200. https://doi.org/10.1016/j.eswa.2019.05.030
Magara B. M., Ojo S. O. & Zuva T. (2018). A comparative analysis of text similarity measures and algorithms in research paper recommender systems. Conference on Information Communications Technology and Society (ICTAS), 1-5. https://doi.org/10.1109/ICTAS.2018.8368766
Mohammad, F. (2018). Is preprocessing of text really worth your time for online comment classification? eprint arXiv, 1806(029908), 1-5. https://doi.org/10.48550/arXiv.1806.02908
Park, K., Hong, J. S., & Kim, W. (2020). A Methodology Combining Cosine Similarity with Classifier for Text Classification. Applied Artificial Intelligence, 34(5), 396–411. https://doi.org/10.1080/08839514.2020.1723868
Stokes, E. (2021, December 11). NLP with Pipeline & GridSearch - Towards Data Science. Medium.https://towardsdatascience.com/nlp-with-pipeline-gridsearch-5922266e82f4
Talakh, M.V. (2019). PART 7. USING TEXT MINING FOR THE ANALYSIS OF SOCIAL NETWORKS. In Ushenko, Y., Ostapov, S. & Golub, S., (Eds.), INFORMATION TECHNOLOGIES Part 1. Application in computer vision, recognition and intelligent monitoring systems Yuriy Ushenko, Serhiy Ostapov, Serhiy Golub (pp. 157-173). LAP LAMBERT Academic Publishing.
Talakh, M.V., Holub, S. & Lazarenko Y. (n.d.). Intelligent monitoring of software test automation of Web sites.International Scientific and Practical Conference “Intellectual Systems and Information Technologies”, 46-51.
Telegram (2025). Telegram APIs. Retrieved April 8, 2025, from https://core.telegram.org/api