Method for Detection of Disinformation Based on Text Data Analysis Using TF-IDF and Contextual Vector Representations

Olga Lozynska; Victoria Vysotska; Oksana Markiv; Marian Kuspis

The article considers an approach to detecting fake news in the digital environment through text analysis using machine learning and natural language processing methods. The proposed method is based on a hybrid text representation combining frequency features (TF-IDF) and contextual embeddings obtained using the IBM Granite model. A complete data processing cycle was developed, covering the stages of exploratory analysis (EDA), text preprocessing and tokenization, forming vector representations, training a logistic regression model, and obtaining key metrics. The main stages of text preprocessing included converting all characters to lowercase, removing URLs and HTML tags, cleaning from non-letter characters and excess spaces, eliminating duplicates to avoid re-training, and unifying the values of specific fields. A combination of TF-IDF with contextual embeddings was used to vectorize the cleaned texts, which allowed the model to simultaneously consider the statistical significance of terms and their semantic context within the messages. The constructed logistic regression model combined with a hybrid representation of text data demonstrated high efficiency, achieving an overall accuracy of 82 % and balanced F1-measure values for the “true” and “fake” classes. An analysis of TF-IDF feature weights based on logistic regression coefficients was applied to identify the most relevant terms. The study showed that the model tends to associate truthful information with Ukrainian-language, neutral vocabulary, while texts with signs of disinformation often contain Russian- language elements characteristic of propaganda or manipulative messages. Further research will be aimed at expanding the dataset and creating new ensemble models to identify sources of disinformation.

disinformation

fake

contextual vector representations

Abdulaziz, A., & Marwan, A. (2020). An Empirical Comparison of Fake News Detection using different Machine Learning Algorithms. International Journal of Advanced Computer Science and Applications (IJACSA), 11(9), 146-152.
Alikhashashneh, E., Nahar, K., Abual-Rub, M., & Alkhaldy, H. (2024). A robust method for detecting fake news using both machine and deep learning algorithms. Indonesian Journal of Electrical Engineering and Computer Science, 36, 1816-1826. https://doi.org/10.11591/ijeecs.v36.i3.pp1816-1826.
Battal, B., Yıldırım, B., Dinçaslan, Ö., & Cicek, G. (2024). Fake News Detection with Machine Learning Algorithms. Celal Bayar University Journal of Science, 20(3), 65-83. https://doi.org/10.18466/cbayarfbe. 1472576.
Huang, J. (2020). Detecting Fake News With Machine Learning. Journal of Physics: Conference Series, 1693. https://doi.org/10.1088/1742-6596/1693/1/012158.
Jouhar, J., Pratap, A., Tijo, N., & Mony, M. (2024). Fake News Detection using Python and Machine Learning.
Procedia Computer Science, 233, 763-771. https://doi.org/10.1016/j.procs.2024.03.265.
Komorowski, M., Marshall, D., Salciccioli, J., & Crutain, Y. (2016). Exploratory Data Analysis. Secondary Analysis of Electronic Health Records, 185-203. https://doi.org/10.1007/978-3-319-43742-2_15.
Kozik, R., Kula, S., Choraś, M. & Wozniak, M. (2022). Technical solution to counter potential crime: Text analysis to detect fake news and disinformation. Journal of Computational Science, 60(101576). https://doi.org/10.1016/j.jocs.2022.101576.
Lozynska, O., Markiv, O., Vysotska, V., Romanchuk, R., & Nazarkevych, M. (2024). Information technology for developing and populating a disinformation dataset using intelligent deepfakes and clickbait search. Herald of Khmelnytskyi National University. Technical Sciences, 343(6(1)), 158-167. https://doi.org/10.31891/ 2307-5732-2024-343-6-24.
Mohammed, S., Al-Aaraji, N., & Al-Saleh, A. (2024). Knowledge Rules-based Decision Tree Classifier model for effective fake accounts detection in social networks. International Journal of Safety and Security Engineering, 14(4), 1243-1251. https://doi.org/10.18280/ijsse.140421.
Tajrian, M., Rahman, A., Kabir, M.A. & Islam, M. R. (2023). A Review of Methodologies for Fake News Analysis. IEEE Access, 11, 73879-73893, https://doi.org/10.1109/ACCESS.2023.3294989.
Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597. 3137600.
Shu, X. (2024). BERT and RoBERTa for Sarcasm Detection: Optimizing Performance through Advanced Fine- tuning. Applied and Computational Engineering, 97, 1-11. https://doi.org/10.54254/2755-2721/97/2024 1354.
Su, Q., Wan, M., Liu, X., & Huang, C.-R. (2020). Motivations, methods and metrics of misinformation detection: An NLP perspective. Natural Language Processing Research, 1, 1–13. https://doi.org/10.2991/nlpr. d.200522.001.
Sudhakar, M., & Kaliyamurthie, K.P. (2024). Detection of fake news from social media using support vector machine learning algorithms. Measurement: Sensors, 32(101028). https://doi.org/10.1016/j.measen.2024. 101028
Wang, W. Y. (2017). “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 422–426).
Wardle, C., & Derakhshan, H. (2017). Information Disorder: Toward an Interdisciplinary Framework for Research and Policy Making. Council of Europe report DGI09.
Sathyanarayanan, S., & Roopashri Tantri, B. (2024). Confusion Matrix-Based Performance Evaluation Metrics: African Journal of Biomedical Research, 27, 4023-4031. https://doi.org/10.53555/AJBR.v27i4S.4345.