Method for Detection of Disinformation Based on Text Data Analysis Using TF-IDF and Contextual Vector Representations

2025;
: pp. 98 - 110
1
Lviv Polytechnic National University, Lviv, Ukraine
2
Lviv Polytechnic National University, Information Systems and Networks Department; Osnabrück University, Institute of Computer Science
3
Lviv Polytechnic National University, Lviv, Ukraine
4
Lviv Polytechnic National University, Lviv, Ukraine

The article considers an approach to detecting fake news in the digital environment through text analysis using machine learning and natural language processing methods. The proposed method is based on a hybrid text representation combining frequency features (TF-IDF) and contextual embeddings obtained using the IBM Granite model. A complete data processing cycle was developed, covering the stages of exploratory analysis (EDA), text preprocessing and tokenization, forming vector representations, training a logistic regression model, and obtaining key metrics. The main stages of text preprocessing  included  converting  all  characters  to  lowercase,  removing  URLs  and  HTML  tags, cleaning from non-letter characters and excess spaces, eliminating duplicates to avoid re-training, and unifying the values of specific fields. A combination of TF-IDF with contextual embeddings was used to vectorize  the  cleaned  texts,  which  allowed  the  model  to  simultaneously  consider  the  statistical significance of terms and their semantic context within the messages. The constructed logistic regression model combined with a hybrid representation of text data demonstrated high efficiency, achieving an overall accuracy of 82 % and balanced F1-measure values for the “true” and “fake” classes. An analysis of TF-IDF feature weights based on logistic regression coefficients was applied to identify the most relevant  terms.  The  study  showed  that  the  model  tends  to  associate  truthful  information  with Ukrainian-language, neutral vocabulary, while texts with signs of disinformation often contain Russian- language elements characteristic of propaganda or manipulative messages. Further research will be

  1. Abdulaziz, A., & Marwan, A. (2020). An Empirical Comparison of Fake News Detection using different Machine Learning Algorithms. International Journal of Advanced Computer Science and Applications (IJACSA), 11(9), 146-152.
  2. Alikhashashneh, E., Nahar, K., Abual-Rub, M., & Alkhaldy, H. (2024). A robust method for detecting fake news using both machine and deep learning algorithms. Indonesian Journal of Electrical Engineering and Computer Science, 36, 1816-1826. https://doi.org/10.11591/ijeecs.v36.i3.pp1816-1826.
  3. Battal, B., Yıldırım, B., Dinçaslan, Ö., & Cicek, G. (2024). Fake News Detection with Machine Learning Algorithms. Celal Bayar University Journal of Science, 20(3), 65-83. https://doi.org/10.18466/cbayarfbe. 1472576.
  4. Huang, J. (2020). Detecting Fake News With Machine Learning. Journal of Physics: Conference Series, 1693. https://doi.org/10.1088/1742-6596/1693/1/012158.
  5. Jouhar, J., Pratap, A., Tijo, N., & Mony, M. (2024). Fake News Detection using Python and Machine Learning.
  6. Procedia Computer Science, 233, 763-771. https://doi.org/10.1016/j.procs.2024.03.265.
  7. Komorowski, M., Marshall, D., Salciccioli, J., & Crutain, Y. (2016). Exploratory Data Analysis. Secondary Analysis of Electronic Health Records, 185-203. https://doi.org/10.1007/978-3-319-43742-2_15.
  8. Kozik, R., Kula, S., Choraś, M. & Wozniak, M. (2022). Technical solution to counter potential crime: Text analysis to detect fake news and disinformation. Journal of Computational Science, 60(101576). https://doi.org/10.1016/j.jocs.2022.101576.
  9. Lozynska, O., Markiv, O., Vysotska, V., Romanchuk, R., & Nazarkevych, M. (2024). Information technology for developing and populating a disinformation dataset using intelligent deepfakes and clickbait search. Herald of Khmelnytskyi National University. Technical Sciences, 343(6(1)), 158-167. https://doi.org/10.31891/ 2307-5732-2024-343-6-24.
  10. Mohammed, S., Al-Aaraji, N., & Al-Saleh, A. (2024). Knowledge Rules-based Decision Tree Classifier model for effective fake accounts detection in social networks. International Journal of Safety and Security Engineering, 14(4), 1243-1251. https://doi.org/10.18280/ijsse.140421.
  11. Tajrian, M., Rahman, A., Kabir, M.A. & Islam, M. R. (2023). A Review of Methodologies for Fake News Analysis. IEEE Access, 11, 73879-73893, https://doi.org/10.1109/ACCESS.2023.3294989.
  12. Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. https://doi.org/10.1145/3137597. 3137600.
  13. Shu, X. (2024). BERT and RoBERTa for Sarcasm Detection: Optimizing Performance through Advanced Fine- tuning. Applied and Computational Engineering, 97, 1-11. https://doi.org/10.54254/2755-2721/97/2024 1354.
  14. Su, Q., Wan, M., Liu, X., & Huang, C.-R. (2020). Motivations, methods and metrics of misinformation detection: An NLP perspective. Natural Language Processing Research, 1, 1–13. https://doi.org/10.2991/nlpr. d.200522.001.
  15. Sudhakar, M., & Kaliyamurthie, K.P. (2024). Detection of fake news from social media using support vector machine learning algorithms. Measurement: Sensors, 32(101028). https://doi.org/10.1016/j.measen.2024. 101028
  16. Wang, W. Y. (2017). “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (pp. 422–426).
  17. Wardle, C., & Derakhshan, H. (2017). Information Disorder: Toward an Interdisciplinary Framework for Research and Policy Making. Council of Europe report DGI09.
  18. Sathyanarayanan, S., & Roopashri Tantri, B. (2024). Confusion Matrix-Based Performance Evaluation Metrics: African Journal of Biomedical Research, 27, 4023-4031. https://doi.org/10.53555/AJBR.v27i4S.4345.