DECISION SUPPORT SYSTEM FOR DISINFORMATION, FAKES AND PROPAGANDA DETECTION BASED ON MACHINE LEARNING

Victoria Vysotska; Roman Romanchuk

Due to the simplification of the processes of creating and distributing news via the Internet, as well as due to the physical impossibility of checking large volumes of information circulating in the network, the volume of disinformation and fake news distribution has increased significantly. A decision support system for identifying disinformation, fakes and propaganda based on machine learning has been built. The method of news text analysis for identifying fakes and predicting the detection of disinformation in news texts has been studied. Due to the simplification of the processes of creating and distributing news via the Internet, as well as due to the physical impossibility of checking large volumes of information circulating in the network, the volume of disinformation and fake news distribution has increased significantly. In this regard, detection of fake news becomes a critical task. This not only ensures the provision of verified and reliable information to users, but also helps prevent manipulation of public consciousness. Strengthening control over the credibility of news is important for maintaining a reliable ecosystem of the information space. The combination of IR and NLP allows systems to automatically analyse and track information to detect potential misinformation or fake news. It is also important to consider context, source of information, and other factors to accurately determine credibility. Such automated methods can help in real-time detection and resolution of problems related to the spread of misinformation in social networks. For our experiment, we use a dataset with a total number of 20,000 articles: 10,000 entries for fake news and 10,000 for non-fake news. Most of the articles are related to politics. For both subsets of the data, basic text cleaning procedures such as changing text to lowercase, removing punctuation marks, cleaning location and author tags, and removing stop words, etc., were performed. After cleaning, tokenization and lemmatization were performed. For better lemmatization results, each token is labelled with a POS tag. Using POS tags helps perform lemmatization more accurately. For both subsets of the data, bigrams and trigrams were created to better understand the context of the articles in the data set. It was found that non-fake news uses a more formal language style. Next, we performed sentiment analysis on both subsets of the data. The results show that the fake sub-dataset contains more negative scores, while the non-false sub-dataset has mostly positive scores. Subsets of the data were combined before building the prediction model. BOW and Logistic Regression functions were used for the forecast model. The F1 score is 0.98 for both fake/non-fake classes.

disinformation recognition

fake recognition

natural language processing

machine learning

news classification

1. Tyshchenko, V., & Muzhanova, T. (2022). Disinformation and fake news: features and methods of detection on the internet. Cybersecurity: education, science, technique, 2(18), 175 186. https://doi.org/10.28925/2663-4023.2022.18.175186

2. Myronyuk, O. (2024). Misinformation: how to recognize and combat it [Misinformation: how to recognize and combat it]. Retrieved from: https://law.chnu.edu.ua/dezinformatsiia-yak-rozpiznaty-ta-borotysia/

3. Reuter, C., Hartwig, K., Kirchner, J., & Schlegel, N. (2019). Fake news perception in Germany: A representative study of people's attitudes and approaches to counteract disinformation. Retrieved from: https://aisel.aisnet.org/wi2019/track09/papers/5/

4. Luchko, Y. I. (2023). The role of artificial intelligence technologies in spreading and combating disinformation [The role of artificial intelligence technologies in spreading and combating disinformation]. Countermeasures against disinformation in the conditions of Russian aggression against Ukraine: challenges and prospects: theses addendum. participants of the international science and practice conf. (Ann Arbor - Kharkiv, December 12-13, 2023), 104-106. https://doi.org/10.32782/PPSS.2023.1.26

5. Комісарів, М. (2023). Проблема поширення недостовірної інформації у ЗМІ та соціальних медіа. Retrieved from: https://www.osce.org/representative-on-freedomof-media.

6. Marchi, R. (2012). With facebook, blogs, and fake news, teens reject journalistic "objectivity". Journal of communication inquiry, 36(3), 246 262. https://doi.org/10.1177/0196859912458700

7. Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. SIGKDD Explorations Newsletter, 19(1), 22 36. https://doi.org/10.1145/3137597.3137600

8. Zhou, Z.; Guan, H.; Bhat, M. & Hsu, J. (2019). Fake News Detection via NLP is Vulnerable to Adversarial Attacks. Proceedings of the International Conference on Agents and Artificial Intelligence, 2, 794-800. https://doi.org/10.5220/0007566307940800

9. Lazer, D. M., Baum, M. A., Benkler, Y., Berinsky, A. J., Greenhill, K. M., Menczer, F., Nyhan, B., Pennycook, G., Rothschild, D., Schudson, M., Sloman, S. A., Sunstein, Cass. R., Thorson, E. A., Watts, D. J., & Zittrain, J. L. (2018). The science of fake news. Science, 359(6380), 1094-1096. https://doi.org/10.1126/science.aao2998

10. Vosoughi, S, Roy, D, & Aral, S. (2018). The spread of true and false news online. Science, 359, 1146-1151. https://doi.org/10.1126/science.aap9559

11. Zuo, C., Karakas, A. I., & Banerjee, R. (2018). A hybrid recognition system for check-worthy claims using heuristics and supervised learning. CEUR workshop proceedings, 2125. Retrieved from: https://ceur-ws.org/Vol-2125/paper_143.pdf

12. Hansen, C., Hansen, C., Simonsen, J. G., & Lioma, C. (2018). The Copenhagen Team Participation in the Check-Worthiness Task of the Competition of Automatic Identification and Verification of Claims in Political Debates of the CLEF-2018 CheckThat! Lab. CEUR Workshop Proceedings, 2125. Retrieved from: https://ceur-ws.org/Vol-2125/paper_81.pdf

13. Thorne, J, & Vlachos, A. (2018). Automated fact checking: Task formulations, methods and future directions. arXiv preprint arXiv:180607687. https://doi.org/10.48550/arXiv.1806.07687

14. Mihaylova, T, Karadjov, G, Atanasova, P, Baly, R, Mohtarami, M, & Nakov, P. (2019). SemEval- 2019 task 8: Fact checking in community question answering forums. arXiv preprint arXiv:190601727. https://doi.org/10.18653/v1/S19-2149

15. O'Brien, N. (2018). Machine learning for detection of fake news. Retrieved from: https://dspace.mit.edu/handle/1721.1/1197279

16. Canini, K. R, Suh, B, & Pirolli, P. L. (2011). Finding credible information sources in social networks based on content and social structure. International Conference on Privacy, Security, Risk and Trust and International Conference on Social Computing, Boston, MA, USA, 1-8. https://doi.org/10.1109/PASSAT/SocialCom.2011.91

17. Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Hoi, G. W. S., & Zubiaga, A. (2017). SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. arXiv preprint arXiv:170405972. https://doi.org/10.18653/v1/S17-2006

18. Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., & Nakov, P. (2018). Predicting factuality of reporting and bias of news media sources. arXiv preprint arXiv:181001765. https://doi.org/10.18653/v1/D18-1389

19. Hardalov, M., Koychev, I., & Nakov, P. (2016). In Search of Credible News. In: Dichev, C., Agre, G. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2016. Lecture Notes in Computer Science, 9883. Springer, Cham. https://doi.org/10.1007/978-3-319-44748-3_17

20. Enayet, O., & El-Beltagy, S. R. (2017). NileTMRG at SemEval-2017 task 8: Determining rumour and veracity support for rumours on Twitter. https://doi.org/10.18653/v1/S17-2082

21. Juola, P. (2012). An Overview of the Traditional Authorship Attribution Subtask. CEUR Workshop Proceedings, 1178. Retrieved from: https://ceur-ws.org/Vol-1178/CLEF2012wn-PAN-Juola2012.pdf.

22. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3), 538-556. https://doi.org/10.1002/asi.21001

23. Popat, K., Mukherjee, S., Strötgen, J., & Weikum, G. (2017). Where the truth lies: Explaining the credibility of emerging claims on the web and social media. Proceedings of the 26th international conference on world wide web companion, 1003-1012. https://doi.org/10.1145/3041021.3055133

24. Aphiwongsophon, S., & Chongstitvatana, P. (2018). Detecting fake news with machine learning method. International conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON), 528-531. https://doi.org/10.1109/ECTICon.2018.8620051

25. Potthast, M., Kiesel, J., Reinartz, K., Bevendorff, J., & Stein, B. (2017). A stylometric inquiry into hyperpartisan and fake news. arXiv preprint arXiv:1702.05638. https://doi.org/10.18653/v1/P18-1022

26. Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8(6). Retrieved from: https://www.jmlr.org/papers/volume8/koppel07 a/koppel07 a.pdf.

27. Horne, B., & Adali, S. (2017). This Just In: Fake News Packs A Lot In Title, Uses Simpler, Repetitive Content in Text Body, More Similar To Satire Than Real News. Proceedings of the International AAAI Conference on Web and Social Media, 11(1), 759-766. https://doi.org/10.1609/icwsm.v11i1.14976

28. Jain, A., Shakya, A., Khatter, H., & Gupta, A. K. (2019). A smart system for fake news detection using machine learning. In 2019 International conference on issues and challenges in intelligent computing techniques (ICICT), 1, 1-4. https://doi.org/10.1109/ICICT46931.2019.8977659

29. Rashkin, H., Choi, E., Jang, J. Y., Volkova, S., & Choi, Y. (2017). Truth of varying shades: Analyzing language in fake news and political fact-checking. Proceedings of the conference on empirical methods in natural language processing, 2931-2937. https://doi.org/10.18653/v1/D17-1317

30. Shao, C., Ciampaglia, G. L., Varol, O., Flammini, A., & Menczer, F. (2017). The spread of fake news by social bots. arXiv preprint arXiv:1707.07592, 96(104), 14. Retrieved from: https://cs.furman.edu/~tallen/csc271/source/viralBot.pdf

31. Horne, B., Khedr, S., & Adali, S. (2018). Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape. Proceedings of the International AAAI Conference on Web and Social Media, 12(1). https://doi.org/10.1609/icwsm.v12i1.14982

32. Mahir, E. M., Akhter, S., & Huq, M. R. (2019). Detecting fake news using machine learning and deep learning algorithms. International conference on smart computing & communications (ICSCC), 1-5. IEEE. https://doi.org/10.1109/ICSCC.2019.8843612

33. Da San Martino, G., Seunghak, Y., Barrón-Cedeno, A., Petrov, R., & Nakov, P. (2019). Fine-grained analysis of propaganda in news article. Proceedings of the conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 5636-5646. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1565

34. Nouh, M., Nurse, J. R., & Goldsmith, M. (2019, July). Understanding the radical mind: Identifying signals to detect extremist content on twitter. International conference on intelligence and security informatics (ISI). IEEE, 98-103. https://doi.org/10.1109/ISI.2019.8823548

35. Barrón-Cedeño, А., Jaradat, I., Da San Martino, G., & Nakov, P. (2019). Proppy: Organizing the news based on their propagandistic content. Retrieved from: https://wwwsciencedirectcom/science/article/abs/pii/S0306457318306058:16. https://doi.org/10.1016/j.ipm.2019.03.005

36. Oliinyk, V. A., Vysotska, V., Burov, Y., Mykich, K., & Basto-Fernandes, V. (2020). Propaganda Detection in Text Data Based on NLP and Machine Learning. CEUR Workshop Proceedings, 2631, 132-144. Retrieved from: https://ceur-ws.org/Vol-2631/paper10.pdf

37. Altiti, O., Abdullah, M., & Obiedat, R. (2020). Just at semeval-2020 task 11: Detecting propaganda techniques using bert pre-trained model. Proceedings of the Fourteenth Workshop on Semantic Evaluation, 1749-1755. https://doi.org/10.18653/v1/2020.semeval-1.229

38. Han, Y., Karunasekera, S., & Leckie, C. (2020). Graph neural networks with continual learning for fake news detection from social media. arXiv preprint arXiv:2007.03316. https://doi.org/10.48550/arXiv.2007.03316

39. Polonijo, B., Šuman, S., & Šimac, I. (2021). Propaganda detection using sentiment aware ensemble deep learning. International Convention on Information, Communication and Electronic Technology, 199-204. https://doi.org/10.23919/MIPRO52101.2021.9596654

40. Sprenkamp, K., Jones, D. G., & Zavolokina, L. (2023). Large language models for propaganda detection. arXiv preprint arXiv:2310.06422. https://doi.org/10.48550/arXiv.2310.06422

41. Li, W., Li, S., Liu, C., Lu, L., Shi, Z., & Wen, S. (2022). Span identification and technique classification of propaganda in news articles. Complex & Intelligent Systems, 8(5), 3603-3612. https://doi.org/10.1007/s40747-021-00393-y

42. Martseniuk, M., Kozachok, V., Bohdanov, O., & Brzhevska, Z. (2023). Analysis of methods for detecting misinformation in social networks using machine learning. Electronic Professional Scientific Journal "Cybersecurity: Education, Science, Technique", 2(22), 148 155. https://doi.org/10.28925/2663-4023.2023.22.148155

43. Ravichandiran, S. (2021). Getting Started with Google BERT: Build and train state-of-the-art natural language processing models using BERT. Packt Publishing Ltd.

44. Xiao, Y., & Jin, Z. (2021). Summary of research methods on pre-training models of natural language processing. Open Access Library Journal, 8(7), 1-7. https://doi.org/10.4236/oalib.1107602

45. Ibrahim, M., & Murshed, M. (2016). From tf-idf to learning-to-rank: An overview. Handbook of research on innovations in information retrieval, analysis, and management, 62-109. https://doi.org/10.4018/978-1-4666-8833-9.ch003

46. Omar, M., Choi, S., Nyang, D., & Mohaisen, D. (2022). Robust natural language processing: Recent advances, challenges, and future directions. IEEE Access, 10, 86038-86056. https://doi.org/10.1109/ACCESS.2022.3197769

47. Verma, P. K., Agrawal, P., & Prodan, R. (2021). WELFake Dataset for Fake News Detection in Text Data (Version: 0.1) [Data Set]. Genéve, Switzerland: Zenodo.