СИСТЕМА ПІДТРИМКИ ПРИЙНЯТТЯ РІШЕНЬ ВИЯВЛЕННЯ ДЕЗІНФОРМАЦІЇ, ФЕЙКІВ ТА ПРОПАГАНДИ НА ОСНОВІ МАШИННОГО НАВЧАННЯ

В. А. Висоцька; Роман Романчук

Внаслідок спрощення процесів створення та поширення новин через інтернет, а також через фізичну неможливість перевірки великих обсягів інформації, що циркулює у мережі, значно зросли обсяги поширення дезінформації та фейкових новин. Побудовано систему підтримки прийняття рішень щодо виявлення дезінформації, фейків та пропаганди на основі машинного навчання. Досліджено методику аналізу тексту новин для ідентифікації фейку та передбачення виявлення дезінформації в текстах новин. У зв’язку з цим виявлення неправдивих новин стає критичним завданням. Це не лише забезпечує надання користувачам перевіреної та достовірної інформації, а й допомагає запобігти маніпулюванню суспільною свідомістю. Посилення контролю за достовірністю новин важливе для підтримки надійної екосистеми інформаційного простору. Комбінування IR та NLP дає змогу системам автоматично аналізувати та відстежувати інформацію, щоб виявляти можливі факти дезінформації або фейкові новини. Важливо також враховувати контекст, джерело інформації та інші фактори для точного визначення достовірності. Такі автоматичні методи допомагають у реальному часі виявляти та вирішувати проблеми, пов’язані з поширенням дезінформації в соціальних мережах. Для експерименту ми використали набір даних із загальною кількістю 20 000 статей: 10 000 записів для фейкових новин і 10 000 для нефейкових. Більшість статей пов’язані з політикою. Для обох піднаборів даних виконано основні процедури очищення тексту, такі як зміна тексту на малі літери, видалення знаків пунктуації, очищення тегів розташування та автора, а також видалення стоп-слів тощо. Після очищення виконано токенізацію та лематизацію. Для кращих результатів лематизації кожен токен позначено тегом POS. Використання тегів POS допомагає точніше виконувати лематизацію. Для обох піднаборів даних створено біграми та триграми, щоб краще зрозуміти контекст статей у наборі даних. Виявлено, що у нефейкових новинах використовується офіційніший мовний стиль. Проаналізовано настрої в обох піднаборах даних. Результати показують, що фальшивий субнабір даних містить більше негативних балів, тоді як нефальшивий субнабір даних – переважно позитивні оцінки. Піднабори даних були об’єднані перед створенням моделі прогнозування. Для моделі прогнозування використано функції BOW і Logistic Regression. Оцінка F1 становить 0,98 для обох класів фейк / не фейк.

розпізнавання дезінформації

розпізнавання фейків

опрацювання природної мови

машинне навчання

класифікація новин

1. Tyshchenko, V., & Muzhanova, T. (2022). Disinformation and fake news: features and methods of detection on the internet. Cybersecurity: education, science, technique, 2(18), 175 186. https://doi.org/10.28925/2663-4023.2022.18.175186
https://doi.org/10.28925/2663-4023.2022.18.175186
2. Myronyuk, O. (2024). Misinformation: how to recognize and combat it [Misinformation: how to recognize and combat it]. Retrieved from: https://law.chnu.edu.ua/dezinformatsiia-yak-rozpiznaty-ta-borotysia/
3. Reuter, C., Hartwig, K., Kirchner, J., & Schlegel, N. (2019). Fake news perception in Germany: A representative study of people's attitudes and approaches to counteract disinformation. Retrieved from: https://aisel.aisnet.org/wi2019/track09/papers/5/
4. Luchko, Y. I. (2023). The role of artificial intelligence technologies in spreading and combating disinformation [The role of artificial intelligence technologies in spreading and combating disinformation]. Countermeasures against disinformation in the conditions of Russian aggression against Ukraine: challenges and prospects: theses addendum. participants of the international science and practice conf. (Ann Arbor - Kharkiv, December 12-13, 2023), 104-106. https://doi.org/10.32782/PPSS.2023.1.26
https://doi.org/10.32782/PPSS.2023.1.26
5. Комісарів, М. (2023). Проблема поширення недостовірної інформації у ЗМІ та соціальних медіа. Retrieved from: https://www.osce.org/representative-on-freedomof-media.
6. Marchi, R. (2012). With facebook, blogs, and fake news, teens reject journalistic "objectivity". Journal of communication inquiry, 36(3), 246 262. https://doi.org/10.1177/0196859912458700
https://doi.org/10.1177/0196859912458700
7. Shu, K., Sliva, A., Wang, S., Tang, J., & Liu, H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. SIGKDD Explorations Newsletter, 19(1), 22 36. https://doi.org/10.1145/3137597.3137600
https://doi.org/10.1145/3137597.3137600
8. Zhou, Z.; Guan, H.; Bhat, M. & Hsu, J. (2019). Fake News Detection via NLP is Vulnerable to Adversarial Attacks. Proceedings of the International Conference on Agents and Artificial Intelligence, 2, 794-800. https://doi.org/10.5220/0007566307940800
https://doi.org/10.5220/0007566307940800
9. Lazer, D. M., Baum, M. A., Benkler, Y., Berinsky, A. J., Greenhill, K. M., Menczer, F., Nyhan, B., Pennycook, G., Rothschild, D., Schudson, M., Sloman, S. A., Sunstein, Cass. R., Thorson, E. A., Watts, D. J., & Zittrain, J. L. (2018). The science of fake news. Science, 359(6380), 1094-1096. https://doi.org/10.1126/science.aao299
https://doi.org/10.1126/science.aao2998
10. Vosoughi, S, Roy, D, & Aral, S. (2018). The spread of true and false news online. Science, 359, 1146-1151. https://doi.org/10.1126/science.aap9559
https://doi.org/10.1126/science.aap9559
11. Zuo, C., Karakas, A. I., & Banerjee, R. (2018). A hybrid recognition system for check-worthy claims using heuristics and supervised learning. CEUR workshop proceedings, 2125. Retrieved from: https://ceur-ws.org/Vol-2125/paper_143.pdf
12. Hansen, C., Hansen, C., Simonsen, J. G., & Lioma, C. (2018). The Copenhagen Team Participation in the Check-Worthiness Task of the Competition of Automatic Identification and Verification of Claims in Political Debates of the CLEF-2018 CheckThat! Lab. CEUR Workshop Proceedings, 2125. Retrieved from: https://ceur-ws.org/Vol-2125/paper_81.pdf
13. Thorne, J, & Vlachos, A. (2018). Automated fact checking: Task formulations, methods and future directions. arXiv preprint arXiv:180607687. https://doi.org/10.48550/arXiv.1806.07687.
14. Mihaylova, T, Karadjov, G, Atanasova, P, Baly, R, Mohtarami, M, & Nakov, P. (2019). SemEval- 2019 task 8: Fact checking in community question answering forums. arXiv preprint arXiv:190601727. https://doi.org/10.48550/arXiv.1906.01727
https://doi.org/10.18653/v1/S19-2149
15. O'Brien, N. (2018). Machine learning for detection of fake news. Retrieved from: https://dspace.mit.edu/handle/1721.1/1197279
16. Canini, K. R, Suh, B, & Pirolli, P. L. (2011). Finding credible information sources in social networks based on content and social structure. International Conference on Privacy, Security, Risk and Trust and International Conference on Social Computing, Boston, MA, USA, 1-8. https://doi.org/10.1109/PASSAT/SocialCom.2011.91
https://doi.org/10.1109/PASSAT/SocialCom.2011.91
17. Derczynski, L., Bontcheva, K., Liakata, M., Procter, R., Hoi, G. W. S., & Zubiaga, A. (2017). SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. arXiv preprint arXiv:170405972. https://doi.org/10.48550/arXiv.1704.05972
https://doi.org/10.18653/v1/S17-2006
18. Baly, R., Karadzhov, G., Alexandrov, D., Glass, J., & Nakov, P. (2018). Predicting factuality of reporting and bias of news media sources. arXiv preprint arXiv:181001765. https://doi.org/10.48550/arXiv.1810.01765
https://doi.org/10.18653/v1/D18-1389
19. Hardalov, M., Koychev, I., & Nakov, P. (2016). In Search of Credible News. In: Dichev, C., Agre, G. (eds) Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2016. Lecture Notes in Computer Science, 9883. Springer, Cham. https://doi.org/10.1007/978-3-319-44748-3_17
https://doi.org/10.1007/978-3-319-44748-3_17
20. Enayet, O., & El-Beltagy, S. R. (2017). NileTMRG at SemEval-2017 task 8: Determining rumour and veracity support for rumours on Twitter. https://doi.org/10.18653/v1/S17-2082
https://doi.org/10.18653/v1/S17-2082
21. Juola, P. (2012). An Overview of the Traditional Authorship Attribution Subtask. CEUR Workshop Proceedings, 1178. Retrieved from: https://ceur-ws.org/Vol-1178/CLEF2012wn-PAN-Juola2012.pdf.
22. Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3), 538-556. https://doi.org/10.1002/asi.21001
https://doi.org/10.1002/asi.21001
23. Popat, K., Mukherjee, S., Strötgen, J., & Weikum, G. (2017). Where the truth lies: Explaining the credibility of emerging claims on the web and social media. Proceedings of the 26th international conference on world wide web companion, 1003-1012. https://doi.org/10.1145/3041021.3055133
https://doi.org/10.1145/3041021.3055133
24. Aphiwongsophon, S., & Chongstitvatana, P. (2018). Detecting fake news with machine learning method. International conference on electrical engineering/electronics, computer, telecommunications and information technology (ECTI-CON), 528-531. https://doi.org/10.1109/ECTICon.2018.8620051
https://doi.org/10.1109/ECTICon.2018.8620051
25. Potthast, M., Kiesel, J., Reinartz, K., Bevendorff, J., & Stein, B. (2017). A stylometric inquiry into hyperpartisan and fake news. arXiv preprint arXiv:1702.05638. https://doi.org/10.48550/arXiv.1702.05638
https://doi.org/10.18653/v1/P18-1022
26. Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research, 8(6). Retrieved from: https://www.jmlr.org/papers/volume8/koppel07 a/koppel07 a.pdf.
27. Horne, B., & Adali, S. (2017). This Just In: Fake News Packs A Lot In Title, Uses Simpler, Repetitive Content in Text Body, More Similar To Satire Than Real News. Proceedings of the International AAAI Conference on Web and Social Media, 11(1), 759-766. https://doi.org/10.1609/icwsm.v11i1.14976
https://doi.org/10.1609/icwsm.v11i1.14976
28. Jain, A., Shakya, A., Khatter, H., & Gupta, A. K. (2019). A smart system for fake news detection using machine learning. In 2019 International conference on issues and challenges in intelligent computing techniques (ICICT), 1, 1-4. https://doi.org/10.1109/ICICT46931.2019.8977659
https://doi.org/10.1109/ICICT46931.2019.8977659
29. Rashkin, H., Choi, E., Jang, J. Y., Volkova, S., & Choi, Y. (2017). Truth of varying shades: Analyzing language in fake news and political fact-checking. Proceedings of the conference on empirical methods in natural language processing, 2931-2937. https://doi.org/10.18653/v1/D17-1317
https://doi.org/10.18653/v1/D17-1317
30. Shao, C., Ciampaglia, G. L., Varol, O., Flammini, A., & Menczer, F. (2017). The spread of fake news by social bots. arXiv preprint arXiv:1707.07592, 96(104), 14. Retrieved from: https://cs.furman.edu/~tallen/csc271/source/viralBot.pdf
31. Horne, B., Khedr, S., & Adali, S. (2018). Sampling the News Producers: A Large News and Feature Data Set for the Study of the Complex Media Landscape. Proceedings of the International AAAI Conference on Web and Social Media, 12(1). https://doi.org/10.1609/icwsm.v12i1.14982
https://doi.org/10.1609/icwsm.v12i1.14982
32. Mahir, E. M., Akhter, S., & Huq, M. R. (2019). Detecting fake news using machine learning and deep learning algorithms. International conference on smart computing & communications (ICSCC), 1-5. IEEE. https://doi.org/10.1109/ICSCC.2019.8843612
https://doi.org/10.1109/ICSCC.2019.8843612
33. Da San Martino, G., Seunghak, Y., Barrón-Cedeno, A., Petrov, R., & Nakov, P. (2019). Fine-grained analysis of propaganda in news article. Proceedings of the conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 5636-5646. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1565
https://doi.org/10.18653/v1/D19-1565
34. Nouh, M., Nurse, J. R., & Goldsmith, M. (2019, July). Understanding the radical mind: Identifying signals to detect extremist content on twitter. International conference on intelligence and security informatics (ISI). IEEE, 98-103. https://doi.org/10.1109/ISI.2019.8823548
https://doi.org/10.1109/ISI.2019.8823548
35. Barrón-Cedeño, А., Jaradat, I., Da San Martino, G., & Nakov, P. (2019). Proppy: Organizing the news based on their propagandistic content. Retrieved from: https://wwwsciencedirectcom/science/article/abs/pii/S0306457318306058:16. https://doi.org/10.1016/j.ipm.2019.03.005
https://doi.org/10.1016/j.ipm.2019.03.005
36. Oliinyk, V. A., Vysotska, V., Burov, Y., Mykich, K., & Basto-Fernandes, V. (2020). Propaganda Detection in Text Data Based on NLP and Machine Learning. CEUR Workshop Proceedings, 2631, 132-144. Retrieved from: https://ceur-ws.org/Vol-2631/paper10.pdf
37. Altiti, O., Abdullah, M., & Obiedat, R. (2020). Just at semeval-2020 task 11: Detecting propaganda techniques using bert pre-trained model. Proceedings of the Fourteenth Workshop on Semantic Evaluation, 1749-1755. https://doi.org/10.18653/v1/2020.semeval-1.229
https://doi.org/10.18653/v1/2020.semeval-1.229
38. Han, Y., Karunasekera, S., & Leckie, C. (2020). Graph neural networks with continual learning for fake news detection from social media. arXiv preprint arXiv:2007.03316. https://doi.org/10.48550/arXiv.2007.03316
39. Polonijo, B., Šuman, S., & Šimac, I. (2021). Propaganda detection using sentiment aware ensemble deep learning. International Convention on Information, Communication and Electronic Technology, 199-204. https://doi.org/10.23919/MIPRO52101.2021.9596654
https://doi.org/10.23919/MIPRO52101.2021.9596654
40. Sprenkamp, K., Jones, D. G., & Zavolokina, L. (2023). Large language models for propaganda detection. arXiv preprint arXiv:2310.06422. https://doi.org/10.48550/arXiv.2310.06422
41. Li, W., Li, S., Liu, C., Lu, L., Shi, Z., & Wen, S. (2022). Span identification and technique classification of propaganda in news articles. Complex & Intelligent Systems, 8(5), 3603-3612. https://doi.org/10.1007/s40747-021-00393-y
https://doi.org/10.1007/s40747-021-00393-y
42. Martseniuk, M., Kozachok, V., Bohdanov, O., & Brzhevska, Z. (2023). Analysis of methods for detecting misinformation in social networks using machine learning. Electronic Professional Scientific Journal "Cybersecurity: Education, Science, Technique", 2(22), 148 155. https://doi.org/10.28925/2663-4023.2023.22.148155
https://doi.org/10.28925/2663-4023.2023.22.148155
43. Ravichandiran, S. (2021). Getting Started with Google BERT: Build and train state-of-the-art natural language processing models using BERT. Packt Publishing Ltd.
44. Xiao, Y., & Jin, Z. (2021). Summary of research methods on pre-training models of natural language processing. Open Access Library Journal, 8(7), 1-7. https://doi.org/10.4236/oalib.1107602
https://doi.org/10.4236/oalib.1107602
45. Ibrahim, M., & Murshed, M. (2016). From tf-idf to learning-to-rank: An overview. Handbook of research on innovations in information retrieval, analysis, and management, 62-109. https://doi.org/10.4018/978-1-4666-8833-9.ch003
https://doi.org/10.4018/978-1-4666-8833-9.ch003
46. Omar, M., Choi, S., Nyang, D., & Mohaisen, D. (2022). Robust natural language processing: Recent advances, challenges, and future directions. IEEE Access, 10, 86038-86056. https://doi.org/10.1109/ACCESS.2022.3197769
https://doi.org/10.1109/ACCESS.2022.3197769
47. Verma, P. K., Agrawal, P., & Prodan, R. (2021). WELFake Dataset for Fake News Detection in Text Data (Version: 0.1) [Data Set]. Genéve, Switzerland: Zenodo.