Data Protection in the Utilization of Natural Language Processors for Trend Analysis and Public Opinion: Cryptographic Aspect

Authors: 

Inna Rozlomii1, Nataliia Yehorchenkova2, Andrii Yarmilko1, and Serhii Naumenko1

1. Bohdan Khmelnytsky National University of Cherkasy

2. Slovak University of Technology in Bratislava

In the digital age, the significant increase in information generation and processing is accompanied by a growing threat of unauthorized access, illegal distribution, and use. One of the most promising strategies for protecting information from various cyber threats and malicious attacks is the use of Natural Language Processing (NLP) processors. This article focuses on the methodology of data protection in the context of utilizing Natural Language Processing for sentiment analysis and trend detection. Emphasis is placed on the relevance of using NLP to address tasks related to text content analysis for identifying suspicious or dangerous information. The article covers the stages of text data collection and processing, including data gathering from various sources such as social media, news portals, forums, and blogs. Subsequently, preliminary processing is performed, involving noise removal, tokenization, stemming, and lemmatization of the text to prepare the data for further analysis. The application of NLP allows for the identification of keywords, topics, sentiment, and text structure, facilitating categorization and trend identification in public opinion. Additionally, a mathematical model for detecting phishing indicators is presented, along with an example of identifying suspicious text features. It is noted that the use of cryptographic methods can effectively secure processed data, reducing the risk of unauthorized access or misuse. The article provides a detailed description of data protection methods in the process of sentiment analysis using NLP and underscores the necessity of employing cryptographic techniques to ensure the security of processed text data. 

[1] K. Chowdhary, K. R. Chowdhary, Natural Language Processing. In: Fundamentals of Artificial Intelligence. Springer, New Delhi, 2020, pp. 603-649. doi: https://doi.org/10.1007/978-81-3223972-7_19.

[2] D. Khurana, A. Koli, K. Khatter, S. Singh, Natural language processing: State of the art, current trends and challenges, Multimedia tools and applications 82(3) (2023) 3713-3744.

[3] V. Raina, S. Krishnamurthy, Natural Language Processing, in: V. Raina, S. Krishnamurthy (Eds.), Building an Effective Data Science Practice: A Framework to Bootstrap and Manage a Successful Data Science Practice, Apress, Berkeley, CA, 2022, pp. 63–73. doi: https://doi.org/10.1007/9781-4842-7419-4_6.

[4] R. Oshikawa, J. Qian, W. Y. Wang, A survey on natural language processing for fake news detection,  arXiv:1811.00770 [cs.CL] (2018). doi: https://doi.org/10.48550/arXiv.1811.00770.

[5] D. Khurana, A. Koli, K. Khatter, S. Singh, Natural language processing: State of the art, current trends and challenges, Multimedia tools and applications 82(3) (2023) 3713-3744.

[6] D. H. Maulud, S. R. Zeebaree, K. Jacksi, M. A. M. Sadeeq, K. H. Sharif, State of art for semantic analysis of natural language processing, Qubahan academic journal 1(2) (2021) 21-28.

[7] J. H. Li, Cyber security meets artificial intelligence: a survey, Frontiers of Information Technology & Electronic Engineering 19(12) (2018) 1462-1474.

[8] R. May, K. Denecke, Security, privacy, and healthcare-related conversational agents: a scoping review. Informatics for Health and Social Care 47(2) (2022) 194-210. doi: 10.1080/17538157.2021.1983578.

[9] R. K. Jha, Strengthening Smart Grid Cybersecurity: An In-Depth Investigation into the Fusion of Machine Learning and Natural Language Processing, Journal of Trends in Computer Science and Smart Technology 5(3) (2023) 284-301.

[10] A. W. Pradana, M. Hayaty, The effect of stemming and removal of stopwords on the accuracy of sentiment analysis on indonesian-language texts, Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control 4(4) (2019) 375-380.

[11] N. Banik, M. H. H. Rahman, S. Chakraborty, H. Seddiqui, M. A. Azim, Survey on text-based sentiment analysis of bengali language, in: 2019 1st International Conference on Advances in Science, Engineering and Robotics Technology (ICASERT), Dhaka, Bangladesh, 2019, pp. 1-6, doi: 10.1109/ICASERT.2019.8934481.

[12] M. O. Hegazi, Y. Al-Dossari, A. Al-Yahy, A. Al-Sumari, A. Hilal, Preprocessing Arabic text on social media, Heliyon 7(2) (2021). doi: 10.1016/j.heliyon.2021.e06191.

[13] E. Hossain, R. Rana, N. Higgins, J. Soar, P. D. Barua, A. R. Pisani, Natural language processing in electronic health records in relation to healthcare decision-making: a systematic review, Computers in Biology and Medicine 155 (2023). doi: https://doi.org/10.1016/j.compbiomed.2023.106649.

[14] Z. Jiang, L. Liu, Research on sentiment analysis of online public opinion based on semantic, in: Geo-Spatial Knowledge and Intelligence, in: H. Yuan, J. Geng, C. Liu, F. Bian, T. Surapunt (Eds.), Geo-Spatial Knowledge and Intelligence, GSKI 2017, volume 849 of Communications in Computer and Information Science, Springer, Singapore, 2017, pp. 313–32.1 https://doi.org/10.1007/978-981-13-0896-3_31.

[15] S. Salloum, T. Gaber, S. Vadera, K. Shaalan, A systematic literature review on phishing email detection using natural language processing techniques, IEEE Access 10 (2022) 65703-65727. doi: 10.1109/ACCESS.2022.3183083.

[16] X. Chen, R. Ding, K. Xu, S. Wang, T. Hao, Y. Zhou, A bibliometric review of natural language processing empowered mobile computing, Wireless Communications and Mobile Computing (2018). https://doi.org/10.1155/2018/1827074.

[17] T. Peng, I. Harris, Y. Sawa, Detecting phishing attacks using natural language processing and machine learning, in: 2018 IEEE 12th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 2018, pp. 300-301. doi: 10.1109/ICSC.2018.00056.

 [18] Y. Zhu, X. Li, J. Wang, Analysis and research of Weibo public opinion based on text, Journal of Physics: Conference Series 1769(1) (2021). doi: 10.1088/1742-6596/1769/1/012018.

[19] W. E. Zhang, Q. Z. Sheng, A. Alhazmi, C. Li, Adversarial attacks on deep-learning models in natural language processing: A survey, ACM Transactions on Intelligent Systems and Technology (TIST) 11(3) (2020) 1-41.

[20] H. Gan, Research on data mining method based on privacy protection, in: 020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), Shenzhen, China, 2020, pp. 502-506. doi: 10.1109/AEMCSE50948.2020.00114.

[21] N. Garg, K. Sharma, Text pre-processing of multilingual for sentiment analysis based on social network data, International Journal of Electrical & Computer Engineering 12(1) (2022) 20888708.

[22] C. Qian, N. Mathur, N. H. Zakaria, R. Arora, V. Gupta, M. Ali, Understanding public opinions on social media for financial sentiment analysis using AI-based techniques, Information Processing & Management 59(6) (2022). doi: https://doi.org/10.1016/j.ipm.2022.103098.

[23] M. Anandarajan, C. Hill, T. Nolan, Text Preprocessing, in: Practical Text Analytics. Advances in Analytics and Data Science, Springer, Cham, 2019. doi: https://doi.org/10.1007/978-3-319-956633_4 45-59.

[24] A. Tabassum, R. R. Patil, A survey on text pre-processing & feature extraction techniques in natural language processing, International Research Journal of Engineering and Technology (IRJET) 7(06) (2020) 4864-4867.

[25] A. Kurniasih, L. P. Manik, On the Role of Text Preprocessing in BERT Embedding-based DNNs for Classifying Informal Texts, Neuron 1024(512) (2022) 927-934.

[26] J. Potočnik, E. Thomas, R. Killeen, S. Foley, A. Lawlor, J. Stowe, Automated vetting of radiology referrals: exploring natural language processing and traditional machine learning approaches, Insights into Imaging 13(1) (2022) 1-8.

[27] H. Brown, K. Lee, F. Mireshghallah, R. Shokri, F. Tramèr, What does it mean for a language model to preserve privacy? in: 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT '22), ACM, Seoul, Republic of Korea, New York, NY, USA, pp. 2280-2292. doi: https://doi.org/10.1145/3531146.3534642.

[28] H. Yang, Q. He, Z. Liu, Q. Zhang, Malicious encryption traffic detection based on NLP. Security and Communication Networks (2021). doi: https://doi.org/10.1155/2021/9960822.  [29] M. I. Alfarizi, L. Syafaah, M. Lestandy, Emotional Text Classification Using TF-IDF (Term Frequency-Inverse Document Frequency) And LSTM (Long Short-Term Memory). JUITA: Jurnal Informatika 10(2) (2022) 225-232.