Large Language Models and Personal Information: Security Challenges and Solutions Through Anonymization

Pavlo Zamroz; Yuriy Morozov

ctive methods to protect personal data in online texts. Existing anonymization methods often prove ineffective against complex LLM analysis algorithms, especially when processing sensitive information such as medical data. This research proposes an innovative approach to anonymization that combines k-anonymity and adversarial methods. Our approach aims to improve the efficiency and speed of anonymization while maintaining a high level of data protection. Experimental results on a dataset of 10,000 comments showed a 40% reduction in processing time (from 250 ms to 150 ms per comment) compared to traditional adversarial methods, a 5% improvement in medical data anonymization accuracy (from 90% to 95%), and a 7% improvement in data utility preservation (from 85% to 92%). Special attention is paid to the application of the method in the context of interaction with LLM-based chatbots and medical information processing. We conduct an experimental evaluation of our method, comparing it with existing industrial anonymizers on real and synthetic datasets. The results demonstrate significant improvements in both data utility preservation and privacy protection. Our method also takes into account GDPR requirements, setting a new standard in the field of data anonymization for AI interactions. This research offers a practical solution for protecting user privacy in the era of LLMs, especially in sensitive areas such as healthcare.

California Consumer Privacy Act (CCPA). [Online]. Available: https://oag.ca.gov/privacy/ccpa. Accessed: Oct. 2018.
EU, “General data protection regulation,” 2016. [Online]. Available: https://gdpr-info.eu. Accessed: Oct. 2024.
U.S. Department of Labor, “DOL,” 2023. [Online]. Available: https://www.dol.gov/general/ppii. Accessed: Oct. 2024.
N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, and C. Zhang, “Quantifying memorization across neural language models,” arXiv:2202.07646, Mar. 06, 2023. doi: 10.48550/arXiv.2202.07646.
S. Vimercati, S. Foresti, G. Livraga, and P. Samarati, “k-Anonymity: From Theory to Applications,” Trans. Data Priv., 2023. [Online]. Available: https://www.tdp.cat/issues21/tdp.a460a22.pdf. Accessed: Oct. 23, 2024.
“Differential privacy for deep and federated learning: A survey,” IEEE Access, vol. 10, pp. 8602–8616, 2022. doi: 10.1109/ACCESS.2022.3151670. Accessed: Oct. 16, 2024.
Y. Zhao and J. Chen, “A survey on differential privacy for unstructured data content,” ACM Comput. Surv., vol. 54, no. 10s, pp. 207:1–207:28, Sep. 2022. doi: 10.1145/3490237.
P. R. Silva, J. Vinagre, and J. Gama, “Towards federated learning: An overview of methods and applications,” WIREs Data Min. Knowl. Discov., vol. 13, no. 2, p. e1486, 2023. doi: 10.1002/widm.1486.
J. Li, Y. Yang, Z. Wu, V. G. Vydiswaran, and C. Xiao, “ChatGPT as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger,” arXiv:2304.14475, 2023. doi: 10.48550/arXiv.2304.14475.
DPIA, 2019. [Online]. Available: https://gdpr.eu/wp-content/uploads/2019/03/dpia-template-v1.pdf. Accessed: Oct. 2024.
R. Staab, M. Vero, M. Balunović, and M. Vechev, “Large language models are advanced anonymizers,” arXiv:2402.13846, 2024. [Online]. Available: https://arxiv.org/abs/2402.13846. doi: 10.48550/arXiv.2402.13846. Accessed: Oct. 03, 2024.
R. Staab, M. Vero, M. Balunović, and M. Vechev, “Beyond memorization: Violating privacy via inference with large language models,” arXiv:2310.07298, May 06, 2024. [Online]. Available: http://arxiv.org/abs/2310.07298. doi: 10.48550/arXiv.2310.07298. Accessed: Oct. 03, 2024.