The development of large language models (LLMs) with each new iteration demonstrates a significant improvement in their ability to understand and generate text, which opens up increasingly wide opportunities for their integration into information processing systems and digital business processes of enterprises and institutions. In the context of the constant growth of the complexity and functional capabilities of LLMs, the development of reliable methods for their evaluation becomes a fundamental challenge for the research community, as traditional metrics for evaluating text information are often unable to fully cover the entire depth and multifaceted nature of their potential capabilities and characteristics. The creation of comprehensive LLM quality assessment systems is designed to ensure not only an objective comparison of different models but also to form critical feedback for targeted technology improvement and the prevention of potential risks associated with their large-scale implementation. The article is devoted to the systematization of criteria for assessing the quality of large language models that have become widespread in natural language processing tasks. The article aims to create a comprehensive approach to LLM quality assessment that covers the main aspects of their functioning. The paper defines LLM quality criteria such as precision and completeness of responses, naturalness of speech, consistency, toxicity, bias, security vulnerabilities, among others. Three main approaches to assessing LLM quality are analyzed in detail: expert assessments, comparison with reference data, and automated methods without reference data. For each quality criterion, the most effective assessment methods are determined, and their advantages and disadvantages in different application contexts are noted. In conclusion, it is emphasized that, despite the high reliability of expert assessments, automated methods are becoming increasingly important for large-scale LLM evaluation, especially for subjective criteria such as toxicity or bias
- Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A.. A Comprehensive Overview of Large Language Models. 2023. [Online]. Available: https://arxiv.org/abs/2307.06435
- Li, Diya & Zhao, Yue & Wang, Zhifang & Jung, Calvin & Zhang, Zhe. (2024). Large Language Model-Driven Struc- tured Output: A Comprehensive Benchmark and Spatial Data Generation Framework. ISPRS International Journal of Geo-Information. 13. 405. 10.3390/ijgi13110405.
- Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Yu, L., Liu, Y.,Li, J., Xiong, B., & Xiong, D.. Evaluating Large Language Models: A Comprehensive Survey. 2023. [Online].Available: https://arxiv.org/abs/2310.19736
- Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove,C., Manning, C. D., Ré, C., Hudson, D. A., Zelikman, E., . Koreeda, Y.. Holistic Evaluation of Language Models. 2022. [Online]. Available: https://arxiv.org/abs/2211.09110
- Banerjee, D., Singh, P., Avadhanam, A., & Srivastava, S.. Benchmarking LLM powered Chatbots: Methods and Metrics. 2023. [Online]. Available: https://arxiv.org/abs/ 2308.04624
- Wu, N., Gong, M., Shou, L., Liang, S., & Jiang, D.. Large Language Models are Diverse Role-Players for Summarization Evaluation. 2023. [Online]. Available: https://arxiv.org/abs/2303.15078
- Ni, X., & Li, P.. A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks. 2024. [Online]. Available: https://arxiv.org/abs/2405.10251
- Cui, W., Zhang, J., Li, Z., Damien, L., Das, K., Malin, B., & Kumar, S.. DCR-Consistency: Divide-Conquer- Reasoning for Consistency Evaluation and Improvement of Large Language Models. 2024. [Online]. Available: https://arxiv.org/abs/2401.02132
- Zhao, Y., Zhu, J., Xu, C., & Li, X.. Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph. 2024. [Online]. Available: https://arxiv.org/ abs/2412.15268
- L. Baresi, C. Criscuolo and C. Ghezzi, "Understanding Fairness Requirements for ML-based Software," 2023 IEEE 31st International Requirements Engineering Conference (RE), Hannover, Germany, 2023, pp. 341-346, doi: 10.1109/RE57278.2023.00046.
- Kolchenko, V., Khoma, V., Sabodashko, D., & Perepelytsia, P. (2024). Exploring large language models’ security threats with automated tools. Social Development and Security, 14(6), 81-96. https://doi.org/10.33445/sds.2024.14.6.9
- Kendall, M.G. & Gibbons, J.D. (1990). Rank Correlation Methods (5th ed.). Oxford University Press. https://archive.org/details/rankcorrelationm0000kend
- Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT. 2019. [Online]. Available: https://arxiv.org/abs/ 1904.09675
- Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., & Liu, Y.. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. 2024. [Online]. Available: https://arxiv.org/abs/2412.05579
- Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., & Guo, J.. A Survey on LLM-as-a-Judge. 2024. [Online]. Available: https://arxiv.org/abs/2411.15594