CRITERIA FOR THE QUALITY ASSESSMENT OF LARGE LANGUAGE MODELS

Юрій Хома; Ivan Shchudlo

The development of large language models (LLMs) with each new iteration demonstrates a significant improvement in their ability to understand and generate text, which opens up increasingly wide opportunities for their integration into information processing systems and digital business processes of enterprises and institutions. In the context of the constant growth of the complexity and functional capabilities of LLMs, the development of reliable methods for their evaluation becomes a fundamental challenge for the research community, as traditional metrics for evaluating text information are often unable to fully cover the entire depth and multifaceted nature of their potential capabilities and characteristics. The creation of comprehensive LLM quality assessment systems is designed to ensure not only an objective comparison of different models but also to form critical feedback for targeted technology improvement and the prevention of potential risks associated with their large-scale implementation. The article is devoted to the systematization of criteria for assessing the quality of large language models that have become widespread in natural language processing tasks. The article aims to create a comprehensive approach to LLM quality assessment that covers the main aspects of their functioning. The paper defines LLM quality criteria such as precision and completeness of responses, naturalness of speech, consistency, toxicity, bias, security vulnerabilities, among others. Three main approaches to assessing LLM quality are analyzed in detail: expert assessments, comparison with reference data, and automated methods without reference data. For each quality criterion, the most effective assessment methods are determined, and their advantages and disadvantages in different application contexts are noted. In conclusion, it is emphasized that, despite the high reliability of expert assessments, automated methods are becoming increasingly important for large-scale LLM evaluation, especially for subjective criteria such as toxicity or bias

large language model

Quality Assessment of Large Language Models

Quality Assessment Criteria

Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., & Mian, A.. A Comprehensive Overview of Large Language Models. 2023. [Online]. Available: https://arxiv.org/abs/2307.06435
Li, Diya & Zhao, Yue & Wang, Zhifang & Jung, Calvin & Zhang, Zhe. (2024). Large Language Model-Driven Struc- tured Output: A Comprehensive Benchmark and Spatial Data Generation Framework. ISPRS International Journal of Geo-Information. 13. 405. 10.3390/ijgi13110405.
Guo, Z., Jin, R., Liu, C., Huang, Y., Shi, D., Yu, L., Liu, Y.,Li, J., Xiong, B., & Xiong, D.. Evaluating Large Language Models: A Comprehensive Survey. 2023. [Online].Available: https://arxiv.org/abs/2310.19736
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove,C., Manning, C. D., Ré, C., Hudson, D. A., Zelikman, E., . Koreeda, Y.. Holistic Evaluation of Language Models. 2022. [Online]. Available: https://arxiv.org/abs/2211.09110
Banerjee, D., Singh, P., Avadhanam, A., & Srivastava, S.. Benchmarking LLM powered Chatbots: Methods and Metrics. 2023. [Online]. Available: https://arxiv.org/abs/ 2308.04624
Wu, N., Gong, M., Shou, L., Liang, S., & Jiang, D.. Large Language Models are Diverse Role-Players for Summarization Evaluation. 2023. [Online]. Available: https://arxiv.org/abs/2303.15078
Ni, X., & Li, P.. A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks. 2024. [Online]. Available: https://arxiv.org/abs/2405.10251
Cui, W., Zhang, J., Li, Z., Damien, L., Das, K., Malin, B., & Kumar, S.. DCR-Consistency: Divide-Conquer- Reasoning for Consistency Evaluation and Improvement of Large Language Models. 2024. [Online]. Available: https://arxiv.org/abs/2401.02132
Zhao, Y., Zhu, J., Xu, C., & Li, X.. Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph. 2024. [Online]. Available: https://arxiv.org/ abs/2412.15268
L. Baresi, C. Criscuolo and C. Ghezzi, "Understanding Fairness Requirements for ML-based Software," 2023 IEEE 31st International Requirements Engineering Conference (RE), Hannover, Germany, 2023, pp. 341-346, doi: 10.1109/RE57278.2023.00046.
Kolchenko, V., Khoma, V., Sabodashko, D., & Perepelytsia, P. (2024). Exploring large language models’ security threats with automated tools. Social Development and Security, 14(6), 81-96. https://doi.org/10.33445/sds.2024.14.6.9
Kendall, M.G. & Gibbons, J.D. (1990). Rank Correlation Methods (5th ed.). Oxford University Press. https://archive.org/details/rankcorrelationm0000kend
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2019). BERTScore: Evaluating Text Generation with BERT. 2019. [Online]. Available: https://arxiv.org/abs/ 1904.09675
Li, H., Dong, Q., Chen, J., Su, H., Zhou, Y., Ai, Q., Ye, Z., & Liu, Y.. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. 2024. [Online]. Available: https://arxiv.org/abs/2412.05579
Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., & Guo, J.. A Survey on LLM-as-a-Judge. 2024. [Online]. Available: https://arxiv.org/abs/2411.15594