D. Teterevenkov1
1 Financial University under the Government of the Russian Federation (Moscow, Russia)
1 249453@edu.fa.ru
Human evaluation is traditionally used as the "gold standard" for checking the quality of texts generated by large language models (LLMs). However, it is itself subject to subjectivity, variability and systematic errors of perception, which calls into question the reliability and reproducibility of model verification results.
To systematize and analyze modern approaches to assessing the quality of work of expert annotators involved in the verification of the results of large language models, in order to increase the objectivity and reliability of human evaluation.
The main sources of unreliability of expert judgments are analyzed. Methods for improving the reliability of human evaluation are presented and considered in detail: multiple annotation and calculation of agreement coefficients (Cohen's κ, Fleiss' κ, Krippendorff's α, ICC), the use of control ("gold") tasks, blind testing, meta-evaluation and statistical monitoring of evaluators' work. Special attention is paid to probabilistic models of annotator quality, such as the Dawid-Skene model.
The application of systematic calibration and verification of experts is a necessary condition for ensuring the objectivity and reproducibility of human-in-the-loop experiments in natural language processing research and evaluation of large language models. The considered methods make it possible to formalize the process of human evaluation, minimize subjective distortions and increase the reliability of the data obtained, which is critical for the correct comparison and development of LLMs.
Teterevenkov D. Methods for assessing the quality of experts in the verification of large language models. Neurocomputers. 2025. V. 27.
№ 6. P. 69−76. DOI: 10.18127/j19997493-202506-07 (in Russian).
- Chiang C.-H., Lee H.-Y. Can Large Language Models Be an Alternative to Human Evaluations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023). Toronto, Canada, 2023. P. 15607--15631. DOI: 10.18653/v1/2023.acl-long.870.
- Gao M. LLM-based NLG Evaluation: Current Status and Challenges. Computational Linguistics. 2025. V. 51. № 2.
- Tam T.Y.C., Chow T.Y. et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digital Medicine. 2024.
- Anthropic. Challenges in evaluating AI systems. 2023.
- Liu S., Wang H., Ma Z., Li X. How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators. arXiv:2502.06387. 2025.
- Guo Z. et al. Evaluating Large Language Models: A Comprehensive Survey. arXiv:2310.19736. 2023.
- Ouyang L., Wu J., Jiang X. et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS 2022). 2022.
- Yao S. et al. HUGAGENT: Evaluating LLMs in Simulating Human Reasoning. 2025.
- TR-Labs. How to Build Reliable Human Annotation Guidelines with LLMs. 2023.
- OpenAI. GPT-4 – research materials. 2023.
- Oltyan N.N. Metody preobrazovaniya polustrukturirovannyh dannyh v relyacionnye modeli: klassifikaciya, primenenie i ocenka prigodnosti dlya analitiki i mashinnogo obucheniya. Myagkie izmereniya i vychisleniya. 2025. T. 90. № 5. S. 48--67. (in Russian).

