R.D. Shuvalov1, E.A. Sidorova2
1 Novosibirsk State University (Novosibirsk, Russia)
2 A.P. Ershov Institute of Informatics Systems, Siberian Branch of the Russian Academy of Sciences (Novosibirsk, Russia)
1 r.shuvalov@g.nsu.ru, 2 lsidorova@iis.nsk.su
Problem Statement. The task of coreference resolution is to identify mentions in texts that refer to the same real-world object. Current research in this area relies on annotated datasets, which mostly contain news texts. Models and methods based on these datasets do not generalize well to specialized domains of knowledge. In addition, there is a shortage of training data for the Russian language in this area.
Objective. The aim of the work is to create a new Russian-language corpus of texts from a specialized subject area with coreference markup.
Results. The results of the study are a methodology for automating coreference markup in Russian-language texts of a limited subject area and an annotated corpus of articles related to the field of «Computational Linguistics».
Practical significance. The proposed methodology can be used to automate coreference markup in texts of narrow subject areas, and the created corpus can be used for training and evaluating coreference resolution models for the Russian language.
Shuvalov R.D., Sidorova E.A. HaRuCo: A new Russian-language corpus of popular science texts with coreference annotation // Highly Available Systems. 2026. V. 22. № 2. P. 83−88. DOI: https://doi.org/10.18127/j20729472-202602-07
- Paducheva E.V. Vyskazyvanie i ego sootnesennost s deistvitelnostiu. Izd. 5, ispr. M.: URSS, 2008 (in Russian).
- Dobrovolskii V.A., Michurina M.A., Ivoylova A.M. RuCoCo: a new Russian corpus with coreference annotation. Computational Linguistics and Intellectual Technologies: Proc. of the Int. Conference «Dialogue 2022». 2022. P. 141–149. DOI: 10.28995/2075-7182-2022-21-141-149
- Azerkovich I. Using Semantic Information for Coreference Resolution with Neural Networks in Russian. Analysis of Images, Social Networks and Texts. AIST 2019. Communications in Computer and Information Science. Cham: Springer International Publishing. 2020. V. 1086. P. 85–93.
- Toldova S., Roytberg A., Ladygina A., Vasilyeva M., Azerkovich I., Kurzukov M., Sim G., Gorshkov D., Ivanova A., Nedoluzhko A., Grishina Y. Ru-eval-2014: Evaluating anaphora and coreference resolution for russian. Computational linguistics and intellectual technologies. Proc. of the Int. Conference «Dialogue 2014». 2014. P. 681–694.
- Budnikov A.E., Toldova S.Yu., Zvereva D.S., Maximova D.M., Ionov M.I. Ru-eval-2019: Evaluating anaphora and coreference resolution for Russian. Computational Linguistics and Intellectual Technologies – Supplementary Volume. 2019.
- Ovchinnikova K., Ivanov A., Sidorova E. Automation of the construction of the terminological core of ontology in computer linguistics based on a corpus of texts. System Informatics. 2023. 23. P. 13–32. DOI: 10.31144/SI.2307-6410.2023.N23.P13-32 (in Russian).
- Nghia T. Le, Ritter A. Are Large Language Models Robust Coreference Resolvers? First Conference on Language Modeling (COLM-2024). 2024.
- Moosavi N.S., Strube M. Which coreference evaluation metric do you trust? А proposal for a link-based entity aware metric. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016. V. 1. P. 632–642.

