M.G. Kobuk1, O.M. Ataeva2
1-2 Moscow Witte University (Moscow, Russia)
2 Federal Research Center "Computer Science and Control" of the RAS (Moscow, Russia)
1 mikhail.kobuk@mail.ru; 2 oataeva@frccsc.ru
Problem statement. This work addresses the problem of structuring scientific articles for building data corpora within the SciLibAIRU semantic ecosystem, and the transition from a document-oriented data representation to a format suitable for automated analysis and semantic search.
Objective. The objective of this work is to investigate an optimal method for vector-based fuzzy search over mathematical texts and to implement it in conjunction with a parser for mathematical LaTeX texts.
Results. A prototype vector search system is proposed that is capable of ingesting LaTeX versions of scientific texts and providing a fuzzy search interface over textual fragments within the SciLibAIRU library.
Practical significance. The results are applicable to library and editorial information systems.
Kobuk M.G., Ataeva O.M. Methods of semantic annotation and ontological modeling of mathematical texts in LaTeX format. Highly Available Systems. 2026. V. 22. № 1. P. 90−94. DOI: https://doi.org/10.18127/j20729472-202601-18 (in Russian)
- Hoftich M. TEX4ht: LATEX to Web Publishing. TUGboat. 2019. V. 40. № 1. R. 76–81.
- Frankston C. et al. Using HTML Papers on arXiv: Why It’s Important, and How We Made It Happen. arXiv:2402.08954, 2024.
- Serebryakov V.A., Galochkin M.P., Gonchar D.R., Furugyan M.G. Teoriya i realizaciya yazy`kov programmirovaniya. Izd. 2-e. M.: Izd-vo MZ-Press. 2006. 352 s.
- Xopkroft Dzh., Motvani R., Ul`man Dzh. Vvedenie v teoriyu avtomatov, yazy`kov i vy`chislenij. M.: Vil`yams. 2002. 528 s.
- Axo A.V., Lam M.S., Seti R., Ul`man Dzh.D. Kompilyatory`: principy`, texnologii i instrumentarij. Izd .2-e. M.: Vil`yams. 2008. 1184 s.
- Peters M., Neumann M, Ivyer M., Gardner M., Clark C., Lee L., Zettlemoyer L. Deep contextualized word representations. arXiv:1802.05365v2, 2018. DOI: arXiv:1802.05365
- Pennington J., Socher R., Manning C. GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014. R. 1532–1543. DOI: 10.3115/v1/D14-1162
- Joulin A., Grave E., Bojanowski P., Mikolov T. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Vol. 2. Short Papers. Valencia, Spain, April 2017. R. 427–431. DOI: 10.18653/v1/E17-2068
- Feng F., Yang Y., Cer D., Arivazhagan N., Wang W. Language-agnostic BERT Sentence Embedding. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL). Dublin, Ireland. May 2022. R. 878–891. DOI: 10.18653/v1/2022.acl-long.62
- Zmitrovich D. et al. A Family of Pretrained Transformer Language Models for Russian. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia. May 2024. P. 507–524. arXiv.2309.10931. DOI: 10.48550/arXiv.2309.10931
- Kuratov Y., Arkhipov M. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language. Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference «Dialogue 2019» Moscow. May-June 2019. arXiv:1905.07213. DOI: 10.48550/arXiv.1905.07213
- Nikolich A., Puchkova A. Fine-tuning GPT-3 for Russian Text Summarization. arXiv preprint 2021. arXiv:2108.03502. DOI: 10.48550/arXiv.2108.03502
- Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems (NIPS 26). 2013. R. 3111–3119. DOI: 10.5555/2999792.2999959
- Gerasimenko N., Vatolin A.. Ianina A., Vorontsov K. SciRus: Tiny and Powerful Multilingual Encoder for Scientific Texts. Doklady Mathematics. 2024. V. 110. № 1. P. S193–S202. DOI: 10.1134/S1064562424602178

