A method for calculating the similarity measure of program code fragments

500 rub

Journal Highly available systems №1 for 2026 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j20729472-202601-09

UDC: 004.4

Keywords: Code embedding code similarity cross-language clone cross-language code search machine learning models semantic clone syntax tree

Authors:

V.I. Zorin1, E.K. Lipachev2

1 KNRTU-KAI (Kazan, Russia)
2 KFU (Kazan, Russia)
2 Innopolis University (Innopolis, Russia)

1 addefan@mail.ru, 2 elipachev@gmail.com

Abstract:

Problem Statement. An effective system for cross-language search for semantically similar fragments of software code is needed, as existing solutions typically support a limited number of programming languages and weakly consider both the syntactic and semantic content of source code.

Objective. To develop a method for calculating the similarity measure of source code fragments based on identifying syntactic and semantic characteristics of the code and ensuring comparability across different programming languages.

Results. A method for calculating the similarity measure of program code fragments is proposed. The method is based on fragment vectorization using machine learning models. The high efficiency of the UniXcoder and CodeBERT models for identifying semantic cross-language clones has been experimentally confirmed. An algorithm for identifying the programming language based on reserved words is developed and validated.

Practical Relevance. The software tool for searching for similar code fragments, implemented using the method presented in this paper, can be used for cross-language search in the program code space. Integration of the software tool into the OntoMath digital ecosystem of the Lobachevskii Digital Mathematical Library is planned.

Pages: 47-50

For citation

Zorin V.I., Lipachev E.K. A method for calculating the similarity measure of program code fragments. Highly Available Systems. 2026. V. 22. № 1. P. 47−50. DOI: https://doi.org/10.18127/j20729472-202601-09 (in Russian)

References

Zhou S. et al. What Is the Impact of Releasing Code With Publications? IEEE Control Systems. 2024. V. 44 (4). P. 38–46. DOI: https://doi.org/10.1109/MCS.2024.340288
Zakeri M., Parsa S., Ramezani M., Roy C., Ekhtiarzadeh M. A systematic literature review on source code similarity measurement and clone detection. Journal of Systems and Software. 2023. V. 204 (3). P. 111796. DOI: https://doi.org/10.1016/j.jss.2023.111796
Vislavski T. et al. LICCA: A tool for cross-language clone detection. IEEE Int. Conf. on Software Analysis. 2018. P. 512–516. DOI: https://doi.org/10.1109/SANER.2018.8330250
Nafi K.W. et al. CLCDSA: Cross Language Code Clone Detection. IEEE/ACM Int. Conf. (ASE). 2019. P. 1026–1037. DOI: https://doi.org/ 10.1109/ASE.2019.00099
Mathew G., Stolee K.T. Cross-language code search using static and dynamic analyses. ESEC/FSE 2021. 2021. P. 205–217. DOI: https://doi.org/10.1145/ 3468264.3468538
Tao C., Zhan Q., Hu X., Xia X. C4: contrastive cross-language code clone detection. ICPC '22. 2022. P. 413–424. DOI: https://doi.org/10.1145/ 3524610.3527911
Feng Z. et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. EMNLP 2020. P. 1536–1547. DOI: https://doi.org/ 10.18653/v1/2020.findings-emnlp.139
Guo D. et al. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv:2009.08366. 2020. DOI: https://doi.org/10.48550/ arXiv.2009.08366
Guo D. et al. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. Proc. of the Association for Computational Linguistics. 2022. P. 7212–7225. DOI: https://doi.org/10.18653/v1/2022.acl-long.499
Svajlenko J., Roy C.K. BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench. ICSME. 2016. P. 596–600. DOI: https://doi.org/10.1109/ICSME.2016.62
Zorin V.I. Programming Language Detection Dataset. Zenodo. 2025. DOI: https://doi.org/10.5281/zenodo.15661548
Elizarov A. et al. Digital OntoMath Ecosystem Tools for Managing and Developing Mathematical Knowledge. LNNS. 2024. V. 912. P. 110–117. DOI: https://doi.org/10.1007/978-3-031-53488-1_13
Elizarov A.M., Kirillovich A.V., Lipachyov E.K., Nevzorova O.A. Cifrovaya e`kosistema OntoMath kak podxod k postroeniyu prostranstva matematicheskix znanij. E`lektronny`e biblioteki. 2023. T. 26. № 2. S. 154–202. DOI: https://doi.org/10.26907/1562-5419-2023-26-2-154-202

Date of receipt: 24.02.2026

Approved after review: 26.02.2026

Accepted for publication: 10.03.2026