V.I. Zorin1, E.K. Lipachev2
1 KNRTU-KAI (Kazan, Russia)
2 KFU (Kazan, Russia)
2 Innopolis University (Innopolis, Russia)
1 addefan@mail.ru, 2 elipachev@gmail.com
Problem Statement. An effective system for cross-language search for semantically similar fragments of software code is needed, as existing solutions typically support a limited number of programming languages and weakly consider both the syntactic and semantic content of source code.
Objective. To develop a method for calculating the similarity measure of source code fragments based on identifying syntactic and semantic characteristics of the code and ensuring comparability across different programming languages.
Results. A method for calculating the similarity measure of program code fragments is proposed. The method is based on fragment vectorization using machine learning models. The high efficiency of the UniXcoder and CodeBERT models for identifying semantic cross-language clones has been experimentally confirmed. An algorithm for identifying the programming language based on reserved words is developed and validated.
Practical Relevance. The software tool for searching for similar code fragments, implemented using the method presented in this paper, can be used for cross-language search in the program code space. Integration of the software tool into the OntoMath digital ecosystem of the Lobachevskii Digital Mathematical Library is planned.
Zorin V.I., Lipachev E.K. A method for calculating the similarity measure of program code fragments. Highly Available Systems. 2026. V. 22. № 1. P. 47−50. DOI: https://doi.org/10.18127/j20729472-202601-09 (in Russian)
- Zhou S. et al. What Is the Impact of Releasing Code With Publications? IEEE Control Systems. 2024. V. 44 (4). P. 38–46. DOI: https://doi.org/10.1109/MCS.2024.340288
- Zakeri M., Parsa S., Ramezani M., Roy C., Ekhtiarzadeh M. A systematic literature review on source code similarity measurement and clone detection. Journal of Systems and Software. 2023. V. 204 (3). P. 111796. DOI: https://doi.org/10.1016/j.jss.2023.111796
- Vislavski T. et al. LICCA: A tool for cross-language clone detection. IEEE Int. Conf. on Software Analysis. 2018. P. 512–516. DOI: https://doi.org/10.1109/SANER.2018.8330250
- Nafi K.W. et al. CLCDSA: Cross Language Code Clone Detection. IEEE/ACM Int. Conf. (ASE). 2019. P. 1026–1037. DOI: https://doi.org/ 10.1109/ASE.2019.00099
- Mathew G., Stolee K.T. Cross-language code search using static and dynamic analyses. ESEC/FSE 2021. 2021. P. 205–217. DOI: https://doi.org/10.1145/ 3468264.3468538
- Tao C., Zhan Q., Hu X., Xia X. C4: contrastive cross-language code clone detection. ICPC '22. 2022. P. 413–424. DOI: https://doi.org/10.1145/ 3524610.3527911
- Feng Z. et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. EMNLP 2020. P. 1536–1547. DOI: https://doi.org/ 10.18653/v1/2020.findings-emnlp.139
- Guo D. et al. GraphCodeBERT: Pre-training Code Representations with Data Flow. arXiv:2009.08366. 2020. DOI: https://doi.org/10.48550/ arXiv.2009.08366
- Guo D. et al. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. Proc. of the Association for Computational Linguistics. 2022. P. 7212–7225. DOI: https://doi.org/10.18653/v1/2022.acl-long.499
- Svajlenko J., Roy C.K. BigCloneEval: A Clone Detection Tool Evaluation Framework with BigCloneBench. ICSME. 2016. P. 596–600. DOI: https://doi.org/10.1109/ICSME.2016.62
- Zorin V.I. Programming Language Detection Dataset. Zenodo. 2025. DOI: https://doi.org/10.5281/zenodo.15661548
- Elizarov A. et al. Digital OntoMath Ecosystem Tools for Managing and Developing Mathematical Knowledge. LNNS. 2024. V. 912. P. 110–117. DOI: https://doi.org/10.1007/978-3-031-53488-1_13
- Elizarov A.M., Kirillovich A.V., Lipachyov E.K., Nevzorova O.A. Cifrovaya e`kosistema OntoMath kak podxod k postroeniyu prostranstva matematicheskix znanij. E`lektronny`e biblioteki. 2023. T. 26. № 2. S. 154–202. DOI: https://doi.org/10.26907/1562-5419-2023-26-2-154-202

