V.V. Petrov1
1 Kazan Federal University (Kazan, Russia)
1 valeryvpetrov.itis@gmail.com
Problem statement. Large mobile app repositories contain functionally related and derivative Android applications, including different versions, modified copies, and apps with injected third-party modules. It is required to develop a method that, given an APK file of a target application, automatically identifies the most similar applications in a collection and computes a numerical similarity score robust to code obfuscation techniques.
Objective. To propose and implement a prototype pipeline for numerical similarity scoring of Android applications based on static analysis of APK files, including scalable candidate generation using compact fingerprints.
Results. A two-stage architecture is introduced: 1) fast candidate generation using compact MinHash/SimHash fingerprints and approximate nearest neighbor indexing; 2) refined comparison at the function and structural levels with normalization of the final score and a structural penalty for injected or unmatched code.
Practical significance. The approach supports quality control (duplicate and version detection), security analysis (injected code detection), and semantic search and recommendation over large application repositories.
Petrov V.V. A two-stage recommendation-based system for numerical similarity assessment of Android applications using static features. Highly Available Systems. 2026. V. 22. № 1. P. 61−64. DOI: https://doi.org/10.18127/j20729472-202601-12 (in Russian)
- Petrov V.V. Sistema avtomatizacii chislennoj ocenki sxodstva Android-prilozhenij // E`lektronny`e biblioteki. 2024. DOI: https://doi.org/10.26907/1562-5419-2024-27-3-336-365
- Li L. et al. Understanding Android App Piggybacking: A Systematic Study of Malicious Code Grafting. IEEE TIFS. 2017. DOI: https://doi.org/10.1109/TIFS.2017.2656460
- Piggybacking dataset repository (SerVal, Univ. of Luxembourg). GitHub. URL: https://github.com/serval-snt-uni-lu/Piggybacking
- RePack: repository of repackaged Android app pairs (SerVal, Univ. of Luxembourg). GitHub. URL: https://github.com/serval-snt-uni-lu/RePack
- Allix K. et al. AndroZoo: Collecting Millions of Android Apps for the Research Community. ACM MSR. 2016. DOI: https://doi.org/10.1145/ 2901739.2903508
- Broder A.Z. On the Resemblance and Containment of Documents. Compression and Complexity of Sequences. 1997. URL: https://www.cs.princeton.edu/courses/archive/spring13/cos598C/broder97resemblance.pdf
- Charikar M.S. Similarity Estimation Techniques from Rounding Algorithms. STOC. 2002. DOI: https://doi.org/10.1145/509907.509965
- Manku G.S. et al. Detecting Near-Duplicates for Web Crawling. WWW 2007. DOI: https://doi.org/10.1145/1242572.1242592
- Zhang Y. et al. Detecting Third-Party Libraries in Android Applications with High Precision and Recall. IEEE SANER. 2018. DOI: https://doi.org/10.1109/SANER.2018.8330204
- Huang J. et al. Scalably Detecting Third-Party Android Libraries With Two-Stage Bloom Filtering. IEEE Transactions on Software Engineering. 2023. DOI: https://doi.org/10.1109/TSE.2022.3215628
- The Drebin Dataset. URL: https://drebin.mlsec.org/
- Elizarov A.M. et al. Digital Ecosystem OntoMath as an Approach to Building the Space of Mathematical Knowledge. Russian Digital Libraries Journal. 2023. DOI: https://doi.org/10.26907/1562-5419-2023-26-2-154-202

