Comparative analysis of clustering methods for graph data

350 rub

Journal Neurocomputers №4 for 2023 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j19998554-202304-05

UDC: 004.021

Keywords: Сlustering graphs artificial graphs graph. Lancichinetti–Fortunato–Radicchi graph clustering methods

Authors:

N.V. Gryneva1, S.S. Mikhaylova2, A.A. Vilcul3

1–3 Financial University under the Government of the Russian Federation (Moscow, Russia)

Abstract:

Problem setting. Clustering is widely used in practice for the analysis of biological, economic, road, social and many other networks. At the same time, communities in the graph can be distinguished in various ways, based on graph properties such as density, modularity, point eccentricity, etc. This is due to the variety of methods of clustering their types. At the same time, there is no generally accepted algorithm for testing and evaluating the quality of graph community allocation algorithms. Many scientific papers describe the work of graphs on real networks, comparing their performance and accuracy. It seems possible to compare several clustering methods using a synthetic graph.

Target. Evaluate the performance of one of the most popular graph data clustering algorithms and choose the optimal one by constructing a set of synthetic Lancichinetti–Fortunato–Radicchi graphs.

Results. The article defines the concepts of clustering, a synthetic graph Lancichinetti–Fortunato–Radicchi, methods of clustering graph data. Five methods of graph data clustering have been studied in detail, namely: the Louvain method, the K-means method, the spectral clustering method, the label propagation method and the agglomerative clustering method. The process of applying the methods is shown. The analysis of the main metrics of clustering quality is carried out: adjusted Rand index and normalized mutual information. During the execution of the algorithms, the execution time was recorded to analyze the performance and speed of work. To solve these problems, the Python programming language and its packages networks, generators.community, sklearn.metrics, numpy, time, sklearn.cluster were used. The result of the research work is to identify the optimal algorithm for processing large graph data.

Practical significance. The results of the study of the possibilities of existing clustering methods and the features of their work on specific examples allow us to choose from them the optimal one for clustering a large amount of data.

Pages: 32-44

For citation

Grineva N.V., Mikhaylova S.S., Vilkul A.A. Comparative analysis of clustering methods for graph data. Neurocomputers. 2023. V. 25. № 4. Р. 32-44. DOI: https://doi.org/10.18127/j19998554-202304-05 (In Russian)

References

Driver H.E., Kroeber A.L. Quantitative Expression of Cultural Relationships. University of California Publications in American Archeology and Ethnology. 1932. V. 31. P. 211–256.
Doan D.H., Zhuleva S.Yu., Kroshilin A.V., Kroshilina S.V., Tishkina V.V. Formation of sets of variants of the course of the disease by fuzzy clustering in medical decision support systems. Biomedical radioelectronics. 2017. № 7. P. 60–66. (In Russian)
Gros S.A., Onufrieva T.A. Features of clustering application in the transport monitoring system. Information-measuring and control systems. 2020. V. 18. № 2. P. 44–50. DOI 10.18127/j20700814-202002-07. (In Russian)
Lancichinetti A., Fortunato S. Community detection algorithms: a comparative analysis. Physical Review E – Statistical, Nonlinear and Soft Matter Physics. 2009. V. 80. № 5. P. 056117. DOI 10.1103/PhysRevE.80.056117.
Blondel V.D., Guillaume J.-L., Lambiotte R., Lefebvre E. Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. 2008. V.2008. № 10. P. 10008. DOI 10.1088/1742-5468/2008/10/P10008.
Von Luxburg U. Clustering stability: An overview. Foundations and Trends in Machine Learning. 2010. V. 2. № 3. P. 235–274. DOI 10.1561/2200000008.
Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing. 2007. V. 17. № 4. P. 395–416. DOI 10.1007/s11222-007-9033-z.
Zhang X.-K., Ren J., Song C., Jia J., Zhang Q. Label propagation algorithm for community detection based on node importance and label influence. Physics Letters A. 2017. V. 381. № . 2017. V. 381. № 3. P. 2691–2698. DOI 10.1016/j.physleta.2017.06.018.
Zachary W.W. An Information Flow Model for Conflict and Fission in Small Groups. Journal of Anthropological Research. 1977. V. 33. № 4. P. 452–473. DOI 10.1086/jar.33.4.3629752.
Girvan M., Newman M.E.J. Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America. 2002. V. 99. № 12. P. 7821–7826. DOI 10.1073/pnas.122653799.
Shen H.-W. Community Structure of Complex Networks. Springer Science & Business Media. 2013. DOI 10.1007/978-3-642-31821-4.
Vinh N.X., Epps J., Bailey J. Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research. 2010. V. 11. P. 2837–2854.
Danon L., Duch, J., Diaz-Guilera A., Arenas A. Comparing community structure identification. Journal of Statistical Mechanics: Theory and Experiment. 2005. №. 9. P. 09008. DOI 10.1088/1742-5468/2005/09/P09008.
Strehl A., Ghosh J. Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research. 2002. V. 3. P. 583–617.
Rand W.M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association. 1971. V. 66.
P. 846–850.
Koufos N., Martin B. K-Means & Other Clustering Algorithms: A Quick Intro with Python. 2023.

Date of receipt: 20.06.2023

Approved after review: 07.07.2023

Accepted for publication: 26.07.2023