A.D. Modyaev, P.I. Rudakov, A.A. Raskin
Nowadays preliminary clustering of initial data set is one of efficient methods, which is used for data visualization. Graph model is most appropriate for description data with formalized structure (i.e. measurement results obtained in predefined formats, reports about rendered services etc). In such case the idea of fragment (chains) of data similarity reduced to idea of similarity of graphs and choice of distance measure between graphs structures become the most complicated problem.
The analysis of opportunity to use methods of comparing weighted chainlike graphs for clustering data with formalized structure is given. Main techniques of comparing graphs and single elements of graphs, such as isomorphic method, Levenshtein distance (edit distance), maximum common subgraph search (MCS-method), statistical method (based on comparing of statistical characteristics of graph structure), method of g-model creating and iterative method were studied on data set of rendered medical services.
It was shown that usage of Levenshtein distance and modification of TF-IDF technique reduces the influence of intrastructural outliers. Modification of distance measure was suggested to reduce the influence of intrastructural outliers on clustering process.
- Kokhov V.A. Metody analiza skhodstva grafov i skhodstva raspolozhenija cepnykh fragmentov v grafe // Tezisy dokladov nauchnojj sessii MIFI-2004. T.3. M.: MIFI. S.178–179.
- Solton Dzh. Dinamicheskie bibliotechno-poiskovye sistemy. M.: Mir. 1979.
- Cook D.J., Holder L.B. Mining Graph Data. John Wiley & Sons Inc. Hoboken – New Jersey. 2007.
- Dunn J.C. Well separated clusters and optimal fuzzy partitions // Journal of Cybernetics. 1974. №4. R.95-104.
- Rousseeuw P.J.Silhouettes: a graphical aid to the interpretation and validation of cluster analysis // Journal of Computational and Applied Mathematics. 1987. №20. R.53-65.
- Titov I., McDonald R.Modeling Online Reviews with Multi-Grain Topic Models // Proc. International World Wide Web Conference. Beijing. 2008. R.111-120.
- Xiao C., Wang W., Lin X. Efficient Similarity Joins for Near Duplicate Detection // Proc. International World Wide Web Conference. Beijing. 2008. R.131-140.