350 rub
Journal Neurocomputers №7 for 2009 г.
Article in number:
Cluster-based method for missing data imputation in neural network training
Authors:
V. V. Ayuyev, Z. Y. Aung, K. M. Thein, M. B. Loginova
Abstract:
The problem of missing information processing is important for many fields of information technologies. This work is devoted to analysis and application for original method of missing data imputation on the example of neural networks based prediction for crime recidivism persons who were involved in correction programs. Traditional ways to deal with missing data were either ignoring instances with missing elements, or interpolate them by means of mean/mode based methods. In spite of the fact that both these methods had obvious disadvantages, their wide spread usage specified by simplicity and minimal computational complexity. This work described the alternative way, which had based on initial cauterization of instances space and following data imputation in each cluster separately. Contrary to the common practice of hierarchical clustering algorithms application, we used density-based algorithm that provided highly efficient results in high dimensional spaces no matter how complex cluster form was. The other feature of our approach is in modification of mixed type similarity measure that allowed us to use complete vector-s information for its relevance computation. The idea of modification was in scaling vector-s attributes differences in high dimensional space according to their significance, which was computed on 2 statistics. For data imputation in clusters 6 methods were created. All of them were based either on vectors distance information or their relative positions according to shared nearest neighbors measure. The experiments - results that were achieved on public domain dataset showed 15-22% stable advantage of proposed data imputation method in compare with traditional linear mean value method. The benefits, showed on every type of data, were especially strong for the data with large (15-30%) fraction of missing values. Various kinds of predictors worked with restored data showed 5-14% stable benefits of overall prediction accuracy in compare with linear-based restored dataset. These results allowed us to compete with etalon dataset in accuracy of prediction. The only weak point of our approach, which limits its application in real-time scale models, was high requirements for time complexity and memory space. As a future work for proposed methodology, we could suggest multiply imputation method for separate clusters, or dynamical cluster forming. The other possible approach for imputation accuracy improving is in specialized algorithms that would be careful about statistical qualities of forming clusters.
References
  1. Little, R. J. A. and Rubin, D. B., Statistical Analysis with Missing Data. 2-nd edition. - New Jersey: John Wiley and Sons. 2002. 408 P.
  2. Schafer, J. L., Multiple imputation: a primer // Statistical Methods in Medical Research. 1999. V. 8. N. 1. P. 3-15.
  3. Fujikawa, Y. and Ho, T. B., Cluster-based algorithms for dealing with missing values // Proceedings in Advances in Knowledge Discovery and Data Mining. Berlin: Springer. 2002. P. 549-554.
  4. Mantaras, R. L., A distance-based attribute selection measure for decision tree induction // Machine Learning. 1991. V. 6. P. 81-92.
  5. Gan, G., Ma, C., and Wu, J., Data Clustering: Theory, Algorithms, and Applications // ASA-SIAM Series on Statistics and Applied Probability. Philadelphia: SIAM Press. 2007. V. 20. 466 P.
  6. Wishart, D., K-means clustering with outlier detection, mixed variables and missing values // Schwaiger, M., Opitz, O., Exploratory Data Analysis in Empirical Research. New York: Springer. 2003. P. 216-226.
  7. Chernoff, H. and Lehmann, E. L., The use of maximum likelihood estimates inχ2 tests for goodness-of-fit // The Annals of Mathematical Statistics. 1954. V. 25. P. 579-586.
  8. Ertoz, L. and Steinback, M., and Kumar, V., Finding clusters of different sizes, shapes, and density in noisy, high dimensional data // Second SIAM International Conference on Data Mining. SanFrancisco: SIAMPress. 2003. P. 47-58.
  9. АюевВ. В.,Тун Ч., Тура А., Аунг З. Е. Метод доменной компенсации информационной неполноты БД// Тр. МГТУ им. Н. Э. Баумана. Т. 2. М.: МГТУ им. Н. Э. Баумана. 2007. С. 57-64.
  10. Люгер Д.Ф. Искусственный интеллект. Стратегии и методы решения сложных проблем. - М.: Вильямс, 2005. - 864 C.
  11. Crime and Justice Research Center, Temple University. http://www.temple.edu/prodes/
  12. Tan, P. N., Steinbach, and M., Kumar, V., Introduction to Data Mining. New York: Addison Wesley. 2005. 769 P.
  13. Ester, M., Kriegel, H. P., Sander, J., and Xu, X., A density-based algorithm for discovering clusters in large spatial databases with noise. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining. Portland: AAAI Press. 1996. P. 226-231.
  14. Lehmann, E. L. and Romano,J. P., Testing Statistical Hypotheses, 3-rd edition. NewYork: Springer. 2005.786 P.
  15. Логинов Б. М., Аюев В. В. Нейросетевые агенты в задачах управления с разделением по времени входными данными высокой размерности // Нейрокомпьютеры: разработка и применение. 2007. №5. С. 21-31.
  16. Тархов Д.А. Нейронные сети. Модели и алгоритмы. Кн. 18. //Под общ. ред. А. И. Галушкина.М.: Радиотехника.2005. 256 С.
  17. Hosmer, D. W. and Lemeshow, S., Applied logistic regression, 2-nd edition. New York: John Wiley and Sons. 2000. 392 P.
  18. Cristianini, N. and Shawe-Taylor, J.,An Introduction to Support Vector Machines and other kernel-based learning methods. Cambrige: Cambridge University Press. 2000. 189 P.
  19. Bentley, J. L., K-d Trees for Semidynamic Point Sets // SCG '90: Proc. 6-th Annual Symposium on Computational Geometry. 1990. P. 187-197.
  20. King, G., Honaker, J., Joseph, A., and Scheve, K., Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation // American Political Science Review. 2001. V. 95.N 1.P. 49-69.