Complex method for selection of significant input features in solving regression problems with neural networks

350 rub

Journal Neurocomputers №3 for 2010 г.

Article in number:

Keywords: neural networks regression problem feature selection WEKA

Authors:

A. G. Guzhva, S. A. Dolenko, I. G. Persiantsev

Abstract:

In this paper, the problem of reducing the dimensionality of data is considered, with regard to the case of data studied using neural networks - namely, using multilayer perceptrons. It is true for a neural network model that a larger number of input features used from a dataset do not necessarily mean obtaining a more high-quality neural network model. Primarily, this is due to the «curse of dimensionality» problem for the weights space of a neural network. Larger number of input features leads to using a larger number of neurons and, consequently, to increasing dimensionality of the neurons weight space. A number of methods are considered which reduce the number of input features by selecting the most significant input features from the original set (Feature Selection methods). Wel refer to these methods as methods for Selection of Significant Features (SSF methods). This paper is an attempt to combine a variety of SSF methods into a unified methodology used in solving regression problems. The methodology is designed for use with multilayer perceptrons. Necessity of such a methodology is based on significant differences in the SSF methods, as well as on the differences in requirements for the accuracy of the solution and in the availability of computing resources for solving various problems. Datasets used were taken from public dataset collection store WEKA (http://www.cs.waikato.ac.nz/ml/weka/). Numerous datasets of different data collections were investigated, but final conclusions are given for two collections of datasets: "Friedman-datasets" and "Datasets-numeric". The paper gives a complete description of the way the datasets have been preprocessed. These ways were the same for all datasets used. 8 runs per each neural network were made (each time - with a new starting point in the neuron weights space), each for 10 variants of partitioning of the original dataset into data sets used in the multilayer perceptron training procedure. The results correspond to the statistical patterns associated with the reviewed SSF methods, not with the individual datasets specifics. The following SSF methods were compared: correlation analysis, cross-entropy analysis, three methods from the family of neural network weights analysis methods. "Add" method (also well known as SFS, Sequential Forward Selection) has been investigated actively; the method performs sequential choice of the most significant input features in descending order of significance. The variations of the features of the Add method, as well as combinations of the Add method with other SSF methods, have been also investigated. Initially, a number of "reference" neural networks were trained on a complete set of input features. Then with the help of one of the SSF methods (or a combination of SSF methods), the most significant input features were selected. Based on such a reduced set of input features, more neural networks were trained, which were compared to the "reference" neural net-works. At last, results of such a comparison were used to draw conclusions about the used SSF method. The conclusions were made on the basis of numerical values of the following statistical indicators calculated on the results of application of neural networks: linear correlation coefficient, coefficient of multiple determination, number of input features in the reduced input variables set. Characteristics were averaged over the used neural networks, as well as over datasets used. In addition to values of the statistical indicators themselves, the standard deviations of statistical indicators were also considered. As the result of the study, it was confirmed that use of SSF methods can reduce the error of problem solution and the computational cost of problem solving as well. Based on the results obtained, one can conclude that the methods of correlation analysis and cross-entropy analysis should be used as SSF methods for rapid assessment of the input features significance. Using the methods of correlation analysis or cross-entropy analysis usually does not lead to any serious reduction of solution error for the problem being solved, but it leads to exclusion of a number of non-significant input features. Using the Add method allows one to improve the obtained results significantly, both in terms of improved accuracy and in terms of even more severe reduction of the input features set; however, at the cost of extraordinary growth of computation time. The most efficient way is to use Add method after the initial set of input features has been reduced with rapid assessment methods.

Pages: 20-32

References

Carreira-Perpiñán, M. Á., A review of Dimension Reduction Techniques. Technical Report CS-96-09, Dept. of Computer Science, University of Sheffield.
Pudil, P., Somol, P., Current Feature Selection Techniques in Statistical Pattern Recognition. Computer Recognition Systems, Springer Berlin / Heidelberg. 2005. P. 53-68.
Guyon, I., Elisseeff, A., An Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003. V. 3. P. 1157-1182.
Mladenić, D., Feature Selection for Dimensionality Reduction. C. Saunders et al. (Eds.) // SLSFS 2005. Lecture Notes in Computer Science. 2006. V. 3940. P. 84-102.
Гужва А. Г., Доленко С. А., Персианцев И. Г., Шугай Ю. С., Еленский В. Г. Отбор существенных переменных при нейросетевом прогнозировании: сравнительный анализ методов // IX Всероссийская научная конференция «Нейроинформатика-2007». Сб. научн. тр. Ч. 2. М.: МИФИ. 2007. С. 251-258.
Гужва А. Г., Доленко С. А., Персианцев И. Г., Шугай Ю. С.Сравнительный анализ методов определения существенности входных переменных при нейросетевом моделировании: методика сравнения и ее применение к известным задачам реального мира // X Всероссийская научная конференция «Нейроинформатика-2008»: Сб. научн тр. Ч. 2. М.: МИФИ. 2008. С. 216-225.
База данных WEKA. URL: http://www.cs.waikato.ac.nz/ml/weka/
Friedman,J. H., (1999). Stochastic Gradient Boosting (Tech. Rep.). URL: http://www-stat.stanford.edu/~jhf/ftp/stobst.ps
Костенко В. А., Смолик А. Е. Алгоритм мультистарта с отсечением для обучения нейронных сетей прямого распространения // IX Всероссийская научная конференция «Нейроинформатика-2007»: Сб. научн. тр. Ч. 3. М.: МИФИ. 2007. С. 251-257.
Ежов А. А., Шумский С. А. Нейрокомпьютинг и его применения в экономике и бизнесе. М.: МИФИ: 1998.
Gevrey, M., Dimopoulos, I., Lek, S., Review and comparison of methods to study the contribution of variables in artificial neural network models // Ecological Modeling. 2003. V. 160. P. 249-264.
Linda, L. Feature Selection Using Neural Networks with Contribution Measures // Computer Science and Engineering, University of New South Wales.
Warren S. Sarle. How to measure importance of inputs - SAS Institute Inc., Cary, NC, USA. URL: ftp://ftp.sas.com/pub/neural/importance.html
Загоруйко Н. Г. Прикладные методы анализа данных и знаний. Новосибирск: Изд-во института математики. 1999.