350 rub
Journal Highly available systems №3 for 2022 г.
Article in number:
System for distributed training of Kernel forests
Type of article: scientific article
DOI: https://doi.org/10.18127/j20729472-202203-05
UDC: 004.89
Authors:

D.A. Devyatkin1

1 Federal Research Center «Computer Science and Control» of the Russian Academy of Sciences (Moscow, Russia)

Abstract:

Random forests of univariate decision trees are widely used for data and text analysis, but they have limited applicability in the analysis of large and sparse datasets. One approach to solving this problem is to build forests of decision trees with oblique or more complex splits. The training time for such forests on large-scale data significantly exceeds the time for univariate decision trees. In addition, building different types of splits requires different computing resources: a linear split can be trained with the central processors, while training nonlinear splits requires the use of graphics processors. This paper proposes a distributed architecture of a system for training random forests, in which the training process is parallelized at the level of individual splits. It helps reduce downtime of hardware resources and assign training tasks to various types of computing nodes depending on the type of split, as well as dynamically scale the resources depending on the load. Experiments have shown that it can significantly reduce the total computer time required to train forests on large high-dimensional datasets. The presented architecture can be considered as a basis for creating applied systems for data mining and image processing intended for use in various sectors of the economy: agriculture, industry, and transport.

Pages: 59-68
For citation

Devyatkin D.A. System for distributed training of Kernel forests. Highly Available Systems / Sistemy vysokoy dostupnosti. 2022. V. 18. № 3. P. 59−68. DOI: https://doi.org/10.18127/j20729472-202203-05 (in Russian)

References
  1. Breiman L. Classification and regression trees. Routledge. 2017. 368 p.
  2. Breiman L. Random forests. Machine learning. 2001. V. 45. № 1. P. 5−32.
  3. Khan Z., Rahimi-Eichi V., Haefele S., Garnett T., Miklavcic S.J. Estimation of vegetation indices for high-throughput phenotyping of wheat using aerial imaging. Plant methods. 2018. V. 14. № 1. P. 1−11.
  4. Devyatkin D.A., Grigoriev O.G. Random Kernel Forests. IEEE Access. 2022. V. 10. P. 77962−77979.
  5. Friedman J.H. Stochastic gradient boosting. Computational statistics & data analysis. 2002. V. 38. № 4. P. 367−378.
  6. Kraska T., Talwalkar A., Duchi J.C., Griffith R., Franklin M.J., Jordan M.I. MLbase: A Distributed Machine-learning System. CIDR. 2013. V. 1. P. 2−1.
  7. Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M., Shenker S., Stoica I. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12). 2012. P. 15−28.
  8. Hindman B., Konwinski A., Zaharia M., Ghodsi A., Joseph A.D., Katz R., Stoica I. Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11). 2011.
  9. Malewicz G., Austern M.H., Bik A.J., Dehnert J.C., Horn I., Leiser N., Czajkowski G. Pregel: a system for large-scale graph processing //Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010. P. 135−146.
  10. Low Y., Gonzalez J., Kyrola A., Bickson D., Guestrin C., Hellerstein J.M. Distributed Graphlab: A framework for machine learning in the cloud //arXiv preprint arXiv:1204.6078. 2012.
  11. Low Y., Gonzalez J.E., Kyrola A., Bickson D., Guestrin C.E., Hellerstein J. Graphlab: A new framework for parallel machine learning //arXiv preprint arXiv:1408.2041. 2014.
  12. Gillick D., Faria A., DeNero J. Mapreduce: Distributed computing for machine learning. Berkley. December 2006. V. 18.
  13. Panda B., Herbach J.S., Basu S., Bayardo R.J. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Proceeding VLDB Endow. 2009. V. 2. P. 1426–1437.
  14. Murdopo A. Distributed decision tree learning for mining big data streams. Master of Science Thesis, European Master in Distributed Computing. 2013. P. 75.
  15. Ye J., Chow J.-H., Chen J., Zheng Z. Stochastic gradient boosted distributed decision trees. Proceedings of the 18th ACM conference on Information and knowledge management. 2009. P. 2061−2064.
  16. Li B., Yu Q., Peng L. Ensemble of fast learning stochastic gradient boosting. Communications in Statistics-Simulation and Computation. 2022. V. 51. № 1. P. 40−52.
  17. Chen T., He T., Benesty M., Khotilovich V., Tang Y., Cho H., Chen K. Xgboost: extreme gradient boosting. R package version 0.4‑2. 2015. V. 1. № 4. P. 1−4.
  18. Dorogush A.V., Ershov V., Gulin A. CatBoost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363. 2018.
  19. Druzhkov P.N., Polovinkin A.N. Realizatsiya parallelnogo algoritma obucheniya v metode gradientnogo bustinga derevev reshenii dlya sistem s raspredelennoi pamyatyu. Parallelnye vychislitelnye tekhnologii 2012 (PAVT'2012). Novosibirsk, 26−30 marta 2012. Novosibirsk. 2012. S. 459−465. (in Russian)
  20. Zhang H. et al. Real-time distributed-random-forest-based network intrusion detection system using Apache spark. IEEE 37th international performance computing and communications conference (IPCCC). 2018. P. 1−7.
  21. Murphy P.M. and Aha D.W. UCI Repository of machine learning databases. Dept. Inf. Comput. Sci., Univ. California, Irvine, CA, USA, Tech. Rep., 1991. Rezhim dostupa: Jul. 24, 2022. URL: https://archive.ics.uci.edu/ml/about.html.
  22. Krizhevsky A. Learning multiple layers of features from tiny images. M.S. thesis, Dept. Comput. Sci., Univ. Toronto, Toronto, ON. Canada. 2009.
Date of receipt: 12.08.2022
Approved after review: 23.08.2022
Accepted for publication: 29.08.2022