Big data analytical processing technologies | Publishing house Radiotekhnika

350 rub

Journal Information-measuring and Control Systems №12 for 2016 г.

Article in number:

Big data analytical processing technologies

Keywords: OLAP MapReduce Hadoop Spark advantages and shortcomings

Authors:

Yu.A. Grigorev - Dr.Sc. (Eng.), Professor, Department of Information Processing Systems and Management, Bauman Moscow State Technical University E-mail: grigorev@bmstu.ru

Abstract:

Big Data term appears in almost all talks related to data mining and analysis in a wide range of areas including economy, manufacturing, marketing, telecommunications. Most companies use Big Data in customer service (53%) and operational effectiveness (40%). The article describes most popular technologies for Big Data processing: OLAP, MapReduce /Hadoop and Spark. OLAP was introduced in 1993. It-s based on data warehouses (DW) ? databases containing a large amount of data structured in a convenient for analysis way. DW relies on multi-dimensional data representation: MOLAP, ROLAP, HOLAP. The last hybrid model combines first two approaches: the data is stored in a relational database (ROLAP), and aggregates - in the multidimensional (MOLAP). Scalability is the issue for OLAP tools: with increasing of the original dataset size the cost of implementation growth dramatically. NoSQL technique implements a new strategy based on open source solutions and native scalability and reliability due to the multiple replication of database records at a number of low cost nodes. The data is stored as . Value may contain aggregates to avoid reading from multiple tables. However, NoSQL has limited functionality for complex queries processing. The next step of evolution is MapReduce (MR) technique, e.g. Hadoop implementation with HDFS files system. The article details the example of joining two tables on Hadoop technology. The processing of records includes: Map: → list | Reduce: → list. Hadoop is hard to configure, it involves a lot of R/W operations and query processing time is high. Further studies were focused on MR technique improvements. Spark is one of such tools. The article provides reviews of queries processing schemas in Hadoop and Spark, comparison of performance, fault tolerance and ease of programming. Spark demonstrates better performance and usability. Hadoop is more reliable: if a node fails it restarts map/reduce functions on the other node. To conclude, the article reviews weak and strong points of OLAP, MapReduce/Hadoop, and Spark techniques. Ex-pensive parallel relational DBMS (Teradata, Oracle Exadata) have high cost of implementation OLAP. MapReduce technique provides affordable tools for processing Big Data.

Pages: 59-68

References

Revoljucija Big Data: Kak izvlech neobkhodimuju informaciju iz «Bolshikh Dannykh»? URL. http://www.statsoft.ru/products/Enterprise/big-data.php#top.
Analiticheskijj obzor rynka BigData. [EHlektronnyjj resurs] [https://habrahabr.ru/company/moex/blog/256747/] Provereno 12.06.2016.
Borodulin A.N. Programmnye sredstva biznes-analitiki v sisteme upravlenija sovremennym predprijatiem // EHkonomicheskaja nauka segodnja: teorija i praktika: Materialy III Mezhdunar. nauch.-prakt. konf. (CHeboksary, 26 dek. 2015 g.). S. 286-289.
Ukharov A.O. Metod priblizhennojj obrabotki zaprosov v sistemakh operativnogo analiza dannykh: Diss. - kand. tekhn. nauk. M.: MGTU im. N.EH. Baumana. 2011. 188 s.
Codd E.F., Codd S.B., Salley C.T. Providing OLAP (on-line analytical processing) to user-analysts: An IT mandate // Codd and Date. 1993. T. 32.
Fedorov A., Elmanova N. Vvedenie v OLAP // KompjuterPress M. 2000. № 3. S. 37-42.
Pendse N. Database Explosion // Olap Report. 2006. URL. http://olapreport.com (data obrashhenija 11.11.2005).
KHrustalev E.M. Agregacija dannykh v OLAP-kubakh // Interface Internet & software company. 2003. URL. http://www.interface.ru/misc/mut.htm (data obrashhenija 21.05.07).
Sherman R. Data Integration Advisor: The Enterprise Data Warehouse Strikes Again. Part 1 // DM Review. 2006. URL.http://www.athena-solutions.com/library-dmreview.shtml (data obrashhenija 11.05.2006).
Bc. Aleš Hejmalíček. Hadoop as an Extension of the Enterprise Data Warehouse. Masaryk university, Faculty of informatics, Brno, 2015.
Jerzy Duda. Business intelligence and NoSQL databases // Information Systems in Management. 2012. V. 1 (1). P. 25-37.
Sadalage, P., Fowler, M. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison Wesley Professional. 2012.
White T. Hadoop: The Definitive Guide, 4th Edition. O\'Reilly Media. 2015.
Palla K. A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework. Master of Science School of Informatics University of Edinburgh. 2009. P 1-93.
Zhou G., ZhuY., Wang G. Cache Conscious Star-Join in MapReduce Environments. Cloud-I \'13 Proceedings of the 2nd International Workshop on Cloud Intelligence, August 26. 2013.
Grigorev JU.A., Proletarskaja V.A. Sravnenie metodov obrabotki zaprosov k khranilishhu dannykh po tekhnologii MapReduce // Informatika i sistemy upravlenija. 2016. № 1. S. 3-13.
Feng Li, Beng Chin Ooi, M. Tamer Özsu, Sai Wu. Distributed data management using MapReduce // Journal ACM Computing Surveys (CSUR). 2014. V. 46. Is. 3. Article № 31.
Huai Y., Chauhan A., Gates A. et al. Major Technical Advancements in Apache Hive, VLDB, 2012.
Astashin V. Hadoop, Apache Spark ili Storm: kakojj dvizhok vybrat? // Internet-izdanie mebius.io. URL. https://mebius.io/practice/hadoop-apache-spark-or-storm
Shi J., Qiu Y., Minhas U.F., Jiao L., Wang C., Reinwald B., Ozcan F. Clash of the titans: Mapreduce vs. spark for large scale data analytics // Proceedings of the VLDB Endowment. 2015. V. 8. № 13. P. 2110-2121.
Zakharija M., Vendell P., Konvinski EH., Karau KH. Izuchaem Spark. Molnienosnyjj analiz dannykh. M.: DMK Press. 2015. 304 s.
CHuvyrov E. Bolshie dannye - Obrabotka dannykh i mashinnoe obuchenie v Spark. URL. https://msdn.microsoft.com/ru-ru/magazine/mt694087.aspx
SHuriga L. Optimizacija zadanijj Apache Spark. CH. 1. URL. http://datareview.info/article/optimizatsiya-zadaniy-apache-spark-chast-1/.
Velikhov P. Apache Spark: chto tam pod kapotom - URL. https://habrahabr.ru/post/251507/.
Zinoviev A. 10 prichin razdrazhatsja pri ispolzovanii Apache Spark. URL. http://zaleslaw.blogspot.ru/2015/11/10-reasons-angry-about-spark.html.
Serbul A. Apache Spark v «boevykh» proektakh - opyt vyzhivanija. URL. http://www.pvsm.ru/programmirovanie/106404.
Demin A. Pochemu Spark otnjud ne tak khorosh. URL. https://www.youtube.com/watch-v=oeoirw_SRPw.