E.Yu. Ermakov - Ph.D. (Eng.), Project manager, LTD Mail.Ru Group (Moscow)
V.A. Proletarskaya - Post-graduate Student, Department of Information Processing System end Management, Bauman Moscow State Technical University
E- mail: email@example.com
Currently, one of the most popular solutions for the development of a distributed data processing systems is the MapReduce / Spark technology.
MapReduce / Spark сoncept has established a whole set of new paradigms and structures for thequery creation and queries processing. It offers two mutually exclusive solutions in published works, realized with the help of Spark tools:
1. Using a Bloom filter for filtering the fact table;
2. Reading and filtering the dimension tables and their subsequent spread to all cluster nodes.
However, due to the Bloom filter the first method has substantial loss over time (up to 50%), while the second requires high costs for the stations memory capacity. Thus, the existing methods of data stores accessing have drawbacks. In the paper a new method has been proposed, which does not have these shortcomings.
The paper presents the implementation of the developed method without caching dimension table (MWCDT) in Ma-pReduce / Hadoop and Spark. Unlike Hadoop in Spark is no buffer memory data to disk (if enough RAM), except Shuffle phase.
The paper describes the MWCDT algorithms for MapReduce technologies (Hadoop) and Spark. Dimension and fact table are stored in the ORCFile format. Entries are recorded in the key-value format. The Hadoop algorithm is implemented in the form of three tasks, consisting of Map and Reducephases. The Spark uses a single job, including three converter Map and two converter reduceByKey.
The comparison method MWCDTwith Apache Hive and Spark SQL methods in the cloud architecture DigitalOcean. The cluster consists of 7 units, one of which controls HDFS services have been deployed, Hive, Spark, Yarn. Main characteristics of deployed virtual components: the CPU 4-core, 8 GB of RAM, 80 GB the SSD drive, OS CentOS 7.2 x64. As the database schema has been selected with the star scheme synthetically generated 1st fact table and 9th dimensions. As selected queries 3 variants which differed composition tables, attributes, and the output filtering conditions.
The proposed method in Query 1 and Query 3 shows an average of 31% better result than Hive, and has the advan-tage over the Spark SQL query execution in 3 (5%).With the reduction of the parameter (the selectivity of a predicate fact table, multiplied by the number of measurements involved in the request) while the MWCDT decreases faster than other methods, and it begins to exceed counterparts.With a low selectivity of the predicate fact table and a small number of measurementsMWCDT begins to exceed other methods to speed query execution.It was revealed that a potential cause of long-running Query 2 is the serialization operation is the fact table attributes. A further step in the development of the pro-posed method to overcome this disadvantage is planned.
The work was supported by the Russian Federal Property Fund within the framework of a research project 16-37-00117 \"complex decision support at the design stage Data Warehouse.\"Headofresearch: Prof. Dr. YuriGrigoriev.
Songting Chen Turn Inc. Cheetah: A High Performance,
Custom Data Warehouse on Top of MapReduce // Journal Proceedings of the VLDB Endowment.
2010. V. 3 Is. 1–2. September. P. 1459–1468.
Grigorev JU.A., Plutenko
A.D., Pluzhnikov V.L., Ermakov E.JU., Cvjashhenko E.V., Proletarskaja V.A. Teorija i praktika
analiza parallelnykh sistem baz dannykh. Vladivostok: Dalnauka. 2015. 336 c.
Grigorev JU.A. Proletarskaja
V.A. Metod rannejj materializacii dostupa k khranilishhu dannykh po tekhnologii MapReduce
// Informatika i sistemy upravlenija. 2015. № 3. S. 3–16.
Eltabakh M.Y., Tian
Yu., O¨zcan F., Gemulla R., Krettek A., McPherson J. CoHadoop: Flexible Data Placement
and Its Exploitation in Hadoop // Journal Proceedings of the VLDB Endowment. 2011.
V. 4, Is. 9. June. P. 575–585.