350 rub
Journal Information-measuring and Control Systems №12 for 2016 г.
Article in number:
The data store access method without caching of dimension tables in a random access memory by MapReduce/Spark technology
Authors:
E.Yu. Ermakov - Ph.D. (Eng.), Project manager, LTD Mail.Ru Group (Moscow) E-mail: JK.Ermakov@gmail.com V.A. Proletarskaya - Post-graduate Student, Department of Information Processing System end Management, Bauman Moscow State Technical University E- mail: vilka2000@mail.ru
Abstract:
Currently, one of the most popular solutions for the development of a distributed data processing systems is the MapReduce / Spark technology. MapReduce / Spark сoncept has established a whole set of new paradigms and structures for thequery creation and queries processing. It offers two mutually exclusive solutions in published works, realized with the help of Spark tools: 1. Using a Bloom filter for filtering the fact table; 2. Reading and filtering the dimension tables and their subsequent spread to all cluster nodes. However, due to the Bloom filter the first method has substantial loss over time (up to 50%), while the second requires high costs for the stations memory capacity. Thus, the existing methods of data stores accessing have drawbacks. In the paper a new method has been proposed, which does not have these shortcomings. The paper presents the implementation of the developed method without caching dimension table (MWCDT) in Ma-pReduce / Hadoop and Spark. Unlike Hadoop in Spark is no buffer memory data to disk (if enough RAM), except Shuffle phase. The paper describes the MWCDT algorithms for MapReduce technologies (Hadoop) and Spark. Dimension and fact table are stored in the ORCFile format. Entries are recorded in the key-value format. The Hadoop algorithm is implemented in the form of three tasks, consisting of Map and Reducephases. The Spark uses a single job, including three converter Map and two converter reduceByKey. The comparison method MWCDTwith Apache Hive and Spark SQL methods in the cloud architecture DigitalOcean. The cluster consists of 7 units, one of which controls HDFS services have been deployed, Hive, Spark, Yarn. Main characteristics of deployed virtual components: the CPU 4-core, 8 GB of RAM, 80 GB the SSD drive, OS CentOS 7.2 x64. As the database schema has been selected with the star scheme synthetically generated 1st fact table and 9th dimensions. As selected queries 3 variants which differed composition tables, attributes, and the output filtering conditions. The proposed method in Query 1 and Query 3 shows an average of 31% better result than Hive, and has the advan-tage over the Spark SQL query execution in 3 (5%).With the reduction of the parameter (the selectivity of a predicate fact table, multiplied by the number of measurements involved in the request) while the MWCDT decreases faster than other methods, and it begins to exceed counterparts.With a low selectivity of the predicate fact table and a small number of measurementsMWCDT begins to exceed other methods to speed query execution.It was revealed that a potential cause of long-running Query 2 is the serialization operation is the fact table attributes. A further step in the development of the pro-posed method to overcome this disadvantage is planned. The work was supported by the Russian Federal Property Fund within the framework of a research project 16-37-00117 \"complex decision support at the design stage Data Warehouse.\"Headofresearch: Prof. Dr. YuriGrigoriev.
Pages: 90-97
References

 

  1. Jairam Chandar. Join Algorithms using Map/Reduce .Edinburgh:University of Edinburgh, 2010.
  2. Zhou G., Zhu Y., Wang G. Cache Conscious Star-Join in MapReduce Environments. Cloud-I \'13 Proceedings of the 2nd International Workshop on Cloud Intelligence, August 26. 2013.
  3. Lin Y., Agrawal D., Chen C., Ooi B.C.,Wu S. Llama: leveraging columnar storage for scalable join processing in the MapReduce framework. Proceedings of the 2011 ACM SIGMOD International Conference on Management of data. P. 961-972.
  4. Brito J., Mosqueiro T., Ciferri R.R., DA Ciferri C. Faster Cloud Star Joins with Reduced Disk Spill and Network Communication. Chemometrics and Intelligent Laboratory Systems. 2016.
  5. Lee R., Huai Y., Shao Z., etc. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE. 2011. P. 1199-1208.
  6. Grigorev JU.A., Plutenko A.D. Analiz vremeni soedinenija tablic v strochnojj parallelnojj sisteme baz dannykh i po tekhnologii MapReduce // Informatika i sistemy upravlenija. 2014. № 2. S. 3-11.
  7. Songting Chen Turn Inc. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce // Journal Proceedings of the VLDB Endowment. 2010. V. 3 Is. 1-2. September. P. 1459-1468.
  8. Grigorev JU.A., Plutenko A.D., Pluzhnikov V.L., Ermakov E.JU., Cvjashhenko E.V., Proletarskaja V.A. Teorija i praktika analiza parallelnykh sistem baz dannykh. Vladivostok: Dalnauka. 2015. 336 c.
  9. Grigorev JU.A. Proletarskaja V.A. Metod rannejj materializacii dostupa k khranilishhu dannykh po tekhnologii MapReduce // Informatika i sistemy upravlenija. 2015. № 3. S. 3-16.
  10. Eltabakh M.Y., Tian Yu., O¨zcan F., Gemulla R., Krettek A., McPherson J. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop // Journal Proceedings of the VLDB Endowment. 2011. V. 4, Is. 9. June. P. 575-585.