350 rub
Journal Highly available systems №3 for 2014 г.
Article in number:
Environment for integration of large heterogeneous data collections
Authors:
V. I. Budzko - Dr.Sc. (Eng.), Professor, Deputy Director, Institute of Information Problems, Russian Academy of Sciences. E-mail: vbudzko@ipiran.ru
L. A. Kalinichenko - Dr.Sc. (Phys.-Math.), Head of Laboratory, Institute of Informatics Problems, Russian Academy of Sciences. E-mail: leonidandk@gmail.com
S. A. Stupnikov - Ph.D. (Eng.), Senior Research Scientist, Institute of Informatics Problems, Russian Academy of Sciences. E-mail: ssa@ipi.ac.ru
A. E. Vovchenko - Ph.D. (Eng.), Senior Research Scientist, Institute of Informatics Problems, Russian Academy of Sciences. E-mail: alexey.vovchenko@gmail.com D. O. Briukhov - Ph.D. (Eng.), Senior Research Scientist, Institute of Informatics Problems, Russian Academy of Sciences. E-mail: brd@ipi.ac.ru
D. Yu. Kovalev - programmist, Institute of Informatics Problems, Russian Academy of Sciences. E-mail: dm.kovalev@gmail.com
Abstract:
New science and IT paradigm dominating during the last years is based on data exploration. Techniques for overcoming the diversity of data models (including semi-structured and unstructured data), metadata, and data semantics are required in the frame of this paradigm. Popularity of semi-structured NoSQL databases combined with Hadoop and MapReduce technologies aimed at parallel processing of large semi-structured data collections is steadily growing. Such recognition is explained by a multitude of actual and potential applications. This work concerns the area of development of data intensive systems. The aim of this work is analysis of approaches to the development of environment for integration of heterogeneous data collections. The environment should support both virtual and materialized integration of data collections represented in traditional and non-traditional data models. Virtual integration is provided by the subject mediation technology. The mediators form a layer between users (applications) and heterogeneous information resources. Materialized integration is proposed to be implemented using the Hadoop open source system for distributed data storage and processing. The Hadoop should be combined with a data warehouse system over Hadoop (Hive or IBM Big SQL systems can be used for that). Basic features of the environment for integration of large heterogeneous collections are presented in the paper. Transformation of collections represented using non-traditional data models into the integrated data model is illustrated. The integrated representation is based on the warehouse data model. Brief overview of methods for information extraction from text, entity resolution and data fusion is provided. Techniques for programming of the entity resolution and data fusion methods using HIL high-level integration language are illustrated. An example of a problem to be solved using the proposed environment for integration of heterogeneous data collections is provided.
Pages: 3-19
References

  1. The Forth Paradigm: Data-Intensive Scientific Discovery. Eds. Tony Hey, Stewart Tansley, and Kristin Tolle. Redmond: Microsoft Research, 2009. URL: http://goo.gl/GqkDX1 (data obrashcheniya: 13.08.2014).
  2. Kalinichenko L.A., Stupnikov S.A. OWL as Yet Another Data Model to be Integrated. Advances in Databases and Information Systems: Proc. II of the 15th East-European Conference. Vienna: Austrian Computer Society. 2011. P. 178-189.
  3. Skvortsov N.A. Otobrazhenie modeli dannykh RDF v kanonicheskuyu model' predmetnykh posrednikov // Trudy 15-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL'2013. Yaroslavl': Yaroslavskiy gosudarstvennyy universitet im. P.G. Demidova. 2013. S. 202-209.
  4. Stupnikov S.A. Otobrazhenie grafovoy modeli dannykh v kanonicheskuyu ob''ektno-freymovuyu informatsionnuyu model' pri sozdanii sistem integratsii neodnorodnykh informatsionnykh resursov // Trudy 15-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL'2013. Yaroslavl': Yaroslavskiy gosudarstvennyy universitet im. P.G. Demidova. 2013. S. 193-202.
  5. Stupnikov S.A. Unifikatsiya modeli dannykh, osnovannoy na mnogomernykh massivakh, pri integratsii neodnorodnykh informatsionnykh resursov // Trudy 14-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL'2012. Pereslavl'-Zalesskiy: Universitet goroda Pereslavlya. 2012. S. 67-77.
  6. Skvortsov N.A. Otobrazhenie modeley dannykh NoSQL v ob''ektnye spetsifikatsii // Trudy 14-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL'2012. Pereslavl'-Zalesskiy: Universitet goroda Pereslavlya. 2012. S. 78-87.
  7. Kalinichenko L.A., Stupnikov S.A., Martynov D.O. SYNTHESIS: a Language for Canonical Information Modeling and Mediator Definition for Problem Solving in Heterogeneous Information Resource Environments. Moscow: IPI RAN. 2007. 171 p.
  8. Kalinichenko L.A., Briukhov D.O., Martynov D.O., Skvortsov N.A., Stupnikov S.A. Mediation Framework for Enterprise Information System Infrastructures // Proc. of the 9th International Conference on Enterprise Information Systems ICEIS 2007. Funchal, 2007. Volume Databases and Information Systems Integration. P. 246-251.
  9. Stupnikov S.A., Vovchenko A.Ye. Kombinirovannaya virtual'no-materializovannaya sreda integratsii bol'shikh neodnorodnykh kollektsiy dannykh // Trudy 16-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL'2016. Dubna: OIYaI. 2014.
  10. Vovchenko A.Ye., Kalinichenko L.A., Kovalev D.Yu. Programmirovanie metodov razresheniya sushchnostey i sliyaniya dannykh pri realizatsii ETL v srede Hadoop // Trudy 16-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL'2016. Dubna: OIYaI. 2014.
  11. Bryukhov D.O., Skvortsov N.A. Izvlechenie informatsii iz bol'shikh kollektsiy russkoyazychnykh tekstovykh dokumentov v srede Hadoop // Trudy 16-y Vserossiyskoy nauchnoy konferentsii «Elektronnye biblioteki: perspektivnye metody i tekhnologii, elektronnye kollektsii» RCDL'2016. Dubna: OIYaI. 2014.
  12. White T. Hadoop: The definitive guide. 3rd edition. O'Reilly Media. 2012.
  13. Saracco C.M., Jain U. What's the big deal about Big SQL - Introducing relational DBMS users to IBM's SQL technology for Hadoop. IBM DeveloperWorks. 2013. URL: http://www.ibm.com/developerworks/library/bd-bigsql/bd-bigsql-pdf.pdf (data obrashcheniya: 13.08.2014).
  14. Capriolo E., Wampler D., Rutherglen J. Programming Hive Data Warehouse and Query Language for Hadoop. O'Reilly Media. 2012.
  15. Christen P. Data Matching - Concepts and techniques for record linkage, entity resolution, and duplicate detection. Data-Centric Systems and Applications. 2012.
  16. Wenfei Fan, Geerts F. Foundations of data quality management // Synthesis lectures on data management. 2012. № 29.
  17. Naumann F., Herschel M. An introduction to duplicate detection // Synthesis lectures on data management. 2010. № 3.
  18. Bleiholder J., Naumann F. Data Fusion. ACM Computing Survey. 2009.
  19. Vovchenko A.Ye. Rassredotochennaya realizatsiya prilozheniy v srede predmetnykh posrednikov: Diss. ... kand. tekhn. nauk po spetsial'nosti 05.13.11. Moskva: IPI RAN. 2012. 216 s.
  20. Stupnikov S.A., Skvortsov N.A., Budzko V.I., Zakharov V.N., Kalinichenko L.A. Metody unifikatsii netraditsionnykh modeley dannykh // Sistemy vysokoy dostupnosti. 2014. Vyp. 1. S. 18-39.
  21. Miner D. MapReduce design patterns: Building effective algorithms and analytics for hadoop and other systems. O'Reilly Media. 2012.
  22. IBM InfoSphere BigInsights Information Center. 2014. URL: http://pic.dhe.ibm.com/infocenter/bigins/v2r1/index.jsp (data obrashcheniya: 13.08.2014).
  23. Annotation Query Language. URL: http://goo.gl/wJ6X1d (data obrashcheniya: 13.08.2014).
  24. Beyer K.S., Ercegovac V., Gemulla R., Balmin A., Eltabakh M., Kanne C.-C., Ozcan F., Shekita E.J. Jaql: A scripting language for large scale semistructured data analysis. VLDB 2011.
  25. Introducing JSON. 2014. http://www.json.org/ (data obrashcheniya: 13.08.2014).
  26. Hernández M., Koutrika G., Krishnamurthy R., Popa L., Wisnesky R. HIL: a high-level scripting language for entity integration // Proceedings of the 16th International Conference on Extending Database Technology EDBT. 2013. P. 549-560.
  27. The Neo4j Manual. 2014. http://goo.gl/cHiOGF (data obrashcheniya: 13.08.2014).
  28. Sarawagi S. Information extraction // Foundations and Trends in Databases. 2008. V. 1. № 3. P. 261-377.
  29. Cunningham H., Maynard D., Bontcheva K., Tablan V. Gate: A framework and graphical development environment for robust NLP tools and applications // Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. 2002.
  30. Getoor L., Taskar B. (Eds.). Introduction to Statistical Relational Learning. MIT Press. 2007.
  31. Getoor L., Machanavajjhala A. Entity resolution for big data // 19th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Chicago: ACM SIGKDD. 2013.
  32. String metric. URL: http://en.wikipedia.org/wiki/String_metric (data obrashcheniya: 13.08.2014).
  33. Adamic L.A., Adar E. Friends and neighbors on the Web // Social networks. 2003. V. 25. № 3. P. 211-230.
  34. Cochinwala M. et al. Efficient data reconciliation. Information Sciences. 2001.
  35. Christen P. Automatic record linkage using seeded nearest neighbour and support vector machine classification. KDD. 2008.
  36. Chen Z. et al. Exploiting context analysis for combining multiple entity resolution systems. SIGMOD. 2009.
  37. Gupta R., Sarawagi S. Answering Table Augmentaton Queries from Unstructured Lists on the Web. PVLDB. 2009. V. 2. № 1.
  38. Herzog T. et al. Data quality and record linkage techniques. Springer, 2007.
  39. Bellare K. et al. Active sampling for entity matching. KDD. 2012.
  40. Wenfei Fan. Dependencies revisited for improving data quality. PODS. 2008.
  41. Bhattacharya I., Getoor L. A latent dirichlet model for unsupervised entity resolution. SDM. 2007.
  42. Bleiholder J. Data fusion and conflict resolution in integrated information systems. Dissertation. Hasso-Plattner-Institut. 2010.
  43. Dong X.L., Naumann F. Data Fusion ? Resolving data conflicts in Integration. VLDB. 2009.
  44. Rajaraman A., Ullman J.D. Integrating information by outerjoins and full disjunctions. PODS. 1996.
  45. Sarma A.D. et al. An Automatic Blocking Mechanism for Large-Scale De-duplication Tasks. CIKM. 2012
  46. Kolb L., Thor A., Rahm E. Dedoop: Efficient deduplication with Hadoop // Proceeding of the 38th Intl. Conference on Very Large Databases (VLDB). VLDB Endowment. 2012. V. 5. № 12.