Approximate queries processing with estimated errors in big multidimensional data warehouses

350 rub

Journal Information-measuring and Control Systems №12 for 2016 г.

Article in number:

Keywords: OLAP Approximate Queries Processing Haar Wavelet Transform

Authors:

Yu.A. Grigorev - Dr.Sc. (Eng.), Professor, Department of Information Processing Systems and Management, Bauman Moscow State Technical University E-mail: grigorev@bmstu.ru A.O. Ukharov - Ph.D.(Eng.), PH Informatics Corp. (Moscow) E- mail: oukharov@gmail.com

Abstract:

Approximate query processing originated in OLAP where it is important to discover data relationships and trends rather than get an accurate answer. It involves creation of the compressed dataset and run queries over it. There are 3 main approaches for approximate query processing: 1) sampling, 2) histograms, and 3) wavelet trans-forms. The quality of sampling and histogramshighly depend on the dataset. Histograms provide substantial errors for res-toring single values andhigh building cost for multidimensional data. The wavelet approach performs well, buthas some fundamental issues: the existing errors estimation methods are focused on the overall relative errors and their upper limits; errors of single and summary values cannot be evaluated. We obtained the random error of restoring single, summary values andan arbitrary function over the compressed da-taset. Wesetuptheprobabilityspace. Let be randomly uniformly distributed over the wavelet de-composition cube. Then the particular decomposition W with coefficients is a single sample of the relevant population.These coefficients located in certain positions and will be wiped during compression. We showed that the error distribution function converges to the normal distribution while increasing the compression ratio and.It-s true for a general query (sum,average). We investigated the rate of convergence andestimatedthe confidence interval of original values restoration errors. We discovered that in case of a large number of wiped wavelet coefficients the variance ofthe original single value restoration is approximately equal to the square of the mean square error of restoration ofall elements. We developed the software tool and could reduce the sample data up to 60% while getting up to 47% gain in query processing time. The single value error was observed at 5.8 (estimated 6.2%) for confidence interval. The proposed method of approximate query processing provides estimated confidence intervals of original values restoration errors depending on the compression ratio as well as estimated size of the compressed dataset for as given error. We analyzed the mathematical model adequacy and acceptable accuracy for single and summary queries. We showed the gain in query processing time depending on a compression ratio.

Pages: 79-89

References

Chaudhuri S., Das G., Datar M. Overcoming Limitations of Sampling for Aggregation Queries // Proceedings of the 17th International Conference on Data Engineering. Washington. 2001. P. 534-542.
Ganti V., Lee M., Ramakrishnan R. ICICLES: Self-Tuning Samples for Approximate Query Answering. Proceedings of International Conference on Very Large Databases // Proceedings of the 26th International Conference on Very Large Data Bases. San Francisco, 2000. P. 176-187.
Gibbons P.B., Matias Y. New Sampling-Based Summary Statistics for Improving Approximate Query Answers // Proceedings of the 1998 ACM SIGMOD international conference on Management of data. New York. 1998. P. 331-342.
Poosala V., Ioannidis Y. Selectivity Estimation Without the Attribute Value Independence Assumption // Proceedings of the 23rd International Conference on Very Large Data Bases. San Francisco. 1997. P. 486-495.
Ioannidis Y. The History of Histograms // Proceedings of the 29th international conference on Very large data bases. Berlin. 2003. V. 27 (1). P. 19-30.
Chakrabarti K., Garofalakis M., Rastogi R. Approximate query processing using wavelets // The VLDB Journal - The International Journal on Very Large Data Bases. 2001. V. 10(2). P. 199-223.
Garofalakis M., Gibbons P.B. Probabilistic Wavelet Synopses // ACM Transactions on Database Systems (TODS). 2004. V. 29(1). P. 43-90.
Garofalakis M., Kumar A. Deterministic Wavelet Thresholding for Maximum Error Metrics // Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. New York. 2004. P. 166-176.
Vitter J.S., Wang M. Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets // Proceedings of the 1999 ACM SIGMOD international conference on Management of data. New York, 1999. P. 193-204.
Vitter J.F., Wang M., Iyer B. Data Cube Approximation and Histograms via Wavelets // Proceedings of the seventh international conference on Information and knowledge management. Bethesda. 1998. P. 96-104.
Stollnitz E.J., DeRose T.D., Salesin D.H. Wavelets for Computer Graphics: Theory and Applications. Morgan Kaufmann Publishers. Inc., San Francisco. California. 1996.
Welstead S. Fractal and Wavelet Image Compression Techniques. SpiePress. 1999.
Burdakov A., Grigorev U., Ploutenko A., Ukharov A. Approximate Query Processing Using Wavelets in OLAP with Arbitrarily Sized Data and Bounded Errors // 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing PDP 2016, Heraklion, Greece, February 2016.
Central limit theorem: http://en.wikipedia.org/wiki/Central_limit_theorem#Lyapunov_CLT.