I.O. Shkokov1
1 AVIV Group Gmbh (Paris, France)
1 igor.shkokov@gmail.com
The rapid growth in the volume of data processed in the e-commerce sector leads to increasing operational costs for this economic segment. The correct choice of data storage and processing format directly impacts the performance of analytical systems and the speed of timely decision-making, making it a relevant challenge for the scientific community. Numerical confirmation of the correct data storage and processing format choice can reduce the operational costs of e-commerce systems.
Through a numerical experiment, the study aims to calculate how changing the data storage and processing format (CSV, Parquet, or Avro) affects the performance of e-commerce analytical systems, specifically focusing on the disk space occupied by the data and the execution speed of Structured Query Language (SQL) queries.
Experiments have been conducted using the RetailRocket dataset, the DuckDB library, and the Python programming language to simulate the operation of modern cloud data warehouses. Five typical analytical database queries (event aggregation, table joining, filtering and sorting, window functions, multi-table analysis) have been used, with their execution times measured. A clear advantage has been proven for the columnar Parquet format, which showed acceleration in query execution by up to 7 times compared to CSV and provided the best compression (40.2% of the original data size). Conversely, the Avro data format performed worse than the standard CSV format by 624.2% on average while offering only a marginal disk space saving of 11.9%.
The materials of this scientific article are of practical value to employees performing analytics and data processing in the e-commerce sector, providing numerical evidence to support the selection of the Parquet format for optimizing storage and accelerating the processing time of analytical queries.
Shkokov I.O. Optimizing data storage format selection: An empirical performance comparison of csv, avro, and parquet in E-commerce analytics systems // Neurocomputers. 2026. V. 28. № 3. P. 25–34. DOI: https://doi.org/10.18127/j19998554-202603-04.
- Kitchens B., Dobolyi D., Li J., Abbasi A. Advanced customer analytics: Strategic value through integration of relationship-oriented big data. Journal of Management Information Systems. 2018. V. 35. № 2. P. 540–574. DOI: 10.1080/07421222.2018.1451957.
- Hashem I.A.T., Yaqoob I., Anuar N.B. et al. The rise of «big data» on cloud computing: Review and open research issues. Information Systems. 2015. V. 47. P. 98–115. DOI: 10.1016/j.is.2014.07.006.
- Mäs S., Henzen D., Bernard L. et al. Generic schema descriptions for comma-separated values files of environmental data. 21st AGILE Conference on Geo-information Science. 12-15 June 2018. Lund, Sweden. 2018. P. 558–565.
- Sreekanth S., Pramodhini A.S.R., Likita C.S. et al. Putting Avro into Hive. International Journal of Research. 2017. V. 4. № 5. URL: https://journals.pen2print.org/index.php/ijr/article/view/7377/0.
- Vohra D. Apache Parquet. In: Practical Hadoop Ecosystem. Berkeley, CA: Apress. 2016. P. 325–335. DOI: 10.1007/978-1-4842-2199-0_8.
- Wankhede K., Colabawalla B. Parquet compression in Windows with big data – An enhanced storage style. Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV). Tirunelveli, India. 2021. P. 1244–1249. DOI: 10.1109/ICICV50876.2021.9388437.
- Baranov V.G., Misevich A.A., Sevryukov A.A., Suslov B.A., Sevryukov M.A., Alipova N.A. Primenenie metodov intellektual'nogo analiza dannykh v informatsionno-analiticheskikh sistemakh monitoringa. Informatsionno-izmeritel'nye i upravlyayushchie sistemy. 2011. T. 9. № 3. S. 38–42. (in Russian)
- Shkokov I.O. Metody obnaruzheniya oshibok v potokovoj informatsii sovremennykh relyatsionnykh baz dannykh tekhnologii Structured Query Language v krupnykh organizatsiyakh. Naukoemkie tekhnologii. 2025. T. 26. № 5. S. 17–25. DOI: https://doi.org/10.18127/ j19998465-202505-02. (in Russian)
- Bhosale P. High-performance data storage: A comparative analysis of AVRO, Parquet, and ORC formats in modern data systems. ESP Journal of Engineering & Technology Advancements. 2024. V. 4. № 3. P. 165–170. DOI: 10.56472/25832646/JETA-V4I3P117.
- Nelluri S.R., Saldanha F.A.A. Mastering big data formats: ORC, Parquet, Avro, Iceberg, and the strategy of selection. International Journal of Computer Trends and Technology. 2025. V. 73. № 1. P. 44–50. DOI: 10.14445/22312803/IJCTT-V73I1P105.
- Bhosale P. Parquet’s columnar storage advantage: A case study in big data analytics. International Journal on Science and Technology (IJSAT). 2024. V. 15. № 2. P. 1–12.
- Plase D., Niedrite L., Taranovs R. A comparison of HDFS compact data formats: Avro versus Parquet. Mokslas – Lietuvos ateitis. 2017. V. 9. P. 267–276. DOI: 10.3846/mla.2017.1033.
- Belov V., Kosenkov A.N., Nikulchev E. Experimental characteristics study of data storage formats for data marts development within data lakes. Applied Sciences. 2021. V. 11. № 18. DOI: 10.3390/app11188651.
- Shivayogappa A., Shivashankar S. A comparison of HDFS file formats: Avro, Parquet and ORC. International Journal of Advanced Science and Technology. 2020. V. 29. P. 4665–4675. DOI: 10.5281/zenodo.7027910.
- Zykov R. Retailrocket recommender system dataset. 2022 [Electronic resource]. URL: https://www.kaggle.com/datasets/retailrocket/ ecommerce-dataset (visited on 08/08/2025).
- Raasveldt M., Mühleisen H. DuckDB: An embeddable analytical database. Proceedings of the 2019 International Conference on Management of Data (SIGMOD’19). New York, NY, USA: Association for Computing Machinery. 2019. P. 1981–1984. DOI: 10.1145/3299869. 3320212.
- Silva Y.N., Almeida I., Queiroz M. SQL: From traditional databases to big data. Proceedings of the 47th ACM Technical Symposium on Computing Science Education. 2016. P. 413–418. DOI: 10.1145/2839509.2844560.

