mPyPl: functional monadic approach to data processing in deep learning

350 rub

Journal Information-measuring and Control Systems №5 for 2024 г.

Article in number:

Type of article: scientific article

DOI: 10.18127/j20700814-202405-10 © Сошников Д.В., 2024

UDC: 510.23 004.622

Keywords: Monadic Computations Data Processing Python Deep Learning

Authors:

D.V. Soshnikov1

1National Research University Higher School of Economics (Moscow, Russia)

1Moscow Aviation Institute – National Research University (Moscow, Russia)

1dmitri@soshnikov.com

Abstract:

This paper presents an approach to data processing in deep learning data preparation pipelines based on functional paradigm. We also present a new Python library called mPyPl, which implements presented approach and is intended to simplify complex data processing tasks. This library defines operations on lazy data streams of named dictionaries represented as generators (so-called multi-field datastreams), and allows enriching those data streams with more ’fields’ in the process of data preparation and feature extraction. Thus, most data preparation tasks can be expressed in the form of neat linear ’pipeline’, similar in syntax to UNIX pipes, or |> functional composition operator in F#. Different evaluation strategies allow for different compromises in terms of memory and performance, as well as expressing data augmentation tasks naturally. We define basic operations on multi-field data streams, which resemble classical monadic operations, and show similarity of the proposed approach to monads in functional programming.

Pages: 88-96

For citation

Soshnikov D.V. mPyPl: functional monadic approach to data processing in deep learning. Information-measuring and Control Systems. 2024. V. 22. № 5. P. 88−96. DOI: https://doi.org/10.18127/j20700814-202405-10 (in Russian)

References

Soshnikov D., Valieva A. mPyPl: Python Monadic Pipeline Library for Complex Functional Data Processing. Microsoft Journal of Applied Research (MS-JAR). December 2019. V. 12. Also available externally as arXiv:2106.09164 [cs.PL]. DOI: 10.48550/arXiv.2106.09164.
CrowdFlowerr: Data science report. 2016. https://visit.figure-eight.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_ 2016.pdf (data obrashcheniya: 9.06.2024).
McKinney W. (2010). Data structures for statistical computing in python. Proceedings of the 9th Python in Science Conference (editors Van der Walt S. and Millman J). P. 51−56.
Chollet F. et al. Keras. 2015. https://keras.io.
Pallard J. Pipe python package. 2016. https://github.com/JulienPalard/Pipe.
Chollet F.. Building powerful image classification models using very little data. 2016. https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html.
Databricks 2019a. Deep learning pipelines for apache spark. https://github.com/databricks/spark-deep-learning.
Databricks 2019b. Distributed training with horovod on databricks. https://docs.databricks.com/applications/deep-learning /distributed-training/index.html.
Wadler P.. Comprehending monads. Mathematical Structures in Computer Science. 1992. P. 61−78.
Yana Valieva, Dmitry Soshnikov, Tim Scarfe . Race events recognition project. 2019. https://github.com/vJenny/race-events-recognition.

Date of receipt: 30.08.2024

Approved after review: 13.09.2024

Accepted for publication: 27.09.2024