Translation of affine programs for parallel execution on systems with distributed memory

350 rub

Journal Highly available systems №3 for 2023 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j20729472-202303-03

UDC: 004.4'422

Keywords: Affine programs; loop parallelization; polyhedral model; communications; translator

Authors:

A.S. Lebedev1

1 Design Information Technologies Center Russian Academy of Sciences (Odintsovo, Moscow Region, Russia)
1 tementy@gmail.com

Abstract:

The purpose of the work is to increase the performance of affine programs when executed on systems with distributed memory. Reducing the cost of communications in multiprocessor systems is one way to improve the performance of a parallel program. For systems with distributed memory, finding the placement of calculations and data together is performed. The mapping of virtual and physical processors entails the distribution of calculations and data among physical processors in accordance with the found placements of calculations and data for each instruction and array in the program. A set of macros has been developed that simplifies this distribution with the means for processing communication polyhedra in parallel programs that assume two-way communication of processes within the MPI standard. In this case, it is not required to place data on all computers and apply special models for organizing calculations. For each access to the array, the following static analysis is performed: it is checked whether the index of the virtual processor owning the array element matches the index of the virtual processor executing the dynamic instance of the instruction in which the access in question occurred. If it matches, then the access is treated as local, otherwise it is treated as remote. The created software translates the original sequential C program into its parallel-optimized version also in C.

The developed software can be used to analyze the data flow affine programs, improve the data locality, and parallelize such programs for cluster systems due to the support of MPI as a parallelization technology. Experimental results of studying the performance of parallel versions of the LU-decomposition program for a square matrix are presented. The performance of parallel programs obtained using the developed translator ilpy and the modern translator pluto 0.11.4 is compared. All parallel solutions made it possible to obtain acceleration of calculations. The parallel version obtained with ilpy shows better performance in all runs except those running on two machines. The performance loss ranged from 2% to 8% depending on the number of processes. The biggest advantage over pluto is achieved when running in eight processes and is 27% for running on one machine and 18% for running on eight machines. The pluto solutions show better CPU utilization in all parallel runs, but the advantage over ilpy is only achieved when running on two machines.

Pages: 35-47

For citation

Lebedev A.S. Translation of affine programs for parallel execution on systems with distributed memory. Highly Available Systems. 2023. V. 19. № 3. P. 35−47. DOI: https://doi.org/ 10.18127/j20729472-202303-03 (in Russian)

References

Lim A. W., Cheong G. I., Lam M. S. An affine partitioning algorithm to maximize parallelism and minimize communication. Proceedings of the 13th international conference on Supercomputing. 1999. P. 228–237.
Feautrier P. Automatic distribution of data and computations. TSI-Technique et Science Informatiques-RAIRO. 1996. V. 15. № 5. P. 529–558.
Griebl M. Automatic parallelization of loop programs for distributed memory architectures. Passau, Germany: Univ. Passau, 2004.
Lebedev A.S. Razmeshchenie dannyh pri avtomaticheskom rasparallelivanii linejnyh programm dlya sistem s raspredelennoj pamyat'yu. Vestnik Rybinskoj gosudarstvennoj aviacionnoj tekhnologicheskoj akademii im. P.A. Solov'eva. 2015. № 3. S. 92–99 (in Russian).
Dathathri R. et al. Generating efficient data movement code for heterogeneous architectures with distributed-memory. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE. 2013. P. 375–386.
Lebedev A.S. Organizaciya informacionnogo obmena mezhdu parallel'nymi processami pri avtomaticheskom rasparallelivanii linejnyh programm dlya klasternyh sistem s primeneniem modeli mnogogrannikov. Programmnye sistemy: teoriya i prilozheniya. 2017. T. 8. № 4 (35). S. 3–20 (in Russian).
Saà-Garriga A., Castells-Rufas D., Carrabina J. OMP2MPI: Automatic MPI code generation from OpenMP programs. arXiv preprint arXiv:1502.02921. 2015.
Millot D. et al. From OpenMP to MPI: first experiments of the STEP source-to-source transformation tool. ParCO 2009: International Conference on Parallel Computing: Mini-Symposium" Parallel Programming Tools for Multi-core Architectures". IOS Press, 2009. P. 669–676.
Loechner V. PolyLib: A library for manipulating parameterized polyhedra. 1999.
Bastoul C. Openscop: A specification and a library for data exchange in polyhedral compilation tools. Paris-Sud University, France, Tech. Rep. 2011. V. 9. P. 22.
Makhorin A. GLPK (GNU linear programming kit). URL: http://www.gnu.org/s/glpk/glpk.html. 2008 (дата обращения 18 августа 2023).
Bastoul C. Code generation in the polyhedral model is easier than you think. Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004. IEEE. 2004. P. 7–16.

Date of receipt: 10.08.2023

Approved after review: 18.08.2023

Accepted for publication: 30.08.2023