E. L. Syromyatnikov, D. V. Makagon, S. I. Paruta, A. A. Rumyantsev
Collective operations are used in a broad variety of tasks for inter-node communications. Typical examples of collective operations include broadcast (sending the same set of data from a single node to a set of nodes), reduce (gathering the data from a set of nodes applying the commutative associative binary operation to that data, the result being sent to the given single node), scatter (distributing the array of data from a single node to a set of nodes, every node receiving its part of the array), gather (sending parts of the data from a set of nodes to a single node, so it receives the complete array of data), allreduce (same as reduce, but the result is sent to all nodes which performed the operation), allgather (same as gather, but the data is gathered by every node in the set), alltoall (distributing the array of data from every node to all nodes in the set).
Collective operations are among the basic communication primitives in the majority of parallel programming standards (MPI, Shmem, PGAS languages – UPC, X10). Collectives can constitute a significant part of the inter-node communications in many applications (at least for those, that use linear algebra, graphs, structured and unstructured grids). Although there is a straight-forward implementation of the collectives through the point-to-point operations, a more sophisticated version, which relies on hardware support, allows for significant increase of performance and scalability of the parallel programs through the data aggregation and avoidance of duplicate traffic.
JSC “NICEVT” develops the high-speed intercommunication network with multidimensional torus topology. The hardware support for collective broadcast and reduce is implemented through the addition of 2 virtual subnetworks with tree topology.
The tree has a root, according to which the two possible directions of movement are introduced: towards the root and away from the root. Each direction has a corresponding virtual channel. The tree is constructed according to the XYZW order of the dimensions. This allows to avoid deadlocks between different intersecting trees. Auxiliary transit nodes that don't logically belong to the tree can be used to make the tree connected (when it is not possible otherwise to build the connected tree that complies to the XYZW-order rule).
The implemented collective operations are asynchronous and one-sided, i. e. the control is returned to the processor as soon as the operations are injected into the network and the result is stored into the memory of the receiving side without the involvement of the processor. This allows overlapping computation and communications.
Currently the 3rd generation interconnect prototype (M3) is ready and working. It consists of 9 nodes connected in 3x3 2-dimensional torus. The debugging and fine-tuning of the collective operations is now at its final stage.