LIP6 2002/018
- Thesis
Optimisations de la bibliothèque de communication MPI pour machines parallèles de type "grappe de PCs" sur une primitive d'écriture distante - O. Glück
- 219 pages - 07/12/2002- document en - http://www.lip6.fr/lip6/reports/2002/lip6.2002.018.pdf - 6,227 Ko
- Contact : Olivier.Gluck (at) nulllip6.fr
- Ancien Thème : ASIM
- Keywords : parallel computer, PCs clusters, communication library, message passing, MPI, remote write, Direct Memory Access, user-level communication, memory management, virtual/physical address, address translation
- Publisher : Francois.Dromard (at) nulllip6.fr
This Ph.D Thesis is a part of the MPC (Multi-PC) research project started in 1995 at Pierre et Marie Curie University, Paris. The goal was to design a low cost and high performance parallel computer. The MPC parallel computer consists of several processing nodes interconnected by a gigabit High Speed Link network. This work presents how the Message Passing Interface (MPI) communication library can be optimized for a parallel computer made of clusters of workstations, providing a remote-write communication primitive. From the hardware point of view, this communication mechanism is very efficient. Our goal is to minimize the overhead of communication software layers used by applications for accessing the high speed network. This thesis focuses on an efficient and optimized implementation of MPI built on a simple Remote Direct Memory Access hardware primitive. For experimental purposes, the MPC parallel computer of LIP6 laboratory was used. However, our communication software layers were built over a generic remote-write API in order to port easily our MPI implementation on every hardware platform using a remote write primitive. We study the impact of several factors on application performances and we propose efficient mechanisms to implement the Message Passing Interface on a remote DMA communication primitive. Precisely, we describe solutions to eliminate system calls and interrupts during communications. A drawback of the remote-write primitive is that it uses physical memory addresses for sending data: the network controller accesses directly the host memory on the sender node and the receiver node. The major difficulty of this work deals with the user-level accesses to the network interface by several processes and the address translations. We propose a mechanism to significantly reduce the overhead due to the translations of virtual addresses supplied by the applications in physical addresses used by the network controller.