Using MCAPI to lighten an MPI load

Tuesday, June 21, 2011

Using MCAPI to lighten an MPI load

High-performance computing (HPC) relies on large numbers of computers to get a tough job done. Often, one computer will act as a master, parceling out data to processes that may be located anywhere in the world. The Message Passing Interface (MPI) provides a way to move the data from one place to the next.

Normally, MPI would be implemented once in each server to handle the messaging traffic. But with multicore servers using more than a few cores, it can be very expensive to use a complete MPI implementation because MPI would have to run on each core in the computer in an asymmetric multi-processing (AMP) configuration. The Multicore Communications API (MCAPI), on the other hand – a protocol designed with embedded systems in mind – is a much more efficient way to move MPI messages around within the computer.

Heavyweight champion

MPI was designed for HPC and is a well-established protocol that is robust enough to handle the problems that might be encountered in a dynamic network of computers. For example, such networks are rarely static. Whether it’s due to updates, maintenance, the purchase of additional machines, or even the simple fact that there is a physical network cable that can be inadvertently unplugged, MPI must be able to handle the eventuality of the number of nodes in the network changing. Even with a constant number of servers, those servers run processes that may start or stop at any time. So MPI includes the ability to discover who’s out there on the network.

At the programming level, MPI doesn’t reflect anything about computers or cores. It knows only about processes. Processes start at initialization, and then this discovery mechanism builds a picture of how the processes are arranged. MPI is very flexible in terms of how the topology can be created, but, when everything is up and running, there is a map of processes that can be used to exchange data. A given program can exchange messages with one process inside or outside a group or with every process in a group. The program itself has no idea whether it’s talking to a computer next to it or one on another continent.

So a program doesn’t care whether a computer running a process with which it’s communicating is single-core or multicore, homogeneous or heterogeneous, symmetric (SMP) or asymmetric (AMP). It just knows there’s a process to which it wants to send an instant message. It’s up to the MPI implementation on the computer to ensure that the messages get through to the targeted processes.

Due to the architectural homogeneity of SMP multicore, this is pretty simple. A single OS instance runs over a group of cores and manages them as a set of identical resources. So a process is naturally spread over the cores. If the process is multi-threaded, then it can take advantage of the cores to improve computing performance; nothing more must be done.

However, SMP starts to bog down with more cores because bus and memory access bog down. For computers that are intended to help solve big problems as fast as possible, it stands to reason that more cores in a box is better, but only if they can be utilized effectively. To avoid the SMP limitations, we can use AMP instead for larger-core-count (so-called “many-core”) systems.

With AMP, each core (or different subgroups of cores) runs its own independent OS instance, and some might even have no OS at all, running on “bare metal.” Because a process cannot span more than one OS instance, each OS instance – potentially each core – runs its own processes. So, whereas an SMP configuration can still look like one process, AMP looks like many processes – even if they’re multiple instances of the same process.

Configured this way, each OS must run its own instance of MPI to ensure that its processes are represented in the network and get fed any messages coming their way. The issue is the fact that MPI is a heavyweight protocol as a result of the range of things it must handle on a network. The environment connecting the cores within a closed box – or even on a single chip – is much more limited than the network within which MPI must operate. It also typically has far fewer resources than a network does. So MPI is over-provisioned for communication within a server (see sidebar).

Assisted by a featherweight

Unlike MPI, the Multicore Association specifically designed the MCAPI specification to be lightweight so that it can handle inter-process communication (IPC) in embedded systems, which usually have considerably more limited resources. While MCAPI works differently from MPI, it still provides a basic, simple means of getting a message from one core to another. So we can use MCAPI to deliver MPI functionality much more inexpensively within a system that has limited resources but also more limited requirements.

There are two possible ways to approach bring MCAPI into an MPI design. The first way works if the program using MPI utilizes very few MPI constructs – more or less just sending and receiving simple messages. The idea is to designate one “master” core within the server to run a full-up MPI service plus a translator for all other “accelerator” cores in the box. The accelerator cores will run MCAPI instead of MPI. This means that MPI messages will run between the servers, but MCAPI messages will run between the cores inside the server see figure 1.

Fig 1: MPI messages will run between the servers, but MCAPI messages will run between the cores inside the server.

For those program instances running on the accelerator cores, you then replace the MPI calls with the equivalent MCAPI calls – which is why this works only for simpler uses of MPI, since many MPI constructs have no MCAPI equivalents. A translator converts any messages moving between the MPI and MCAPI domains - see figure 2.

The cost of this arrangement lies in the fact that the program must be edited and recompiled to use MCAPI instead of MPI for the accelerator cores. This also complicates program maintenance due to the existence of two versions of the program – one using MPI and one using MCAPI.