Performance of MPICH-MX 1.2.6..0.92c

Uniprocessor (UP) case (one process per node)

MPICH is a portable implementation of MPI, developed by Argonne National Laboratory. It is designed to be highly portable and is currently used by a large number of providers of MPI implementations. MPICH-MX is a "port" of MPICH on top of MX (ch_mx) developed and supported by Myricom.

The current MPICH-MX software release is MPICH-MX 1.2.6..0.92c, and is based on MPICH 1.2.6. MX-2G versions of MX and MPICH-MX are available to Myricom customers. MX will ultimately be available in two versions, MX-2G for the PCIX series NICs, and MX-10G for the Myrinet-10G NICs. For upcoming releases, Myricom plans to charge nominal license and support fees for MX software support.

Performance data is presented for the Pallas MPI Benchmark Suite, Version 2.2.

Each benchmark is run with varying message lengths, and timings are averaged over multiple samples. Details of each test can be found in Section 4.2.1 of the Pallas MPI Benchmarks -- documentation MPI-1 part.

Important Note: For these performance graphs, only one process per node was used, so there was no intranode communication by shared memory.

The test environment consists of sixty-four dual AMD64 1.4GHz machines with a M3F-PCIXF-2 NIC on a 133 MHz 64bit PCI-X bus, and connected by a Myrinet-2000 switch. Each machine has 2 GB of memory and is running Linux 2.4.19 (SuSE SLES 8.1), MX-2G (version 1.0.0) and MPICH-MX 1.2.6..0.92c. The Pallas MPI Benchmark is compiled with gcc 3.2.2 with -O.

Point to Point Communication

Point to point communication performance is measured between two processes. Latency is measured in micro-second (us) and bandwidth is measured in MB/s. The latency scale is logarithmic and the bandwidth scale is linear.

PMB PingPong

PingPong is the classical pattern used for measuring startup (latency) and throughput (bandwidth) of a single message sent between two processes.

graph

PMB PingPing

As PingPong, PingPing measures the startup and throughput of a single message sent between two processes, with the crucial difference that messages are obstructed by oncoming messages. For this, two processes communicate (MPI_Isend/MPI_Recv/MPI_Wait) with each other, with the MPI_Isend's issued simultaneously.

graph

PMB Sendrecv

Based on MPI_Sendrecv(), the processes form a periodic communication chain. Each process sends to the right and receives from the left neighbor in the chain. The turnover count is 2 messages sample (1 in, 1 out) for each process.

For 2 processes, Sendrecv will report the bi-directional bandwidth of the system, as obtained by the (optimized) MPI_Sendrecv function.

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Exchange

Exchange is a communications pattern that often occurs in grid splitting algorithms (boundary exchanges). The group of processes is seen as a periodic chain, and each process exchanges data with both left and right neighbor in the chain.

The turnover count is 4 messages per sample (2 in, 2 out) for each process.

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

Collective Communication

Collective communication performance is measured between all or a subset of the nodes in the system.

PMB Reduce

Benchmark of the MPI_Reduce() function. Reduces a vector of length L = X/sizeof(float) float items. The MPI datatype is MPI_FLOAT; the MPI operation is MPI_SUM. The root of the operation is changed cyclically.

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Reduce_scatter

Benchmark of the MPI_Reduce_scatter() function. Reduces a vector of length L = X/sizeof(float) float items. The MPI datatype is MPI_FLOAT; the MPI operation is MPI_SUM. In the scatter phase, the L items are split as evenly as possible between all processes. Exactly, when

np = #processes, L = r*np+s (s = L mod np),
then the process with rank i gets r+1 items when i < s, and r items when i >= s.

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Allreduce

Benchmark of the MPI_Allreduce() function. Reduces a vector of length L = X/sizeof(float) float items. The MPI datatype is MPI_FLOAT; the MPI operation is MPI_SUM.

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Allgather

Benchmark of the MPI_Allgather() function. Every process sends X bytes and receives the gathered X*(#processes) bytes.

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Allgatherv

Functionally the same as Allgather, however with the MPI_Allgatherv() function. Shows whether MPI produces overhead due to the more complicated situation as compared to MPI_Allgather().

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Alltoall

Benchmark of the MPI_Alltoall() function. Every process inputs X*(#processes) bytes (X for each process) and receives X*(#processes) bytes (X for each process).

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Broadcast

Benchmark of MPI_Bcast. A root process broadcasts X bytes to all. The root of the operation is changed cyclically.

Results are presented for 2, 4, 8, 16, and 32 processes.

graph

PMB Barrier

This is a benchmark of the MPI_Barrier() function.

graph


SMP (two processes per node) timings for the Pallas Benchmark

SMP (two processes per node) timings for the Pallas Benchmark are also available.

Myricom banner
Last updated: 20 June 2005