Myricom logotype at ISC2006 ISC2006 Official Sponsor

Low-Latency 10-Gigabit Ethernet

Myricom's principal announcement at ISC2006 is the MX/Ethernet software, which, when used with Myri-10G NICs and a low-latency 10-Gigabit Ethernet switch, achieves performance metrics over 10-Gigabit Ethernet similar to MX-10G over 10-Gigabit Myrinet.

HPC Clustering with 10-Gigabit Ethernet Switches
2.4µs MPI latency, 1200 MByte/s one-way data rates

Myricom is announcing and demonstrating at the International Supercomputer Conference the new capability of its Myrinet Express (MX) message-passing software and 10-Gigabit/s, dual-protocol, Myri-10G NICs to support "MX over Ethernet," delivering HPC-calibre performance using standard 10-Gigabit Ethernet switches.  MX/Ethernet operates by kernel bypass, in which application programs communicate directly with the firmware in the programmable Myri-10G NIC, to achieve low latency and low host-CPU utilization nearly on par with MX over 10-Gigabit Myrinet.

MX/Ethernet uses 10-Gigabit Ethernet as a layer-2 network with an MX EtherType[1] to identify MX packets (frames).  The same network and Myri-10G NICs can carry TCP/IP traffic along with the MX traffic, but MX/Ethernet uses MX's efficient reliability layer rather than TCP/IP.  The technique is open, transparent to Ethernet switch makers, less expensive than proprietary HPC solutions, and applicable both to HPC and to enterprises.

MX/Ethernet is plug-and-play with any 10-Gigabit Ethernet switch, although you get better performance with some switches than with others.  The table below of MPI benchmarks[2] starts with MX/Myrinet with Myri-10G NICs and a 128-port 10-Gigabit Myrinet switch as a baseline.  The performance of MX/Ethernet with the new Fulcrum Microsystems FM2224, 24-port, 10GBase-CX4, 10-Gigabit Ethernet switch is almost as good as the MX/Myrinet performance.  The Fujitsu XG700, 12-port, 10GBase-CX4, 10-Gigabit Ethernet switch exhibits slightly lower data rates than the Fulcrum switch, and slightly higher latencies, but is still entirely adequate for most HPC applications.  The last column of the table cites recently published[3] MPI benchmarks for Mellanox InfiniBand to show that MX with Myri-10G NICs soundly beats InfiniBand, even with standard 10-Gigabit Ethernet switches.

MPI Benchmark

MX/Myrinet
Myricom
10G Myrinet
switch

MX/Ethernet
Fulcrum
10G Ethernet
switch

MX/Ethernet
Fujitsu
10G Ethernet
switch

OpenIB with
Intel MPI
Mellanox
InfiniBand

PingPong latency

2.4µs

2.4µs

2.8µs

4.0µs

One-way data rate (PingPong)

1204 MByte/s

1201 MByte/s

1002 MByte/s

964 MByte/s

Two-way data rate (SendRecv)

2397 MByte/s

2162 MByte/s

1762 MByte/s

1902 MByte/s

These results show that for small clusters, up to the size that can be supported from a single switch, 10-Gigabit Ethernet is capable of performance formerly associated only with specialty cluster interconnects.  Inasmuch as there are no high-port-count, low-latency, full-bisection, 10-Gigabit Ethernet switches on the market today, MX/Myrinet with 10-Gigabit Myrinet switches will continue to be preferred for large clusters because of the economy and scalability of Myrinet switching.

This MX/Ethernet innovation provides strong new evidence that 10-Gigabit Ethernet will become the interconnect technology of choice for High-Performance-Computing (HPC) clusters, initially for small clusters, but, as 10-Gigabit Ethernet switch technology advances, for larger clusters as well.  Clusters have come to dominate the TOP500 supercomputer list in recent years.  Over the past two years, commodity Gigabit Ethernet has eclipsed specialty interconnects, including Myricom’s earlier Myrinet-2000 interconnect, in the number of systems in the TOP500 list.  However, Gigabit Ethernet is not fast enough for leading-edge cluster hosts with their multiple, multi-core processors.  In anticipation of these trends, Myricom’s latest generation of products, Myri-10G, was designed as a convergence at 10-Gigabit/s data rates of Myrinet, the most successful specialty network for HPC applications, and mainstream Ethernet.  As these MX/Ethernet results demonstrate, Myricom’s Myri-10G technology combines the best of both worlds.

Booth exhibit: In Myricom’s ISC2006 booth, we are exhibiting a cluster of eight dual 3GHz dual-core Intel Xeon (Woodcrest) hosts, 32 processors total, connected with Myri-10G networks in both Myrinet and Ethernet modes. The cluster can be booted into either Linux or Microsoft Windows Compute Cluster Server 2003.  With an Rpeak of 384 Gflops and an Rmax of ~300 Gflops, this cluster is as fast as those qualifying for the TOP500 supercomputer list just a few years ago.

This demonstration cluster was provided for Myricom's use at ISC2006 by our highly experienced Swiss cluster integrator, Dalco AG, who has supplied numerous Myrinet clusters, including for computational-fluid-dynamics applications for the design of Formula-1 race cars.

One Myri-10G NIC in each host of the demonstration cluster operates in Myrinet mode, and is connected to a 128-port, 10-Gigabit Myrinet switch. The second Myri-10G NIC in each host is connected to a Fulcrum Microsystems FM2224, 24-port, 10GBase-CX4, 10-Gigabit Ethernet switch.  Visitors to the Myricom booth will be able to observe benchmarks and applications in both MX/Myrinet and MX/Ethernet modes.

Myricom attendees at ISC2006 are Dr. Chuck Seitz, CEO; Scott Atchley, Member Technical Staff; Martin Benes, Principal VLSI Architect; Patrice Duffort, Director of EMEA Sales; Brett Ellis, Senior Systems Administrator; and Dr. Markus Fischer, Senior Software Architect.


[1] If you are not familiar with the Ethernet EtherType, see http://www.answers.com/main/ntquery?s=EtherType .

[2] The MPI benchmarks for MX are the standard Pallas, now Intel, MPI benchmarks.  The data rates are converted from the Mebibyte (220 Byte) per second measure reported to the standard MByte/s measure.

[3] OSU Benchmark Comparison: May 11, 2006.  The numbers cited are typical of the best of 45 benchmarks reported.  The reported latency appears not to include the latency of an InfiniBand switch; thus, the actual in-system latency may be higher.  The data rates are from streaming tests, which are less demanding than PingPong tests.

Myricom banner
28-30 June 2006