Myrinet logotype
Bus Performance of Myrinet-2000/PCI-X NICs

Performance of the PCIX-series PCI-bus implementation

All of the PCIX-series NICs use the same silicon implementation for the PCI/PCI-X logic, and exhibit the same performance in PCI DMA benchmarks.

The limit of a 64-bit, 133.3MHz PCI-X bus is 1067 MB/s, either reading or writing. In hosts with 64-bit, 100MHz, PCI-X buses, the limit is 800MB/s. The PCIX-series NICs achieve these data rates in bursts up to 4KB, the maximal DMA-transfer size for PCI-X, and perform all PCI-X bus protocols between bursts in a minimum of bus clock cycles.

The PCI-X slots in host computers transfer data to and from system memory, and thus can only approach the limit. For example, the host will typically delay the beginning of DMA-read transfers for ~60 PCI-X clock cycles while it starts fetching and buffering data from the host memory. The following table provides measurements of the PCI-DMA performance, as shown by the GM-2 "gm_debug" utility, of a sample of today's best cluster hosts. The test reports the average data rate for 32 chained 4KB reads to the same block, followed by 32 chained 4KB writes to the same block. Note: Small differences are not significant.

Host/OS bus read (send) bus write (recv)
AMD "Melody" dual 1.6GHz Opteron server (AMD 8131 chip set) / SuSE 8 Linux 936 MB/s 1032 MB/s
AMD "Quartet" quad Opteron server / SuSE 8 Linux 870 MB/s 989 MB/s
Apple dual 2GHz G5 / MacOS X 882 MB/s 1036 MB/s
HP "Marvel" (Alpha EV-7, es47) quad-Alpha server / either Linux or Tru64 908 MB/s 1038 MB/s
HP rx2000 dual 900MHz Itanium2 (HP chip set) / Linux 784 MB/s 1044 MB/s
IBM BladeCenter HS20 Xeon blade, D-card HCA, 100MHz PCI-X / Linux 716 MB/s 784 MB/s
Intel quad 900MHz Itanium2 (Intel 870 chip set) / Linux 819 MB/s 947 MB/s
Intel dual 1.5GHz Itanium2 Madison / Linux 874 MB/s 946 MB/s
Intel dual 2.4GHz Xeon whitebox (Serverworks GC chipset, 400MHz FSB) / Linux 856 MB/s 1044 MB/s
Intel dual 1.8GHz Xeon whitebox (Intel E7500 chipset, 400MHz FSB) / Linux 816 MB/s 853 MB/s
Microway Navion dual 1.6GHz Opteron (AMD 8131 chip set) / United Linux kernel 939 MB/s 1036 MB/s
Newisys dual 1.4GHz Opteron server (AMD 8131 chip set) / SuSE 8 Linux 929 MB/s 1032 MB/s
Sunfire v60x dual Xeon server (Intel E7500 chip set, 100MHz PCI-X) / Linux 675 MB/s 782 MB/s
Supermicro X5DL8-GG dual 2.4GHz Xeon (Serverworks GC-LE chip set, 533MHz FSB) / Linux 932 MB/s 1044 MB/s
Supermicro X5DPE-G2 dual 2.4GHz Xeon (Intel E7501 chip set, 533MHz FSB) / Linux 826 MB/s 853 MB/s
Tyan Trinity single 3.06GHz Pentium-4 (Serverworks GC-SL chip set, 533MHz FSB) / Linux ­ performance in the 133MHz PCI-X slot 859 MB/s 1040 MB/s
Tyan Trinity single 3.06GHz Pentium-4 (Serverworks GC-SL chip set, 533MHz FSB) / Linux ­ performance in the 100MHz PCI-X slot 708 MB/s 782 MB/s

How much bus performance do you need?

With the one-port NICs, 500 MB/s PCI-DMA performance is sufficient to achieve maximal summed-bidirection performance of 250+250 MB/s on the Myrinet port. For these one-port NICs, all of the hosts listed above have PCI-DMA performance to spare.

The two ports of the M3F2-PCIXE NICs have an aggregate peak throughput of 500MB/s from the Myrinet fabric plus 500MB/s to the fabric, a total of 1GB/s. The user-level summed-bidirectional data rate that GM-2.1 or MX-2G achieves is 838-912MB/s, limited by NIC firmware and memory bandwidth. With balanced bidirectional traffic -- an equal number of bytes read and written, many of the hosts listed above have sufficient PCI-X bus performance to support ~900MB/s summed-bidirectional data rates. For example, the Supermicro X5DL8-GG hosts show 932MB/s bus read and 1044MB/s bus write. With an equal number of bytes read and written, the peak bus performance is 2/(1/932 + 1/1044) = 985 MB/s. (Note: To understand this formula, observe that 1/932 is the time to read a byte, and 1/1044 is the time to write a byte. In balanced traffic, two bytes are transferred each (1/932 + 1/1044) µs.)

Myricom banner
6 June 2006