
Ethernet Emulation
(TCP/IP and UDP/IP)
Performance for MX
10 March 2005
In addition to its OS-bypass features, MX also presents itself to the host operating system as an ethernet interface. This "ethernet emulation" feature of MX allows Myrinet to carry any packet traffic and protocols that can be carried on ethernet, including TCP/IP and UDP/IP.
It is helpful to understand that when using ethernet emulation over MX, traffic goes from the application through the OS kernel to the MX driver, following the same path as it would for a "real" ethernet interface; traffic does not go directly from the application to the interface, as it does when using MX in its OS-bypass mode. Thus, the TCP/IP and UDP/IP performance over MX depends primarily on the host-CPU performance and the host-OS's IP protocol stack. This performance varies widely for different hosts and operating systems. Also, unlike MX's OS-bypass mode, which exhibits a very small host-CPU overhead, TCP/IP and UDP/IP protocol processing at high data-transfer rates may use a significant fraction of the host-CPU cycles.
The MX developers have streamlined ethernet emulation over MX wherever practical. For example, the ethernet-emulation code uses the PCI-X DMA engines to offload the receive-side IP-checksum computation for TCP/IP and UDP/IP in operating systems that support it. This optimization results in less data being accessed in the host-OS kernel. MX supports 9000-Byte jumbo frames in addition to the standard 1500-Byte ethernet frames; indeed, the MTU (Maximum Transmission Unit) can be set to any value between 64 Bytes and 9000 Bytes. Larger frames result in fewer packets being sent to transfer the same amount of data. MX also uses interrupt-coalescing, which reduces host overhead by batching multiple transmitted and received packets together, thereby reducing the number of interrupts the host needs to service.
In the tables below, we report the ethernet-emulation (TCP/IP and UDP/IP) performance of MX-2G Beta 0.8.8 between a pair of 3.06Ghz Intel Pentium-4 hosts that use the Serverworks Grand Champion chipset. The test machines were running Debian 3.0 and the kernel.org 2.6.11smp Linux kernel. Hyperthreading was enabled. The MX driver was configured to use a 9K MTU for ethernet emulation.
The standard netperf2.2pl4 benchmark resulted in the following bandwidth performance for TCP and UDP. The TCP test uses 256K socket buffers; the UDP test uses an 8K message size.
| NIC | Bandwidth | CPU Utilization | ||
| Sender | Receiver | |||
| PCIXE | TCP | 3946 Mb/s | 35% | 40% |
| UDP | 3964 Mb/s | 31% | 39% | |
| PCIXD | TCP | 1977 Mb/s | 17% | 21% |
| UDP | 1982 Mb/s | 15% | 17% | |
The following table shows the (half-round-trip) one-way latency performance for a 1-Byte message. The netperf benchmark presents this data as "number of transmits per second", so we divide 1 second by the number of transmits to get the full round-trip latency, then divide that by 2 to obtain the results below.
| NIC | One-way Latency | CPU Utilization | ||
| Sender | Receiver | |||
| PCIXE | TCP | 17 µs | 19% | 18% |
| UDP | 16 µs | 22% | 22% | |
| PCIXD | TCP | 18 µs | 18% | 13% |
| UDP | 17 µs | 19% | 18% | |
The "raw" netperf output for these tests is attached below.
Raw netperf output for PCIXE NICs:
>netperf224 -Hshout-my -l 60 -c -C -- -S262144 -s262144 TCP STREAM TEST to shout-my Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 217088 217088 217088 60.01 3945.89 34.60 39.96 1.437 1.659 >netperf224 -Hshout-my -l 60 -c -C -tUDP_STREAM -- -m 8192 UDP UNIDIRECTIONAL SEND TEST to shout-my Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # 10^6bits/sec % SS us/KB 108544 8192 60.00 3629347 0 3964.0 30.98 1.280 108544 60.00 3629335 3964.0 39.04 1.614 >netperf224 -Hshout-my -l 60 -c -C -tTCP_RR TCP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.01 29271.92 19.48 17.68 13.307 12.083 16384 87380 >netperf224 -Hshout-my -l 60 -c -C -tUDP_RR UDP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 108544 108544 1 1 60.01 31185.80 21.77 21.61 13.960 13.859 108544 108544
Raw netperf output for PCIXD NICs:
>netperf224 -Hshout-my -l 60 -c -C -- -S262144 -s262144 TCP STREAM TEST to shout-my Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % S % S us/KB us/KB 217088 217088 217088 60.01 1976.95 17.35 20.87 1.438 1.729 >netperf224 -Hshout-my -l 60 -c -C -tUDP_STREAM -- -m 8192 UDP UNIDIRECTIONAL SEND TEST to shout-my Socket Message Elapsed Messages CPU Service Size Size Time Okay Errors Throughput Util Demand bytes bytes secs # # 10^6bits/sec % SS us/KB 108544 8192 60.01 1814716 0 1982.0 15.07 1.245 108544 60.01 1814702 1982.0 17.35 1.434 >netperf224 -Hshout-my -l 60 -c -C -tTCP_RR TCP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.01 27306.40 17.18 17.55 12.580 12.854 16384 87380 >netperf224 -Hshout-my -l 60 -c -C -tUDP_RR UDP REQUEST/RESPONSE TEST to shout-my Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 108544 108544 1 1 60.01 28942.29 18.88 18.33 13.049 12.669 108544 108544
![]()
02 June 2006