Myrinet logotype

Sockets-MX Overview and Performance

Overview

Sockets-MX is a new middleware layer which mimics sockets semantics and replaces the traditional Ethernet protocol to allow for low latency, high speed data transfers. It overcomes current TCP/IP implementations which involve high system load. Sockets-MX bypasses the TCP/IP protocol stack which takes up to 50% of the time spent in communication.

It can be applied to existing, linked applicationns in binary format. In popular benchmarks it is quite common to gain an order of magnitude in performance.

Features

Sockets-MX achieves binary compatibility for existing applications through different interception techniques.
These interception techniques are dependent on the operating system.
Currently, Sockets-MX is available on any Linux 2.4 and 2.6 version.

Test cases

There have been several projects been valided by using Sockets-MX. To name a few:
PVM 3.4, MPICH-2, LAM MPI, Pallas PMB, Intel iSCSI, netperf, netpipe, Iperf, NTttcp

Concept

Besides standard TCP/IP, application can switch to Sockets-MX. Sockets-MX is entirely based on MX, bypassing any protocol stack. The result is lowest latency combined with superior connection setup time performance between sockets.

Sockets-MX Concept

The concept of Sockets-MX has been implemented using different approaches.

When using Sockets-MX under Linux, applications use a new AF_MYRI protocol family. Until a couple of years ago, kernel traps increased the latency significantly. Modern systems and recent Linux versions have lowered this overhead to be much less than 1usec.
Sockets-MX is implemented to mimic the given socket interfaces. This allows for high efficiency and optimization techniques can be applied.

For this, Sockets-MX offers two different communication concepts. It allows for buffered communication in which one copy of the data is copied into pre-registered buffers or a zero-copy protocol where data is exchanged directly from application to application buffers using Myrinet RDMA functions. The latter is known for cutting down system load. As a matter of fact, a performance benchmark shows 490 MB/s for a 8KByte payload using less than 4% CPU load.

Another advantage of Sockets-MX is that it can be tuned for specific applications. That is, threshold values can be set dynamically which will specify when the zero-copy protocol should be used. For latency sensitive applications, Sockets-MX will return to the calling application much earlier, because only a copy of the data is made. The actual message delivery is then handled by the Myrinet NIC. Moreover, the performance boost is consistent on any given system. Unlike some operating systems which do not allow for tuning of protocol stacks, the performance increase is much higher.

In comparison with other TCP/IP implementations, Sockets-MX allows for higher throughput requiring less CPU load. If an application runs under Sockets-MX, then it can also communicate to applications which are not connected via Myrinet. In this case, the conventional TCP/IP over Ethernet protocol will be used.

Performance

Depending on your configuration a benchmark such as netperf achieves up to 3.9Gbps with a CPU utilization of 7%.
Latency numbers based on round trip communication are slightly higher than the numbers of MX. The netperf TCP_RR as well as netpipe latency test show less than 5usec for a TCP/IP socket application.
Detailed performance graphs are being updated. Some results of netperf, netpipe and MPI benchmarks are presented in the following. These results were obtained with Myrinet E cards which offer bidirectional bandwidth of 4+4 Gbps. It can be observed that Sockets-MX matches this performance.

Results using Myri-10G will be available soon.

-- LATENCY:
fischer@serenade2:~/Sockets-MX_MODULE$ ~/NetPIPE_3.6.2/NPtcp -h serenade1 -u 128
Send and receive buffers are 135168 and 135168 bytes
(A bug in Linux doubles the requested buffer sizes)
Now starting the main loop
  0:       1 bytes  19082 times -->      1.52 Mbps in       4.92 usec
  1:       2 bytes  19923 times -->      3.05 Mbps in       5.00 usec
  2:       3 bytes  19986 times -->      4.57 Mbps in       5.01 usec
  3:       4 bytes  13302 times -->      6.08 Mbps in       5.02 usec
  4:       6 bytes  14938 times -->      8.98 Mbps in       5.10 usec
  5:       8 bytes   9805 times -->     11.96 Mbps in       5.10 usec
  6:      12 bytes  12249 times -->     17.68 Mbps in       5.18 usec
  7:      13 bytes   8046 times -->     19.14 Mbps in       5.18 usec
  8:      16 bytes   8907 times -->     23.72 Mbps in       5.15 usec
  9:      19 bytes  10930 times -->     27.78 Mbps in       5.22 usec
 10:      21 bytes  12103 times -->     30.60 Mbps in       5.24 usec
 11:      24 bytes  12732 times -->     35.31 Mbps in       5.19 usec
 12:      27 bytes  13661 times -->     39.31 Mbps in       5.24 usec
 13:      29 bytes   8481 times -->     41.97 Mbps in       5.27 usec
 14:      32 bytes   9158 times -->     46.61 Mbps in       5.24 usec
 15:      35 bytes  10142 times -->     48.60 Mbps in       5.49 usec
 16:      45 bytes  10400 times -->     59.33 Mbps in       5.79 usec
 17:      48 bytes  11520 times -->     64.83 Mbps in       5.65 usec
 18:      51 bytes  12170 times -->     66.72 Mbps in       5.83 usec
 19:      61 bytes   6724 times -->     79.14 Mbps in       5.88 usec
 20:      64 bytes   8362 times -->     85.39 Mbps in       5.72 usec
 21:      67 bytes   9017 times -->     85.92 Mbps in       5.95 usec
 22:      93 bytes   9031 times -->    117.09 Mbps in       6.06 usec
 23:      96 bytes  11001 times -->    121.32 Mbps in       6.04 usec
 24:      99 bytes  11215 times -->    121.71 Mbps in       6.21 usec
 25:     125 bytes   5859 times -->    149.83 Mbps in       6.36 usec
 26:     128 bytes   7792 times -->    156.80 Mbps in       6.23 usec

-- BANDWIDTH:
fischer@atipa4:~/Sockets-MX_MODULE$ ./tests/netperf/netperf -l 1 -H atipa3 -- -m 4000 -M 4000
TCP STREAM TEST to atipa3
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

 65535  65535   4000    1.00     3388.58


Another important factor is the performance of accept/connect handling.
-- CONNECTION PERFORMANCE:
-- Dual Xeons, Linux 2.4, E cards

GigEth for comparison

./tests/netperf-2.2pl4/netperf -p 5000 -l 1 -t TCP_CC -H atipa1
TCP Connect/Close TEST to atipa1
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

262142 262142 1        1       0.99     4216.39
16384  87380

TCP/IP over MX

./tests/netperf-2.2pl4/netperf -p 5000 -l 1 -t TCP_CC -H 192.168.1.1
TCP Connect/Close TEST to 192.168.1.1
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

262142 262142 1        1       1.00     10119.09
16384  87380

Sockets-MX

./tests/netperf-2.2pl4/netperf -l 1 -t TCP_CC -H atipa1
TCP Connect/Close TEST to atipa1
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

262142 262142 1        1       1.00     18434.88
65535  65535

This means that the number of connections per second
are more than 4 times higher than traditional TCP/IP.

-- Dual Opterons:

./tests/netperf-2.2pl4/netperf -l 1 -t TCP_CC -H serenade1
TCP Connect/Close TEST to serenade1
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

1048574 1048574 1        1       1.00     23449.96
524287 135168

-- Intel Pallas Benchmark
Sockets-MX can also speed up HPC applications in binary format which use TCP/IP.
For the following test the PMB benchmark was compiled and run under LAMP MPI.
The binary was pointed to the AF_MYRI protocol and reports a latency (with MPI overhead) of 5.33usec.
#---------------------------------------------------
# Benchmarking PingPong
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0        10000         5.33         0.00
            1        10000         8.88         0.11
            2        10000         8.82         0.22
            4        10000         8.83         0.43
            8        10000         8.89         0.86
           16        10000         8.89         1.72
           32        10000         8.82         3.46
           64        10000         9.19         6.64
          128        10000         9.17        13.31
          256        10000        10.67        22.89
          512        10000        12.56        38.88
         1024        10000        16.09        60.69
         2048        10000        18.41       106.08
         4096        10000        24.13       161.90
         8192         5120        40.23       194.18
        16384         2560        72.42       215.75
        32768         1280       109.07       286.50
        65536          640       175.53       356.07
       131072          320       327.27       381.95
       262144          160       608.50       410.84
       524288           80      1171.41       426.84
      1048576           40      2312.47       432.44
      2097152           20      4566.38       437.98
      4194304           10      9077.20       440.66


Full Pallas