Sockets-MX Overview and Performance
Sockets-MX is a new middleware layer which mimics sockets semantics and replaces the traditional Ethernet protocol to allow for low latency, high speed data transfers. It overcomes current TCP/IP implementations which involve high system load. Sockets-MX bypasses the TCP/IP protocol stack which takes up to 50% of the time spent in communication.
It can be applied to existing, linked applicationns in binary format. In popular benchmarks it is quite common to gain an order of magnitude in performance.Sockets-MX achieves binary compatibility for existing applications through
different interception techniques.
These interception techniques are dependent on the operating system.
Currently, Sockets-MX is available on any Linux 2.4 and 2.6 version.
Besides standard TCP/IP, application can switch to Sockets-MX. Sockets-MX is entirely based on MX, bypassing any protocol stack. The result is lowest latency combined with superior connection setup time performance between sockets.
When using Sockets-MX under Linux, applications use a new AF_MYRI protocol family.
Until a couple of years ago, kernel traps increased the latency significantly.
Modern systems and recent Linux versions have lowered this overhead
to be much less than 1usec.
Sockets-MX is implemented to mimic the given socket interfaces. This allows
for high efficiency and optimization techniques can be applied.
For this, Sockets-MX offers two different communication concepts. It allows for buffered communication in which one copy of the data is copied into pre-registered buffers or a zero-copy protocol where data is exchanged directly from application to application buffers using Myrinet RDMA functions. The latter is known for cutting down system load. As a matter of fact, a performance benchmark shows 490 MB/s for a 8KByte payload using less than 4% CPU load.
Another advantage of Sockets-MX is that it can be tuned for specific applications. That is, threshold values can be set dynamically which will specify when the zero-copy protocol should be used. For latency sensitive applications, Sockets-MX will return to the calling application much earlier, because only a copy of the data is made. The actual message delivery is then handled by the Myrinet NIC. Moreover, the performance boost is consistent on any given system. Unlike some operating systems which do not allow for tuning of protocol stacks, the performance increase is much higher.
In comparison with other TCP/IP implementations, Sockets-MX allows for higher throughput requiring less CPU load. If an application runs under Sockets-MX, then it can also communicate to applications which are not connected via Myrinet. In this case, the conventional TCP/IP over Ethernet protocol will be used.
Results using Myri-10G will be available soon.
fischer@serenade2:~/Sockets-MX_MODULE$ ~/NetPIPE_3.6.2/NPtcp -h serenade1 -u 128 Send and receive buffers are 135168 and 135168 bytes (A bug in Linux doubles the requested buffer sizes) Now starting the main loop 0: 1 bytes 19082 times --> 1.52 Mbps in 4.92 usec 1: 2 bytes 19923 times --> 3.05 Mbps in 5.00 usec 2: 3 bytes 19986 times --> 4.57 Mbps in 5.01 usec 3: 4 bytes 13302 times --> 6.08 Mbps in 5.02 usec 4: 6 bytes 14938 times --> 8.98 Mbps in 5.10 usec 5: 8 bytes 9805 times --> 11.96 Mbps in 5.10 usec 6: 12 bytes 12249 times --> 17.68 Mbps in 5.18 usec 7: 13 bytes 8046 times --> 19.14 Mbps in 5.18 usec 8: 16 bytes 8907 times --> 23.72 Mbps in 5.15 usec 9: 19 bytes 10930 times --> 27.78 Mbps in 5.22 usec 10: 21 bytes 12103 times --> 30.60 Mbps in 5.24 usec 11: 24 bytes 12732 times --> 35.31 Mbps in 5.19 usec 12: 27 bytes 13661 times --> 39.31 Mbps in 5.24 usec 13: 29 bytes 8481 times --> 41.97 Mbps in 5.27 usec 14: 32 bytes 9158 times --> 46.61 Mbps in 5.24 usec 15: 35 bytes 10142 times --> 48.60 Mbps in 5.49 usec 16: 45 bytes 10400 times --> 59.33 Mbps in 5.79 usec 17: 48 bytes 11520 times --> 64.83 Mbps in 5.65 usec 18: 51 bytes 12170 times --> 66.72 Mbps in 5.83 usec 19: 61 bytes 6724 times --> 79.14 Mbps in 5.88 usec 20: 64 bytes 8362 times --> 85.39 Mbps in 5.72 usec 21: 67 bytes 9017 times --> 85.92 Mbps in 5.95 usec 22: 93 bytes 9031 times --> 117.09 Mbps in 6.06 usec 23: 96 bytes 11001 times --> 121.32 Mbps in 6.04 usec 24: 99 bytes 11215 times --> 121.71 Mbps in 6.21 usec 25: 125 bytes 5859 times --> 149.83 Mbps in 6.36 usec 26: 128 bytes 7792 times --> 156.80 Mbps in 6.23 usec
fischer@atipa4:~/Sockets-MX_MODULE$ ./tests/netperf/netperf -l 1 -H atipa3 -- -m 4000 -M 4000 TCP STREAM TEST to atipa3 Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 65535 65535 4000 1.00 3388.58 Another important factor is the performance of accept/connect handling.
-- Dual Xeons, Linux 2.4, E cards GigEth for comparison ./tests/netperf-2.2pl4/netperf -p 5000 -l 1 -t TCP_CC -H atipa1 TCP Connect/Close TEST to atipa1 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262142 262142 1 1 0.99 4216.39 16384 87380 TCP/IP over MX ./tests/netperf-2.2pl4/netperf -p 5000 -l 1 -t TCP_CC -H 192.168.1.1 TCP Connect/Close TEST to 192.168.1.1 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262142 262142 1 1 1.00 10119.09 16384 87380 Sockets-MX ./tests/netperf-2.2pl4/netperf -l 1 -t TCP_CC -H atipa1 TCP Connect/Close TEST to atipa1 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 262142 262142 1 1 1.00 18434.88 65535 65535 This means that the number of connections per second are more than 4 times higher than traditional TCP/IP. -- Dual Opterons: ./tests/netperf-2.2pl4/netperf -l 1 -t TCP_CC -H serenade1 TCP Connect/Close TEST to serenade1 Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 1048574 1048574 1 1 1.00 23449.96 524287 135168
#---------------------------------------------------
# Benchmarking PingPong
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 10000 5.33 0.00
1 10000 8.88 0.11
2 10000 8.82 0.22
4 10000 8.83 0.43
8 10000 8.89 0.86
16 10000 8.89 1.72
32 10000 8.82 3.46
64 10000 9.19 6.64
128 10000 9.17 13.31
256 10000 10.67 22.89
512 10000 12.56 38.88
1024 10000 16.09 60.69
2048 10000 18.41 106.08
4096 10000 24.13 161.90
8192 5120 40.23 194.18
16384 2560 72.42 215.75
32768 1280 109.07 286.50
65536 640 175.53 356.07
131072 320 327.27 381.95
262144 160 608.50 410.84
524288 80 1171.41 426.84
1048576 40 2312.47 432.44
2097152 20 4566.38 437.98
4194304 10 9077.20 440.66
Full Pallas