Sockets-GM Overview and Performance
Sockets-GM is a new middleware layer which mimics sockets semantics and replaces the traditional Ethernet protocol to allow for low latency, high speed data transfers. It overcomes current TCP/IP implementations which involve high system load. Sockets-GM bypasses the TCP/IP protocol stack which takes up to 50% of the time spent in communication.
Sockets-GM achieves binary compatibility for existing applications through
different interception techniques.
These interception techniques are dependent on the operating system.
Currently, Sockets-GM is available on any Linux and Solaris version, and
also for Windows 2000 Professional, Server as well as .NET.
Depending on the installation, Sockets-GM can co-exist with the traditional TCP/IP protocol. Sockets-GM will create a companion socket which will be used after a connection between two endpoints has been established.
For Windows, Sockets-GM has been implemented as a Layered Service Provider and as a System Area Network proxy for Winsock Direct.
When using Sockets-GM, applications operate entirely in user level mode. Costly traps into the kernel are avoided. This reduces latency significantly. Sockets-GM is implemented to mimic the given socket interfaces. This allows for high efficiency and optimization techniques can be applied.
For this, Sockets-GM offers two different communication concepts. It allows for buffered communication in which one copy of the data is copied into pre-registered buffers or a zero-copy protocol where data is exchanged directly from application to application buffers using GM RDMA functions. The latter is known for cutting down system load. As a matter of fact, approximately 200 MBytes/s are exchanged with a CPU load of 3%.
Another advantage of Sockets-GM is that it can be tuned for specific applications. That is, threshold values can be set dynamically which will specify when the zero-copy protocol should be used. For latency sensitive applications, Sockets-GM will return to the calling application much earlier, because only a copy of the data is made. The actual message delivery is then handled by the Myrinet NIC. Moreover, the performance boost is consistent on any given system. Unlike some operating systems which do not allow for tuning of protocol stacks, the performance increase is much higher.
In comparison with other TCP/IP implementations, Sockets-GM allows for higher throughput requiring less CPU load. If an application runs under Sockets-GM, then it can also communicate to applications which are not connected via Myrinet. In this case, the conventional TCP/IP over Ethernet protocol will be used.
fischer@atipa4:~/Sockets-GM_MODULE$ ./tests/netperf/netperf -l 1 -H atipa3 -- -m 4000 -M 4000
TCP STREAM TEST to atipa3
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
65535 65535 4000 1.00 3388.58
fischer@atipa4:~/Sockets-GM_MODULE$ ./tests/netperf/netperf -l 1 -H atipa4 -- -m 8100 -M 8100
TCP STREAM TEST to atipa4
Recv Send Send
Socket Socket Message Elapsed
Size Size Size Time Throughput
bytes bytes bytes secs. 10^6bits/sec
65535 65535 8100 1.00 3829.28
fischer@atipa3:~/Sockets-GM-1.7/Sockets-GM_MODULE$ ~/mpich2-0.97/bin/mpirun -hf ~/hostfile -np 2 ~/PMB2.2/SRC/PMB-MPI1
#---------------------------------------------------
#---------------------------------------------------
# Date : Sat Aug 14 09:02:59 2004
# Machine : i686# System : Linux
# Release : 2.4.25generic
# Version : #3 SMP Wed Mar 17 15:41:21 PST 2004
#
# Minimum message length in bytes: 0
# Maximum message length in bytes: 4194304
#
# MPI_Datatype : MPI_BYTE
# MPI_Datatype for reductions : MPI_FLOAT
# MPI_Op : MPI_SUM
#
#
# List of Benchmarks to run:
# PingPong
# PingPing
# Sendrecv
# Exchange
# Allreduce
# Reduce
# Reduce_scatter
# Allgather
# Allgatherv
# Alltoall
# Bcast
# Barrier
#---------------------------------------------------
# Benchmarking PingPong
# ( #processes = 2 )
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
0 10 19.50 0.00
1 10 25.89 0.04
2 10 25.65 0.07
4 10 26.25 0.15
8 10 26.66 0.29
16 10 26.39 0.58
32 10 27.41 1.11
64 10 26.89 2.27
128 10 27.00 4.52
256 10 28.15 8.67
512 10 32.20 15.16
1024 10 37.15 26.29
2048 10 45.99 42.47
4096 10 55.40 70.51
8192 10 76.64 101.94
16384 10 108.70 143.75
32768 10 159.50 195.92
65536 10 254.45 245.63
131072 10 522.35 239.30
262144 10 1023.45 244.27
524288 10 1982.75 252.18
1048576 10 3857.29 259.25
2097152 10 7708.90 259.44
4194304 10 15271.85 261.92
The full Pallas PMB run between two Xeon 2.4 nodes equipped
with Myrinet 2XP cards can be found here .