************************************************************************* * * * MPICH2-MX * * * * MPICH2 over Myrinet Express (ch_mx) documentation * * * * Copyright (C) 2007 Myricom, Inc. * * Author: Myricom, Inc. * * * ************************************************************************* README of MPICH2-MX MPICH2-MX provides support for Myricom's Myrinet Express (MX) communication layer. MPICH2-MX may be used with either MX-10G or MX-2G. See MX's README for supported NICs. Table of Contents: I. Installation 1. Configuring and compiling 2. Runtime tunables 2.1 Registration cache 2.2 Error handling 2.3 Send cancellation II. MPICH2-MX Performance III. Caveats 1. Multiple NICs are not supported 2. Polling versus blocking mode 3. Checksums IV. License V. Support =============== I. Installation =============== MPICH2-MX requires Myricom's MX version 1.2.1 or higher. See MX's README for the supported list of platforms. 1. Configuring and compiling MX has been fully integrated into the MPICH2 build process. To build MPICH2-MX, you will need to do the following: $ ./configure --with-device=ch_mx --with-mx=/opt/mx replacing /opt/mx with the actual path to MX. Then run: $ make $ make install You can override the MX include directory and/or the MX library directory with: --with-mx-include=/path/to/mx/include --with-mx-lib=/path/to/mx/lib If you want to build shared libraries, run: $ ./configure --help or read the MPICH2 manual. If you are compiling with Portland Group compilers, you will need to also set: $ export CFLAGS=-c9x 2. Runtime tunables You can change some behaviors in MPICH2-MX by setting some environment variables. Some of these affect MX directly and others only impact MPICH2. 2.1 Registration cache MX has an internal memory registration cache (regcache) than can improve repetitive communication of large messages. By default, MX will try to use the regcache. Previously, the regcache was not the default and was enabled with MX_RCACHE=1. In more recent MX versions on Linux, the regcache is enabled by default. In applications that override memory functions such as malloc(), the MX regcache will not work. You can disable the regcache with: $ export MX_RCACHE=0 2.2 Error handling By default, MX will abort if an error occurs. This is useful for catching errors but can be ignored if the upper layers of software expect errors and can handle them correctly. MPICH2, in general, can tolerate some errors. The ch_mx device can handle some errors and abort for others. You can safely change the behavior to not abort on MX error by setting: $ export MX_ERRORS_ARE_FATAL=0 This setting is necessary to pass the errhan tests in the MPICH2 test suite. 2.3 Send cancellation In MPI, it is optional as to whether an implementation will cancel a send. By default, MPICH2-MX will not cancel sends. You can enable this feature by setting: $ export MX_ENABLE_CANCEL_SEND=1 This setting is necessary to pass *scancel tests in the pt2pt tests in the MPICH2 test suite. This will also switch error handling to return rather than abort. 2.4 Recv mode By default, MPICH2-MX will using polling for blocking receives. You can change this behavior to a blocking mode or mixed mode (some polling, then blocking) by setting: $ export MX_RECV_POLLING=N where N is -1, 0, positive integer. The value -1 indicates polling, the value 0 indicates blocking, and a positive integer value will poll this many times before blocking. Changing the behavior to blocking will lower CPU usage but increase latency. You will need to test various values to determine which is best for your application. 2.5 Unexpected queue length By default, MX will buffer up to 4 MB of unexpected messages before starting to drop unexpected messages (the sender will automatically try to retransmit). You can alter this amount by setting: $ export MX_UNEX_Q_LENGTH=N where N is the number of bytes to buffer. 2.6 Message checksums Starting with MX 1.2.4, you can checksum all messages sent using MX in MPICH2-MX (or any other MX application). Using checksums will lower performance, so only use it for debugging. To enable checksums, use: $ export MX_CSUM=1 2.7 Using multiple NICs By default, MPICH2-MX will only use the first Myricom NIC in each host. You can override this behavior by using a machinefile that specifies the interface hostname (ifhn=hostname:board_index) for each NIC. For example, if your machines (e.g. compute[0-7]) have two NICs each and you want to run a max of 8 processes per machine, you can specify a machinefile like: compute0:4 ifhn=compute0:0 compute0:4 ifhn=compute0:1 compute1:4 ifhn=compute1:0 compute1:4 ifhn=compute1:1 ... compute7:4 ifhn=compute7:0 compute7:4 ifhn=compute7:1 ========================= II. MPICH2-MX Performance ========================= On MX-2G systems, MPICH2-MX should easily saturate the link and use minimal CPU. On MX-10G systems, MPICH2-MX can saturate the link and use moderate CPU resources. MX-10G relies on PCI-Express which is relatively new and performance varies considerably by processor, motherboard and PCI-E chipset. Refer to Myricom's website for the latest DMA read/write performance results by motherboard. The DMA results will place an upper-bound on MPICH2-MX performance. ============ III. Caveats ============ 1. Send cancellation If the sender sends two identical messages (same receiver, same MPI tag, same communicator) and if the receiver has not posted a recv for either message and if the sender wants to cancel the second message, MPICH2-MX will instead cancel the first message (it will be matched first). This will lead to undefined behavior. =========== IV. License =========== In addition to the standard MPICH2 license found in the COPYRIGHT file, Myricom adds the following for ch_mx: Copyright (c) 2007, Myricom, Inc. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Myricom nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY Myricom, Inc. ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL Myricom, Inc. BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ========== V. Support ========== If you have questions about MPICH2-MX, please contact help@myri.com. /* -*- mode: c; c-basic-offset: 8; indent-tabs-mode: nil; -*- * vim:expandtab:shiftwidth=8:tabstop=8: */