Portable MPI Model Implementation over MX Version mpich-1.2.7..1, November, 2005 README-mpich-MX =============== MPICH is a portable implementation of MPI, developed by Argonne National Laboratory. MPICH is designed to be highly portable and is currently used by a large number of providers of MPI implementations. http://www.mcs.anl.gov/mpi/mpich/ MPICH-MX is a port of MPICH on top of MX (ch_mx) and is supported by Myricom. This release is based on MPICH 1.2.7 from ANL. See list of changes for this release in mpich-mx-1.2.7..1/mpid/ch_mx/CHANGES * binaries compiled with versions <= 1.2.6..0.92c are incompatible with the current mpirun.ch_mx script and must use the mpirun.ch_mx_compat script. ************************************************************************ For updates to this software, visit `http://www.myri.com/scs'. The FAQ file is located at `http://www.myri.com/scs/FAQ/ All Myrinet hardware and software questions should be directed to help@myri.com. ************************************************************************ Table of Contents: I. Installation of MPICH-MX (compilation and usage) (READ THIS!!) II. Spawning MPI processes III. Run-Time options IV. Running TOTALVIEW V. Running GDB. VI. Passing Environment Variables I. Installation of MPICH-MX (compilation and usage) =================================================== The installation of MPICH-MX involves the following 3 steps: 1. Configure/make/install MPICH-MX 2. Modify the machine file 3. Run a program STEP 1: Configure/make/install MPICH-MX --------------------------------------- MPICH-MX uses the same configure script as the generic MPICH. To assist with this process, we have provided template scripts which set the required environment variables, run configure with the appropriate options for the respective compiler, run make and install the binaries. All output is logged to config-mine.log, make-mine.log and install-mine.log. The currently available scripts are named: mpich.make.linux -- Linux use mpich.make.macosx - MacosX We encourage the user to edit one of these scripts as needed and then execute the script. For example, for Linux: ./mpich.make.linux The site specific information that should be modified in this script is: CC: the C compiler CXX: the C++ compiler F77: the fortran compiler F90: the F90 compiler MX_HOME: the path to the MX install tree. PREFIX: the location where the MPICH-MX binary tree will be installed. RSHCOMMAND: the program (rsh/ssh) used to spawn processes when not using MPD. Configuring on other operating systems/architectures: ===================================================== Linux ----- Shared library are supported by included the --enable-sharedlib on the configure command-line. Be aware that with some compilers other than gcc, for instance the PGI pgcc compiler, using shared libraries creates a runtime dependency on some PGI shared libraries. So the same PGI runtime should be present on the nodes where the applications will be executed. Mac OS X -------- You must use Gnu make. Refer to the file mpich.make.macosx in this distribution for an example configure script. Errors: ======= The mpich make process generates lots of output. If the make fails in one directory, it will skip that directory and continue with the rest of the make. The result of this is that when an application is being built, it will fail with 'undefined reference to' errors. In this case, you can inspect the 3 log files (config-mine.log, make-mine.log and install-mine.log) for any errors. STEP 2: Modify the machine file ------------------------------- The machine file is located in ${PREFIX}/share/machines.ch_mx.ARCH where PREFIX specifies the binary installation directory, and ARCH = LINUX, etc. This machine file specifies the hosts on which the MPI application will run. This file only needs to be accessible on the machine where you invoke mpirun.ch_mx, as it will be read only by this script. Comments (lines that begin with '#') and blank lines are allowed. An example machine file is given below: # the list of nodes that make the MPI World node1.myri.com:4 node2.myri.com 0 node2.myri.com 1 node2.myri.com 0 node2.myri.com 1 node3.myri.com node4.myri.com # end of machine file MPICH-MX automatically allocates the MX endpoints on the nodes of the MPI job, load-balancing ports on multiple Myrinet interfaces if available. This machine file is completely compatible with the generic MPICH machine file. If you have multiple boards in a single machine, you can add the board number to the end of the line if you want to restrict the MX port allocation to this NIC. No board numbers indicates freedom to allocate MX ports on any available boards in the host. The dynamic allocation strategy minimizes the number of MX ports opened per board. In the example above, node2 has two boards and will allocate two processes on board 0, and two processes on board 1. To set up the machine file for SMP use, you can simply list the SMP machine N times (one for each processor), or add the number of processors next to the hostname (separated by a colon). The example above uses two SMP machines (node1 and node2), each of which has 4 processors, and two uni-processor machine (node3 and node4). If there are more processes requested by -np than there are machines listed in the machine file, the processes will be spawned cyclically in the order specified in the list until the number of desired processes has been spawned. The machine names in the machine file are the hostname used to rsh/ssh to the remote node. They can be IP addresses, they do not need to be identical to the host names in the MX network table. This machine file is system-wide so all MPI applications will use it unless overridden by the user with the run-time option "-machinefile". STEP 3: Run a program --------------------- Sample test programs are in /examples, examples/perftest and examples/tests. To run the cpi program in /examples: cd /examples make cpi ../bin/mpirun -np 2 cpi If the make process fails with 'undefined reference' errors, see the NOTE under Step 1 on building MPICH-mx. Prior to running an MPI program, and if you're using rsh, we assume that the customer has set his .rhosts file appropriately for all of the hosts in the Myrinet network. If this has not been done, the customer will see 'Permission Denied' messages at run-time. If you're using ssh and ssh is not configured to not prompt the user for a password, you will also see a 'Connection Refused' or 'Permission Denied' message. The user should put /bin in his path to access all MPICH scripts. II. Spawning MPI processes ========================== mpirun: generic MPICH script, will call mpirun.ch_mx underneath for the ch_mx device. No specific mpirun.ch_mx flags can be passed to mpirun, this is a compatibility script. mpirun.ch_mx: perl script, uses rsh/ssh and fork to spawn processes. Each MPI process communicates with the mpirun.ch_mx perl script via sockets to pass MX port allocation information. Note: Batch queue systems ------------------------- MPICH-MX uses a machine file "host file". batch queue systems usually provide a similar file ($PBS_NODEFILE, etc) when resources are allocated for a job. A "host file" derived from the batch queue system provided file can be used to spawn the MPI processes, using the mpirun.ch_mx flag "-machinefile" listed below. III. Run-Time options ===================== A number of run-time tuning options can be supplied to mpirun.ch_mx. Usage: mpirun.ch_mx [options] [-np ] prog [flags] -v Verbose - provide additional details of the script's execution. -t Testing - do not actually run, just print what would be executed. -s Close stdin - can run in background without tty input problems. -r Clean up remote shared memory files - should be removed automatically, but always good to have an option to force it. -machinefile Specifies a machine file (default is /share/machines.ch_mx.). --mx-wait Wait seconds between each spawning step. --mx-kill Kill all processes seconds after the first exits. --mx-recv Specifies the receive mode: or --mx-noshmem Disable the shared memory support (enabled by default) --mx-copy-env Have each process inherit most variables of the current environment. --mx-tree-spawn Use a two-level spawn tree to launch processes -totalview Specifies Totalview debugging session. -ddt Specifies DDT debugging session. -pg Specifies the procgroup file. -wd Specifies the working directory. -np Specifies the number of processes. prog [flags] Specifies which command line to run. Examples: -------- 1. Specifying "--mx-recv blocking" as a run-time option changes the behavior of the blocking MPI call. Two modes may be specified at runtime. For example: mpirun.ch_mx --mx-recv polling -np 4 foo.x mpirun.ch_mx --mx-recv blocking -np 4 foo.x The default is "--mx-recv polling". The "polling" mode asks MPI to "poll" MX continually to check for the completion of an event -- send or receive. This mode provides the lowest latency but also has the highest CPU utilization. It is enabled by default as it provides the best performance when each process has a dedicated processor. The "blocking" mode uses a "blocking MX functions" (i.e., each MPI blocking function call will effectively block, sleeping in the kernel waiting for an interrupt from the Myrinet interface.) The CPU utilization is minimal as a blocked process will not be scheduled on any processor, but the cost of this interrupt and context switches increase the latency by an overhead of 10-20 microseconds, depending upon the architecture. This receive mode is very efficient when several processes compete for the same processor. This is the case for some multi-threaded applications or some MPI applications that spawn several processes per processor by default (GAMESS for example). 2. To disable the shared memory support (enabled by default), use the mpirun.ch_mx flag "--mx-no-shmem", as shown below. This will disable shared memory between local processes for this run only. mpirun.ch_gm --mx-no-shmem -np 4 foo.x Disabling shared memory may improve or may reduce performance, it depends closely on the MPI application: the latency using shared memory is much better, but the peak bandwidth depends on the performance of the memory copy code provided by the OS. Memory bus traffic and cache trashing are two unpleasant side effects of shared memory. 3. The "--mx-kill X" option is a way to kill all remaining processes X seconds after the first one dies or exits. mpirun.ch_mx --mx-kill 5 -np 4 foo.x In this example, if one process dies or exits, mpirun.ch_mx will kill all remaining processes 5 seconds after the first one dies or exits. Thus the Abort is propagated to the rest of the MPI job but outside of the MPI code. This is a very useful tool when debugging. IV. Running TOTALVIEW ====================== To run with totalview, first set the TOTALVIEW environment variable, and then run with the "-totalview" flag : setenv TOTALVIEW /totalview or export TOTALVIEW=/totalview mpirun.ch_mx -totalview -np 2 foo.x V. Running GDB ============== Running an MPICH-MX job under gdb is fairly easy. You need to have "gdb" and "xterm" installed on the nodes and in your path, and run: mpirun.ch_mx DISPLAY= -np 2 xterm -e gdb foo.x This will run two instances of the MPI program foo.x, under gdb with two xterms, and display these two windows on one machine.