Portable MPI Model Implementation over GM Version mpich-1.2.7..15a, June, 2006 README-mpich-gm =============== MPICH is a portable implementation of MPI, developed by Argonne National Laboratory. MPICH is designed to be highly portable and is currently used by a large number of providers of MPI implementations. http://www.mcs.anl.gov/mpi/mpich/ MPICH-GM is a port of MPICH on top of GM (ch_gm) and is supported by Myricom. This release is based on MPICH 1.2.7 from ANL. A detailed list of changes in this release can be found in the mpichgm-1.2.7..15a/mpid/ch_gm/CHANGES file. ************************************************************************ For updates to this software, visit `http://www.myri.com/scs'. The FAQ file is located at `http://www.myri.com/scs/FAQ/ All Myrinet hardware and software questions should be directed to help@myri.com. ************************************************************************ Table of Contents: I. Installation of MPICH-GM (compilation and usage) (READ THIS!!) II. Spawning MPI processes III. Run-Time options IV. Running TOTALVIEW V. Running GDB. VI. Passing Environment Variables I. Installation of MPICH-GM (compilation and usage) =================================================== The installation of MPICH-GM involves the following 3 steps: 1. Configure/make/install MPICH-GM 2. Modify the machine file 3. Run a program STEP 1: Configure/make/install MPICH-GM --------------------------------------- MPICH-GM uses the same configure script as the generic MPICH. To assist with this process, we have provided several scripts which set the required environment variables, run configure with the appropriate options for the respective compiler, run make and install the binaries. All output is logged to config-mine.log, make-mine.log and install-mine.log. The currently available scripts are named: mpich.make.gcc -- Gnu gcc and g77 mpich.make.absoft - Absoft Fortran compiler and Gnu gcc mpich.make.pgi - Portland Group Fortran, Fortran90 and C compilers mpich.make.freebsd - FreeBSD C compiler and Gnu g77 mpich.make.intel_ia32 - Intel Fortran, Fortran90 and C compilers on IA32 mpich.make.intel_ia64 - Intel Fortran, Fortran90 and C compilers on IA64 mpich.make.solaris_32 - Solaris Fortran, Fortran90 and C compilers (32 bits) mpich.make.solaris_64 - Solaris Fortran, Fortran90 and C compilers (64 bits) mpich.make.tru64 - Compaq C compiler and Gnu g77 mpich.make.macosx - Apple C compiler and Gnu g77 mpich.make.aix - Gnu gcc and g77 on AIX (in progress) mpich.make.lahey - Lahey Fortran compiler and Gnu gcc We encourage the user to edit one of these scripts as needed and then execute the script. For example, for gcc/g77: ./mpich.make.gcc The site specific information contained in this script is referenced by: GM_HOME: the path to the GM binary tree. PREFIX: the location where the MPICH-GM binary tree will be installed. RSHCOMMAND: the program (rsh/ssh) used to spawn processes when not using MPD. rsh is preferred over ssh because it has lower overhead and because ssh does not properly clean up the remote process when killed by a signal (CTRL+C for example). There is a reaper code in mpirun.ch_gm when using ssh, but using rsh would avoid this trouble altogether. Configuring on other operating systems/architectures: ===================================================== Linux ----- There are many building scripts for Linux, depending on the associated compilers: mpich.make.gcc, mpich.make.pgi, mpich.make.absoft, mpich.make.lahey, and mpich.make.intel. Shared library are supported by included the --enable-sharedlib on the configure command-line. Be aware that with some compilers other than gcc, for instance the PGI pgcc compiler, using shared libraries creates a runtime dependency on some PGI shared libraries. So the same PGI runtime should be present on the nodes where the applications will be executed. Windows 2000 ------------ This release of MPICH-GM is not yet supported for Windows. Solaris ------- You must use Gnu make. At this time, Solaris does not yet support memory registration, and because of this, you must configure with the flag --with-device=ch_gm:-disable-registration. Performance of message sizes > 16K will be affected (since this is not a zero-copy protocol). Refer to the file mpich.make.solaris in this distribution for an example configure script. Tru64 ----- You must use Gnu make. Refer to the file mpich.make.tru64 in this distribution for an example configure script. FreeBSD ------- You must use Gnu make. Refer to the file mpich.make.freebsd in this distribution for an example configure script. Mac OS X -------- You must use Gnu make. Refer to the file mpich.make.macosx in this distribution for an example configure script. Errors: ======= The mpich make process generates lots of output. If the make fails in one directory, it will skip that directory and continue with the rest of the make. The result of this is that when an application is being built, it will fail with 'undefined reference to' errors. In this case, you can inspect the 3 log files (config-mine.log, make-mine.log and install-mine.log) for any errors. STEP 2: Modify the machine file ------------------------------- The machine file is located in ${PREFIX}/share/machines.ch_gm.ARCH where PREFIX specifies the binary installation directory, and ARCH = LINUX, etc. This machine file specifies the hosts on which the MPI application will run. This file only needs to be accessible on the machine where you invoke mpirun.ch_gm, as it will be read only by this script. Comments (lines that begin with '#') and blank lines are allowed. An example machine file is given below: # the list of nodes that make the MPI World node1.myri.com:4 node2.myri.com 0 node2.myri.com 1 node2.myri.com 0 node2.myri.com 1 node3.myri.com node4.myri.com # end of machine file By default, this release of MPICH-GM automatically allocates the GM ports on the nodes of the MPI job, load-balancing ports on multiple Myrinet interfaces if available. No information related to Myrinet or GM is needed, this machine file is completely compatible with the generic MPICH machine file. If you have multiple boards in a single machine, you can add the board number to the end of the line if you want to restrict the GM port allocation to this NIC. No board numbers indicates freedom to allocate GM ports on any available boards in the host. The dynamic allocation strategy minimizes the number of GM ports opened per board. In the example above, node2 has two boards and will allocate two processes on board 0, and two processes on board 1. To set up the machine file for SMP use, you can simply list the SMP machine N times (one for each processor), or add the number of processors next to the hostname (separated by a colon). The example above uses two SMP machines (node1 and node2), each of which has 4 processors, and two uni-processor machine (node3 and node4). If there are more processes requested by -np than there are machines listed in the machine file, the processes will be spawned cyclically in the order specified in the list until the number of desired processes has been spawned. The machine names in the machine file are the hostname used to rsh/ssh to the remote node. They can be IP addresses, they do not need to be identical to the host names in the GM routing table. This machine file is system-wide so all MPI applications will use it unless overridden by the user with the run-time option "-machinefile". STEP 3: Run a program --------------------- Sample test programs are in /examples, examples/perftest and examples/tests. To run the cpi program in /examples: cd /examples make cpi ../bin/mpirun -np 2 cpi If the make process fails with 'undefined reference' errors, see the NOTE under Step 1 on building MPICH-GM. Prior to running an MPI program, and if you're using rsh, we assume that the customer has set his .rhosts file appropriately for all of the hosts in the Myrinet network. If this has not been done, the customer will see 'Permission Denied' messages at run-time. If you're using ssh and ssh is not configured to not prompt the user for a password, you will also see a 'Connection Refused' or 'Permission Denied' message. The user should put /bin in his path to access all MPICH scripts. II. Spawning MPI processes ========================== There are two ways in which MPI processes can be spawned -- using rsh/ssh, or using MPD. The instructions that we provide in this README-mpich-gm assume that you are using the rsh/ssh method of spawning processes. If you have more than 128 nodes in your cluster, you should consider using MPD for spawning the MPI processes. Details about MPD can be found in the MPICH-GM documentation available on the Myricom web page, as well as the MPICH documentation at Argonne. MPICH-GM, in contrast to MPICH-P4, can use the two spawning methods, rsh/ssh and MPD, from the same build. No configuration flag is needed to compile MPD, it is done by default. There are 3 different scripts that can be used to spawn MPI processes: mpirun: generic MPICH script, will call mpirun.ch_gm underneath for the ch_gm device. No specific mpirun.ch_gm flags can be passed to mpirun, this is a compatibility script. mpirun.ch_gm: perl script, uses rsh/ssh and fork to spawn processes. Each MPI process communicates with the mpirun.ch_gm perl script via sockets to pass GM port allocation information. mpirun.mpd: alias to mpd/mpdcon, interface used to access the MPD ring, manage jobs, etc. MPD will not be described in this README. Argonne's documentation is more appropriate. It is completely compatible with the way MPD is used for the MPICH ch_p4mpd device. Argonne's documentation will be ported to ch_gm in the upcoming release. Note: Batch queue systems ------------------------- MPICH-GM uses a machine file instead of a ch_gm specific configuration file. A machine file, also called "host file" is exactly the type of information generated by almost all batch queue systems (PBS, etc) when resources are allocated for a job. Such a machine file can be used directly to spawn the MPI processes, using the mpirun.ch_gm flag "-machinefile" listed below. III. Run-Time options ===================== A number of run-time tuning options can be supplied to mpirun.ch_gm. Usage: mpirun.ch_gm [options] [-np ] prog [flags] -v Verbose - provide additional details of the script's execution. -t Testing - do not actually run, just print what would be executed. -s Close stdin - can run in background without tty input problems. -r Clean up remote shared memory files - should be removed automatically, but always good to have an option to force it. -machinefile Specifies a machine file (default is /share/machines.ch_gm.). --gm-no-shmem Disable the shared memory support (enabled by default). --gm-shmem-prefix File prefix of the shared-mem communications storage (defaut: /tmp/gmpi_shmem-) --gm-numa-shmem Enable shared memory only for processes sharing the same Myrinet interface. --gm-wait Wait seconds between each spawning step. --gm-kill Kill all processes seconds after the first exits. --gm-eager Specifies the Eager/Rendez-vous protocol threshold size. --gm-recv Specifies the receive mode , or , is the default. --gm-lock-mbytes Maximum number of megabytes of memory locked by gm (for communications) per process --gm-label Prefix each process output with its rank --gm-copy-env Have each process inherit most variables of the current environment --gm-bounce-buffers Maximum number of bounce buffers (only valid for --disable-registration mode) --gm-tree-spawn Use a two-level spawn tree to launch processes -totalview Specifies Totalview debugging session. -ddt Specifies DDT debugging session. -pg Specifies the procgroup file. -wd Specifies the working directory. -np Specifies the number of processes. prog [flags] Specifies which command line to run. Note: The shared memory support is enabled by default and it may be DISABLED at runtime with the "--gm-no-shmem" flag to mpirun.ch_gm. Examples: -------- 1. Specifying "--gm-recv" as a run-time option changes the behavior of the blocking MPI call. Three modes may be specified at runtime. For example: mpirun.ch_gm --gm-recv polling -np 4 foo.x mpirun.ch_gm --gm-recv blocking -np 4 foo.x mpirun.ch_gm --gm-recv hybrid -np 4 foo.x The default is "--gm-recv polling". The "polling" mode asks MPI to "poll" all devices continually to check for the completion of an event -- send or receive. This mode provides the lowest latency but also has the highest CPU utilization. It is enabled by default as it provides the best performance when each process has a dedicated processor. The "blocking" mode uses a "blocking GM receive" function called gm_blocking_receive_no_spin() (i.e., each MPI blocking function call will effectively block, sleeping in the kernel waiting for an interrupt from the Myrinet interface.) The CPU utilization is minimal as a blocked process will not be scheduled on any processor, but the cost of this interrupt and context switches increase the latency by an overhead of 15-40 microseconds, depending upon the architecture. This receive mode is very efficient when several processes compete for the same processor. This is the case for some multi-threaded applications or some MPI applications that spawn several processes per processor by default (GAMESS for example). The "hybrid" mode is a combination of the two previous modes. In this mode, the process will "poll" for one millisecond (gm_blocking_receive) and then sleep as in the blocking mode. This receive mode provides a good balance between the release of CPU cycles and the cost of the interrupt overhead. 2. To change the Eager/Rendez-vous threshold at run time, use the "--gm-eager" flag as shown below: mpirun.ch_gm --gm-eager 4096 -np 2 foo.x The default value of the Eager size limit is 32672, and the minimum value that can be specified is 128 Bytes. WARNING: Do not change this value unless you know what you are doing! The Eager protocol is a non-blocking protocol where the sender sends a message without knowing if the receiver has posted a matching receive. If the receiver does not provide a matching receive in time, the message is saved in a temporary buffer. This protocol is used for small messages as it provides the lowest latency. Rendez-vous protocol forces the synchronization between sender and receiver by hand-checking with small messages. The data is then transmitted with a gm_directed_send_with_callback() (PUT), and is written directly to the receive buffer without intermediate buffering. This protocol is used for large messages as the buffering overhead becomes unmanageable. In MPICH-P4, the usual threshold between these two protocols is 16K. This value can be changed at run-time, but you must be very careful. The MPI specification says that the application cannot assume anything about the blocking behavior of the MPI function for different message sizes. However, a large set of MPI applications happily violate this rule and will deadlock if the value of this threshold is decreased to less than 16K. 3. To disable the shared memory support (enabled by default), use the mpirun.ch_gm flag "--gm-no-shmem", as shown below. This will disable shared memory between local processes for this run only. mpirun.ch_gm --gm-no-shmem -np 4 foo.x Disabling shared memory may improve or may reduce performance, it depends closely on the MPI application: the latency using shared memory is much better, but the peak bandwidth depends on the performance of the memory copy code provided by the OS. Memory bus traffic and cache trashing are two unpleasant side effects of shared memory. 4. The "--gm-kill X" option is a way to kill all remaining processes X seconds after the first one dies or exits. mpirun.ch_gm --gm-kill 5 -np 4 foo.x In this example, if one process dies or exits, mpirun.ch_gm will kill all remaining processes 5 seconds after the first one dies or exits. Thus the Abort is propagated to the rest of the MPI job but outside of the MPI code. This is a very useful tool when debugging. IV. Running TOTALVIEW ====================== To run with totalview, first set the TOTALVIEW environment variable, and then run with the "-totalview" flag : setenv TOTALVIEW /totalview or export TOTALVIEW=/totalview mpirun.ch_gm -totalview -np 2 foo.x V. Running GDB ============== Running an MPICH-GM job under gdb is fairly easy. You need to have "gdb" and "xterm" installed on the nodes and in your path, and run: mpirun.ch_gm DISPLAY= -np 2 xterm -e gdb foo.x This will run two instances of the MPI program foo.x, under gdb with two xterms, and display these two windows on one machine. VI. Passing Environment Variables ================================= As of MPICH-GM 1.2.6..14 and later, you can export all environment variables to the MPI processes using the --gm-copy-env run-time option to mpirun.ch.gm. Earlier releases of MPICH-GM only export two environment variables by default: DISPLAY and LD_LIBRARY_PATH. If you are using a pre-MPICH-GM-1.2.6..14 release, and you need environment variables to be passed to the MPI processes, there are two ways to accomplish this. * set up this environment variable in your login script on each node (.bashrc/.cshrc), and it will be used as an execution context by the remote shell. * pass the environment variable in the mpirun.ch_gm call: mpirun.ch_gm FOO_1= FOO_2= -np 2 foo.x