Myrinet Protocol Module -- Implementation of Sun HPC ClusterTools MPI over GM Version CT-Myrinet-PM-1.1-no-mt April 26, 2004 README-ClusterTools =================== The Myrinet protocol module (Myrinet PM) is a loadable protocol module implemented for Sun HPC ClusterTools MPI over GM. This Myrinet PM is for Sun HPC ClusterTools 4.0 and higher. Features of this Myrinet PM implementation include: * Uses efficient pipelined memory copy. * Uses GM directed sends (PUT) (gm_directed_send_with_callback()) for large messages (Rendez-vous) to improve scalability. Uses all of the send and recv tokens efficiently. * The Eager/Rendez-vous thresholds can be changed at run-time. The default is 16KB minus the bookkeeping bytes (The bookkeeping uses 96 bytes for 32-bit applications and 112 bytes for 64-bit applications). * The Rendez-vous/Multiphase-Rendez-vous thresholds can be changed at run-time. The default is 32KB minus 8 bytes. * Using RTE to exchange GM initialization information among processes. This provide very good scalability for large jobs. Table of Contents: I. Installation of Myrinet PM for Sun ClusterTools II. Run-Time (Tuning) Options III. FAQ for GM Driver for SUN UltraSPARC Systems IV. FAQ for Myrinet Protocol Module for Sun HPC ClusterTools I. Installation of Myrinet PM for Sun ClusterTools (4.0 and higher) =================================================================== The installation of Myrinet PM involves the following 8 steps. Step 1. Install Sun HPC ClusterTools. Install Sun HPC ClusterTools 4.0 or higher into the directory (default is /opt/SUNWhpc). Step 2. Install the Myrinet PM libraries. gunzip -c CT-Myrinet-PM-1.1-no-mt.tar.gz | tar xvf - su root cp myr*.so* /lib cp sparcv9/myr*.so* /lib/sparcv9 Step 3. Set up the GM hostnames and bring up the ethernet driver. For each node of the cluster, run ip2hostname to obtain the hostname corresponding to the IP address of myri0 on that node ip2hostname x.x.x.x (myri0's IP address) Edit/create file /etc/gm/hostname.0 and put this name into it. Restart /etc/init.d/gm script on each node /etc/init.d/gm restart This will set the node's GM name to the hostname in /etc/gm/hostname.0 and make the change globally known to all nodes. The name setup may be done just once. Once the file /etc/gm/hostname.0 is in place, the /etc/init.d/gm script will pick up the right name for the GM node whenever the daemon is started. For each node of the cluster, plumb myri0 and set its UP address by the command ifconfig myri0 plumb x.x.x.x up Step 4. Construct the configure file /conf/gm.conf. An example of gm.conf file for a cluster consists of two nodes is given as follows: ----------------------------------------- 2 # node_name board_num port_num port_ids u81-t 1 12 4 5 6 7 8 9 10 11 12 13 14 15 u82-t 1 12 4 5 6 7 8 9 10 11 12 13 14 15 ----------------------------------------- In this file, the first line specifies the total number of nodes in the cluster. Starting on the second line, the first column specifies the node's GM name; the second column specifies the number of Myrinet interfaces installed on this node; the third column specifies the number of GM ports available for CT to use; from the fourth column on are the GM port ids that can be used. Set the environment parameter MPI_MYR_CONF to /conf/gm.conf file. setenv MPI_MYR_CONF /conf/gm.conf If no MPI_MYR_CONF is set, /opt/SUNWhpc/conf/gm.conf will be used as the default. Step 5. Modify /conf/hpc.conf. ------------------------------------------------------------ ... # List the available Protocol Modules # PMODULE LIBRARY Begin PMODULES shm () rsm () myr () tcp () End PMODULES # SHM settings # NAME RANK Begin PM=shm shm 5 End PM # RSM settings # NAME RANK AVAIL Begin PM=rsm wrsm 20 1 End PM # MYR settings # NAME RANK Begin PM=myr myr 30 End PM ... # TCP settings # NAME RANK MTU STRIPE LATENCY BANDWIDTH Begin PM=tcp midn 0 16384 0 20 150 idn 10 16384 0 20 150 ... myri 161 4096 0 20 150 ... ------------------------------------------------------------ Step 6. Start the daemon (requires root privileges) On the master node, su root /etc/init.d/sunhpc.cre_master start /etc/init.d/sunhpc.cre_node start On the non-master nodes, su root /etc/init.d/sunhpc.cre_node start Step 7. Initialize the "all" partition (if necessary) Initialize the "all" partition (requires root privileges) when "mpinfo -N" does not have all nodes in "all" partition, using su root /etc/part_initialize ================================================================================ example: mpinfo -N NAME UP PARTITION OS OSREL NCPU FMEM FSWP LOAD1 LOAD5 LOAD15 u81 y all SunOS 5.8 4 3901 3910 0.02 0.04 0.02 u82 y - SunOS 5.8 4 3873 3873 3.05 2.48 1.42 ================================================================================ Step 8. Rerun the daemon (if necessary) If the /conf/hpc.conf is modified (eg. rank changed, interface added/removed, PM added/removed, etc.), or the status of any interface in /conf/hpc.conf is changed (eg. interface up/down, etc.), the daemon should be rerun. Stop the daemon (requires root privileges). On the non-master nodes, su root /etc/init.d/sunhpc.cre_node stop On the master node, su root /etc/init.d/sunhpc.cre_node stop /etc/init.d/sunhpc.cre_master stop Restart the daemon. Repeat Step 6. II. Run-Time (Tuning) Options ============================= A number of run-time tuning options can be supplied to Myrinet PM using environment parameters. MPI_MYR_CONF specifies a configuration file default /opt/SUNWhpc/conf/gm.conf MPI_MYR_RENDVSIZE specifies the eager/rendez-vous thresholds default 16K - bookkeeping bytes (96/112 bytes for 32/64 bit applications respectively) MPI_MYR_MULTIPHASE_RENDVSIZE specifies the rendez-vous/multiphase-rendez-vous thresholds default 32K - 8 III. FAQ for GM Driver for SUN UltraSPARC Systems ================================================= Q1. 32/64-bit GM driver and kernel matching issue: A1. GM-1.6_Solaris simultaneously supports 32-bit and 64-bit applications on 64-bit kernel. IV. FAQ for Myrinet Protocol Module for Sun HPC ClusterTools (4.0 and higher) ============================================================================ Q1. What is the status of using multiple Myrinet interfaces in a single server, and how is communication between servers with multiple interfaces accomplished? A1. The syntax of gm_open API is: ------------------------------------------------------------------------------ SYNOPSIS gm_status_t gm_open (struct gm_port **p, int device, int port_id, char *port_name); DESCRIPTION Opens GM port "port_id" of Myrinet interface "device", and returns a pointer to the port's state at "*p". This pointer must be passed to all subsequent functions that operate on the opened port. "port_name" is a null-terminated ASCII string that is used to identify the port client for debugging (and potentially other) purposes; pass in the name of your program. Note that the device unit numbers and port numbers start at 0, and that ports 0 and 1 reserved, so clients will usually open ports 2 and higher. ------------------------------------------------------------------------------ On a node with multiple Myrinet interfaces, a GM port is defined by the combination of device (interface) unit id and port id. Myrinet PM for HPC CT 4.0 uses a configure file (by default /opt/SUNWhpc/conf/gm.conf) to specify and enforce an order of all legitimate GM ports available for CT across all Myrinet interfaces. This configure file is used only by Myrinet PM. All processes will walk sequentially through the list looking for the first available GM port, using gm_open(). If gm_open returns "GM_FAILURE", then this port is unavailable, and the next (device, port) entry in the list is tested. After a GM port is acquired by a process, it is used exclusively by this process for all Myrinet PM traffic until gm_close() is called. While Myrinet PM may be selected for inter-process communication between nodes, by default the shared memory protocol will always be used for SMP inter-process communication because of its higher priority in the HPC configure file (/opt/SUNWhpc/conf/hpc.conf). This configure file (/opt/SUNWhpc/conf/hpc.conf) is used by all PMs. ------------------------- Example of hpc.conf file ------------------------- ... # SHM settings # NAME RANK Begin PM=shm shm 5 End PM # MYR settings # NAME RANK Begin PM=myr myr 30 End PM ... ----------------- Q2. In Myrinet PM for HPC CT 4.0, what is the order that multiple Myrinet interfaces are used? A2. As specified in A1, a configure file (by default /opt/SUNWhpc/conf/gm.conf) is used to specify and enforce an order of all legitimate GM ports. When parsing this file, the GM ports are tested in the following order: first the smallest ports on all the devices, then the next smallest ports on all the devices, etc. up to the largest ports on all the devices. A Myrinet PM process will try each (device, port) pair in order until the gm_open of this pair is successful (indicating that this port on this device is available). ------------------------------------------------------ Example of gm.conf file ------------------------------------------------------ #node_name #boards #ports port_ids u81-t 2 12 4 5 6 7 8 9 10 11 12 13 14 15 u82-t 1 12 4 5 6 7 8 9 10 11 12 13 14 15 ------------------------------------------------------ This gm.conf file specifies that u81-t has 2 Myrinet interfaces, u82-t has 1 Myrinet interface, and each interface has 12 GM ports available for CT, numbered 4 to 15. The processes on u81-t which use Myrinet PM will then try the (device, port) pairs in this order: (device 1, port 4) (device 2, port 4) ... (device 1, port 15) (device 2, port 15) The processes on u82-t will then try the (device, port) pairs in this order: (device 1, port 4) (device 1, port 5) ... (device 1, port 14) (device 1, port 15) To use a configure file other than the default /opt/SUNWhpc/conf/gm.conf file, set the environment parameter using following command. setenv MPI_MYR_CONF your_gm_conf_file Q3. How many processes can be simultaneously transmitting or receiving? A3. A GM port is used exclusively by a process. The GM-2 driver supports 16 ports per interface by default. Among these ports, port 0 and 1 are reserved by GM. When the IP driver is loaded, port 3 emulates ethernet. We also leave port 2 for any possible GM test (by default, GM test uses port 2). This leaves 12 GM ports for MPI processes using Myrinet PM per interface. Therefore, each Myrinet interface supports 12 MPI processes simultaneously. Each GM port is bidirectional. Example: A machine has two Myrinet interfaces installed. The IP driver is loaded. When Myrinet PM is used for MPI processes, up to 24 processes can simultaneously perform Myrinet transmission. When examining performance issues, the user should be aware of the competition amongst multiple processes for the PCI bus. Q3. Does Myrinet PM support multithreading? A3. Currently no. Q4. Does Myrinet PM support memory registration for Solaris? A4. Currently no. Q5. Does Myrinet PM have client-server support? A5. Currently no.