************************************************************************ * Myricom GM networking software and documentation * * Copyright (c) 2003 by Myricom, Inc. * * All rights reserved. See the file `COPYING' for copyright notice. * ************************************************************************ README-linux for gm-1.6.5 README for linux distribution Supported platforms: Linux 2.2 and 2.4 for IA32, PowerPC, Alpha. Linux 2.4 for IA64 (Itanium). - For Alphas, if you have 2 GB or more of memory, we recommend kernel version 2.4.18 or later to install GM. You must use kernel version 2.4.14 or later (2.4.9 also works). Supported interfaces: LANai7 (PCI64, PCI64A), and LANai9 (PCI64B, PCI64C) If you have LANai4, you will need to upgrade your interface, or use a previous version of GM such as gm-1.2.3 for 256K and gm-1.5.2.1 for larger memory sizes. (Please also note that Linux 2.4 is not supported on gm-1.2.3). For installation instructions of an earlier GM version please refer to the respective README and README- files. WARNING: When building/linking GM applications, you must do so on a linux box that matches the OS version of the machine on which you will be running. You cannot compile on a 2.2.x machine and run the executable on a 2.4.x machine. Table of Contents: ----------------- I. GM Installation a. Configuring and compiling GM b. Installing the GM driver c. Running the GM Mapper d. Enabling IP over Myrinet (Ethernet emulation) e. Testing the GM installation II. Verifying the GM performance III. Improving IP Performance IV. Fork() Support V. Sample Scripts to automatically load GM and start the Mapper VI. Operating-system-specific Caveats a. Using Compaq Compilers for Alpha Linux (ccc cxx) b. PCI Chipset Tweaks c. APIC IRQ conflict on Tyan and AMD motherboards d. AGP (nVidia and ATI) conflicts VII. Miscellaneous a. Uninstallation of the GM driver ************************************************************************ If difficulties are encountered, please consult the FAQ http://www.myri.com/scs/FAQ/ and all technical support questions should be directed to help@myri.com. ************************************************************************ =================== I. GM Installation =================== GM installation is performed in the following four steps. 1. Configuring and compiling GM: --------------------------------------------- gunzip -c gm-1.6.5_Linux.tar.gz | tar xvf - cd {GM_HOME} ./configure make By default, we assume that the header file for your Linux installation is located in /usr/src/linux. If your Linux installation is not located in /usr/src/linux, you must configure with the following option: ./configure --with-linux= where specifies the directory for the linux kernel source. The kernel header files MUST match the running kernel exactly: not only should they both be from the same version, but they should also contain the same kernel configuration options. Note: If you have a mixture of hosts with LANai4 and LANai7 (or LANai9) interfaces that need to talk to each other, you must configure with --disable-new-features on all of the hosts. For a complete listing of all options to configure, type: ./configure --help Note: Do not use the configure flag --enable-directcopy. This flag is not a valid option to GM 1.6.5. It will be re-enabled in a future release. 2. Installing the GM driver: --------------------------------------------- Select an installation directory path . It is usually best for to be the path to an NFS directory available on all machines that are to share this GM installation. The directory must be accessible using on all machines that are to share the installation. must be an absolute path; it must start with "/". However, may contain symbolic links. cd binary ./GM_INSTALL If you omit , the driver will be installed in the default directory, /opt/gm/. Next, you must run su root /sbin/gm_install_drivers /etc/init.d/gm start on each machine to install the drivers on that machine. The gm_install_drivers script performs the following operations: * Shuts down existing IP over Myrinet * Unloads existing GM module, if it exists (rmmod) * Creates the devices (/dev/gm* and /dev/gmp*), one device per interface * Loads the GM module (insmod) Important note: The gm_install_drivers script does not configure the IP device. If you wish to run IP over GM/Myrinet (ethernet emulation), you must configure the device. Refer to step 4. If you wish for the driver to auto-load at boot, you must create appropriate links in the /etc/rcN directories to the /etc/init.d/gm and /etc/init.d/myri scripts. Alternatively, you may start and stop the drivers manually using su root /etc/init.d/gm start /etc/init.d/gm stop or su root /etc/init.d/gm restart to start, stop, or restart the driver, respectively. For directions on how to uninstall the GM driver, refer to the "Miscellaneous" section. Note: If the host is rebooted, you must reload the GM driver (and rerun the GM mapper). 3. Running the GM Mapper ------------------------ Myrinet is a source-routed network. I.e., each host must know the route to all other hosts through the switching fabric. The GM mapper automatically discovers all of the hosts connected to the Myrinet network, computes a set of deadlock free minimum length routes between the hosts, and distributes appropriate routes to each host on the connected network. Loopback and point-to-point network topologies require that gm_simpleroute must be run instead of the GM Mapper. (Refer to the GM README and the FAQ for details.) For a switch network topology, the GM Mapper must be run before any communication over Myrinet can be initiated. Further technical details about the GM mapper can be found in mt/README. Depending upon the user's needs, there are three different ways in which the GM mapper may be used. MAP_ONCE mapping: ---------------- The first way is by far the most common, and we shall refer to it as "map_once". In this method, the mapper is run on one host in the network (any of the hosts). It is rerun if a host (re)boots or a hostname is changed or after a change of Myrinet topology (swapping of ports on a switch). (If the Mapper must be rerun for any of these reasons, it is strongly advised to run it on the same host.) The command for this method of running the GM mapper is: cd /sbin/ su root ./mapper ../etc/gm/map_once.args STATIC mapping: -------------- The second way in which the GM mapper may be used is called "static mapping" or "file mapping". In this method, an active mapper is run once when ALL of the hosts are up and running the GM driver. This initial active mapper will generate a map file and a host file. These files are then copied to all of the hosts in the network, or shared by NFS. An entry in the boot scripts will allow each host to read the map file and the host file and update the routing table on its local Myrinet interface(s). This method is particularly appealing as no human intervention is needed and no traffic is generated at boot time. The commands for this method of running the GM mapper are: cd /sbin/ su root ./mapper ../etc/gm/static.args Copy the 3 files created by this command (static.map, static.routes, and static.hosts) to each /sbin/ directory on each host if the gm tree is not mounted by NFS. Add the following command to the boot scripts of the host (scripts in /etc/init.d or /etc/rc.d/init.d). cd /sbin/ su root ./file_mapper ../etc/gm/file.args HA mapping: ----------- The third way in which the GM mapper may be used is for the users who have a need for High Availability (HA) in an aggressive computing environment. The command for this method of running the GM Mapper is: cd /sbin/ su root ./mapper ../etc/gm/active.args & It will continuously run the GM mapper in the background to detect and add any new hosts or remove any non-responding hosts, to detect any change of topology (change of slots in the switch, change of innerswitch topology), and periodically update the routing tables of the Myrinet cards (by default, every 30 seconds). You should note that this mapping method is quite intrusive. The user is strongly advised to avoid this method of running the GM mapper if his applications produce heavy network traffic (e.g., MPI applications) since the GM Mapper uses non-reliable messages that may be dropped in case of heavy contention, leading to hosts that may be marked as "non-responding" and removed because they are unreachable. A few expert customers use this mapping method to satisfy their high availability constraints for GM applications designed to handle a dynamic change of configuration (by design, MPI is NOT a fault-tolerant application). For the majority of users, the "map_once" GM mapping method is sufficient. For the users with more production-level constraints, the "static mapping" is the most adequate method. For fault-tolerant GM applications, the third method provides the best alternative. 4. Enabling IP over Myrinet (Ethernet Emulation) (OPTIONAL) ----------------------------------------------------------- If you wish to run IP over Myrinet (ethernet emulation), the Linux command to enable IP over GM is as follows: /sbin/ifconfig myri0 up where you must replace myri0 with the appropriate name (myri1, myri2, etc.) if you have more than one Myrinet interface per host. For suggestions on improving performance, please refer to section "III. Improving IP Performance". Refer to the FAQ entry " Can you explain the "channel bonding" support that was added to GM in gm-1.5.2.1 and gm-1.6.3?" for details of channel bonding support in GM. Consult the "Running IP" section of the FAQ (http://www.myri.com/scs/FAQ/) for other related questions. 5. Testing the GM Installation ------------------------------ Once the GM software has been properly installed on all of the hosts in your cluster, you are ready to validate your Myrinet installation by performing the following sequence of tests. * Check the LEDs on each switch port and interface port * Run gm_board_info on one host * Run gm_debug to test the PCI bandwidth * Run gm_allsize to test the links in the network * Run gm_stress to test the network * Run Mute on the cluster to test for bad links Each of these steps is detailed in the Troubleshooting section of the FAQ http://www.myri.com/scs/FAQ/ The test scripts (gm_board_info, gm_debug, gm_allsize, gm_stress) are available in /bin in your GM installation. A README describing each of these tests can be found in /bin/README. Mute is not included in the GM distribution, but can be downloaded from http://www.myri.com/scs/ ================================ II. Verifying the GM Performance ================================ We recommend the following test to verify your GM performance. cd /bin/ gm_debug -L This gm_debug test displays the results of the hardware benchmark test of the PCI bus with the DMA engine of the Myrinet interface. The output of this command indicates the maximum sustained bandwidth that can be obtained from the PCI bus, and thus provides an upper bound on GM performance. A detailed description of this benchmark can be found in the FAQ entry "Can you describe in detail the "hardware benchmark of the PCI bus" that is returned by gm_debug?" The output of this command also tells you if the Myrinet interface was correctly detected as 64-bit / 66 MHz, for example. If the interface was not correctly detected by the BIOS, you should suspect a riser card problem or a PCI slot problem. Performance graphs (http://www.myri.com/myrinet/performance) for GM are available. The performance measurements were obtained by running gm_allsize tests for latency and bandwidth as described in the FAQ entry ("What are the run-time options to gm_allsize?"). Refer to the section entitled "GM Performance" in the /README for complete details on expected GM performance. ============================= III. Improving IP performance ============================= To obtain good IP performance over Myrinet: * use Linux-2.4 (Linux-2.4.20 is now available) * configure GM with --enable-new-features (a default for gm-1.5 and later) to get a larger 9000byte MTU for IP-over-Myrinet You definitely want to use Linux 2.4 instead of Linux 2.2, and NFS-v3 over TCP. Linux 2.4 has vastly better TCP/IP and UDP/IP numbers than Linux-2.2. Also, there have been some recent patches to Linux-2.4 that help udp performance. If you are running Linux 2.2 or earlier, you should use the following tuning options to get good NFS bandwidth. Otherwise, you are latency dominated and Myrinet IP and Ethernet IP performance will be about the same. - For linux you want to increase the tcp windows: echo "262144" > /proc/sys/net/core/rmem_max echo "262144" > /proc/sys/net/core/wmem_max echo "262144" > /proc/sys/net/core/wmem_default echo "262144" > /proc/sys/net/core/rmem_default - In linux/include/net/tcp.h, replace the value of #define MAX_WINDOW 32767 with the value of your choice (200k~500k might be good) - check that /proc/sys/net/ipv4/tcp_window_scaling is enabled with the value 1 (as it should be by default). - Play with the buffer sizes of netperf or your favorite net tester. Note: These tunings options are not required for Linux 2.4. =================== IV. Fork() Support =================== As of gm-1.5.2 and later, GM has full support for fork() under Linux. It works for all processor families. There are no restrictions; GM can fork() with or without a GM port open. However, if the customer has a choice between using vfork() or fork(), there will be better performance with vfork() since the time to fork a process with vfork() is much shorter. =============================================================== V. Sample Scripts to automatically load GM and start the Mapper =============================================================== The directory {GM_HOME}/drivers/linux/scripts contains some sample initialization scripts, contributed by customers, that can be customized to suit your system to automatically load the gm driver and start the GM Mapper. ====================================== VI. Operating-system-specific Caveats ====================================== --------------------------------------------------- a. Using Compaq Compilers for Alpha Linux (ccc cxx) --------------------------------------------------- Under the C shell: setenv CC ccc setenv CXX cxx setenv CXXFLAGS \ "-g -O2 -inline speed -x cxx -noexceptions -nocxxstd -using_std -w2" setenv CFLAGS -gcc_messages setenv KCC gcc rm -f config.cache ./configure or under a Bourne shell or Bash: CC=ccc ; export CC CXX=cxx ; export CXX CXXFLAGS="-g -O2 -inline speed -x cxx -noexceptions -nocxxstd" CXXFLAGS="$(CXXFLAGS) -using_std -w2" ; export CXXFLAGS CFLAGS=-gcc_messages ; export CFLAGS KCC=gcc ; export KCC rm -f config.cache ./configure ---------------------- b. PCI Chipset Tweaks ---------------------- In the file: {GM_HOME}/drivers/linux/gm/gm_arch.c If you have an i840 chipset, modify the flag to be #define GM_INTEL_840 1 There are similar defines for: #define GM_INTEL_860 1 #define GM_21154 1 #define GM_INTEL_450NX 1 #define GM_KT266A 1 Also from this file, please read this warning: /****************** PCI CHIPSET TWEAKS: WARNING ************************* * * * The patches below were supplied by customers who reported that * * their PCI performance was improved when using these patches * * on a particular chipset. * * These patches tweak certain bits in the chipset and have not been * * verified or reviewed by Myricom and may have other, possibly * * negative, side-effects. Before applying one of these patches, * * you may wish to check for a newer BIOS for your machine. * * Also, a newer linux kernel may provide better PCI performance, * * and might be a safer course of action than applying one of * * these patches. * * * * Use these patches at your own risk. * * * ***********************************************************************/ -------------------------------------------------- c. APIC IRQ conflict on Tyan and AMD motherboards -------------------------------------------------- We have encountered APIC IRQ conflicts on several Tyan and AMD motherboards. The installation of GM will fail with an error message similar to the following: GM: LANai rate set to 198 MHz (max=2-2MHz) GM: Board 0 page hash cache has 32768 GM: Allocated IRQ 11 GM: NOTICE: GM: board interrupt (configured on IRQ 11) is not working GM: NOTICE: GM: Failed to initialize Myrinet Card GM: gm: driver unloading GM: WARNING: GM: No Board Initialized ############################# Error Installing GM driver module ############################# or GM: Version 1.5.2.1_Linux build 1.5.2.1_Linux xxxh@xxx.xx.xx Fri Jul 19 14:03:17 EDT 2002 GM: NOTICE: GM: Module not compiled from a real kernel build source tree GM: This build might not be supported. GM: Highmem memory configuration: GM: PAGE_ZERO=0x0, HIGH_MEM=0x3ff80000, KERNEL_HIGH_MEM=0x38000000 GM: Memory available for registration: 224748 pages (877 MBytes) GM: MCP for unit 0: L9 4K (new features) GM: LANai rate set to 133 MHz (max = 134 MHz) GM: Board 0 page hash cache has 32768 bins. GM: Allocated IRQ5 GM: NOTICE: GM: Board interrupt (configured on IRQ 5) is not working. GM: NOTICE: GM: Failed to initialize Myrinet Card GM: gm: driver unloading The IRQ error message says that the driver asked the Myrinet NIC to raise the interrupt that has been assigned by the BIOS to check that it's working, and the driver doesn't receive it in the expected timeout. Thus, the driver cannot use the Myrinet board and exits from the initialization. The most frequent cause for this problem is: * The interrupt lines are managed by an APIC (Advanced Programmable Interrupt Controller) chipset and it is not supported correctly by the BIOS and/or by the current Linux kernel. Possible solutions: 1. Try a different PCI slot. 2. Upgrade the BIOS. 3. Upgrade the Linux kernel version if available. Boot the Linux kernel without APIC support; pass the flag -noapic to the booting kernel via the LILO boot prompt. In this case, the kernel will use a safer compatibility mode. It is important to note that if this error occurs on any node in the cluster, all nodes in the cluster should be booted with -noapic. Refer to the Myrinet FAQ entry "GM_INSTALL or gm_install_drivers fails. What does this error message mean?" for further details. --------------------------------- d. AGP (nVidia and ATI) conflicts --------------------------------- Two types of problems were reported. 1. If I load the GM module first, and then load the nVidia or ATI module, it works. But if I load the nVidia or ATI module first, GM won't load. The GM_INSTALL error message looks like: n03 135# ./GM_INSTALL Making device files in /dev. ifconfig myri0 down - in case it was up myri0: unknown interface: No such device Adding new GM driver. sbin/gm: init_module: No such device Hint: insmod errors can be caused by incorrect module parameters, including invalid IO or IRQ parameters **** Error installing GM driver module. **** and then in the kernel log, you see something like: GM: Version 1.5.2_Linux build 1.5.2_Linux x@x Wed Aug 21 16:17:08 PDT 2002 GM: NOTICE: GM: Module not compiled from a real kernel build source tree GM: This build might not be supported. GM: Highmem memory configuration: GM: PAGE_ZERO=0x0, HIGH_MEM=0x7fff0000, KERNEL_HIGH_MEM=0x38000000 GM: Memory available for registration: 451752 pages (1764 MBytes) GM: NOTICE: GM: pci_rev2: Could NOT map board into kernel (span = 0x1000000) GM: WARNING: GM: Can't map IO memory to system memory GM: NOTICE: GM: gm_instance_init failed GM: NOTICE: GM: Failed to initialize Myrinet Card GM: gm: driver unloading GM: WARNING: GM: No board initialized This one is a case of shortage of virtual memory (used for IO-mapping PCI memory) in the Linux kernel. On configurations with a lot of physical memory, there will only be 128Mb of the address space that Linux will always reserve for virtual memory dynamically allocated. Unfortunately the nVidia card seems to eat as much virtual memory as it can (it occupies at least 128Mb in PCI memory space), so if you load it before the gm module on such a configuration, you will have the error reported. The fix is to recommend for people with more than 768Mb of memory and an nVidia or ATI card to apply the following patch to their kernel: --- arch/i386/kernel/setup.c Thu Aug 2 17:00:46 2001 +++ arch/i386/kernel/setup.c.2 Thu Oct 11 09:00:59 2001 @@-815,7 +815,7 @@ /* * 128MB for vmalloc and initrd */ -#define VMALLOC_RESERVE (unsigned long)(128 << 20) +#define VMALLOC_RESERVE (unsigned long)(256 << 20) #define MAXMEM (unsigned long)(-PAGE_OFFSET-VMALLOC_RESERVE) #define MAXMEM_PFN PFN_DOWN(MAXMEM) #define MAX_NONPAE_PFN (1 << 20) And be sure that the HIGHMEM option is enabled while configuring the kernel. If you do not mind losing memory or just to do a test, you can try to boot your current kernel with mem=768m to see if the problem disappears. Refer to the Myrinet FAQ entry "GM_INSTALL or gm_install_drivers fails. What does this error message mean?" for further details. 2. Overlapping of prefetch memory for the AGP and PCI bridges. SGI Visual Workstation 550 machine. AGP cards (nVidia Quadro, ATI Mach64 PCI graphics card, ATI Rage AGP). What we see with them is that the prefetchable memory assigned by the BIOS for the AGP and PCI bridges is overlapping. This looks like a BIOS problem and we have asked the customer to look into upgrading the BIOS, or to play with the BIOS settings to attempt to get the BIOS to do the right thing (things to try - toggling the plug-n-play OS setting, change the size of the AGP graphics aperture, reinitialize or re-detect the PCI space in the configuration space, etc.) Specifically, it was seen that: The memory for the Myrinet card is mapped at exactly the same spot with the ATI Mach64 PCI graphics card as it is with the ATI Rage AGP graphics card: 03:01.0 Non-VGA unclassified device: MYRICOM Inc.: Unknown device 8043 (rev 03) Region 0: Memory at 82000000 (64-bit, prefetchable) [size=16M] However, now look at the bridges leading to bus 3 (PCI where Myrinet card is) and bus 1 (AGP) in the ATI Rage AGP config: 00:01.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset AGP Bridge (rev 01) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=01, subordinate=01, sec-latency=64 Prefetchable memory behind bridge: 82300000-850fffff 00:02.0 PCI bridge: Intel Corporation 82840 840 (Carmel) Chipset PCI Bridge (Hub B) (rev 01) (prog-if 00 [Normal decode]) Bus: primary=00, secondary=02, subordinate=03, sec-latency=0 Prefetchable memory behind bridge: 81600000-831fffff See how those the prefetchable memory regions overlap? And, more importantly, see how the bridge to the AGP bus's prefetchable memory region overlaps that of the Myrinet card? Note that the only prefetchable memory on the AGP bus is for the rage card and that this memory is a small subset of the region the bridge is claiming: 01:00.0 VGA compatible controller: ATI Technologies Inc 3D Rage IIC AGP (rev 7a) (prog-if 00 [VGA]) Region 0: Memory at 84000000 (32-bit, prefetchable) [size=16M] This issue is now resolved. You need to download BIOS version A9 from the SGI website. ================== VII. Miscellaneous ================== ------------------------------------ a. Uninstallation of the GM driver ------------------------------------ The gm_install_drivers script generates the script /sbin/gm_uninstall_drivers, which can be used to uninstall the drivers. The GM_INSTALL script generates the script /sbin/GM_UNINSTALL, which can be used to uninstall GM.