|
A Graphical, Diagnostic, Monitoring Tool for Myrinet-2000 M3-E* Networks |
Mute is a graphical diagnostic monitoring tool for Myrinet-2000 M3-E* networks (switch(es), cables, and hosts). It exercises the monitoring capabilities of the Myrinet-2000 M3-E* switches (M3 Switch Tools, m3-dist.tar.gz), as well as the Mapper Tools (located in the mt subdirectory of the GM distribution). Mute builds an image/picture of the Myrinet-2000 Network, and can be used to non-intrusively monitor the Myrinet-2000 network in real time, analyzing the network traffic, and diagnosing/validating the integrity of the hardware components. Mute works only for Myrinet-2000 M3-E* switches equipped with a monitoring line card. The graphical interface uses GNOME/GTK, which is currently only available for Linux.
Note: The notation M3 denotes third generation Myrinet, a.k.a. Myrinet-2000.
Some of the features of Mute are:
Commonly-asked questions about Mute can be found on the FAQ.
Screenshots of Mute can be found at: http://www.myri.com/staff/finucane/mute.
In order to install Mute, the following requirements must be met:
Myrinet-2000 M3-E* switch(es) equipped with a monitoring line card in each switch.
Note: The monitoring line card must also be installed in each switch. Installation instructions are available.
The following instructions assume that you have Gnome/Gtk installed (or are willing to install Gnome/Gtk) on a Myrinet node in the cluster. If this is not the case, then you will need to follow these instructions for the installation of Mute.
http://www.myri.com/ftp/pub/m3-dist.tar.gz gunzip -c m3-dist.tar.gz | tar xvf -
If you're using m3-dist version 1.0.14 or later, you need to add -DFOR_GM1 to the m3-dist/makefile as instructed in the following FAQ entry.
http://www.myri.com/ftp/pub/mute-1.9.6.tar.gz gunzip -c mute-1.9.6.tar.gz | tar xvf -
If you're using Mute 1.9.6 or later with GM-1, you need to also specify the -DFOR_GM1 in the mute-1.9.6/makefile as instructed in the following FAQ entry.
You will need to install the GNOME/GTK package if it is not already available.
gnome-libs-devel gtk+-devel gtk+
apt-get install libgnome-dev
cd {GM_SRC_HOME}/binary
mkdir lib
ln .gm_uninstalled_libs/lib/.libs/libgm.a lib/libgm.a
cd ../mt
make all gm
where GM_SRC_HOME specifies the directory where GM was compiled.
If you're using a version of GM-1 prior to gm-1.6.3, you don't need to do the mkdir and ln, you only need to do the following:
cd {GM_SRC_HOME}/mt
make all gm
cd $HOME/m3-dist
make gmdir={GM_SRC_HOME} mtdir={GM_SRC_HOME}/mt host-no-snmp
where GM_SRC_HOME specifies the directory where GM was compiled.
cd $HOME/mute
make gmdir={GM_SRC_HOME} gminstalldir=<install_path> m3dir=$HOME/m3-dist
where GM_SRC_HOME specifies the directory where GM was compiled, and <install_path> is the directory where the GM binaries and libraries are installed. If you're using gm-1.5.2.1 or earlier, there is no need to specify the gminstalldir= path.
Note: If you're using Gnu gcc 3.2 or later, you will need to change all instances of gcc to g++ in the m3-dist/makefile and the mute/makefile.
Refer to {GM_SRC_HOME}/mt/README, $HOME/m3-dist/README, and $HOME/mute/README for detailed instructions on how to build mt, m3-dist, and mute.
You can now proceed to Building the Cluster's Image with Mute.
If Gnome/Gtk is not available on a compute node in the Myrinet cluster, you can install Mute on another machine in your local network that has Gnome/Gtk available and has IP access to the monitoring line card in the switch(es). I.e., this machine must be able to ping each of the Myrinet switches in the cluster.
Compile GM, Mapper Tools, Switch Tools, and Mute on this machine in the local network, as detailed in these instructions.
Compile the Mapper Tools and the Switch Tools on one of the compute nodes in the cluster, and then generate the needed Mute files (mute.map, mute.hosts, mute.switches, mute.xbars).
cd <GM_install_path>/sbin
cp mapper.map $HOME/mute/mute.map
cp mapper.hosts $HOME/mute/mute.hosts
mute.switches contains the IP address of each Myrinet switch in the cluster, one IP address per line.
mute.xbars is a list of Myrinet-2000 switch MAC addresses, port correction numbers, and routes to the xbar. (A route from the root host of the map file uniquely defines every xbar).
mac address .-----------xbar id | | v v 00:60:dd:7f:8d:88:1:0 <----------------------------- port correction -2 <------------------------- route to xbar.
The first xbar from the root host has an empty route, so its route line is blank.
mute.xbars is generated using the find_xbars and xbars4mute tools located in the Switch Tools (m3-dist) distribution. The syntax of these commands is as follows:
find_xbars 0 mute.map <host> <switchnames>
xbars4mute mute.map <host> <find_xbars_output>
E.g.,
cd $HOME/m3-dist/intel_linux
find_xbars 0 ~/mute/mute.map falbala01 falbala-switch | xbars4mute ~/mute/mute.map falbala01 > mute.xbars
Transfer these 4 Mute files to the Mute directory located on the machine in the local network. Invoke Mute on the Gnome/Gtk-enabled machine with the -w option to specify the directory where these Mute files are located, and it will build the image of the cluster using these Mute files.
mute -w $HOME/mute
You can now proceed to Building the Cluster's Image with Mute.
Note: If the topology of the cluster changes, you will need to regenerate the mute.map and mute.xbars files on a node in the cluster and transfer these files to your Mute directory and rebuild the image of the cluster.
This mode is useful for real-time non-intrusive monitoring of traffic on the Myrinet network, as well as analysis of the static image of the cluster.
The following instructions assume that you have Gnome/Gtk installed (or are willing to install Gnome/Gtk) on a Myrinet node in the cluster. If this is not the case, then you will need to follow these instructions for the installation of Mute.
Important Note: Mute is not yet supported on GM-2.1.1 with M3F2-PCIXE-2 interfaces.
http://www.myri.com/ftp/pub/m3-dist.tar.gz gunzip -c m3-dist.tar.gz | tar xvf -
http://www.myri.com/ftp/pub/mute-1.9.6.tar.gz gunzip -c mute-1.9.6.tar.gz | tar xvf -
You will need to install the GNOME/GTK package if it is not already available.
gnome-libs-devel gtk+-devel gtk+
apt-get install libgnome-dev
cd {GM_SRC_HOME}/binary
mkdir lib
ln .gm_uninstalled_libs/lib/.libs/libgm.a lib/libgm.a
cd ../mt
make all gm
where GM_SRC_HOME specifies the directory where GM was compiled.
cd $HOME/m3-dist
make gmdir={GM_SRC_HOME} mtdir={GM_SRC_HOME}/mt host-no-snmp
where GM_SRC_HOME specifies the directory where GM was compiled.
cd $HOME/mute
make gmdir={GM_SRC_HOME} gminstalldir=<install_path> m3dir=$HOME/m3-dist
where GM_SRC_HOME specifies the directory where GM was compiled, and <install_path> is the directory where the GM binaries and libraries are installed.
Note: If you're using Gnu gcc 3.2 or later, you will need to change all instances of gcc to g++ in the m3-dist/makefile and the mute/makefile.
killall gm_mapper
cd <install_path>/sbin/ ./gm_mapper -v --level=10 --pause --map-file=$HOME/mute/mute.map
Control-C the mapper by hand when it finishes mapping and starts verify mode:
...
14 3,-7,1,7
h6 3,-7,2,7
h31 -5,-12,6 <---computing routes
h23 -5,-10,6
h15 3,-6,-7,7
h7 3,-6,-6,7
h39 3,-6,-5,7
map version is now 1929343636
h0 checking hosts <------- entering verify
checking host h1 mode
checking host h2
checking disconnected link -15 on x0
checking for new hosts on x0
map version 0:60:dd:7f:3b:bf 1929343636
43 hosts and 17 xbars
I am h0
verifying again
h0 checking hosts
checking host h1
checking host h2 <-------- control-C anywhere
around here's fine
OPTIONAL: If you are using GM-2.0.6 or later and you would like to see the hostnames in your Mute display, you will need to perform the following step. (This extra step is necessary because the GM-2 mapper does not know about hostnames.)
After you have a map file (mute.map), the next step is to convert its logical hostnames (h0, h1, h2, etc.) into real hostnames (e.g., falbala1, falbala2, etc.)
Available in GM-2.0.6 and later, this conversion tool is called board_names (<GM_SRC_HOME>/mt/tools/board_names.c). It uses the mac address to hostname routing table information output by gm_board_info (e.g.,)
1 00:60:dd:7f:7b:a5 falbala1 (this node) (mapper) 2 00:60:dd:7f:7b:76 falbala2 81 3 00:60:dd:7f:7b:82 falbala3 ba
and then searches through the map file (mute.map) for each host by its mac address (e.g.,)
h - h0 1 0 s - x0 14 address 00:60:dd:7f:7b:a5 <------- mac address hostType 1
and replaces the hostname.
h - "falbala1" <------------------- replaced 1 0 s - x0 14 address 00:60:dd:7f:7b:a5 hostType 1
By default, the new map file goes to stdout. This output should be saved and used as your mute.map file in order to see hostnames on the Mute display.
For example (assuming mute.map is the original map file)
cd <install_path>/bin/ ./gm_board_info > board_info.out
cd <GM_SRC_HOME>/mt/tools/intel_linux/ ./board_names mute.map board_info.out > map.with.hostnames.map
mv map.with.hostnames.map mute.map
Create the mute.switches file.mute.switches contains the IP address of each Myrinet switch in the cluster, one IP address per line.
mute -w $HOME/mute
In the Build window check Find Xbars, Find Loopbacks, and Write Routes only. Uncheck everything related to mapping and build. When Find Xbars terminates, the GM mapper port will close and you can run mappers again without killing Mute.
You can now proceed to Building the Cluster's Image with Mute.
Rerun the mapper without --pause, to unpause everyone:
cd <install_path>/sbin/ ./gm_mapper -v --level=10 --map-once
Note: the --level=10 option is important for both runs of the mapper (pausing and unpausing: you need to make sure this mapper has the highest priority. Default is 1 with GM mappers so 10 is fine. 2 would have worked also.)
Let -map-once terminate the mapper. Now all the other mappers are awake again.
Refer to {GM_SRC_HOME}/mt/README, $HOME/m3-dist/README, and $HOME/mute/README for detailed instructions on how to build mt, m3-dist, and mute.
If Gnome/Gtk is not available on a compute node in the Myrinet cluster, you can install Mute on another machine in your local network that has Gnome/Gtk available and has IP access to the monitoring line card in the switch(es). I.e., this machine must be able to ping each of the Myrinet switches in the cluster.
Compile GM, Mapper Tools, Switch Tools, and Mute on this machine in the local network, as detailed in these instructions.
Compile the Mapper Tools and the Switch Tools on one of the compute nodes in the cluster, and then generate the needed Mute files (mute.map, mute.switches, mute.xbars).
killall gm_mapper
cd <GM_install_path>/sbin
./gm_mapper -v --level=10 --pause --map-file=$HOME/mute/mute.map
If you are using GM-2.0.6 or later and you would like to see the hostnames in your Mute display, you will need to perform the following step. (This extra step is necessary because the GM-2 mapper does not know about hostnames.)
After you have a map file (mute.map), the next step is to convert its logical hostnames (h0, h1, h2, etc.) into real hostnames (e.g., falbala1, falbala2, etc.)
For example (assuming mute.map is the original map file)
cd <install_path>/bin/ ./gm_board_info > board_info.out
cd <GM_SRC_HOME>/mt/tools/intel_linux/ ./board_names mute.map board_info.out > map.with.hostnames.map
mv map.with.hostnames.map mute.map
mute.xbars is a list of Myrinet-2000 switch MAC addresses, port correction numbers, and routes to the xbar. (A route from the root host of the map file uniquely defines every xbar).
mac address .-----------xbar id | | v v 00:60:dd:7f:8d:88:1:0 <----------------------------- port correction -2 <------------------------- route to xbar.
The first xbar from the root host has an empty route, so its route line is blank.
mute.xbars is generated using the find_xbars and xbars4mute tools located in the Switch Tools (m3-dist) distribution. The syntax of these commands is as follows:
find_xbars 0 mute.map <host> <switchnames>
xbars4mute mute.map <host> <find_xbars_output>
E.g.,
cd $HOME/m3-dist/intel_linux
find_xbars 0 ~/mute/mute.map falbala01 falbala-switch | xbars4mute ~/mute/mute.map falbala01 > mute.xbars
Transfer these 3 Mute files to the Mute directory located on the machine in the local network. Invoke Mute on the Gnome/Gtk-enabled machine with the -w option to specify the directory where these Mute files are located, and it will build the image of the cluster using these Mute files.
mute -w $HOME/mute
You can now proceed to Building the Cluster's Image with Mute.
Note: If the topology of the cluster changes, you will need to regenerate the mute.map and mute.xbars files on a node in the cluster and transfer these files to your Mute directory and rebuild the image of the cluster.
This mode is useful for real-time non-intrusive monitoring of traffic on the Myrinet network, as well as analysis of the static image of the cluster.
Usage:
mute [options] -w, --working-directory=WORKING_DIRECTORY path to working directory -f, --firmware-directory=FIRMWARE_DIRECTORY path to firmware directory -t, --max-threads=MAX_THREADS max thread count -h, --event-history=EVENT_HISTORY event history length
The first step in using Mute is to build an image/picture of the cluster.
Before you invoke Mute, you should create a mute.switches file in the working directory. This file is a list of monitoring card IP addresses, one address per line. This file used to be created automatically by the "Find Switches" feature, but this feature fails with most customers, so it has been permanently disabled.
To run Mute, type
su root cd $HOME/mute ./mute
This command will result in two windows appearing -- a Build Network Graph window
and an empty Myricom Mute 1.8 window.
Since this is the first time to run Mute, it does not allow the customer to un-check the boxes labeled Run Mapper, Find Loopbacks, Write Routes, or Find Xbars in the Build Network Graph window. These operations must be performed the first time you run Mute in order for the initial picture/image of the cluster to be generated.
Once you have checked/un-checked the desired boxes in this Build Network Graph window, click on Build to generate the image/picture of the cluster. The following output will appear in the Build Network Graph window.
If the above output does not appear, then refer to the Troubleshooting Section for guidance.
Upon successful completion of this Build process, the image/picture of the cluster will appear in the Myricom Mute 1.8 window (as depicted below),
and the four configuration files (mute.map, mute.hosts, mute.switches, mute.xbars, and mute.routes (optional, if the Write Routes box is checked) and mute.state (optional)) have been created in the working directory.
If the cluster's image does not appear in the Myricom Mute 1.8 window, or is only partially drawn, then refer to the Troubleshooting Section for guidance.
Subsequent executions of Mute will allow the boxes in the Build Network Graph window to be un-checked (operation not performed) if you wish to use existing Mute configuration files. (To accomplish this you could run Mute with the -w runtime option and specify the directory where the Mute configuration files can be found, or run Mute in the same working directory where the configuration files are located.)
In this case, the output in the Build Network Graph would look like the following:
Current cautions and common problems:
Scenario 1:
If errors appear in the Build Network Graph window, does the output look like the following?
If yes, then there are two possible explanations:
Find Xbars sends invalid routes to each xbar and checks the invalidRoute counter to see which Myrinet-2000 M3-E* switch xbar each invalid route caused. If there is a mapper running it will prevent find_xbars from working because it will generate its own invalid routes as part of the mapping process.
Scenario 2:
Do you see Permission denied messages in the Build Network Graph window?
To build an image of your cluster, Mute executes three steps.
As each of these steps is completed, diagnostic information will scroll in the Build Network Graph, and Mute will create the following six files in the working directory where it builds its image of the cluster.
You can edit the Mute configuration files by hand if you like. They are humanly readable ascii files.
mute.xbars is a list of Myrinet-2000 switch mac addresses, port correction numbers, and routes to the xbar. (A route from the root host of the map file uniquely defines every xbar).
mac address .-----------xbar id | | v v 00:60:dd:7f:8d:88:1:0 <----------------------------- port correction -2 <------------------------- route to xbar.
The first xbar from the root host has an empty route, so its route line is blank.
lynx -source 206.117.208.81/all > 206.117.208.81.html
How do I isolate the cause of a high badcrc count?
Isolating the cause of a high badcrc count is an iterative process.
The procedure detailed below can also be performed by hand using the steps outlined in the Troubleshooting section of the FAQ.
After building the cluster's image, the following steps should be performed:
And if you left click on the magnifying glass icon and then left click on the switch, it will enlargen the image so that you can more clearly see the names of the hosts connected to the specific ports on the switch.
You could then use File-->Save Positions to save the desired resizing of the images for the next invocation of Mute.
From this output information, we can see that the cable connecting the host brutus06 and port 8 on the lowermost line card has the highest badcrc count.
Note:
And to toggle back to the original view of the switch, just click on View-->Show Insides again, and it will turn off this view.
Also refer to the FAQ entry "Are the counters (badcrcs) reported by gm_counters on each machine the same as I see in Mute?". Mute only reports badcrc information from the switch counters. It is important that you also check for badcrcs in the host counters (gm_counters output).
How do I check the temperature of the switches?
After building the cluster's image, select View-->Show Temperatures
and the switch will appear in color as follows:
To determine the meanings of these colors, you can then select Windows-->Legend
and the following Legend of colors will appear.
All temperatures reported in the Legend are in degrees Celsius. The highest temperature is represented by the color red and the coolest temperature is represented by the color blue. The temperature value of the colors is relative. Thus, it is important to refer to the Legend to see what range of temperature corresponds to the color red. Note that just because a component is denoted by the color red is not necessarily a sign for alarm.
Typical operating temperatures for the components of the switch are in the range of 30-39 degrees Celsius. If you see temperatures in the range of 50-55 degrees Celsius, the switch is too hot and shutdown may occur. (Shutdown occurs at 55 degrees Celsius and operation resumes at 50.) Check that the switch has proper ventilation.
How do I upgrade the firmware on the Myrinet-2000 M3-E* switch(es)?
You can use Mute to upgrade the firmware on the Myrinet-2000 M3-E* switch(es) in your network.
If your Myrinet-2000 M3-E* switch(es) are not running release-0.9.9.2 or later, you must first use Mute to upgrade the firmware before you can build the image of the cluster. Details of this process can also be found at "How do I upgrade the firmware on the monitoring line card?" on the FAQ.
Upgrading the firmware on the Myrinet-2000 M3-E* switch(es) is performed in the following four steps. If you have multiple switches, Mute will upgrade the firmware on all switches simultaneously.
mute -w <working dir> -f <firmware dir>where <working dir> is the directory containing the mute.switches file (a list of IP addresses of all Myrinet-2000 M3-E* switches, one per line), and <firmware dir> is the directory containing the vxWorks_var.hex file (m3-dist directory).
This will program every Myrinet-2000 M3-E* switch with vxWorks_var.hex if the hex file is newer than the firmware in the switch. Reprogrammed switches will have their monitoring cards automatically rebooted.
Note: Alternatively, you could invoke mute with no options, close the Build window, and select Preferences and set the firmware directory to where the vxWorks_var.hex is located, for instance, the m3-dist directory, and working directory to the directory containing your mute.switches file. You would then follow steps 2-4 as outlined above to upgrade the firmware on the switch(es).
How do I run Mute in offline mode?
You can run Mute in offline mode on any host that has Gnome/Gtk installed.
*.html lynx -source m3-switch-name/all mute.hosts mapper's mapper.hosts file mute.map mapper's mapper.map file mute.switches find_switches (or typically by hand) mute.xbars find_xbars
The *.html file for each Myrinet-2000 M3-E* switch is named with an IP address.
(e.g., lynx -source 206.117.208.81/all > 206.117.208.81.html)
mute -w <working dir>where <working dir> is the directory where these files are located.
You can then proceed to exploring and investigating the cluster's image remotely.
Can I use Mute to interactively monitor the traffic in the Myrinet network?
Yes. You can use Mute for "real-time" non-intrusive monitoring of traffic on the Myrinet network.
Select Windows-->Counters and a small window will appear where you can specify the monitoring of Bad Crcs or Good Crcs.
How do I interactively monitor the switch traps using Mute?
Using Mute, you can interactively view the switch traps that are being generated on each switch in the Myrinet network.
Select Windows-->Events and the Events window will appear. To start listening to the traps on the switches, click on the Listen toggle at the bottom of the Events window, and the events will start scrolling by.
If Log Events is enabled (File->Preferences dialog box), events will also be written to a file called mute.events in the working directory. The file is reopened and appended to every time the events Listen toggle is activated. The file is closed when the listening is stopped.
The Events window contains several columns of information. The columns are labeled Time, IP address, Event, Count, Part, and Index.
The Time column lists the current local time. The IP address lists the IP address of the switch reporting the specified trap. The Event columns lists the switch trap that is being reported. A list of all available switch traps can be found in the following FAQ entry. The Count lists the number of times this trap has been generated. The Part specifies which component is generating the trap. Possible values of Part are:
Since there is a xbar16 on each line card, there are 16 xbarPorts on each switch line card. Eight of these xbar ports correspond to the numbered ports on the faceplate of the line card, and the remaining 8 xbar ports correspond to the connections to the backplane.
The xbarPorts numbered 1-16 will correspond to the topmost switch line card in the switch. xbarPorts 17-32 will correspond to the switch line card beneath that first switch line card, and so on.
These components denote the fiber transceiver for each port on the faceplace of the switch line card.
For example, let's say xbarPort 122 is generating lots of missedBeatTrap, and you would like to know where that xbarPort is located. The easiest way to determine this information is to go to the web interface, and select "all".
http://<switch_name>/all
where <switch_name> is the name or IP address of the switch (listed in the IP address column of the Events window). You can then do a Find (with your HTTP browser) for xbarPort 122. Once you have located the specific Part, you can then scroll up slightly and you will be able to see with which line card it's associated, as well as to which physical port number it's associated.
The Myricom Mute 1.8 window has four menu items -- File, Edit, View, and Windows, as well as positioning/resizing features such as point, zoom in/out, scroll, info, and control. For example,
Click on the magnifying glass to zoom in and out (left / right mouse click) and use the hand to slide the image around, and the arrow to move items. Drag or shift select multiple items.
The File menu button has the following options:
The Edit menu button has one option, Find, for locating ascii text strings in the image. For example, use Edit->Find to find switches or hosts by their names (ip, mac address, etc.).
The View menu button has the following options:
The Windows menu button has the following options:
Warnings:
![]()
Last updated: 19 May 2006