How does the Mapper work in GM-2?

Table of Contents

Introduction

Myrinet is a source-routed network. I.e., each host must know the route to all other hosts through the switching fabric. It is the mapper's job to explore/determine the cluster's topology and to load routing tables into each host on the cluster. Exploring/Determining the topology is called mapping and loading routes is called route computation or route configuring. Mappers run on every host, although in the end only one mapper does the actual mapping. Once the map is created it is sent over the myrinet to the mappers on all hosts, and these mappers compute their hosts' routing tables. Once routing is completed, the mappers go into a phase of methodically testing parts of the network for changes in topology. This is called verify mode. When a change is detected, remapping occurs.

The basic mapping algorithm has not changed between GM-1 and GM-2. However, the GM-2 Mapper is scalable, and provides better routing and fault tolerance.

For GM-2, there is a mapper running as a user process for each Myrinet interface.

The GM-2 mapper and the GM-1 mapper do not inter-operate.

Installation

The Mapper software is available as part of the GM-2 distribution. As part of the GM installation process

/etc/init.d/gm start
will load the GM module (insmod), and start a mapper daemon called gm_mapper for each Myrinet interface contained in the host. The PIDs of the running gm_mappers are stored in /var/run/gm_mapper/pid.{board_id}.

Alternatively, if you choose to run the mapper by hand, gm_mapper is located in <install_path>/sbin/ and has the following options:

Usage:

./gm_mapper <options>
where valid <options> are as follows:
--daemon-pid-file=<path>  Run as a daemon, storing daemon PID at <path>.
--help                    Print this message and exit
--level=<level>           mapper level (default 1, 0 for never map).
--num-passes=<integer>    number of route computation passes (default 8).
--map-file=<path>         Store map files at <path>.
--printeur                Debugging feature. accept network printfs.
--thrash                  Another debugging feature. keep remapping.
--extra-hosts=<count>     A debugging feature. make artificially inflated map.
--pause                   for mute find_xbars: pause all other mappers
--map-once                terminate after mapping is complete everywhere.
--map-incrementally       re-map incrementally.
-B <Board>,
-b <board>,
--unit=<board>            board number.
--non-clos                for safe routing on non-clos networks.
--no-xbar-check           skip xbar check.
-v,
--verbose                 verbose output.
--verbose-file=<path>     write verbose output to <path>

As soon as gm_board_info reports Network is fully configured, then the Mapper has completed the mapping process and communication over Myrinet can begin.

All Myrinet nodes must have a mapper running on them. Nodes without mappers running on them are invisible as far as mapping is concerned.

Overview

Mapping

The mapping algorithm is a simple breadth-first-search exploration of the cluster. The mapper sends special mapper messages. These messages are small, unreliable, and have their own routes appended to them; in other words they bypass the reliability and routing provided by the underlying transport (GM, MX, etc). To determine whether a port on a crossbar (xbar) is connected, and, if so, to what it is connected, the mapper sends mapper messages to the port. First, the mapper sends a switch probe message with a loopback route on it: if the port is connected to another xbar, the message will return to the mapper. Otherwise after a timeout the mapper will try to see if the port is connected instead to a host. To test this, the mapper sends a scout message to that port: if there's a host connected, and a mapper running on that host, that remote mapper will reply with a scout reply message. The first mapper will receive this message, and add the host to its map.

Routing

Each mapper computes its host's routes to all other hosts. The algorithm for this process is essentially Dijkstra's algorithm with link costs incremented for each route that goes across them. There is no coordination between mappers in computing routes. To insure even link usage (also called dispersive routing) the mapper relies on randomness in the order of exploring links. The mapper computes multiple routes from a host to all other hosts. The number of routes computed per host is a parameter that can be varied by the user. The underlying transport uses multiple routes for dispersive routing which helps for fault tolerance and hot spot avoidance. Deadlocks are not a problem with Clos networks. Mappers running on non-Clos networks compute deadlock-free routes with up*/down*.

Scout Messages

Scout messages are the small messages exchanged between mappers. There are different types of scout messages, and the type is encoded in the reason field of the scout message. The valid types are:

and their meanings are described in the following sections.

Like all mapper messages, scout messages are unreliable. Timeouts and a limited number of retries (LX_*_TRIES) give some semblance of reliability, but in pathological cases, when there is limited MCP support or congestion, the mapping algorithm reacts gracefully.

Stages of the Mapping Process

Leader Election and the Provisional Tree

Ultimately only one mapper maps. This mapper is called the global mapper or the winner or the active mapper. Determining which mapper is the global mapper is called the leader election. Mappers are ranked by mac address and a user-settable integer called level (--level=). The mapper with the highest level becomes the global mapper; in case of ties in level, the mapper with the higher mac address becomes the global mapper. Initially all mappers start mapping the network. When two mappers discover each other, the one with the lower level/mac address becomes passive.

When a mapper becomes passive it retains information about the higher mapper -- its mac address and the route to it. The lower mapper will periodically (every second) ping the higher mapper with a scout message (LX_QUERYING_MAP_VERSION) to ensure that the higher mapper still exists. As long as the higher mapper responds, the lower mapper knows that it is still connected to the network, and that eventually it will get the global map. When the leader election completes, the entire network will be arranged into a kind of tree, with lower mappers verifying that their higher mappers (a.k.a., parents) are still alive. This tree is provisional. The structure of this tree depends upon:

At the leaf level of a Clos network, you will have 7 hosts becoming passive to the highest host on these hosts' xbar. This higher host will in turn become passive to some still higher host on some other xbar in the network. We prefer a binary tree because it makes it easier to guarantee receive buffering between mappers. Dropped scouting messages due to lack of receive buffers (say a dozen hosts happen to share the same parent and all simultaneously ask the parent, are you still there?) or because of some timing misfortune may result in one child's scout messages being dropped after multiple (30?) retries. This child will incorrectly conclude that the map has changed, and it will start remapping. However, it will eventually collide/encounter with a higher mapper and become passive again.

Some guarantees could be made about the structure of the provisional tree if mappers all started at the same time. This could be accomplished by having each mapper sleep a bit before it starts mapping. (During the sleep they would respond to scout messages).

Map Distribution and the Second Tree

When the global mapper finishes mapping, it starts constructing a new tree of higher-lower mapper dependencies. This tree is binary. Each interior node is a mapper responsible for delivering the complete map to two children. The tree is arranged by partitioning a sorted (mac address / level) list of all mappers. In the obvious manner. n is the parent of 2n and 2n+1.

This second tree replaces the provisional winner-loser tree when the global mapper (the root of the tree) sends scout messages (LX_TREE) to each of its children.

When a child learns who his real parent is, he abandons his provisional parent, and instead asks the new parent if there is a new map to collect (LX_QUERYING_MAP_VERSION).

A map version consists of the mac address of the global mapper, and a counter (map_version). This counter begins at a random number and increments when a new map is made. If counter=0, this value is reserved to mean no valid map. The random starting value tries to detect the case where a mapper restarts.

A parent has a new map if its map version mac_address or map_version has changed (and if map_version is nonzero).

Since a global map can exceed the underlying transport's MTU, the map is distributed in pieces. It is the child's job to ask for and wait for each piece. As usual with mapper messages, timeouts and retries are used to obtain some pretense of reliability. If this fails, however, the child simply gives up and starts mapping again. If the network really hasn't changed, this child will become passive again, re-learn its parent, and try fetching the map again.

Each global map piece contains about 70 host or 20 xbar definitions.

Route Computation

Once a mapper has a map, the next step is for it to compute routes. There's little to say on this that wasn't in the overview. It's a repeated Dijkstra's algorithm, where one shortest path breadth-first-search of the graph is performed for each set of routes. Multiple routes are computed for route dispersion; each host computes repeated one-to-all routes. Randomness is used to pretend that globally (all-to-all) routes are evenly distributed across the network. Specifically this randomness is the order in which ports are explored. To spread out routes in the local sense (taking into account only one host's routes), the search is done more than once for every set of routes. Within this inner loop of searching, routes are assigned to hosts periodically. It's up to the underlying transport (GM, MX, etc.) to decide how to handle the case of hosts being removed from the map -- hosts that occurred in the previous map not be in the new map.

Due to space constraints there's a limit on the longest possible route: 11 hops. It is a fatal user error if the network diameter exceeds this dimension.

In addition to providing the underlying transport with the mac address and routes to each host, a host_type integer is also provided. This integer defines what kind of transport the host is running (GM, XM, MX). Hostnames are not collected by the mapper.

Verify Mode

Once a mapper has configured its host with routes, it enters verify mode. The purpose of verify mode is to detect a change in the network topology. In the background, the mapper will still respond to requests from its children for the global map, and (as always) it will reply (in the background) to all sorts of scout messages, but its primary function is verification.

Verify mode occurs in an infinite loop with a pause of 1 second at the end of each iteration. The default value of this pause is 1 second. The pause is to make the mapper less intrusive on the host. A missing xbar-xbar link or a missing host or an added host or an added xbar-xbar link or a changed link -- all of these will result in the mapper marking his map as invalid and exiting verify mode. Another thing the mapper checks for is that the hosts it is verifying (kids and parents) have valid maps. In this way the knowledge of the network's change propogates up to the root of the tree (global leader) which then re-maps the network.

If the change was a missing parent, the mapper cannot rely on the tree to trigger remapping again. Instead there is no choice, but to start to map from scratch. This starts another leader election.

Verify mode consists of three phases.

Mapping Algorithm Details.



Last updated: 23 October 2003

Home | Mail for Product Information | Mail to Tech Support