Estimated FIT Rates for
Current-Production Myrinet-2000 Components

Last significant revision: 3 January 2005

This page provides estimates that can be used for decisions about spares. Please see the notes below about the basis for and limitations on these estimates.

Definition of FIT Rates

At Myricom we prefer to use the Failure In Time (FIT) method of specifying component reliability. The FIT rate is defined as the expected number of component failures per 109 (ten to the ninth power, or 1,000,000,000) hours. The FIT rate can be converted immediately to the MTBF (Mean Time Between Failures) in hours as MTBF = 109/FIT. The Annualized Failure Rate (AFR) can be calculated as AFR = (FIT * 8760)/109.

The advantage of using FIT rates rather than the MTBF metric is that FIT rates are additive. For example, the FIT rate of a current-production M3F-PCIXD-2 NIC is 200, the sum of the FIT rate of the fiber transceiver, 125, and the FIT rate of all other component failures, 75. A FIT rate of 200 corresponds to an MTBF of 109/200 = 5M hours, and to an AFR of (200 * 8760)/109 = 0.00175 = 0.175%.

Please note that these are very low failure rates. 5M hours is 570 years!

Similarly, you can add the FIT rates of all of the switch and NIC components in a cluster to determine the FIT rate of the interconnect components of the cluster.

Summary for Current-Production Components

Component FIT
Rate
MTBF Comments
M3-E16 enclosure with no line cards 650 1.5M hours See notes below regarding fans
M3-E32 enclosure with no line cards 650 1.5M hours See notes below regarding fans
M3-E64 enclosure with no line cards 2000 0.5M hours See notes below regarding fans
M3-E128 enclosure with no line cards 2000 0.5M hours See notes below regarding fans
M3-M monitoring line card 435 2.3M hours  
M3-SW16-8F and M3-SPINE-8F switch line cards with fiber ports 1400 0.7M hours 8*125 FIT for fiber transceivers + 400 FIT for all other components
M3-SW16-8E GbE line card 800 est 1.2M hours New product.
M3F-PCIXD-2 or -4 one-port PCI-X NIC 200  5.0M hours 125 FIT for fiber transceiver + 75 FIT for all other components
M3F-PCIXF-2 or -4 one-port PCI-X NIC 200 est 5.0M hours New product, but should have the same failure rate as M3F-PCIXD NICs
M3F2-PCIXE-2 or -4 dual-port PCI-X NIC 325 3.1M hours 2*125 FIT for fiber transceivers + 75 FIT for all other components
M3F-PCI64B and M3F-PCI64C PCI NICs with Fiber ports 500 2.0M hours  
Component Estimated L10 Life Expectancy
Fan trays for switches and switch networks: M3-E16-FAN, M3-E32-FAN, M3-E64-FAN, M3-E128-FAN at least 10 years
of operation

Summary for the Components in Myrinet Switches for Large Clusters (New)

Component FIT
Rate
MTBF Comments
M3-CLOS-ENCL
M3-SPINE-ENCL
200 est 5M hours These enclosures differ only in the backplane. The FIT rate does not include the FRUs listed below. See notes below.
M3-MONITOR 400 est 2.5M hours
M3-POWER (4 per enclosure) 500 est 2M hours N+1 redundant. These 350W Astec power supplies are ~50% derated if all four are operating.
M3-SW32-16F 2200 est 0.5M hours 16*125 FIT for fiber transceivers + 200 FIT for all other components
M3-2SW32 200 est 5M hours  
M3-4SW32-16Q 3600 est 0.28M hours 16*200 FIT for the quad-fiber transceivers + 400 FIT for all other components
M3-THRU-16Q 3400 est 0.3M hours 16*200 FIT for the quad-fiber transceivers + 200 FIT for all other components
M3-AIRDAM 0 - No active components
Component Estimated L10 Life Expectancy
M3-FAN, fan assembly for M3-CLOS-ENCL and M3-SPINE-ENCL (2 per enclosure), 2 series fans per assembly at least 5 years
of operation


Comments on the Basis for these estimates

Of course, we prefer experienced rather than calculated FIT rates. For products that have been in production long enough and have shipped in sufficient volume, the estimates above correspond to experienced FIT rates, but sometimes corrected for changes in component FIT rates. Myricom responds to all RMAs with a root-cause analysis, which allows the reliability data to be broken down into FIT rates for individual components. For example, because current-production Myrinet components use fiber transceivers from three manufacturers (Infineon, Stratos, and Finisar) on both switch line cards and on NICs, the RMA data allows us to track the reliability of the fiber transceivers from different manufacturers as well as to apply experienced component reliability from one product to other products that use the same components. An example from the table above is that we are able to provide a good estimate of the FIT rate of the M3F2-PCIXE-2 dual-port NIC, a relatively new product, from its similarity to the M3F-PCIXD-2 NIC, for which we have data from more than 500M hours of field operation.

Returned components that are damaged or that pass production tests are not considered to be field failures.


Notes

M3-E16, M3-E32, M3-E64, M3-128 Enclosures with no line cards

The dominant failure mode of these components is power-supply failures.

Fans

The MTTF (Mean Time To Failure, different than MTBF) of the Panasonic Panaflo fans on the fan trays of the M3-E* enclosures is estimated to be 500,000 hours each. The basis for this estimate is a report from from Panasonic Industrial Company on their "Hydro-Wave Bearing Technology Fans." The thrust plate of these unusual fans floats on a circulating film of oil. The tests done by the Panasonic engineers are based on measurements of oil consumption from fans operating over an 18,000-hour period at temperatures of 20C, 30C, 40C, 50C, 60C, and 70C.

The data reported shows that the MTTF Life Expectancy of an FBA06A fan (exactly the fan used in the Myricom M3-E16) at 20C is extrapolated to be 517,000 hours, and at 30C is extrapolated to be 431,000 hours. The L10 (90% confidence) Life Expectancy is 212,000 hours at 20C and 177,000 hours at 30C. The extrapolation is that the fan is considered to be at the end of its life if half of the original oil is exhausted. Thus, we estimate an MTTF of ~500,000 hours (to the significance of the data) at the typical operating temperature of 25C.

Fans experience wear-out rather than random failures; thus, the MTTF is not a completely relevant measure. The field experience with Myrinet products that use these Panaflo fans began in 4Q99. There has now been more than 100M hours of fan operation, and the number of failures reported is zero, but all fans are within the first 5 years of service.

Although a mechanical failure of these excellent fans prior to 100,000 hours (11 years) would be unusual according to the L10 data, the fan tray is an FRU (field-replaceable unit). The M3-E128 has 8 92mm fans on the fan tray and a fan of unknown MTTF built into the power supply. The M3-E64 has 4 92mm fans on the fan tray and a fan of unknown MTTF built into the power supply. The M3-E32 has 3 92mm fans on the fan tray, with the rear-most fan cooling the power supply. The M3-E16 has 4 60mm fans on the fan tray, with the rear-most fan cooling the power supply.

Power Cord

The M3-E128, M3-CLOS-ENCL, and M3-SPINE-ENCL ship with a 15A 100-127V IEC power cord or a 10A 200-240V IEC power cord. The type of power cord (the plug end) depends on the shipping destination. The IEC line cords for the smaller enclosures are all rated at 10A. We have never observed a failure in an IEC power cord in these or any other Myrinet products.

M3-CLOS-ENCL and M3-SPINE-ENCL

These new 14U enclosure products include a limited amount of active electronics: a TFT display, small monitoring and control boards for each fan assembly, and power supply monitoring. Until we have sufficient field experience with the failure rates, we have assigned a provisional FIT rate of 200 to the enclosure. Note that the TFT display also has a limited lifetime if it is used continuously. However, the firmware on the M3-MONITOR, which drives the TFT display, will turn the display off if it receives no commands from the turn/push control for 2 hours.

M3-M (monitoring line card)

The FIT rate of 435 is based on field experience since early 2000.

Fiber ports

The FIT rate of Myrinet NICs and switch line cards is generally dominated by the FIT rate of the optical-fiber transceivers. The experienced FIT rates of the transceivers from Infineon, Stratos, and Finisar are all ~125 (MTBF = 8M hours), a rate that is consistent with reliability testing and results from these manufacturers. We expect the FIT rate of these transceivers to decline over the next year thanks to reliability improvements in the transceiver construction and in the optical assembly. We have assigned a provisional FIT rate of 200 to the Agilent quad-fiber (MTP) transceivers "-16Q" line cards based on reliability data from Agilent.

The FIT estimates above do not apply to earlier Myrinet products that used fiber transceivers from E2O Communications, Inc. These transceivers suffered from a high rate of failure of the VCSEL laser. The failures are insidious (delayed), appearing after 200-400 days of operation, and are tightly clustered. Myricom no longer uses these transceivers, and has been systematically replacing components at sites with high failure rates.

M3-SW16-8F and M3-SPINE-8F (line cards with Fiber ports)

As noted in the table above, the FIT rate of 1400 breaks down to a FIT rate of 8*125 for the 8 fiber transceivers, and a FIT rate of 400 for the rest of the components on these line cards. The failure rate is the same for the M3-SW16-8F and M3-SPINE-8F because the boards are nearly identical except for the Myricom XBar16 chip, and its FIT rate is very small (statistically less than 5).

Fiber cables

There have been no reported field failures of the 50/125 fiber-pair cables or of the LC fiber-end connectors. The fiber and the LC connector system appear to be highly reliable. However, a small fraction, much less than 1%, of the cables shipped are DOA (dead on arrival, defective) or damaged in installation.

M3F-PCIXD-2 and M3F-PCIXD-4 one-port Myrinet/PCI-X NIC

See the notes above regarding Fiber ports. As noted in the table above, the FIT rate of 200 breaks down to a FIT rate of 125 for the single fiber transceiver and a FIT rate of 75 for the remaining circuitry. The improvement in reliability of these NICs relative to the older M3F-PCI64 series of NICs is due to their higher level of integration.

M3F2-PCIXE-2 and M3F2-PCIXE-4 Dual-port Myrinet/PCI-X NIC

See the notes above regarding Fiber ports. As noted in the table above, the FIT rate of 325 breaks down to a FIT rate of 2*125 for the dual fiber transceivers and a FIT rate of 75 for the remaining circuitry. Although this is a new product, we can estimate its FIT rate with confidence due to its similarity to the M3F-PCIXD NICs.

M3F-PCI64B and M3F-PCI64C (Myrinet/PCI NICs with Fiber ports)

See the notes above regarding Fiber ports. The FIT rate of 500 is based on many years and many 100M hours of field operation. The FIT rate of the "B cards" and "C cards" is statistically the same. These cards are made with 2MB, 4MB, or 8MB of memory (-2, -4, and -8 versions), but there is no statistical difference in the FIT rate depending upon the amount of memory.

Myricom banner
Last updated: 3 January 2005