README - Myricom 10GbE driver for Linux Contents: I. Installation II. Performance Tuning III. Troubleshooting IV. Compiling against another kernel V. Compile-time options VI. Load-time options This Myricom 10GbE driver for Myri-10G NICs is intended for use only with Linux kernel version 2.6 or later. It has been tested with Red Hat Enterprise Linux 4, and several kernel.org kernels. I. Installation =============== To build the driver, type % cd myri10ge/linux % make clean % make % su root # make install-only To compile against a kernel that is different than the current running kernel, see the "Compiling against another kernel" section below. To load the Myricom 10GbE driver, type the command # modprobe myri10ge A new ethernet interface, having a MAC address beginning with 00:60:DD, should now appear in the output of ifconfig -a . For example: # ifconfig -a | grep 00:60:DD eth2 Link encap:Ethernet HWaddr 00:60:DD:47:E5:31 In the examples below, we will assume our device is named "eth2". If the driver fails to load, refer to the "Troubleshooting" section below. If an error occurs during the installation procedure or at run-time, please send the output of myri10ge_bugreport.sh to help@myri.com. II. Performance Tuning ====================== In addition to the suggestions below, please see http://www.myri.com/cgi-bin/fom?file=511#linux for additional performance tuning recommendations. A. Write Combining ------------------ Enabling Write Combining (WC) on the device's memory range will improve performance. Running the command ethtool -S eth2 | grep WC will indicate if the driver was able to enable WC. If WC is disabled, please see http://www.myri.com/cgi-bin/fom?file=416 for tips on how to allow the driver to enable it. If WC is enabled, performance can be improved (at the cost of slightly higher host-CPU utilization) by enabling the WC fifo using: # modprobe myri10ge myri10ge_wcfifo=1 B. Network Buffer Sizes ----------------------- For best performance, we recommend increasing several network buffer sizes from their default values. Add the following lines to /etc/sysctl.conf and execute the command "sysctl -p /etc/sysctl.conf". net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 250000 For best performance with a 1500 byte MTU on a LAN, we suggest disabling TCP timestamps by adding the following line to /etc/sysctl.conf and executing the command "sysctl -p /etc/sysctl.conf". net.ipv4.tcp_timestamps = 0 C. Interrupt Coalescing ----------------------- This driver is ethtool compliant, and the interrupt coalescing parameter can be adjusted via "ethtool -C $DEVNAME rx-usecs $VALUE". The default setting is a compromise between latency and cpu overhead. You may wish to reduce rx-usecs if latency is more important and you are using a low-latency switch or a point-to-point connection. Similarly, you may wish to increase rx-usecs if you are interested in reducing CPU overhead for large transfers. Note that rx-usecs controls both transmit and receive coalescing. If you are using a kernel prior to 2.6.15, and notice that increasing rx-usecs results in a sharp decline in TCP performance, you may want to increase the TSO window divisor by adding the following line to /etc/sysctl.conf and executing the command "sysctl -p /etc/sysctl.conf". net.ipv4.tcp_tso_win_divisor = 32 For example, for the best performance on opterons, you should load the driver with myri10ge_wcfifo=0 and set rx-usecs to at least 75us (ethtool -C ethX rx-usecs 75). Also, we've found that disabling TCP timestamps is very important on opterons. Try sysctl net.ipv4.tcp_timestamps=0. D. MSI versus Legacy Interrupts ------------------------------- Enabling MSI interrupts will lower interrupt latency and can improve performance under some workloads. Our driver will only request MSI interrupts on chipsets it has confidence will work with MSI interrupts. To use MSI interrupts, the Linux kernel must be compiled with MSI support (CONFIG_PCI_MSI=y). To see if MSI interrupts were enabled, check ethtool for the myri10ge device: # ethtool -S eth2 | grep MSI MSI: 1 A non-zero value indicates that an MSI interrupt is being used by our device. However, if the value is 0 and dmesg shows a message like the following, it means that the Linux kernel did not allow our device to use MSI interrupts: myri10ge: Error setting up MSI on device 0000:05:00.0, falling back to xPIC If you would like to force the use of MSI interrupts, you should load the driver using: # modprobe myri10ge myri10ge_msi=1 If MSI interrupts are still not enabled even when setting myri10ge_msi=1, this may mean your Linux distribution disables MSI by default on a global basis. Recent Ubuntu and Fedora Core versions are known to do this. To enable MSI, you must add pci=msi to the kernel parameters and reboot. Note that if MSI interrupts were forced to be enabled, but the interface now fails to pass traffic, you should revert to using xPIC interrupts by reloading the driver without using myri10ge_msi=1, and remove pci=msi from your kernel parameters. If it's not possible to enable MSI interrupts with the specific Linux release that you're using, you can make xPIC interrupts less expensive by loading the driver with: # modprobe myri10ge myri10ge_deassert_wait=0 or set it at runtime via 'echo 0 > /sys/module/myri10ge/myri10ge_deassert_wait' Not using MSI or myri10ge_deassert_wait=0 costs about 500Mb/s in our performance measurements for a single stream. E. Module compilation --------------------- If you are using Linux kernel version 2.6.16 or higher, you will see improved receive performance if you change the definition of MYRI10GE_ALLOC_ORDER to 2 or more. This will cause the driver to allocate receive buffers from 2^MYRI10GE_ALLOC_ORDER contiguous pages. This reduces the number of allocations that the driver will make, as well as potentially reducing the number of IOMMU manipulations, at the cost of making each allocation more expensive. Please note that if the system is under heavy memory load, you will have an increased likelihood of allocation failures because it is harder for the kernel to provide contiguous pages. To change the this parameter, rebuild by: % make clean % make MYRI10GE_ALLOC_ORDER=$ORDER % su root # make install-only # rmmod myri10ge # modprobe myri10ge Where $ORDER ranges in value from 1..3. A good value to choose is MYRI10GE_ALLOC_ORDER=2, as it results in 16KB allocations. This is the same size allocation as a driver which does not use PAGE_SIZE buffers, and simply allocates 9KB jumbo frames. You may want to experiment with making MYRI10GE_ALLOC_ORDER=3, but this is a bit more likely to fail under heavy memory pressure. F. Packet forwarding --------------------- If your workload is primarily traffic forwarding or traffic analysis, you should build the driver using the MYRI10GE_RX_SKBS=1 compile option. This causes the driver to receive into standard skbufs, rather than into pages attached to an skbuf. Using this option is critical for forwarding standard MTU frames at line rate, and for forwarding frames to interfaces whose drivers do not support scatter-gather DMA. However, this option is incompatible with LRO, and should therefor not be used on an endstation. To change this parameter, rebuild by: % make clean % make MYRI10GE_RX_SKBS=1 MYRI10GE_LRO=0 % su root # make install-only # rmmod myri10ge # modprobe myri10ge G. MSI-X interrupts and Multiple Receive Queues ------------------------------------------------- If your kernel, motherboard and NIC are MSI-X capable, you can take advantage of hardware steering of incoming IPv4 traffic into multiple sets of receive queues. To check if your NIC is capable of MSIX, use lspci: # lspci -v -d 14c1: | grep 'MSI-X' If your NIC is MSI-X capable, you should see: Capabilities: [d0] MSI-X: Enable- Mask- TabSize=128 To enable multiple sets of receive queues (called slices), load the driver with the module parameter myri10ge_max_slices set to the number of slices you want to use, or to -1, which will pick the optimal number. Note that the driver will use a number of slices which is the largest power of two equal to or below myri10ge_max_slices that it can successfully configure. After loading the driver, you should see a line mentioning the number of MSI-X IRQs in use in your kernel messages log: myri10ge 0000:05:00.0: 4 MSI-X IRQs, tx bndry 4096, fw myri10ge_rss_eth_z8e.dat, WC Enabled Most Linux kernel versions with which we have tested will deliver all MSI-X interrupts to the same CPU by default, which defeats the purpose of steering packets into multiple queues. To achieve optimal results you should manually bind the interrupt from each slice to a different CPU core. We include an example script, msixbind.sh, which will bind slice 0 to CPU 0, slice 1 to CPU 1, etc. H. Intel Direct Cache Access (DCA) ---------------------------------- If you have a recent Intel server or workstation chipset, you may be able to take advantage of DCA. DCA causes DMA writes posted by the NIC to be prefetched into a CPU's cache, thereby reducing cache misses and increasing CPU efficiency for received network traffic. Support for DCA is provided by the Intel ioatdma driver. DCA support was added to the ioatdma in the 2.6.24 series. Please make sure you have both ioatdma and dca configured. Note that we have also found that using ioatdma's copy offload for TCP actually degrades performance at 10GbE speeds, so make sure to avoid configuring that (or disable it at runtime via sysctl net.ipv4.tcp_dma_copybreak=2147483647) Here is a snippet from a 2.6.24 .config file showing an optimal configuration: # # DMA Devices # CONFIG_INTEL_IOATDMA=m CONFIG_DMA_ENGINE=y # # DMA Clients # # CONFIG_NET_DMA is not set CONFIG_DCA=m If you have an older kernel or prefer not to recompile your 2.6.24 or newer kernel, you may download the ioatdma driver from Intel (version 2.15 or newer). Build and install the ioatdma driver prior to installing the myri10ge driver. Save the ioatdma source directory, and point the Myri10GE build at it when building the Myri10GE driver via the DCA_FLAGS argument to make. # make DCA_FLAGS="-DCONFIG_DCA -I/path/to/ioatdma-/include" # make install-only After loading the driver, you can confirm that DCA is enabled via: # ethtools -S eth2 | grep dca dca_capable: 1 dca_enabled: 1 III. Troubleshooting ==================== If the recommendations below do not resolve the problem you have encountered, please send a full description, along with the output of myri10ge_bugreport.sh, to help@myri.com. Large Receive Offload (LRO) is enabled by default. This will interfere with forwarding TCP traffic. If you plan to forward TCP traffic (using the host with the Myri10GE NIC as a router or bridge), you must disable LRO. To disable LRO, load the myri10ge driver with myri10ge_lro set to 0: # modprobe myri10ge myri10ge_lro=0 Alternatively, you can disable LRO at runtime by disabling receive checksum offloading via ethtool: # ethtool -K eth2 rx off The ability to saturate a 10GbE link depends on having sufficient PCI-Express bandwidth. When loaded, our driver calculates the available bus bandwidth (read DMA, write DMA, and simultaneous read and write DMA) and stores it so that ethtool may retrieve it later. To view the bus bandwidth, use the following command: # ethtool -S eth2 | grep dma Note that the reported bandwidth is measured in megabytes per second, not megabits. This means that 10Gb/s corresponds to 1280MB/s. This driver uses the Linux hotplug facility to load its firmware by default. It will look in /lib/firmware (Redhat), or /usr/lib/hotplug/firmware (SuSE) for a firmware image. The firmware images are copied there at install time. If there is a problem locating the firmware, the driver will fail to load, and you will see a message like this on the console: Myricom MYRI10GE driver 0000:05:00.0: Unable to load myri10ge_eth_z8e.dat firmware image, status = -2 This may be caused by your distribution using a different location for firmware. Please contact help@myri.com if you have a problem loading firmware. If the driver fails to load because of the unknown symbols "release_firmware" and "request_firmware", this means that you need to install the firmware loading module via "modprobe firmware_class". Also, make sure your kernel is built with CONFIG_FW_LOADER= 'y' or 'm'. As a workaround, you may wish to build the firmware into the myri10ge kernel module itself. To do this, build the module using MYRI10GE_BUILTIN_FW=1 # make MYRI10GE_BUILTIN_FW=1 If the driver fails to load because of the unknown symbols "zlib_inflate", "zlib_inflateInit2", and "zlib_inflate_workspacesize", this means you need to install the zlib module via # modprobe zlib_inflate If MSI interrupts were automatically enabled, but the interface fails to pass traffic, you should revert to using xPIC interrupts by reloading the driver using: # modprobe myri10ge myri10ge_msi=0 If you are using 802.1q VLANs, and you see an error message in the kernel log which looks like: hw tcp v4 csum failed you need to adjust the myri10ge_vlan_csum_fixup parameter. This tunable parameter controls whether or not the driver corrects the hardware checksum of received 802.1q VLAN tagged frames to account for the extra 4 bytes of VLAN header. In kernel.org kernels 2.6.14 and later, the Linux 802.1q VLAN module automatically does this correction, so our driver does not need to. In earlier Linux kernels (2.6.13 and earlier), however, the correction is not included, so our driver needs to perform this modification. Thus, the myri10ge_vlan_csum_fixup parameter defaults to true (non-zero) on kernel versions prior to 2.6.14, and to false (zero) on newer kernel versions. To enable the correction in the Myri10GE driver, reload the driver using: # modprobe myri10ge myri10ge_vlan_csum_fixup=1 Or you can adjust this at runtime using: # echo 1 > /sys/module/myri10ge/myri10ge_vlan_csum_fixup Or (depending on your kernel version): # echo 1 > /sys/module/myri10ge/parameters/myri10ge_vlan_csum_fixup Similarly, replace "0" with "1" above to disable the correction. TSO can potentially overwhelm the receiver and lead to packet loss and retransmissions. If you see an increase in bandwidth after disabling TSO, check your switch counters and settings to ensure flow control is enabled. TSO can be disabled as follows: # ethtool -K eth2 tso off If you are using another vendor's driver which also sets up PAT write combining, and that vendor's driver uses a different PAT index than our driver, the other vendor's driver may note a conflict and run in reduced performance mode. Examples of other drivers which use PAT include the Nvidia graphics driver, and various Infiniband drivers. One way to work around this PAT conflict is to change the PAT index used by the Myri10GE driver by rebooting, and loading the module using the myri10ge_pat_idx modparam to specify a different PAT index. We currently default to 6, while some other vendors use 1. Good values to try are 1, 4, 5, and 7. IV. Compiling against another kernel ==================================== To build for kernel different than the installed kernel, assuming its `uname -r` is 2.6.12-1-686 and its modules have been installed into /lib/modules/2.6.12-1-686, type % make clean % make KVER=2.6.12-1-686 ... To build against a kernel that has not been installed yet, but whose sources are in and have been built in (possibly the same directory), type % make clean % make KSRC= KDIR= ... Be sure to always 'make clean' before compiling against another kernel since the myri10ge_checks.h has to be regenerated according to the right kernel headers before compiling. V. Compile-time options ======================= To rebuild the module in a non-default manner, simply type: % make OPTION=value % su root # make install-only # rmmod myri10ge # modprobe myri10ge where the following OPTIONs are available: Option Values Default Meaning ------------ ------ ------- --------- MYRI10GE_LRO 0, 1 1 Enable or disable LRO MYRI10GE_BUILTIN_FW 0, 1 0 Build in firmware? (see above) MYRI10GE_ALLOC_ORDER 0..3 0 Allocate pages of this "order", see explanation above. MYRI10GE_RX_SKBS 0, 1 0 Receive into skbufs? (see above) MYRI10GE_THROTTLE 0, 1 0 Throttle transmit bandwidth After rebuilding and re-installing the module, you can confirm the module was built correctly by checking the compile options using ethtool -S. The presence of LRO flushed indicates LRO is compiled in, for all others, simply look for the option name in lower case without the leading MYRI10GE_. For example, to confirm a driver is compiled with MYRI10GE_ALLOC_ORDER=3, do the following: # ethtool -S eth2 | grep alloc_order alloc_order: 3 And to confirm the driver is using LRO: # ethtool -S eth1 | grep 'LRO flushed' LRO flushed: 11320658 VI. Load-time options ===================== When loading the myri10ge module, you may change a variety of options by appending them to the modprobe line: # modprobe myri10ge OPTION=value Option Values Default Meaning ------------ ------ ------- --------- myri10ge_force_firmware 0, 1 0 Force firmware to assume that the host provides aligned PCIe completions. myri10ge_fw_name string [a] Name of firmware image to load via hotplug. myri10ge_ecrc_enable 0,1 1 Enable ECRC on parent bridge if needed. myri10ge_msi -1,0,1 -1 Enable use of MSI interrupts. myri10ge_intr_coal_delay 0..N [b] Initial interrupt coalescing delay in usecs. myri10ge_flow_control 0, 1 1 Enable flow control. myri10ge_deassert_wait 0, 1 1 Wait for xPIC interrupt deassertion before exiting interrupt handler. myri10ge_initial_mtu 128..9000 9000 Initial default MTU. myri10ge_vlan_csum_fixup 0,1 [c] Do VLAN Checksum fixup for received frames. myri10ge_wcfifo 0,1 1 Use WC fifo when WC was enabled. myri10ge_max_slices -1,1..N 1 How many sets of receive queues to use per NIC. myri10ge_pat_idx 1,4..7 6 Which PAT idx to use to setup WC a: This defaults to myri10ge_eth_z8e.dat or myri10ge_ethp_z8e.dat depending on the host bridge chip in your machine. b: This defaults to 25us for kernels older than 2.6.15, and 75us for newer kernels. c: This defaults to 1 for kernels older than 2.6.14