*************************************************************************
*                                                                       *
*    Myrinet Express Lustre Networking Driver (MXLND) documentation     *
*                                                                       *
*************************************************************************

README of MXLND

MXLND provides support for Myricom's Myrinet Express (MX) communication
layer in Lustre.

MXLND may be used with either MX-10G or MX-2G. See MX's README for
supported NICs.

Table of Contents:
    I. Installation
       1. Configuring and compiling
       2. Module Parameters
   II. MXLND Performance
  III. Caveats
       1. Systems with different page sizes
       2. Multi-homing
       3. MX endpoint collision
   IV. License
    V. Support

================
I. Installation
================

MXLND is supported on Linux 2.6. It may be possible to run it on 2.4,
but it has not been tested. MXLND requires Myricom's MX version 1.2.8
or higher. See MX's README for the supported list of processors.

MXLND requires the optional MX kernel library interface. MX must be compiled
with --enable-kernel-lib.

1. Configuring and compiling

MXLND should be already integrated into the Lustre build process. To 
build MXLND, you will need to set the path to your MX installation
in Lustre's ./configure:

    --with-mx=/opt/mx

replacing /opt with the actual path. Configure will check to ensure that
the MX version has the required functions. If not, it will fail to build.
To check if MXLND built, look for:

    checking whether to enable Myrinet MX support... yes

in configure's output or the presence of Makefile in
$LUSTRE/lnet/klnds/mxlnd.

2. Module Parameters

MXLND supports a number of load-time parameters using Linux's module
parameter system. On our test systems, we created the following file:

    /etc/modprobe.d/kmxlnd

On some (older?) systems, you may need to modify /etc/modprobe.conf.

The available options are:

    n_waitd	# of completion daemons
    cksum	set non-zero to enable small message (< 4KB) checksums
    ntx		# of total tx message descriptors
    peercredits	# concurrent sends to one peer
    board	index value of the Myrinet board
    ep_id	MX endpoint ID
    ipif_name	IPoMX interface name
    polling	Use 0 to block (wait). A value > 0 will poll that many times before blocking

    credits	Unused - was # concurrent sends to all peers
    max_peers	Unused - was maximum number of peers that may connect
    hosts	Unused - was IP-to-hostname resolution file

You may want to vary the options to obtain the optimal performance for your
platform.

    n_waitd sets the number of threads that process completed MX requests
(sends and receives). In our testing, the default of 1 performed best.

    cksum turns on small message checksums. It can be used to aid in trouble-
shooting. MX also provides an optional checksumming feature which can check 
all messages (large and small). See the MX README for details.

    ntx is the number of total sends in flight from this machine.

    peercredits is the number of in-flight messages for a specific peer. This is part
of the flow-control system in Lustre. Increasing this value may improve performance
but it requires more memory since each message requires at least one page.

    board is the index of the Myricom NIC. Hosts can have multiple Myricom NICs
and this identifies which one MXLND should use.

    ep_id is the MX endpoint ID. Each process that uses MX is required to have at
least one MX endpoint to access the MX library and NIC. The ID is a simple index
starting at 0. When used on a server, the server will attempt to use this end-
point. When used on a client, it specifies the endpoint to connect to on the 
management server.

    ipif_name is the name of the Ethernet interface over MX. Generally, it is
myriN, where N matches the MX board index.

    polling determines whether this host will poll or block for MX request com-
pletions. A value of 0 blocks and any positive value will poll that many times
before blocking. Since polling increases CPU usage, we suggest you set this to
0 on the client and experiment with different values for servers.

=====================
II. MXLND Performance
=====================

On MX-2G systems, MXLND should easily saturate the link and use minimal CPU 
(5-10% for read and write operations). On MX-10G systems, MXLND can saturate 
the link and use moderate CPU resources (20-30% for read and write operations).
MX-10G relies on PCI-Express which is relatively new and performance varies
considerably by processor, motherboard and PCI-E chipset. Refer to Myricom's
website for the latest DMA read/write performance results by motherboard. The
DMA results will place an upper-bound on MXLND performance.

============
III. Caveats
============

1. Systems with different page sizes

MXLND will set the maximum small message size equal to the kernel's page size.
This means that machines running MXLND that have different page sizes are not
able to communicate with each other. If you wish to run MXLND in this case,
send email to help@myri.com.

2. Multi-homing

At this time, the MXLND does not support more than one interface at a time.
Thus, a single Lustre router cannot route between two MX-10G, between two
MX-2G, or between MX-10G and MX-2G fabrics.

3. MX endpoint collision

Each process that uses MX is required to have at least one MX endpoint to
access the MX library and NIC. Other processes may need to use MX and no two
processes can use the same endpoint ID.  MPICH-MX dynamically chooses one at
MPI startup and should not interfere with MXLND. Sockets-MX, on the other hand,
is hard coded to use 0 for its ID. If it is possible that anyone will want to
run Sockets-MX on this system, use a non-0 value for MXLND's endpoint ID.


===========
IV. License
===========

MXLND is copyright (C) 2006 of Myricom, Inc. 

MXLND is part of Lustre, http://www.lustre.org.

MXLND is free software; you can redistribute it and/or modify it under the
terms of version 2 of the GNU General Public License as published by the Free
Software Foundation.

MXLND is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
Lustre; if not, write to the Free Software Foundation, Inc., 675 Mass Ave,
Cambridge, MA 02139, USA.

==========
V. Support
==========

If you have questions about MXLND, please contact help@myri.com.
