README for libmmx.

General Information:
	The home site for libmmx is off the SWAR homepage at Purdue University:
		http://shay.ecn.purdue.edu/~swar

	libmmx was written by Hank Dietz and Randy Fisher, who can currently be
	contacted at:
		hankd@ecn.purdue.edu
		rfisher@ecn.purdue.edu

	Please include "libmmx" in the subject line of any correspondence
	pertaining to the library.

	Please see the file "bug-reports" for information on reporting
	problems with the library.

	Please read the file INSTALL for information on making and istalling
	the library, or the file UPGRADE for information on upgrading from an
	earlier version of libmmx.

Introduction:

	Intel's MMX family of multimedia extensions to the x86 instruction
	set contains CPU instructions which allow a single operation to be
	applied to multiple data items simultaneously.  This data is stored
	in a "partitioned" floating-point (FP) register, meaning that the
	register is logically divided into multiple independent sections
	called "fields", each of which can hold a single datum.  For
	example, a 64-bit register may be partitioned into two 32-bit
	fields, with the first consisting of bits 0 through 31 and the
	second consisting of bits 32-63.

	Throughout this document, and all SWAR literature, the notation AxB
	will be used to indicate a register partitioning of A fields of B
	bits each.  "AxB" is read as "A by B".  For example, a 64-bit
	register can be partitioned as 4 fields of 16-bits each (4 by 16).
	The notation "AxBu" indicates A fields of B bits each, containing
	unsigned integer data, and the notation "AxBf" indicates A fields of
	B bits each, containing floating point data.

	Once the data has been stored in the partitioned register, MMX
	instructions can be used to operate simultaneously on all the fields
	of the register.  Most of these instructions are "non-interfering",
	meaning that their application to one field is independent of their
	application to any other field of the same register.  In this manner,
	a single operation can be applied to multiple data streams
	concurrently.  Thus, the MMX instructions treat the fields of a
	partitioned register as though they were equivalent registers on
	separate nodes of a SIMD parallel computer.  We refer to this type of
	processing as SWAR (SIMD Within A Register).

	This library is intended to provide C function level support for the
	MMX instruction set.  It does so by providing a data type for
	partitioned registers, and functions which allow operands of this
	type to be passed to an MMX instruction and returned from it, loaded
	in MMX registers, and stored from MMX registers to memory.  All of
	the original MMX instructions are supported by the library.

	The libmmx functions access MMX instructions via inline assembly and
	use the MMX support provided by the GNU assembler (gas) versions
	2.8.1 and later.  It is possible to modify the library sources to use
	earlier versions of gas, however, we suggest that you upgrade to a
	newer version of gas if possible.


The mmx_t data type union:
	
	The data is either signed or unsigned, integer, and may be 8, 16, 32,
	or 64 bits long.  libmmx stores the contents of each register as an
	unsigned 64-bit entity which may take one of several forms.  This
	first-class type union is defined to be "mmx_t" in the header file
	"mmx.h":

	typedef union {
		long long		q;      /* Quadword */
		unsigned long long	uq;     /* Unsigned Quadword */
		int			d[2];   /* 2 Doublewords */
		unsigned int		ud[2];  /* 2 Unsigned Doublewords */
		short			w[4];   /* 4 Words */
		unsigned short		uw[4];  /* 4 Unsigned Words */
		char			b[8];   /* 8 Bytes */
		unsigned char		ub[8];  /* 8 Unsigned Bytes */
	} mmx_t;

	Within an application, variables of type mmx_t are declared as
	normal, and can be initialized by initializing the "elements" of the
	chosen partitioning as would be done for an array:
			mmx_t a,
			mmx_t b.d = {456L, 98L};
			mmx_t c.q = 0xF1F2F3F4F5F6F7F8LL;

	Values may be set within the application by setting the elements of
	the chosen partitioning:
			a.q = 0xF1F2F3F4F5F6F7F8LL;
			b.d[1] = 456L; b.d[0] = 98L;
			c.uw[3] = 3; c.uw[2] = 2; c.uw[1] = 1; c.uw[0] = 0;

	Because the data is stored as a single mmx_t, there is no inherent
	partitioning due to the storage type.  The partitioning of the
	register is dependent only upon the operation applied, and is not
	enforced by the library between operations.  Thus, data may be
	stored in an mmx_t variable with an 2x32 partitioning, then a 4x16
	addition can be applied to the mmx_t variable, immediately followed
	by an 8x8 addition.  It is left to the programmer to use the proper
	version of each instruction to maintain the desired partitioning
	throughout an application.

	Note that the data may be accessed with various partitionings.  Thus,
	the following sequence is legal:
			a.q = 0x0123456789abcdefLL;
			a.d[1] = a.d[0];
			a.w[3] = a.d[2];
			a.w[1] = a.d[0];
			a.ub[7] = a.b[6];
			a.ub[5] = a.b[4];
			a.ub[3] = a.b[2];
			a.ub[1] = a.b[0];

	Immediate data can be used where a value of type mmx_t is required by
	casting it to be a value of one of the types within the mmx_t union,
	then casting it to be an mmx_t value:
			mmx_t a;
			a = (mmx_t) (long long) 0x0123456789abcdefLL;
	Notice we did not say "a.q" here.


Using libmmx in an application:

	To use libmmx in an application, the header file "mmx.h" must be
	#included in the application before any variables of type mmx are
	declared, and before the first occurrence of a libmmx function.

	The processor's floating point registers must usable in MMX mode
	for any MMX instruction to be used, and must be put back into FP mode
	once MMX operations have concluded.  mmx_ok() checks to see if the
	processor supports MMX.  If so, mmx_ok() returns 1, otherwise, it
	returns 0.  This return value may be checked by a program, and allows
	MMX code to be skipped or alternative code to be used if MMX is not
	supported by the CPU.  mm_support() can be used to see which multimedia
	extensions are supported by the processor including MMX, Extended MMX,
	and 3DNow!, although the library currently only supports standard MMX.

	If MMX is supported, any of the other libmmx functions may be called.

	When these are completed, emms() must be called to place the CPU back
	into FP mode, and must be called before any FP instructions are used
	which follow an MMX instruction.

	See the document "functions" for complete descriptions of the libmmx
	functions.


Using MMX_TRACE:
	Defining MMX_TRACE before the inclusion of mmx.h, either in the
	source or on the compiler command line, enables the printing of
	trace information onto stderr.  This information should be useful
	for debugging and optimizing your code.

