E-values for pairwise local sequence alignment
==============================================

If we compare two completely random sequences, we may find weak
alignments that exist just by chance.  The expected number of such
chance alignments is called the "E-value".  The E-value depends on:
the alignment score threshold, the sequence lengths, the scoring
scheme (score matrix and gap costs), and the letter abundances.

E-values provide guidance for choosing alignment score thresholds: the
score threshold should be high enough that alignments are unlikely to
exist just by chance.

The following tables show E-values expressed as "alignments per square
gigabase".  In other words, if we compared two completely random
sequences of length 1 billion each, this is how many alignments we
would expect to find.  For example, if we compare DNA sequences with
60% A+T, using the HOXD70 matrix with a gap cost of 400 + 30 * (gap
length), we expect about 14 alignments with score >= 4000 per square
gigabase.

The E-values were calculated with this formula::

  E-value = 10^18 * K * exp(-lambda * score)

The values of K and lambda are also shown: they depend on the scoring
scheme and letter abundances.  They were calculated using ALP
(http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html/index/software.html).

For short sequences -- very roughly, shorter than 100 -- these
E-values are inaccurate (too high).  This can be fixed using a
so-called finite size correction, which is available in ALP.


DNA scoring schemes
-------------------

Match score = 1, mismatch cost = 1, gap cost = 7 + (gap length)
(For larger gap costs, e.g. infinity, the table remains very similar.)
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
AT%  Lambda   K        20      25      30      35      40      45      50    
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
50   1.10     0.332    9.3e+7  3.8e+5  1500    6.3     0.026   1.1e-4  4.3e-7
60   1.05     0.309    2.3e+8  1.2e+6  6500    34      0.18    9.3e-4  4.9e-6
70   0.893    0.242    4.2e+9  4.9e+7  5.6e+5  6500    74      0.85    0.0098
===  =======  =======  ======  ======  ======  ======  ======  ======  ======

Match/mismatch scores = TiTv212, gap cost = 16 + (gap length)
(For larger gap costs, e.g. infinity, the table remains very similar.)
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
AT%  Lambda   K        50      60      70      80      90      100     110
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
50   0.481    0.200    7.2e+6  59000   480     3.9     0.032   2.6e-4  2.1e-6
60   0.455    0.184    2.4e+7  2.6e+5  2700    29      0.3     0.0032  3.4e-5
70   0.380    0.137    7.7e+8  1.7e+7  3.8e+5  8600    190     4.3     0.096
===  =======  =======  ======  ======  ======  ======  ======  ======  ======

Match/mismatch scores = HOXD70, gap cost = 400 + 30 * (gap length)
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
AT%  Lambda   K        3000    3500    4000    4500    5000    5500    6000
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
50   0.00936  0.0895   57000   530     4.9     0.046   4.2e-4  3.9e-6  3.6e-8
60   0.00908  0.0806   1.2e+5  1300    14      0.14    0.0015  1.7e-5  1.8e-7
70   0.00714  0.0312   1.6e+7  4.4e+5  12000   350     9.8     0.28    0.0077
===  =======  =======  ======  ======  ======  ======  ======  ======  ======

Match/mismatch scores = HOXD70, gap cost = infinity
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
AT%  Lambda   K        2000    2500    3000    3500    4000    4500    5000
===  =======  =======  ======  ======  ======  ======  ======  ======  ======
50   0.0104   0.186    1.7e+8  9.5e+5  5200    29      0.16    8.8e-4  4.9e-6
60   0.0102   0.183    2.5e+8  1.5e+6  9400    57      0.35    0.0021  1.3e-5
70   0.00911  0.165    2.0e+9  2.1e+7  2.2e+5  2300    25      0.26    0.0027
===  =======  =======  ======  ======  ======  ======  ======  ======  ======


Protein scoring schemes (with standard amino acid abundances)
-------------------------------------------------------------

Match/mismatch scores = Blosum62, gap cost = 11 + 2 * (gap length)
=======  =======  ======  ======  ======  ======  ======  ======  ======
Lambda   K        80      90      100     110     120     130     140
=======  =======  ======  ======  ======  ======  ======  ======  ======
0.299    0.0883   3.6e+6  1.8e+5  9100    460     23      1.2     0.058
=======  =======  ======  ======  ======  ======  ======  ======  ======

Match/mismatch scores = Blosum62, gap cost = infinity
=======  =======  ======  ======  ======  ======  ======  ======  ======
Lambda   K        60      70      80      90      100     110     120
=======  =======  ======  ======  ======  ======  ======  ======  ======
0.318    0.134    6.9e+8  2.9e+7  1.2e+6  50000   2100    86      3.6
=======  =======  ======  ======  ======  ======  ======  ======  ======
