Aligning bisulfite-converted DNA reads to a genome
==================================================

Bisulfite is used to detect methylated cytosines.  It converts
unmethylated Cs to Ts, but it leaves methylated Cs intact.  If we then
sequence the DNA and align it to a reference genome, we can infer
cytosine methylation.

To align the DNA accurately, we should take the C->T conversion into
account.  Here is how to do it with LAST.

Let's assume we have bisulfite-converted DNA reads in a file called
"reads.fastq" (in fastq-sanger format), and the genome is in
"mygenome.fa" (in fasta format).  We will also assume that all the
reads are from the converted strand, and not its reverse-complement
(i.e. they have C->T conversions and not G->A conversions).

First, we need to run lastdb twice, for forward-strand and
reverse-strand alignments:
  lastdb -u bisulfite_f.seed my_f mygenome.fa
  lastdb -u bisulfite_r.seed my_r mygenome.fa

Then find alignments, one strand at a time:
  lastal -p bisulfite_f.mat -s1 -Q1 -d108 -e120 my_f reads.fastq > temp_f
  lastal -p bisulfite_r.mat -s0 -Q1 -d108 -e120 my_r reads.fastq > temp_r

Finally, merge the alignments and estimate which one represents the
genomic source of each read:
  last-merge-batches.py temp_f temp_r | last-map-probs.py -s150 > myalns.maf

These commands refer to files (bisulfite_f.seed etc), which are in the
examples directory.  You need to specify exactly where they are
(e.g. "-u examples/bisulfite_f.seed").

Explanation of the parameters
-----------------------------

The options "-u bisulfite_f.seed" and "-p bisulfite_f.mat" enable
accurate forward-strand alignments.  Likewise, "-u bisulfite_r.seed"
and "-p bisulfite_r.mat" enable accurate reverse-strand alignments.
Option "-s1" means to find forward-strand alignments only, and "-s0"
means reverse-strand alignments only.  The options -Q1 -d108 -e120 and
-s150 are the same as in last-map-probs.txt: please see the
explanation there.

Avoiding biased methylation estimates
-------------------------------------

Imagine that one genomic cytosine is methylated in 50% of cells in
your sample, so that 50% of reads covering it have C and 50% have T.
It is possible that the reads with C are easier to align, so we align
more of them.  The methylation rate would then look higher than 50%.

We can avoid this bias by converting all Cs in the reads to Ts, before
aligning, like this:

  perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_f - > temp_f
  perl -pe 'y/C/t/ if $. % 4 == 2' reads.fastq | lastal ... my_r - > temp_r

These perl commands assume that the fastq files have no line-wrapping
or blank lines.  Also, they convert Cs to lowercase Ts: lowercase has
no effect on the alignment, but it lets you see where the Cs were in
the output (assuming the reads were all uppercase to start with).

Aligning reads in chunks
------------------------

Rather than align 1 billion reads all at once, it's probably better to
align them in chunks of, say, 1 million reads per chunk.  This has two
advantages: it avoids huge temp files, and you can align the chunks in
parallel.
