*** Assembly mapping ***

This document describes how to create a whole genome alignment between two
assembly versions of a genome.

All scripts used in this pipeline have a --help option to print available
arguments (to be passed in from the commandline, or read from a ini-style
configuration file specified with --conffile).
Be sure to supply a config file or update the default file (sanger-plugins/
vega/conf/ini-files/Conversion.ini) as a path to a logging directory and
a binary directory are expected.
Check the files in the logging dir for errors after every step.

For an example how to *use* the mapping created by these scripts, please see
EXAMPLE.use_mapping.pl.


-------------------------------------------------------------------------------
Overview
-------------------------------------------------------------------------------

Creating the whole genome alignment between two assembly versions of a genome is
a two-step process. Alignments are first created by matching identical clones,
and the remaining region pairs are then aligned using blastz. This only works
well for assemblies without major rearrangements (especially no clones moved
between toplevel seq_regions). You can use a test script to get a picture of
your assembly differences:

  $ ensembl/misc-scripts/assembly/compare_assemblies.pl

Also note that the current version of the alignment code allows basepair
mismatches (but no insertions/deletions), so if you need perfect sequence
matches you'll have to post-process the projection results as required. Earlier
versions of this code didn't allow mismatches which resulted in a very
fragmented mapping, and in most use cases this wasn't desirable.


-------------------------------------------------------------------------------
Creating the whole genome alignment
-------------------------------------------------------------------------------

0. Check dbs:
=============

You will need mapping paths from all toplevel coord_systems you intend to
include in the mapping to contig and clone, in both the reference and
alternative db. If not present, add appropriate entries to the meta table,
given that the mapping actually exists. A common missing entry is the 3-way
mapping between chromosomes, contigs and clones, e.g.

mysql> select * from meta where meta_key = 'assembly.mapping';
+---------+------------------+-----------------------------------------+
| meta_id | meta_key         | meta_value                              |
+---------+------------------+-----------------------------------------+
|      34 | assembly.mapping | chromosome:NCBIM37#contig               |
|     125 | assembly.mapping | chromosome:NCBIM37#contig#clone         |
|      37 | assembly.mapping | clone#contig                            |
+---------+------------------+-----------------------------------------+
[...only partial result shown...]

The entry with meta_id 125 is often missing, even though the mapping is there
implicitely, defined by meta_id 34 and 37, but it can't be used by the
AssemblyMapper.


1. Load alternative assembly:
=============================

To start the process, you'll have to load the alternative toplevel seq_regions
into the Ensembl database for further processing. The reference and alternative
database need to be on the same db host.

a. Make sure you have all databases on the same host.

b. Then run the script:

  $ ensembl/misc-scripts/assembly/load_alternative_assembly.pl

The script creates backup tables of all tables that will subsequentially be
modified, so that you can track back errors if necessary.


2. Align identical clones:
==========================

In the first step of creating the alignment, clones with same name and version
are matched directly and alignment blocks for these regions are created. Clones
can be tagged manually to be excluded from these direct matches by listing them
in a file of clones to skip (--skipclones argument). This can be useful to get
better results in regions with major assembly differences.
       
The result is stored in the assembly table as an assembly between the
toplevel seq_regions of both genome assemblies.

Non-aligned blocks are stored in a temporary table (tmp_align) and will be
aligned using blastz in the second step.

To run the script:

  $ ensembl/misc-scripts/assembly/align_by_clone_identity.pl


3. Align non-identical regions:
===============================

In step 2, non-aligned block pairs are aligned using blastz. Alignments are
calculated by this algorithm:

  1. fetch region from tmp_align
  2. write soft-masked sequences to temporary files
  3. align using blastz
  4. filter best hits (for query sequences, i.e. alternative regions) using
     axtBest
  5. parse blastz output to create blocks of exact matches only (update:
     mismatches are allowed now, but gaps are not)
  6. remove overlapping target (reference) alignments
  7. write alignments to assembly table

A wrapper script will run the actual script (align_nonident_regions.pl) one
chromosome at a time over lsf:

  $ ensembl/misc-scripts/assembly/align_nonident_regions_wrapper.pl

After running this, check all logfiles for errors (also the lsf logs in the lsf/
subdirectory of your logpath).


4. Remove overlaps:
===================

The previous script may generate overlapping mappings which need to be removed
(otherwise the AssemblyMapper may break when using the mapping). You need to run
the script:

  $ ensembl/misc-scripts/assembly/fix_overlaps.pl


5. QC:
======

a. mapping stats:
-----------------

This script prints some statistics about the alignment, like the alignment
coverage and length of alignment blocks.

  $ ensembl/misc-scripts/assembly/mapping_stats.pl


b. check if mapping is correct:
-------------------------------

This script checks if the whole genome alignment between two assemblies is
correct. It does so by comparing the sequence in the reference database with
the sequence of the projected fragments in the alternative database.

  $ ensembl/misc-scripts/assembly/check_mapping.pl


6. temporary table cleanup:
===========================

Finally, once you are happy with the results, delete all temporary and backup
tables no longer needed.

  $ ensembl/misc-scripts/assembly/cleanup_tmp_tables.sql
  

-------------------------------------------------------------------------------
Scripts and main modules used
-------------------------------------------------------------------------------

  ensembl/modules/Bio/EnsEMBL/Utils/ConversionSupport.pm

  ensembl/misc-scripts/assembly/compare_assemblies.pl

  ensembl/misc-scripts/assembly/load_alternative_assembly.pl
  ensembl/misc-scripts/assembly/align_by_clone_identity.pl
  ensembl/misc-scripts/assembly/align_nonident_regions_wrapper.pl
  ensembl/misc-scripts/assembly/align_nonident_regions.pl
  ensembl/misc-scripts/assembly/AssemblyMapper/BlastzAligner.pm
  ensembl/misc-scripts/assembly/fix_overlaps.pl

  ensembl/misc-scripts/assembly/mapping_stats.pl
  ensembl/misc-scripts/assembly/check_mapping.pl

  ensembl/misc-scripts/assembly/cleanup_tmp_tables.sql


-------------------------------------------------------------------------------
Known bugs
-------------------------------------------------------------------------------

1. PAR regions:
===============

When projecting from a symlinked region of a PAR, you will end up in the
reference region (e.g. projecting from human chromosome Y gets you to X). The
coordinates of the projected slice will therefore also reflect the reference
region and might not be what you expect. Since features should only be stored
once (on the reference region) anyway, these projections should be ignored.


2. Log output of align_by_clone_identity.pl:
============================================

Sometimes the clone counts for directly aligned blocks in the logfile are
wrong. Note that only the log output is wrong, the actual data generated is
correct.
  

If you find other bugs please report them to the Ensembl development mailing
list <dev@ensembl.org> or to Patrick Meidl <meidl@ebi.ac.uk>.


-------------------------------------------------------------------------------
Further documentation
-------------------------------------------------------------------------------

Each script involved is commented and has detailed information in POD format.

A datareview presentation describing this pipeline can be found at

  ensembl-personal/datareviews/presentations/meidl20060413.ppt

(internal users only)

