August 18, 2004  -  Graham McVicker

This README describes the human2chimp.pl program.


DESCRIPTION
===========

The human2chimp.pl program is used to transfer human genes onto
chimpanzee sequence using a whole genome alignment.

In theory this program is not restricted to use between human and
chimpanzee and could be used to transfer genes between any two
assemblies with a good whole genome alignment.  For notational
convenience the sequences and databases are still refered to as human
and chimp databases.

The program can finish within the span of a few hours on a single
processor.


EXECUTION
=========

The program takes many command line arguments.  Most of these
arguments are database connection parameters for the three databases
used by the script.  See the DATABASES section of the README for
details on what the databases are.

  Human database arguments:
    -hdbname, -hhost, -huser, -hpass, -hport, -hassembly

  Chimp database arguments:
    -cdbname, -chost, -cuser, -cpass, -cport, -cassembly

  Destination database arguments:
    -ddbname, -dhost, -duser, -dpass, -dport

  Other arguments:
    -store   -  Flag. Provided if genes are to be stored.  If
                this flag is not provided genes will be created but not stored.
                Not storing genes is useful for testing and debugging, and 
                removes the requirement for providing destination database
                arguments.

    -help    - Flag.  If provided usage information is printed to STDERR.
    
    -logfile - file to log status messages to.  See the LOGGING section of
               the README for more information

    -verbose - Flag. If provided progress and diagnostic information
               are printed to STDERR during execution

It may be easier to prepare a shell script to execute the program
rather than typing in all of the command line arguments.  The
ensembl/misc-scripts/chimp/run_hum2chimp.sh script is an example of
how this can be done.

*important*
The program requires the use of the API's ChainedMapper even when there is a
a 2-step mapping path.  This is because the ChainedMapper correctly deals
with a component sequence regions which map to multiple assembled sequence
regions whereas the standard AssemblyMapper does not (for speed reasons).

The use of the ChainedMapper currently needs to be hardcoded. It was
intended to make the mapper used configurable with database meta data
but this has not been done yet. The following shows lines in the 
Bio::EnsEMBL::DBSQL::AssemblyMapperAdaptor module that must be changed 
prior to running the human2chimp.pl program.  The $asm_mapper variable 
is set to a ChainedAssemblyMapper rather than a normal AssemblyMapper.  
At the time of writing these were lines 183-192:

  if(@mapping_path == 2) {
    #1 step regular mapping
#    $asm_mapper = Bio::EnsEMBL::AssemblyMapper->new($self, @mapping_path);

#   If you want multiple pieces on two seqRegions to map to each other
#   uncomment following. AssemblyMapper assumes only one mapped piece per contig
   $asm_mapper = Bio::EnsEMBL::ChainedAssemblyMapper->new( $self, $mapping_path[0], undef, $mapping_path[1] );
    $self->{'_asm_mapper_cache'}->{$key} = $asm_mapper;
    return $asm_mapper;
  }



ALGORITHM
=========

The program steps through each human gene in the human database.  Each
transcript of each gene is looped over.  Each human exon of each human
transcript is then looped over and mapped to the chimpanzee one at a
time.  Insertions and deletions are processed and recorded as they are
encountered along each exon.  Gaps in the alignment are considered to
be the same as deletions/insertions.  Human reading frame is
maintained through the introduction of 1-2bp "frameshift" introns when
a CDS insertion/deletion which is not a multiple of 3 is encountereed.

After the exons of a human transcript are transfered chimp transcripts
are constructed from them.  An exons is discarded from a transcript if it:
  * contains a long in-frame CDS deletion
  * contains a long in-frame CDS insertion
  * contains a medium frameshifting CDS deletion
  * contains a medium frameshifting CDS insertion
  * crosses strands
  * contains an inversion
  * spans multiple sequence regions (e.g. chromosomes)
  * contains a difficult to resolve situation (e.g. an insertion which 
    would be in an introduced frameshift intron)
  * completely failed to be transfered (e.g. was entirely deleted)

These fatal exon criteria are defined by the @FATAL array in the
InterimExon module and by the InterimExon::fail() getter/setter
method.

The definitions of short, medium and long lengths are defined in the Length
module.  These are currently set to:
  SHORT  < 9  bp
  MEDIUM 9-48 bp
  LONG   > 48 bp

Where a transcript would contain a discarded exon the exon is removed
and the transcript is split into two at that point (or left as a
single transcript if the discarded exon was the first/last exon of the
transcript).

A transcript is also split into multiple transcripts if it:
  * spans multiple sequence regions (e.g. chromosomes)
  * contains exons out of order (an inversion)
  * contains exons on different strands
  * contains an intron greater than MAX_INTRON_LEN in length

MAX_INTRON_LEN is defined in the Transcript module and is currently
set to 2Mb.

After all transcripts of a gene are transfered several steps occur.
Transcripts which now contain stop codons are made into pseudogenes.
Transcripts which are now duplicates are merged into a single
transcript.  All transcripts which are close together and from the
same human gene are grouped together in the same chimp gene.
Transcripts are considered to be close together if they are on the
same strand of the same sequence region and within Gene::NEAR bp of
each other.  Gene::NEAR is currently defined as 1.5Mb.

The total number of amino acids and nucleotides in all transcripts of
each newly created chimp gene is calculated and genes with less then
15 amino acids and less than 600 nucleotides are discarded.  This is
to prevent the creation of very small fragmentary genes as a result of
a spurious aligment region.



DATABASES
=========

The program uses three databases:
  * A human database to retrieve genes from
  * A chimp database containing chimpanzee sequence and an alignment 
    between the human and chimpanzee sequences.
  * A destination database where the newly created chimpanzee genes are 
    to be stored.  The destination database and the database of 
    chimpanzee sequence can optionally be made the same database.

The human database should be a normal ensembl v20+ database containing
a set of human genes.  If the stable identifiers have not yet been
loaded into this database some modifications may be necessary to the
script.

The chimp database must contain a chimpanzee assembly and sequence.
Additionally, the alignment between the chimpanzee and human sequences
must be loaded into the assembly table.  The human seq_regions must be
loaded into the seq_region table with the same names and internal
identifiers that are present in the human database.


ALIGNMENT
=========

The chimp database must contain an assembly table loaded with a whole
genome alignment between the chimpanzee and human sequence regions.
Previously, blastz alignments were imported from UCSC.  The
ensembl/misc-scripts/chain/chain2assembly.pl or
ensembl/misc-scripts/axt2assembly.pl programs can be used to parse
UCSC formats.

Once the alignment is loaded into the assembly table and the both the
human and chimp seq_regions are present in the seq_region table an
appropriate assembly.mapping entry must be added to the meta table.
The following is a sample coord_system and meta table.  The NCBI34
assembly is the human assembly and the BROAD1 is the chimp assembly:


mysql> select * from coord_system;
+-----------------+------------+---------+------+--------------------------------+
| coord_system_id | name       | version | rank | attrib                         |
+-----------------+------------+---------+------+--------------------------------+
|               1 | chromosome | BROAD1  |    1 | default_version                |
|               2 | scaffold   | BROAD1  |    2 | default_version                |
|               3 | contig     | NULL    |    4 | default_version,sequence_level |
|               4 | chromosome | NCBI34  |    3 | default_version                |
+-----------------+------------+---------+------+--------------------------------+

mysql> select * from meta;
+---------+------------------------+-------------------------------------+
| meta_id | meta_key               | meta_value                          |
+---------+------------------------+-------------------------------------+
|       1 | schema_version         | $Revision: 1.2 $                  |
|       3 | assembly.default       | BROAD1                              |
|       4 | species.taxonomy_id    | 9598                                |
|       5 | species.classification | troglodytes                         |
|       6 | species.classification | Pan                                 |
|       7 | species.classification | Hominidae                           |
|       8 | species.classification | Catarrhini                          |
|       9 | species.classification | Primates                            |
|      10 | species.classification | Eutheria                            |
|      11 | species.classification | Mammalia                            |
|      12 | species.classification | Vertebrata                          |
|      13 | species.classification | Chordata                            |
|      14 | species.classification | Metazoa                             |
|      15 | species.classification | Eukaryota                           |
|      16 | species.common_name    | Chimpanzee                          |
|      17 | assembly.maxcontig     | 1000000000                          |
|      18 | assembly.mapping       | scaffold:BROAD1|contig              |
|      22 | assembly.mapping       | chromosome:BROAD1|scaffold:BROAD1   |
|      27 | assembly.mapping       | chromosome:BROAD1|contig            |
|      26 | assembly.mapping       | chromosome:NCBI34|chromosome:BROAD1 |
+---------+------------------------+-------------------------------------+
20 rows in set (0.38 sec)



*important* The alignment which is loaded into the assembly table must
be best-reciprocal.  One human region cannot align to multiple
chimpanzee regions or vice-versa.  One human region must align to only
one chimpanzee region.

In order to improve performance it may be necessary to "raise" the
alignment so that only one step mapping between the genomes is
necessary.  For example if the original alignment was between human
chromosomes to chimpanzee scaffolds but the genes were built on
chimpanzee chromosomes two step mapping would be required:

  human chromosomes -> chimp scaffolds -> chimp chromosomes

By pushing the alignment up so that it was between human chromosomes
and chimp chromosomes only faster two step mapping would be required:

  human chromosomes -> chimp chromosomes

For an example of how to push up an alignment in the form of a dumped
assembly table see the
ensembl/misc-scripts/chimp/chimp/lift_assembly.pl script.


STABLE IDENTIFIERS
==================

Stable identifiers are created on the fly by the
Gene::generate_stable_id method.  Currently this method starts with
hardcoded stable ids of the form ENSPTR(G|T|P|E)00000000001 and
increments them as genes, transcripts, translations and exons are
stored.  This is unlikely to be appropriate for the creation of a
second chimp gene set, or for the creation of another species' gene
set.


PROTEIN FEATURES
================

Protein features are not transfered by this program.  They must be
generated by the protein feature pipeline after the gene set is
created.

XREFS
=====

Only some xrefs are currently transfered by this program.  The names
of the databases with transfered xrefs are defined by the
%Gene::KEEP_XREF hash.  Currently xrefs from the SWISSPROT, SPTREMBL
and HUGO databases are the only xrefs which are transfered.


LOGGING
=======

The human2chimp.pl program logs events to a file specified by the
-logfile command line argument.  Numbers are written to this file
which are bitvectors of combined status message.  Use of the
bitvectors allows stats to be compiled for certain classes of events.
For example all messages relating to entire transcripts have been
xor'd with the StatMsg::TRANSCRIPT bit flag.  The
ensembl/misc-scripts/chimp/get_stats.pl script is an example program
that can be used to extract information from the log files.  The
following is an example of usage and output

UNIX> cat chimp.final.log | perl get_stats.pl
Transcript Summary:
  Split Events = 8681
  Translates (Partial/Entire) = 30005
    Partial = 835
    Entire = 29170
  Doesn't Translate = 2792
    Partial = 835
    Entire = 1957
  No Sequence Left = 1527
  No CDS Left = 1671

Exon Summary:
  Split Events = 22256
  Inserts = 40844
    Long = 2413
    Medium = 2641
    Short = 35790
    CDS = 17086
      Frameshifting = 14593
        Long = 985
        Medium = 397
        Short = 13211
    UTR = 23516
  Deletes = 63388
    Entire = 1231
    Long = 25754
    Medium = 7292
    Short = 30342
    CDS = 35776
      Frameshifting = 17376
        Long = 6862
        Medium = 2213
        Short = 8301
    UTR = 27612
  Confused Events = 251

Useful status and debugging output will also be printed to STDERR if
the -verbose command line option is provided.


BUGS
====

* UTRs are allowed to grow to an arbitrary size.  This results in a few
  extremely large transcripts when there is a *very* large insertion in a
  UTR.  This should be changed, probably through an addition to the 
  @FATAL hash in the InterimExon module, and possibly an additional length
  classification in the Length.pm module.

* Transcripts with stop codons are made into pseudogenes even if the stop
  occurs within the last few amino acids.  It might be an idea to keep these
  transcripts and simply shorten the CDS by a few bases.

* The "confusing" exon situation occasionally occurs.  This is due to a 
  complicated situation where an insert immediately follows the introduction
  of a frameshift intron.  Currently the exon is discarded, but it could be
  kept with some tricky changes to the code.

* There are probably other bugs as well...

