ENSEMBL - API Change Specification
==================================

CONTENTS
--------

Introduction
Goals
Schema Modifications
  Proposed New/Modified Tables
    seq_region
    coord_system
    seq_region_annotation
    dna
    assembly
    gene
    transcript
    translation
    all feature tables
    meta_coord
    misc_feature
    misc_set
    misc_feature_misc_set
    misc_attrib
  Removed Tables
    contig
    clone
    chromosome
Meta Information
API Changes
  Slice
  Tile
  SliceAdaptor
  RawContig
  RawContigAdaptor
  Clone
  CloneAdaptor
  Chromosome
  ChromosomeAdaptor
  Root
  Storable Base Class
  Features
    transform
    transfer
    move
    project
  StickyExon
  AssemblyMapper
  FeatureAdaptors
  CoordSystemAdaptor
New Features
  Assembly Exceptions
  Haplotypes
  Pseudo Autosomal Regions
  Multiple Assemblies
Other Considerations
  Loci  


INTRODUCTION
------------

This document describes the changes that are being made to the EnsEMBL core
schema and Perl/Java/C APIs.

GOALS
-----
-A cleaner, more intuitive API
-A more general schema able to better capture divergent assembly types
-More flexibility with regards to assembly related data such as haplotypes,
 PARs, WGS assemblies etc.

SCHEMA MODIFICATIONS
--------------------

Proposed New/Modified Tables:
-----------------------------

  seq_region
  ----------
  The seq_region table is a generic replacement for the clone, contig, 
  and chromosome tables.  Additionally supercontigs which were formerly in the
  assembly table are also present in this table.  The name column can contain
  chromosome names, clone accessions, supercontig names or anything that is
  appropriate for the seq_region it describes.  The coord_system_id is a 
  foreign key to the new coordinate system and is used to distinguish 
  between the divergent types of sequence regions in the table.

  seq_region_id    int
  name             varchar
  coord_system_id  int         references coord_system table
  length           int


  coord_system
  ------------
  The coordinate system table lists the available coordinate systems in the
  database.  The attrib is mysql set and is used to denote the default version
  of each named coordinate system. E.g. there may be two 'chromosome' coordiate
  systems and the default may be version 'NCBI34'.  The 'top_level' and 
  sequence level attribs denote the coordinate system from which sequence is
  retrieved and the coordinate system which has the largest assembled pieces.
  The top_level coordinate system will usually be 'chromosome' but for some
  shrapnel assemblied this may be something like 'supercontig' or 'clone'.

  There may be multiple toplevel coordinate systems providing that they share
  the same name (but different version) and providing one of them is the 
  default.  There may only be a single sequence level coordinate system.

  Note that the version in the coordinate system can be viewed as applying
  to every seq_region of a given coordinate system.  It is analagous to 
  a CVS tag, not a CVS version.  E.g. The version would  'NCBI33' apply to 
  every chromosome seq_region so it is a valid version.  A clone accession of 
  '8' would not be a valid version because it only describes a particular
  seq_region of the coordinate system - not all of them.

  coord_system_id   int
  name              varchar
  version           varchar
  attrib            set ('top_level', 'default_version', 'sequence_level')          


  seq_region_annotation
  ---------------------
  This table allows for extra arbitrary information to be attached to 
  seq_regions. For example the htg_phase was formerly part of the clone table
  but now is stored in this table.
  
  seq_region_id   int
  attrib_type_id  smallint          references attrib_type table
  value           varchar


  dna
  ---
  Formerly the contig table referenced the dna table.  Now the dna table 
  refrences the seq_region_table.  Every seq_region which has a coordinate
  system with the 'sequence_level' attrib should be referenced by an entry in
  the dna table.

  seq_region_id  int 
  sequence       varchar


  assembly
  --------
  The assembly table has been made more generic.  Columns that previously
  were names chr_* and contig_* have been renamed asm_* and cmp_* (assembled
  and component) respectively.   The superctg_name column has been removed.
  Supercontigs are now defined in the seq_region table.

  The makeup of all seq_regions from smaller seq_regions can be described in
  this table.  The relationships which are explicitly defined must be listed 
  in the meta table.  For example, the clone <-> contig mapping used to be
  defined in the contig table with an embl_offset column. This information is
  now found in this table instead. 

  asm_seq_region_id  int
  asm_start          int
  asm_end            int
  cmp_seq_region_id  int
  cmp_start          int
  cmp_end            int
  ori                tinyint

  gene
  ----
  For faster retrieval and retrieval independently of transcripts and 
  exons, genes have a seq_region_id, seq_region_start and seq_region_end
  which defines the span of their transcript.

  The transcript_count column has been removed as it was never used.
  
  gene_id             int
  type                varchar
  analysis_id         int
  seq_region_id       int
  seq_region_start    int
  seq_region_end      int
  seq_region_strand   tinyint
  display_xref_id     int


  transcript
  ----------
  For faster retrieval and retrieval independently of genes and exons
  transcripts also have a seq_region_id, seq_region_start and 
  seq_region_end. The translation_id has been removed; translations will point 
  to transcripts instead (and pseudogenes will have no translation).  
 
  The exon_count column has been removed as it was never used.

  transcript_id      int
  gene_id            int 
  seq_region_id      int
  seq_region_start   int
  seq_region_end     int
  seq_region_strand  tinyint
  display_xref_id    int

  
  translation
  -----------
  Translations now reference transcripts rather than transcripts referencing
  a single (or no) translation.  This allows for more elegant handling of 
  pseudogenes (where there is no translation) and also can be used to supply
  multiple translations for a single transcript (e.g. polycistronic genes).

  translation_id   int
  transcript_id    int
  start_exon_id    int
  end_exon_id      int
  seq_start        int
  seq_end          int


  all feature tables
  ------------------
  All feature tables would now have seq_region_id, seq_region_start, 
  seq_region_end, seq_region_strand instead of contig_id, contig_start,
  contig_end.  This includes the repeat_feature, simple_feature, 
  dna_align_feature, protein_align_feature, exon, marker_feature,
  karyotype and qtl_feature tables.

  meta_coord
  ----------
  The meta coord table defines what coordinate systems are used to store each
  type of feature.  A given type of feature may be stored in multiple
  coordinate systems, but these will not be retrieved by the API unless there
  is an entry in the meta_coord table.

  table_name       varchar
  coord_system_id  int


  misc_feature
  ------------
  This is a renaming of the mapfrag table. The renaming reflects the fact that
  this table can be used to store any type of feature.

  misc_feature_id
  seq_region_id
  seq_region_start
  seq_region_end
  seq_region_strand

  misc_set
  --------
  This table was formerly names mapset.  It defines 'sets' that can be used
  to group misc_features together.

  misc_set_id  smallint
  code         varchar
  name         description
  description  text
  max_length   int

  misc_feature_misc_set
  ---------------------
  This is a link table defining the many-to-many relationship between the 
  misc_set and misc_feature tables.

  misc_feature_id   int
  misc_set_id       smallint


  misc_attrib
  -----------
  This table was formerly named mapfrag_annotation.  It contains arbitrary
  annotations of misc_features and links to the same attrib_type table that
  the seq_region_attrib table uses.

  misc_feature_id  int
  attrib_type_id   smallint
  value            varchar


  
Removed Tables
--------------

  contig
  ------
  Contigs are no longer needed.  They are stored as entries in the seq_region
  table with type 'contig'.  The embl_offset and clone_id will not be
  necessary as their relationship to clones can be described by the 
  assembly table.

  clone
  -----
  Clones are no longer needed.  Clones are stored as entries in the seq_region 
  table with coord_system 'clone'.  The modified timestamp will be discarded 
  as it is no longer maintained anyway.  The embl_acc, version, and 
  embl_version columns are redundant and will also be discarded.  Versions
  are simply appended onto the end of the name with a delimiting '.'. 

  Any additional information that needs to be present (such as htg_phase) can 
  be added to the seq_region_attrib table.

  chromosome
  ----------
  This table is no longer needed.  Chromosomes can be stored in the 
  seq_region table with a 'chromosome' coord_system.



META INFORMATION
----------------

Considerable more meta information is stored in the core
databases in order for the general approach to be maintained.  
This information is stored in the new coord_system table and in the 
meta, and meta_coord tables.

Meta information includes the following:

  * The coordinate system that features of a given type are stored in.  This
    information is stored in the meta_coord table and is used when constructing
    queries for a particular feature table.

  * The top-level coordinate system. For human
    this would be 'chromosome'.  For briggsae this may be something like
    'scaffold' or 'super contig'.  This information would be used to construct
    the web display and would possibly be the default coordinate system when 
    a coordinate system is unspecified by a user. This is stored as a flag
    in the coord_system table.

  * The default version of each coordinate system.  This is stored as a flag
    in the coord_system table.

  * The coord_system where sequence is stored.  This will be stored as a
    flag in the coord_system table.  Initially it will only be possible
    to have a single coord_system in which sequence is stored.  This 
    may be extended in the future to allow sequence to be stored for multiple
    coord_systems.

  * The coordinate system relationships between that are explicitly defined 
    in the assembly table.  The new API is capable of 2 step (implicit) mapping
    between coordinate systems, but these relationships can be determined
    through the direct relationship information.
 
    For example the clone, chromosome and nt_contig coordinate systems may all
    be constructed from the contig coordinate system:
      contig -> clone
      contig -> chromosome
      contig -> nt_contig
    Or there may be a more hierarchical approach:
      contig    -> clone
      clone     -> nt_contig
      nt_contig -> chromosome
    This information is stored in the meta table under the key 
   'assembly.mapping' with the following format (versions are optional):
    assembled_coord_system_name[:version]|component_coord_system_name[:version]

    For example the meta table for human might contain the following entries:
    mysql> select * from meta where meta_key = 'assembly.mapping';
     +---------+------------------+--------------------------+
     | meta_id | meta_key         | meta_value               |
     +---------+------------------+--------------------------+
     |      43 | assembly.mapping | chromosome:NCBI33|contig |
     |      44 | assembly.mapping | clone|contig             |
     |      45 | assembly.mapping | supercontig|contig       |
     +---------+------------------+--------------------------+

   * The names of the allowable coordinate systems.  This would allow for 
     quick validation of API requests and provide a list that could be used
     by the website for coordinate system selection.  This information will be
     stored in the coord_system table.

   * The coordinate system(s) that each feature type is stored in. This is
     stored in the meta_coord table.


API CHANGES
-----------

Slice
-----
  Slice methods chr_start, chr_end, chr_name will be renamed start, end, 
  seq_region_name.  For backwards compatibility the old methods are 
  chained to the new methods with deprecated warnings. 

  A new slice method 'coord_system' will be added and will return a
  Bio::EnsEMBL::CoordSystem object.

  Slices will represent a region on a seq_region as opposed to a region on a
  chromosome.  Slices will be immutable (i.e. their attributes will not be
  changeable).  A new slice will have to be created if the attributes are to
  be changed.

  The following attributes will therefore define a unique slice:
  coord_system    (e.g. object with name and version)
  seq_region_name (e.g. 'X' or 'AL035554.1')
  start           (e.g. 1000000 or 1)
  end             (e.g. 2000001 or 800)
  strand          (e.g. 1 or -1)

  The name method will return the above values joined by a ':' delimiter, and
  will not be settable:
  e.g.  'chromosome:NCBI33:X:1000000:2000001:1' or 'clone::AL035554.1:1:800:-1'
  This value can be used as a hashvalue that uniquely defines a slice.

  The concept of an 'empty' slice will no longer exist.

  The get_tiling_path method will be deprecated in favour of a more general
  method project().  Whereas get_tiling_path() implies a relationship between 
  an assembly and the coordinate system which makes up the assembly the
  project method will allow conversion accross any two coordinate systems.
  It will take a coord_system string as an argument and rather 
  than returning a list of Tile objects it will return a listref of triplets 
  containing a start int, and end int, and a 'to' slice object.  The following 
  is an example of how this method would be used ($clone is a reference to a 
  slice object in the clone coordinate system):

    my $clone_path = $slice->project('clone');

    foreach my $segment (@$clone_path) {
      my ($start, $end, $clone) = @$segment;
      print $slice->seq_region_name, ':', $start, '-', $end , ' -> ',
            $clone->seq_region_name, ':', $clone->start, '-', $clone->end, 
            $clone->strand, "\n";
    }

    An optional second argument to project() will be the coordinate
    system version.  E.g.:
     $ncbi34_path = $slice->project('chromosome','NCBI34').


Tile
----
  The tile object will no longer be necessary.  However for backwards
  compatibility it will remain in the system for some time before being phased
  out along with the get_tiling_path method.


SliceAdaptor
------------
  The Slice adaptor must provide a method to fetch a slice via its coordinate
  system, seq_region_name, start, end, and strand.  
  The old, commonly used method fetch_by_chr_start_end has been altered to 
  simply chain to this new method (with a warning) as do most other 
  SliceAdaptor methods.

  Another method which is necessary with the disapearence of the Clone,
  RawContig and Chromosome adaptors is one which allows for all slices
  of a certain type to be retrieved.  For example it is often necessary to 
  retrieve all chromosomes, or clones for a species.  This method is simply
  named fetch_all.  The old fetch_all methods on the ChromosomeAdaptor, 
  RawContigAdaptor, CloneAdaptor, etc. chain to the new method for backwards 
  compatibility.

  Method Names and Signatures
  ---------------------------
    Slice fetch_by_region(coord_system, name)
    Slice fetch_by_region(coord_system, name, start)
    Slice fetch_by_region(coord_system, name, start, end)
    Slice fetch_by_region(coord_system, name, start, end, strand)
    Slice fetch_by_region(coord_system, name, start, end, strand, version)
    listref of Slices fetch_all(coord_system)
    listref of Slices fetch_all(coord_system, version)
  
RawContig
---------
  The RawContig object is no longer necessary with the new system.  RawContigs
  are replaced by Slices with coord_system = 'contig'. In the interests of 
  backwards compatibility the RawContig class will still be present for
  sometime as a minimal implmentation inheriting from the Slice class.


RawContigAdaptor
----------------
  The RawContigAdaptor is no longer necessary.  The RawContigAdaptor is 
  replaced by the SliceAdaptor.  For backwards compatibility a minimal 
  implementation of the RawContigAdaptor will remain which inherits from the 
  SliceAdaptor.

Clone
-----
  The Clone object is no longer necessary in the new system.  Clones are 
  replaced by Slices with coord_system = 'clone'. For backwards compatibility
  a minimal implementation will remain which inherits from the Slice object.

CloneAdaptor
------------
  The CloneAdaptor object is no longer necessary in the new system.  The
  CloneAdaptor is replaced by the SliceAdaptor.  For backwards compatibility
  a minimal implementation will remain which inherits from the SliceAdaptor.

Chromosome
----------
  The Chromosome object is no longer necessary in the new system.  The
  Chromosome is replaced by Slices with coord system 'chromosome' (or
  whatever the top level seq_region type is for that species).  For backwards
  compatibility a minimal implementation will remain which inherits from the
  Slice object.  

  Statistical information (e.g. known genes, genes, snps) that 
  was on chromosomes may be stored in the seq_region_attrib table or
  in some sort of density table.

ChromosomeAdaptor
-----------------
  The Chromosome object is no longer necessary in the new system. The 
  ChromosomeAdaptor is replaced by the SliceAdaptor.  For backwards 
  compatibility a minimal implementation which inherits from the SliceAdaptor
  will remain.


Root
----
  Every class in the current EnsEMBL perl API inherits directly or indirectly
  from Bio::EnsEMBL::Root.  This inheritance is almost exclusively for the
  following following three methods:
    throw
    warn
    _rearrange

  Nothing is gained by implementing this relationship as inheritance, and there
  are several disadvantages:
    (1) Everything must inherit from this class to use those 3 object methods.
    This can result in patterns of multiple inheritance which are generally
    considered to be a bad thing.

    (2) It is not possible to use the throw, warn or rearrange method within
    the constructor until the object is blessed.  Blessing the object first
    and then calling rearrange to extract named arguments is slower because
    the blessed hash needs to be expanded as more keys are added and several
    key access/value assignements may need to be performed.

    (3) Objects become larger and object construction becomes slightly slower
    because constructors traverse an additional level of inheritance.

  A better approach, which we have used, is to make the methods static and 
  create a static utility class that exports the methods.  The warn method has
  been renamed warning so as not to conflict with the builtin perl function
  warn and the _rearrange method has be renamed rearrange.

  The following is an example of the old styl Root inheritance and the new
  style static utility methods:

  #
  # OLD STYLE
  #
  package Old;

  use Bio::EnsEMBL::Root;

  @ISA = qw(Bio::EnsEMBL::Root);

  sub new {
    my $caller = shift;
    my $class = ref($caller) || $caller;

    $self = $class->SUPER::new(@_);
    
    my ($start, $end) = $self->_rearrange(['START', 'END'], @_);

    if(!defined($start) || !$defined($end)) {
      $self->throw('-START and -END arguments are required');
    }

    $self->{'start'} = $start;
    $self->{'end'}   = $end;

    return $self;
  }

  #
  # NEW STYLE
  #
  package New;

  use Bio::EnsEMBL::Utils::Exception qw(throw warning);
  use Bio::EnsEMBL::Utils::Argument  qw(rearrange);

  sub new {
    my $caller = shift;

    my $class = ref($caller) || $caller;

    my ($start, $end) = rearrange(['START', 'END'], @_);

    if(!defined($start) || !defined($end)) {
      throw('-START and -END arguments are required');
    }

    return bless {'start' => $start, 'end' => $end}, $class;
  }

  The calls to $self->rearrange $self->warn and $self->throw have been 
  replaced by class method calls to warning() and throw() inside the core API.
  However, for backwards compatibility the existance inheritance to 
  Bio::EnsEMBL::Root will remain in many cases (and be removed at a later date)


Storable Base Class
-------------------
  Almost all business objects in the EnsEMBL system are storable in the db
  and the ones which are always require 2 methods: dbID() and adaptor().  These
  methods have been moved to a Storable base class which most of the 
  business objects now inherit from.  This module has an additional method
  is_stored() which takes a database argument and returns true if the object
  appears to already have been stored in the provided database.

Features
--------
  All features should inherit from a base class that implements common feature
  functionality.  Formerly this role was filled by the bloated SeqFeature class
  which inherits from Bio::SeqFeature and Bio::SeqFeatureI etc.
  This class has been replaced by a smaller, less complicated 
  implementation named Feature.  To make classes more polymorphic in general,
  the gene, and transcript objects should now also inherit from the Feature
  class.  This class implements the following core methods common to all 
  features:

    start
    end
    strand
    slice (formerly named contig/entire_seq/etc.)
    transform
    transfer
    project
    analysis
    
  The feature class inherits from the Storable base class and thereby inherits
  the following methods:

    adaptor
    dbID
    is_stored

  The signature and behaviour of the transform method has been changed.  The
  existing method works differently depending on the arguments passed as 
  described below.

    OLD transform(no arguments)
    -----------------------
      Transforms from slice coordinates to contig coordinates.  The feature
      is changed in place and returned.  If the feature already is in contig
      coordinates an exception is thrown.  The feature may be split into two
      features in which case both features are returned (not sure if one of
      them is transformed in place).  Some features are not permitted to be
      split in two in which case an exception is thrown? (not sure) if it is
      to be split accross contigs.

    OLD transform(slice)
    ----------------
      If the feature is already in slice coordinates and the slice is on the
      same chromosome the features coordinates are simply shifted.  If the
      feature is already in slice coordinates but on a different chromosome
      an exception is thrown.
      It the feature is in contig coordinates and the slice is not empty then
      it is transformed onto the new slice (or an exception is thrown if the 
      transform would cause the feature to end up on a different chromosome
      than the slice).  If the feature is in contig coordinates and the
      slice is an empty slice the feature is transformed into chromosomal
      coordinates and placed on a newly created slice of the entire chromosome.

    The new transformation has only a single valid signature and splits its 
    responsibilities with the new transfer method.  The transfer 
    method transfers a feature onto another slice, whereas the transform
    method simply converts coordinate systems. Transform does 
    NOT transform features in place but rather returns the newly 
    transformed feature as a new object:
    
    transform(coord_system, [version])
    -----------------------
      Takes a single string specifying the new coord system. If the coord
      system is not valid an exception is thrown. If the coord system is the
      same coord system as the feature is currently in a new feature that is
      a copy of the old one is still be returned.  This also retrieves
      a slice which is the entire span of the region of the coordinate system
      that this feature is being transformed to.  For example transforming
      an exon in contig coordinates to chromosomal coodinates will place a 
      copied exon on a slice of an entire chromosome.  If a feature spans a 
      boundary in the coordinate system, undef is returned by the method 
      instead.

    transfer(slice)
    ----------------
      Shifts a feature from one slice to another.  If the new slice is in the
      same coordinate system but different seq_region_name (e.g. both 
      chromosomal but different chromosomes) an exception is thrown.  
      If the new slice is in a different coordinate system then the 
      transform method is internally called first.  If the feature would be 
      split across a boundary undef is returned instead.  After the transform 
      there follows a potential move, if the slice does not cover the full 
      seq_region. If there is no transform call necessary, the feature is 
      copied and then moved.

    move( start, end, strand )
    --------------------------
      In place change of the coordinates of the feature. It will stay on the 
      same slice.

    project(coord_system, [version])
    -----------------
      This method is analagous to the project method on Bio::EnsEMBL::Slice.
      It 'projects' a feature onto another coordinate system and returns the
      results formatted as a listref of [$start, $end, $feature] triplets.
      The $features returned are copies of the feature on which the method was
      called, but with coordinates in the coordinate system that was projected
      to.  If the feature maps entirely to a gap then an empty list ref [] will
      be returned.  If the feature is mapped to multiple locations a listref
      containing split features will be returned.

StickyExon
----------
  The sticky exon object is not be present in the new system.  It does not
  make sense to define features in a coordinate system where they are simply 
  not present.  Exons are calculated in chromosomal coordinates and they
  will generally be retrieved in the same coordinates system.  It will
  of course be possible to still retrieve exons in contig coordinates but only
  is they are fully defined on the contigs of interest.
  The split coordinates can be obtained through a call to the project
  method.


AssemblyMapper
--------------
  The assembly mapper and assembly mapper adaptor classes have become more
  general and sophisticated.Not only is it possible to map between two
  coordinates systems whose relationship is explicitly defined in the assembly
  table, but it is also possible to perform implicit, 2-step mapping
   using 'coordinate system chaining'.

  For example if no explicit relationship is defined between the supercontig
  and clone coordinate systems but relationships between the clone and contig
  and the supercontig and contig coordinate systems is present the mapper has
  the faculty to perform the mapping between the clone and supercontig systems.
  In this case the contig cooridinate system is used as an intermediary:
  
  NTContig <-> Contig <-> Clone

  In the above example the assembly mapper adaptor internally does the 
  following:
  
  (1) Create a mapper object between the NTContig and Contig region
  (2) Create a mapper object between the Contig and Clone region
  (3) Create and return a third mapper constructed from the sets of mappings
  generated by the intermediate mappers.

  
FeatureAdaptors
---------------
  Most FeatureAdaptors inherit from the BaseFeature adaptor.  As a 
  minimum feature adaptors provide fetch_all_by_Slice and fetch_by_dbID 
  methods. The fetch by slice method provides the same return types and 
  requires the same arguments as before, but required some internal changes.

  The simplified algorithm for fetching features via a slice is:

  (1) Check with coord system is requested or that slice is in.
  (2) Check which coord system features are in
  (3) Obtain mapper between coord systems
  (5) Retrieve features in their native coord system.
  (6) Remap features to the requested coord system using the mapper
  (7) Return the features

  The method fetch_all_by_RawContig is obsolete (it is equivalent to
  fetching by a slice of a contig) but has be left in as an alias for the
  fetch all by slice method for backwards compatibility.

  When performing a non-locational fetch (e.g. by dbID) features are still
  returned in the coordinate system that they are calculated in.  This is to
  ensure that the feature can always be retreived in this manner of fetching
  and so that features which are not in the database can be distinguished from
  features which are simply not in the requested coordinate sytem.  When a 
  single feature which is not in the database is requested via a non-locational
  fetch undef is returned instead.  If multiple features are requested but none
  are present in the database a reference to an empty list is returned.  If the
  features are required in a specific coordinate system the transfer, project 
  or transform method can always be used.


CoordSystemAdaptor
------------------
  A CoordSystemAdaptor provides access to the information in the 
  coord_system, meta and meta_coord tables.  This adaptor provides 
  Bio::EnsEMBL::CoordSystem objects.


NEW FEATURES
------------

Assembly Exceptions (Symbolic Sequence Links)
---------------------------------------------

  It is sometimes desirable to have multiple regions refer to the same 
  sequence.

  In much the same way a symlinked file acts as a pointer to a real file, 
  a symlinked region can point to another region of sequence.

  This can be described in the database through the addition of a table which 
  has a structure that mirrors that of the assembly table. The assembly table
  does not define the structure underlying this seq_region, and it does not
  have sequence of its own.  By means of the assembly_exception table this 
  seq_region points to another seq_region where the underlying sequence is 
  defined:

      assembly_exception
      ------------------   
      seq_region_id        int
      seq_region_start     int
      seq_region_end       int
      exc_type             enum('HAP', 'PAR')
      exc_seq_region_id    int
      exc_seq_region_start int
      exc_seq_region_end   int
      ori                  int  (may not be needed, may implicitly be 1)

   When fetching features and sequence from a slice that overlaps a symlinked
   region, the features and sequence from the symlinked region are returned.  
   This may be implemented by altering fetch by slice calls and adding a 
   SliceAdaptor method with splits a slice into non-symlinked components.  
   The following algorithm would apply to sequence and feature fetches:
      (1) Split the slice into non-symlinked component slices
      (2) Recursively call the method with the component slices
      (3) Adjust the start and end of the returned features and place them
          back on the original slice (or splice the sequence together if this
          is a sequence fetch)
      (4) Return the features or sequence
   
   Consider a slice which overlaps regions (A), (B), and (C) on chromosome Y:

             ===============  (chrX)
               ^^^^^^^^^^^
     ========   symlink     =========  (chrY)
      (A)          (B)         (C)

   Regions (A) and (C) are described by the assembly table, but region (B)
   is described in the assembly_exception table and points to a region of
   chromosome Y.  When features or sequence are retrieved the slice is split
   into 3 component slices which have no symlinks:  region (A) and (C) are
   slices on chromosome Y but region (B) is made into a slice on chromsome (X).
   All of the features are fetched from the individual slices adjusted by
   some addition and placed on back on the original Slice before being
   returned. 
  
   
    
Haplotypes (and the MHC region)
-------------------------------
  There are several requirements related to haplotypes:
    - Must be able to determine which haplotypes overlap a slice
    - Must be able to run genebuild/raw computes over the haplotypes
    - Must be able to retrieve a slice on a haplotype and its flanking
      regions (i.e. the regions of the default assembly bordering the 
      haplotype).
    - It may be desireable to interpolate features from the default sequence
      onto the haplotype

   Proposal:
    The haplotype will be present as a full length 'chromosome' in the 
    seq_region table (or other appropriate coordinate system) but only the 
    region which differs from the the default assembly will be described in 
    the chromosome table.  The regions which are identical will be described 
    by the assembly_exception table. 
  
    It is possible to retrieve a slice on a haplotype just as any other slice
    is retrieved from the SliceAdaptor.  For example: 
    $slice = $slice_adaptor->fetch_by_region('chromosome', '6_DR52');
    
    A slice created on a haplotype will have coordinates relative to the
    start of the chromosome NOT relative to the start of the haplotype
    region. For all intents and purposes a haplotype slice will behave as
    a normal slice.

    For example, the assembly table could define the composition of the 
    divergent region of chromosome 6_DR52 (C), but leave the remainder of the
    chromosomal composition undefined.  The remainder of the 
    chromosome composition would be accounted for by 2 rows in the 
    assembly_exception table which described the synonymous regions in terms 
    of chromosome 6:
 
       ==============  6       ==============  6
            ^                         ^
       _____|________          _______|______  
                   C ==========               6_DR52





Pseudo Autosomal Regions (PARs)
-------------------------------
  There are several requirements related to PARs:
    - The same sequence and features must be present on a region of
      both chromosome X and chromosome Y
    - The region and features should be returned when retreiving features
      from either chromosome.
    - It must still be possible to retrieve one of the features via its 
      identifier
    - It must still be possible to transform features in the region from 
      chromosomal coordinates to contig coords and vice-versa.
    - The genebuild should run over the region, but only once.

  Proposal:
    Use the assembly_exception table in a similar fashion as it is used
    for the haplotypes described above. Chromosome X can be the 'default' 
    chromosome for the PAR and Chromosome Y can be described by the assembly 
    table except in the PAR.  The PAR on chromosome Y can be defined by the 
    assembly_exception table and refer to the corresponding sequence on 
    chromosome X.  The same algorithm as used for haplotypes can then be used 
    when retrieving sequence or features from slices which overlap this 
    exception on chromosome Y.

    The following diagram illustrates how chromsome X and chromosome Y could
    be defined:

    ========================================== X
           ^                 ^ 
          _|_            ____|____ 
    ======   ============         ============= Y


Multiple Assemblies
-------------------

In theory it is possible to load multiple assemblies into the same database.
For example two coordinate systems with two versions chromosome:NCBI33 and
chromosome:NCBI34 could be loaded into the database.  Leveraging the fact that
two step mapping is possible and that these coordinate systems share a 
coincident mapping with the contig coorinate system it is possible to pull
across annotation from one assembly to the other. The following example 
illustrates the transfer of genes from the chromosome X on the NCBI33 assembly
to the NCBI34 assembly:

  $slice = $slice_adaptor->fetch_by_region('chromosome', 'X', undef,
                                           undef,undef, 'NCBI33');
  @genes = @{$gene_adaptor->fetch_all_by_Slice($slice)};

  foreach my $gene (@genes) {
    $gene->transform('chromosome', 'NCBI34');
    #...
  }


OTHER CONSIDERATIONS
--------------------

Loci
----

Similar genes which are defined across haplotypes need
to be somehow linked into loci.  The intent is that 
a user would be able to see that a gene has a counterpart
on an equivalent haplotypic sequence. 

This is work in progress. The current opinion among us is to implement
it via a relationship table that specifies which genes on default
haplotypes are considered equivalent to other genes on
(overlapping) haplotypes.
  
