The Xref System
========================================================================

The external database references (Xrefs) are added to the Ensembl
databases using the code found in this directory.  The process consists
of two parts.  First part is parsing the data into a temporary database
(Xref database).  The second part is to map the new Xrefs to the Ensembl
database.


Parsing the external database references
------------------------------------------------------------------------

In this directory you will find an ini-file called 'xref_config.ini'.
This file contains two types of configuration sections: source sections
and species sections.  A source section defines Xref priority, order
etc. (as key-value pairs, see the comment at the top of the source
sections for a fuller explanation of these keys) for the source and
also the URIs pointing to the data files that the source should use.
The source label will only be used to refer to the source within the
ini-file (from a species section), so this can be any text string which
is easy to understand the meaning of.

A species section contains information about species aliases, the
numerical taxonomy ID(s) and what sources to use for that species.  If
a species has more than one taxonomy ID (in the case where there are
multiple strains or subspecies, for example), there can be more than one
'taxonomy_id' key.  The name of the species is defined by the source
label and will be store in the Xref database.

For now, the script 'xref_config2sql.pl' (also found in this directory)
should be used to convert the ini-file into a SQL file which you
should replace the file 'sql/populate_metadata.sql' with.  The
'xref_config2sql.pl' script expects to find 'xref_config.ini' in the
current directory, but you may specify an alternative file as the first
command line argument to the script if you have moved or renamed the
ini-file.  When 'xref_parser.pl' is run it will load the generated SQL
file into the database and will then download and parse all external
data files for one or several specified species.

If you want to add a new source you will have to add a new source
section, following the pattern used by the other source sections.  You
will then have to add it to the species that require the data.

If the new data comes in files not previously handled by the Xref
system, you will now also have to write the parser NewSourceParser.pm
(the parser name may be arbitrary chosen) in the XrefParser directory.
You can find lots of examples of parsers in this directory.

Before running the Xref parser, make sure that the environment
variable 'http_proxy' is set to point to the local HTTP proxy to get
outside the firewall.  For Sanger, the value of the variable should be
"http://cache.internal.sanger.ac.uk:3128", i.e. for tcsh shells you
should have

    setenv http_proxy http://cache.internal.sanger.ac.uk:3128

in your ~/.tcshrc file, while for bash-like shell you should have

    export http_proxy=http://cache.internal.sanger.ac.uk:3128

in your ~/.profile or ~/.bashrc file.

When you run the script 'xref_parser.pl' to do the Xrefs you must pass
to it several options but for most runs all you need to specify it the
user (user name on the database), pass (password), host (database host),
dbname, and species, i.e.

    perl xref_parser.pl -host mymachine -user admin -pass XXXX \
        -dbname new_human_xref -species human

Please keep the output from this script and check it later.  At the end
of the output there will be a summary of what was successful and what
failed to run.  This is important.

The parsing can create three types of Xrefs these are

1) Primary   (These have sequence and are mapped via exonerate)
2) Dependent (Have no sequence but are dependent on the Primary ones)
3) Direct    (These are directly linked to the Ensembl entities, so the
             mapping is already done)

Some sources will have more than one set of files associated with it,
in these cases they have the same source name but different source IDs.
These are known as "priority Xrefs" as the Xrefs are mapped according to
the priority of the source.  An example of this is the HUGOs.

For more information on the what data can be parsed see the
'parsing_information.txt' file.


Mapping the external database references to the Ensembl core database
------------------------------------------------------------------------

This is an overview of what goes on in the script 'xref_mapper.pl' .

Primary Xrefs are dumped out to two Fasta files, one for peptides and
the other for DNA.  Ensembl Transcripts and Translations are then dumped
out to two files in Fasta format.

Exonerate is then used to find the best matches for the Xrefs.
If there is more than one best match then the Xref is mapped to
more than one Ensembl entity.  A cutoff is used to filter the best
matches to make sure they pass certain criteria.  By default this
is that the query identity OR the target identity must be over
90%.  This can be changed by creating your own '<method>.pm' file
in the directory 'XrefMapper/Methods' and creating subroutines
'query_identity_threshold()' and 'target_identity_threshold()' which
return the new values.

So exonerate will generate a set of .map files with the mapping in.  The
map-files are then parsed and any that pass the criteria are stored in
the 'xref' table, 'object_xref' table and the 'identity_xref' table.
All dependent Xrefs are also stored if the parent is mapped.

Direct Xrefs are also stored at this stage but no mapping is needed here
as we already knew what each Xref maps too.

For priority Xrefs (ones that have multiple sources) the highest
priority one is only stored.

Any Xrefs which fail to be mapped are written to the unmapped_object
table with a brief explanation of why they could not be mapped.

Once all the mapping have been stored the display Xrefs and the
descriptions are generated for the transcripts and genes.

If you want to change any of the default settings you can create a new
'<species>.pm' for your particular species, or '<taxon>.pm' and override
the script 'BasicMapper.pm' (see 'rattus_norvegicus.pm' as an example).

The 'xref_mapper.pl' script needs a configuration file which has
information on the Xref database and the core database and also the
species name.  Below is an example of running the mapping.

    perl ~/ensembl-live/ensembl/misc-scripts/xref_mapping/xref_mapper.pl \
        -file xref_input -upload >&MAPPER.OUT


Here is an example of a configuration file for 'xref_mapper.pl':
------------------------------------------------------------------------
xref
host=ensembl-machine
port=3306
dbname=human_xref_42
user=admin
password=xxxx
dir=./xref

species=homo_sapiens
taxon=mammalia (this is optional - use taxon if you need more than one species to use the same '<taxon>.pm' module) 
host=ensembl-machine
port=3306
dbname=homo_sapiens_core_42_36d
user=admin
password=xxxx
dir=./ensembl

farm
queue=long
exonerate=/software/ensembl/bin/exonerate-1.4.0
------------------------------------------------------------------------

Note it is good practice to put a sub-directory for the Ensembl
directory as many files are generated and hence best to put these all
together and way from everything else or it will be hard to find things.
Also the directory can be tared and zipped in case you need to check
things later.
