You should already be familiar with the document xrefs_overview.txt. This document
primarily details the inner workings of the xref_mapper.pl program explaining how 
to run it, how to track what it is doing and how to recover if something goes 
wrong.
This is quite technical so if you are having problems and just want it to work, 
it might be best to skip to the FAQ.tx document.

So i am going to start by presuming the parsing of the external databases data 
has been successful. 

The last entry in the process_status table should at this stage read 
"parsing_finished". The process_status table is used by the mapper program to 
work out what stage it is at and what it has to do next. So at this point if we 
look at what is in the table we should see something like:-

[user@machine] [macaca_xref]> select * from process_status;
+----+--------------------------+---------------------+
| id | status                   | date                |
+----+--------------------------+---------------------+
|  1 | xref_created             | 2009-05-26 16:21:43 |
|  2 | parsing_started          | 2009-05-26 16:21:43 |
|  3 | parsing_finished         | 2009-05-26 16:54:25 |
+----+--------------------------+---------------------+

Another useful table to look at is the meta table which contains general info

[user@machine] [macaca_xref]> select meta_key, meta_value from meta;
+----------------------------------------------------------------+
| meta_key   | meta_value                                        |
+------------+---------------------------------------------------+
| options    | -user rw -pass noway -host machine                |
|              -species macaca_mulatta -dbname macaca_xref       |
|              -create -stats -checkdownload                     |
| fullmode   | yes                                               |
+------------+---------------------------------------------------+

So this shows the options used by the parser and that it is being ran in fullmode.

Fullmode means that all the xrefs are being updated and not just a few specific 
external database sources. This is important as this affects the way the 
display_xrefs, descriptions are calculated at the end.The user can override this 
by setting -partupdate option in the mapper options or change the entry in the 
table.

If we are doing all the xref sources then we know that all the data is local and 
hence can do some SQL to get  the display_xrefs etc But if this is not the case 
then the core database will have extra information in it that may be needed so 
we have to query the core database. The xref database has extra information that 
is not in the xref database and so simple SQL can be used whereas with the core 
database we have to go for each gene and then for each transcript etc using the
 API which is alot slower.

In summary only alter the mode here if you know what you are doing and what 
consequences there will be.

You now need to create a configuration file which specifys which xref database is
going to be used against which core database (see the overview document for this)

So the mapper has several stages these are:-

 1) process the config file
 2) Dump the fasta files (unless they already exist and dumpcheck option is used)
 3) run exonerate to produce the mapping files
 4) process the mapping files
 5) load specific core data into the xref database
 6) process the direct xrefs
 7) flag the priority xrefs (flag the best ones)
 8) process paired data
 9) check if any source that is on more than one ensembl type and fix.
10) official naming (for human and mouse only)
11) checks. (looks for possible errors in the tables plus looks at number of xrefs) 
12) coordinate mapping (for human and mouse only to map by location (UCSC))
13) load data into the core
14) calulate display xref for genes and transcripts
15) calulate the gene descriptions.

When you first use the mapper the first time it is advised that you do not use 
the -upload flag so that the program termiates before it loads the data enabling
you to look at the checks that are performed. One set of tests looks at the number
of xrefs that have been mapped in this run to the ones that are already in the 
core database so you can see which source have been missed or have gone wrong. The
check gives you the difference as a percentage for each source so any big changes 
need to be thought about.

Process the config file
-----------------------

Get information on the databases to do the mapping on.
Also store the options passed to the program and info on the databases to the 
meta table.


+---------+------------+----------------+-------------------+--------------------+
| meta_id | species_id | meta_key       | meta_value        | date               |
+---------+------------+----------------+-------------------+--------------------+
|       5 |          1 | xref           | host1:macaca_xref | 2009-05-26 16:56:52|
|       6 |          1 | species        | host2:macaca_core | 2009-05-26 16:56:52|
|       7 |          1 | mapper options | -file xref_input  | 2009-05-26 16:56:52|
+---------+------------+----------------+-------------------+--------------------+


Dumping the fasta files.
------------------------

If -dumpcheck is specified then the system checks to see if the fasta files exist
already and if it does, does not redump the fasta files. Dumping the fasta files 
from the core database can be one of the longest steps so if the core fasta files
exist already this is very useful. The fasta file will be written to where ever 
was specified in the configuration file.

The process_status table should now read something like :-
+----+--------------------------+---------------------+
| id | status                   | date                |
+----+--------------------------+---------------------+
|  1 | xref_created             | 2009-05-26 16:21:43 |
|  2 | parsing_started          | 2009-05-26 16:21:43 |
|  3 | parsing_finished         | 2009-05-26 16:54:25 |
|  4 | xref_fasta_dumped        | 2009-05-26 16:57:08 |
|  5 | core_fasta_dumped        | 2009-05-26 16:57:08 |
+----+--------------------------+---------------------+


Exonerate mapping.
------------------

The mapper uses exonerate to produce the mapping files. If the option -nofarm is
used then exonerate will run locally. The mapper sets the exonerate jobs running 
and  then writes to the tables mapping_jobs and mapping storing information about
these jobs.

The table mapping holds information about what mapping method and the mapping_jobs
table holds information about the individual exonerate jobs.

For most runs there will be 2 entries in the mapping table one for the alignment 
of the dna and another for the peptides :-

[user@machine] [macaca_xref]> select * from mapping \G
*************************** 1. row ***************************
               job_id: 76917
                 type: dna
         command_line: exonerate-1.4.0 xref_0_dna.fasta macaca_mulatta_dna.fasta 
                       --querychunkid $LSB_JOBINDEX --querychunktotal 586 
                       --showvulgar false --showalignment FALSE 
                       --ryo "xref%qi:%ti:%ei:%ql:%tl:%qa:%qae:%tab:%tae:%C:%s\n" 
                       --gappedextension FALSE --model affine:local --subopt no 
                       --bestn 1 
                      | grep "^xref" > ExonerateGappedBest1_dna_$LSB_JOBINDEX.map
 percent_query_cutoff: 90
percent_target_cutoff: 90
               method: ExonerateGappedBest1
           array_size: 586
*************************** 2. row ***************************
               job_id: 76919
                 type: peptide
         command_line: exonerate-1.4.0 xref_0_peptide.fasta 
                       macaca_mulatta_protein.fasta 
                       --querychunkid $LSB_JOBINDEX --querychunktotal 73 
                       --showvulgar false --showalignment FALSE 
                       --ryo "xref:%qi:%ti:%ei:%ql:%tl:%qa:%qae:%tab:%tae:%C:%s\n"
                        --gappedextension FALSE --model affine:local --subopt no 
                        --bestn 1 
                  | grep "^xref" > ExonerateGappedBest1_peptide_$LSB_JOBINDEX.map
 percent_query_cutoff: 90
percent_target_cutoff: 90
               method: ExonerateGappedBest1
           array_size: 73


So here we can see that the dna alignment was split into 586 seperate farm jobs. 
The percentage cutoffs are used when the mapping files are processed and not by 
exonerate itself.

We can see the individual jobs ran by looking at the table mapping_jobs:-

[user@machine] [macaca_xref]>select map_file, status, array_number, job_id from 
mapping_jobs limit 2;
+--------------------------------+-----------+--------------+--------+
| map_file                       | status    | array_number | job_id |
+--------------------------------+-----------+--------------+--------+
| ExonerateGappedBest1_dna_1.map | SUBMITTED |            1 |  76917 |
| ExonerateGappedBest1_dna_2.map | SUBMITTED |            2 |  76917 |
+--------------------------------+-----------+--------------+--------+


So here we can see the map file that will be produced and the array number for a 
particular job_id;


After the exonerate jobs have been issued to the farm a depend job is set to wait
for all the exonerate jobs to finish. If the farm is not used then since the
jobs are run locally no depend job is needed.


Process the mapping files
--------------------------

Several checks are made while processing the mapping files and if the farm was 
used then checks are made on its error files to make sure they are empty.If any 
errors are found then the status for this mapping job is set to "FAILED" and the 
program exits gracefully after all the files have been read. So if there is an 
error in one of the mapping files then this is noted but all the other mapping 
files are still read in afterwards but the program then exits. 

 When the first entry is read from a mapping file the first object_xref that is 
stored has its id stored in the column object_xref_start in the table 
object_xref_start and also the last object_xef that is stored has its id stored 
into the object_xref_end column. So from this we know for each mapping file the 
range of object_xrefs that have been stored.
If the mapping file is processed with no errors then the status is set to 
"SUCCESS".


select map_file, status, array_number as arr, job_id, object_xref_start as start, 
 object_xref_end as end from mapping_jobs limit 2;
+--------------------------------+---------+------+--------+-------+------+
| map_file                       | status  | arr  | job_id | start | end  |
+--------------------------------+---------+------+--------+-------+------+
| ExonerateGappedBest1_dna_1.map | SUCCESS |    1 |  76917 |     1 |  461 |
| ExonerateGappedBest1_dna_2.map | SUCCESS |    2 |  76917 |   462 |  674 |
+--------------------------------+---------+------+--------+-------+------+

Also if start or end are null this indicates something went wrong.


If for some reason a mapping job failed this tends to be things like running out 
of disk space, the compute farm loosing a job etc then you have a couple of 
options.

1) reset then database to the parsing stage and rerun all the mappings

To reset the database use the  option -reset_to_parsing_finished

>xref_mapper.pl -file config_file -reset_to_parsing_finished

then redo the mapping
 
>xref_mapper.pl -file config_file -dumpcheck >& MAPPER.OUT

Note here we use -dumpcheck to make the program does not dump the fasta files if 
they  are already there as this process can take along time and the fasta files 
will not have changed.

Also it is a good idea to save the output from the job (>&MAPPER.OUT)

2) just redo those jobs that failed.

Run the mapper with the -resubmit_failed_jobs flag

>xref_mapper.pl -file config_file -resubmit_failed_jobs


The processing of the mapping file creates entries in the object_xref and 
identity_xref tables for the primary xrefs and also its dependents. Also entries
 may be added to the go_xref table if there are GO xrefs.

In the object_xref table we set the default ox_status to be "DUMP_OUT" if the 
mapping passes the percentage cutoff criteria else it is set to "FAILED_CUTOFF". 
So that we know which mapping are good.



Load specific core data into the xref database
-----------------------------------------------

For ease of use and to reorganise some data we copy data from the core database 
to the xref database.

gene_stable_id, transcript_stable_id and translation_stable_id are all copied as 
is from the core database. gene_transcript_translation table though has 
information regarding as to how these are all linked in one easy table.

Information about the external databases that are "KNOWN" or "KNOWNXREF" are 
stored for that source in the source table as this is needed for the 
gene/transcript status calculation later on.


Process the direct xrefs
------------------------
 
Direct xrefs are those xrefs where we have a direct mapping is taken from a file 
or database. The mapper is not used for these ones as the mapping is already 
specified.
So we now take all the entries from the tables gene_direct_xref, 
transcript_direct_xref and translation_direct_xref and create the object_xrefs 
for these.
If the stable_id cannot be found at this point a warning is given but the progam
does not halt. This is why it is strongly recomended that you do not use -upload 
the first time you use the mapper so that these types of things can be looked at 
before loading them into the core.


flag the priority xrefs
-----------------------

Priority xrefs are those xrefs where the external database xrefs may come from 
several different sources that have different prioritys. a good exmaple here is 
the HGNC xrefs for human.

> select source_id as id, name, priority, priority_description from source where
   name like "HGNC";
+----+------+----------+----------------------+
| id | name | priority | priority_description |
+----+------+----------+----------------------+
| 53 | HGNC |        1 | havana               |
| 54 | HGNC |        2 | ccds                 |
| 55 | HGNC |        4 | entrezgene_manual    |
| 56 | HGNC |        5 | refseq_manual        |
| 57 | HGNC |        6 | entrezgene_mapped    |
| 58 | HGNC |        7 | refseq_mapped        |
| 59 | HGNC |        8 | uniprot              |
| 60 | HGNC |        3 | ensembl_mapped       |
| 61 | HGNC |      100 | desc_only            |
+----+------+----------+----------------------+

So we can see that HGNC has 9 sources where havana (manual annotation) has the 
best priority.

What the flag priority xrefs step does is sets the ox_status in the object_xref 
table to be "FAILED_PRIORITY" for those where ther is a better match (better 
priority).


Here is an example

>select x.label, x.info_type, s.name, s.priority, s.priority_description, 
        ox.ox_status 
        from xref x, source s, object_xref ox 
        where x.source_id = s.source_id and ox.xref_id = x.xref_id 
          and s.name like "HGNC" and label like "BRCA2" limit 10;
+-------+-----------+------+----------+----------------------+-----------------+
| label | info_type | name | priority | priority_description | ox_status       |
+-------+-----------+------+----------+----------------------+-----------------+
| BRCA2 | DEPENDENT | HGNC |        8 | uniprot              | FAILED_PRIORITY |
| BRCA2 | DEPENDENT | HGNC |        5 | refseq_manual        | FAILED_PRIORITY |
| BRCA2 | DEPENDENT | HGNC |        7 | refseq_mapped        | FAILED_PRIORITY |
| BRCA2 | DEPENDENT | HGNC |        4 | entrezgene_manual    | FAILED_PRIORITY |
| BRCA2 | DEPENDENT | HGNC |        6 | entrezgene_mapped    | FAILED_PRIORITY |
| BRCA2 | DIRECT    | HGNC |        1 | havana               | DUMP_OUT        |
| BRCA2 | DIRECT    | HGNC |        2 | ccds                 | FAILED_PRIORITY |
+-------+-----------+------+----------+----------------------+-----------------+

So becouse the havana one is the best we set the ox_status for the other matches 
in the object_xref table to be "FAILED_PRIORITY" for BRCA2.
From this point on the "FAILED_PRIORITY" object_xref s are ignored.



process paired data
-------------------

For some database sources one data set is paired with another, an example here is
the relationship between reseq_peptide and refseq_dna. So if one of these is 
mapped  but the paired one is not then a new object_xref is created linking the 
one not mapped to the other one.



check if any source that is on more than one ensembl type and fix
-----------------------------------------------------------------

Becouse Biomart does not like having any particular data source on more then one
ensembl type (gene/transcript/translation) this check searches for these instances
and moves all the object_xrefs onto the highest level. So if HGNC is on Genes and 
Transcripts then all the one on Transcripts will be moved to the correspondiong 
Genes.



official naming 
---------------

Becouse we want to have the same names (with a postfix) for all the transcripts in
a gene for human and mouse, this process gets the best name (HGNC/MGI) for a gene
taken from any of its transcripts and then applies this to all the transcripts and
gene.
An example here is PDS5B which has two transcripts PDS5B-006 and PDS5B-201.
If the postfix start with a 0 it means that this comes directly from havana and is
the name there. If it starts with a 2 then the transcript is not found in havana.



checks
------

There is a list of checks which are performed. Some check primary/foreign key 
pairs, others check the number of xref and object_xrefs in the xref database 
compared to the core database. Depending on the seriousness of the problem a 
warning is given and then may exit gracefully.



coordinate mapping
------------------

For human and mouse we map UCSC stable ids to our system using their locations.
This mapping is done using exonerate and the reuslt are written directly to the 
object_xef table in the core database.



load data into the core
-----------------------

First for each source that is in the xref database the corresdonding data is 
deleted from the core database. This includes xref, object_xref, identity_xref, 
external_synonym, go_xref, dependent_xref and unmapped_object tables. Also all the
Projected xrefs are deleted.

Via some complex sql these tables are now filled with the new data.


calulate display xref for genes and transcripts
-----------------------------------------------

The external databases to be used for the display_xrefs are taken from either the 
BasicMapper.pm subroutine transcript_display_sources  i.e.

sub transcript_display_xref_sources {
  my @list = qw(miRBase
                RFAM
                HGNC_curated_gene
		HGNC_automatic_gene
                MGI_curated_gene
		MGI_automatic_gene
		Clone_based_vega_gene
		Clone_based_ensembl_gene
		HGNC_curated_transcript
		HGNC_automatic_transcript
		MGI_curated_transcript
		MGI_automatic_transcript
		Clone_based_vega_transcript
		Clone_based_ensembl_transcript
		IMGT/GENE_DB
		HGNC
		SGD
		MGI
		flybase_symbol
		Anopheles_symbol
		Genoscope_annotated_gene
		Uniprot/SWISSPROT
		Uniprot/Varsplic
		RefSeq_peptide
		RefSeq_dna
		Uniprot/SPTREMBL
		EntrezGene
	        IPI);

  my %ignore;
  $ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';
  
  return [\@list,\%ignore];

}




or if you want to create your own list then you need to create a species.pm file 
and create a new subroutine there an example here is for drosophila_melanogaster.
So in the file drosophila_melanogaster.pm  we have :-

sub transcript_display_xref_sources {

  my @list = qw(FlyBaseName_transcript FlyBaseCGID_transcript 
                flybase_annotation_id);
                

  my %ignore;
  $ignore{"EntrezGene"}= 'FROM:RefSeq_[pd][en][pa].*_predicted';

  return [\@list,\%ignore];

}


If we are in fullmode then a tempory table is created with these sources in and 
some sql is used to find the best display_xrefs for the transcripts and genes. 

If we are not in fullmode then we have to use the API to cycle though each gene, 
transcript and translation fecth all there xrefs and get the best. This is not that quick.

The results are stored directly to the core database.


calulate the gene descriptions.
-------------------------------

As above except this time we use the sub gene_description_sources.



