ABOUT
~~~~~

This document describes how to run the ensembl-compara genetree
pipeline.

- Loads the canonical or longest peptide for each gene loci in each
  configured species database into the compara DB's member table.

- Loads each UniProt protein into the member table (metazoa only for
  now). Only needed if the family pipeline is to be run subsequently.

- Performs all-against-all wublastp searched of the member proteins in
  the DB.

- Run hcluster_sg to create the clusters of proteins.

- Run MCoffee (or Muscle) to obtain the protein multialignments for
  each cluster.

- Run NJTREE_PHYML (aka TreeBeST) with a given species tree to create
  the protein trees.

- Run OrthoTree to create the homology relationships per each genepair
  in the tree.

- QuickTreeBreak will run in case NJTREE_PHYML fails on big alignments
  and create new jobs for the subclusters that creates. Old clusters
  will be automatically deleted.

- Provided that the species are close enough, the pipeline can also
  run a dN/dS analysis for the orthologs of pairs of species, in the
  Homology_dNdS analysis. This step is optional.

- For every tree and subsequent subtrees inside, Sitewise_dNdS can be
  executed to obtain column-wise dN/dS estimates. This step is
  optional.

The genetree process overlaps initially with the homology pipeline,
but once the blasts hits are created, it differs in the way things are
done. Read README-homology for more information.


1- code API needed and executable
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  bioperl-live (bioperl-1-2-3)
  bioperl-run (1.4) for the CodeML runnable/parser
  ensembl
  ensembl-compara
  ensembl-hive
  ensembl-analysis
  ensembl-pipeline

  executables
  ~~~~~~~~~~~
  wublastp
      using /usr/local/ensembl/bin/wublastp
  setdb
      using /usr/local/ensembl/bin/setdb
  hcluster_sg
      using /nfs/users/nfs_a/avilella/src/treesoft/trunk/ia64/hcluster/hcluster_sg (turing)
      using /nfs/users/nfs_a/avilella/src/treesoft/trunk/hcluster/hcluster_sg (x86_64)
  mcoffee
      using /software/ensembl/compara/tcoffee-7.86b/t_coffee (needs library dir as well)
  muscle
      using /usr/local/ensembl/bin/muscle
  quicktree
      using /nfs/users/nfs_a/avilella/src/quicktree/quicktree_1.1/bin/quicktree
      needs /usr/local/ensembl/bin/sreformat
  njtree
      using /nfs/users/nfs_a/avilella/src/treesoft/trunk/treebest/treebest
  codeml
      using /usr/local/ensembl/bin/codeml
  Slr_ensembl
      using /software/ensembl/compara/bin/Slr_ensembl
      needs /software/ensembl/compara/bin/Gblocks

1.2 Code checkout

First we export our CVS connection strings

export BIOPERL_CVS_REPO=':ext:bio.perl.org:/home/repository/bioperl'
export ENSEMBL_CVS_REPO=':ext:cvs.sanger.ac.uk:/nfs/ensembl/cvsroot'
#or
#export ENSEMBL_CVS_REPO=':pserver:cvsuser@cvs.sanger.ac.uk:/cvsroot/ensembl'

#export ENSEMBL_TAG='branch-ensembl-57'
export ENSEMBL_TAG='HEAD'

bioperl code;

  $ cvs -d $BIOPERL_CVS_REPO \
    co -r branch-1-2-3 bioperl-live

bioperl-run code (for the CodeML runnable/parser)
  
  $ cvs -d $BIOPERL_CVS_REPO \
    co -r bioperl-run-release-1-4-0 bioperl-run

core ensembl code

  $ cvs -d $ENSEMBL_CVS_REPO co -r $ENSEMBL_TAG ensembl

ensembl-compara code (for Compara API and schema)

  $ cvs -d $ENSEMBL_CVS_REPO co -r $ENSEMBL_TAG ensembl-compara

ensembl-analysis code (for Runnables)

  $ cvs -d $ENSEMBL_CVS_REPO co -r $ENSEMBL_TAG ensembl-analysis

ensembl-pipeline code (for Pipeline)

  $ cvs -d $ENSEMBL_CVS_REPO co ensembl-pipeline

ensembl-hive code

  $ cvs -d $ENSEMBL_CVS_REPO co  ensembl-hive


1.3 Perl environment

  Beware the pipeline has been seen to misbehave when using csh, and
  is mostly tested under bash terminals.

in tcsh
  $ setenv BASEDIR   /some/path/to/modules
  $ setenv PERL5LIB ${BASEDIR}/ensembl/modules:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl-compara/modules:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl-pipeline/modules:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl-analysis/modules:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl-hive/modules:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/bioperl-live:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/bioperl-run:${PERL5LIB}

in bash
  $ BASEDIR=/some/path/to/modules
  $ export PERL5LIB=${BASEDIR}/ensembl/modules:${PERL5LIB}
  $ export PERL5LIB=${BASEDIR}/ensembl-compara/modules:${PERL5LIB}
  $ export PERL5LIB=${BASEDIR}/ensembl-pipeline/modules:${PERL5LIB}
  $ export PERL5LIB=${BASEDIR}/ensembl-analysis/modules:${PERL5LIB}
  $ export PERL5LIB=${BASEDIR}/ensembl-hive/modules:${PERL5LIB}
  $ export PERL5LIB=${BASEDIR}/bioperl-live:${PERL5LIB}
  $ export PERL5LIB=${BASEDIR}/bioperl-run:${PERL5LIB}


1.4 CPAN Perl modules
  www.cpan.org

  DBI
  DBD::mysql
  Data::UUID
  Statistics::Descriptive


2- Create the tables in the database
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
These tables are already in the main ensembl-compara/sql/table.sql
file as of ensembl v40.

Pick a mysql instance and create a database;

  $ set COMPARA_DBNAME=${USER}_ensembl_compara_52
  $ mysql -e "create database $COMPARA_DBNAME"

And load the schemas;

  $ cat $BASEDIR/ensembl-compara/sql/table.sql \
        $BASEDIR/ensembl-compara/sql/pipeline-tables.sql \
        $BASEDIR/ensembl-hive/sql/tables.sql \
    | mysql $COMPARA_DBNAME

And set an environment variable containing the connection parameters
to the DB as used by many of the compara scripts:

  $ set COMPARA_URL=mysql://<user>:<pass>@<host>:<port>/${COMPARA_DBNAME}

If your mysql server supports table partitioning (v.5.1 or later), you
should use the partitioned version of the paf tables now, with
something like this:

  $ mysql $COMPARA_DBNAME \
    -e "rename table peptide_align_feature to peptide_align_feature_orig"
  $ mysql $COMPARA_DBNAME \
    -e "rename table peptide_align_feature_prod to peptide_align_feature"

Unless you have a very good reason to stick to MyISAM, you should also
change the engine to INNODB with a bash script like this:

   for t in `mysql $COMPARA_DBNAME -N -e "show tables"`
   do
           mysql $COMPARA_DBNAME -N -e "ALTER TABLE $t engine=InnoDB"
   done


3- Load the NCBI taxonomy data 
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You now need to load the NCBI taxonomy data into the ncbi_taxa_name
and ncbi_taxa_node tables. See the following file;

  $BASEDIR/ensembl-compara/scripts/taxonomy/README-taxonomy

You may alternatively be able to copy taxonomy tables from an existing
compara db (this may take a few mins), e.g.

  $ mysqldump --no-create-info ensembl_compara_old \
    ncbi_taxa_node ncbi_taxa_name \
    | mysql $COMPARA_DBNAME

Or, (Ensembl-site-specific)

  $ mysqldump -u ensro -h ens-livemirror \
    --extended-insert --compress --delayed-insert \
    ncbi_taxonomy ncbi_taxa_node ncbi_taxa_name \
    | mysql $COMPARA_DBNAME

Or, to copy from a compara DB on ensembldb (using bash);

  $ for TABLE in ncbi_taxa_node ncbi_taxa_name; do
    mysql -hensembldb.ensembl.org -P5306 -uanonymous ensembl_compara_59 -p'' \
    -N -e "select * from $TABLE" \
    | sed s/NULL/\\\\N/g > /tmp/${TABLE}.txt
    mysqlimport -L $COMPARA_DBNAME /tmp/${TABLE}.txt
    done

4- compara-hive config file
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy and modify the compara-hive config file
   
  $ cd $BASEDIR/ensembl-compara/scripts/pipeline
  $ cp compara-hive-genetree.hcluster.conf.example my_compara.conf
  <editor> my_compara.conf

You may need to change the database names, port, dbnames, and the
paths to the 'hive_output_dir' and 'fasta_dir'. Note that these dirs
must exist.

You can enable/disable various components of the system
(e.g. MCL-clustering, dNdS) where indicated depending on your
preferences.

IMPORTANT: The hive system can generate an awful lot of log outputs
that are dumped in the hive_output_dir. When a pipeline runs fine,
these are not needed and can take a lot of disk space as well as
generate a large number of files. If you don't want log outputs
(recommended), then just don't specify any hive_output_dir.


4.1- Species Configuration
     ----------

Connection parameters to the source ensembl core should be configured
in this file.

Some 'gotchas' with the core databases should be checked;

- The top-level coordinate system must include a version,
  mysql> SELECT name, version FROM coord_system ORDER BY rank LIMIT 1;

- The meta_coord table must be correctly populated, e.g. using;
  $BASEDIR/ensembl/misc-scripts/meta_coord/update_meta_coord.pl

- Only genes on slices marked as 'toplevel' will be included.
  mysql> SELECT count(*) 
         FROM seq_region cs, seq_region_attrib a, attrib_type at 
         WHERE cs.seq_region_id=a.seq_region_id 
         AND   a.attrib_type_id=at.attrib_type_id 
         AND   at.code="toplevel";

- Only genes with biotype of 'protein_coding' will be dumped;
  mysql> SELECT count(*) FROM gene WHERE biotype='protein_coding';
  mysql> SELECT count(*) FROM transcript WHERE biotype='protein_coding';

- Only canonical transcripts will be loaded
  mysql> SELECT count(*) FROM gene JOIN transcript 
         ON( gene.canonical_transcript_id=transcript.transcript_id );

- The dumping goes much faster if the genes/transcripts are on
  toplevel slices;
  mysql> SELECT COUNT(*) 
         FROM transcript f, seq_region_attrib sra, attrib_type at 
         WHERE f.seq_region_id=sra.seq_region_id 
         AND sra.attrib_type_id=at.attrib_type_id  
         AND at.code='toplevel'
         AND f.biotype='protein_coding';

- The species.taxonomy_id in the meta table must correspond to that in
  the 'my_compara.conf' file;
  mysql> SELECT * FROM meta WHERE meta_key = 'species.taxonomy_id';

4.2- dNdS Configuration
     ----------

In case you want to run the dN/dS part, you will also need this:

  { TYPE => dNdS,
    'codeml_parameters' 
      => do($ENV{BASEDIR}.'/ensembl-compara/scripts/homology/codeml.ctl.hash'),
    'species_sets' => '[[3,22,25,31,32,33,34,35,38],[4,13]]',
    'method_link_types' => ['ENSEMBL_ORTHOLOGUES']
  },

Where the species sets define the pairs of species and
method_link_types (typically 'ENSEMBL_ORTHOLOGUES' and/or
'ENSEMBL_PARALOGUES') for which one wants to run the dNdS. For example:

    'species_sets' => '[[a,b,c,d],[x,y]]'

would do the dNdS for the species pairs: a+b, a+c, a+d, b+c, b+d and
c+d and also between x+y.

One example of the codeml parameters needed can be found in
ensembl-compara/scripts/homology/codeml.ctl.hash


4.3- Bio::EnsEMBL::Analysis::Config files 
     ----------

Copy and modify the Bio::EnsEMBL::Analysis::Config::General file
You may need to change the locations of BIN_DIR, DATA_DIR and LIB_DIR;

  $ cd $BASEDIR/ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config
  $ cp General.pm.example General.pm
  <editor> General.pm

Copy and modify the Bio::EnsEMBL::Analysis::Config::Blast file

  $ cd $BASEDIR/ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config
  $ cp Blast.pm.example Blast.pm
  <editor> Blast.pm


Set the blast envirnment; these environment variables cannot be set in
either ::Config::Blast or ::Config::General;

in tcsh:
  $ setenv PATH /usr/local/ensembl/bin:$PATH
  $ setenv BLASTMAT /usr/local/ensembl/data/blastmat:$BLASTMAT
  $ setenv BLASTFILTER /usr/local/ensembl/bin:$BLASTFILTER


5- Create the species tree file 
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A NCBI taxon_id phylogenetic tree file representing the species in the
compara DB is required for the njtree stage. The file must be in Nexus
(.nh) format. If you are running the pipeline for your own set of
species, you can generate such a tree with the script
Ensembl-compara/scripts/tree/testTaxonTree.pl.

One example of the species_tree_file can be found in
ensembl-compara/scripts/pipeline/species_tree_njtree.taxon_id.nh

To have your list of NCBI taxon_ids, you can use "-query_ncbi_name"
like this:

$ perl $BASEDIR/ensembl-compara/scripts/tree/testTaxonTree.pl \
  -url $COMPARA_URL -query_ncbi_name "Anopheles gambiae"

NJTREE can distinguish between species that are supposed to have all
the genes represented in the DB and "incomplete" species, that may
have only a part of the genes represented. For the latter, NJTREE
won't try to estimate gene losses given this info.

If you have a DB with say 3 completely sequenced species, Anopheles
gambiae, Aedes aegypti and Drosophila melanogaster, and 2 incomplete
species, Pediculus humanus and Culex pipiens, you can create the
species tree like this:

(a) Look for the NCBI taxon_ids:

$ perl $BASEDIR/ensembl-compara/scripts/tree/testTaxonTree.pl \
  -url $COMPARA_URL -query_ncbi_name "Anopheles gambiae"

$ perl $BASEDIR/ensembl-compara/scripts/tree/testTaxonTree.pl \
  -url $COMPARA_URL -query_ncbi_name "Aedes aegypti"

$ perl $BASEDIR/ensembl-compara/scripts/tree/testTaxonTree.pl \
  -url $COMPARA_URL -query_ncbi_name "Drosophila melanogaster"

$ perl $BASEDIR/ensembl-compara/scripts/tree/testTaxonTree.pl \
  -url $COMPARA_URL -query_ncbi_name "Culex pipiens"

$ perl $BASEDIR/ensembl-compara/scripts/tree/testTaxonTree.pl \
  -url $COMPARA_URL -query_ncbi_name "Pediculus humanus"

the first three then being "-extrataxon_sequenced 7165_7159_7227", 
and the last two being: "-extrataxon_incomplete 7175_121225"

(b) Then you call the script like this:

$ perl $BASEDIR/ensembl-compara/scripts/tree/testTaxonTree.pl \
  -url $COMPARA_URL \
  -create_species_tree -extrataxon_sequenced 7165_7159_7227 \
  -extrataxon_incomplete 7175_121225  -no_previous

Which in this case will create your file with the name
"njtree.5.ensembl_compara_52.nh":

((((7175,7159*)53550,7165*)7157,7227*)7147,121225);

Also, there is an option to ignore internal nodes present in the NCBI
taxonomy. This will multifurcate the tree at the point when the given
node would disappear. For example, if one wants to ignore the
Coelomata (33316) internal node, and create a multifurcation at that
point, the option would be "-multifurcation_deletes_node 33316".

If you want to create a species_tree with all the species already
present in a previous Compara DB plus a few new species, you can use
the script connecting to that given DB and without the "-no_previous"
option. For example:

$ testTaxonTree.pl -url $COMPARA_URL \
  -create_species_tree -extrataxon_sequenced 9823 \
  -multifurcation_deletes_node 33316

will create a tree with the species in the compara DB, deleting
the Coelomata internal node and adding the Sus scrofa species as
completely sequenced.

You also need to add the species tree string into the protein_tree_tag
table, so that it doesnt overload the filesystem by asking for it
every time, with a command like this:

  $ mysql $COMPARA_DBNAME \
    -e "insert into protein_tree_tag (node_id,tag,value) values (1,'species_tree_string','something')"

6- Run the configure scripts
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The comparaLoadGenomes script use the information in the conf file to
connect to the core databases, queries for things like taxon_id,
assembly, gene_build, and names to create entries in the genome_db
table.  It also sets the genome_db.locator column to allow the system
to know where the respective core databases are located.  There is an
additional table called genome_db_extn which will be deprecated
shortly, but as of today (23 July 2008) it is still used to hold a
'phylum' value for each genome to allow the option of building
homologies only within phylums.  This script will also 'seed' the
pipeline/hive system by creating the first jobs in the analysis_job
table for the analysis 'SubmitGenome'. Note that the
coord_system.version fields in the source species databases must be
set to a 'true' value for the script to work.

The loadGeneTreeSystem script creates the analysis entries for the
processing system, and creates both the dataflow rule and the analysis
control rules.  It also initializes the analysis_stats row for each
analysis.  These rows hold information like batch_size, hive_capacity,
and run-time stats that the Hive's Queen will update.

These scripts may give you warnings if the output directories are not
available or if it's unable to connect to core databases. You need a
fasta_dir and a cluster_dir, and if you are using lustre, you need to
change the type for the fasta dir like this:

 $ FASTA_DIR=somewhere
 $ lfs setstripe $FASTA_DIR -c -1

 $ cd $BASEDIR/ensembl-compara/scripts/pipeline
 $ perl comparaLoadGenomes.pl -conf my_compara.conf
 $ perl loadGeneTreeSystem.hcluster_tablereuse.pl -conf my_compara.conf

If comparaLoadGenomes.pl or loadGeneTreeSystem.hcluster_tablereuse.pl goes wrong, the DB
can be reset as follows (you need to re-run both);

 $ mysql $COMPARA_DBNAME \
   -e"delete from analysis;" -e"delete from analysis_ctrl_rule;" \
   -e"delete from analysis_data;" -e"delete from analysis_description;" \
   -e"delete from analysis_job;" -e"delete from analysis_stats;" \
   -e"delete from dataflow_rule;" \
   -e"delete from genome_db;" -e"delete from genome_db_extn;" \
   -e"delete from hive;" -e"delete from monitor;"

  $ mysql $COMPARA_DBNAME -e"show tables like 'peptide_align_feature_%'"
  (drop each of these)

There are occasions where big alignments are produced that are
unmanagable by the backbone part of the pipeline, and need to be
processed differently. These will cause fail jobs in certain analyses,
only for a small percent of cases. You need to allow that to tolerate
that with this command:

 $ mysql $COMPARA_DBNAME \
 -e "update analysis_stats left join analysis on analysis_stats.analysis_id=analysis.analysis_id set analysis_stats.failed_job_tolerance=5 where analysis.logic_name in (\"NJTREE_PHYML\",\"OrthoTree\",\"QuickTreeBreak\",\"OtherParalogs\",\"BuildHMMaa\",\"BuildHMMcds\")"

7- Create the method_link_species_sets
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you are running this for the Ensembl production release, make sure
that the method_link, species_set and method_link_species_set tables
in your compara DB are in sycn with the method_link_species_set in the
compara master db.

$ mysqldump -t ensembl_compara_master \
  species_set method_link method_link_species_set \
  | mysql $COMPARA_DBNAME

If not, create the method_link_species_sets that you need with the
create_mlss.pl script as follows;

7.1. First you will need an ensembl registry file that configures the
   database connection to the compara database. An example registry
   file would be;

  ---
  use Bio::EnsEMBL::Utils::ConfigRegistry;
  use Bio::EnsEMBL::DBSQL::DBAdaptor;
  use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;
  new Bio::EnsEMBL::Compara::DBSQL::DBAdaptor(
      -host    => 'localhost',
      -user    => 'admin',
      -pass    => 'secret',
      -port    => 3306,
      -species => 'compara-master',
      -dbname  => 'my_ensembl_compara_45');
  Bio::EnsEMBL::Registry->add_alias("compara-master","DEFAULT");
  1;
  ---

7.2. You may need to create the following method_link entries;
     (consistent IDs make merging DBs easier)

  $ mysql $COMPARA_DBNAME -e \
    'INSERT IGNORE INTO method_link (method_link_id,type, class) \
     VALUES (201,"ENSEMBL_ORTHOLOGUES","Homology.homology"), \
            (202,"ENSEMBL_PARALOGUES", "Homology.homology"), \
            (204,"ENSEMBL_HOMOLOGUES", "Homology.homology"), \
            (301,"FAMILY",             "Family.family"), \
            (401,"PROTEIN_TREES",      "ProteinTree.protein_tree_node"), \
            (402,"NC_TREES",           "NCTree.nc_tree_node")'

7.3. Run the create_mlss.pl script. 

  $ cd $BASEDIR/ensembl-compara/scripts/pipeline

  $ # This gets a comma-separated list of all genome_db_ids in the DB
  $ set SPECIES_LIST=`mysql $COMPARA_DBNAME -r -e \
    'select group_concat(genome_db_id) from genome_db' | tail -n 1`

  $ # For orthologues
  $ perl create_mlss.pl --f --pw \
    --reg_conf ensembl.registry \
    --method_link_type ENSEMBL_ORTHOLOGUES \
    --genome_db_id "$SPECIES_LIST"

  $ # For between-species paralogues
  $ perl create_mlss.pl --f --pw \
    --reg_conf ensembl.registry \
    --method_link_type ENSEMBL_PARALOGUES \
    --genome_db_id "$SPECIES_LIST"

  $ # For same-species paralogues
  $ perl create_mlss.pl --f --sg \
    --reg_conf ensembl.registry \
    --method_link_type ENSEMBL_PARALOGUES \
    --genome_db_id "$SPECIES_LIST"

  $ # For protein trees
  $ perl create_mlss.pl --f \
    --reg_conf ensembl.registry \
    --method_link_type PROTEIN_TREES \
    --genome_db_id "$SPECIES_LIST"

7.4. This step is optional to determine the genomes that have and have
not changed gene sets between releases:

# Run the healthcheck and list the changing species:
# org.ensembl.healthcheck.testcase.generic.ComparePreviousVersionExonCoords

./run-healthcheck.sh -d '.*core_57.*' ComparePreviousVersionExonCoords

# Build the genome_db list of species that haven't changed:
mysql -hdbname -uensro compara_master_db -e "select group_concat(genome_db_id order by genome_db_id) from genome_db where assembly_default=1 and taxon_id!=0 and name not in ('Genus species4','Genus species5','Genus species6')"

# Then fill the reuse parameters in BLASTP_TEMPLATE:
reuse_db=>'mysql://ensro\@server-with-previous-release/compara_prev_release',reuse_gdb=>[1,2,3,7,8,9]

At this point the system is ready to run..


8- Run the hive
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One must first make sure that the available hive scripts are available
in one's path.  This can be done by either extending PATH or linking the
scripts to an existing directory in the PATH

  tcsh$ setenv PATH ${PATH}:${BASEDIR}/ensembl-hive/scripts
  bash$ export PATH=$PATH:$BASEDIR/ensembl-hive/scripts

You may also need to set the LSB_DEFAULTPROJECT environment variable
if you are using LSF across the sanger farm;

  tcsh$ setenv LSB_DEFAULTPROJECT ensembl-compara
  bash$ export LSB_DEFAULTPROJECT=ensembl-compara
       
There are two beekepers, one which uses LSF job submission system, and
another that uses the local machine only. If using the local machine,
add the -local flag to the command and the workers will run as 
background system commands rather than being submited to an LSF resource.

 $ beekeeper.pl -url $COMPARA_URL -loop

Running the complete build for ensembl45 with 35 species on the Sanger
CPU cluster takes (depending on availability of resources):

- from 1-3 days for the blasts
- half a day for the clustering
- 2-3 days for the multialignments, trees and homologies

If, for some reason, the pipeline hangs, it can sometimes be kicked
back into action using;

  $ beekeeper.pl -url $COMPARA_URL -sync

If you or your queueing system killed all running jobs, you can bring
the beekeeper.pl back to sync with something like:

  $ beekeeper.pl -url $COMPARA_URL -alldead

If you want to manually run one of the jobs in the pipeline in
debugging mode to see if everything is working as expected, query the
db for your analysis_job_id and use the runWorker.pl script like this:

 $ mysql $COMPARA_DBNAME -e \
   "select * from analysis_job aj, analysis a where \
   a.analysis_id=aj.analysis_id and a.logic_name='MCoffee' limit 10"

This will print the first 10 MCoffee jobs, then get the analysis_job_id
for one of them and run:

 $ runWorker.pl -bk LOCAL \
   -url $COMPARA_URL -outdir '' -no_cleanup -debug 1 --job_id 1234567

Sometimes you may need to reset a particular job (or even analysis) to
rerun it. See the beekeeper's '-reset_job_id' and
'-reset_all_jobs_for_analysis' flags.

Sometimes MCoffee jobs fail due to large genetrees, but sometimes it's
simply that the aligner gets confused somehow. The following SQL lists
the first 10 failed MCoffee jobs;
 
  $ mysql $COMPARA_DBNAME -e \
   'select a.logic_name, aj.analysis_job_id, aj.input_id \
    from analysis a join analysis_job aj using( analysis_id ) \
    where aj.status = "FAILED" and a.logic_name="MCoffee" limit 10';

For more details on the failed jobs, e.g. the gene_count, use the
followimng SQL;

  $ mysql $COMPARA_DBNAME -e \
'SELECT * \
 FROM(    \
   SELECT a.logic_name, CAST( \
     TRIM( TRAILING "\, \'clusterset_id\'\=\>1\}" \
           FROM TRIM( LEADING "\{\'protein_tree_id\'\=\>" FROM aj.input_id)) \
     AS UNSIGNED INT) AS root_id, aj.analysis_job_id \
   FROM analysis_job aj JOIN analysis a using( analysis_id ) \
   WHERE a.logic_name="MCoffee" \
   AND aj.status="Failed" ) AS aj \
   JOIN protein_tree_tag p ON p.node_id=aj.root_id \
   WHERE p.tag IN( "gene_count" ) \
 LIMIT 10'


Finally, some jobs (e.g. Sitewise_dNdS) rely on some jobs failing in
order to trigger a resubmit on a different set. As the threshold for
failed jobs is 0, the pipeline will seem to have failed. The following
SQL fixes this particular example;

  $ mysql $COMPARA_DBNAME -e \
"update analysis_stats ast, analysis a \
set ast.failed_job_tolerance=98 \
where ast.analysis_id=a.analysis_id and logic_name='Sitewise_dNdS'"

for more details on controling the hive system do
  $ beekeeper.pl -help

MCoffee jobs shouldn't fail, since at least MAFFT should be able to do
tens of thousands of sequences. If they fail, it's usually because
they need more memory. Once they have failed, one can manually retry
the jobs with more memory like this (7600MB in this example):

beekeeper.pl -url $COMPARA_URL -failed | grep MCoffee | grep job_id | awk -F\= '{print $2}' | awk '{print $1}' | while read i; do echo $i===; export TIMESTAMP=`date +2%3y%m%d_%H%M%S` && bsub -J7600.MCoffee.$i -q normal -R"select[mem>7600] rusage[mem=7600]" -M7600000 -o $HOME/$TIMESTAMP.out -e /$HOME/$TIMESTAMP.err "perl runWorker.pl -bk LSF -url $COMPARA_URL -outdir '' -debug 1 -job_id $i" && sleep 1; done

Then check how much memory they are taking with something like this:
while [ 1 ] ; do bjobs -l | grep 'MEM\:' | awk -FMEM\: '{print $2}' | sort -rn; date; sleep 120; done

8a- The last gene tree alignments take too long. What can I do?
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you have a handful of alignments and are in a hurry to finish, you
can change from the comprehensive mcoffee to the fast mcoffee option
dynamically using:

UPDATE analysis SET parameters=replace(parameters,'cmcoffee','fmcoffee');

Or even more:

update analysis set parameters=replace(parameters,'cmcoffee','mafft');

9 - Quality Control
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There are two QC analysis that are run during the pipeline,
ClustersetQC and GenetreesetQC:

ClustersetQC calculates the number and percent of orphans for every
genome and stores the values as protein_tree_tags. Once it's run you
can check the values running this sql query:

SELECT tag, value, (select name from genome_db where genome_db_id=SUBSTRING_INDEX(tag,'_',1)) as name from protein_tree_tag where node_id=1 and tag like '%ClustersetQC' order by cast(value as decimal)";

GenetreesetQC calculates the number and percent of orphans again for
every genome after all the alignments, trees and homologies have been
processed. You can check the values running:

SELECT tag, value, (select name from genome_db where genome_db_id=SUBSTRING_INDEX(tag,'_',1)) as name from protein_tree_tag where node_id=1 and tag like '%ClustersetQC' order by cast(value as decimal)";

If you are re-running the pipeline using the reuse_db feature from the
previous release, ClustersetQC and GenetreesetQC will also give two
more numbers:

stable_id_mapping_mapped_clusters
stable_id_mapping_novel_clusters

Out of these two numbers, a proportion will be calculated, which is
the proportion of novel clusters. There is a hard-coded default value
of less than 0.2 for these that will make the analysis fail and the
pipeline to stop. If you want to increase the tolerance, set your
preferred value in the BLASTP_TEMPLATE section of the config file:

'unmap_tolerance'=>'0.2'

Once the pipeline is completely finished, consistency of the entries
can be tested with the following script;

 $ perl ${BASEDIR}/ensembl-compara/scripts/pipeline/check_genetree_data.pl \
   -url $COMPARA_URL -longtests 1

It has happened before that the farm node on which an OrthoTree job is
running dies in the middle of the storing of homologies, and this gives
inconsistent homologies. This script is to check this kind of cases.

10 - Adding the ensembl aliases
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Ensembl species have aliases that are used in the webcode. In the
EnsemblCompara GeneTrees, aliases can also be defined for the internal
nodes in the species tree that relates the species set.

The aliases are mapped to the common names in NCBI taxonomy or added
ad-hoc when not existing, and stored in this file:

ensembl-compara/scripts/taxonomy/ensembl_aliases.sql

For example, the internal node for the puffer fishes:

http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=31031

Tetraodontidae
Taxonomy ID: 31031
Genbank common name: puffers
Other names:
common name: 	puffer fishes

With this SQL, we map the taxon_ids to "taxon_alias", so that the
aliases can be used in GeneTreeView (v51ss):

  $ mysql -N $COMPARA_DBNAME -e \
    "INSERT INTO protein_tree_tag SELECT p1.node_id, 'taxon_alias', n.name \
     FROM protein_tree_tag p1, protein_tree_tag p2, ncbi_taxa_name n \
     WHERE n.taxon_id=p2.value and p1.node_id=p2.node_id and \
     p2.tag=\"taxon_id\" and p1.tag=\"taxon_name\" \
     and n.name_class=\"ensembl alias name\"

Additionally, if you have "ensembl timetree mya" entries in the
ensembl_aliases.sql table, you can also add the "(XXX MYA)" tags in
the taxon_alias, so that they can be used in GeneTreeView:

  $ mysql -N $COMPARA_DBNAME -e \
    "INSERT INTO protein_tree_tag SELECT p1.node_id, 'taxon_alias_mya', \
     concat(n.name, ' (', round(n2.name,0), ' MYA)') \
     FROM protein_tree_tag p1, protein_tree_tag p2, ncbi_taxa_name n, ncbi_taxa_name n2 \
     WHERE n.taxon_id=p2.value and n2.taxon_id=p2.value and p1.node_id=p2.node_id and \
     p2.tag=\"taxon_id\" and p1.tag=\"taxon_name\" and n.name_class=\"ensembl alias name\" \
     and n2.name_class=\"ensembl timetree mya\"

To obtain MYA estimates for your species, you can do pairwise queries
using the timetree database (http://www.timetree.org).


11 - Generate basic statistics
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A basic set of gene tree stats can be generated using the following
command (should complete within 5 mins):

 $ cat ${BASEDIR}/ensembl-compara/sql/protein_tree-stats.sql \
   | mysql -vv -t $COMPARA_DBNAME | less

There is also a 'homology-stats.sql' file that can be processed in the
same way, but this takes longer to run.

For a more detailed analysis of the gene tree data, look at;

 $ perl ${BASEDIR}/ensembl-compara/scripts/tree/geneTreeTool.pl -help


12 - Creating the Release Database
     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

*** What to do about the protein_feature_<species> tables? ***

Once the pipeline has completed, the database contains many
pipeline/hive specific tables. These can be removed as follows;

  $ cat ${BASEDIR}/ensembl-compara/sql/drop-pipeline-tables.sql \
    | mysql $COMPARA_DBNAME


To merge the GeneTrees + Families to the general compara DB, look at;

  scripts/pipeline/merge_protein_data.pl

And use the standard ImportDB.pl or CopyDBoverserver.pl to copy things
around...



