ABOUT
~~~~~

This document describes how to run the ensembl-compara BRH/RS-based
homology pipeline.

- Loads the longest peptide for each gene loci in each configured
  species database into the compara DB's member table.

- Loads each UniProt protein into the member table (metazoa only for
  now; needs extending for e.g. viridiplantae). Only needed if the
  family pipeline is to be run subsequently.

- Performs all-against-all blastp searched of the member proteins in
  the DB, and loads the processed results into the homology_member
  table.

- Run this pipeline in preparation for running the family
  pipeline. This pipeline roughly corresponds to the first part of the
  genetree pipeline, for which there is a README-genetree explanatory
  file.


1- code API needed and executable
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1.1 Executables

  wublastp
      using /usr/local/ensembl/bin/wublastp
  setdb
      using /usr/local/ensembl/bin/setdb
  codeml
      using /usr/local/ensembl/bin/codeml

1.2 Code checkout

See README.genetree

1.3 Perl environment

See README.genetree

1.4 CPAN Perl modules

See README.genetree


2- Configure database
3- Load the NCBI taxonomy data
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

See README.genetree


4- compara-hive config file
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Copy and modify the compara-hive config file

  $ cd $BASEDIR/ensembl-compara/scripts/pipeline
  $ cp compara-hive-homology.conf.example my_compara.conf
  <editor> my_compara.conf
   
You may need to change the database names, port, dbnames, and the
paths to the 'hive_output_dir' and 'fasta_dir'. Note that these dirs
must exist.


4.1- Species Configuration
4.2- dNdS Configuration
4.3- Bio::EnsEMBL::Analysis::Config files 
     ----------

See README.genetree


5- Run the configure scripts
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The loadHomologySystem script creates the analysis entries for the
processing system, and creates both the dataflow rule and the analysis
control rules.  It also initializes the analysis_stats row for each
analysis.  These rows hold information like batch_size, hive_capacity,
and run-time stats that the Hive's Queen will update.

The comparaLoadGenomes script use the information in the conf file to
connect to the core databases, queries for things like taxon_id,
assembly, gene_build, and names to create entries in the genome_db
gtable.  It also sets the genome_db.locator column to allow the system
to know where the respective core databases are located.  There is an
additional table called genome_db_extn which will be deprecated
shortly, but as of today (11 April, 2007) it is still used to hold a
'phylum' value for each genome to allow the option of building
homologies only within phylums.  This script will also 'seed' the
pipeline/hive system by creating the first jobs in the analysis_job
table for the analysis 'SubmitGenome'. Note that the
coord_system.version fields in the source species databases must be
set to a 'true' value for the script to work.

These scripts may give you warnings if the output directories are not
available or if it's unable to connect to core databases.

 $ cd $BASEDIR/ensembl-compara/scripts/pipeline
 $ ./comparaLoadGenomes.pl -conf my_compara.conf
 $ ./loadHomologySystem.pl -conf my_compara.conf


7- Create the method_link_species_sets
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

See REAME.genetree


6- Run the hive
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One must first make sure that the available hive scripts are available
in one's path.  This can be done by either extending PATH or linking the
scripts to an existing directory in the PATH

  tcsh$ setenv PATH ${PATH}:${BASEDIR}/ensembl-hive/scripts
  bash$ export PATH=$PATH:$BASEDIR/ensembl-hive/scripts

You may also need to set the LSB_DEFAULTPROJECT environment variable
if you are using LSF across the sanger farm;

  tcsh$ setenv LSB_DEFAULTPROJECT ensembl-compara
  bash$ export LSB_DEFAULTPROJECT=ensembl-compara
       
There are two beekepers, one which uses LSF job submission system, and
another that uses the local machine only. If using the local machine,
add the -local flag to the command and the workers will run as 
background system commands rather than being submited to an LSF resource.

 $ beekeeper.pl \
   -url mysql://user:password@ecs2:3306/my_ensembl_compara_45 -loop

where 'user' and 'password' allow write access to the database

Running the complete homology build for ensembl31 with 16 species takes around
10,000 CPU-hrs of total work.  On a 600 CPU cluster this can finish in anywhere
from 1-3 days depending on availability of resources.

for more details on controling the hive system do
  $ beekeeper.pl -help

