

1- code API needed and executable
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  bioperl-live (bioperl-1-2-0?)
  ensembl
  ensembl-compara
  ensembl-hive
  ensembl-pipeline
  ensembl-analysis

  executables
  ~~~~~~~~~~~
  mercator
      using /usr/local/ensembl/bin/mercator
  blastz
      using /usr/local/ensembl/bin/blastz

1.2 Code checkout

      cvs -d :ext:bio.perl.org:/home/repository/bioperl co -r branch-07 bioperl-live
      cvs -d :ext:cvs.sanger.ac.uk:/nfs/ensembl/cvsroot co ensembl
      cvs -d :ext:cvs.sanger.ac.uk:/nfs/ensembl/cvsroot co  ensembl-pipeline
      cvs -d :ext:cvs.sanger.ac.uk:/nfs/ensembl/cvsroot co  ensembl-hive
      cvs -d :ext:cvs.sanger.ac.uk:/nfs/ensembl/cvsroot co  ensembl-analysis

in tcsh
    setenv BASEDIR   /some/path/to/modules
    setenv PERL5LIB  ${BASEDIR}/ensembl/modules:${BASEDIR}/ensembl-pipeline/modules:${BASEDIR}/bioperl-live:${BASEDIR}/ensembl-compara:${BASEDIR}/ensembl-hive:${BASEDIR}/ensembl-analysis
    setenv PATH $PATH:${BASEDIR}/ensembl-compara/script/pipeline:${BASEDIR}/ensembl-hive/scripts

in bash
    BASEDIR=/some/path/to/modules
    PERL5LIB=${BASEDIR}/ensembl/modules:${BASEDIR}/ensembl-pipeline/modules:${BASEDIR}/bioperl-live:${BASEDIR}/ensembl-compara:${BASEDIR}/ensembl-hive:${BASEDIR}/ensembl-hive
    PATH=$PATH:${BASEDIR}/ensembl-compara/scripts/pipeline:${BASEDIR}/ensembl-hive/scripts

1.3 Configure the Pipeline:

  Copy ${BASEDIR}/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/Blast.pm.example to ${BASEDIR}/ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/Blast.pm


2- Create a hive/compara database
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* Pick a mysql instance and create a database

mysql -h compara1 -uensadmin -pxxxx -e "create database jh7_pecan_47"

cd ~/src/ensembl_main/ensembl-hive/sql
mysql -h compara1 -uensadmin -pxxxx jh7_pecan_47 < tables.sql

cd ~/src/ensembl_main/ensembl-compara/sql
mysql -h compara1 -uensadmin -pxxxx jh7_pecan_47 < table.sql
mysql -h compara1 -uensadmin -pxxxx jh7_pecan_47 < pipeline-tables.sql

* Populate your database:

cd /lustre/work1/ensembl/jh7/release47/pecan/

* Create a registry configuration file:

=======================================
use strict;
use Bio::EnsEMBL::Registry;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;

new Bio::EnsEMBL::Compara::DBSQL::DBAdaptor(
    -host => 'compara1',
    -user => 'ensro',
    -port => 3306,
    -species => 'compara-master',
    -dbname => 'jh7_ensembl_compara_master');

new Bio::EnsEMBL::Compara::DBSQL::DBAdaptor(
    -host => 'compara1',
    -user => 'ensadmin',
    -pass => 'XXXXXXXX',
    -port => 3306,
    -species => 'jh7_pecan_47',
    -dbname => 'jh7_pecan_47');

Bio::EnsEMBL::Registry->load_registry_from_url('mysql://ensro@ens-staging/47');

1;
=======================================

* Populate the database with data from the master DB:

~/src/ensembl_main/ensembl-compara/scripts/pipeline/populate_new_database.pl \
  --master compara-master --new jh7_pecan_47 --reg reg.conf

3- Choose a working directory with some disk space
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* The multiple aligner pipeline tries to minimize the amount of output data for workers.
But if debug options is on this can take some space. So be careful.

mkdir -p /lustre/scratch1/ensembl/jh7/release47/pecan/blastDB
## Sets the stripping patterns for big files
lsf setstripe /lustre/scratch1/ensembl/jh7/release47/pecan/blastDB 0 -1 -1
mkdir -p /lustre/scratch1/ensembl/jh7/release47/pecan/workers

* This directory needs to be set in 'hive_output_dir' variable in the compara-hive
configuration file. See below. If not set, all STDOUT/STDERR goes to /dev/null. :)

4- Copy and modify your compara-hive config file
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

cd /lustre/work1/ensembl/jh7/release47/pecan/

cp ~/src/ensembl_main/ensembl-compara/scripts/pipeline/compara-hive-multiplealigner.conf.example hive.conf.pl
cp ~/src/ensembl_main/ensembl-compara/scripts/pipeline/hmr.tree.example hmr.tree
<editor> hive.conf.pl
<editor> hmr.tree

You may need to change the database names, port, dbnames, and the
paths to the 'hive_output_dir' to
/lustre/scratch1/ensembl/jh7/release47/pecan/workers
'fasta_dir' and 'tree_file' will need to be updated as well.
The species names should correspond to the genome_db_id of the species used (and defined in the 
hive configuration file). You can get a newick species tree using the geneTreeTool.pl script:

perl ~/src/ensembl_main/ensembl-compara/scripts/tree/geneTreeTool.pl --file \
  ~/src/ensembl_main/ensembl-compara/scripts/pipeline/species_tree_rap.nh --newick \
  --keep_leaves "Homo sapiens,Mus musculus,Rattus norvegicus"

Species trees in other formats can be found in the same directory
~/src/ensembl_main/ensembl-compara/scripts/pipeline/species_tree*

The species tree for the Pecan pipeline requires the species to be identified using the genome_db_id. To convert the species names to genome_db_ids, run the
script:

perl ~/src/ensembl_main/ensembl-compara/scripts/pipeline/tree_name_to_ids.pl ~/src/ensembl_main/ensembl-compara/scripts/pipeline/species_tree_blength.nh 'mysql://ensro@compara1/jh7_pecan_47' > tree.nw

<editor> tree.nw


You can have several multiple alignment pipelines working in the same HIVE. To do this, add an extra
MULTIPLE_ALIGNER block in the configuration file and change the corresponding list of species and the
tree file. The HIVE will run the all-vs-all blastp jobs and then build a Synteny map for each
multiple alignment pipeline. The resulting syntenies will be used to feed the multiple aligner.

SET_INTERNAL_IDS is optional and if defined in the conf file will ensure that 
the internals ids in the genomic_align_block and genomic_align tables are unique across a release by adding $mlss_id*10**10.


You can also have a CONSERVATION_SCORE block if you want to run Gerp or any other similar software
to get constrained elements and conservation scores from the resulting multiple alignments.

The default version of Gerp is version 2.1 and for this version, no parameter files are required. To run Gerp version 1 you will also need to create a parameter file (tab-delimited), for example:

-----------------------------------
alignment	gerp_alignment.mfa
phylo_tree	gerp_tree.nw
window_length	1
repeats	NULL
no_rsmin_estimate
merge_distance	6
rej_subs_min	8.5
no_ABC_files
-----------------------------------

where alignment: alignment file used in multi-fasta format
      phylo_tree: tree file
      window_length: length of window used for sliding window rate estimation
      repeats: either the repeat annonation file or NULL to ignore repeats
      no_rsmin_estimate: no automatic estimation of restricted substitution rate
      merge_distance: maximum number of unconstrained scores allowed between candiate constrained elements
      rej_sub_min: threshold score to define candidate constrained element as significant
      no_ABC_files: no ABC files produced

You will also need to add the paths for gerp and semphy if you want to run Gerp. At the time of writing these can be found in:
/software/ensembl/compara/semphy
/software/ensembl/compara/gerp/GERPv2.1

For the SYNTENY_MAP_BUILDER, the MULTIPLE_ALIGNER and the CONSERVATION_SCORE blocks you can specify
the logic_name for the method. You can also specify the module name. If none is given, the hive will
use the Bio::EnsEMBL::Compara::Production::GenomicAlignBlock::$logic_name module.

At the end of the pipeline, some basic healthchecks are performed. Currently these are conservation_jobs which checks whether there is one conservation job per multiple alignment and conservation_scores which checks if the conservation scores correspond to existing genomic_align_blocks, whether the correct gerp_XXX entry exists in the meta table and whether there are no alignments with more than 3 seqs and no scores. If no HEALTHCHECK module is defined, the parameters are automatically created.


5- Run the configure scripts
   ~~~~~~~~~~~~~~~~~~~~~~~~~
The following scripts are in ensembl-compara/scripts/pipeline (should be in your PATH)
The order of execution is important

comparaLoadGenomes.pl -conf hive.conf.pl
loadMultipleAlignerSystem.pl -conf hive.conf.pl

The comparaLoadGenomes script uses the information in the conf file to connect to
the core databases, queries for things like taxon_id, assembly, gene_build, and names
to create entries in the genome_db table.  It also sets the genome_db.locator column to
allow the system to know where the respective core databases are located. This script 
will also 'seed' the pipeline/hive system by creating the first jobs in the analysis_job
table for the analysis 'SubmitGenome'.

The loadMultipleAlingerSystem script creates the analysis entries for the processing
system, and creates both the dataflow rule and the analysis control rules.
It also initializes the analysis_stats row for each analysis.  These row hold
information like batch_size, hive_capacity, and run-time stats that the Hive's
Queen will update.

These scripts may give you warnings if the output directories are not available or if
it's unable to connect to core databases.

At this point the system is ready to run

NB: Although this feature has not been extensivelly tested, it should be
  possible to start the pipeline with all but one species if this one is not
  ready yet. You will need to update manually the analysis_job table and set
  the status of the relevant analysis_job (<analysis_id> = 1 for SubmitGenome
  analysis and <input_id> should match the genome_db_id of the species you
  want to block) to "BLOCKED". Later on, you can set the status back to "READY"
  in order to unblock it.


6- Run the beekeeper
   ~~~~~~~~~~~~~~~~~

The following scripts are in ensembl-hive/scripts (should be in your PATH)
beekeeper.pl -url mysql://ensadmin:xxxx@compara1:3306/jh7_pecan_47 -loop

where xxxx is the password for write access to the database

for more details on controling/monitoring the hive system see 
beekeeper -help

7- Healthchecks
   ~~~~~~~~~~~~
The healthchecks are run at the end of the pipeline. If a healthcheck fails, it is important to discover why it has failed. It is possible to do this either by looking at the output file if hive_output_dir is defined or by running the job again interactively using runWorker.pl.
for example
runWorker.pl -bk LSF -url mysql://ensadmin:xxxx@compara1:3306/kb3_compara_human_mouse_tblat_healthcheck_test -job_id 115 -outdir '' -no_cleanup

where the job_id is the analysis_job_id of the job you wish to rerun. For more information on the options of runWorker.pl see
runWorker.pl -help

*TROUBLESHOOTING*

* Some sequences refuse to be blasted.

  - The actual pipeline uses a blast filter that only allows for the 20 standard aminoancids
    and X. Check that your sequence do not contain selecisteines (* or U) for instance and
    replace them by X's:
    UPDATE sequence SET sequence = REPLACE(sequence, "U", "X") WHERE sequence_id = 123456;
    UPDATE sequence SET sequence = REPLACE(sequence, "*", "X") WHERE sequence_id = 123456;

* Mlagan/Pecan jobs are failing

  - Some of these jobs may run out of memory. Reset those jobs and overwrite the default amount
    of memory allowed for the Java VM:

    UPDATE analysis_job SET input_id = concat("{java_options=>'-server -Xmx2500M',", substr(input_id,2)),
        status = "READY", job_claim = "" WHERE status = "FAILED" AND analysis_id = 8;

    Run these jobs on the long queue (requesting quite a lot of memory)

    beekeeper.pl --lsf_options '-R "select[mem>=6000] rusage[mem=6000]" -q long' \
        --url mysql://ensadmin:xxxxxxx@compara1/jh7_pecan_42 --run

* Gerp jobs are failing.

  - Check you set up your environment so the main gerp script can find
    the remaining gerp scripts and the semphy program.

  - Check the source code for the Bio::EnsEMBL::Compara::Production::GenomicAlignBlock::Gerp
    module. The path to the main Gerp script is hard-coded there.

