

1 - code API needed and executable
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Perl code:

    bioperl (1.2 or above)
    ensembl (use a stable branch)
    ensembl-compara
    ensembl-hive
    ensembl-pipeline
    ensebml-analysis

    executables:

    blastz, lavToAxt, faToNib, axtChain

    See other documentation for how to get hold of these.

2 - Environment set-up
    ~~~~~~~~~~~~~~~~~~

    set BASEDIR to  /some/path/to/your/code/checkouts/

    PERL5LIB needs to include:
    ${BASEDIR}/ensembl/modules
    ${BASEDIR}/ensembl-pipeline/modules
    ${BASEDIR}/bioperl
    ${BASEDIR}/ensembl-compara/modules
    ${BASEDIR}/ensembl-hive/modules
    ${BASEDIR}/ensembl-analysis/modules

    PATH need to include:

    ${BASEDIR}/ensembl-hive/scripts

    Also, copy ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/General.pm.example to
    General.pm, and make sure BIN_DIR is set to the directory containing the executables
    above. 


3 - Create a hive/compara database
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    Example: database klh_compara_human_pangolin_blastz on genebuild1

    mysql -h genebuild1 -uensadmin -pxxxx -e "create database klh_compara_human_pangolin_blastz";

    cd $BASEDIR/ensembl-hive/sql
    mysql -h genebuild1 -uensadmin -pxxxx klh_compara_human_pangolin_blastz < tables.sql

    cd $BASEDIR/ensembl-compara/sql
    mysql -h genebuild1 -uensadmin -pxxxx klh_compara_human_pangolin_blastz < table.sql
    mysql -h genebuild1 -uensadmin -pxxxx klh_compara_human_pangolin_blastz < pipeline-tables.sql

4- Copy and modify your compara-hive config file
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

   cp $BASEDIR/ensembl-compara/scripts/pipeline/compara-hive-2xalignment.conf.example ./compara_2x.conf

   Now change the relevant details of the sections. The comments in the 
   file should help you to do this. Take particular care that after editing, the
   resulting file is still valid perl.

5 - Run the configure scripts for raw alignment
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    perl $BASEDIR/ensembl-compara/scripts/pipeline/comparaLoadGenomes.pl -conf ./compara_2x.conf
    perl $BASEDIR/ensembl-compara/scripts/pipeline/loadPairAlignerSystem.pl -conf ./compara_2x.conf 

    Note: you want to add the config entry dump_nib => '1' to the compara_2x.conf config file in the 
    DNA_COLLECTION configuration hash of your target (usually human ) genome

6 - Run the beekeeper to generate raw alignment
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    beekeeper.pl -url mysql://ensadmin:xxxx@ia64g:3306/klh_compara_human_pangolin_blastz -loop

    Now wait. The raw alignment can take 1-3 days of wall-clock time
    (depending on farm load). 


7 - Run the configure script for alignment processing
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    perl $BASEDIR/ensembl-compara/scripts/pipeline/load2xAlignmentFilterSystem.pl\
     -conf ./compara_2x.conf


8 - Run the beekeeper for alignment processing
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    beekeeper.pl -url mysql://ensadmin:xxxx@ia64g:3306/klh_compara_human_pangolin_blastz -loop

9 - Check that everything has finished
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    If the single job for the analysis "UpdateMaxAlignLengthAfterNet" is
    registered as "DONE" (run the beekeeper without the -loop option to 
    get the current status of the system), then because this is the last
    job to be run, it's a good sign. You should also check the  
    genomic_align_block table to make sure that there are aligments for
    the three expected method_links in the system (raw, chains and nets). 

    If something is wrong, you will need to find out what it is. See next
    section. 

=====================
Debugging strategies
=====================

- Signs that there is a problem

  The majority of time in this pipeline is spent (a) performing searches/filters 
  writing the results back to the database, or (b) creating and writing jobs for
  the next stage of the pipeline. If after some time neither the genomic_align_block
  table nor the analysis_job table are increasing in size, it's likely that something
  is wrong. No jobs running on a the farm is also a sign that something is wrong
  (submitted jobs are failing quickly). 

  If you suspect that something is wrong, there is no harm in killing the beekeeper
  during its sleep phase. It will work out what to do when it is restarted. After
  killing it, you can run in again in non-loop mode to get a progress report:
  
  beekeeper.pl -url mysql://ensadmin:xxxx@ia64g:3306/klh_compara_human_pangolin_blastz -analysis_stats

  This will indicate the analysis the pipeline has got stuck on. You now need
  to find out why jobs from that analysis are failing. There are two main
  strategies for this:


(a) Keeping the output

  You can keep the compara-output files for debugging 
  purposes if you add an entry 'hive_output_dir' to the meta-table 
  of the compara db - but be careful and DON'T run a whole pipeline in 
  debugging-mode - it will produce one output-file per job and you will
  end up with > 500k files.
   
  INSERT INTO META(meta_key,meta_value) VALUES ('hive_output_dir','/path/to/scratch/') ;

(b) Run the worker locally

  The Hive analogue of the ensembl-pipeline runner script is called runWorker.pl. 
  You can debug a process that seems to be failing by running runWorker.pl locally
  on the command line in debug mode:

  runWorker.pl  -url mysql://ensadmin:xxxx@ia64g:3306/klh_compara_human_pangolin_blastz -job_id 10 -no_write -debug 4

  Errors will be sent to screen. You will need to examine the analysis_job table 
  to get the ID of a job belonging to the particular analysis that seems to be failing. 
  Also, it might sometimes be useful to omit the "-no_write" flag, because
  it is possible that the jobs are failing at the stage where they write to the
  database. 


=======
Gotchas
=======

a. Your config file is not valid perl

   Check it with perl -c before running the set-up script(s)

b. Impedence mismatch between core databases and core code

   You need to make sure that your core databases are on the same schema
   branch, which is matched by your core code check-out. This is crucial;
   if there's a problem with this, you will probably get silent death of
   your blastz jobs.

[list to be added to]
