This document wraps up how to configure and run the Ensembl Functional Genomics (eFG) 
analysis pipeline.

Prerequisites
=============

Before you can start you need to make sure you have installed the following 
Ensembl API code:

ensembl
ensembl-functgenomics
ensembl-pipeline
ensembl-analysis

as well as

bioperl-live (bioperl-release-1-2-3)

For details please see http://www.ensembl.org/info/using/api/index.html


Background
==========

The Ensembl Functional Genomics (eFG) analysis pipeline makes use of the
Ensembl pipeline system that has been developed for the Ensembl Genebuild. The
documents overview.txt, quick_start_pipeline_guide,
the_ensembl_pipeline_infrastructure.txt, and custom_analyses.txt in
ensembl-doc/pipeline_docs give a good overview. 

At the bottom of this document is a list of utility methods which may come in 
useful if you go wrong.  Take a brief look at these first before running the 
configuration steps.

Setup/Configuration
===================

To get started first you need to fill out a couple of configuration files.


1. Global shell environment
---------------------------

The eFG analysis pipeline system uses various environmental variables
and functions to facilitate common tasks. So on one hand it reuses variables
and functions that have already been defined in your eFG environment (for
details see
http://www.ensembl.org/info/using/api/funcgen/efg_introduction.html). Some  
pipeline specific functions we will refer to later on are defined within 
the environment file

  $SRC/ensembl-functgenomics/scripts/pipeline/.efg_pipeline

This file contains common settings and functions utilised by the pipeline
environment and hence does not usually need to be edited. Instead a second 
config file is used to define project specific settings, like 

  $SRC/ensembl-functgenomics/scripts/pipeline/human_50.env

Such a file has to be set up and maintained according to your project (see
comments within the example file example.env that can be used as template). To
ease the access to the environment it is handy to set up an alias in your
shell configuration as follows (in bash syntax):

  alias eFG_human='. $SRC/ensembl-functgenomics/scripts/pipeline/human_50.env'

A similar alias you probably have set up already for your general eFG 
environment, i.e.

  alias eFG='. $SRC/ensembl-functgenomics/scripts/.efg'

Once you have set up all the config files you first need to source the general 
.efg file, and subsequently your project specific env file.


2. Configuration of the eFG runnables
-------------------------------------

All the configuration modules for the Ensembl analysis runnables can be found
at 

   ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config

First you need to rename the file General.pm.example to General.pm. Nothing
needs to be edited within this module. The eFG analysis specific configuration
modules are at

   ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/Funcgen

This directory contains all the example config modules (.pm.example) for 
the analysis that are currently implemented:

        Nessie (Flicek et al., unpublished)
        TileMap (Ji and Wong (2005), Bioinformatics)
        Chipotle
        ACME
        Splitter
        SWEmbl (Wilder et al., unpublished)
                
Rename the config files for the analysis you are going to use so that it 
ends on ".pm" and edit the CONFIG hash within that module as required to 
configure your analysis. DEFAULT represents the default settings for any 
analysis configured in the module. The environment variables used in the
DEFAULT section of the CONFIG hash have already been set in your efg and
efg_pipeline environment and should not need to be edited. To configure a
analysis with a parameter set different from DEFAULT settings you have to add
another element to the CONFIG hash, where the key resembles the logic_name of
this analysis and the values overwrite the DEFAULT settings. Each example
module contains sample configs that can be used as templates. 


3. Configuration of the analysis pipeline
-----------------------------------------

To configure the pipeline system you need to rename and edit the following two
files:

    ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/General.pm.example
    ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm.Example

For details see the documentation in ensembl-doc as mentioned above. To help
especially filling in the BatchQueue.pm module there is a function called
WritePipelineConfig (underlying script: write_pipeline_config.pl) that prints
a config hash into a file (see section 4. for further details). For eFG
purposes you must set the follwong variables as follows: 

DEFAULT_OUTPUT_DIR => $ENV{'ANALYSIS_WORK_DIR'}
DEFAULT_BATCH_QUEUE => 'your queue name'
DEFAULT_RETRY_QUEUE => 'your retry queue name'

Or if you intend to run the pipeline locally rather than with LSF, you can
disregard the above QUEUE variables and set the following:

QUEUE_MANAGER       => 'Local',

BatchQueue.pm will also need the relevant analysis config hashes adding, these 
are autogenerated and detailed in the following section below.


4. Setting up pipeline tables
-----------------------------

The eFG analysis pipeline system uses a separate pipeline database to control
the pipeline and to avoid messing around with the actual eFG database analysis 
table. 

Therefore we first need to create this pipeline database with the command

    CreatePipelineDB <password>

which has been defined in the global efg shell environment configuration
file. It 1.) creates the pipeline database, 2.) creates the pipeline tables,
and 3.) adds schema version to the meta table.

The next step is to configure the pipeline. Hence we first need to generate the 
analysis and rules config (see the_ensembl_pipeline_infrastructure.txt) as well
as the pipeline BatchQueue config mentioned above. To generate the necessary
files use the shell function

     WritePipelineConfig <password> <module> <SubmitType> ['overwrite']

The underlying script (write_pipeline_config.pl) reads in the analysis config
hash of the runnable (Analysis/Config/Funcgen/<module>.pm) that has already
been set up in section 2 and write three files. As options to the function you
need to pass

     <module>      to specify the Runnable config module to be used
 
     <SubmitType>  to set the SubmitType, i.e. slice, file, or array 
                   respectively. It indicates the type of input ids. Some 
                   Runnables need slices as input, like Nessie, or entire
                   array chips, like ACME and Chipotle, others need files, 
                   like SWEmbl

Examples:
     
     WritePipelineConfig **** SWEmbl file
     WritePipelineConfig **** Nessie slice 
     WritePipelineConfig **** ACME array 

     or using the perl script directly

     write_pipeline_config.pl -module Nessie -module TileMap -slice

The output is written to three files located in the $SCRIPTSDIR/conf directory:

    o BatchQueue.conf - contains some analysis-specific settings to the pipeline
      to be copied and pasted into the actual BatchQueue.pm in
      Bio/EnsEMBL/Pipeline/Config (here it needs to be added to the QUEUE_CONFIG 
      list of the Config hash). NOTE: It is important to remove or comment out 
      any other analyses which maybe present as default and may not have been 
      configured properly as these could be configured incorrectly and hence 
      cause misdirecting errors.         
    o analysis.conf - contains config to set up the analysis table and will be 
      read when setting up pipeline tables below
    o rules.conf - contains config to set up the rules table and will be read
      when setting up pipeline tables below

A more detailed explanation about the pipeline config files and what the actual 
parameters are for you can find in Ensembl pipeline documentation at 
ensembl-doc/pipeline_docs/. 

Now the pipeline config can be imported by using the shell command

    ImportPipelineConfig <password>

which runs the three pipeline scripts analysis_setup.pl, rule_setup.pl, and 
setup_batchqueue_outputdir.pl.

Now double check the analysis and rule tables in your pipeline DB.


Finally we needs to create input_ids. In eFG analysis pipeline terms an
input_id consists of an experiment name and a genomic slice or path to an input 
file name concatenated by a colon, i.e.

    GM12878_FAIRE_ENCODE:chromosome:NCBI36:2:234156564:234656627:1

or 

    CD4_H3K4ac:CD4_H3K4ac.bed.gz

or 

    GM12878_PolII_ENCODE:ARRAY

or    

    MEF_H3K4me3_Mikkelsen2007:MEF_H3K4me3_le2m_reads.bed.gz

The shell function

    CreateInputIds <password> <SubmitType> [<exp_regex> <exp_suffix>]

allows for doing this, where <SubmitType> can be either

    slice <'encode'|'toplevel'>

to create Encode slices or toplevel slices as input_ids, 

   array

to analyse entire array chips or

    file <directory>

to use the gzipped bed files in a given <directory> to create the input_ids.

Behind the scenes it uses a perl script called create_input_ids.pl that either 
fetches the slices from the database or reads the files, concatenates them
with the experiment names (optionally selected by <exp_regex>), and imports
them directly into the pipeline database input_id_analysis table. With
<exp_regex> you can pass any Perl regular expression to match a certain subset
of experiment names, i.e. '^CD4_' would create input_ids for all experiments
that begin with 'CD4_' in the name (see examples below). The optional paramter
<exp_suffix> allows you to add an additional string to the experiment name
that is subsequently stored in the database. This option is to distinguish
between sets/experiments that have been performed on the same feature and cell
type. Note that if you only want to add a suffix to the experiment name
without filtering with <exp_regex> option, <exp_regex> needs to be an empty
sting ('', see example below).

If files are used as input (i.e. SWEmbl) the experiment name needs to be
encoded in the filename that should have the format 

    <cell_type>_<feature_type>[_<...>].bed.gz

Note that the infiles are expected to be gzipped and formatted according to
the bed file format definitions with the first 6 fields being set: chr, start,
end, name, score, strand. (see http://genome.ucsc.edu/FAQ/FAQformat)

Also you need to make sure that your eFG database knows about cell and feature 
types to be analysed and imported. You can use $SCRIPTSDIR/run_import_type.pl 
to import cell and feature types. See perl documentation of the script for 
details. This script already contains a lot of predefined cell and feature
types hard coded in the types hash that can be used by uncommenting particular
lines.

Examples:

    CreateInputIds **** file $EFG_DATA/input/SOLEXA/LMI_methylation

  creates for all gzipped bed files in $EFG_DATA/input/SOLEXA/LMI_methylation
  the corresponding input_id

    CreateInputIds **** file /lustre/work1/ensembl/graef/efg/input/SOLEXA/LMI_acetylation '^CD4_H3K.*'

  creates for all gzipped bed files in .../SOLEXA/LMI_methylation that match 
  '^CD4_H3K.*' the corresponding input_id

    CreateInputIds **** slice encode 'CD4_DNAse'

  creates for all experiments with 'CD4_DNAse' in the name input_ids using the
  Encode region

    CreateInputIds **** slice toplevel 'PolII'

  creates for all experiments with 'PolII' in the name input_ids using all 
  toplevel slices

    CreateInputIds **** file /home/graef/data/Mikkelsen2007/alignments '' 'Mikkelsen2007'

  creates for all gzipped bed files in given directory the corresponding input_id 
  and appends the string 'Mikkelsen2007' to the experiment name comprising cell 
  and feature type.

Once the input_ids have been imported into the pipeline database, check the 
input_id_analysis table to make sure they look sensible. Now you're ready for 
running the analysis.

The shell function

    CleanInputIds <password>

allows you to delete ALL input_ids from the input_id_analysis table in the
pipeline database

After setting up and configuring the pipeline system 

    CheckPipelineSanity <password>

performs a series of sanity checks on your database. The underlying perl
script pipeline_sanity.pl is described in more detail in the Ensembl pipeline
infrastructure document (the_ensembl_pipeline_infrastructure.txt). 

Note (taken from the infrastructure document): "Once you have run the 
pipeline_sanity script and the only errors reported are acceptable (note that 
dummy analyses will generally fail the 'module compile' and 'entry in 
batchqueue' checks and this is fine), then you can consider starting the 
pipeline."

To test if everything has been set up you can use

    test_eFG_Runnable <password> <module> <logic_name> <experiment> <input_id> [<options>]

where
   
    <module>     is the name of the runnable
    <logic_name> is the logic)_name of the analysis
    <experiment> is the name of the experiment to be analysed
    <input_id>   is the slice of the genomic region to be analysed
    [<options>]  are optional additional flags that are passed through to the
                 underlysing perl script, i.e. -write to perform a write test
                 that actually stores the resulting annotating features

Example:

  using a file as input_id:
    test_eFG_Runnable **** SWEmbl SWEmbl_default CD4_H3K36me3:CD4_H3K36me3.bed.gz -write

  using a slice (here encode region) as input_id:
    test_eFG_Runnable **** Nessie Nessie_NG_g0100_s1M_d0150_p99 GM12878_c-Myc_ENCODE:chromosome:NCBI36:5:141880151:142380150:1

This shell function calls behind the scenes the test_RunnableDB perl script of
the ensembl-pipeline API. Results are going to be written to the eFG database
unless you add further optional test_RunnableDB flags, like -nowrite, -check,
-input_id_type, or -verbose (for details see test_RunnableDB documentation).


Running the pipeline
====================

Once you have run the CheckPipelineSanity and test_eFG_Runnable functions
successfully you can consider starting the pipeline. This can be done with 

   RunAnalysis <password> [<ln_regex>]

where <ln_regex> is a optional mysql regular expression to match logic_name of
analysis.

Example:

    RunAnalysis **** 'Nessie_NG_g0100_.*'

  to run all analyses which logic_name begins with 'Nessie_NG_g0100_'


Supervising the pipeline run (Errors and Output)
================================================

There are three shell functions that allow you to supervise and analyse a
pipeline run:

Monitor - reports an overview on the current status of the pipeline
GetFailedJobs - lists all failed jobs and the corresponding log files 
GetRunningJobs - lists all jobs that currently run

Logfiles are written to $ANALYSIS_WORK_DIR/$DBNAME/[0-9]/



Utility methods
===============


 o RemoveLock <password>

   Remove a pipeline lock from the pipeline DB should your pipeline fall over 
   and refuse to restart.


 o CleanInputs <password>

   Cleans all input IDs from the pipeline DB should you need to reload them.


 o DropPipelineDB <password>

   Removes entire pipeline DB should you wish to start from fresh


Summary (Protocol of release 51 running SWEmbl on histone data sets)
====================================================================

source efg and pipeline environment

   eFG
   eFG_pipeline 

import feature types

   run_import_type.pl -pass ****

create pipeline database and establish tables in pipeline database

   CreatePipelineDB ****

edit parameter setting in Bio::Ensembl::Analysis::Config::Funcgen::SWEmbl and
run 

   WritePipelineConfig **** SWEMBL file

and copy and paste entires from BatchQueue.conf to BatchQueue.pm

set up rules and analysis

   ImportPipelineConfig ensembl

generate input_ids

   CreateInputIds **** file /lustre/work1/ensembl/graef/efg/input/SOLEXA/LMI_methylation
   CreateInputIds **** file /lustre/work1/ensembl/graef/efg/input/SOLEXA/LMI_acetylation

kick-off analysis

   RunAnalysis **** file all
