:::: The Ensembl Functional Genomics Pipeline Environment ::::

This document details the configuration and functionality available using the eFG 'pipeline' 
environment. The eFG pipeline environment provides basic configuration and support for specific 
eFG analysis environments (e.g. arrays.env, peaks.env), along with command line access to various 
functions which provide administrative support to the Ensembl genebuild pipeline technology.

NOTE: This currently also contains functions which should be moved to peaks.env.

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


Contents

1    Introduction
2    Overview
3    Pre-requisites
4    The Ensembl Pipeline
5    The eFG Pipeline Environment
6	 Administrative Functions

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


1 Introduction
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


2 Overview
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


3 Pre-requsites/Requirements

bioperl-1.2.3

ensembl

ensembl-functgenomics

ensembl-analysis

ensembl-pipeline

All the above ensembl packages and bioperl are available via CVS following the instructions here:
http://www.ensembl.org/info/docs/api/api_installation.html

unix/linux

bash

perl (prefereably 5.8.8)

LSF
Not strictly essential as the pipeline code will run offline, but it would take a long time to run 
these analyses using one machine.

Also see main pipeline docs (see below).

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


4 The Ensembl Pipeline

The main documentation for the Ensembl pipeline is available here:

http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-doc/pipeline_docs/the_ensembl_pipeline_infrastructure.txt?view=markup&root=ensembl

In summary, the ensembl-pipeline code deals with submission of jobs to the farm dependant on a set 
of rules which describe the dependancies of each analysis step in a given pipeline. The rules and 
job tracking information are stored in a special pipeline DB which contains a few extra tables for 
the pipeline data. This can be your output DB, but for the purposes of safety, the eFG environment 
always uses a separate DB to handle this information, named using the environment name e.g.

array_pipeline_homo_sapiens_funcgen_54_37p

This also means no post pipeline clean up is required, whilst maintaining the ability to retain job 
information should it be required.

There are two configuration files which should be considered:

ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/General.pm.example

This contains some general pipeline set up which is largely over-ridden by more specific parameters 
within the arrays environment. Make sure the following are correct for your installation:

	BIN_DIR  => '/software/ensembl/bin',
	LIB_DIR  => '/usr/local/ensembl/lib',


ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm

This contains the general pipeline configuration and specific configuration for how each analysis job
should be run on the farm. A template file should be available for the efg analysis environment (see
specific analysis documentation e.g. arrays.txt). It may be necessary to alter some of the general 
pipeline configuration. 

NOTE: There is most likely an efg version of the files above which will dynamically configure themselves
using the relevant environment.  However, you may find problems with the Blast.pm config not being found.  
This is due to some hardcoding withint the ensembl-pipline code and can be circumvented by simply mv'ing
the relevant file e.g.

mv ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/Blast.pm.example ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/Blast.pm

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


5 The eFG Pipeline Environment

The eFG shell environment was developed to aid administration of a local eFG instance by providing a
collection of command line functions to perform common tasks. This has been extended to provide 
basic support for different analysis pipelines, which contain more specific configuration and 
functions.

.efg			 This provides the most basic configuration and a handfull of administration functions.

pipeline.env	 Provides functions to support any analysis efg environment which utilises the 
				 Ensembl pipeline technology.

pipeline.config	 Provides deployment configuration for pipeline.env.

You will need to create pipeline.config e.g.

efg@bc-9-1-02>cd ensembl-functgenomics/scripts/environments/
efg@bc-9-1-02>cp pipeline.config_example pipeline.config

Esit this file setting data and binary paths where appropriate. All environmental variables should be 
documented or self explanatory. These should only need setting up once. It should be noted that any 
variables set in pipeline.config will be superceded by those set in any 'analysis'.config or instance 
file (see next section).

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


6 Administrative Functions

This list defines some generic administrative functions, more specific functions may also be 
available in the analysis environment, check 'anaylsis'.txt for more information.


DropPipelineDB	 Drops the pipeline DB, removing any record of job rule/goals, analyses or tracking.

CleanJobs		 Deletes all input_id_analysis entries for a given RunnableDB.  This can be used if 
				 an analysis was configured in error and needs to be removed before starting the 
				 pipeline.

GetRunningJobs	 Lists output files for jobs which are currently running.

GetFailedJobs	 Lists output files for jobs with the current status of FAILED, FAIL_NOT_RETY or 
				 AWOL.

ResetFailedJobs	 Resets the rety_count of failed jobs to allow the pipeline to run if they have 
				 exceeded the max DEFAULT_RETRIES/retries configuration set in BatchQueue.pm

RemoveLock		 Removes the pipeline lock from the meta table.  This can occur if a pipeline exits
				 unexpectedly.

BackUpTables	 Performs table dumps for related groups of tables e.g. arrays, xrefs or pipeline.

