:::: The Ensembl Functional Genomics Array Mapping Environment ::::

This document details the configuration and functionality available using the eFG 'arrays' 
environment, which utilises both the eFG pipeline environment and the Ensembl genebuild pipeline 
technology. The eFG environment provides configuration and command line access to various functions 
which can run the whole mapping pipeline or allow a more flexible step wise approach.

The eFG environment currently supports genomic and transcript alignment, and transcript annotation of
the following formats unless otherwise stated:

Format		  Description		Definition	   
AFFY_UTR	  Standard IVT		25mer Probesets anti-sense target
AFFY_ST		  Sense Target		25mer Probesets sense target
ILLUMINA_WG	  WholeGenome		50mer Probes sense target 
CODELINK_WG	  WholeGenome		30mer Probes sense target 
PHALANX		  OneArray			60mer Probes sense target
AGILENT		  Formats?			60mer Probes sense target

Not fully supported
LEIDEN	  Danio rerio
NIMBLEGEN_TILING

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


Contents

1    Introduction
2    Overview
3    Pre-requisites
4    The Ensembl Pipeline
5    The eFG Arrays Environment
6    Input Data
7    Initiating An Instance
8    Running The Pipeline
8.1  Probe Alignment
8.2  Transcript Annotation
9    Administration Functions & Trouble Shooting
10   Adding Additional Support
10.1 Adding A New Array Format
10.2 Adding A New Array
10.3 Tiling Arrays
10.4 Multi-Species Support
10.5 Multi Species Name Support (SPECIES_COMMON) 
11   Known Issues/Caveats

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


1 Introduction

The array mapping pipeline consists of two distinct stages:

Probe Alignment

The probe sequences from a given array are aligned to the genomic sequence. If applicable (i.e. 
expression design) probes are also mapped to transcript sequences(cDNA).  These alignments are then 
mapped back to the genome and stored as gapped alignments, any ungapped alignments from this process 
are discarded as they will be represented by the genomic alignments. Transcript associations defined by 
this process are stored using a DBEntry object.

Transcript Annotation

Probe/sets are assigned to transcripts given a set of simple rules dependant on the array design.
Historically this has involved a 2KB extension of the 3' UTR sequence as it is known that the Ensembl 
gene build pipeline can be conservative in predicting UTRs. The new pipeline allows for much more 
flexible configuration taking into account species specific variation of UTRs. The current default 
strategy for annotating arrays is detailed here:

www.ensembl.org/info/docs/microarray_probe_set_mapping.html

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


2 Overview

If required, edit the following pipeline configurations files(see section 4)

ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/General.pm
ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm.efg_arrays
ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm 
ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ProbeAlign.pm 

If required, edit the following environment files(see section 5):

ensembl-functgenomics/scripts/efg.config
ensembl-functgenomics/scripts/environments/pipeline.config
ensembl-functgenomics/scripts/environments/arrays.config

If required, set up input directory structure and data (see section 6) e.g.

$DATA_HOME/HOMO_SAPIENS/AFFY_UTR/HC-G110_probe_fasta     
$DATA_HOME/HOMO_SAPIENS/AFFY_UTR/HG-U133A_probe_fasta        
...

$DATA_HOME/HOMO_SAPIENS/AFFY_ST/HuEx-1_0-st-v2.probe.fa
$DATA_HOME/HOMO_SAPIENS/AFFY_ST/HuGene-1_0-st-v1.probe.fa

etc.

Create an instance file(see section 6), e.g.

my_human_54_37p.arrays

Initialise the environment and run the alignments and annotation (see section 7).
If required, run TestImportArrays and TestProbleAlign.
Check alignment reports before running transcript annotation.

>bash
>. ensembl-functgenomics/scripts/.efg
>. my_human_54_37p.arrays password
>RunAlignments
>RunTranscriptXrefs

Check transcript annotation logs to ensure probe2transcript has completed succesfully.
See section 8 for rollback and administration functions.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


3 Pre-requisites/Requirements

See pipeline.txt Pre-requisites.

exonerate-v2.2.0 can be found here :
http://www.ebi.ac.uk/~guy/exonerate/

Memory

This is dependant on several factors including the number of arrays to be mapped, the
format of the arrays and the size of the genome/transcriptome of a given species. For human, which 
tends to be the largest dataset Ensembl deals with, some steps can require upto 12GB of memory. However, 
this is handled dynamically by the environment dependant on the values of:

HUGEMEM_HOME	- Directory visible to huge memory host(this may be the same as DATA_HOME)
HUGEMEM_QUEUE	- LSF queue name for huge memory host
MAX_NORMAL_MEM  - Maximum memory usage on normal memory host
MAX_HUGE_MEM    - Maximum memory usage on huge memory host

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


4 The Ensembl Pipeline

See pipeline.txt for generic pipeline information and configuration. Note: General.pm does not need 
editing unless you want to specifically modify some of the General pipeline config.

The ensembl-analysis code deals with the actual work of a given analysis, with modules being split
into Runnables which perform the actual analysis and RunnableDBs which handle post processing and
interaction with the ensembl output DB e.g. your funcgen DB.  There are just three modules which 
perform the bulk of the array mapping analyses:

ensembl-analysis/modules/Bio/EnsEMBL/Analysis/RunnableDB/ImportArrays.pm - Runs as a single job to 
collapse arrays of the same format into a non-redundant set of probes, storing probe records in the 
output DB and writing a non-reundant probe fasta file to be used in the alignment step.

ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Runnable/ExonerateProbe.pm - Performs exonerate 
alignments and generates and filters ProbeFeatures given a max 
mismatch value.

ensembl-analysis/modules/Bio/EnsEMBL/Analysis/RunnableDB/ProbeAlign.pm - Runs the ExonerateProbe 
Runnable using genomic or transcript target sequence, and performs post processing and storage of 
ProbeFeatures, UnmappedObjects and DBEntries. Dependant on the target sequence the analysis run by 
this module is refered to as ProbeAlign(genomic) or ProbeTranscriptAlign

There are three main configuration files which need to be considered:

ensembl-pipeline/modules/Bio/EnsEMBL/Pipeline/Config/BatchQueue.pm.efg_arrays

A QUEUE_CONFIG entry is defined for each analysis, one for each array format per mapping type, with 
an additional 'Submit' type analysis. This already contains config for all supported array formats 
and should be ready to use by simply stripping the efg_arrays suffix.

ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm 

This specifies some DB parameters along with specific configuration for each import analysis. For any given
array format, this will include a regular expression and a field order list to parse the fasta headers and 
a list of ARRAY_PARAMS which contain meta data defining each supported array for that format. 

ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ProbeAlign.pm

This provides analysis specific configuration for exonerate alignment and filter options.

Once the environment is initiatied, the alias 'configdir' can be used to access this directory.

All of the above have been edited such that, where appropriate, configuration of analyses can be 
accessed using environmental variables defined in a give instance of the arrays environment. This 
prevents having to manually edit these configuration modules every time an instance is run.  However, 
it may be necessary to add configuration the first time you run a particular array or format. See 
section 10 for futher details.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


5 The eFG Arrays Environment

See pipeline.txt for information about general pipeline.env configuration.

arrays.env		 Provides array mapping specific configuration and functions.

arrays.config	 Provides deployment configuration for arrays.env

You will need to create the arrays.config file e.g.

efg@bc-9-1-02>cd ensembl-functgenomics/scripts/environments/
efg@bc-9-1-02>cp arrays.config_example arrays.config

Edit this file setting data and binary paths where appropriate. All environmental variables should be 
documented or self explanatory. These should only need setting up once. It should be noted that any 
variables set in arrays.config will override those set in pipeline.config and efg.config, likewise 
any set in your instance file (see section 7) will override those set in arrays.config.

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


6 Input Data

The arrays environment assumes a particular directory structure based on the value of $DATA_HOME and 
$SPECIES e.g.

$DATA_HOME/HOMO_SAPIENS

Each species directory should contain sub-directories for each array format, where the input array 
fasta files should be located. If you do not specify GENOMIC/TRANSCRIPTSEQS paths in your instance 
file, additional sub-directories will be created here which will be used to dump the relevant 
sequence from the core DB e.g.

AFFY_UTR
AFFY_ST
ILLUMINA_WG
CODELINK
PHALANX
AGILENT
GENOMICSEQS
TRANSCRIPTSEQS

By default, the environment will map whatever formats are present in this directory, unless this is 
restricted by redefining ARRAY_FORMATS in the instance file or passing the required array format as 
a parameter to a given function. Use the alias 'arraysdir' to access this array formats root 
directory.

There are many array design file(adf) formats which are implemented by various different array vendors, 
sometimes differing even from the same vendor. Whilst the ImportArrays config(see section 10.1) goes 
someway to account for this, this only supports fasta format. To transform non-affy adfs use
the following script:

ensembl-functgenomics/scripts/array_mapping/pre_process_probe_seqs.sh

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


7 Initiating An Instance 

To initialise an instance of the environment a small 'instance' file is sourced. An example of this is 
available here:

ensembl-functgenomics/scripts/environments/example.arrays

This contains a few variables to inform the pipeline where the output DB is.  As the eFG DBAdaptor
auto-selects a core DB if one is not already specified, it may be necessary to define some DNADB parameters.
Do this if a valid corresponding core DB in not available on ensembldb.ensembl.org, or if you want to use 
a particular core DB. Due to the multi-assembly nature of the eFG schema it is also necessary to follow
the ensembl standard naming convention for DBs i.e.

your_prefix_species_name_funcgen_RELEASE_BUILDversion > my_homo_sapiens_funcgen_54_37p

As detailed above, you may also want to add some more variables to the instance file to override those in 
arrays.config or pipeline.env.

Tip: Due to the nature of the efg arrays environemnt is it useful to create a separate 'screen' session for each.
This prevents the pipeline being interupted should disconnection from the network occur. This also provides an easy
way to manage multiple arrays environments.

Sourcing the environment will print some config setting for you to review and also change the prompt 
and window title, to inform you which instance of the environment you are using.  This is useful when running
numerous instance in parallel.  Source an instance file by sourcing the base eFG environment first (invoking bash 
and passing a dnadbpass if required):

>bash
>screen -R my_instance_file_name          # Optional
>. ensembl-functgenomics/scripts/.efg     # or just efg is you have set the alias in your .bashrc
Setting up the Ensembl Function Genomics environment...
Welcome to eFG!
>. path/to/my_instance_file.arrays dbpass dnadbpass

:::: Welcome to the eFG array mapping environment
:::: Setting up the eFG pipeline environment
:: Sourcing ARRAYS_CONFIG: /nfs/acari/nj1/src/ensembl-functgenomics/scripts/environments/arrays.config
DB:               ensadmin@ens-genomics1:homo_sapiens_funcgen_54_36p:3306
DNADB:            ensro@ens-staging:homo_sapiens_core_54_36p:3306
PIPELINEDB:       ensadmin@ens-genomics1:arrays_pipeline_homo_sapiens_funcgen_54_36p:3306
VERSION:          54
BUILD:            36
:: Setting config for homo_sapiens array mapping
: Setting ARRAY_FORMATS: AFFY_UTR AFFY_ST ILLUMINA
: Setting align types: GENOMIC TRANSCRIPT
GENOMICSEQS:      /path/to/your/genomicseqs.fasta 
TRANSCRIPTSEQS:   /path/to/your/transcriptseqs.fasta 

arrays:homo_sapiens_funcgen_54_36p>


If the output directory does not already exist, it will be created here:

$DATA_HOME/$DB_NAME

All output from the mapping pipeline will be written here. Use the alias 'workdir' top access this 
directory.

You are now ready to run the array mapping pipeline. If for some reason you have not specified the
GENOMIC/TRANSCRIPTSEQS paths, the environment will prompt you to accept previously existing files, or 
will dump the sequence for you.

The alias 'mysqlefg' and 'mysqlpipe' will allow you to connect to the output and pipeline DBs respectively.

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


8 Running The Pipeline

The arrays environment specifies several functions(not all documented here) available via the command 
line, these can generally be invoked with a -h option to print a help or usage message.


8.1 Probe Alignment 

There are basically 5 main steps to running the probe alignment element of the pipeline:

BuildFastas     Looks in the each array format directory and updates a cat'd file of all available 
			    fasta.

SetUpPipeline   Creates pipeline DB and imports analyses and pipeline rule conditions and goals.
			    This may show some errors as the pipeline system expects all the analyses in 
			    BatchQueue.pm to be configured in the DB, as we the efg environment has dynamic and 
			    templated configuration this can be ignored. There may also be some errors refering 
			    to the submit type analyses having no module, accumulator or input_ids not being 
				present, these can also be ignored.
			   
ImportArrays    Invokes the pipeline to import all available arrays. Followed by ImportWait

CreateAlignIDs  Creates ProbeAlign job/chunk IDs to enable parallel processing of the non-redundant 
                fasta output of ImportArrays.

SubmitAlign     Submits ProbeAlign/ProbeTranscriptAlign jobs for each format. Followed by AlignWait 
                and ProbeAlignReport.

These are accessible as separate funtions but can be invoked by a wrapper function:

RunAlignments     - Does all of the above plus some pipeline monitoring.


NOTE: The ImportWait step is a pipeline accumulator and simply waits for the ImportArrays jobs to 
finished successfully.  If these jobs fail, then ImportWait will wait indefinitely.  It is a good idea
to source the environment in a separate window and use the 'monitor' function to track the progress
of the pipeline.


When running an array or species for the first time you it is wise to test some small jobs locally before 
submitting thousands of jobs to the farm which may potentially fail. This can be done using the following functions:

TestImportArrays - Will parse the fasta file and store the results if the -w flag is set.  Note:  This will 
not be recorded as a succesful job as it is being run outside of the pipeline. Hence specifying -w may cause 
problems when trying to ImportArrays.

TestProbeAlign   - Will run a genomic or transcript alignment for a given input id.  The write flag -w will
                   cause output to be written to the DB as with TestImportArrays

To rollback the data written in these test cases or if the alignments fail for some reason and require some clean up,
use the RollbackArrays function. There are also various administrations functions which can be used in conjuctions with
RollbackArrays (see section 9). It will then be possible to either RunAlignments again or invoke the individual functions 
required.

When doing this for the first time it is likely that some configuration will have been omitted from 
one of the files in section 4.  It is also possible that some external_db entries may be missing 
from the DB. The external_db table is normally curated outside of the API and so needs populating 
separately.  If this happens, then error message will contain the correct sql to populate the table. 
This is done manually to avoid multiple inserts by parallel processes, and also as we may simply 
want to change the schema_build/version of the external DB if only the schema has changed.

Once these errors have been remedied the 'Test' functions can be abandoned in favour of RunAlignments.

If only the genebuild has been updated and the genome assembly is unchanged, it is possible to only re-run the 
ProbeTranscriptAlign step.  However, this requires the non-redundant dbID header fasta file and the array names file 
from the original import.  Simply copy or link to the appropriate files from the original import workdir e.g.

arrays_nr.AFFY_UTR.fasta
arrays.AFFY_UTR.names

It will also be necessary to restrict the align types variable to TRANSCRIPT in the instance file:

export ALIGN_TYPES='TRANSCRIPT'

Once all the probe alignments jobs have completed successfully for a given array format, or the AlignWait step has 
completed, the ProbeAlignReport function can be used to generate reports detailing the numbers of probes mapped 
for each array format. These provide an easy way to detect whether any unseen errors have occured during the 
alignment process.


NOTE: It may be apparent that the ProbeTranscriptAlign type analyses are returning only low levels
of ProbeFeatures or none at all.  This is because any alignments are discarded when they map back to 
the genome as ungapped to avoid redundancy of feature between the ProbeAlign and ProbeTranscriptAlign
analyses.

8.2 Transcript Annotation

This depends on the array names file for each format which is normally written during the import step.  
This should always be present as running the transcript annotation implies the gene build has been 
updated, which also means a ProbeTranscriptAlign step is required and hence a non-redundant probe fasta 
file, which is generated by the import step.

If for some reason it is not present then it will need to be created, using the relevant array names from the array table:

arrays:danio_rerio_funcgen_54_8>more arrays.AGILENT.names 
G2519F
G2518A

The RunTranscriptXrefs function handles submitting the probe2transcript.pl jobs to the farm.  By default this will
be done for all available array formats, but it is possible to specify additional paramaters to change this behaviour, 
try -h for some usage information.

If there are already transcript annotations present in the xref schema, the healthcheck mode will cause an error and 
exit. To delete the existing xrefs simply specify the -d delete flag or RollbackArrays.

The default settings for the probe2transcript.pl script are set using the PROBE2TRANSCRIPT_PARAMS variable e.g.

export PROBE2TRANSCRIPT_PARAMS='--calculate_utrs --utr_multiplier 1'

By default, this script requires at least 50% of probes in a given probeset to match with a maximum of 1bp overlap 
mismatch.  These parameters can be set and changed in the instance file, dependant or the requirements for a given 
species(i.e. no utr extension for bacteria) or a given array format(which might not be recognised by probe2transcript 
and therefore require extra configuration). To see the full range of parameters available run: 

ensembl-functgenomics/scripts/array_mapping/probe2transcript.pl -help
or
perldoc ensembl-functgenomics/scripts/array_mapping/probe2transcript.pl

This step can also be done by running the probe2transcript.pl script directly on a format by format basis.
However, this will not perform the healthchecks and back ups which are part of the RunTranscriptXrefs function.
Hence, this is not advised unless it is not possible to use RunTranscriptXrefs for some reason.

Each probe2transcript.pl job will generate four output files per array format e.g.

AFFY_ST_probe2transcript.err
AFFY_ST_probe2transcript.out 
homo_sapiens_funcgen_54_36p_AFFY_ST_probe2transcript.log
homo_sapiens_funcgen_54_36p_AFFY_ST_probe2transcript.out

The first two are LSF output, which can largely be ignored unless it appears the job has failed in 
some way. The later two are a log of the job progression and a record of each ProbeFeature which has
been considered for mapping.  This information is also stored as UnmappedObject or DBEntries, so it 
is slightly redundant, but is handy for quickly looking up how a ProbeFeature was processed.

As this step is not parallelised it can take a long time, especially when there are many arrays within
a given format.  It is useful to monitor the progression of the job by tailing the log e.g.

>tail -f homo_sapiens_funcgen_54_36p_AFFY_ST_probe2transcript.log
::      Calculating default UTR lengths from greatest of max median|mean - Wed Mar 11 15:57:38 2009
::      Seen 40728 5' UTRs, 22552 have length 0
::      Calculated default unannotated 5' UTR length:   278
::      Seen 40500 3' UTRs, 22780 have length 0
::      Calculated default unannotated 3' UTR length:   1083
::      Finished calculating unannotated UTR lengths - Wed Mar 11 16:21:31 2009
::      Caching arrays per ProbeSet - Wed Mar 11 16:21:31 2009
::      Performing overlap analysis. % Complete:
::      0 ::    1 ::    2 ::    3 ::    4

Once the overlap analysis has completed 100%, the DBEntries and remaning UnmappedObjects will be written.
Finally a summary report of transcript annotation will be written which is useful for making sure everything
is as it should be e.g.

::      Updating 0 promiscuous probesets - Thu Mar 12 11:56:19 2009
::      Loaded a total of 983575 UnmappedObjects to xref DB
::      HuEx-1_0-st-v2 total xrefs mapped:      1568554
::      HuGene-1_0-st-v1 total xrefs mapped:    78399
::      Mapped 61502/63280 transcripts  - Thu Mar 12 11:56:22 2009


::      ::      Top 5 most mapped transcripts:  ::      ::

::      ENST00000369202 mapped 22 times
::      ENST00000369198 mapped 22 times
::      ENST00000369326 mapped 19 times
::      ENST00000369194 mapped 16 times
::      ENST00000321694 mapped 13 times


::      ::      Top 5 most mapped ProbeSets(inc. promoscuous):  ::      ::

::      1400062 mapped 70 times
::      585732 mapped 67 times
::      1579351 mapped 60 times
::      888066 mapped 55 times
::      379672 mapped 55 times


::      ::      Top 5 most mapped ProbeSets(no promiscuous):    ::      ::

::      1400062 mapped 70 times
::      585732 mapped 67 times
::      1579351 mapped 60 times
::      888066 mapped 55 times
::      379672 mapped 55 times


::      ::      Completed Transcript ProbeSet annotation for HuEx-1_0-st-v2 HuGene-1_0-st-v1    ::      :: - Thu Mar 12 11:56:27 2009

::      Logging complete Thu Mar 12 11:59:07 2009.


Checking that every array format log file ends as above will ensure that all the transcript annotation
jobs have completed succesfully.

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

9 Administration Functions & Trouble Shooting


See also the corresponging section in pipeline.txt for more general pipeline functions.

Here is a summary of some helpful aliases.

Add this to your .bashrc for convinient initiation of base environment

efg='. ~/src/ensembl-functgenomics/scripts/.efg'

After the base environment is initialised 'efg' and other aliases are defined as:

efg	    	- cd's to root efg code dir
efgd        - cd's to root data dir 
workdir		- cd's to working database dir
configdir	- cd's to analysis config dir
configmacs	- Opens xemacs in configdir
arraysdir	- cd's to the root array formats dir for a given species
			   mysqlefg	- Connects to the output DB
mysqlpipe	- Connects to the pipeline DB
mysqlcore   - Connects to the core/dna DB
monitor		- Prints a summary of the pipeline status. Initialise a fresh environment to track a running 
			  pipeline.

This section could also be called 'What to do when things go wrong'.  Inevitably there will be times 
when jobs fail. It is important to understand what has happened during the failure to take the 
correct course of action before restarting the pipeline, i.e. has any output been written? 

The import stage will fail if an array had already been recorded as IMPORTED, which is a sign that you either 
need to rollback or consider whether you actually need to import it. The alignment stage is not able
to record import success for individual chunks, so it is very important that you are sure a 
particular job has not written any output. The job output and error files should be able to provide
this information. If it seems like the jobs has died before any processing has taken place, then it is 
safe to just restart. If it seems like the job has fallen over half way through, then a rollback is 
required.  This is due to the fact that currently, UnmappedObjects are written as they are created, 
rather than being delayed as per the ProbeFeatures. The error file will have warnings if unmapped 
objects or other otuput has been written, simply search for the following:

		UnmappedObject
		Writing ProbeFeatures

If these are not present then it is safe to resubmit this job without rolling back(see below).

ProbeAlignReport

Running this shell function will generate alignment reports for each of your array formats. Output
can be found here:
	
	$WORK_DIR/ProbeAlign.'ARRAY_FORMAT'.report


probe2transcript.pl output

See section 8.2


UnmappedObjects

These are stored at various points throughout the array mapping pipeline to capture details 
of failed mappings. Investigating the unmapped_object and unmapped_reason tables can often
reveal information about why the alignments or transcript annotations failed.


RollbackArrays

This handles all levels of rollback and will fail if there are data dependencies which are not 
accounted for i.e. it will not let you orphan records in the database. There are several levels of 
rollback available using the -m option:

probe					 This is the default level, and will only rollback the data stored during 
						 the import step i.e. array, array_chip, probe, probeset and any status 
						 records which may have been stored

ProbeAlign				 This will rollback any records stored by the genomic alignment analysis, 
						 this includes any UnmappedObjects which may have been stored.

ProbeTranscriptAlign	 As above but for transcript alignments.

probe_feature			 Both the above alignment analyses.

probe2transcript		 This will rollback all the output from the transcript annotation step.

Use the -h option for more information.

Dependant on the nature of the job failure, it may also be necessary to perform some job clean up 
using the functions listed in pipeline.txt.


:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


10 Adding Additional Support


10.1 Adding a new format:

Assuming the array model is comparable to something that is already supported e.g. single probe or 
probeset, sense/anti-sense target.

Create your input data dir e.g.

$EFG_DATA/HOMO_SAPIENS/NEW_FORMAT

The pipeline BatchQueue.pm file is now configured dynamically and should not need editing.

Edit ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm 

Add a new ARRAY_CONFIG entry named after your import analysis e.g.

IMPORT_NEW_FORMAT_ARRAYS

Using an existing one a template, being careful to maintain any environment variables. Redefine the 
input ID regular expression(IIDREGEXP) and the input field order(IFIELDORDER) to enable parsing of 
array fasta file. Add any array meta data to the ARRAY_PARAMS hash as described below.


Edit ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ProbeAlign.pm 

Add a new PROBE_CONFIG entry for each of the alignment analyses e.g.

NEW_FORMAT_PROBEALIGN
NEW_FORMAT_PROBETRANSCRIPTALIGN

Use an existing comparable format as a template, being careful to maintain any environment 
variables. Edit other paramters as required:

HIT_SATURATION_LEVEL - Number of allowed alignments before probe is called 'promiscuous', is
                       disregarded and stored as UnmappedObject.

MAX_MISMATCHES       - Maximimum number of sequence mismatches allowed in an alignment.

OPTIONS              - Exonerate options tuned with respect to probe length and MAX_MISMATCHES i.e.
                       seedrepeat = MAX_MISMATCHES + 1
                       dnawordlen = probe length/seedrepeat (rounded down)
                       Alternatively for longer probes it may be better to set a -perc cutoff and tune
                       the MAX_MISMATCHES parameters to ensure that all possible mismatches are caught
                       for the longest probe in your array,

Finally, add the format name to VALID_ARRAY_FORMATS in arrays.config.

WARNING: The probe2transcript.pl script uses hardcoded format configuration for supported formats.
Hence, running the RunTranscriptXrefs step will fail unless the probe2transcript.pl code is edited 
directly to add the necessary format config(see . This can be overcome by running the probe2transcript.pl 
script directly using the following parameters:

	   -linked_arrays
	   -probeset_arrays
	   -sense_interrogation


10.2 Adding A New Array 

To add a new array to an existing format, simply add the array meta data to the relevant format 
specific ARRAY_PARAMS hash in: 

ensembl-analysis/modules/Bio/EnsEMBL/Analysis/Config/ImportArrays.pm 

Due to the nature of probes being redundant across arrays of the same format, it will be necessary to
rollback and re-run all the arrays for the given format e.g.

>RollbackArrays -f YOUR_FORMAT -d
>DropPipelineDB
>RunAlignments -f YOUR_FORMAT
etc..


10.3 Tiling Arrays

As eFG was initially designed as a data storage and analysis platform for ChIP-Chip analysis, it is 
possible to use the arrays environment to map tiling arrays which have been import via the eFG import 
API.  To do this it is necessary to specify the -fasta flag when running parse_and_import.pl. This
will dump a non-reundundant fasta file of the array used in a given experiment. Use this file as the 
input file to the alignment step by moving or linking it to the working directory with the 
appropriate name e.g.

arrays_nr.NIMBLEGEN_TILING.fasta

To restrict the analyses to genomic alignments only, add this to the instance file:

export ALIGN_TYPES='GENOMIC'

As the import has already taken place it is necessary to run the rest of the alignments stage in a
step wise manner e.g.

SetUpPipeline
CreateAlignIds
SubmitAlign
AlignWait
ProbeAlignReport		#This may take some time if it is a whole genome tiling array


10.4 Multi-Species Support(EnsemblGenomes)

The arrays environment and pipeline code has been developed to support the new multi-species aspect
of the EnsemblGenomes DBs. This has a few impacts on setup and configuration of the arrays 
environment.

To enable multi-species support for funcgen databases add the following to 
either arrays.config or the instance file.

export MULTI_SPECIES=1

To turn on core database multi-species support then export the following:

export DNADB_MULTI_SPECIES=1

If multi-species is specified for a core database and no species is found the
environment will exit. However if you specify it for the funcational genomics
database the environment will insert a multi-species entry for the current 
species name if no species ID can be found in the database.

As species are grouped into 'collection' DBs, this means that they will most 
likely share at least some if not all of the arrays to be mapped. To 
reduce redundancy in the input data, fasta files should be stored in 
a collection directory and specific species directories soft linked to 
these e.g

$DATA_HOME/STAPH_COLLECTION
$DATA_HOME/STAPHYLOCOCCUS_AUREUS -> $DATA_HOME/STAPH_COLLECTION/

If there are differences between the arrays to be mapped within the 
collection, then more exhaustive individual file links will need to be set 
up in a standard input directory for that species.

When running the analysis using MULTI_SPECIES you will only have to import the
array once so runs of other species in the same schema require you to
use RunAlignments -s. If running the pipeline this way then each and we assume
our first species is called e_coli & our second s_flex then our file system 
would look like:

$DATA_HOME/e_coli/ARRAY_TYPE/arrays.ARRAY_TYPE.fasta*

$DATA_HOME/s_flex/ARRAY_TYPE/symbolic_link_to_*

Sequence files i.e. transcripts & genomic are still dumped to the 
$ARRAYS_HOME/GENOMICSEQS and $ARRAYS_HOME/TRANSCRIPTSEQS directories.

See known issues for race condition problems which can occur when running
multi-species databases from the Ensembl Bacteria project.

10.5 Multi Species Name Support (SPECIES_COMMON) 

SPECIES_COMMON is an attribute which when set will be used for file names. This
is not a specific multi-species support attribute but more support for
Ensembl Genomes & the various strain names that can appear which are not
file system friendly. If you have a strain: 
 
e.g. Escherichia coli O1:K1 / APEC 

This can be changed to a more filesystem friendly name: 

e.g. E_coli_O1_K1_APEC 

This conversion must be done by yourself but the usage of it is automatic 

:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


11 Known Issues/Caveats 

- Hardcoded perl

Some of the main ensembl-pipeline scripts contain hardcoded perl paths which may need to be edited.


- Updating ProbeTranscriptAlignments

Given a new genebuild it is not possible to update just the ProbeTranscriptAlignments without the 
original arrays_nr.FORMAT.fasta file. This could be reconsituted using the original sources but at 
present requires a complete re-import.


- MAX_HIT_SATURATION

If a probe exceeds the maximum number of mappings(HIT_SATURATION_LEVEL) for a particular mapping type, 
then the hits are filtered for perfect matches.  If the remaining perfect match hits do not exceed 
the HIT_SATURATION_LEVEL then these are kept and the mismatch hits are disregarded. 
HIT_SATURATION_LEVELs are currently not calculated cumulatively over mapping types.


- Alt-trans ProbeFeature duplication

Some alternative transcripts may give rise to duplicate ProbeFeatures from the ProbeTranscriptAlign
analysis. This is as yet unseen and will not effect the transcript annotations.


- Tiling array support 

This is not yet fully implemented.


- ProbeFeature/Overlap mismatches

The probe2transcript.pl script needs some work around resolving some rare cases concerning 
ProbeFeature mismatches and overlap mismatches.

When populating the pipeline for species with coordinate systems other than
chromosome e.g. super-contig or plasmid a race condition can occur in the
pipeline causing processes to fail as each one attempts to insert the missing
coordinate system(s). To avoid this run the following script before the 
pipeline:

$EFG_PERL $EFG_SCRIPTS/import/import_coord_systems.pl

This loads config from Bio::EnsEMBL::Funcgen::Config::ProbeAlign & imports
the missing coordinate systems. Eventually this import will be done by
the API in a safe manner. The issue does not affect sequence regions insertions.
