:::: Introduction to Ensembl Functional Genomics ::::


:: Installation Requirements

Firstly, follow the instructions for installation of the Ensembl core, functional genomics and BioPerl packages available here:

http://www.ensembl.org/info/docs/api/api_installation.html


The next step is to set up the eFG specific requirements. This is an exhaustive list and may not be necessary if you do not intend to use the full functionality of eFG. Install the following as required:


RDBMS Platform:

	MySQL 5.1 Community Server
	http://dev.mysql.com/downloads/mysql/5.1.html
	Version 5.1 is required as some table utilise table partitioning


Required for DB Access - These may already be present on your system:

	DBI
	http://search.cpan.org/~timb/DBI/DBI.pm

	DBD::mysql
	http://search.cpan.org/dist/DBD-mysql/	 


Required for ChIP-Chip normalisation, QC reports and analysis:

	R
    http://cran.r-project.org/

	RMySQL - R package
    http://cran.r-project.org/
	
	BioConductor - R platform
    http://www.bioconductor.org/download

	Ringo - BioConductor package
    http://bioconductor.org/packages/2.1/bioc/html/Ringo.html


Required for ChIP-Chip meta data validation and replicate definition:

	MAGEstk - Bio::MAGE software tool kit.
    Pre-requisite for Tab2MAGE. 
	http://www.mged.org/Workgroups/MAGE/magestk.html

	Tab2MAGE & MAGE.dtd
    Tab2MAGE is available from http://sourceforge.net/project/showfiles.php?group_id=120325
    MAGE.dtd v1.1 is available from http://mged.sourceforge.net/software/docs.php
    The MAGE.dtd should be placed in the root data directory ($EFG_DATA).


Required for DAS functionality:

	 Bio-Das-Proserver
	 svn co https://proserver.svn.sf.net/svnroot/proserver/trunk Bio-Das-ProServer
	 Install normally or simply unpack in $SRC(see below) directory and modify $PERL5LIB accordingly

	 POE-1.006
	 http://search.cpan.org/~rcaputo/POE-1.006/

	 POE-Test-Loops-1.005
	 http://search.cpan.org/~rcaputo/POE-Test-Loops-1.005/lib/POE/Test/Loops.pm

	 Config-IniFiles-2.52
	 http://search.cpan.org/~shlomif/Config-IniFiles-2.52/lib/Config/IniFiles.pm

	 Readonly-1.03
	 http://search.cpan.org/~roode/Readonly-1.03/Readonly.pm


NOTE: It is not necessary to set up the eFG environment if you simply want to query a remote eFG DB i.e. the eFG DBs available at ensembldb.ensembl.org. Only the ensembl-functgenomics package is required for this.


:: Configuration

The eFG system uses a shell environments to set global variables and help perform common tasks. 
The base 'efg' environment is split into two files, one containing functions and constants (efg.env) and another containing installation configuration (efg.config). To enable initialsation of the efg environment carry out the following:



1 Add the contents of the following file to the end of your .bashrsc:

  ensembl-functgenomics/scripts/environments/efg/efg.bashrc


2 Create the configuration file by copying the example file and editing accordingly:

  efg@bc-9-1-02>cd ensembl-functgenomics/scripts/environments/
  efg@bc-9-1-02>cp efg.config_example efg.config
  efg@bc-9-1-02>more efg.config

  # Experimental group data
  export EFG_GROUP='eFG'                          
  export EFG_LOCATION='Hinxton, Cambridge'        
  export EFG_CONTACT='your@email.ac.uk'           

  #Code/Data Directories
  export EFG_SRC=$SRC/ensembl-functgenomics       #eFG source directory
  export EFG_SQL=$EFG_SRC/sql                     #eFG SQL
  export EFG_DATA=$HOME/data/efg                  #Data directory.

  export EFG_PERL=/software/bin/perl              
  export PATH=$PATH:$EFG_SRC/scripts              #eFG scripts directory
  export PERL5LIB=$EFG_SRC/modules:$PERL5LIB      #Update PERL5LIB. EDIT add $SRC/ensembl and $SRC/Bio-Das-Proserver/lib  if required


  #Your efg DB connection params
  export EFG_WRITE_USER="write_user"                  
  export EFG_READ_USER="read_user"                    
  export EFG_HOST='efg_host     '                     
  export EFG_PORT=3306                                
  export MYSQL_ARGS="-h${EFG_HOST} -P${EFG_PORT}"         
  #pass always supplied on cmd line e.g. mysqlw -p your password

  #Your ensembl core DB connection params, read only
  export DNADB_USER='ensro'                        #EDIT if required e.g. anonymous
  export DNADB_HOST='ens-staging'                  #EDIT if required e.g. ensembldb.ensembl.org
  export DNADB_PORT=3306                           #EDIT if required
  export DNADB_SCRIPT_ARGS="-dnadb_host $DNADB_HOST -dnadb_user $DNADB_USER -dnadb_port $DNADB_PORT"

  #DAS params
  export EFG_DAS_CONFIG=$EFG_SRC/config/DAS       #DAS config dir where pid and config files are written
  export EFG_DAS_HOST=$(hostname -f)              #DAS sever host, set to $(hostname -f) or EDIT?
  export EFG_DAS_PORT=9876                        #Default DAS port
  export EFG_DAS_NAME=efg                         #DAS instance name, can redefine this and port dynamically to enable multiple servers
  export EFG_DAS_HOME=$SRC/Bio-Das-ProServer      #DAS code dir, must be ProServer!

  #Default norm and analysis methods
  export NORM_METHOD='VSN_GLOG'                   #EDIT if required e.g. T.Biweight, Loess

  #R config
  export R_LIBS=${R_LIBS:=$SRC/R-modules}
  export R_PATH=/software/bin/R-2.7.1-dev
  export R_FARM_PATH=/software/bin/R-2.7.1
  export R_BSUB_OPTIONS="-R'select[type==X86_64 && mem>6000] rusage[mem=6000]' -q bigmem"


:: Initialisation

Once the configuration is complete, simply type 'efg' to enter the environment. This prints a brief welcome message and resets the prompt to show that you are in the eFG environment:

user_name@host>efg
Setting up the Ensembl Function Genomics environment...
Welcome to eFG!
efg@host>

The title bar of your terminal should also change to show you user name and the current working directory:

	user_name@host:/your/current/working/dir

You will now have access to some useful aliases and helper functions. Help for most of the functions can be accessed by specifying the -h option e.g.

efg@host> CreateDB -h
usage: CreateDB -d dbname -p password -s(pecies) e.g. e.g homo_sapiens [ -f(orce drop database) -t(skip type import)]

It is desirable to maintain the standard Ensembl nomenclature for a database and simply prefix it with some descriptive tag e.g  my_homo_sapiens_funcgen_47_36i
   
Failure keep the species name, db type and schema version information intact may cause problems in dynamically detecting the correct core DB to use. 

All of the efg enviroments make use of command line functions and aliases, type EFGHelp or aliases to see a list respectively. If a function exits for some reason, it will also exit the bash environment. Simply rerun the initialisation steps to get back into the environment.


:: Data Import

There are various types of data import, export and transformation which can be performed using the scripts available in the scripts directory. These encompass simple cell and feature type imports, through to array design and full experiment imports. Most of the more common tasks have template shell scripts with required parameters set and others left for editing. Here follows a list of the main types of tool script:

	Feature/CellType/Analysis import
    This is done explicitly to avoid propogation of duplicated types through the database.
	CreateDB auotmatically runs this for all the standard types available in ensembl-functgenomics/script/import/types 
    See: run_import_type.sh

Array design import
    Importing arrays prior to receipt of experimental can be beneficial. During the initial import of an array an exhaustive process is undertaken to resolve replicate probe and build a cache external to the DB. Depending on the size of the array, this process can take quite a few hours. The benefit of this process apart from the correct handling of replicate probes is that subsequent imports based on the given array are much faster. eFG also supports import for custom pre-production arrays to facilitate analysis and design turnover during development.
    See: run_import_design.sh and run_import_array_from_fasta.sh

	ChIP-Chip experiment import
    eFG supports import of experiments using the Nimblegen tiling array and Sanger PCR array platforms. See: run_NIMBLEGEN.sh and run_Sanger.sh 

	ChIP-Seq/Feature level import
    It is possible to perform a direct feature import. This accommodates pre-analysed data or technologies which are not fully supported by eFG. e.g. Hitlists, Wiggletracks, Short reads.
    See: run_import_BED.sh, run_import_hitlist.sh and run_import_wiggle.sh

	Feature mapping
    eFG utilises the assembly mapping information in the core DB to remap feature between genome assemblies.
    See: run_remap_features.sh and run_project_feature_set.sh
    Note: This has been shown to give spurious results when remapping probe level features, dependent on the nature of the assembly change. A more exhaustive full genomic mapping is also possible via the eFG array mapping environment.

	Data export
    Various flavours of data export including GFF and custom formats.
    See: run_dump_gff_features.sh and get_data.pl

	Data rollback
    Removes part or all of an experimental import and its associated data.
    See: roll_back_experiment.pl


:: Importing an experiment

Prior to running your first experiment import, you will need to make sure that the Feature/CellType and Analysis used in the experiment are present in the DB. CreateDB will automatically load the default types available in ensembl-functgenomics/scripts/import/types.  If your 'types' are not present then they must be imported using run_import_typ.sh:

efg@bc-9-1-02>more run_import_type.sh 
#!/bin/sh

PASS=$1
shift

$EFG_SRC/scripts/import_type.pl\
        -type FeatureType\
        -name H3K4me3\
        -dbname your_homo_sapiens_funcgen_48_36j\
        -description 'Histone 3 Lysine 4 Tri-methyl'\
        -class HISTONE\
        -pass $PASS

FeatureType names should correspond to a recognised ontology or nomenclature where appropriate e.g. Brno nomenclature for histones. The class parameter is not required for CellType imports.


To import an experiment you must first create an input directory for the array vendor and your experiment e.g.

mkdir $EFG_DATA/input/homo_sapiens/NIMBLEGEN/EXPERIMENT_NAME


:: Importing a ChIP-Chip experiment

The eFG system currently expects only one experiment per input directory. If your data contains more than one experiment, you will need to split the files up, recreating any meta files accordingly e.g. DesignNotes.txt, SampleKey.txt. A Nimblegen experiment import can be done using the appropriate run script:

efg@bc-9-1-02>more run_NIMBLEGEN.sh
#!/bin/sh

PASS=$1
shift

$EFG_SRC/scripts/import/parse_and_import.pl\
       -name 'DVD_OR_EXPERIMENT_NAME'\              #Name of the data directory
       -format tiled\                               #Array format
       -vendor NIMBLEGEN\                           #Technology vendor/import parser
	   -array_name "DESIGN_NAME"\					
	   -array_set\                                  #Flag to treat every chip/slide as part of one array			
	   -cell_type e.g. U2OS\
       -feature_type e.g. H3K4me3\
       -version 55\									#DB version the registry should load	
       -assembly 37\      
	   -fasta\                                      #Fasta dump flag, uesful for remapping     
       -location Hinxton\                           #Your experimental group details 
       -contact 'your@email.com'\
	   -group efg\	
       -species homo_sapiens\
       -dbname 'your_homo_sapiens_funcgen_47_36i'\	#eFG DB details	
	   -user write_user\
       -pass $PASS\
       -port 3306\
       -host dbhost\
       -tee\
       #-recover                                     #Enables import/rollback when experiment is already present 
													#e.g.2nd stage MAGE import or failed/partial imports

There are lots of additional optional parameters, for full documentation specify -help|man.

Running the above script will perform a preliminary import, which involves validation checks and import of some basic experiment meta data. The meta data gleaned from the import parameters and available vendor files is automatically populated within a tab2mage file located in the output directory e.g. E-TABM-EXPERIMENT_NAME.txt  The import will then stop to allow manual curation of the tab2mage file. Due to the lack of comprehensive meta data associated with an experiment DVD, it is necessary to inspect, correct and annotate the tab2mage file where possible, paying close attention to the definition of the replicate sets. Failure check this properly may result in permanent loss of meta data, an inability to submit to ArrayExpress and a corrupted import which ultimately may prevent any further analysis.

There are three main areas to be addressed, most of which may have been automatically populated. Fields which need attention are marked with three question marks e.g. ???

   1. Experiment section: This includes information about the laboratory where the experiment was performed along with some field which are required for ArrayExpress submission.
   2. Protocol section: This is automatically populated from the defaults present in the vendor definitions package e.g. NimblegenDefs.pm. These should be checked and edited accordingly, propagating the changes to the Hybridisation section.
   3. Hybridisation section: This is possibly the most important section to scrutinise, any inconsistencies here may result in import failure or anomalous "ResultSet" generation. This section provides for capturing information about multi cell or feature type experiments, aswell as the relationship between individual slides or in eFG parlance, "ExperimentalChips". The submitter should cross reference this section with the meta files supplied by the vendor e.g. SampleKey.txt, DesignNotes.txt.

      The following fields should be checked explicitly:
      tab2mage field	Value
      BioSource	CellType or specific source sample name if known.
      Sample	Biological replicate name.
      Extract	Technical replicate name.
      LabeledExtract 	Control/Experimental channel sample.
      Immunoprecipitate 	Description of IP, blank for control channel.
      Hybridization 	Description of Hybridisation.
      BioSourceMaterial 	e.g. cell, tissue MGED?
      Dye 	e.g. Cy3/5
      BioMaterialCharacteristics[StrainOrLine]	
      BioMaterialCharacteristics[CellType]	
      FactorValue[StrainOrLine]	
      FactorValue[Immunoprecipitate] 	e.g. anti-H3ac antibody

      Some standard naming formats have been put in place to aid validation. Primarily these are the replicate names which have been given the BRN and TRN denominations for biological replicates and techniocal replicates respectively e.g.

      EXPERIMENT_BR1_TR1
      EXPERIMENT_BR1_TR2
      EXPERIMENT_BR2_TR1

      Here we see two biological replicate for "EXPERIMENT", the first having two technical replicates and the second having just one. The other naming convention adopted is that of the "FactorValue[Immunoprecipitate]" field. This must follow the format of the example above i.e. anti-"FeatureType Name" antibody

      This is parsed during validation and used to store chip/slide level information. Replicates with mismatching feature type names in this field will fail validation.

      Note: The autogenerated tab2mage file is simply a template, you may add further valid fields to any section of the file, or supply your own tab2mage file. However, the validation checks will still be necessary for correct import, therefore it is advised to follow the standard naming rules given above.


Once the tab2mage file has been edited the second stage of the import can commence. Simply uncomment the -recover option and rerun the import script as above.  The tab2mage package will run an extensive validation step on the tab2mage file (involving lot's of warnings) before writing the MAGE XML to the output directory. Unfortunately it is not possible to reliably capture a true error from this process, so a secondary validation of the XML is performed by the import script.  Unless you are familiar with tab2mage files. tt is likely this step will fail a few times before you manage to get the data correct. Look out for warning message from the XML validation which may give you useful information to help correct the data.

On successful validation of the tab2mage and XML file, the main data import will begin. If the array design for a given experiment is not present this will be dealt with prior to loading the raw experimental data. Once this is complete a job will be submitted to the compute farm to run the chosen normalisation method. Dependant on the size and nature of the array being loaded, this stage can take several hours.  However, once loaded subsequent experiment imports using the same design will skip this step. 

The eFG analysis pipeline can now be used to perform peak calling using the imported experiment data.

:: Importing a ChIP-Seq experiment

ChIP-Seq imports work slighlty differently that ChIP-Chip import, in that alignment (e.g.MAQ) and peak calling (e.g. SWEmbl) analyses are performed prior to import into an eFG DB.  This prevents having to store huge amounts of raw reads data, which might be better suited to an archive (ENA???). The output of of the analysis pipeline is peaks call in the form of BED files. The analysis pipeline can load these directly into an eFG DB, or alternatively any BED file can be load using the following script:

efg@bc-9-1-02> more run_import_BED.sh
#!/bin/sh

PASS=$1
shift

#$* is an optional list of bed file paths
#Contents of input_dir used if not supplied

$EFG_SRC/scripts/import/parse_and_import.pl\
    -name             LMI_CD4_H3K4ac\
    -format           SEQUENCING\
    -parser           Bed\
    -vendor           SOLEXA\
    -location         Hinxton\
    -contact          your_name@email_host.ac.uk\
    -group            your_research_group\
    -species          homo_sapiens\
    -experimental_set LMI_CD4_H3K4ac\							#This will be used to name the ExperimentalSet and the FeatureSet
    -ucsc_coords\												#Chr starts at 0 not 1
    -assembly         37\
    -host             your_host\
    -port             3306\
    -dbname           my_homo_sapiens_funcgen_55_37\
    -pass             $PASS\
    -cell_type        CD4\
    -feature_type     H3K4ac\
    -feature_analysis WindowInterval\							#e.g. SWEmbl etc..
    -tee\
	-result_files $*
		

The -experimental_set option allows several sets to be defined for one experiment.



:: Displaying data with DAS


The eFG API and envinroment now support dynamic Hydra DAS sources, removing the need for extra configuration every time
a new DAS source is loaded. Given that the DAS pre-requisites have been met, use the following functions to aid 
configuration of DAS sources:

	  
	  ListDASSources

	  TurnOnDASSources
	  
	  TurnOffDASSources

	  LoadBedDASSources	- Loads individual DAS sources from bed files as aligned reads as wiggle or profile displays

	  GenerateDASConfig - Uses the generate_DAS_config.pl script to detect which FeatureSets have the 'DAS_DISPLAYABLE' status and 
      generates an DAS instance specific config file.  This will also take feature_set and result_set options flag 
	  specifc sets as 'DAS_DISPLAYABLE'. This will provide a basic configuration file which can be edited for
	  further display customisation.
	  An HTML file will also be written showing links which will attach the DAS sources to the Ensembl browser. This
	  assumes your DAS server is visible to the web.
	  
	  StartDASServer

	  StopDASServer
	  
	  ForwardDASPorts - Automates complicated port forwarding required from a private das host(i.e. your desk/laptop) to a publicly 
	  available machine, which can then be used as the public das server.


Remember to use the -h option with the functions to get some additional help.

There are ways to display data via DAS. The first is to simply set the 'DAS_DISPLAYABLE' status for any FeatureSet of peak calls loaded using the load scripts detailed in the previous section. The second deals with loading read alignments from BED files into individual tables of a DAS database.  This is not required to be an efg schema but can be integrated with an efg database. This goes someway to providing support for read alignment data which is not currently integrated within the main funcgen schema. In future this read alignment DAS table here will be merged with the main efg schema.

A typical routine for setting up a DAS source from a BED file containing read alignments is as follows:

  njohnson@mac>efg												#Initialise the environment
  SOURCING SSH ENVIRONMENT
  Setting up the Ensembl Function Genomics environment...
  Welcome to eFG!

  EFG:mac>efgd													#cd to the root data dir
  EFG:mac>cd input/homo_sapiens/								#cd to human input dir	
  EFG:mac>ls
  ES_DNase_le1m_reads.bed.gz									#Here we have a bed file of aligned reads

  #Now we load this file using the 'reads' and 'profile' display options using LoadBedDASSources

  EFG:mac>LoadBedDASSources  -host $DB_HOST --user $DB_USER --pass ensembl --dbname your_das_dbname --files ES_DNase_le1m_reads.bed.gz --names ES_DNase --reads --profile --frag_length 150 --bin_size 25

  :: Building profile for:        ES_DNase_le1m_reads.bed.gz
  :: Decompressing:				  ES_DNase_le1m_reads.bed.profile_025.gz
  :: Loading profile file:        ES_DNase_le1m_reads.bed.profile_025.gz
  :: Table name:				  bed_profile_ES_DNase
