Environment Variables
---------------------

These point to the server where the working databases for the
annotation will be created.(Alternatively these values can be given on
the command line of each script)

ENSFGUSER=ensadmin
ENSFGPWD=big_secret
ENSFGHOST=ens-genomics2
ENSFGPORT=3306
ENSFGDRIVER=mysql


Non-standard Perl modules
-------------------------

use lib '/nfs/users/nfs_d/dkeefe/src/personal/ensembl-personal/dkeefe/perl/modules/';

use DBSQL::Utils;
use DBSQL::DBS; # to use databases listed in ~/dbs.


Database access specifications
------------------------------

In your home directory create a subdir called dbs. This will contain
files with database connection details

Four files are required

current_funcgen
current_mouse_funcgen
current_core
current_mouse_core

Each contains 5 lines eg

NAME = homo_sapiens_core_58_37c
HOST = ens-staging
PORT = 3306
USER = ensro
PASS =


Scripts
-------

found in

ensembl-functgenomics/scripts/regulatory_annotation


Overview of the process.
-----------------------

Each funcgen database contains regulatory features for multiple cell lines.
Annotation is done on each cell line separately in its own working database.

Output from the scripts is piped into log files so it is useful to
have a working directory for each cell line. eg

mkdir v58
mkdir v58/dbK562

We create the working database at the mysql command line.

create database dk_funcgen_classify_58_K562

We then populate this database with tables from the current funcgen
and core databases using scripts. (These scripts, by default, pick up
the database server host from the environment variables set above.)

gen_feats_4_classification.pl
reg_feats_4_classification.pl

which must be run in the above order. These scripts also create tables
of genomic features for analysis and filter the regulatory features to
remove unwanted instances.

We then create a set of randomly located regulatory features using

mock_reg_feat_gen.pl

Finally we run 

reg_feat_gen_feat_overlaps.pl

which does the analysis and creates the annotation in a table called
regulatory_features_classified in the working database. This script
reads a configuration file which allows the analysis to be configured
for both research and production runs. For production the file is
called reg_feat_gen_feat_conf.2 and contains the following text (no
blank lines).

protein_coding_exon1_plus_enhancer
protein_coding_transcript_downstream_2500
protein_coding_single_exon_gene_plus_enhancer
protein_coding_intron1
protein_coding_gene_body
intergenic_2500
RNA_gene_single_exon_gene_plus_enhancer
tss_centred_5000




In Practice
-----------

( In the future, when the dust has settled, I will write one script to
rule them all, but for the time being each script is run separately at
the command line.)

Most of the work is done by the database server so its OK to run these
on the head nodes.

To ease the pain there is a script called reg_feat_class_prep.pl This
looks at the regulatory features in the funcgen database and prints
out the commands needed for each cell line. (currently this is not
species aware so you need to check that the cell_name doesn't conflict
with that from another species)

To start with, create a working directory for this release and change to it.

mkdir mus58
cd mus58

reg_feat_class_prep.pl -e dev_mus_musculus_funcgen_58_37k -H ens-genomics1 -s mus_musculus

outputs :-

>>
processing SQL.
RegulatoryFeatures:ES 8 ES

create database dk_funcgen_classify_58_ES;

 Now edit ~/dbs/current_funcgen to point to dev_mus_musculus_funcgen_58_37k on ens-genomics1

 and ~/dbs/ens_staging to point to latest human core

 and copy reg_feat_gen_feat_conf.2 to the v58 dir

mkdir dbES
cd dbES
gen_feats_4_classification.pl -e dk_funcgen_classify_58_ES -v2 -s mus_musculus > & log1
reg_feats_4_classification.pl -e dk_funcgen_classify_58_ES -v2 -c ES -s mus_musculus > & log2
mock_reg_feat_gen.pl -e dk_funcgen_classify_58_ES -i regulatory_features_filtered -o mockreg_features_filtered >& log3
reg_feat_gen_feat_overlaps.pl -e dk_funcgen_classify_58_ES -v1 -c ../reg_feat_gen_feat_conf.1 > & log4 &
grep -v Ex log4 | tail -22 | grep -v 'and cell_type_specific' > summary1
cd ..
<<


The line RegulatoryFeatures:ES 8 ES tells us that there is only one
cell line with regulatory features. If there were more there would be
more lines. There are 8 bits in the binary string display_label. For
cell lines with greater than 2 bits the cell name follows. For those
with fewer this is blank signifying that regulatory annotation will
not be attempted on this cell line.

The create database line(s) can be pasted into the mysql prompt  

The next three lines are reminders 

And the next 8 are commands to enter at the unix command line. If
there are more than 1 cell line to be processed there will be further
blocks of 8 command lines.

The file summary1 in the working directory for each cell line gives
the number of features classified and a bit of QC.

The regulatory feature classifications are loaded into the funcgen
database using Nathan's scripts.



  $Log: running_the_annotation_scripts,v $
  Revision 1.1  2010-07-15 11:21:09  dkeefe
  Instructions

