README for running translated BLAT

The best way to get tBLAT alignments is using the run_BLAT_pipeline.pl you can find in the ~ensembl-compara/scripts/BLAT directory. This script takes care of downloading the sequences in chunks and groups and launches all the tBLAT jobs to the farm using the LSF system. It checks performs several checks and re-runs the jobs when needed. If a particular job fails persistently, the scripts dies throwing an exception. At the moment, the tBLAT pipeline relies on storing data in flat files.

In order to run the script, you will need a copy of the ensembl core, ensembl-compara and ensembl-pipeline APIs, a working directory with plenty of space and a Registry configuration file.

For example, the default working directory is /ecs4/work1/jh7/BLAT. Here is an example of a Registry configuration file:

===========================================================
use strict;
use Bio::EnsEMBL::Utils::ConfigRegistry;
use Bio::EnsEMBL::DBSQL::DBAdaptor;

my @aliases;

new Bio::EnsEMBL::DBSQL::DBAdaptor(
    -host => 'ecs2',
    -user => 'ensro',
    -port => 3365,
    -species => 'Homo sapiens',
    -group => 'core',
    -dbname => 'homo_sapiens_core_34_35g');
@aliases = ('human');
Bio::EnsEMBL::Utils::ConfigRegistry->add_alias(
    -species => "Homo sapiens", -alias => \@aliases);

new Bio::EnsEMBL::DBSQL::DBAdaptor(
    -host => 'ecs2',
    -user => 'ensro',
    -port => 3365,
    -species => 'Ciona intestinalis JGI2',
    -group => 'core',
    -dbname => 'jhv_ciona_intestinalis_seq_34_JGI2');
@aliases = ('ciona', 'Ciona intestinalis', "Ciona");
Bio::EnsEMBL::Utils::ConfigRegistry->add_alias(
    -species => "Ciona intestinalis JGI2", -alias => \@aliases);

1;
===========================================================

Then, you just have to run the script like this:
/usr/local/ensembl/bin/perl -w ~/src/ensembl_main/ensembl-compara/scripts/BLAT/run_BLAT_pipeline.pl --reg_conf registry.conf --species1 fugu --species2 danio

You can even launch the main script using the LSF queuing system:
bsub -m ecs4_hosts -J BLAT.fugu-danio -q basement /usr/local/ensembl/bin/perl -w ~/src/ensembl_main/ensembl-compara/scripts/BLAT/run_BLAT_pipeline.pl --reg_conf registry.conf --species1 fugu --species2 danio --queue basement

This will create the following file ~working_directory/run/Homo_sapiens_NCBI35:c100000:o1000:m2_vs_Ciona_intestinalis_JGI2:c100000:o1000:m2/all.data which holds all the results
NB: The script chooses automatically which one of the two species will be the consensus depending on the quality of the assembly. Therefore, you may find the species in a different order in the name of this directory.

Then, you just have to load the data using the ~ensembl-compara/scripts/BLAT/LoadBLATAlignments.pl script:

~ensembl-compara/scripts/BLAT/LoadBLATAlignments.pl -dbname my_compara -cs_genome_db_id 1 -qy_genome_db_id 18 -file ~working_directory/run/Homo_sapiens_NCBI35:c100000:o1000:m2_vs_Ciona_intestinalis_JGI2:c100000:o1000:m2/all.data -alignment_type TRANSLATED_BLAT -conf_file registry.conf -tab 1

This will produce the following files 

genomic_align_block.Gg1Hs35.data
genomic_align.Gg1Hs35.data
genomic_align_group.Gg1Hs35.data
meta.Gg1Hs35.data

logon to the machine that has the db on it. (eg. caa-stat mysql_3364 on ecs2 will tell you which machine ecs2:3364 is currently on). These are loaded into the db using :

mysql -S /mysql/data_3364/mysql_3364.sock -u ensadmin -Pxxxxxxx 

mysql> use cara_ensembl_compara_27_1 ;

database changed

mysql> load data infile /ecs2/scratch3/cara/BLAT/Gg1Hs35/genomic_align_block.Gg1Hs35.data into table genomic_align_block ;

repeat for the other three files and you're done!

NB: You can also load the data directly into the DB using the -tab 0 option but this uses to be slower!


