
Here is an example of out procedure to align human and mouse genomes.
The procedure was designed by Jim Mullikin at the Sanger Institute,
and adapted to fit Ensembl compara by Abel Ureta-Vidal.

All command lines were run in bash shell.

1 - Move to your working directory. Check the disk space availible
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> cd /ecs2/work3/abel/phusion/
	
 Create a FastaFiles directory if it does not exit.

> mkdir FastaFiles
> cd FastaFiles

2 - dump genomic dna from human and mouse genomes
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> mkdir Hs31
> cd Hs31

 Then dump the sequence for each chromosome:
 
 To get the list of toplevel sequences (normally chromosomes, but may be a mix of eg chromosomes and scaffolds).
 
> echo "select sr.name, cs.name from coord_system cs, seq_region sr, seq_region_attrib sra, attrib_type at where sra.attrib_type_id=at.attrib_type_id and at.code='toplevel' and sr.seq_region_id=sra.seq_region_id and sr.coord_system_id=cs.coord_system_id and sr.name not like \"%_NT_%\" and sr.name not like \"%_DR_%\" and sr.name not like \"UNKN\";" | mysql -h ecs4 -u ensro -P 3351 homo_sapiens_core_20_34 | awk '!/name/' | sort -u > Hs34_chr_names

This produces a file of form:

1       chromosome
10      chromosome
11      chromosome
12      chromosome
13      chromosome
14      chromosome
15      chromosome
16      chromosome
17      chromosome
18      chromosome
19      chromosome
2       chromosome
20      chromosome
21      chromosome
22      chromosome
etc
 
 This file may be used to run the 

ensembl-compara/scripts/dumps/DumpChromosomeFragments.pl on the farm


eg To send that on the farm


> cat Hs31_chr_names | while read i j ;do echo bsub -q acari -o $i.out -e $i.err ensembl-compara/scripts/dumps/DumpChromosomeFragments.pl -dbname homo_sapiens_core_12_31 -chr_names $i -overlap 0 -chunk_size 60000 -masked 0 -phusion Hs -o $i.fa -coord_system $j -conf /nfs/acari/cara/.Registry.conf;done

 Check each job went fine, and concatenate the all chromosome fasta files in one big, e.g. Hs31.fa
 Delete the intermediate files.

 Do the same for mouse.

 Usually the dumps of human and mouse distribute over the farm take around 6 hours.

3- Indexing fasta files for phusion and blast jobs
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 The index file is needed in the post-processing of phusion
> ensembl-compara/scripts/phusion/fasta2tag.pl Hs31.fa > Hs31.tag

 This index file is needed when running the blast jobs
> /nfs/acari/abel/bin/fastaindex Hs31.fa Hs31.fa.index

4- run phusion (on aristotle)
   ~~~~~~~~~~~~~~~~~~~~~~~~~~

> cd /ecs2/work3/phusion
> mkdir Hs31Mm3
> cd Hs31Mm3

The phusion executable initially used is in /nfs/team71/psg/jcm/src/phusion/phusion.
I copied it to my home directory to make make sure I always use the same
/nfs/acari/abel/src/ensembl_main/ensembl-compara/scripts/phusion/phusion

For details on the phusion program, see
Mullikin, JC and Ning, Z. "The Phusion Assembler", Genome Research 13, 81-90, 2003

 The phusion run on human/mouse needs ~32gb of memory, so you'll need to run it on aristotle.
 Here is the command line to send the job to aristotle

bsub -q acaritest -C0 -R"select[mem>=20000] rusage[mem=20000]" -o phusion17-60k.out -e phusion17-60k.err /nfs/acari/abel/src/ensembl_main/ensembl-compara/scripts/phusion/phusion -kmer 17 -depth 2 -match 2 -match2 2 -set 50 -break 1 -SCG /ecs2/work3/abel/phusion/FastaPeptidesFiles/Hs31/Hs31.fa mates /ecs2/work3/abel/phusion/FastaPeptidesFiles/Hs31/Hs31.fa /ecs2/work3/abel/phusion/FastaPeptidesFiles/Mm3/Mm3.fa


 Takes another ~6 hours

5- Reading Phusion output to obtain pairwise comparison
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ensembl-compara/scripts/phusion/GetExtra.pl Hs 1 /ecs2/work3/abel/phusion/FastaPeptidesFiles/Hs31/Hs31.tag /ecs2/work3/abel/phusion/FastaPeptidesFiles/Mm3/Mm3.tag phusion17-60k.out > Extra 2> Extra.err

 Hs the reference species,
 1 try to expand to neighbouring segment even if no 17-mer connection exists

        60Kb A         60Kb B
Mm -------------- --------------
       |            |
Hs    ------------------ ---------------
           60Kb	C		60Kb D
  
 when argument2 is 1
  A, B, C and D are part on the same cluster.
 when argument2 is 0,
  just do a strict connected "graph", so A, B and C are in the same cluster.

6- Dispatch the pairwise comparison from the 'Extra' file to run blastn
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> mkdir Extra_dir
> cd Extra_dir
> ensembl-compara/scripts/phusion/DispatchExtra.pl ../Extra

 That will create in the working directory, the input files needed by LaunchBlast.pl

7- LaunchBlast (LSF job arrays)
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 For LSF job arrays specific doc, Tim Cutts has written a very detailed web page.
 http://www.sanger.ac.uk/Users/tjrc/lsf/job-arrays.html

> ls |wc -l
 That will be you the number of jobs to be sent

> mkdir ../Cigar
 This directory will receive the parsed blast outputs

 NB: There is a restriction to alpha machine because the LaunchBlast.pl use a c executable, fastafetch,
     only compiled on alpha machines. We have to think maybe to use dbfetch instead...need to investigate.

     LaunchBlast.pl runs a filtering program, ensembl-compara/scripts/phusion/FilterBlast.pl, that parse the blast 
     output obtained for each mouse query and keep hits the best hitted human chromosome, and print out
     in a gff-like format with ciger lines.
     !!! Maybe a bug, few hits are pointed out with empty cigar line, need investigation !!!

     Not generic at all, I know, so you may need to modify these lines in the code of LaunchBlast.pl
     my $fastafetch_executable = "/nfs/acari/abel/bin/fastafetch";
     my $FilterBlast_executable = "/nfs/acari/abel/src/ensembl_main/ensembl-compara/scripts/phusion/FilterBlast.pl";
     
 Check with one job, if everything goes fine
> echo 'ensembl-compara/scripts/phusion/LaunchBlast.pl -i Extra.set${LSB_JOBINDEX} -st Hs -sf /ecs2/work3/abel/phusion/FastaPeptidesFiles/Hs31/Hs31.fa -si /ecs2/work3/abel/phusion/FastaPeptidesFiles/Hs31/Hs31.fa.index -qt Mm -qf /ecs2/work3/abel/phusion/FastaPeptidesFiles/Mm3/Mm3.fa -qi /ecs2/work3/abel/phusion/FastaPeptidesFiles/Mm3/Mm3.fa.index -min_score 300 -dir /ecs2/work3/abel/phusion/Hs31Mm3/Cigar' | bsub -q acari -Ralpha -JHs31Mm3"[1]" -o Extra.set%I.out -e Extra.set%I.err

 If everything goes fine, send all the remaining jobs

> [...] | bsub -q acari -Ralpha -JHs31Mm3"[2-16364]" -o Extra.set%I.out -e Extra.set%I.err

 Resend failed jobs. To check which ones failed e.g.

> ls|grep out|while read i;do awk '/^Subject/ && $NF!="Done" {print;exit}' $i;done
Subject: Job 276328[1004]: <Hs31Mm3[1-16364]> Exited
Subject: Job 276328[10055]: <Hs31Mm3[1-16364]> Exited

jobs #1004 and #10055 failed. To resend them only

> rm -f Extra.set1004.* Extra.set10055.*
> [...] | bsub -q acari -Ralpha -JHs31Mm3"[1004,10055]" -o Extra.set%I.out -e Extra.set%I.err


8- Further filter the hits
   ~~~~~~~~~~~~~~~~~~~~~~~

> cd ../Cigar
> ls |wc -l
There should be the same number of files as jobs you sent.

> ls | xargs cat | ensembl-compara/scripts/phusion/RemoveNonSyntenic.pl 6 > ../rmRedun.1st
> cd ../
> ensembl-compara/scripts/phusion/RemoveNonSyntenic.pl 6 rmRedun.1st > rmRedun.2nd


The first argument needs to be an even number

          *
Mm  - - -   - - -  6 consecutive matches on the same block, 3 are 5' of another 
                   mouse matches, and 3 are 3'. * deleted, because not 
                   consistent with the 3 matches before and after

Hs ---------------


9- Load data in compara db
   ~~~~~~~~~~~~~~~~~~~~~~~

 Make sure you have a compaar database to populate. If not create one

> echo "create database ensembl_compara_12_2;" |mysql -h ecs1b -u ensadmin -pxxxx
> mysql -h ecs1b -u ensadmin -pxxxx ensembl_compara_12_2 < ensembl-compara/sql/table.sql

 make sure that the genome_db table is populated with the corresponding genome entries.
 if not create those entries

> mysql -h ecs1b -u ensadmin -pxxxx ensembl_compara_12_2
> insert into genome_db (genome_db_id,taxon_id,name,assembly) values (1,9606,"Homo sapiens","NCBI31");
> insert into genome_db (genome_db_id,taxon_id,name,assembly) values (1,10090,"Mus musculus","MGSC3");


 Then load the alignments in your compara db with ensembl-compara/scripts/phusion/LoadComparaDb.pl

>ensembl-compara/scripts/phusion/LoadComparaDb.pl -dbname ensembl_compara_12_1 -conf_file /nfs/acari/cara/.Registry.conf -cs_genome_db_id 1 -qy_genome_db_id 2 -qy_tag Mm -tab 1 -file rmRedun


10- run health check test suite
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~

Foreign keys db contraints
==========================

genome_db_id	PK	genome_db
		FK	danfrag
		FK	gene_relationship_member
		FK	method_link_species
		FK	genomic_align_genome	-> consensus_genome_db_id
						-> query_genome_db_id

danfrag_id	PK	dnafrag
		FK	dnafrag_region
		FK	genomic_align_block	-> consensus_dnafrag_id
						-> query_danfrag_id

synteny_region_id	PK	synteny_region
			FK	dnafrag_region

method_link_id	PK	method_link
		FK	method_link_species
		FK	genomic_align_genome **
		FK	genomic_aling_block **

gene_relationship_id	PK	gene_relationship
			FK	gene_relationship_member

meta table
~~~~~~~~~~

Make sure the 'max_alignment_length' key is there and set up **

