Populating an Ensembl Variation database with data from DBSNP
======================================================================

Checkout required scripts and APIs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Set the working directory

  $ cd /some/path/to/workdir
  $ setenv BASEDIR $PWD

Set the path to the perl binary that is available on the cluster
  $ setenv PERLDIR /usr/bin/perl

Get bioperl code

  $ cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/bioperl login
    (pass cvs)

  $ cvs -d :pserver:cvs@cvs.open-bio.org:/home/repository/bioperl \
    co bioperl-live 

Get ensembl-core code

  $ cvs -d :ext:cvs.sanger.ac.uk:/cvsroot/ensembl co ensembl

Get ensembl-variation code

  $ cvs -d :ext:cvs.sanger.ac.uk:/cvsroot/ensembl co ensembl-variation

Get ensembl-analysis code

  $ cvs -d :ext:cvs.sanger.ac.uk:/cvsroot/ensembl co ensembl-pipeline

Set the Perl environment

  $ setenv PERL5LIB ${BASEDIR}/bioperl-live:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl/modules:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl-variation/modules:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl-variation/scripts/import:${PERL5LIB}
  $ setenv PERL5LIB ${BASEDIR}/ensembl-analysis/modules:${PERL5LIB}


Configure databases
~~~~~~~~~~~~~~~~~~~
Variation databases are created on a per-species basis. Pick a mysql
instance and create an empty ensembl variation database;

  $ mysql \
    --host=localhost --port=3306 --user=rw_user --password=secret \
    -e "create database oryza_sativa_variation_38"

And load the schema;

  $ mysql \
    --host=localhost --port=3306 --user=rw_user --password=secret \
    oryza_sativa_variation_38 < $BASEDIR/ensembl-variation/sql/table.sql

Finally set the schema_version in the meta table
  $ mysql \
    --host=localhost --port=3306 --user=rw_user --password=secret \
    oryza_sativa_variation_38 -e \
    'INSERT INTO meta (meta_key, meta_value) VALUES ("schema_version", "38")'

The SNP pipeline scripts use an ensembl.registry file to store the DB
connection info. For each species the core, variation and dbsnp
databases must be configured. An example for rice is shown below;

  $ <editor> ensembl-variation/scripts/import/ensembl.registry
  Bio::EnsEMBL::DBSQL::DBAdaptor->new
  ( '-species' => "Oryza_sativa",
    '-group'   => "core",
    '-port'    => 3306,
    '-host'    => 'localhost',
    '-user'    => 'ro_user',
    '-pass'    => '',
    '-dbname'  => 'oryza_sativa_core_38',);

  Bio::EnsEMBL::Variation::DBSQL::DBAdaptor->new
  ( '-species' => 'Oryza_sativa',
    '-group'   => 'variation',
    '-port'    => 3306,
    '-host'    => 'localhost',
    '-user'    => 'rw_user',
    '-pass'    => 'secret',
    '-dbname'  => 'oryza_sativa_variation_38',);

  Bio::EnsEMBL::DBSQL::DBAdaptor->new
  ( '-species' => 'Oryza_sativa',
    '-group'   => 'dbsnp',
    '-port'    => 3306,
    '-host'    => 'localhost',
    '-user'    => 'ro_user',
    '-pass'    => '',
    '-dbname'  => 'dbSNP_125_rice_4530', );


Configure import script
~~~~~~~~~~~~~~~~~~~~~~~
Go to script dir;

  $ cd $BASEDIR

Look at the following file, and determine whether an appropriate
SPECIES_PREFIX is configured;

  $ <editor> $BASEDIR/ensembl-variation/scripts/import/dbSNP.pl

Add a new "if ($species =~ /xx/i)" block if required. The most
basic, used for importing variations (e.g. SNPs) and flanking
sequences only may resemble;

  my $dbsnp = dbSNP::GenericContig->new(-dbSNP => $dbSNP,
                                        -dbCore => $dbCore,
                                        -dbVariation => $dbVar,
                                        -tmpdir => $TMP_DIR,
                                        -tmpfile => $TMP_FILE,
                                        -limit => $LIMIT_SQL,
                                        -taxID => $TAX_ID,
                                        -species_prefix => $SPECIES_PREFIX
                                        );
  $dbsnp->dump_dbSNP();



Run import script
~~~~~~~~~~~~~~~~~
Go to basedir;

  $ cd $BASEDIR

Make sure dbSNP.pl script can run (should get usage info);

  $ $PERLDIR/perl $BASEDIR/ensembl-variation/scripts/import/dbSNP.pl

Make sure dbSNP.pl script can run over lsf (more usage info);

  LSF:
  $ bsub -I $PERLDIR/perl $BASEDIR/ensembl-variation/scripts/import/dbSNP.pl

  PBS:
  $ qsub -v PERL5LIB,BASEDIR,PERLDIR -I
  $ $PERLDIR/perl $BASEDIR/ensembl-variation/scripts/import/dbSNP.pl

Make sure that the species of interest is configured in dbSNP.pl. If
not, then the dbSNP::GenericContig dumper will be used by default. In
this case make sure that the correct methods are configured in
dump_dbSNP.

Run the import script for real;

  LSF:
  $ bsub -e dbSNP_errors.txt -o dbSNP_output.txt    \
    -K $PERLDIR/perl $BASEDIR/ensembl-variation/scripts/import/dbSNP.pl \
    -species=Oryza_sativa -dbSNP_version=b125 \
    -tmpdir=/tmp -tmpfile=dbSNP_tmpfile.txt

  PBS:
  $ qsub -v PERL5LIB,BASEDIR,PERLDIR -I
  $ $PERLDIR/perl $BASEDIR/ensembl-variation/scripts/import/dbSNP.pl \
    -species=Oryza_sativa -dbSNP_version=b125 \
    -tmpdir=/tmp -tmpfile=dbSNP_tmpfile.txt

Check the dbSNP_errors.txt file for, erm, errors?

  $ cat dbSNP_errors.txt


Mapping dbSNP flanking sequences to the genome
==============================================

This procedure is not required if the dbSNP database already contains
the mappings between the species of interest and the specific assembly
in the ensembl code DB. If the data is missing, or the mappings are to
a different assembly, then the flanks must be mapped to the genopme
using SSAHA, and the results loaded into the variation DB.


Dump the flanking sequences into a text file
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Go to the script dir;

  $ cd $BASEDIR/ensembl-variation/scripts/import/

Run the input file generation program. This will generate a fasta file
per 10000 SNPs in the current working dir;

  LSF:
  $ bsub -e bsub_errors.txt -o bsub_output.txt                      \
    -K $PERLDIR/perl                                                \
    $BASEDIR/ensembl-variation/scripts/import/generate_input_seq.pl \
    -species=Oryza_sativa -generate_input_seq -output_dir=$PWD      \
    -tmpdir=/tmp -tmpfile=tmpfile.txt
    

  PBS:
  $ qsub -v PERL5LIB,BASEDIR,PERLDIR -I
  $ $PERLDIR/perl                                                   \
    $BASEDIR/ensembl-variation/scripts/import/generate_input_seq.pl \
    -species=Oryza_sativa -generate_input_seq -output_dir=$PWD      \
    -tmpdir=/tmp -tmpfile=tmpfile.txt


Prepare the target sequence
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Run the target6 file generation program. You can either dump the
entire genome into a single FASTA file, or create a FASTA file per
chromosome;

  $  $PERLDIR/perl                                                    \
     $BASEDIR/ensembl-variation/scripts/import/generate_target_seq.pl \
     -species=Oryza_sativa -output_dir=$PWD


Generate the ssaha alignments
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Run the run_ssaha2 script. The following example will cause the first
10 snp chunk files to be aligned against the target sequence.

  $ $PERLDIR/perl /
    $BASEDIR/ensembl-variation/scripts/importrun_ssaha2.pl /
    -input_dir  $BASEDIR/data/fasta/snp                    /
    -target_dir $BASEDIR/data/fasta/dna                    /
    -output_dir $BASEDIR/data/mapping
    -start 1 -end 10

The runs typically takes several hours. Check the job output and
error files after the job completed to ensure that it was
successful. These are out_N and err_N, and can be very large. Keep
an eye on free disk space and clean up as needed. Rerun failed jobs
using bsub_ssaha's -start and -end params. The end product will be a
set of mapping_file_N files containing the filtered, processed snp
locations. 


Load the SNP features into the Variation DB
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Concatenate all mapping files into a single file;

  $ cat mapping_file_* > mapping_file

Go to the following script dir;

  $ cd $BASEDIR/ensembl-variation/scripts/import

You now need to rerun the dbSNP load script to load the new mappings.

$ $PERLDIR/perl $BASEDIR/ensembl-variation/scripts/import/dbSNP.pl \
    -species=Oryza_sativa -dbSNP_version=b125 \
    -tmpdir=/tmp -tmpfile=dbSNP_tmpfile.txt
    -mapping_file=$BASEDIR/mapping_file
