The pipline of generating Ensembl Variation Database mainly contains two processes: The data importing from dbSNP, followed by post-processing.

The data importing process:

The raw data importing mainly consists of a series of SQL queries to dump data from the internal mirror of dbSNP mySQL database at the Sanger Institute, and to import data to Ensembl variation database. 
They are 10 core steps for data import for all species, see Table 1 bellow. 

Table 1 - List of Ensembl tables generated and dbSNP tables used 
Ensembl tables generated				    dbSNP tables used 
source 
variation						    SNP, SNPAncestralAllele 
variation synonym*					    SubSNP, SNPSubSNPLink, ObsVariation Batch, UniVariAllele 
population/sample/population structure/sample synonym	    PopClassCode, PopClass, PopLine 
individual/ sample/individua population/sample synonym	    SubmittedIndividual, Individual PedigreeIndividual 
allele							    Allele, AlleleFreqBySsPop, SubSNP 
flanking sequence					    SubSNPSeq5, SubSNPSeq3, SNP 
individual genotype					    SubInd,ObsGenotype,SubmittedIndividual 
population genotype					    GtyFreqBySsPop, UniGty, Allele 
variation feature					    SNPContigLoc, ContigInfo 

*All of the subsnp information are used in the intermediate steps, and deleted in the final 
variation synonym table. 

The tables listed on the left column are generated in Ensembl variation database. The corresponding tables listed on the right columns are from dbSNP which are used to generate Ensembl tables. 
During the importing process, variation synonym table holds most information about subsnp, such as subsnp id, snp id, substrand reversed flag etc. which are used to produce other Ensembl tables, such as allele, genotype tables. When all the tables are generated, the final clean up process is used to drop unwanted tables/columns. During this clean up process, the information stored in variation synonym table is deleted because the subsnp information is not needed in the end.

The scripts that are used to import data from dbSNP and post-processing are stored in cvs repository :
http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation/scripts/import/?root=ensembl, and the database schema in http://cvs.sanger.ac.uk/cgi-bin/viewvc.cgi/ensembl-variation/schema/?root=ensembl.

Importing :

The main driving script is dbSNP.pl. There are several modules are used in different species and different cases:

MappingChromosome.pm--in case we have a new assembly in Ensembl and dbSNP still use old assembly, we need to remapping SNPs from old assembly to new assembly, this module is used to importing newly mapped SNPs to ensembl variation database

GenericContig.pm--in case we import dbSNP data in contig level (we have different chromosome names from dbSNP, such as anopheles, it was used in mouse import as well, but not any more), then mapping them to toplevel during post-process

GenericChromosome.pm--this is used in most cases to import chromosome+toplevel coordinates from dbSNP

Apart from the above mentioned general cases, there are specific cases in Human,mosquitos. 

For human, apart from normal import, we import HGVbaseIDs, TSCIDs as well as AFFYIDs, aLso illumina data. We stored HGVbaseIDs upto version 15 in a flat file :  dbSNP/rs_hgvbase.txt.gz, also affy data in separate array: 100k,500k and genome wide 6.0. 

For mosquitos, ensembl using different naming system, i.r chromosome 1,2L,2R,3L,3R,X, also we put "PEST" and "MOPTI" into genotype table bu sorting them out using dbSNP.loc_snp_id(we have submitted this data, so the local_id has MOPTI or PEST in it to distinguash between them).

Post-processing :

The post-processing (after importing data) is performed to process the following tables: variation feature table to remove some unwanted features, flanking sequence table to reduce table size and transcript variation table to calculate the effects of a variation on transcript. 

variation feature table: 

1. Transfer coordinates from contig to toplevel if they are not on toplevel. 
2. Add allele_string in variation_feature table. The first one is always same as reference allele.
3. The variation features with following properties are removed from the table (as well as other tables): 

- with more than three mappings 
- with reference allele not in allele string column in variation feature table 
- with more than three different alleles in allele string column in variation feature table, 
- with duplicated positions and alleles, the SNP with smallest rsId is kept
- variation with no variation_feature

Note that the removed variations themselves are only kept in variation/failed variation tables and have a status of failed and the reason of failing. 

flanking sequence table: 

Some flanking sequences are very long. In order to reduce disk space usage, each flanking sequence is aligned back against the reference sequence by the given SNP position. If there is an exact match, the flanking sequence table will provide seq region id, seq region start and seq region end rather than the original sequence. For human genome this method has reduced disk space usage from 12.8 GB to 2 GB. 

transcript variation table: 

In Ensembl, sixteen consequence types are defined, see Table 2. 
These sixteen consequence types are listed in the order of the importance, i.e the first one ESSENTIAL SPLICE SITE is considered as most important type.

Table 2 - List of consequence types in the transcript variation table 
ESSENTIA SPLICE SITE 
STOP GAINED 
STOP LOST 
COMPLEX_INDEL
FRAMESHIFT CODING 
NON SYNONYMOUS CODING 
SPLICE SITE 
SYNONYMOUS CODING 
REGULATORY REGION
WITHIN_MATURE_miRNA 
5PRIME UTR 
3PRIME UTR 
INTRONIC 
WITHIN_NON_CODING_GENE
UPSTREAM 
DOWNSTREAM 

For each SNP that is associated with a transcript (i.e SNP position within 5KB on either side of a transcript or a SNP within a regulatory region that regulates a gene/transcript), the resulting consequence type is calculated. The consequence type of ESSENTIAL SPLICE SITE, SPLICE SITE exists in combination with one of the other types. The consequence type of REGULATORY REGION could exist on its own or combined with one of the other types. 

ESSENTIAL SPLICE SITE is defined for SNPs which has effect on the last two bases of an intron (the AG) or the first two bases of the intron (GT). 

SPLICE SITE is defined for other SNPs which are either 3bp into the exon or 3 to 8bp into the intron. 

REGULATORY REGION is defined by a SNP occur in a regulatory_feature that exist in Ensembl Funcg database, the name such as miRanda, VISTA enhancer set, cisRED motifs as well as predicted by regulatory build in Ensembl Functional Genomics team.

In variation feature table, there is a column called consequence type which hold consequence type for that variation. If this variation has effects on multiple transcripts, only one consequence type which is considered as most important one is given in this column (see Table 3 for order list). 

The above description is for all species genome.

For a species if we have a our own called SNPs and this SNPs are not in dbSNP yet, so they need to be imported into the database. So far, there are about 10 species we have called SNPs that need to be imported and merged with dbSNP data.

Before importing, we need to change SNPs that mapped to reverse strand to forward strand: it's easier for the user to look at, also it's easier for data merge between called SNPs and the one from dbSNP.

Reverse strand process includs :
	change strand/allele_string in variation_feature table
	change strand/flanking_sequence/up_seq/down_seq information in flanking_sequence table
	change allele in allele table
	change genotypes in individual/population genotype tables

Now merge called SNPs with dbSNPs :

       if a called SNPs has a same coordinate with a SNP that already in database, the called SNPs is merged by putting this SNP in variation_synonymns table and delete the recode from variation,variation_featur,flanking_sequence,allele tables, merge allele to allele_string in variation_feature if the allele is not in allele_string already. Change variation_id to be merged variation_id in genotype table.

After merge, finally checking allele, genotype tables for inconsistant alleles/genotypes, such as allele_string is A/C, genotypes an be A/C/T/G.. In this case, we change allele/genotypes to be same as the ones in allele_string by reverse complement allele/genotypes. If after reverse/complement, the allele/genotypes are still not matching the one in allele_string, they are left unchanged.

The next step is to compress genotypes.

For human genome, tagged SNPs and Linkage Disequilibrium (LD) are also calculated during the post-processing step. 

For both pairwise LD and tagged SNPs calculations, three PerlEgen populations (PERLEGEN:AFD EUR PANEL, PERLEGEN:AFD AFR PANEL and PERLEGEN:AFD CHN PANEL) and four HapMap populations (CSHL-HAPMAP:HapMap-CEU, CSHL-HAPMAP:HapMap-HCB, CSHL-HAPMAP:HapMap-JPT and CSHL-HAPMAP:HapMap-YRI) are used. In the case that a HapMap population is from 30 mother-father-child trios from the CEPH collection, the child genotypes are excluded from the calculation. 

Calculation of Linkage Disequilibrium (LD) 

The method to calculate LD (r2 and D) is described in [?] For each 100 kb window size, variations are ordered by their positions; pairwise LD (r2 value) for each pair of variations in each population is then calculated. All r2 values, apart from r2 < 0.5 or with genotype data having population sample size < 40, are stored in table pairwise ld. This table is only used for calculation of tagged SNPs. The r2 and D_prim values used in Ensembl web interface are calculated on demand. 

Calculation of tagged SNPs 

All variations with genotypes in populations HapMap and PerlEgen are pulled out from the variation database and sorted by seq region id and seq region start. 

For each variation, Minor Allele Frequency (MAF) is calculated for all populations that have the genotype data and then all MAF are ordered by their frequencies with highest MAF first. For each ordered MAF, r2 is pulled out from the table pairwise ld (see above) for this variation and all other variations within 100 kb window size to check if r2 value between the variations is bigger than 0.99 (r2 > 0.99 means the two variations are in high LD). In this case, the associated variation is removed from further calculation. In this process, the variations that are in high LD with other variations, but has lower MAF are removed. 

The remaining variations are called tagged SNPs and are stored in table tagged variation feature. 

Once tagged variation calculation finished, it's ready to copy database to staging server for health check.

health check:
go to the web page: http://admin.ensembl.org/common/web/admin_home
and go through each species to check for any problems.

also if new data/display is added/changed, need to check related pages to see whether diaplay is correct. There are severa pages worth looking at :

#Human SNPView with minus strand, ancestral allele, synonym and consequence in tv, also should have LDView linked
http://staging.ensembl.org/Homo_sapiens/Variation/Summary?v=rs745764
#Another Human SNPView with synonym with Venter and Watson
http://staging.ensembl.org/Homo_sapiens/Variation/Summary?v=rs539157
#Resequencing alignment Ref:36/Venter/Watson
http://staging.ensembl.org/Homo_sapiens/Location/SequenceAlignment?r=20:38084753-38104752
#Good to see transcriptSNPView/GeneSNPView
http://staging.ensembl.org/Homo_sapiens/Variation/Mappings?r=1:228911917-228912917;v=rs699;vdb=variation;vf=724
Also prepare emf files dump to be hand over to web team

The method to prepare emf dumps:

We don't have a variation directory on lustre
I have a directory : [yuans_lustre_home]/emf_dumps/
under this directory, there are three directories: hum, rat and mouse.
under each of the three species, there are two directory, one is for previous release, one is for current release (if don't have one for current release, create one)

- if there was no change in the variation data (no new import from dbSNP) and no new assembly, there is a script, under
ensembl-variation/scripts/import/upgrade_resequencing_files.pl that can be run. 
First copy all the files from the previous release (cp [lustre_home]/emf_dumps/hum/release_53/* .)
Once the information is copied in the new directory, you are ready to run the script:
bsub -q normal -o [lustre_home]/emf_dumps/hum/out_release_54 perl upgrade_resequencing_files.pl
-path [lustre_home]emf_dumps/hum/release_54 -version 54

You should repeat the process for all the resequencing species we dump data for (currently, human, mouse and rat). This is not accessing ens-staging, so it is safe to run it while martians are hammering the server.

-if there is a change in the data, you will need to run a different script. First, create the directory structure again (e.g. mkdir release_54 ). Copy the README file (e.g. cp [lustre_home]/emf_dumps/rat/release_53/README .)
You are now ready to run the script:

perl dump_strain_wrapper.pl -species rat -dump_file
'[yuans_lustre_home]/emf_dumps/rat/release_54/Rattus_norvegicus.RGSC3.4.54.resequencing.chromosome.'

Be careful, since the "dump_file" argument should include the path and the file name (which is kind of standard for all the EMF files).
Creating the new files will take around 20 hours for human and access the ens-staging server, so run it before martians start/finish with their built

After sending jobs, do check if all the jobs successfully finished. If so, send a email to release coordinator to tell them emf dumps are finished and where those dump files are.

Make patch sql to patch new schema/version for all databases:

go to ensembl-variation/sql, copy patch_53_54_a.sql to patch_54_55_a.sql, this is to copy release 53_54 patch to 54_55 patch for version, edit patch_54_55_a.sql, change version to 55.
also make other patch sql to reflect schema change.

Once we have a number of sql patch files, do schema patches by using script:

The script is within ensembl folder: (in the core, not the variation)

ensembl/misc-scripts/schema_patch.pl

to run it, you should do something like:

schema_patch.pl --host ens-staging --user ****** --pass ******* --port 3306
--pattern '%variation_55_%' --schema 55 --patch_variation_database
--dry_run = 1 --interactive=0

with the dry_run option you will check which pathces are going to be
applied and which databases. When you are sure about the patches and
databases, remove the --dry_run option


_________________________________________________________________________
The following is a list of command that to do each steps described above:
_________________________________________________________________________

bsub -q basement -o out_dbsnp_hum_51 /usr/local/bin/perl dbSNP.pl -species human -tmpd
ir [lustre_home]/tmp/hum -dbSNP_version b126 & (check affyIDs in Human
.pm and make sure to run it)

#if basement queue take long time to start, Tim suggested use nice +20 command
or use yesterday queue

/usr/local/bin/perl parallel_post_process.pl -species human -tmpdir [lustre_home]/tmp/hum
-tmpfile 'human.txt' -num_processes 20 -variation_feature -flankin
g_sequence -transcript_variation -variation_group_feature  -top_level 1

/usr/local/bin/perl parallel_post_process.pl -species human -tmpdir [lustre_home]/tmp/hum
-tmpfile 'human.txt' -check_seq_region_id -remove_wrong_variations
 -merge_rs_feature -reverse_things -read_updated_files -merge_ensembl_snps

#before merge with ensembl_snps, run var_db_check.pl first
/usr/local/bin/perl var_db_check.pl 
#import ensembl_snps and genotypes that cover dbSNP snps from ensembl that covered dbS
NPs
bsub -q long -o out_import_hum_51 perl ./import_Sanger_database.pl -species human -tmp
dir [lustre_home]/tmp/hum/sanger_import/venterl

#now then merge ensembl_snps
#then run parallel_post_process.pl to make allele unique in allele table

/usr/local/bin/perl parallel_read_coverage.pl -species human -tmpdir
/[lustre_home]/tmp/hum -readdir [lustre_home]/hum/read_coverage -maxlevel 2

bsub -q basement -o [lustre_home]/tmp/hum/out_compress.txt /usr/local/
bin/perl compress_genotypes.pl -species human -tmpdir
[lustre_home]/tmp/hum/compress_gtype -tmpfile compress

#create_ld_table needs compressed genotype table finished
bsub -q long -W12:00 -o [lustre_home]/tmp/hum/tag_snp/output_tag.txt perl
create_ld_table.pl -tmpdir [lustre_home]/tmp/hum/tag_snp -tmpfile tag_snps.txt
bsub -q long -W12:00 -o [lustre_home]/tmp/hum/tag_snp/output_tag.txt perl
tag_snps.pl -tmpdir [lustre_home]/tmp/hum/tag_snp -tmpfile tagsnps.txt


if a new variation_feature table is generated from mapping, need to rerun transcript_v
ariation, flanking_sequence, also taggged_variation_feature by :
create table tmp_vf_id select ovf.variation_feature_id as old_vf_id, vf.variation_feat
ure_id as vf_id from old_variation_feature ovf, variation_feature vf, tmp_seq_region_o
ld tso, tmp_seq_region ts where ovf.variation_id=vf.variation_id and ovf.seq_region_id
 = tso.seq_region_id and vf.seq_region_id=ts.seq_region_id and tso.name=ts.name;

#after database is ready, copy all databases to staging sever (or ask release coordina
te to copy the databases over), then :

checking seq_region table for database with patched gene set :
select count(*) from seq_region sr, homo_sapiens_core_52_36n.seq_region sr1 where sr.s
eq_region_id=sr1.seq_region_id and sr.name=sr1.name;
select count(*) from variation_feature vf left join seq_region sr on sr.seq_region_id=
vf.seq_region_id where sr.seq_region_id is null;

1: prepare emf dump files for human/mouse/rat see Daniel's email about this.
2: make patch sql to patch new schema/version for all databases
3: run healthcheck to see anything wrong

 

