
#########################################################################

Achtung: a Hive pipeline exists now that does the same thing.

    You can run it (assuming you have ensembl, ensembl-hive and ensembl-compara installed into <path_to_your_ensembl_cvs_root>) like this:
    
    # 1. edit ensembl-compara/scripts/taxonomy/web_name_patches.sql to patch web names (if needed)

    # 2. edit ensembl-compara/scripts/taxonomy/ensembl_aliases.sql to ensembl aliases (if needed)

        # 3. initialize the pipeline:
    init_pipeline.pl Bio::EnsEMBL::Compara::PipeConfig::ImportNCBItaxonomy_conf -ensembl_cvs_root_dir <path_to_your_ensembl_cvs_root> -password <your_password>

        # 4. run the pipeline:
    beekeeper.pl ... (specific command line(s) will be printed by init_pipeline.pl)

#########################################################################

OTHERWISE do everything by hand (old instruction follows) :


This file describes how to "mirror" the NCBI taxonomy database by creating
and populating MySQL tables (to be used in a compara database), that retain
the necessary information to generate the taxonomy tree structure used by
the Bio::EnsEMBL::Compara::NCBITaxon object (and its corresponding adaptor
Bio::EnsEMBL::Compara::DBSQL::NCBITaxonAdaptor).

First of all, send an email to ensembl-admin and alert people that you
will be updating ncbi_taxonomy in ens-livemirror.

The NCBI taxonomy database dumps are apparently generated every hour.
We'll update our "mirror" at each release (every 2 months).

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

tar xzf taxdump.tar.gz

# It will generate several dmp files of which we want nodes and names

rm -f taxdump.tar.gz

# this should take about 10 seconds
perl -e 'while(<>) {chomp;s/\t\|$//;my @array = split(/\t\|\t/);my ($taxon_id, $parent_id, $rank, $genbank_hidden_flag) = ($array[0],$array[1],$array[2],$array[10]);print $taxon_id,"\t", $parent_id,"\t", $rank,"\t",$genbank_hidden_flag,"\n";}' nodes.dmp > ncbi_taxa_node.txt

# this should take about 10 seconds
perl -e 'while(<>) {chomp;s/\t\|$//;my ($taxon_id, $name, $unique_name, $name_class) = split(/\t\|\t/);print $taxon_id,"\t", $name,"\t",$name_class,"\n";}' names.dmp > ncbi_taxa_name.txt

mysql -hens-livemirror -uensadmin -pensembl -N -e "drop database ncbi_taxonomy"
mysql -hens-livemirror -uensadmin -pensembl -N -e "create database ncbi_taxonomy"

mysql -hens-livemirror -uensadmin -pensembl ncbi_taxonomy

# This two tables also exist in ensembl-compara/sql/table.sql

CREATE TABLE ncbi_taxa_node (
  taxon_id                        int(10) unsigned NOT NULL,
  parent_id                       int(10) unsigned NOT NULL,

  rank                            char(32) default '' NOT NULL,
  genbank_hidden_flag             boolean default 0 NOT NULL,

  left_index                      int(10) NOT NULL,
  right_index                     int(10) NOT NULL,
  root_id                         int(10) default 1 NOT NULL,
  
  PRIMARY KEY (taxon_id),
  KEY (parent_id),
  KEY (rank)
);

CREATE TABLE ncbi_taxa_name (
  taxon_id                    int(10) unsigned NOT NULL,

  name                        varchar(255),
  name_class                  varchar(50),

  KEY (taxon_id),
  KEY (name),
  KEY (name_class)
);

exit

# This should take about 60 seconds
mysqlimport -hens-livemirror -uensadmin -pensembl -L ncbi_taxonomy ncbi_taxa_node.txt
#ncbi_taxonomy.ncbi_taxa_node: Records: 573624  Deleted: 0  Skipped: 0  Warnings: 1720872
# v57
#ncbi_taxonomy.ncbi_taxa_node: Records: 540612  Deleted: 0  Skipped: 0  Warnings: 1621836

# This should take about 90-120 seconds
mysqlimport -hens-livemirror -uensadmin -pensembl -L ncbi_taxonomy ncbi_taxa_name.txt
#ncbi_taxonomy.ncbi_taxa_name: Records: 826789  Deleted: 0  Skipped: 0  Warnings: 0
#v56
#ncbi_taxonomy.ncbi_taxa_name: Records: 773164  Deleted: 0  Skipped: 0  Warnings: 0

mysql -hens-livemirror -uensadmin -pensembl -N -e "update ncbi_taxa_node set parent_id=0 where parent_id=taxon_id" ncbi_taxonomy

awk '{print $1,$3}' merged.dmp |awk '{print "insert into ncbi_taxa_name (taxon_id, name_class, name) values ("$2",\"merged_taxon_id\","$1");"}' > merged_nodes.sql

# This should take about 10 seconds
mysql -hens-livemirror -uensadmin -pensembl ncbi_taxonomy < merged_nodes.sql

mysql -hens-livemirror -uensadmin -pensembl -N -e "insert into ncbi_taxa_name (taxon_id, name, name_class) select distinct taxon_id, CURRENT_TIMESTAMP, 'import date' from ncbi_taxa_node where parent_id=0" ncbi_taxonomy

# Change some of the names to better reflect web names:

mysql -hens-livemirror -uensadmin -pensembl ncbi_taxonomy < ~/src/ensembl_main/ensembl-compara/scripts/taxonomy/web_name_patches.sql


# Build the left/right indexes and store them in the database. This indexing help speeding up subtree retrieval if needed.

# This takes about 2 hours
# /usr/local/ensembl/bin/perl ~/src/ensembl_main/ensembl-compara/scripts/taxonomy/taxonTreeTool.pl -url mysql://ensadmin:ensembl@ens-livemirror:3306/ncbi_taxonomy -index > leftright_indexing.err 2>&1 &
bsub -q yesterday -o leftright_indexing.out -e leftright_indexing.err -R"select[mem>3000] rusage[mem=3000]" -M3000000 '/usr/local/ensembl/bin/perl ~/src/ensembl_main/ensembl-compara/scripts/taxonomy/taxonTreeTool.pl -url mysql://ensadmin:ensembl@ens-livemirror:3306/ncbi_taxonomy -index'

# Started at Tue Oct 20 11:07:23 2009
# Results reported at Tue Oct 20 12:49:50 2009


# Import 'ensembl common name' and 'ensembl alias name' for Ensembl web team use.

mysql -hens-livemirror -uensadmin -pensembl ncbi_taxonomy < ~/src/ensembl_main/ensembl-compara/scripts/taxonomy/ensembl_aliases.sql

ensembl-compara/scripts/taxonomy/ensembl_aliases.sql needs to be updated as new species
are integrated into Ensembl.

The basic rule for "ensembl common name" is to use "genbank common
name" or any of "common name" available. If they don't exist you are
free to invent one. This is used in the download page 
http://www.ensembl.org/info/data/download.html

The other source of info is here:
http://www.intlgenome.org/viewDatabase.cfm

For "ensembl alias name", it is used at the top of the web page of
each species, when it says e.g. "e!EnsemblHuman" or
"e!EnsemblMouse". This should be agreed with the web team or just take
the one they have chosen from the pre site.

mysql -hens-livemirror -uensro ncbi_taxonomy -e "select * from ncbi_taxa_name where taxon_id=1"
+----------+---------------------+-----------------+
| taxon_id | name                | name_class      |
+----------+---------------------+-----------------+
|        1 | all                 | synonym         | 
|        1 | root                | scientific name | 
|        1 | 2009-10-20 11:04:59 | import date     | 
+----------+---------------------+-----------------+

