  COMPARA PRODUCTION CYCLE
============================


#============================================================================
#  Intentions (declaration of intentions)
#============================================================================

1. Once the release coordinator has sent the mail for declaration of
intention, ask the other members of compara group (or related) what
they intend to produce. Sometimes, you have to wait a bit that gene
builders do their declaration first to know what compara will need to
produce and also potential schema changes. Compara has one extra day
to send the DOI, because we have to know which new species are added.

2. Submit the declaration of intentions using the admin website. Remember
that compara has one additional day after the declaration of intentions
deadline for the genebuilders.

3. Set up a web page with intentions in the Confluence wiki system to
allow easy tracking of the progress.

Note: The current (rel.63) compara_master database is sf5_ensembl_compara_master on compara1 

3.1  As many example scripts/files in this document are given relative to $ENSEMBL_CVS_ROOT_DIR,
    make sure this variable is defined in your terminal
    (it is now necessary to run the Hive, so should be in your shell configs).


#============================================================================
# NCBI taxonomy data (handover to compara)
#============================================================================

4. Update the NCBI taxonomy local "mirror". Follow the instructions in
    $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/taxonomy/README-taxonomy
There is a hive pipeline to do all the steps now (see the beginning of the previous doc).


5. Update the ncbi_taxa_node and ncbi_taxa_name in the master DB using the
ncbi_taxonomy database located in mysql://ens-livemirror:3306/ncbi_taxonomy
This may be done by the person who runs the orthologs/paralogs, check with him/her.

time mysqldump -u ensro -h ens-livemirror -P3306 --extended-insert --compress --delayed-insert ncbi_taxonomy \
ncbi_taxa_node ncbi_taxa_name | mysql -u ensadmin -pxxxx -h compara1 sf5_ensembl_compara_master

Release 60: 1 min

#============================================================================
# Master compara database (handover to compara)
#============================================================================

6.0  Create a registry configuration file that will be used throughout the release process
    (make sure to have edited the release numbers in the paragraph below) :


    cat >~/release_63/reg_conf.pl <<EOF
use strict;
use Bio::EnsEMBL::Utils::ConfigRegistry;
use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;

Bio::EnsEMBL::Registry->load_registry_from_url(
  'mysql://ensro@ens-staging1/63');

Bio::EnsEMBL::Registry->load_registry_from_url(
  'mysql://ensro@ens-staging2/63');

Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(
    -host => 'compara1',
    -user => 'ensadmin',
    -pass => 'ensembl',
    -port => 3306,
    -species => 'compara_master',
    -dbname => 'sf5_ensembl_compara_master');

Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(
    -host => 'compara1',
    -user => 'ensadmin',
    -pass => 'ensembl',
    -port => 3306,
    -species => 'compara_62',
    -dbname => 'sf5_ensembl_compara_62');

Bio::EnsEMBL::Compara::DBSQL::DBAdaptor->new(
    -host => 'compara1',
    -user => 'ensadmin',
    -pass => 'ensembl',
    -port => 3306,
    -species => 'compara_63',
    -dbname => 'lg4_ensembl_compara_63');

1;
EOF


6.1 Add in the master compara database
(compara1:3306/sf5_ensembl_compara_master) the new entries in the
genome_db and dnafrag tables. You can set up your registry and use the
$ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/update_genome.pl script. This script
sets the new genome_dbs as the default assemblies.
You have to create new genome_db_id (dnafrag) when it is a new assembly, or a new species. 
Sometimes it's done by the pairwise guys as they want to start building earlier. 

eg. 
perl update_genome.pl --reg_conf ~/release_63/reg_conf.pl --compara compara_master --species bushbaby

eg.
perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/update_genome.pl \
 --reg_conf ~/release_63/reg_conf.pl \
 --compara compara-master --species "Ochotona princeps"

6.2 Add in extra non-reference patches 

To add extra non-reference patches to an assembly, eg human, you need the -force option to just add those dnafrags which aren't already in the database.

perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/update_genome.pl --reg_conf ~/release_63/reg_conf.pl --compara compara_master --species human --force


7. New method_link_species_set entries might be added using the
$ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/create_mlss.pl script. The release
coordinator (or any team member) should create a new
method_link_species_set in the master database before starting a new
pipeline in order to get a unique method_link_species_set_id. Ideally
they can be created before starting to build the new database although
new method_link_species_sets can be added later on.
Sometimes the DNA alignment guys add the MLSS themselves. 
For the homologies this is done by the homologues guys

eg. for the pairwise alignment
 perl create_mlss.pl --method_link_type  BLASTZ_NET --genome_db_id 22,51 --source "ensembl"  --reg_conf ~/release_63/reg_conf.pl --compara compara_master

# --pw stands for all pairwised genome_db_ids in the list provided
# --sg stands for keep genome_db_id in the list alone (singleton) 



7.1. Create MLSS entries for homology side pipelines:

## generate the species_set and check it :

    export ALL_GENOMEDB_IDS=`mysql -hcompara1 -uensro sf5_ensembl_compara_master -N -e "select group_concat(genome_db_id order by genome_db_id) from genome_db where (assembly_default=1 and taxon_id!=0)" | cat`
    echo $ALL_GENOMEDB_IDS


## choose a temp. directory where the output will be generated:

    export MLSS_DIR="/tmp/mlss_creation"
    mkdir $MLSS_DIR


## run the loading script several times:

#orthologues
echo -e "201\n" | perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/create_mlss.pl --f --reg_conf ~/release_63/reg_conf.pl \
--pw --genome_db_id "$ALL_GENOMEDB_IDS" 1>$MLSS_DIR/create_mlss.ENSEMBL_ORTHOLOGUES.201.out 2>$MLSS_DIR/create_mlss.ENSEMBL_ORTHOLOGUES.201.err

# paralogues btw
echo -e "202\n" | perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/create_mlss.pl --f --reg_conf ~/release_63/reg_conf.pl \
--pw --genome_db_id "$ALL_GENOMEDB_IDS" 1>$MLSS_DIR/create_mlss.ENSEMBL_PARALOGUES.btw.202.out 2>$MLSS_DIR/create_mlss.ENSEMBL_PARALOGUES.btw.202.err

# paralogues wth
echo -e "202\n" | perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/create_mlss.pl --f --reg_conf ~/release_63/reg_conf.pl \
--sg --genome_db_id "$ALL_GENOMEDB_IDS" 1>$MLSS_DIR/create_mlss.ENSEMBL_PARALOGUES.wth.202.out 2>$MLSS_DIR/create_mlss.ENSEMBL_PARALOGUES.wth.202.err

# proteintrees
echo -e "401\n" | perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/create_mlss.pl --f --reg_conf ~/release_63/reg_conf.pl \
--name "protein trees" --genome_db_id "$ALL_GENOMEDB_IDS" 1>$MLSS_DIR/create_mlss.PROTEIN_TREES.401.out 2>$MLSS_DIR/create_mlss.PROTEIN_TREES.401.err

# nctrees
echo -e "402\n" | perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/create_mlss.pl --f --reg_conf ~/release_63/reg_conf.pl \
--name "nc trees" --genome_db_id "$ALL_GENOMEDB_IDS" 1>$MLSS_DIR/create_mlss.NC_TREES.402.out 2>$MLSS_DIR/create_mlss.NC_TREES.402.err

 families
echo -e "301\n" | perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/create_mlss.pl --f --reg_conf ~/release_63/reg_conf.pl \
--name "families" --genome_db_id "$ALL_GENOMEDB_IDS" 1>$MLSS_DIR/create_mlss.FAMILY.301.out 2>$MLSS_DIR/create_mlss.FAMILY.301.err



## if output/error files are ok, remove them all:

    rm -rf $MLSS_DIR

## ,otherwise *PANIC*


7.2. Update the species_set_tags.
Run the script: update_species_sets.pl

eg
perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/update_species_sets.pl  --conf ~/release_63/reg_conf.pl --dbname compara_master > update_species_sets.out 2> update_species_sets.err

Add/Update any dna tags if necessary. There should be a tag for each of the multiple alignments. Note that fish will already have had a species_set_tag added by the above script.
eg
6way epo
INSERT INTO species_set_tag (species_set_id, tag,value) VALUES (33161,"name","primates");

11way epo
INSERT INTO species_set_tag (species_set_id, tag,value) VALUES (33162,"name","mammals");

16way mercator/pecan
INSERT INTO species_set_tag (species_set_id, tag,value) VALUES (33012,"name","amniotes");

33way low coverage
INSERT INTO species_set_tag (species_set_id, tag,value) VALUES (33163,"name","mammals");
 

7.3. Wait for the handover before starting to build the new database in case
any of the new species cannot make it. Don't forget to switch the
assembly_default values of genome_db in this case.
 [1] for species making it / used in the pipeline
 [0] for species not making it / or old assemblies


8. Create the new database for the new release and add it to your
registry configuration file. Use the $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/sql/table.sql
file to create the tables and populate the database with the relevant
primary data and genomic alignments that can be reused from the
previous release. This can be done with the
$ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/populate_new_database.pl script.  It
requires the master database, the previous released database and the
fresh new database with the tables already created. The script will
copy relevant data from the master and the old database into the new
one. 

mysql --defaults-group-suffix=_compara1 -e "CREATE DATABASE lg4_ensembl_compara_63"
mysql --defaults-group-suffix=_compara1 lg4_ensembl_compara_63 < $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/sql/table.sql

# NB: before you start copying, review the list of mlss_ids that are NOT going to be copied
# and synchronize it with the 'skip_mlss' meta entries in the master database.

time $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/populate_new_database.pl \
    --reg-conf ~/release_63/reg_conf.pl --master compara_master --old compara_62 --new compara_63

# took 3 hours for rel. 60 (copied from rel.59)
# took 2:09 hours for rel.59 (copied from rel.58)
# took 2:15 hours for rel.58 (copied from rel.57)
# took 3 hours for rel.57 (copied from rel.pre57)
# took 3 hours for rel.pre57 (copied from rel.56)

If new method_link_species_sets are added in the master after this, you use this
script again to copy the new relevant data. In such case, you will have to skip
the old_database in order to avoid trying to copy the dna-dna alignments and
syntenies again.

$ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/populate_new_database.pl \
    --reg-conf ~/release_63/reg_conf.pl --master compara_master --new compara_63


8.1. Copy the species_set_tag table -- NB: this may have generated the species_set_tag mess we have in rel.63!
This needs to be added to the populate_new_database script and not just copied from the master where all old tags need to be deleted.
But copy from master till this is done.
INSERT INTO species_set_tag SELECT * FROM sf5_ensembl_compara_master.species_set_tag;


8.2. Add new species to phylogenetic tree
The easiest way to use this is to use the phylowidget.
From the EnsEMBL home page:
View full list of all Ensembl species
Species tree (Requires Java)

Select Arrow and select where you want the new species to go (use ncbi taxonomy or wikipedia etc) eg  Canis familiaris
Tree Edit
Add
Sister

Click on empty node
Edit Name (add new name)

The tree should appear in the Toolbox but if not, then save the tree
Copy the new tree into
    $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/species_tree_blength.nh
cvs commit

9. Check that primary data (species data, dnafrags...) in the new compara DB
match the data in the corresponding core databases using the healthchecks. You
may have to edit the ensj-healthcheck/database.properties file. It should look
like this:

    host=compara1
    port=3306
    user=ensadmin
    password=***********

    # Database driver class - shouldn't need to be changed
    driver=org.gjt.mm.mysql.Driver

    # Master schema - see CompareSchema healthcheck
    # This setting is ignored if CompareSchema is not run
    master.schema=master_schema_38

    # Secondary database connection details
    secondary.host=ecs2
    secondary.port=3364
    secondary.user=ensro
    secondary.driver=org.gjt.mm.mysql.Driver


Make sure everything is up to date:
cd ensj-healthcheck
cvs update -Pd

Recompile:
export JAVA_HOME=/software/jdk
bsub -Is ant

Now you can run the compara-compara_external_foreign_keys healthchecks:

time ./run-healthcheck.sh -d lg4_ensembl_compara_63 -type compara -d2 .+_core_63_.+ compara_external_foreign_keys

...and correct mismatches if any!


#============================================================================
# Merging (pre-compara handover)
#============================================================================

10. Merge data.

10.0 Human patches for high coverage blastz-net alignments
     Need all the relevant mlss_ids and use "merge" option of copy_data
     eg   select distinct(method_link_species_set_id) from kb3_hsap_blastz_hap_60.genomic_align join dnafrag using (dnafrag_id) where genome_db_id=90 and method_link_species_set_id< 1000;   

     for i in 384 385 388 390 392 393 394 404 405 410 428 433 455 473; do echo $i; $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/copy_data.pl --from_url mysql://ensro@compara1/kb3_hsap_blastz_hap_60 --to_url mysql://ensadmin:xxxxxxx@compara1/kb3_ensembl_compara_60 --mlss $i --merge; done


10.1 TRANSLATED_BLAT_NET, BlastZ-Net, Pecan, Gerp

    The removal of old data shouldn't be necessary unless the skip_mlss entries are not up to date.
    Removal of old data :
        # constrained elements are removed directly by mlss_id OF THE CONSTRAINED_ELEMENT:
            DELETE FROM constrained_element WHERE method_link_species_set_id=CE_MLSS_ID;

        # conservation_scores can only be removed by gab_id, which can be done together with gabs:
            DELETE gab, cs FROM genomic_align_block gab LEFT JOIN conservation_score cs ON gab.genomic_align_block_id=cs.genomic_align_block_id \
                WHERE gab.method_link_species_set_id=Main_MLSS_ID;

        # genomic_align_trees and genomic_align_groups are linked to genomic_aligns:
            DELETE ga, gag, gat FROM genomic_align ga LEFT JOIN genomic_align_group gag ON ga.genomic_align_id=gag.genomic_align_id \
                LEFT JOIN genomic_align_tree gat ON gag.node_id=gat.node_id \
                WHERE ga.method_link_species_set_id=Main_MLSS_ID;
        # in rel.pre57 didn't cleanly remove all gat entries, please check by node_id range.

    These data are usually in separate production databases. You can copy them using the
    copy_data.pl script in $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline. This script requires
    write access to the production database if the dnafrag_ids need fixing or the
    data must be copyied in binary mode (this is required for conservation scores).
    Example:
          bsub -q yesterday -R "select[mem>5000] rusage[mem=5000]" -M5000000 \
            -I time $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/copy_data.pl \
            --from_url mysql://ro_user@host/production_db \
            --to_url mysql://rw_user:password@host/release_db --mlss 268

    Note_1: bear in mind that even though constrained elements and conservation scores are associated
    with a multiple alignment, they are not copied automatically and have their own mlss_id,
    so you should copy them by a separate execution of copy_data.

    Note_2: for copying conservation scores you have to provide rw_user and password for --from_url,
    because the script needs to write into the production database.

    Note_3: multiple alignments THAT PRODUCE ANCESTRAL SEQUENCES (not all of them do) eg EPO alignments, will also need
    the ancestral sequences copying to the core ancestral database. The copy_data script has been altered to automatically copy 
    the ancestral dnafrags into the compara database.

    Note_4: for multiple alignments that SHOULD HAVE BEEN automatically copied from the prev.release,
    check that you have the ancestral dnafrags copied. Again this should now happen automatically.


# rel.58
#           30m to copy over Human-vs-Marmoset
#
#           76m to copy over 33way LC EPO
#           23m to copy over 33way LC EPO constrained elements
#           70m to copy over 33way LC EPO conservation scores
#
# don't forget to copy ancestral dnafrags for 6-way primates! (done)
#
#           48m to copy over 6way primate EPO
#
# don't forget to copy ancestral dnafrags for 12-way eutherians! (done)
#
#           45m to copy over 12way eutherian EPO
#
#           43m to copy over 16way placental mercator/pecan
#           17m to copy over 16way placental mercator/pecan constrained elements
#            1h to copy over 16way placental mercator/pecan conservation scores

# rel.63
#           53m to copy over Human-vs-Marmoset LASTZ
#           54m to copy over Human-vs-Microbat LASTZ
#
#           54m to copy over 6way primate EPO
#           77m to copy over 12way mammal EPO
#
#           65m to copy over 19way amniota PECAN
#           10m to copy over 19way amniota PECAN GERP_CONSTRAINED_ELEMENTS
#           40m to copy over 19way amniota PECAN GERP_CONSERVATION_SCORES
#
#           373m to copy over 35way mammal LC EPO
#           66m to copy over 35way mammal LC EPO GERP_CONSTRAINED_ELEMENTS
#           61m to copy over 35way mammal LC EPO GERP_CONSERVATION_SCORES
#
#           90m to copy over 5way fish EPO
#           27m to copy over 5way fish EPO GERP_CONSTRAINED_ELEMENTS
#           23m to copy over 5way fish EPO GERP_CONSERVATION_SCORES


10.2 Syntenies

  Please refer to the documentation in the $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/script/synteny directory.

    1) First make sure the entries in ~/release_63/reg_conf.pl file point at the latest (staging) versions of the core databases.
    
    2) Then run something like:
    $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/synteny/LoadSyntenyData.pl --reg_conf ~/release_63/reg_conf.pl \
        --dbname compara_63 -qy "Homo sapiens" -tg "Callithrix jacchus" \
        /lustre/scratch101/ensembl/kb3/scratch/hive/release_63/kb3_hsap_cjac_synteny_63/synteny/all.100000.100000.BuildSynteny

    3) Check that the MLSS object in the method_link_species_set is fully defined (the loading script may need fixing?)


10.3 Put together ancestral database:
This is now done using the script copy_ancestral_core.pl. You will need to add any ancestral sequences from the previous release if these have not changed, in addition to adding any new ones. 

    1) Create a new core database:
        mysql --defaults-group-suffix=_compara1 -e 'CREATE DATABASE lg4_ensembl_ancestral_63'
        mysql --defaults-group-suffix=_compara1 lg4_ensembl_ancestral_63 <$ENSEMBL_CVS_ROOT_DIR/ensembl/sql/table.sql

    2) Copy over the ancestralsegment coord_system:
        mysql --defaults-group-suffix=_compara1 lg4_ensembl_ancestral_63 -e 'INSERT INTO lg4_ensembl_ancestral_63.coord_system SELECT * FROM sf5_ensembl_ancestral_62.coord_system'

    3) Add new data using copy_ancestral_core.pl script:

bsub -o copy_ancest_525.out -e copy_ancest_525.err -R "select[mem>5000] rusage[mem=5000]" -M5000000 perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/copy_ancestral_core.pl --from_url mysql://ensadmin:ensembl@compara3/sf5_63compara_ortheus6way_ancestral_core --to_url mysql://ensadmin:ensembl@compara1/lg4_ensembl_ancestral_63 --mlss 525
# rel.63: 10 min.
bsub -o copy_ancest_524.out -e copy_ancest_524.err -R "select[mem>5000] rusage[mem=5000]" -M5000000 time perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/copy_ancestral_core.pl --from_url mysql://ensadmin:ensembl@compara3/sf5_63compara_ortheus12way_ancestral_core --to_url mysql://ensadmin:ensembl@compara1/lg4_ensembl_ancestral_63 --mlss 524
# rel.63: 18 min.
bsub -o copy_ancest_528.out -e copy_ancest_528.err -R "select[mem>5000] rusage[mem=5000]" -M5000000 time perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/copy_ancestral_core.pl --from_url mysql://ensadmin:ensembl@compara1/sf5_compara_5fishHMM_ortheus_ancestral_core --to_url mysql://ensadmin:ensembl@compara1/lg4_ensembl_ancestral_63 --mlss 528
# rel.63: 1.5 min.

    4) Do not forget to copy (from the previous release ancestral_core database) the sources that have not been updated in this release:

bsub -o copy_ancest_505.out -e copy_ancest_505.err -R "select[mem>5000] rusage[mem=5000]" -M5000000 time perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/copy_ancestral_core.pl --from_url mysql://ensadmin:ensembl@compara1/sf5_ensembl_ancestral_62 --to_url mysql://ensadmin:ensembl@compara1/lg4_ensembl_ancestral_63 --mlss 505

    5) Check that you have done it correctly:
        SELECT left(name,12) na, count(*), min(seq_region_id), max(seq_region_id), max(seq_region_id)-min(seq_region_id)+1 FROM seq_region GROUP BY na;

======================================


10.4 Merging GeneTrees+Families+NCTrees together is now done by running a mini-pipeline.

    Go to $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/modules/Bio/EnsEMBL/Compara/PipeConfig and open the PipeConfig file MergeHomologySideTogether_conf.pm
    
    It has 6 sections for connecting to databases where you will have to change the names of the databases and possibly their locations:
        pipeline_db  - is your intermediate target (all protein side pipelines merged together)
        master_db    - is the main compara master
        prevrel_db   - should point to the previous release database
        genetrees_db - should point to the current GeneTrees pipeline database
        families_db  - should point to the current Families pipeline database
        nctrees_db   - should point to the current ncRNAtrees pipeline database

    Save the changes, exit the editor and run init_pipeline.pl with this file:
        init_pipeline.pl MergeHomologySideTogether_conf.pm -password <our_most_secret_password>

    Then run both -sync and -loop variations of the beekeeper.pl command suggested by init_pipeline.pl .
    This pipeline will create a database with protein side pipeline databases merged together.


10.5 Final merger of "protein side" into the release database is done by running another mini-pipeline.

    Go to (or stay in) $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/modules/Bio/EnsEMBL/Compara/PipeConfig and open the PipeConfig file MergeHomologyIntoRelease_conf.pm
    
    It has 3 sections for connecting to databases where you will have to change the names of the databases and possibly their locations:
        pipeline_db        - is a hive database that is only used for job tracking - it may/should be removed right after the pipeline is done
        merged_homology_db - is the result of the previous step (protein side databases merged together)
        rel_db             - is the main release database

    Save the changes, exit the editor and run init_pipeline.pl with this file:
        init_pipeline.pl MergeHomologyIntoRelease_conf.pm -password <our_most_secret_password>

    Then run both -sync and -loop variations of the beekeeper.pl command suggested by init_pipeline.pl .
    This pipeline will merge the protein side tables into the main release database.

10.6 Cleanup and CVS commit

    After you are happy about the result of both mergers
    you can drop both "compara_homology_merged" and "compara_full_merge" databases.

    Also, please commit the changes to the PipeConfig files that you have made.

11. Drop method_link_species_set entries for alignments which did not make it.

11.5 Check for method_links that do not have a corresponding method_link_species_set:
    SELECT ml.* FROM method_link ml LEFT JOIN method_link_species_set mlss ON ml.method_link_id=mlss.method_link_id WHERE mlss.method_link_id IS NULL;
In most cases they can be removed, but check with other members of Compara.

Removal of redundant method_link entries:
    DELETE ml FROM method_link ml LEFT JOIN method_link_species_set mlss ON ml.method_link_id=mlss.method_link_id WHERE mlss.method_link_id IS NULL;


12. Updating member.display_label and member.description fields for the members generated by EnsEMBL prediction.

This step has to happen ASAP, but AFTER the Core name projections have been done. And before you analyze/optimize the tables.
(1. Matthieu/Leo runs the Homology pipeline
 2. The homology database is given to Rhoda, who uses it to run name projections on Core databases (display_labels and gene_descriptions change in Core databases)
 3. We use the information in Core databases (derived from Homology, i.e. Compara) to fix the display_labels and gene_descriptions for Compara
!MAKE SURE YOU ARE IN SYNC WITH THE REST OF THE WORLD!
)

I made a backup of the member table before starting:
RENAME TABLE member TO member_orig;
CREATE TABLE member LIKE member_orig;
INSERT INTO member SELECT * FROM member_orig;

12.1 Ensure your registry is correct

The registry file which is used should point to the server where all updated
Xref projections are located. This will mean staging servers 1 and 2. To load
the data from these two servers you can use the 
Bio::EnsEMBL::Registry->load_registry_from_multiple_dbs() call on both servers.

12.2 Run the command

bsub -o populate_member_display_labels.out -e populate_member_display_labels.err -R "select[mem>5000] rusage[mem=5000]" -M5000000 time perl $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/populate_member_display_labels.pl --registry ~/release_63/reg_conf.pl --compara compara_63 --verbose
# rel59 real    9m44.754s
# rel60 real    10m23.002s
# rel62 real    11m
# rel63 real    11m

This will iterate over all GenomeDBs and update any label which is empty. Should
you wish to replace existing labels then rerun with the -replace switch. 

Ran a few tests:

mysql -hcompara1 -uensro kb3_ensembl_compara_60 -e "select count(*) from member_orig mo, member m where m.member_id=mo.member_id and m.display_label IS NOT NULL and mo.display_label IS NULL"
+----------+
| count(*) |
+----------+
|  1278014 |
+----------+

mysql -hens-livemirror -uensro ensembl_compara_59 -e "select stable_id, display_label from member m where m.display_label IS NOT NULL" | sort > 59.display_label.tsv 
mysql -hcompara1 -uensro kb3_ensembl_compara_60 -e "select stable_id,display_label from member m where m.display_label IS NOT NULL" | sort > 60.display_label.tsv 
join 59.display_label.tsv 60.display_label.tsv | wc -l
3017840
join -v 1 59.display_label.tsv 60.display_label.tsv | wc -l
42957
join -v 2 59.display_label.tsv 60.display_label.tsv | wc -l
126040

These look fine.

Remember to DROP TABLE member_orig if made a backup.

13. Run the healthchecks:

13.1 Run the healthchecks for ancestral database:

    time ./run-healthcheck.sh -d lg4_ensembl_ancestral_63 compara-ancestral
# rel.62: 4sec, all successful
# rel.63: 14sec, all successful after analyzing 3 tables

13.2 Update the max_alignment_length. You can use the corresponding healthcheck with the -repair option:

    time ./run-healthcheck.sh -d lg4_ensembl_compara_63 -type compara -repair Meta
# rel.61: 6m30, "no repair needed" (probably after Stephen has already done it)
# rel.63: 8m (full repair)

13.3 Now run the remaining healthchecks:

    time ./run-healthcheck.sh -d lg4_ensembl_compara_63 -type compara -d2 .+_core_63_.+ compara_external_foreign_keys
# in rel.56 everything passed apart from CheckTaxon - according to Javier in this particular case it was not a problem
# in rel.pre57 it took 20 minutes (all passed).
# in rel.57 it took ?? minutes ('genbank common name' for 4 species had to be copied from their 'ensembl common name' in ncbi_taxa_name table)
# rel.58:   22m
# rel.61:   32m, 1 failure
# rel.63:   23m, 1 failure ( taeniopygia_guttata_core_63_1: common_name::zebra finch is not in lg4_ensembl_compara_63 )

    time ./run-healthcheck.sh -d lg4_ensembl_compara_63 -type compara compara_genomic
# rel.58:   47m, 7 failures
# rel.61:   51m, 3 failures
# rel.63:   84m, (2.5 errors that are "ok")

    time ./run-healthcheck.sh -d lg4_ensembl_compara_63 -type compara compara_homology
# rel.58:   14m, 5 failures
# rel.61:   30m, 3 failures
# rel.62:   3m, 1 failure (CheckSpeciesSetTag, ok?)
# rel.63:   52m, success (after some fixing, of course)

...and correct mismatches if any!


14. Ask the release coordinator to point the test web server to the compara DB.

Upon confirmation from the release coordinator ask other members of Compara to go to
    http://staging.ensembl.org/
and check their data.


15. Run ANALYZE TABLE and OPTIMIZE TABLE commands for both databases produced
#
# This is required for the CopyDbOverServer script to work properly.
# So if you (suspect that you) have changed anything in the database, do run these two commands just in case -
# a dry run of each doesn't even take a minute.

time mysqlcheck --analyze --verbose --host=compara1 --port=3306 --user=ensadmin --password=ensembl --databases lg4_ensembl_compara_63
# rel.56    12min
# rel.pre57 30+105min
# rel.57    9+4+5min
# rel.58    3min
# rel.62    6min
# rel.63    25m

time mysqlcheck --optimize --verbose --host=compara1 --port=3306 --user=ensadmin --password=ensembl --databases lg4_ensembl_compara_63
# rel.56    2.5 hours
# rel.pre57 : took several iterations (not all tables were MyISAM initially), last one 132min.
# rel.57    2+1.6 hours
# rel.62    32min
# rel.63    61m

time mysqlcheck --analyze --verbose --host=compara1 --port=3306 --user=ensadmin --password=ensembl --databases lg4_ensembl_ancestral_63
# rel.57    took seconds to complete
# rel.62    1sec
# rel.63    3sec

time mysqlcheck --optimize --verbose --host=compara1 --port=3306 --user=ensadmin --password=ensembl --databases lg4_ensembl_ancestral_63
# rel.57    took seconds to complete
# rel.62
# rel.63    14m


16. WHEN EVERYBODY IS HAPPY ABOUT THE DATABASES, actually copy them to the two staging servers
This is done by a strange script with a clumsy interface, but take heart:

16.1. First, ssh into the DESTINATION machine:

    # NB: ask for the password on staging well in advance - there may be noone around you at the right moment!
ssh mysqlens@ens-staging

    # switch shells, as it is running tcsh by default
bash

16.2. Create a file that will contain one line with the source/destination parameters, like this:

cat <<EOF >/tmp/lg4_ensembl_compara_63.copy_options
#from_host      from_port   from_dbname                 to_host         to_port     to_dbname
#
compara1        3306        lg4_ensembl_compara_63      ens-staging     3306        ensembl_compara_63
compara1        3306        lg4_ensembl_ancestral_63    ens-staging     3306        ensembl_ancestral_63
EOF


16.3. Run the script to actually copy the data:

time perl ~lg4/work/ensembl/misc-scripts/CopyDBoverServer.pl -pass ensembl \
        -noflush /tmp/lg4_ensembl_compara_63.copy_options > /tmp/lg4_ensembl_compara_63.copy.err 2>&1

# copying of rel_56 took 2 hours (SUCCESSFUL for both databases - you should check the output file)
# copying of ensembl_compara_pre57 took 2 hours (SUCCESSFUL)
# copying of ensembl_compara_57 took 2 hours (SUCCESSFUL)
# copying of ensembl_ancestral_57 took 20 minutes (only SUCCESSFUL after analyzing/optimizing)
# copying of ensembl_compara_58 and ensembl_ancestral_58 together took 1:30h (SUCCESSFUL)
# copying of ensembl_compara_62 and ensembl_ancestral_62 took 2h38
# copying of ensembl_compara_63 and ensembl_ancestral_63 took 2h


16.4 Do the same thing in parallel on ens-staging2 :

ssh mysqlens@ens-staging2

bash

# NB: the ancestral database doesn't need to be copied to the second staging server

cat <<EOF >/tmp/lg4_ensembl_compara_63.copy_options
#from_host      from_port   from_dbname                 to_host         to_port     to_dbname
#
compara1        3306        lg4_ensembl_compara_63      ens-staging2     3306        ensembl_compara_63
EOF

time perl ~lg4/work/ensembl/misc-scripts/CopyDBoverServer.pl -pass ensembl \
        -noflush /tmp/lg4_ensembl_compara_63.copy_options > /tmp/lg4_ensembl_compara_63.copy.err 2>&1
# copying of ensembl_compara_58 took 1:15h (SUCCESSFUL)
# copying of ensembl_compara_62 took 1:34h (SUCCESSFUL)
# copying of ensembl_compara_63 took 3:44h (SUCCESSFUL)


#### At this point you are "handing over the databases" to the person running Compara Mart (usually Rhoda).
#### Let the main Release Coordinator and Rhoda know about it.
#### But your job is not over yet! Carry on:


17. Dump both the current and the previous release schemas and compare them.

17.1 Create a patch to convert a compara DB from the previous release to the new one.

# The patch should include at least an update of the schema_version in the meta table!
#
mysqldump --defaults-group-suffix=_compara4 --no-data --skip-add-drop-table sf5_ensembl_compara_62 | sed 's/AUTO_INCREMENT=[0-9]*\b//' >old_schema.sql
mysqldump --defaults-group-suffix=_compara4 --no-data --skip-add-drop-table lg4_ensembl_compara_63 | sed 's/AUTO_INCREMENT=[0-9]*\b//' >new_schema.sql
#
sdiff -b old_schema.sql new_schema.sql | less
#
# (create the patch_62_63.sql file by hand)

17.2 Generate an empty database from the old schema, apply the patch, dump it, and check that you get the new schema.

mysql --defaults-group-suffix=_compara4 -e 'create database lg4_schema_patch_test'
mysql --defaults-group-suffix=_compara4 lg4_schema_patch_test < old_schema.sql
mysql --defaults-group-suffix=_compara4 lg4_schema_patch_test < patch_62_63.sql
mysqldump --defaults-group-suffix=_compara4 --no-data --skip-add-drop-table lg4_schema_patch_test | sed 's/AUTO_INCREMENT=[0-9]*\b//' >patched_old_schema.sql
#
sdiff -bs patched_old_schema.sql new_schema.sql | less

17.3 CVS commit the patch.


18. Update the files in the $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/sql directory:

cd $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/sql/

mysql --defaults-group-suffix=_compara1 -N -e "SELECT * FROM genome_db order by genome_db_id asc" lg4_ensembl_compara_63 > genome_db.txt
mysql --defaults-group-suffix=_compara1 -N -e "SELECT * FROM method_link order by method_link_id asc" lg4_ensembl_compara_63 > method_link.txt

# You will have to change the default schema_version in the table.sql file (last line of the file)

# CVS commit these files.


19. Update files in public-plugins/ensembl/htdocs/info/docs/compara

You might need to update the create_mlss_table.conf file with new species added.
Or you can use the order of species given in the species tree:

  $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/docs/create_mlss_table.pl \
    --reg_conf ~/release_63/reg_conf.pl --dbname compara_63 --method_link PECAN \
    --list --species_tree $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/species_tree_blength.nh > pecan.inc

  $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/docs/create_mlss_table.pl \
    --reg_conf ~/release_63/reg_conf.pl --dbname compara_63 --method_link EPO \
    --list --species_tree $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/species_tree_blength.nh > epo.inc

  $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/docs/create_mlss_table.pl \
    --reg_conf ~/release_63/reg_conf.pl --dbname compara_63 --method_link EPO_LOW_COVERAGE \
    --list --species_tree $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/species_tree_blength.nh > epo_lc.inc

  $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/docs/create_mlss_table.pl \
    --reg_conf ~/release_63/reg_conf.pl --dbname compara_63 --method_link TRANSLATED_BLAT_NET \
    --trim --species_tree $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/species_tree_blength.nh > tblat_net.inc

# for blastz_net/lastz_net produce a simple list, because the table is getting too big:
  $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/docs/create_mlss_table.pl \
    --reg_conf ~/release_63/reg_conf.pl --dbname compara_63 --method_link BLASTZ_NET --method_link LASTZ_NET \
    --blastz_list --species_tree $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/species_tree_blength.nh > blastz_net.inc

  $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/docs/create_mlss_table.pl \
    --reg_conf ~/release_63/reg_conf.pl --dbname compara_63 --method_link SYNTENY \
    --trim --species_tree $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/scripts/pipeline/species_tree_blength.nh > synteny.inc

CVS commit these files.


20. Update the schema and tutorial documentation files compara_schema.html and compara_tutorial.html in this directory:
    ensembl-webcode/htdocs/info/docs/api/compara/
    (previously - public-plugins/ensembl/htdocs/info/docs/api/compara )

CVS commit the changes.


21. Remind Javier to do the DNA part of the dumps.


22. Run the pipelines for dumping GeneTrees and ncRNAtrees yourself:

22a. Go to $ENSEMBL_CVS_ROOT_DIR/ensembl-compara/modules/Bio/EnsEMBL/Compara/PipeConfig and open the PipeConfig file DumpTrees_conf.pm

    Check that you are happy about all parameters:
        rel             being the current release number (sometimes this is the only thing to change),
        rel_db          pointing at the release database,
        target_dir      suitable for creating and storing the dumps in
    If not happy, edit the changes, save the config file and exit the editor.

    Run init_pipeline.pl with this file:
        init_pipeline.pl DumpTrees_conf.pm -tree_type protein_trees -password <our_most_secret_password>

    Then run both -sync and -loop variations of the beekeeper.pl command suggested by init_pipeline.pl .
    This pipeline will produce protein_tree dumps in the directory pointed at by 'target_dir' parameter.

# rel_60: took 5 hours on a "bad lustre" day. On one of such days you're better off pointing at your home directory!
    
22b. Stay in the same directory, but now create another pipeline from the same config file:
        init_pipeline.pl DumpTrees_conf.pm -tree_type ncrna_trees -password <our_most_secret_password>

    Then run both -sync and -loop variations of the beekeeper.pl command suggested by init_pipeline.pl .
    This pipeline will produce ncrna_tree dumps in the directory pointed at by 'target_dir' parameter.

# rel_60: took 8 minutes. The other extreme.

22c. Commit the DumpTrees_conf.pm file into the CVS if you'd like to keep the changes.

22d. Report the locations of the dumps to the main Release Coordinator and/or the web people.

