
The chain/net process assumes that you have already run a pairwise comparison pipeline
between 2 genomes (whole or partial) usually using blastz or blat aligner. Please start reading
README-pairlaigner first

Include in your PERL5LIB, the ensembl-analysis code

      cvs -d :ext:cvs.sanger.ac.uk:/nfs/ensembl/cvsroot co  ensembl-analysis

in tcsh
    setenv BASEDIR   /some/path/to/modules
    setenv PERL5LIB  ${PERL5LIB}:${BASEDIR}/ensembl-analysis/modules

in bash
    BASEDIR=/some/path/to/modules
    PERL5LIB=${PERL5LIB}:${BASEDIR}/ensembl-analysis/modules

1- Choose a working directory with some disk space
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can decice to use the same directory that was used for the pair aligner pipeline.
But sometimes the workers output can get huge, so we'd recommend that you create/use
a new directory.

mkdir -p /acari/work7a/abel/hive/abel_compara_human_cow_blastz_27cn/workers

This directory needs to be set in 'hive_output_dir' variable in the hive
configuration file. See below. If not set, all STDOUT/STDERR goes to /dev/null. :)

# IMPORTANT: The hive system can generate an awful lot of log outputs that are dumped in
# the hive_output_dir. When a pipeline runs fine, these are not needed and can take a lot of
# disk space as well as generate a large number of files. If you don't want log outputs (recommended),
# then just don't specify any hive_output_dir (delete or comment the line or set to "" if you don't want
# any STDOUT/STDERR files)

4- Copy and modify your hive config file
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

cd /acari/work7a/abel/hive/abel_compara_human_cow_blastz_27cn

cp ~/src/ensembl_main/ensembl-compara/scripts/pipeline/compara-hive-chain-net.conf.example hive.conf
<editor> hive.conf

you may need to change the database names, port, dbnames, and the
paths to the 'hive_output_dir' to
/acari/work7a/abel/hive/abel_compara_human_cow_blastz_27cn/workers

Make directories to store the nib files:
mkdir -p /acari/work7a/abel/hive/abel_compara_human_cow_blastz_27cn/cow_nib_for_chain
mkdir -p /acari/work7a/abel/hive/abel_compara_human_cow_blastz_27cn/human_nib_for_chain

These need to be set in the hive configuration file as the 'dump_loc' variable.

At the end of the pipeline, some basic healthchecks are performed. Currently these are 'pairwise_gabs' which checks that for each genome_align_block there are 2 genomic_aligns and 'compare_to_previous_db' which checks that the number of genomic_align_blocks after netting is within $max_percentage_diff of a previous database. If no HEALTHCHECK module is defined, the parameters are automatically created. The method_link_type comes from the output_method_link type from NET_CONFIG and the current_genome_db_ids come from the DNA_COLLECTION. The default previous_db_url is mysql://ensro@ens-livemirror and the default max_percentage_diff is 20. It is possible to either define all the params for the 2 healthchecks or only the previous_db_url for the compare_to_previous_db healthcheck and in this case the method_link_type and current_genome_db_ids will be created automatically.


5- Run the configure scripts
   ~~~~~~~~~~~~~~~~~~~~~~~~~
The following script is in ensembl-compara/scripts/pipeline (should be in your PATH)

loadChainNetSystem.pl -conf hive.conf

The loadChainNetSystem script creates the analysis entries for the processing
system, and creates both the dataflow rule and the analysis control rules.
It also initializes the analysis_stats row for each analysis.  These row hold
information like batch_size, hive_capacity, and run-time stats that the Hive's
Queen will update.

This script may give you warnings if the output directories are not available or if
it's unable to connect to core databases.

At this point the system is ready to run

IMPORTANT NOTES:
Few points on specific analysis (logic_name provided here)
- DumpLargeNibForChains dumps each chromosome as a whole in a specific format (.nib files)
  This analysis needs a lot of memory. So you will need to either run it locally on the ecs
  machine you're logged in or specify -lsf_options in the beekeeper to set some resource
  requirements
- AlignmentNets-xxxx in cases where the genomic region to net is very large can take a lot
  of memory. This happens usually when comparing very close species (mouse/rat). Apply the same
  approach as for DumpLargeNibForChains.
- SetInternalIds is optional and if defined in the conf file will ensure that
  the  BLASTZ_NET genomic_align_block_id and genomic_align_ids to be unique 
  across a release by adding $mlss_id*10**10

For the above specific problems, avoid running the beekeeper.pl with -loop alone. Prefer
the addition of -logic_name as well to restrict the loop to particular analysis. This will
allow to stop the pipeline at specific stages, where you can then restart it with customized
resource requirements for the next step.

6- Run the beekeeper
   ~~~~~~~~~~~~~~~~~
The following scripts are in ensembl-hive/scripts (should be in your PATH)
beekeeper.pl -url mysql://ensadmin:xxxx@ia64g:3306/abel_compara_human_cow_blastz_27 -loop

where xxxx is the password for write access to the database

for more details on controling/monitoring the hive system see 
beekeeper -help

7- Healthchecks
   ~~~~~~~~~~~~
The healthchecks are run at the end of the pipeline. If a healthcheck fails, it is important to discover why it has failed. It is possible to do this either by looking at the output file if hive_output_dir is defined or by running the job again interactively using runWorker.pl.
for example
runWorker.pl -bk LSF -url mysql://ensadmin:xxxx@ia64g:3306/abel_compara_human_cow_blastz_27  -job_id 115 -outdir '' -no_cleanup

where the job_id is the analysis_job_id of the job you wish to rerun. For more information on the options of runWorker.pl see
runWorker.pl -help

8- Pair Aligner Config
   ~~~~~~~~~~~~~~~~~~~
If defined, this will automatically update the pair aligner configuration database which keeps track of what parameters were used for each pipeline run and calculates some simple statistics, such as percentage coverage. It will also produce the web pages used in the web documentation. It requires the species used as the reference, the bed files for the pairwise genomes and coding exons, the location of the configuration database, the location of the configuration file and the path the perl modules to access the the compare_beds.pl script.
