
*==========================================================*
|            -Changelog.txt file for HarvestMan-           |
|                                                          |
|           URL: http://www.harvestmanontheweb.com         |
*==========================================================*
Version 2.0 b1
Release Date: TBD

Release Focus: Major Enhancements, Bug-fixes

New Features
============
1. A new plugin architecture which allows to extend
the program very easily by defining simple plugin modules.
2. Integration with swish-e which allows HarvestMan to
run as an external program for feeding content to swish-e.
Swish-e support is implemented as a standard plugin.
3. Integration with Lucene using PyLucene which allows
to use HarvestMan to crawl and index web pages using
lucene. Lucene support is implemented as a standard plugin.
3. A simulate feature which allows to simulate crawls 
without performing any actual downloads. Simulate feature
is implemented as plugin.
4. A session-save & restore feature which allows to save
state of HarvestMan in a session file upon crashes or manual
kills. The program can then resume from this state on the
next invocation.
5. A thread check-pointing and migration feature which
automatically moves data from threads which die to new
threads which allows the program to handle exceptions very
well.
6. Command line options can be now mixed with a configuration
file when using the -C option.
7. New command line option to read a list of URLs and use
them for crawling.
8. Ability to parse stylesheets and download URLs and
other stylesheets referred in them.
9. Support for META robot tags (INDEX, FOLLOW, NOINDEX, NOFOLLOW)
10. Support for META REFRESH tags.
11. Support for HTML embed, object tags.
12. Support for HTTP compression.
13. Support for byte-range downloading in threads. Program
attempts to split large files to several chunks across multiple
threads.
14. Support for HTML entities in URLs.

Other Changes
=============
1. Rewrote logger module to use Python logging API. 
2. All print statements replaced to use logconsole
method of the logger.
3. Log for the same project run after the first time,
appends to the same log file, instead of deleting
old logs.
4. No-crawl logic completely rewritten. It uses
a separate flow of information from the regular
crawler.
5. Program now creates a ".harvestman" folder in user's 
home directory on POSIX systems. Session files are saved
under this directory. Also a copy of the config.xml
file is stored there. The program tries to load this
file if it cannot find a config.xml anywhere else.
6. Replaced use of md5 module with sha module for
checking digest sums since it is more secure.
7. Removed lock objects in datamgr module since 
operations on basic data types is automatically
synchronized by the GIL.
8. Removed strptime module since it is no longer used.
9. Rewrote headers of all modules with correct module
names, author information and copyright information.
10. A __version__ and __author__ attribute added to all
modules.
11. Renamed certain classes which were starting with lower
cases. All classes now start with upper-case letters.
12. The following command line options have been added,
removed or updated.

  1. --subdomain, -S: Changed to -s.
  2. -m,--simulate: New option to simulate crawling.
  3. -g,--plugin: New option to apply a given plugin.
  4. -S,--savesession: New option to turn session saver feature on/off.
  5. --urllistfile, --urltreefile: Removed, can be used by combining with config.xml
    option.
  6. -F, --urlfile: Read list of start URLs from a file (New option).

13. Default value for maximum file size (for single file) increased
to 5MB (was 1 MB earlier).
14. Changed sleep times inside the loops of fetcher/crawler thread
clases to random times ranging from 0 to 0.3 second. This allows
for more better resource allocation and pooling.
15. Sleep times is available as a configuration option. This can
be used to increase sleep-times for slow/traffic intense websites.
16. Caching now supports HTTP 304 for projects run a second time
with datacache enabled. If datacache is available, HarvestMan will
use the HTTP header 'If-Modified-Since' to see if the file on 
server-side is modified.
17. Changes in data cache format. Data cache for each URL is
dumped separately into a <hash>.data file where <hash> is the 
hash value of the associated URL. The .data files are stored
in a .cache-data folder in the project's cache directory.

Bug-fixes
=========
1. Fixed 100% CPU utilization bug. Bug #005972
2. Fixed bug in single-thread mode. 
3. Fixed many bugs in urlparser module - Correctly interpreting
URLs with entities (&amp; etc), resolving URLs with ".." inside
them etc.
4. Bug-fix in rules module to speed up rules checks
5. Bug-fix to avoid exceptions thrown at program exit (interpreter 
shutdown)
6. Fixed a bug which was causing a fetcher thread to download
the same content twice.
7. As usual, localizing of URLs gets many bug-fixes :)
8. Fixed a minor bug in datamgr which was incorrectly printing
download rate stats as bps (should be kbps).
9. Fixed problem with URLs not downloaded when base url is
redefined. This was causing a deep-crawling problem for
some sites.
10. Fixed bugs in re-resolving of URLs in urlparser module.
