/**
 * \file SCRIPT-tech
 *
 * see script_lua.c
 * see fetchnews.c::getarticle()
 * see store.c::store_stream()
 *
 * for testing all this,
 * see script-test-store.c
 * see Makefile-script-test
 *
 * clemens fischer <ino-news@spotteswoode.dnsalias.org>
 */

= The example implementation of "scripthooks.lua" =

The example scripthooks.lua is centered around a couple of tables
specifying rules for filtering groups and the logic to evaluate them
from the callouts of the lua backend. No support for PCRE regular
expressions or networking has been included, as the various platforms
do not have uniform access to the vast library of lua-addons available
from, eg. lua-users.org, luaforge.net etc. Nevertheless all the
functionality of the leafnodes filtering engine has been retained and
enhanced here and there. The exception is: the size limit is measured
in bytes. If you need access to the "Lines:" header, use eg. "lines
= tonumber(Article.s_header.lines)", but with the remove HTML hook
activated, which removes most of useless HTML, it does not make sense to
even look at it.

All the configuration tables are indexed by lua regular expressions
matching group names. There are two types of tables: one returns exactly
one result for the longest match, the other enumerates all matches in
the order of decreasing match length. This makes it possible to specify
defaults while customizing certain groups.

Every article starts out with some score limit.  Hooks can add to
or subtract from it.  If the score reaches zero, it is considered
spam.  Users can decide if this just means redirecting it to some local
spam group or entirely rejecting it.  A few points below the spam_cutoff
lies the reject_cutoff.  Any article scoring at or below this limit is
rejected with a message in the log.

Score limits can be defined explicitly per group or defaulted with
shorter matches of the groups name.  The limit called "." matches any
group name.  The same mechanism is applied to the values of
maximum cross-posts, maximum size, maximum age etc.  It makes sense to
write custom hooks in two parts:  One is the code, the other a table,
indexed by group name, which represents group parameters.

As lua regular expressions have no alternates operator (normally "|" in
other implementations), an operator named "Art" (for "Art"icle data)
has been devised. It is used in the "ScoredMatches" table, which lists
regular lua functions eventually resulting in a score value. This
mechanism allows to combine any number of header- or body matches with a
user defined score.

Score functions in table ScoredMatches typically look like this:

ScoredMatches = {
    ["."] = {
        function(g, a, s)
            if Art(a, "from", {"%@my-email-domain%."}) then
                return 2*max_scorelimit
            else return 0 end
        end,
        -- more functions here
    },
    ["^sci%.electronics%."] = {
        function(g, a, s)
            if Art(a, "from", {"someJerk", "B...%@hotmail%.com"}) then
                return reject_cutoff
            else return 0 end
        end,
    },
    -- yet more newsgroups with score functions
}

"ScoredMatches" is a regular hook table and is run by the function
"run_hooks", giving every function in the table the current group,
article and some specific article data as arguments. It is a "multi-
value" table in that all items selected by matching group names are
included, in the order of decreasing match length. So, if the current
groups name is "alt.usenet.kooks", only the ["."] would match, which
scores me up to twice the regular limit, in case some other hook scores
me down. If the groups name is "sci.electronics.design", both of the
hooks shown run, with the more specific oen first. Look at how the "Art"
operator is used:  "Art(a, "from", {"Skybuck", "B...%@hotmail%.com"})"
will return "true" when the articles "From:" header matches either
"Skybuck" or "B...%@hotmail%.com". Should this happen, the function
will return the "reject_cutoff" value, making the article disappear or
refiled into a spam newsgroup. This decision is implemented in one of
the user callouts, which is "fetchnews_headertxt_bodytxt" in the example
implementation.

Instead of "from", you can use any other header name or the special
constant "search_body", whichg is just a pattern not allowed as a legal
header name, currently defined as ".X.".  "Art" has to be able to tell if
a match fails because some header is there but does not match or because
the article does not have a header with this name in the first place.
this is to avoid it looking into the body if it cannot find the header
specified.

In your own implementation, it is not necessary to provide group or
article data arguments to the scoring functions or the "Art" operator, but
you might choose to keep them nonetheless, because they allow to keep a
selection of articles ready for comparison.

If you have many identical scorers in your ScoredMatches table, you can
save a lot of typing by defining these scorers with function names.
this is the way all of the "hook*" functions are defined.  In the table
"fn_headertable_hooks" I have, among others:

  fn_headertable_hooks = {
      ["."] = {
          hook_count_newsgroups,
          hook_htable_age,
      },
      ["^.-[%.%-]abuse[%.%-]"] = {
          hook_htable_google,
      },
  }

The items called "hook_something" are regular functions like the scorers
above.  They take the same arguments and return some number, which is
added to the current articles running score.  Take "hook_htable_age" as
an example.  It is the name of a function defined as:

  function hook_htable_age(group, article, htab)
      local iam = "hook_htable_age"
      local testresult = 0
      local logmess = iam .. ": "
      local group_maxage = select_match(MaxAge, group.name)
      if article.article_age > group_maxage then
          testresult = spam_cutoff
          logmess = logmess .. group.name .. " age > " .. group_maxage
      elseif article.article_age > group_maxage-3 then
          testresult = reject_cutoff - spam_cutoff
          logmess = logmess .. group.name .. " age > " .. group_maxage-3
      else
          logmess = logmess .. group.name .. " ok"
      end
      if testresult < 0 then
          ln_log(LNLOG_SNOTICE, LNLOG_CARTICLE, logmess)
          addHeader(article, mkmyhead(iam, logmess))
      end
      return testresult
  end

It receives a group, an article and, in this case, the current articles
table of header name, value pairs.  The latter argument is not used, but
is still given by the central hook runner "run_hooks".  What is most
important here is function "select_match".  It is the simple device for
choosing some table entry provided a group name.  In the form given,
without a third argument or when that argument is "false", it will
return exactly one result, the one with the longest match.

Some other hooks, notably "run_hooks" itself, use the form
"select_match(some_tab, groupname, true)". It returns an array of
matching items, ordered by decreasing match length. If you write your
own hooks, this means an extra loop to work on the items in the array.

All the variables and hooks have to be defined lexically before the
tables are, because lua is essentially a one pass compiler.

== The tables are: ==

If you want to throw out posts containing of citation lines
with very few lines of "real" material, use this.

-- longest match wins.
-- { cites/origs, check after at least min_lines, score-if-match }

ArtCiteStats = { }

Specify the "fuel" for any group here.  The more fuel it has,
the more spammyness will be tolerated.

-- longest match wins.
Scorelimit = { }

In some groups, a lot of crossposts make sense, in others, not.

-- longest match wins.
-- {limit, score-if-match}

MaxXposts = { }

In case your upstream or some moderators inject articles late,
articles some number of days may be ok, but you might want to
score very old articles down.

-- longest match wins.
-- { limit, score-if-match }

MaxAge = { }

-- score on the size of an article.
-- longest match wins.
-- { minimum bytes, maximum bytes }
MaxBytes = { }

You should definitely consider using a bayes filter on your newsgroups.
note that some very kooky groups would only mess up the bayes counts,
so you can switch filtering off for them.  Also, in the beginning, it
is risky to use an auto-update feature, you might consider doing
"train-on-error" instead.  Watch your spam- or reject local groups
for the relevant material.

-- longest match wins.

FilterSettings = {
    ["."] = {
        program = "/usr/local/bin/bogofilter -uv",
        is_spam = "^X%-Bogosity:%s-Yes,",
        is_ham = "^X%-Bogosity:%s-No,",
    },
    ["^alt%.usenet%.kooks"] = false,
    ["^news%.software%.readers"] = false,
    ["^rec%.arts%.drwho"] = false,
}

This is the main filtering table, the language beeing regular lua
functions, with as many of them as you need.  Pick the headers to
match on with the utility functions "Art" and "ArtAnd".  Both accept
as the last parameter a table of alternatives each tested for matches.
this is because luas regular expressions do not have the "|" alternation
operator.  "ArtAnd" tests all alternatives and returns true only if all
of them match.

-- these tables are run from longest to shortest match, _all matches_.

ScoredMatches = { }

In the following tables you would list the various hook functions you
define for spam/ham statistics.  The example hooks cater for shouting,
ie. the ratio of capital letters to non-caps, the ratio of citations to
original lines, bayes filtering, HTML removal etc.  It is not difficult
to devise your own.

fn_headertxt_bodytxt_hooks = { }
fn_bodytxt_hooks = { }
fn_headertxt_hooks = { }
fn_headertable_hooks = { }

= Writing other extensions besides lua =

Scripting is realized by inserting "hooks" into the programs that should
have this feature enabled. The hooks should have at least two different
implementations selected by preprocessor symbols. One is for leafnode
proper, ie. without any extension mechanism, for administrators not
wanting it, and another one is your extension. The hooks are simple
function calls that are generic and can be implemented in or for any
language. This way people can add whatever they want in the "backend",
be it lua, perl, python, spam-extinguishers etc.

If you want your extension to coexist with lua, you will have to solve
the problem arising with error conditions, because one backend may run
without errors while another one cannot even initialize.  The place to
start with this is the file "script-redirects.c" in the source
distribution.  It currently contains dummy hooks which do nothing but
return a special exception code to the upper layer, in case an
administrator decides to go without any scripting, plus the same
functions calling up the lua versions of the hooks.  You can either
replace the lua names with the names of your target language hooks,
which is easiest, or expand them to handle the various exceptional
conditions when combining several extensions.  The former is easy, the
latter provides more features.

= Testing the extension =

Post a few test articles to a _local newsgroup_.  It must be local for
several reasons:  (1) your articles will not pollute USENET, (2) you
need to change some headers of the posts for re-injection and (3) you
need to be able to check the results.

The most important steps are:

1.	Testing if _unmodified_ copies of leafnode, with dummy scripting
	hooks enabled, work undisturbed.  This is because administrators
	might choose to use leafnode _without scripting_!

	For this you would post a few lines to a local newsgroup, copy the
	resulting article to a working directory and change the message-ID
	and anything else you need to make the raw material for a new
	article.

	You should leave the "Newsgroups: local..." line intact so you know
	where to find the result.

2.  Build leafnode proper using the customary "autoreconf -i" (optional
    if you already have the "./configure" script), "./configure" and
    "[g]make". Do not install yet, because these steps are only needed
    to get leafnodes project library called "liblnutil.a", which is
    most likely needed in your tests. Use "[g]make -f Makefile-script"
    to generate a little test program. This test program does nothing
    but exercise some aspect of leafnode, ie. the store_stream()
    component in the case of scripting for fetchnews. The resulting
    "script-test-store" program calls store_stream() with some testing
    article you made in step (1) _on stdin_. Thus there shouldn't be any
    networking involved, but you'd still see what your extension does.

3.	Enable your scripting.  You would normally do this without changing
	anything within leafnode proper, instead you'd define preprocessor
	symbols like "WITH_SCRIPT_LUA".  Make dummy scripts doing not much
	more than progress reporting and writing important variables to
	screen.
