#!/usr/local/bin/tops -s /usr/local/tops/sys -u /opt/mytops/usr/
{
   File tops_rtcmon
   January 2008

   Copyright (C) 2008-2012  Dale R. Williamson

   Real time collection monitor

   This script runs as a daemon on a machine that is collecting data. 

   It creates and sets into motion a directory watching word, D2(), 
   to watch a directory where collected files are placed, and sense
   changes as files are added or updated.

   Notification of changed files comes to word dir_monitor() defined
   below, because dir_monitor() is specified as the Action word for 
   directory watching word D2() just described.
   
   Servers on remote machines can request this script to connect to 
   their listening ports by running a companion script to this one 
   such as tops_rtc, which will trigger word server_connect() defined 
   below to make the connection.  The connected sockets are stored in 
   a table in word dir_monitor() and remain connected.

   When changed files are then detected, dir_monitor() will send the 
   files in a tar archive (compressed to about one-quarter size) 
   through the sockets given in its list of remote machines.

   Script tops_rtc is an example of a script that might be running on
   a remote machine with a server that interacts with this script.

------------------------------------------------------------------------

   Contents:

      "tops_rtcmon"  asciiload this " inline:" grepr reach dot

   inline: clr_monitor ( --- ) \ clear out FILES in dir_monitor
   inline: dir_monitor (hFnew hFgone --- ) \ operate on files
   inline: LOG_MEM ( --- qFile) \ the name of this script's mem log file
   inline: LOG_RTCMON ( --- qFile) \ the name of this script's log file
   inline: netlistener ( --- f) \ this machine's network listener is on
   inline: NEW_CLIENT (nS --- ) \ report new clients
   inline: rtc_files ( --- hT) \ list of all collected files
   inline: rtc_for (nYYYMMDD --- hT) \ list of files for YYYMMDD
   inline: send_files (hFiles hSockets --- ) \ send Files to Sockets
   inline: server_close (nS --- ) \ close socket S to remote server
   inline: server_close1 (nS --- ) \ close socket S to remote server
   inline: server_connect (qIP nPort --- ) \ connect to remote server

   Running the server:
   inline: D2 ( --- ) \ watch the collection directory DIR for changes
   inline: RTC_CONNECTack (qName --- ) \ acknowledgement

   Debugging:
   inline: D2 ( --- )
   inline: IDLE ( --- )
   inline: idle1 (secs --- ) \ idle for secs

------------------------------------------------------------------------

   Example: 

   This shows a job running this script.  It has made connections back 
   to three servers (denoted by C>S) and is periodically sending col-
   lected files to them.  

   The socket 6 connection was just made using the remoteprompt command
   to get a prompt for working interactively on the running job to run
   the clients command and make the clients listing shown.

      [dale@fortycoupe] /home/dale > tops
               Tops 3.0.1
      Tue Mar 18 17:58:13 PDT 2008
      [tops@fortycoupe] ready > IPloop 9879 CLIENT remoteprompt
      tops@socket3 > clients
       Server local is listening on port 9879
       Clients:
        socket 2, port  9881, conn C>S, XX.XXX.244.138 dale plunger
        socket 3, port  9882, conn C>S, XX.XXX.148.191 dale topsdog
        socket 5, port  9878, conn C>S, XX.XXX.34.154 dale clacker
        socket 6, port  1725, conn S<C, 127.0.0.1 LOGIN dale fortycoupe
      tops@socket3 > [pressing Esc+q jumps back to the local prompt]

      [tops@fortycoupe] ready >    

   These lines are handy in a script to see what tops_rtcmon and 
   tops_rtc jobs are running:

      #File rtc
      ps -Af --cols 512 | grep tops_rtc
      ps -Af --cols 512 | grep collec
}
\-----------------------------------------------------------------------

\  Words.

   CATMSG push no catmsg

   "dir_watch" missing IF "dog.v" source THEN
   "soonest_end" missing IF "mrc.v" source THEN
   "hist_fname" missing IF "mfil.v" source THEN

   fence

   inline: clr_monitor ( --- ) \ clear out FILES in dir_monitor
{     This word runs in the multitasker and periodically clears out
      the FILES in dir_monitor().

      Force FILES to be sent.
}
      "dir_monitor" "FILES" yank rows any
      "dir_monitor" "S"     yank rows any and
      IF 
         "dir_monitor" "tSENT" yank "t1" book
         0 "dir_monitor" "tSENT" bank

         " clr_monitor: remote sending " date + . nl
         VOL tpurged "_clr_monitor" naming dup dir_monitor

         t1 "dir_monitor" "tSENT" yank (nT1 nT2) max (tSENT)
         (tSENT) "dir_monitor" "tSENT" bank

      THEN
   end

   "epath0" "DIR" macro \ directory of this machine's collected files

   inline: dir_monitor (hFnew hFgone --- ) \ operate on files
\     This is the "Action" word for word D2(), a directory watching 
\     word.

      [ 0 2 null "S" book        \ table of connected sockets

        VOL tpurged "FILES" book \ table of file names
        4 "rFILES" book          \ max rows in FILES before send files

        0 "tSENT" book           \ time that files were sent
        300 "tMAX" book          \ max seconds between files sent
      ]
      (hFgone) drop \ don't care about files gone

      (hFnew) -path 
      dup chars any IF dup 1st "." cite rake lop THEN \ no . files

      (hFnew) FILES pile noq_alike "FILES" book

      S rows any (f)
      IF FILES rows rFILES >= (f1)                   \ max out on files
         time tSENT - tMAX > FILES rows any and (f2) \ max out on time

         (f1 f2) or (f)
         IF FILES S send_files 
            time "tSENT" book 
            VOL tpurged "FILES" book
         THEN
      THEN
   end

   inline: LOG_MEM ( --- qFile) \ the name of this script's mem log file
      [ "HOME" env "tops_rtcmonmem.log" catpath "log" book ] log
   end

   inline: LOG_RTCMON ( --- qFile) \ the name of this script's log file
      [ "HOME" env "tops_rtcmon.log" catpath "log" book ] log
   end

   inline: netlistener ( --- f) \ this machine's network listener is on
      [ 3 "TRIES" book
        5 "SEC" book

      \ Sat Jan 21 15:42:17 PST 2012.  Updated to use HTTPport:
        "HTTPport port_on" "TEST" macro
      ] 
{     If SERVER-CYCLE is running, there may be a failure if the duty
      cycle is in the off period.  For current SERVER-CYCLE parameters,
      three tries at 5 seconds between should capture the on period.
}
      false TRIES 1st DO TEST IF drop true EXIT THEN SEC idle LOOP
   end

   inline: NEW_CLIENT (nS --- ) \ report new clients
\     Bank the ptr to this word in new_conn.PTR to get a report every 
\     time a new client connects.

      " NEW_CLIENT: new client on socket " swap intstr + spaced
      date + . nl
   end
   "NEW_CLIENT" ptr "new_conn" "PTR" bank

   inline: rtc_files ( --- hT) \ list of all collected files
\     Volume T contains a list of file names and times for files that
\     have been and are being collected in directory DIR.
      DIR dirfiles (hNames hTimes) " %0.0f" format park
   end

   inline: rtc_for (nYYYMMDD --- hT) \ list of files for YYYMMDD
\     Volume T contains a list of file names and times for files that
\     have been collected for YYYMMDD in directory DIR.
      DIR dirfiles (hNames hTimes) " %0.0f" format park
      dup rot intstr grepr any?
      IF reach ELSE drop VOL tpurged THEN
   end

   inline: send_files (hFiles hSockets --- ) \ send Files to Sockets
\     Make a tar file archive of certain files in DIR that are listed
\     in Files, and send it to Sockets.  

      [ MAXBLOCK 2 * "socket_ack1" "SEC" bank 
        yes "socket_ack1" "VERBOSE" bank ]  

      hand "S" book
      hand (hFiles)

      dup (hFiles) rows dup intstr "R" book 0> not (f1)
      S rows 0> not (f1 f2) or
      IF (hFiles) drop return THEN

      " send_files begin: " date + ":" + . nl 
      (hFiles) dup 3 indent . nl

      clients

      (hFiles) DIR archive any?
      IF "T" book

       \ Put to sleep all the tasks being run in the multitasker:
         "D2" SLEEP
         "RTC_CONNECT" SLEEP
         "clr_monitor" SLEEP
 
         " send_files: sending " T sizeof intstr + 
         " byte archive of " + R + " files" + . nl

         depth push \ monitor depth to dump stk items if error

         S rows 1st \ loop over the sockets 
         DO S I pry socket_ack1 (f)
            IF " send_files: to socket " S I pry intstr +
               spaced date + . nl
               T "extract_files" S I pry remoterun2
            ELSE
               " send_files: no files to closed socket " 
               S I pry intstr + . nl jmptable nl
            THEN
         LOOP

         purged "T" book
         depth pull - dump \ clean up stack

       \ Awaken all the multitasker tasks:
         "D2" WAKE
         "RTC_CONNECT" WAKE
         "clr_monitor" WAKE

      ELSE " send_files: archive is empty, no files sent" . nl 
      THEN
      " send_files end: " date + . nl 
   end

   inline: server_close (nS --- ) \ close socket S to remote server
{     Close the socket S connection to remote server and remove S from
      the sockets list in dir_monitor().  This word may be run from the
      remote server so its socket will be removed from the list.

      Example: a remote server connected can run the following phrase
      to cause this word to run and disconnect its socket:

         "remotefd server_close" 2 remoterun

      where 2 is the socket number (for this example) on the remote
      server, not the socket number S at this end.

      In the quoted phrase, word remotefd() will put on the stack the
      correct socket number S for this end of the connection; next,
      this word, server_close(), will run and remove S from the list
      in dir_monitor() and close the connection.
}
      (nS) "S" book

    \ Remove S from the sockets table:
      "dir_monitor" "S" yank dup S = rake drop \ rake out S
      nodupes "dir_monitor" "S" bank           \ bank new table

    \ Close the connection:
      " server_close: closing socket " S intstr + spaced date + . nl
      S sclose

      clients
   end

{ --------- NOT USED; PERHAPS two-column S can become one-columned.

   inline: server_close1 (nS --- ) \ close socket S to remote server
{     Close the socket S connection to remote server and remove S from
      the sockets list in dir_monitor() if its socket_ack failure counts
      have hit Smax.
}
      [ 3 "Smax" book ]

      "server_close1" ERRset
      (nS) "S" book

    \ Remove S from the sockets table if Smax counts in column two have
    \ been accumulated:
      "dir_monitor" "S" yank (hS) \ col 1 socket, col 2 counts 
      dup 1st catch S = "R" book  \ rake for nS
      (hS) R rake (hS0 hS1) any?  \ rake out S1, the row containing S 
      IF dup 2nd pry Smax >=  
         IF drop \ Smax has been hit; close the connection:
            " server_close1: at max count, closing socket " S intstr + 
            spaced date + . nl 
            S sclose
         ELSE dup 2nd pry 1+ over 2nd poke R tier \ bump the count
         THEN
         nodupes "dir_monitor" "S" bank \ bank new table
         clients
      ELSE " server_close1: socket " S intstr + " not found" + . nl
         (hS0) drop
      THEN
      ERR
   end
-------- }

   inline: server_connect (qIP nPort --- ) \ connect to remote server
{     Connect to server at IP(Port) and store the socket of the con-
      nection in list S in dir_monitor().  Success or failure will be
      reported in the log file.
}
      "PORT" book "IP" book

    \ Close any servers at this IP address that are now connected:
      clientIPs IP grepr any?
      IF clientsockets swap reach 1st catch (nS) dup rows 1st
         DO dup I pry server_close LOOP drop
      THEN

      IP PORT CLIENT (nS) dup 0> (f) \ socket nS > 0 if succeed
      IF (nS) " server_connect: connected to " IP + 
         ":" + PORT intstr + spaced date + . nl

       \ Set word server_close to run when socket S closes:
         (nS) "server_close" ptr over (ptr nS) ptrCls_upd

       \ Add socket number nS to the sockets table:
         (nS) dup 0 park \ park 0 count as 2nd column
         "dir_monitor" "S" yank pile nodupes
         "dir_monitor" "S" bank

       \ Inform remote server of its socket to here by having it run
       \ its word CONN_SET:
         (nS) "remotefd CONN_SET WAIT_END" over (qS nS) remoterun

         (nS) ontheweb
         IF (nS) drop
         ELSE (nS) drop
{
         Not necessary.  Each machine uses NIST_SYNC (see below).

            (nS) time_sync (f)
            IF " server_connect: time sync with remote " 
               "time_sync" "DT" yank intstr + " seconds" + 
            ELSE " server_connect: time sync with remote failed" 
            THEN . nl
}
         THEN

      ELSE (nS) " server_connect: failed to connect to "
         IP + ":" + PORT intstr + spaced date + . nl
         (nS) drop
      THEN

      clients
   end

   pull catmsg

   keys? IF halt THEN \ interactive testing, do not start server

\-----------------------------------------------------------------------

\  Verify a listening http server:
   netlistener not
   IF "tops_rtcmon: a listening http server is not running; halting" 
      . nl HALT
   THEN

\-----------------------------------------------------------------------

\  SYSOUT must be defined for daemon output.  The following sets SYSOUT
\  to file LOG_RTCMON:
   LOG_RTCMON set_sysout \ log file receives output

\  Write the first lines in LOG_RTCMON:
   "-" 72 cats nl dot nl
   "PID " getpid intstr + spaced date + dot nl

\-----------------------------------------------------------------------
{
   This section sets up directory watching to run in the multitasker, 
   and the message polling word that will enable connect requests from 
   remote servers.

   Make a directory-watch word called D2(), and bank the ptr to word
   dir_monitor() into D2 as its Action word, so D2() will run word
   dir_monitor() when files in directory DIR change.
   
   By default, all directory watch words such as D2(), made by running 
   dir_watch(), write to the same log file (each word's output is duly 
   noted).  So if another directory watching word like tops_deny is 
   running, its word D1() will also be writing to file dirwatch.log.

   Note that from this script, log file dirwatch.log receives output 
   only from D2(), and dirwatch.log is different from file LOG_RTCMON 
   defined above.  File LOG_RTCMON receives output whenever any program
   function writes, including new words in this script, like dir_moni-
   tor() and send_files().  The important messages for the task of send-
   ing files to remote machines will be on log file LOG_RTCMON rather 
   than file dirwatch.log.
}
{
   The lines in this brace region are no longer run to use the standard
   word dir_watch() to create watchdog word D2().  Instead, a much sim-
   pler and faster D2() is made below for this particular problem.  

\  Running dir_watch() to make a watchdog word called D2(), to watch
\  files in the directory defined by DIR:
   "/tmp" DIR "D2" dir_watch  

\  By setting D2.Action = ptr("dir_monitor"), the action of new word
\  D2() is to run dir_monitor():
   "dir_monitor" ptr "D2" "Action" bank 
}
   CATMSG push no catmsg

   inline: D2 ( --- ) \ watch the collection directory DIR for changes
{     This word runs in the multitasker to sense changes to collection
      directory DIR, and send the list of changed files to word dir_mon-
      itor(), the Action word.  

      It takes advantage of the fact that the collection word (probably
      HIST_ADD()) touches the collection directory every time it writes
      to it, making its change easy to spot.

      It also only looks at files in DIR that are being collected right
      now, and not at any earlier ones.

      When this word runs its Action word, dir_monitor, networking words
      will be run, and if the program is in an idle state, this can lead
      to unpredictable behavior (see notes in Appendix below).

      So before continuing, this word checks the idle state (word LOCK-
      ED) and if the program is idling (LOCKED returns true), this word
      simply returns.
}
      [ "dir_watch" "DGLOG" yank "DGLOG" book \ log file for this word

        0 "tDIR" book

        tracklist rows 1 null "TIMES" book
        tracklist rows 1 blockofblanks "FNAMES" book

      {" ( --- hFnames)
      \ This is the list of files being collected right now:
        list: tracklist rows 1st
           DO tracklist I pry hist_fname LOOP
        end words DIR nose 
      "} "get_fnames" macro

      ]
      SYSOUT push DGLOG set_sysout

      LOCKED 
      IF "D2: the program is idling, this word will return" . nl
      ELSE
         DIR filetime (s) dup tDIR <>
         IF (s) "tDIR" book

          \ Is list of file names current?
            tracklist 1st pry hist_fname FNAMES 1st pry -path <>
            IF "D2: updating FNAMES on " date + . nl 
               get_fnames "FNAMES" book
            THEN

            FNAMES dup filetime dup TIMES <> (hTimes hf)
            swap "TIMES" book
            (hFnames hf) rake lop (hFchg) any?
            IF (hFchg) 
               "D2: files changed " date + . nl
               (hFchg) dup 3 indent . nl

               peek set_sysout
               (hFchg) "" (hFchg hFgone) dir_monitor
            THEN
         ELSE (s) drop 
         THEN
      THEN
      pull set_sysout
   end

\  Start D2() running under the multitasker:
   1 10 / (Hz) "D2" PLAY \ times a second

\  Files might cease changing, but there remain some FILEs in word
\  dir_monitor() that have not been sent.  Run word clr_monitor() to 
\  periodically empty FILES from dir_monitor():
   1 300 / (Hz) "clr_monitor" PLAY \ every few minutes

\  The following message poll word looks for RTC_CONNECT messages from 
\  servers that want this program to connect to them so they can re-
\  ceive changed files: 
   "RTC_CONNECT" msgPoll \ the word's name matches the message name

\  This optional word is run by polling word RTC_CONNECT when it gets
\  a message:
   inline: RTC_CONNECTack (qName --- ) \ acknowledgement
\     RTC_CONNECT acknowledging message receipt and time.
      "Name" book \ should be RTC_CONNECT
      " " Name + ": received on " + date + . nl
      Name "M" yank 1st quote Name "M" bank \ just take 1st row
   end
   "RTC_CONNECTack" ptr "RTC_CONNECT" "Ack" bank 

\  Remove old RTC_CONNECT messages; this job must be running before 
\  others can successfully connect:
   "RTC_CONNECT" msgGet drop

\  Start polling for connections:
   1 (Hz) "RTC_CONNECT" PLAY \ fast polling

   pull catmsg
{
   How does the RTC_CONNECT interprocess message from a remote server 
   appear on this machine?  

   In addition to this script, the collection machine is also running
   this program's HTTP server to allow connections from the Internet 
   (see script tops_http).  That was just checked above by running
   netlistener().

   The remote server running this program, runs word msgPutIP() that
   makes a connection to this machine's HTTP server and places a mes-
   sage like the following on this machine's interprocess message file:

      RTC_CONNECT "71.107.4.6" 9886 server_connect

   Message poll word RTC_CONNECT() (started above by PLAY) will pick up 
   the message and run the phrase it contains, thereby running word 
   server_connect() defined above to make the connection to the given
   IP address and listening port. 

   Script tops_rtc runs a remote server that connects to a collection 
   machine and leaves a message in the way just described.
}
\-----------------------------------------------------------------------

\  Start a multitasker job to track memory usage:
\  July 2009: Memory has looked fine for months, no leaks.  Discontinue
\  this:
\  LOG_MEM "memlog" "LOG" bank
\  1 900 / "memlog" PLAY \ every 15 minutes

\-----------------------------------------------------------------------

\  Settings:
   12 new_client_timeout \ time allowed for remote to make connection

 \ Wed Jun 27 19:20:30 PDT 2012:
   host collector (LAN collector?)
   IF 5 "NIST_DELTA" ELSE 5 "NIST_SYNC" THEN ALARM

   10 "tasks" ALARM

{  Start a daemon server, running forever.

   This server is only for the purpose of connecting for maintenance 
   and debugging, through word remoteprompt() as shown in the example
   at the top of this file:
}  
   "*" def_port nextport DSERVER

\-----------------------------------------------------------------------

Appendix

When this message appears in the log file:

 send_files end: Sun Apr 27 20:52:24 PDT 2008
 drain: ignore invalid runflag from socket 2
 drain: ignore invalid runflag from socket 2

it is probably due to sending a tar file with a date in the future.
Turning on wtrace will show the bytes that were ignored.  Here they
are for a case when the tar file is 3 seconds in the future:

 sflush: 6 bytes from socket 2
       0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
   0  2F 74 61 72 3A 20 00 00 00 00 00 00 00 00 00 00  /tar: ..........
 75941 microsec delta: drain entering
 readn1: 4 bytes from socket 2
 263 microsec delta: end readn1
 drain: 4 bytes in socket 2, runflag=DRAIN_INVALID
 drain: ignore invalid runflag from socket 2
 readn1: 64 bytes from socket 2
 321 microsec delta: end readn1
 sflush: 64 bytes from socket 2
       0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
   0  34 32 38 5F 4D 50 2E 62 69 6E 3A 20 74 69 6D 65  428_MP.bin: time
   2  20 73 74 61 6D 70 20 32 30 30 38 2D 30 34 2D 32   stamp 2008-04-2
   4  38 20 30 34 3A 30 31 3A 31 39 20 69 73 20 33 20  8 04:01:19 is 3
   6  73 20 69 6E 20 74 68 65 20 66 75 74 75 72 65 0A  s in the future.

While the program times can be synched closely between machines as they
run, the system time remains the same, and it governs the time on the 
tar file.

This really was not a problem once it was understood.  The key thing
for the program was changing it to ignore invalid runflags rather than 
giving up and closing the connection.

{

Thu May  1 08:01:47 PDT 2008

The notes that follow are concerned with uncovering a mysterious prob-
lem where tops_rtcmon became unreachable by CLIENTs trying to connect 
to its listening port.  This condition has been seen before in other
servers, so it is not a new. 

It has been concluded that the problem comes from running network words
while the program is idle.  Looking at jmptable is key to seeing the 
error.  Jmptable will show the word running high in the run levels 
above an idling word, when it should be running at low run level.  The 
program appears to be able to run for long periods in this incorrect 
state before something makes it crash.

This shows the incorrect condition, with word D2 running above idle:

 LOCK: flag to unlock received on run level 13
 send_files: socket_ack on socket 2 failed
 Jmp table at lev 10
  lev   ret  typ  Lib:lib
  10     2    1     0:send_files
   9     2    1     0:dir_monitor
   8     2    1     0:D2
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__
 send_files: to socket 2 Wed Apr 30 19:35:47 PDT 2008

It has been concluded that trying to make the program smarter to avoid 
this problem is not in keeping with the design philosophy of local con-
trol.  Every word being launched would somehow have to let the program 
know if they can or cannot run above an idling run level.

On the other hand, giving words a tool to determine the idle state
allows local control: they simply can choose not to run if the program
is idling. 

New word LOCKED has been written that network words, running in the 
multitasker, can use to see if they are being asked to run while the 
program is locked.  With this change (adding LOCKED to word D2 above), 
tops_rtcmon has run without the mysterious hangup, and has recovered 
nicely from connection errors.

Log file dirwatch.log shows the change in action when the program is
found idling:

   D2: files changed Thu May  1 00:21:10 PDT 2008
      /mdat/edat0/1080501_NQ.bin
   D2: the program is idling, this word will return
   Thu May  1 00:21:48 PDT 2008 D1: dir_watch files new or changed:
      /var/log/cron
   D2: files changed Thu May  1 00:21:50 PDT 2008
      /mdat/edat0/1080501_HG.bin

This shows tops_rtcmon recovering from a connection problem, and D2
running a low run level as it should:

 netvolread error: invalid rows = 2832596927 on socket 2
 drain: error reading VOL or STR
 clientclose: socket 2 closing on flag 2, Thu May  1 03:35:55 PDT 2008
 clientclose: socket 2, port  9881, conn C>S is closed
 server_close: closing socket 2 Thu May  1 03:35:55 PDT 2008
 Server local is listening on port 9882
 No clients
 send_files: socket_ack on socket 2 failed
 Jmp table at lev 4
  lev   ret  typ  Lib:lib
   4     2    1     0:send_files
   3     2    1     0:dir_monitor
   2     2    1     0:D2
   1     1    2     0:DATA__
 send_files: to socket 2 Thu May  1 03:35:57 PDT 2008


On the efficiency of msgPoll:

There aren't that many RTC_CONNECT messages, so msgPoll really should
not wait to get one from msgcomm.  It should use msgPeek first (which 
works, even if msgcomm is busy), and then skip msgGet (where it has to
wait until others are not holding msgcomm) and move on if there is no 
message.

Word msgPoll has been changed to run msgPeek first, before committing
to getting a message.

With this change, commonly occurring msgHold messages like these have 
not been seen:

 send_files end: Thu May  1 02:28:00 PDT 2008
 msgHold: T6795 idling Thu May  1 02:28:13 PDT 2008
 msgHold: T6795 idling Thu May  1 02:28:22 PDT 2008
 send_files begin: Thu May  1 02:29:20 PDT 2008:

Wed Apr 30 20:24:15 PDT 2008

From man, a note for word idle:
  note: during the idle period, reading of sockets from remote clients
    (see words SERVER and CLIENT) is blocked; all data comes through
    when the period ends

Is is possible that RTC_CONNECT is in an idle period waiting to get
a message, and socket_ack gets run and the ack can't get through?
Then a socket error occurs and the socket is closed?  And we get all
the mess shown below?

That is the current theory.

You simply cannot expect any networking words to work when the program
is idling, but that may be the situation you find yourself in when
multitasker words run asynchronously.  A networking word just should
not be allowed to start running if the program is idle, but there is 
no check for this.  

Can there be a word that flags the idle state?  It would only run when 
a multitasker word runs (since only they can run during an idle), and 
the word could just return if idling.

Yes.  New word LOCKED was just written.  Word D2 check has been changed
to run LOCKED before it makes a periodic run, and just return.

What causes the hangup of tops_rtcmon.

Tops_rtcmon is running a couple of multitasker words.  One of them,
D2, sends files periodically to another machine and the other one,
RTC_CONNECT, tests periodically for a message to connect to another
machine.

Most of the time, D2 and RTC_CONNECT don't get in each other's way.

This shows D2 running a send_files job, and after it finishes we see
RTC_CONNECT running:

D2 sensed changed files, and causes send_files to run.  This shows
send_files beginning and ending:
 send_files begin: Wed Apr 30 19:32:55 PDT 2008:
   1080501_EU.bin
   1080501_PL.bin
   1080501_SI.bin
   1080501_GC.bin
 Server local is listening on port 9882
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog
 send_files: sending 1455 byte archive of 4 files
 send_files: socket_ack on socket 2 ok
 send_files: to socket 2 Wed Apr 30 19:32:55 PDT 2008
 send_files end: Wed Apr 30 19:32:56 PDT 2008

RTC_CONNECT runs next, with debug showing jmptable.  It causes idle
to be run by msgHold which is delayed in msgGet probably because other
jobs (collection jobs started by topse) are writing to or reading from
the messages file, msgcomm:

 LOCK: locked on run level 7
 Jmp table at lev 7
  lev   ret  typ  Lib:lib
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__
 LOCK: flag to unlock received on run level 7
 LOCK: unlock run level 7
 LOCK: unwound to k 1, no more locked
 msgHold: T22592 idling Wed Apr 30 19:33:01 PDT 2008

Here is a case where the first socket_ack failed.

It shows send_files starting to run 

 send_files begin: Wed Apr 30 19:35:47 PDT 2008:
   1080501_HO.bin
   1080501_HG.bin
   1080501_NQ.bin
   1080501_BP.bin
 Server local is listening on port 9882
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog
 send_files: sending 1373 byte archive of 4 files

But apparently RTC_CONNECT was running and in an idle state waiting
to read msgcomm (word msgGet).  This jmptable shows that D2 has 
started running above its idle run level, a bad place to be.

 LOCK: flag to unlock received on run level 13
 send_files: socket_ack on socket 2 failed
 Jmp table at lev 10
  lev   ret  typ  Lib:lib
  10     2    1     0:send_files
   9     2    1     0:dir_monitor
   8     2    1     0:D2
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__
 send_files: to socket 2 Wed Apr 30 19:35:47 PDT 2008

There is no line about socket_ack ok or failed.  That is a puzzle.

This is probably from socket_ack:

 writen1: socket 2 is not open

and writen1 calls sclose(), so the socket will close.  

These are from server_close, set to run when sclose() runs:

 server_close: closing socket 2 Wed Apr 30 19:35:47 PDT 2008
 Server local is listening on port 9882
 No clients

These are probably from remoterun2, but if should never have been
started since the socket was just now closed:

 netvolwrite: 3 bytes out of 1381 to socket 2
 remoterun2: error writing T1 to socket 2

And send_files ends:

 send_files end: Wed Apr 30 19:35:47 PDT 2008


What a mess.  In a short time, an RTC_CONNECT message gets through and
the by some miracle the connection is reestablished. 

 msgHold: T22592 idling Wed Apr 30 19:36:06 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:06 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:06 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:06 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:06 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:06 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:06 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 hit maxwait Wed Apr 30 19:36:45 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:34 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:37:53 PDT 2008
 msgHold: T22592 hit maxwait Wed Apr 30 19:37:53 PDT 2008


This shows CLIENT and the reconnection process is also running at
the wrong run levels above idle:

 RTC_CONNECT: received on Wed Apr 30 19:39:48 PDT 2008
 LOCK: locked on run level 12
 Jmp table at lev 12
  lev   ret  typ  Lib:lib
  12     5    1     0:CLIENT
  11     2    1     0:CLIENT
  10     2    1     0:server_connect
   9     1    2     0:RTC_CONNECT
   8     2    1     0:RTC_CONNECT
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__
 LOCK: flag to unlock received on run level 12
 LOCK: unlock run level 12

This is a place in LOCK1() where it should know it is in trouble:

 LOCK: still locked at k+1: 7
 Jmp table at lev 11
  lev   ret  typ  Lib:lib
  11     2    1     0:CLIENT
  10     2    1     0:server_connect
   9     1    2     0:RTC_CONNECT
   8     2    1     0:RTC_CONNECT
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__

In spite of all this, the connection succeeds:
 server_connect: connected to XXX.XX.148.191:9881 Wed Apr 30 19:39:49 PDT 2008

Time_sync fails because no can bytes get through to the deaf server:

 time_sync: failed to obtain time from remote
 server_connect: time sync with remote failed

But the connection has succeeded, and here it is, from here to there
(topsdog):

 Server local is listening on port 9882
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog


And send_files continues to work, only socket_ack always fails and
D2 is running at the wrong run level above idle.  It is amazing that
this bad state continues to operate for a long time:

 send_files begin: Wed Apr 30 19:39:58 PDT 2008:
   1080501_NG.bin
   1080501_US.bin
   1080501_W.bin
   1080501_SP.bin
   1080501_HU.bin
   1080501_SF.bin
   1080501_TN.bin
   1080501_BO.bin
   1080501_SM.bin
   1080501_CL.bin
   1080501_S.bin
   1080501_C.bin
 Server local is listening on port 9882
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog
 send_files: sending 2905 byte archive of 12 files
 send_files: socket_ack on socket 2 failed
 Jmp table at lev 10
  lev   ret  typ  Lib:lib
  10     2    1     0:send_files
   9     2    1     0:dir_monitor
   8     2    1     0:D2
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__
 send_files: to socket 2 Wed Apr 30 19:40:04 PDT 2008
 send_files end: Wed Apr 30 19:40:04 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:40:50 PDT 2008
 msgHold: T22592 idling Wed Apr 30 19:40:50 PDT 2008


Wed Apr 30 09:44:02 PDT 2008

The work below seemed like it fixed the problem, but it did not.

Simulating the hangup of tops_rtcmon

It is believed that the hangup of tops_rtcmon, noted first when it
would not accept CLIENT commands on its SERVER and then later when it
bombed, was due to multiple calls to word idle that jumped the program
to higher run levels coupled with words running in the multitasker.

The multiple calls to idle, causing higher run levels, probably came
from word msgHold.  Then while at a higher idle run level, TASK word
D2 happened to run, embedding itself above the idle run levels.

When D2 finished, the program remained at the higher idle run levels.

A fix has been made to word idle to not allow it to run again if the
program is already idling.

Output below shows running "4 idle1."

The program continually runs D2, as it is supposed to, but idle1 has
not properly returned.  D2 is running at a run level too high.

In tops_rtcmon, it is believed that msgHold idle calls produced this
multiplicity of idle run levels.

To avoid this problem, a BUSY flag has been added to idle to not allow
idle to be run if the program is already idling.  Logically, if the
program is idling it should not be doing anything, even running word
idle.  The flaw was in having a multitasker word running idle again
while idle was still busy.

The line "BUSY IF drop return THEN" in idle1 fixes the problem for
this demo.  It has been incorporated into the program's word idle in
boot.v.

Running D2
 Jmp table at lev 6
  lev   ret  typ  Lib:lib
   6     2    1     0:D2
   5     5    1     0:idle1
   4     2    1     0:idle1
   3     1    2     0:DATA__
   2     2    1     0:console
   1     1    2     0:DATA__
Running D2
 Jmp table at lev 6
  lev   ret  typ  Lib:lib
   6     2    1     0:D2
   5     5    1     0:idle1
   4     2    1     0:idle1
   3     1    2     0:DATA__
   2     2    1     0:console
   1     1    2     0:DATA__
}
   inline: D2 ( --- )
      "Running D2" . nl jmptable
      "IDLE" "SEC" yank 2 / "D2" ALARM \ set to run again
   end

   inline: IDLE ( --- )
      [ 0 "SEC" book no "BUSY" book ]
      BUSY IF return THEN
      "running IDLE" . nl
      yes "BUSY" book

    \ Set word D2 to run in SEC/2:
      "IDLE: set ALARM to run word D2 in " SEC 2 / 0.5 + intstr + . nl
      SEC 2 / "D2" ALARM

    \ Idle for SEC:
      SEC idle1
   end

   inline: idle1 (secs --- ) \ idle for secs
      [ no "BUSY" book ]
      no NUM stkok not IF "idle" stknot return THEN

      BUSY
      IF "idle1: second entry from IDLE" . nl THEN

      yes "BUSY" book

    \ If IDLE is not BUSY, set an ALARM to run it in less than
    \ idle secs, and at that time to also idle for secs:
      "IDLE" "BUSY" yank not
      IF dup (secs) "IDLE" "SEC" bank
         dup 2 / (secs/2) "IDLE" ALARM
         "idle1: set ALARM to run IDLE in " over 2 / 0.5 + intstr + . nl
      THEN

    \ This line fixes the hangup (comment it out to cause the hangup):
      BUSY IF drop return THEN

      "UNLOCK" swap (secs) LOCK
      no "BUSY" book
   end

{
Wed Apr 30 07:41:39 PDT 2008
Cases below show the jmptable as the problem with tops_rtcmon develops.


This is normal:

 send_files: sending 3971 byte archive of 4 files
 send_files: socket_ack on socket 2 ok
 Jmp table at lev 4
  lev   ret  typ  Lib:lib
   4     2    1     0:send_files
   3     2    1     0:dir_monitor
   2     2    1     0:D2
   1     1    2     0:DATA__

First time socket_ack failed (msgPoll, which is RTC_CONNECT, puts
itself to sleep while it runs msgGet; this does not seem to change
anything):

 Server local is listening on port 9887
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog
 send_files: sending 2359 byte archive of 4 files
 send_files: socket_ack on socket 2 ok
 send_files: to socket 2 Wed Apr 30 05:28:52 PDT 2008
 send_files end: Wed Apr 30 05:28:53 PDT 2008
 msgHold: T10575 idling Wed Apr 30 05:28:54 PDT 2008
 msgHold: T10575 idling Wed Apr 30 05:30:02 PDT 2008
 send_files begin: Wed Apr 30 05:30:12 PDT 2008:
   1080430_TN.bin
   1080430_CL.bin
   1080430_HO.bin
   1080430_HG.bin
 Server local is listening on port 9887
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog
 send_files: socket_ack on socket 2 failed
 Jmp table at lev 10
  lev   ret  typ  Lib:lib
  10     2    1     0:send_files
   9     2    1     0:dir_monitor
   8     2    1     0:D2
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__


Getting worse:

 send_files: socket_ack on socket 2 failed
 Jmp table at lev 16
  lev   ret  typ  Lib:lib
  16     2    1     0:send_files
  15     2    1     0:dir_monitor
  14     2    1     0:D2
  13     5    1     0:idle
  12     2    1     0:idle
  11     2    1     0:msgHold
  10     2    1     0:msgComm
   9     2    1     0:msgGet
   8     2    1     0:RTC_CONNECT
   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__
 send_files: to socket 2 Wed Apr 30 09:03:27 PDT 2008
 send_files end: Wed Apr 30 09:03:27 PDT 2008


Going out of control

...

 send_files begin: Wed Apr 30 09:04:38 PDT 2008:
   1080430_BO.bin
   1080430_SB.bin
   1080430_SM.bin
   1080430_S.bin
 Server local is listening on port 9882
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog
 send_files: sending 2399 byte archive of 4 files
 send_files: socket_ack on socket 2 failed
 Jmp table at lev 112
  lev   ret  typ  Lib:lib
 112     2    1     0:send_files
 111     2    1     0:dir_monitor
 110     2    1     0:D2
 109     5    1     0:idle
 108     2    1     0:idle
 107     2    1     0:msgHold
 106     2    1     0:msgComm
 105     2    1     0:msgGet
 104     2    1     0:RTC_CONNECT
 103     5    1     0:idle
 102     2    1     0:idle
 101     2    1     0:msgHold
 100     2    1     0:msgComm
  99     2    1     0:msgGet
  98     2    1     0:RTC_CONNECT

...

   7     5    1     0:idle
   6     2    1     0:idle
   5     2    1     0:msgHold
   4     2    1     0:msgComm
   3     2    1     0:msgGet
   2     2    1     0:RTC_CONNECT
   1     1    2     0:DATA__
 send_files: to socket 2 Wed Apr 30 09:04:45 PDT 2008
 send_files end: Wed Apr 30 09:04:45 PDT 2008
 tasker: busy task RTC_CONNECT,0:CODE__ interrupted Wed Apr 30 09:05:15 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:15 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:16 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:16 PDT 2008
 tasker: busy task RTC_CONNECT,0:CODE__ interrupted Wed Apr 30 09:05:16 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:16 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:17 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:17 PDT 2008
 bufup: maximum run level exceeded
 bufup: maximum run level exceeded
 file: old file not found: forn
 fget: invalid file handle
 strmatch: expect string or volume on stack
 fclose: invalid file handle
 tasker: busy task RTC_CONNECT,0:CODE__ interrupted Wed Apr 30 09:05:17 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:17 PDT 2008
 bufup: maximum run level exceeded
 bufup: maximum run level exceeded
 file: old file not found: forn

 msgHold: T4940 idling Wed Apr 30 09:05:32 PDT 2008
 bufup: maximum run level exceeded
 bufup: maximum run level exceeded

...

 runaway detected: execution halted on run level 127
 faulty phrase: "*" def_port nextport DSERVER
 HALT at Wed Apr 30 09:05:32 PDT 2008
 RTC_CONNECT: received on Wed Apr 30 09:05:32 PDT 2008
 server_close: closing socket 2 Wed Apr 30 09:05:32 PDT 2008
 server_close: closing socket 2 Wed Apr 30 09:05:32 PDT 2008
 Server local is listening on port 9882
 No clients
 Server local is listening on port 9882
 No clients
 server_connect: connected to XXX.XX.148.191:9881 Wed Apr 30 09:05:33 PDT 2008
 bufup: maximum run level exceeded
 grepe: expect string or volume on stack
 bufup: maximum run level exceeded
 time_sync: failed to obtain time from remote
 server_connect: time sync with remote failed
 Server local is listening on port 9882
 Clients:
  socket 2, port  9881, conn C>S, XXX.XX.148.191 dale topsdog
 fault at word: server_connect
 run: return stack not empty, onbuf: 119, onrun: 100
 run: return stack not empty, onbuf: 118, onrun: 99
 msgHold: T4940 idling Wed Apr 30 09:05:34 PDT 2008
 tasker: busy task RTC_CONNECT,0:CODE__ interrupted Wed Apr 30 09:05:35 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:35 PDT 2008
 msgHold: T4940 idling Wed Apr 30 09:05:35 PDT 2008
 msgHold: T4940 hit maxwait Wed Apr 30 09:05:35 PDT 2008

}

