#!/usr/local/bin/tops -p -s /usr/local/tops/sys -u /opt/mytops/usr/
{  File tops_rtc  January 2008

   Copyright (C) 2008-2012  Dale R. Williamson

   Real time collection (rtc) from a remote machine

   This script runs as a daemon connected to a remote machine that is 
   collecting data.  To become connected, the daemon causes a comple-
   mentary daemon on the remote machine to connect to this daemon's 
   server, and the connection is kept open to receive an asynchronous 
   flow of data in real time.

   To make a permanent connection to the remote machine, this script
   runs msgPutIP() to put an "RTC_CONNECT" message on the remote's in-
   terprocess communication file (file dog.v).

   Message RTC_CONNECT is the agreed-to message that initiates the con-
   nection from the complementary daemon on the remote machine to this 
   one.  The RTC_CONNECT message placed on the remote machine contains 
   this machine's IP address and this daemon's listening port within a 
   phrase that fires the remote daemon's word server_connect() when the
   remote daemon runs it.

   The remote daemon will be polling for such a message (running word 
   msgPoll(), looking for message RTC_CONNECT) and will shortly find it
   and make a connection to the server here using its word server_con-
   nect().  This makes the permanent connection through which real time
   data will flow.

   Script tops_rtcmon is an example of what the remote daemon server
   is running, and it contains word server_connect() mentioned above.

------------------------------------------------------------------------

   Contents:

      "tops_rtc" asciiload this " inline:" grepr reach dot

   inline: collect0 ( --- ) \ set up collector 0
   inline: collect1 ( --- ) \ set up collector 1
   inline: collect2 ( --- ) \ set up collector 2

   inline: CLOSE ( --- ) \ this program will close
   inline: CONN ( --- ) \ leave a message on the remote to connect here
   inline: CONN_CLS (nS --- ) \ action when connection on S has closed
   inline: CONN_RECON ( --- ) \ reconnect to collector
   inline: CONN_RESTART ( --- ) \ restart this program
   inline: CONN_SET (nS --- ) \ set up for collector just connected on S
   inline: current ( --- ) \ local files brought current to remote ones
   inline: current_for (nYYYMMDD --- ) \ local files current for YYYMMDD
   inline: extract_files (hT --- ) \ extract files contained in volume T
   inline: get_archive (hFiles --- hT) \ get Files archive from remote
   inline: local_files ( --- hT) \ list of all local files
   inline: local_for (nYYYMMDD --- hT) \ list of local files for YYYMMDD
   inline: msgID ( --- qM) \ string for CONN message on msgcomm
   inline: rtc_close ( --- ) \ close connection to remote
   inline: rtc_files ( --- hT) \ list of all remote files
   inline: rtc_for (nYYYMMDD --- hT) \ list of remote files for YYYMMDD

------------------------------------------------------------------------

   The Appendix below shows the log file during a period of connection 
   problems with the remote server.

------------------------------------------------------------------------

   Update notes.

   Fri May  6 13:29:54 PDT 2011.
      Running SERVER is a bad idea.  Revert to DSERVER and revise word
      CONN_RESTART to run an updated version of tops_rtccon that will
      kill collector N and start a new collector N.

   Tue May  3 20:34:49 PDT 2011.  This script now runs SERVER instead
      DSERVER, since the server is running in a process within a window
      when it is started by tops_rtccon from the Root Menu selection 
      "Data Collectors."  Running in a window may mean that technically
      a daemon should not be used.  The code for DSERVER (term.c) does 
      forking at start up (SERVER does not), and that may have caused 
      the problem with restart described below in Tue May  3 17:00:01
      PDT 2011. 

      This restart problem is recent because restart is a new approach 
      to replace the former reconnection procedure that was prone to 
      error.  In retrospect, perhaps the old procedure suffered from 
      the same problem, but less often, and now it would be ok under
      SERVER.  There is no plan to go back and explore this possibility.

      In limited restarts (3) no error has occurred with SERVER, while 
      two failures in a row were previously obtained with DSERVER dur-
      ing fruitless tests with delay.

   Tue May  3 20:27:44 PDT 2011.  Using a delay in assigning listening
      port does not help solve the problem with restart noted below.  
      Failure and errno 88 is obtained when CONN_ RESTART runs the re-
      start command, but pasting the restart command at a Unix command 
      prompt succeeds.  When there is a problem, socket 1 is obtained
      by the system instead of socket 0.  Perhaps the problem is related
      to running DSERVER inside a window, as it is when tops_rtccon 
      starts everything on Sunday.  There really is no reason to run 
      DSERVER in this case.  Try instead running SERVER, which does not
      fork.

   Tue May  3 17:00:01 PDT 2011.  When CONN_RESTART is run to restart 
      the program, running Accept() (term.c) with the same port some-
      times produces errno 88 (socket operation on non-socket).  It is
      possible that a delay will remove the problem if it is due to 
      insufficient wait time.

   Sun Apr 17 19:49:39 PDT 2011.  The local receiver of data (tops_rtc)
      from collector 2 has been set up to run as an SSL server if argv 
      -ssl is present in the command line, as in:

         % /usr/local/tops/usr/tops_rtc -ssl -collect 2

      Trying this with the collector 1 receiver will not work until
      remote collector 1 is running OpenSSL and tops_https.

      Below is shown log file tops_rtc2.log here on plunger as tops_rtc 
      starts up to receive collector 2 data from fortycoupe.com.  On 
      fortycoupe, a tops_https secure server is running to receive the
      encrypted RTC_CONNECT message from here, decrypt it and write it 
      to the interprocess message file, msgcomm.  

      The remote daemon (tops_rtcmon) that polls for and receives the 
      RTC_CONNECT message will connect here as CLIENT_SSL due to the 
      added string "yes ssl_connect" in the message sent (see yes>> in 
      log below).  No change to remote program tops_rtcmon was made.

         PID 19491 Sun Apr 17 19:48:53 PDT 2011
          SFLUSHX on, Sun Apr 17 19:48:53 PDT 2011
          NIST_DELTA: time sync with NIST, -3 sec, Sun Apr 17 19:48:53
          tops_rtc: starting DSERVER_SSL on port 9884 Sun Apr 17 19:48:5
          DSERVER_SSL: secure daemon server on port 9884, socket 0
          CONN sending connect msg to 205.134.240.76 Sun Apr 17 19:49:02
          msgPutIP: connected to 205.134.240.76
    yes>> msgPutIP OK: "yes ssl_connect '71.106.241.234' 9884 
             server_connect" "RTC_CONNECT" msgPut
          msgPutIP: connection closed

         Sun Apr 17 19:49:05 PDT 2011 SERVER: 205.134.240.76 connect
          1496 bytes delta: memprobe socket 2 connect
          CONN_SET: connection on socket 2 Sun Apr 17 19:49:05 PDT 2011
    651>> extract_files: to /mdat/edat2/ 651 bytes Sun Apr 17 19:49:46 
          extract_files: to /mdat/edat2/ 435 bytes Sun Apr 17 19:50:56 

      Below is shown the collector 2 log file (tops_rtcmon.log) on re-
      mote fortycoupe.com as it receives the new message with "yes 
      ssl_connect" and connects here to plunger.  It shows socket 2s 
      in the Clients table, where "s" denotes secure connection, due to
      proper response to "yes ssl_connect."

      Note that its local server listening on port 9879 is not secure
      (or it would say "Secure server local is listening"), and is only 
      for local service connections through remoteprompt.

         No clients
         RTC_CONNECT: received on Mon Apr 18 02:49:03 UTC 2011
         server_connect: connected to 71.106.241.234:9884 Mon Apr 18 02
         Server local is listening on port 9879
         Clients:
          socket 2s, port  9884, conn C>S, 71.106.241.234 dale plunger

         send_files begin: Mon Apr 18 02:49:45 UTC 2011:
           1110418_HG.bin
           1110418_NQ.bin
           1110418_SP.bin
           1110418_DJ.bin
           1110418_JY.bin
           1110418_EU.bin
           1110418_SF.bin
         Server local is listening on port 9879
         Clients:
          socket 2s, port  9884, conn C>S, 71.106.241.234 dale plunger
   651>> send_files: sending 651 byte archive of 7 files
         socket_ack1: 263 msec wait for ack from socket 2
         socket_ack1: ok
         send_files: to socket 2 Mon Apr 18 02:49:45 UTC 2011
         send_files end: Mon Apr 18 02:49:45 UTC 2011

      This shows a remoteprompt connection to local program tops_rtc 
      (the one that is writing the tops_rtc2.log output shown above).
      We see that it is running a secure server and has a secure socket
      connection 2s to fortycoupe; socket 3s is us on the remoteprompt
      connection at remote interactive prompt tops@socket3:

         [dale@plunger] /home/dale > tops
                  Tops 3.2.0
         Sun Apr 17 20:37:42 PDT 2011
         [tops@plunger] ready > IPloop 9884 CLIENT_SSL remoteprompt
         tops@socket3 > clients
          Secure server local is listening on port 9884
          Clients:
           socket 2s, port 43912, conn S<C, 205.134.240.76 LOGIN 
              dale fortycoupe
           socket 3s, port 33556, conn S<C, 127.0.0.1 LOGIN 
              dale plunger
         tops@socket3 > 

   Thu Mar 24 08:06:41 PDT 2011.  With new output from socket_ack1 on
      the remote program connected to this one, it was learned that 
      typical wait time for acknowledgment (ack) from this program 
      (tops_rtc) to web machine fortycoupe (running tops_rtcmon) is 
      150 to 170 msec.  When the connection goes bad, tops_rtcmon hits
      the max wait time for socket ack of 12000 msec, and it closes 
      the connection to this program.  

      If the connection is already bad before the remote machine closes
      its socket, then this program will not know that the remote has 
      closed its end.  But this program will eventually sense that too 
      much time has passed between updates and it will close the socket
      on this end and initiate reconnection through the "RTC_CONNECT"
      message approach described at the top of this file.
     
      This shows a period from the log of tops_rtcmon running on forty-
      coupe and doing socket ack when the connection to tops_rtc on 
      plunger became poor (printed lines are about 3 or 4 minutes apart
      unless there is trouble with the connection):

         socket_ack1: 158 msec wait for ack from socket 2  [typical]
         socket_ack1: 604 msec wait for ack from socket 2  [slow]
         socket_ack1: 986 msec wait for ack from socket 2  [slow]
         socket_ack1: 3192 msec wait for ack from socket 2 [slow]
         socket_ack1: 613 msec wait for ack from socket 2  [slow]
         socket_ack1: 1005 msec wait for ack from socket 2 [slow]
         socket_ack1: 600 msec wait for ack from socket 2  [slow]

      The connection was closed in the following case when 12 seconds
      was reached for socket ack; no data was sent to plunger:

         socket_ack1: 12023 msec wait for ack from socket 2 [failed]

      Eventually plunger sensed it wasn't getting data and sent a mes-
      sage to reconnect.  After reconnection, data got through once 
      before socket ack again failed at 12 seconds:

         socket_ack1: 1184 msec wait for ack from socket 2  [slow]
         socket_ack1: 12008 msec wait for ack from socket 2 [failed]

      Raising the max time to 24 seconds did no good; the higher max
      time was hit and then the connection was closed:

         socket_ack1: 24065 msec wait for ack from socket 2 [failed]

      Other failed socket acks:

         socket_ack1: 12010 msec wait for ack from socket 2 [failed]
         socket_ack1: 11990 msec wait for ack from socket 2 [failed]

      The tops_rtc job on plunger was restarted to get a new listening
      socket, and performance became good again:

         socket_ack1: 161 msec wait for ack from socket 2 [typical]
         socket_ack1: 165 msec wait for ack from socket 2 [typical]
         socket_ack1: 164 msec wait for ack from socket 2 [typical]
         socket_ack1: 159 msec wait for ack from socket 2 [typical]
         socket_ack1: 161 msec wait for ack from socket 2 [typical]
         socket_ack1: 163 msec wait for ack from socket 2 [typical]

      From these results, it was decided to make the tops_rtc job re-
      start itself--to get a new port--rather than try to reconnect
      to the remote tops_rtcmon job using the old port (and socket).

      New word CONN_RESTART is run when CONN_RECON senses a bad socket 
      (due to too much time and no new data, or due to the remote clos-
      ing the socket).

      CONN_RESTART starts a new tops_rtc job and then kills itself.  
      The new job will immediately connect to remote tops_rtcmon on a
      new port, ready to receive collected data.
      
      To test CONN_RESTART, connection through remoteprompt was made to 
      the running tops_rtc job and the following was run to force word
      CONN_RESTART to run in 10 seconds:

         tops@socket3 > 10 "CONN_RESTART" ALARM

      before exiting remoteprompt and closing the remoteprompt connec-
      tion.  After 10 seconds, CONN_RESTART ran and the following out-
      put from log file tops_rtc2.log shows the new program starting 
      and the old one closing:

[Begin output from log file tops_rtc2.log]

 extract_files: to /mdat/edat2/ 2877 bytes Thu Mar 24 07:09:16 PDT 2011
 extract_files: to /mdat/edat2/ 3244 bytes Thu Mar 24 07:10:56 PDT 2011
 CONN_RESTART: restarting the program Thu Mar 24 07:12:01 PDT 2011
 rtc_close: closing socket 6 to collector Thu Mar 24 07:12:01 PDT 2011

PID 21461 Thu Mar 24 07:12:04 PDT 2011
 No tasks defined
 NIST_DELTA: time sync with NIST, -16 sec, Thu Mar 24 07:12:04 PDT 2011
 DSERVER: daemon server on port 9885, socket 1
 msgPutIP: connected to 205.134.240.76

Thu Mar 24 07:12:14 PDT 2011 SERVER: 205.134.240.76 connect
 24 bytes delta: memprobe socket 7 connect
 msgPutIP OK: "'71.106.236.39' 9885 server_connect" "RTC_CONNECT" msgPut
 msgPutIP: connection closed
 CONN_SET: connection on socket 7 Thu Mar 24 07:12:15 PDT 2011
 CONN_RESTART: the old program is closing Thu Mar 24 07:12:16 PDT 2011
 killmy: killing pid 21025 Thu Mar 24 07:12:16 PDT 2011
 extract_files: to /mdat/edat2/ 2808 bytes Thu Mar 24 07:14:17 PDT 2011
 extract_files: to /mdat/edat2/ 2803 bytes Thu Mar 24 07:16:28 PDT 2011
 extract_files: to /mdat/edat2/ 3148 bytes Thu Mar 24 07:18:09 PDT 2011
 extract_files: to /mdat/edat2/ 3072 bytes Thu Mar 24 07:22:50 PDT 2011

[End output from log file tops_rtc2.log]

------------------------------------------------------------------------

   Interactive testing.

   If running collector 0 test on plunger, start a dummy collector job 
   to which this test will connect (Mar 2010):
      [dale@plunger] /home/dale > tops_rtcmon

   To test this file interactively, start the program with an argv for 
   a running collector, then source this file and start a SERVER on 
   PORT.  

   This starts the program with argv for collector 1:
      [dale@plunger] /home/dale > tops -collect 1
               Tops 3.0.1
      Thu Apr 10 16:00:27 PDT 2008
   This is needed if testing with argv -collect 0 on plunger (Mar 2010):
      [tops@plunger] ready > yes "TESTING" book

      [tops@plunger] ready > "tops_rtc" source

      [tops@plunger] ready > "" PORT SERVER

      [tops@plunger] ready > clients
       Server local is listening on port 9879
       No clients

   Run CONN to make the connection with remote collector 1.  

   This shows msgPutIP connecting to the HTTP server at XXX.XX.48.191,
   leaving the message server_connect('YYY.YYY.244.138', 9879) and 
   closing the connection.  

      [tops@plunger] ready > (ntrace) CONN \ use ntrace for more output
       msgPutIP: connected to XXX.XX.48.191
       msgPutIP OK: "'YYY.YYY.244.138' 9879 server_connect"
                    "RTC_CONNECT" msgPut
       msgPutIP: connection closed

   A few moments later, XXX.XX.48.191 connects here to socket 9879 shown
   by the CONN_SET log entry at 16:01:05.  

      Thu Apr 10 16:01:05 PDT 2008 SERVER: XXX.XX.48.191 connect
       -512 bytes delta: memprobe socket 6 connect
       CONN_SET: connection on socket 6 Thu Apr 10 16:01:05 PDT 2008

   The clients list shows "S<C, XXX.XX.48.191" indicating that client
   (C) XXX.XX.48.191 has connected to the server (S) here (S<C):

      [tops@plunger] ready > clients
       Server local is listening on port 9879
       Clients:
        socket 6, port  4188, conn S<C, XXX.XX.48.191 LOGIN dale topsdog

   This multitasker task checks every 180 seconds that the connection 
   is still intact:

      [tops@plunger] ready > tasks
       Multitasker tasks:
        CONN_RECON,0:CODE__ alarm period 180 seconds; remaining 131

   Exiting closes the connection and causes CONN_CLS to run:

      [tops@plunger] ready > bye
       CONN_CLS: connection on 6 is closed
      59 keys
                Good-bye
      Thu Apr 10 16:04:03 PDT 2008
      [dale@plunger] /home/dale > 


   These lines in a script are handy to see what tops_rtcmon and
   tops_rtc jobs are running:

      #File rtc
      ps -Af --cols 512 | grep tops_rtc
      ps -Af --cols 512 | grep collect
 
------------------------------------------------------------------------
}
   CATMSG push no catmsg

\  Network setup.

   "IPlocal" "IP" macro                  \ this machine's IP address 
   def_port nextport intstr "PORT" macro \ this machine's listening port

{  Argv -collect equal to 0, 1 or 2 defines which collector will be 
   sending files and where they will be written; see usr/uboot.v for
   host-specific definitions of the following words:
      IPcol0, PORTcol0, epath0
      IPcol1, PORTcol1, epath1
      IPcol2, PORTcol2, epath2
}
   inline: collect0 ( --- ) \ set up collector 0
    \ If running on plunger, tops_rtccon will start topse to collect
    \ data like a remote collector, and this script is never run for
    \ collector 0.  

    \ But to test this word on plunger, start a dummy remote collector
    \ by running tops_rtcmon, and run yes "TESTING" book at the ready
    \ prompt before sourcing this file with "tops_rtc" source.

      "'TESTING' exists?" main 
      IF "TESTING" main ELSE no THEN "TESTING" book

      TESTING not IF host "plunger" = IF exit THEN THEN
 
      "HOME" env "tops_rtc0.log" catpath "LOG_RTC" mainbook
      "HOME" env "tops_rtc0mem.log" catpath "LOG_MEM" mainbook

    \ These macros defined when this word runs go into the main
    \ library for all to see:
      "epath0"   "DIR" macro     \ local dir receiving remote files
      "IPcol0"   "IPcol" macro   \ collector machine IP address

      "-ssl" argv chars 0=
      IF "PORTcol0" \ HTTP port on remote collector machine
      ELSE "443"    \ HTTPS port on remote collector machine
      THEN "PORTcol" macro

      "RTC0_SERVER" (qM) dup msgGet drop       \ remove old message
      PORT "-ssl" argv chars 0> IF negate THEN \ set sign bit if SSL
      (nPORT) intstr swap (qM nPORT) msgPut    \ put local port number

      TESTING \ override these macros:
      IF "IPloop" "IP"    macro \ this machine's IP address 
         "IPloop" "IPcol" macro \ collector machine IP address
      THEN
   end

   inline: collect1 ( --- ) \ set up collector 1
      "HOME" env "tops_rtc1.log" catpath "LOG_RTC" mainbook
      "HOME" env "tops_rtc1mem.log" catpath "LOG_MEM" mainbook

    \ These macros defined when this word runs go into the main 
    \ library for all to see:
      "epath1"   "DIR"     macro \ local dir receiving remote files
      "IPcol1"   "IPcol"   macro \ collector machine IP address

    \ Sun Apr 17 19:26:57 PDT 2011.  The remote machine is being used
    \ to test the use of SSL encrypted connections:
      "-ssl" argv chars 0=
      IF "PORTcol1" \ HTTP port on remote collector machine
      ELSE "443"    \ HTTPS port on remote collector machine
      THEN "PORTcol" macro

      "RTC1_SERVER" (qM) dup msgGet drop       \ remove old message
      PORT "-ssl" argv chars 0> IF negate THEN \ set sign bit if SSL
      (nPORT) intstr swap (qM nPORT) msgPut    \ put local port number
   end

   inline: collect2 ( --- ) \ set up collector 2
      "HOME" env "tops_rtc2.log" catpath "LOG_RTC" mainbook
      "HOME" env "tops_rtc2mem.log" catpath "LOG_MEM" mainbook

    \ These macros defined when this word runs go into the main 
    \ library for all to see:
      "epath2"   "DIR"     macro \ local dir receiving remote files
      "IPcol2"   "IPcol"   macro \ collector machine IP address

    \ Sun Apr 17 19:26:57 PDT 2011.  The remote machine is being used 
    \ to test the use of SSL encrypted connections:
      "-ssl" argv chars 0= 
      IF "PORTcol2" \ HTTP port on remote collector machine
      ELSE "443"    \ HTTPS port on remote collector machine
      THEN "PORTcol" macro 

      "RTC2_SERVER" (qM) dup msgGet drop       \ remove old message
      PORT "-ssl" argv chars 0> IF negate THEN \ set sign bit if SSL
      (nPORT) intstr swap (qM nPORT) msgPut    \ put local port number
   end

   "-collect" argv chars 0= 
   IF " tops_rtc: collector must use argv -collect" . nl HALT THEN

   "-collect" argv "0" = IF collect0 THEN
   "-collect" argv "1" = IF collect1 THEN
   "-collect" argv "2" = IF collect2 THEN

   IPcol ALLOW \ SERVER allow connections from IPcol

\  The number for SOCK is set when the remote collector causes the
\  phrase
\     remotefd CONN_SET
\  to be run here.  See tops_rtcmon, word server_connect().
   -1 "SOCK" book \ will be valid when remote collector connects

\-----------------------------------------------------------------------

\  Words.

   "msgPut" missing IF "dog.v" source THEN

   inline: CLOSE ( --- ) \ this program will close
      nl " This program is closing " date + . nl
      remotesockets sclose
      5 "exit" ALARM
   end

   inline: CONN ( --- ) \ leave a message on the remote to connect here
{     This word, through word msgPutIP() below, makes a connection to 
      the remote's HTTP server and leaves a message for the companion 
      script to this one, running on the remote machine, to connect
      here, to IP address and listening PORT. 

      When this script is first started, this word CONN is run on an 
      ALARM that delays until SERVER on IP:PORT is ready and listening
      for the upcoming connection that will establish socket SOCK.
}
      [ 30 "TIMEOUT" book \ seconds until remote connects
      ]

      " CONN sending connect msg to " . IPcol .i sp date . nl

      rtc_close \ make sure connection is closed
{
      Make a command string to run on the remote, for example
         "'71.107.4.6' 9886 server_connect" "RTC_CONNECT" msgPut,
      that will cause the remote machine running such a phrase to con-
      nect to listening port 9886 at IP address 71.107.4.6.
}
      "-ssl" argv chars 0>
      IF "yes ssl_connect " dup main ELSE "" THEN (qSSL)

      (qSSL)
      "'ip' PORT server_connect" + \ template; replace ip and PORT below

      (hM) "ip" IP strp (hM)  \ replace string ip with IP address
      "PORT" PORT intstr strp \ replace string PORT with PORT num

      (qS) "RTC_CONNECT"      \ S is an RTC_CONNECT message 
      IPcol PORTcol           \ sending to machine at IPcol:PORTcol 
      msgPutIP \ goes to remote machine's interprocess message list

    \ The remote collector should connect to the server here in a short
    \ time, and when it does CONN_SET() will be run and socket SOCK will
    \ be defined.
      TIMEOUT WAIT_ALARM \ time limit for connection
      WAIT_BEGIN         \ wait for connection through CONN_SET

    \ Turn off the WAIT_END alarm started by WAIT_ALARM
      "WAIT_END" -ALARM

    \ Set the alarm for reconnection:
      "CONN_RECON" "SEC" yank (nSec)
      (nSec) "CONN_RECON" ALARM \ set reconnection ALARM
   end

   inline: CONN_CLS (nS --- ) \ action when connection on S has closed
      " CONN_CLS: connection on " swap intstr + " is closed " + 
      date + . nl
      -1 "SOCK" mainbook \ invalidate SOCK
 
    \ xx \ clear the stack; there may be items from aborted connection
{     NEVER CLEAR THE STACK.  I SHOULD KNOW BETTER.  

      HTTPget was returning a stack item when this word ran and cleared
      off the item (and whatever was below it: items that other words 
      may have been waiting for), causing the program to run away and
      then exit.  

      This is an event-driven system, words are recursive and can run 
      at any time, so the integrity of the stack must be maintained.

      It pays to understand a problem before attempting to fix it.

      I had a theory that the problem was due to the remote machine
      closing the socket on socket_ack, and if socket_ack was just
      discontinued, the problem would go away.  

      It has taken a long time, but by now I know it is wise to under-
      stand a problem first, and not just make blind changes and hope
      things are fixed.  

      So debug was added and after a few days the problem occurred but
      the debug was not informative enough and more had to be added.

      Finally after another day the problem occurred again, and the 
      unexpected has shown up as the culprit: the thoughtless clearing
      of the stack by "xx" above.  My shoot-from-the-hip theory about
      socket_ack was just plain wrong.

      Word CONN_CLS is called whenever the connection closes.  I guess
      that it usually runs after HTTPget has returned, because this
      error occurs infrequently.  The log below shows the case where 
      it ran at the worst time and nailed the stack:

      Here is word CONN running to make a connection to 205.134.240.76:
         Top of CONN Mon Jul 20 15:18:58 PDT 2009
         CONN message: '71.107.6.154' 9887 server_connect
         CONN RTC_CONNECT IPcol: 205.134.240.76
         CONN RTC_CONNECT PORTcol: 80
         Top of msgPutIP
         REMOTE_CONNECT: calling HTTPget: 205.134.240.76 clientIPs 
            remotefd clientindex quote 9887 CLIENT 'S' book "remotefd 
            WAIT_END" S remoterun (plunger)
         HTTPget: host 205.134.240.76
         HTTPget: connected to 205.134.240.76 on socket 2
         HTTPget: clientIPs remotefd clientindex quote 9887 CLIENT 'S' 
            book "remotefd WAIT_END" S remoterun (plunger)
         HTTPget: receiving bytes ...
         HTTPget: received 118 bytes at 1.10 Mbytes/sec
         HTTPget: closing connection

      Word CONN_CLS is called whenever the connection closes.  Here is 
      where the stack got cleared, removing the stack item that HTTPget
      was returning:
         CONN_CLS: connection on 2 is closed Mon Jul 20 15:18:58 PDT 200
         << xx cleared the stack here >>
         End of CONN_CLS

      This is debug print placed in REMOTE_CONNECT, and it verifies that
      the stack is empty following CONN_CLS:
         REMOTE_CONNECT: stack after HTTPget:
         stack is empty 

       This is the mess that followed:
         textget: expect string or volume on stack
         dup: empty stack
         cannot pop empty stack
         gt: stack items not as expected
         over: expect two items on stack
         quote: expect string or volume on stack
         strchop: expect string on stack
         cannot pop empty stack
         faulty phrase: "*" PORT DSERVER
         runaway detected: HALT on run level 6  Mon Jul 20 15:18:58 PDT

      Here is a repeat the next day, after removing "xx" and with the 
      same debug print.  It shows three stack items, obviously crucial
      for running, that "xx" had cleared off.  Remoteprompt to the pro-
      gram on IPloop, port 9885 showed the stack was empty following all
      this, as it should be, so the machinery is working ok:
         Top of CONN Tue Jul 21 15:19:33 PDT 2009
         CONN message: '71.107.6.154' 9885 server_connect
         CONN RTC_CONNECT IPcol: 205.134.240.76
         CONN RTC_CONNECT PORTcol: 80
         Top of msgPutIP
         REMOTE_CONNECT: calling HTTPget: 205.134.240.76 clientIPs
            remotefd clientindex quote 9885 CLIENT 'S' book "remotefd 
            WAIT_END" S remoterun (plunger)
         HTTPget: host 205.134.240.76
         HTTPget: connected to 205.134.240.76 on socket 2
         HTTPget: clientIPs remotefd clientindex quote 9885 CLIENT 'S'
            book "remotefd WAIT_END" S remoterun (plunger)
         HTTPget: receiving bytes ...
         HTTPget: received 118 bytes at 1.12 Mbytes/sec
         HTTPget: closing connection
         CONN_CLS: connection on 2 is closed Tue Jul 21 15:19:33 PDT 200
         End of CONN_CLS

      This is debug print placed in REMOTE_CONNECT, showing three stack
      items that CONN_CLS had previously cleared:
         REMOTE_CONNECT: stack after HTTPget:
         stack elements:
               0 string:  REQUESTrun: clientIPs remotefd cli...  118 
                  characters
               1 string: RTC_CONNECT  11 characters
               2 string: '71.107.6.154' 9885 server_connect  34 
                  characters
         [3] ok!

        Tue Jul 21 15:19:33 PDT 2009 SERVER: 205.134.240.76 connect
         104 bytes delta: memprobe socket 2 connect
         REMOTE_CONNECT1: connected to 205.134.240.76:80
         msgPutIP socket: 2
         msgPutIP: connected to 205.134.240.76
         msgPutIP running T: "'71.107.6.154' 9885 server_connect" 
            "RTC_CONNECT" msgPut
         msgPutIP: connection closed
         CONN WAIT_BEGIN Tue Jul 21 15:19:34 PDT 2009
         CONN_SET: connection on socket 3 Tue Jul 21 15:19:34 PDT 2009
         CONN WAIT_END Tue Jul 21 15:19:34 PDT 2009
}
      "CONN_RECON" "SEC" yank (nSec) 
      (nSec) 10 / \ next reconnection sooner than SEC
      (nSec/10) "CONN_RECON" ALARM \ set reconnection ALARM
   end

   inline: CONN_RECON ( --- ) \ reconnect to collector
\     This word runs on an alarm that it continuously resets, to see if
\     it is necessary to connect again by running CONN_RESTART.

      [ 180 "SEC" book  \ test for reconnect every SEC seconds
        600 "SMAX" book \ max seconds between received files
      ]

      msgID msgGet drop
      "CONN_RECON " time intstr spaced + date + msgID msgPut

      LOCKED not
      IF time "extract_files" "t_extract" yank - SMAX >
         IF " CONN_RECON: too much time since last extraction" . nl
            rtc_close
         ELSE
          \ July 2009: run rtc_close if SOCK is not an open client:
            "SOCK" main -1 >
            IF "SOCK" main client_open not
               IF  " CONN_RECON: SOCK is not a client"
                  ", closing connection" + . nl rtc_close 
               THEN
            THEN
         THEN

         "SOCK" main 0< (f) 
         IF CONN_RESTART THEN \ Thu Mar 24 06:06:41 PDT 2011

      ELSE " CONN_RECON: returning, program is locked " date + . nl
      THEN
      SEC "CONN_RECON" ALARM \ check again in SEC 
   end

   inline: CONN_RESTART ( --- ) \ restart this program
{     Thu Mar 24 06:17:38 PDT 2011

      Start a new instance of this program to get a new listening port,
      then kill the old one.  The restarted program, in word CONN, will
      send a message to the remote program to connect on the new listen-
      ing port.

      This word starts a script (see start1 and start2 below) that will
      run tops_rtccon for this collector.  The job run by tops_rtccon 
      will connect to this running job and make it exit, then it will 
      start a new one.
}
      [ usrpath "tops_rtccon -start1 &" + "start1" book
        usrpath "tops_rtccon -start2 &" + "start2" book
      ]
      " CONN_RESTART: restarting the program " date + . nl

      depth push

    \ Find a program to start:
      "-collect" argv "1" = IF start1 (qS) THEN
      "-collect" argv "2" = IF start2 (qS) THEN

      (qS) depth pull - 0> (f)
{
      Thu Nov 17 12:06:05 PST 2011.  Running tops_rtccon can cause
      many jobs to spawn if CONN_RESTART fails.  
 
      A error in C function _sflush() has recently been fixed that may 
      account for numerous problems that fixes like CONN_RESTART were
      written to get around.

      Try the old way with CONN by forcing the ELSE branch below:
}     (f) drop no

      IF (qS) " CONN_RESTART command: " . nl dup 3 indent . nl
         tasks_omit \ stop all multitasker tasks
         (qS) shell \ run start up program

      ELSE \ run CONN and return:
       \ " CONN_RESTART: cannot find a program to start" . nl
         " CONN_RESTART: running CONN to reconnect " date + . nl 
         CONN  
      THEN
   end

   inline: CONN_SET (nS --- ) \ set up for collector just connected on S
\     When the remote collector connects, it runs this word so this end
\     of its connection can be set up.
      " CONN_SET: connection on socket " over intstr + spaced 
      date + . nl
      (nS) dup "SOCK" mainbook                \ connected on SOCK
      "CONN_CLS" ptr swap (ptr nS) ptrCls_upd \ set clientclose function
   end 

   inline: current ( --- ) \ local files brought current to remote ones
      rtc_files any?
      IF (hT2) 1st word drop (hT2)
         local_files 1st word drop (hT1) 
         (hT2 hT1) nomatch1 any?
         IF (hFiles) get_archive any?
            IF (hT) extract_files THEN
         THEN
      THEN
   end

   inline: current_for (nYYYMMDD --- ) \ local files current for YYYMMDD
      "DATE" book
      DATE rtc_for any?
      IF (hT2) 1st word drop (hT2) 
         DATE local_for (hT1) 1st word drop (hT1)
         (hT2 hT1) nomatch1 any?
         IF (hFiles) get_archive any?
            IF (hT) extract_files THEN
         THEN
      THEN
   end

   inline: extract_files (hT --- ) \ extract files contained in volume T
{     Volume T on the stack is a tar file archive.  Save T to FILE and 
      then extract the files of FILE into DIR.

      When the remote machine sends file archive T to this machine, it
      follows it with a string to run this word.  

      For example, the remote machine might run remoterun2() to send T 
      from its stack to here and then run this word, extract_files(), 
      on this machine.  

      Here is a phrase run on the remote machine to do this, showing
      T on its stack ready to be sent here:

         (hT) "extract_files" S remoterun2
}
      [ INF "t_extract" book ]

      time "t_extract" book \ time for elapsed test in CONN_RECON

    \ Write a line to the log file:
      " extract_files: to " DIR + spaced that sizeof intstr + 
      " bytes " + date + . nl 

      ftempsys "FILE" book
      FILE old binary "BIN" file \ open handle to old FILE
      (hT) BIN fput              \ bytes on stack to FILE
      BIN fclose                 \ close FILE handle
      DIR FILE xtar              \ extract tar files from FILE into DIR
{
      FILE should always exist, but when using delete the following 
      was obtained once (the program recovered and continued as the 
      last line shows):

         extract_files: to /home/dale/mdat/edat1/ 4433 bytes 
            Wed May 14 18:46:48 UTC 2008
         delete: file not found: /tmp/T3494_Wm8x37
         faulty phrase: extract_files
         faulty phrase: "*" PORT DSERVER
         extract_files: to /home/dale/mdat/edat1/ 4309 bytes 
            Wed May 14 18:48:28 UTC 2008
}
    \ Switch to deletif:
      FILE deleteif              \ delete FILE
    \ FILE delete                \ delete FILE

      "/bin/touch " DIR + shell  \ so filetime will show change
   end

   inline: get_archive (hFiles --- hT) \ get Files archive from remote
      "SOCK" main "S" book

      S -1 =
      S socket_open not or
      IF " get_archive: socket to remote is not open" . nl
         drop VOL tpurged
      ELSE
       \ Files are in DIR on remote; run word archive on the remote
       \ and have an archive of Files sent here (note that DIR on
       \ the remote is where collected files are placed; it probably
       \ is a different name than DIR here):
         (hFiles) "DIR archive (hT) remotefd remoteput" (hT2)
         (hFiles hT2) S remoterun2
         S 40 (nS nSec) BLOCK
      THEN
   end

   inline: local_files ( --- hT) \ list of all local files
\     Volume T contains a list of file names and times for remote files
\     that have been downloaded to directory DIR.
      DIR dirfiles (hNames hTimes) " %0.0f" format park
   end

   inline: local_for (nYYYMMDD --- hT) \ list of local files for YYYMMDD
\     Volume T contains a list of file names and times for files that
\     have been downloaded for YYYMMDD to directory DIR.
      DIR dirfiles (hNames hTimes) " %0.0f" format park
      dup rot intstr grepr any?
      IF reach ELSE drop VOL tpurged THEN
   end

   inline: msgID ( --- qM) \ string for CONN message on msgcomm
      "LOG_RTC" main -path -ext 
   end

   inline: rtc_close ( --- ) \ close connection to remote
      "SOCK" main "S" book

      S -1 >
      IF " rtc_close: closing socket " S intstr + " to collector " + 
         date + . nl 

         0 S ptrCls_upd \ essential to avoid endless loop with CONN_CLS

       \ Mon May  9 05:31:43 PDT 2011.  Do not try to connect to the
       \ remote server.  Why was this being done anyway, since the 
       \ idea was to communicate through the message file?
       \ "remotefd server_close" S remoterun

         S sclose
         -1 "SOCK" mainbook
      ELSE " rtc_close: socket is already closed " date + . nl
      THEN
   end

   inline: rtc_files ( --- hT) \ list of all remote files
      "SOCK" main "S" book

      S -1 =
      S socket_open not or
      IF " rtc_files: socket to remote is not open" . nl 
         VOL tpurged
      ELSE
         "rtc_files remotefd remoteput" S remoterun1
      THEN
   end

   inline: rtc_for (nYYYMMDD --- hT) \ list of remote files for YYYMMDD
      "SOCK" main "S" book

      S -1 =
      S socket_open not or
      IF " rtc_for: socket to remote is not open" . nl 
         drop VOL tpurged
      ELSE
         intstr (hT1) "main (nYYMMDD) rtc_for remotefd remoteput" (hT2)
         (hT1 hT2) S remoterun2
         S 20 (nS nSec) BLOCK
      THEN
   end

   pull catmsg

   keys? IF halt THEN \ interactive testing, do not start server

\-----------------------------------------------------------------------

\  Start a multitasker job to track memory usage:
\  July 2009: Memory has looked fine for months, no leaks.  Discontinue
\  this:
\  LOG_MEM "memlog" "LOG" bank 
\  1 900 / "memlog" PLAY \ every 15 minutes

\-----------------------------------------------------------------------

\  This section makes the connection to the complementary daemon on the
\  remote machine.

\  This line sets SYSOUT to the log file name defined above:
   LOG_RTC set_sysout \ SYSOUT will be LOG_RTC

\  Write the first lines in LOG_RTC file:
   "-" 72 cats nl dot nl

   "PID " getpid intstr + " begin" + dot nl \ process ID

   ontheweb
   IF NIST_SYNC 
   ELSE 14400 "NIST_DELTA" "SEC" bank NIST_DELTA
   THEN

   " SFLUSHX on, " date + . nl yes SFLUSHX

   12 new_client_timeout \ time allowed for remote to make connection

\  Run CONN on an ALARM that gives DSERVER time to start:
   20 (seconds) "CONN" ALARM

   "-ssl" argv chars 0=
   IF " tops_rtc: starting DSERVER on port "
   ELSE " tops_rtc: starting DSERVER_SSL on port "
   THEN (qS) PORT intstr + . sp date . nl

\  Start the server, running forever.  The remote collection machine 
\  will connect shortly after CONN runs: 
   "*" PORT "-ssl" argv chars 0=
   IF DSERVER ELSE DSERVER_SSL THEN

\-----------------------------------------------------------------------

;  Appendix

   Problem, July 2009.

      After readn1 errno, the program hung.  It remained connected
      to the collection site:
         extract_files: to /mdat/edat1/ 1519 bytes Thu Jul  2 21:02:22 
         extract_files: to /mdat/edat1/ 1513 bytes Thu Jul  2 21:05:13 
         connect_alarm: signum 14  Thu Jul  2 21:07:30 PDT 2009
         readn1: probable alarm interrupt, errno: 4

      At the collection site, it kept delivering files to socket 2, the
      one that is hung up, but socket_ack continually fails and it does
      nothing about that:

         Server local is listening on port 9879
         Clients:
          socket 2, port  9882, conn C>S, 71.106.247.190 dale plunger
          socket 3, port  9879, conn C>S, 64.62.148.191 dale topsdog
         send_files: sending 4008 byte archive of 4 files
         send_files: socket_ack on socket 3 ok
         send_files: to socket 3 Fri Jul  3 17:04:10 UTC 2009
         send_files: socket_ack on socket 2 failed
          Jmp table at lev 4
           lev   ret  typ  Lib:lib
            4     2    1     0:send_files
            3     2    1     0:dir_monitor
            2     2    1     0:D2
            1     1    2     0:DATA__
          send_files: to socket 2 Fri Jul  3 17:04:17 UTC 2009
          send_files end: Fri Jul  3 17:04:17 UTC 2009

      What if the collection site, on seeing socket_ack fails, simply
      closed the socket.  Then would the receiver cease to be hung up?

      Changes to tops_rtcmon and tops_rtc have been made to address 
      this problem.

Example of connecting during a noisy period, October 2008.

This shows the tops_rtc log file during a period when the remote tops_rtcmon
server had TCP/IP connection problems, making operation on this end very rocky.

Times like this are a pain, but they offer the opportunity to make changes
that improve reliability.  Lines below show program tops_rtc detecting bad
connections and continually reconnecting.

Communication to make a remote connection is through the remote's interprocess
communication system (file dog.v), and not through direct connection to server
tops_rtcmon (see documentation at the top of file tops_rtc).  This program sends 
a message to the remote's msgcomm file (see "msgPutIP OK:" below), while the 
remote daemon, tops_rtcmon, is polling for such a message.  When received, the 
remote makes a new connection to here.  This turns out to be a key feature in 
robust reconnection, in effect using a neutral or third party.  

Below is the excerpt from the tops_rtc log file, with comments inserted.

Connect to tops_rtcmon server and start receiving files:

Fri Oct 31 15:10:24 UTC 2008 SERVER: YY.XXX.ZZ.76 connect
 8 bytes delta: memprobe socket 3 connect
 msgPutIP OK: "'XXX.XX.148.191' 9879 server_connect" "RTC_CONNECT" msgPut
 msgPutIP: connection closed
 CONN_SET: connection on socket 3 Fri Oct 31 15:10:25 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 12311 bytes Fri Oct 31 15:10:41 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 2709 bytes Fri Oct 31 15:11:53 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 2794 bytes Fri Oct 31 15:13:04 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 3107 bytes Fri Oct 31 15:14:05 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 3786 bytes Fri Oct 31 15:15:58 UTC 2008

This shows the socket_ack phrase from remote server, trying to run word remoterun
here.  But writen1 finds the socket to the remote is now closed, and word remoterun
fails.  Word CONN_CLS officially closes the socket:

 writen1: socket 3 is not open, client closed
 CONN_CLS: connection on 3 is closed Fri Oct 31 15:27:25 UTC 2008
 fault at word: remoterun
 faulty phrase: "'remoteack' 'pile_ACK' localrun" remotefd remoterun

CONN_RECON running periodically detects too much time passed, and initiates 
another connection:

 CONN_RECON: too much time since last extraction
 CONN_RECON: running CONN to reconnect Fri Oct 31 15:30:24 UTC 2008

The connection initiated by CONN_RECON succeeds:

Fri Oct 31 15:30:25 UTC 2008 SERVER: YY.XXX.ZZ.76 connect
 -152 bytes delta: memprobe socket 2 connect
 CONN_SET: connection on socket 2 Fri Oct 31 15:30:26 UTC 2008

but after about 20 seconds socket_ack fails again:

 writen1: socket 2 is not open, client closed
 CONN_CLS: connection on 2 is closed Fri Oct 31 15:30:46 UTC 2008
 fault at word: remoterun
 faulty phrase: "'remoteack' 'pile_ACK' localrun" remotefd remoterun

and CONN_RECON again detects that (still) too much time has passed and 
starts another connection:

 CONN_RECON: too much time since last extraction
 CONN_RECON: running CONN to reconnect Fri Oct 31 15:33:45 UTC 2008
 msgPutIP: connected to YY.XXX.ZZ.76

Connection succeeds and a couple of files are received but socket_ack to
the server again fails and CONN_RECON starts another connection:

Fri Oct 31 15:33:46 UTC 2008 SERVER: YY.XXX.ZZ.76 connect
 -32 bytes delta: memprobe socket 3 connect
 msgPutIP OK: "'XXX.XX.148.191' 9879 server_connect" "RTC_CONNECT" msgPut
 msgPutIP: connection closed
 CONN_SET: connection on socket 3 Fri Oct 31 15:33:47 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 8933 bytes Fri Oct 31 15:33:55 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 2900 bytes Fri Oct 31 15:35:07 UTC 2008
 writen1: socket 3 is not open, client closed
 CONN_CLS: connection on 3 is closed Fri Oct 31 15:37:40 UTC 2008
 fault at word: remoterun
 faulty phrase: "'remoteack' 'pile_ACK' localrun" remotefd remoterun
 CONN_RECON: running CONN to reconnect Fri Oct 31 15:40:39 UTC 2008
 msgPutIP: connected to YY.XXX.ZZ.76

Connection succeeds, and receipt of files is going more smoothly:

Fri Oct 31 15:40:40 UTC 2008 SERVER: YY.XXX.ZZ.76 connect
 -56 bytes delta: memprobe socket 3 connect
 msgPutIP OK: "'XXX.XX.148.191' 9879 server_connect" "RTC_CONNECT" msgPut
 msgPutIP: connection closed
 CONN_SET: connection on socket 3 Fri Oct 31 15:40:43 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 9132 bytes Fri Oct 31 15:40:44 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 3653 bytes Fri Oct 31 15:41:55 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 4748 bytes Fri Oct 31 15:43:28 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 3407 bytes Fri Oct 31 15:44:39 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 2913 bytes Fri Oct 31 15:45:50 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 3688 bytes Fri Oct 31 15:46:45 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 2090 bytes Fri Oct 31 15:50:23 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 4040 bytes Fri Oct 31 15:51:33 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 3700 bytes Fri Oct 31 15:52:36 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 4807 bytes Fri Oct 31 15:54:06 UTC 2008
 extract_files: to /home/dale/mdat/edat1/ 3143 bytes Fri Oct 31 15:55:43 UTC 2008
