@q Copyright 2012-2014, Alexander Shibakov@>
@q This file is part of SPLinT@>

@q SPLinT is free software: you can redistribute it and/or modify@>
@q it under the terms of the GNU General Public License as published by@>
@q the Free Software Foundation, either version 3 of the License, or@>
@q (at your option) any later version.@>

@q SPLinT is distributed in the hope that it will be useful,@>
@q but WITHOUT ANY WARRANTY; without even the implied warranty of@>
@q MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the@>
@q GNU General Public License for more details.@>

@q You should have received a copy of the GNU General Public License@>
@q along with SPLinT.  If not, see <http://www.gnu.org/licenses/>.@>

@** Forcing \bison\ and \flex\ to output \TeX.
Instead of implementing a \bison\ (or \flex) `plugin' for outputting
\TeX\ parser, the code that follows produces a separate executable
that outputs all the required tables after the inclusion of an
ordinary \Cee\ parser produced by \bison\ (or a scanner produced by
\flex). The actions in both \bison\ parser and \flex\ scanner are
assumed to be merely |printf()| statements that output the `real'
\TeX\ actions. The code below simply cycles through all such actions to
output an `action switch' appropriate for use with \TeX. In every
other respect, the included parser or scanner can use any features
allowed in `real' parsers and scanners.

@*1 Common routines. 
The `top' level of the scanner and parser `drivers' is very similar,
and is therefore separated into a few sections that are common to both
drivers. The layout is fairly typical and follows a standard
`initialize-input-process-output-clean up' scheme. The logic behind each
section of the program will be explained in detail below.

The section below is called |@<\Cee\ postamble@>| because the output of
the tables can happen only after the \bison\ (or \flex) generated
\.{.c} file is included and all the data structures are known.

The actual `assembly' of each driver has to be done separately due to
some `singularities' of the \CWEB\ system and the design of this
software. All the essential routines are presented in the sections
below, though. 
@<\Cee\ postamble@>=

@<Outer definitions@>;
@<Global variables and types@>@;
@<Auxiliary function declarations@>@;
@<Auxiliary function definitions@>@;

int main( int argc, char **argv ) {

  @<Local variable and type declarations@>@;
  @<Establish defaults@>@;
  @<Command line processing variables@>@;
  @<Process command line options@>@;

  switch( mode ) {

      @<Various output modes@>@;

      default:
          break;

  }

  if ( tables_out ) {

      @<Perform output@>@;
      @<Output action switch, if any@>@;

  } else {

      fprintf( stderr, "No output, exiting\n" );
      exit(0);

  }

  @<Clean up@>@;

  return 0;

}

@ Not all the code can be supplied at this stage (most of the routines
here are at the `top' level so the specifics have to be `filled-in' by
each driver), so many of the sections
above are placeholders for the code provided by a specific
driver. However, we still need to supply a trivial definition here to
placate \CWEAVE\ whenever this portion of the code is used isolated in
documentation.
@<Various output modes@>=

@ Standard library declarations for memory management routines, some
syntactic sugar, command line processing, and variadic functions are
all that is needed.
@<Outer definitions@>=
#include <stdlib.h>
#include <stdbool.h>
#include <stdarg.h>
#include <assert.h>
#include <string.h>

@ This code snippet is a payment for some poor (in my view) philosophy
on the part of the \bison\ and \flex\ developers.  There used to be an
option in \bison\ to output just the tables and the action code but it
had never worked correctly and it was simply dropped in the latest
version. Instead, one can only get access to \bison's goodies as part
of a tangled mess of |@[#define@]|'s and error processing code. Had the
tables and the parser function itself been considered separate, well
isolated sections of \bison's output, there would simply be no reason
for dirty tricks like the one below, one would be able to write custom
error processing functions, unicorns would roam the Earth and pixies
would hand open sourced tablets to everyone. At a minimum, it would
have been a much cleaner, modular approach. There is also strange
reluctance on the part of the \gcc\ team to output any intermediate
code other than the results of preprocessing and assembly. I have seen
an argument that involves some sort of appeal to making the code
difficult to close source but the logic of it escaped me completely.

Ideally, there should be no such thing as a parser generator, or a
compiler, for that matter: all of these are just basic table driven
rewriting routines. Tables are hard but table driven code should not
be. If one had access to the tables themselves, and some canonical
examples of code driven by such tables, like |yyparse()| and
|yylex()|, the flexibility of these tools would improve
tremendously. Barring that, this is what we have to do {\it now}.

There are several ways to gain write access to the data declared |const|
in \Cee, like passing its address to a function with no prototype. All
these methods have one drawback: loopholes that make them possible
have been steadily getting on the chopping block of the \Cee\
standards committee. Indeed, |const| data should be constant. Even if
one succeeds in getting access, there is no reason to believe that the
data is not allocated in a write-only region of the memory. The
cleanest way to get write access then is to eliminate |const|
altogether. The code should have the same semantics after that, and
the trick is only marginally bad.

The last two definitions are less innocent (and, at least the
second one, are prohibited by the ISO standard (clause 6.10.8(2), 
see~\cite[ISO/C11])) but
\gcc\ does not seem to mind, and it gets rid of warnings about
dropping a |const| qualifier whenever an |assert| is
encountered. Since the macro is not recursively expanded, this will
only work if $\ldots$|FUNCTION__| is treated as a pseudo-variable, as
it is in \gcc, not a macro.

@d const
@d __PRETTY_FUNCTION__ @[(char *)__PRETTY_FUNCTION__@]
@d __FUNCTION__ @[(char *)__FUNCTION__@]

@ The output file has to be known to both parts of the code, so it is
declared at the very beginning of the program. We also add some
syntactic sugar for loops.
@q The line below is simply irresistible, one should put it on a T-shirt@>
@s FOREVER TeX

@d FOREVER for(;;)

@<Common code for \Cee\ preamble@>=
#include <stdio.h>
  FILE *tables_out;

@ The clean-up portion of the code can be left empty, as all it does
is close the output file, which can be left to the operating system but
we take care of it ourselves to keep out code `clean'\footnote{In case
the reader has not noticed yet, this is a weak attempt at humor to
break the monotony of going through the lines of \CTANGLE'd code}.
@<Clean up@>=
  fclose( tables_out );

@ There is a descriptor controlling the output of the program as
a whole. The code below is an example of a literate programming
technique that will be used repeatedly to maintain large structures
that can grow during the course of the program design. Note that the
name of each table is only mentioned once, the rest of the code is
generic. 

Technically speaking, all of this can be done with \Cee\ preprocessor
macros of moderate complexity, taking advantage of its expansion rules
but it is not nearly as transparent as the \CWEB\ approach.
@<Local variable and type declarations@>=
  struct output_d {

      @<Output descriptor fields@>@;

  };

  struct output_d output_desc = { @<Default outputs@> };

@ To declare each table field in the global output descriptor, all one has to do
is to provide a general pattern.
@<Output descriptor fields@>=
#define _register_table_d(name) @[bool output_##name:1;@]
  @<Table names@>@;
#undef _register_table_d

@ Same for assigning default values to each field.
@<Default outputs@>=
#define _register_table_d(name) @[.output_##name = 0,@] /* do not output any tables by default */
  @<Table names@>@;
#undef _register_table_d

@ Each descriptor is populated using the same approach.
@<Local variable and type declarations@>=
#define _register_table_d(name) @[struct table_d name##_desc = {0};@]
  @<Table names@>@;
#undef _register_table_d

@ The reason to implement the table output routine as a macro is to avoid
writing separate functions for tables of different types of data
(stings as well as integers). The output is controlled by each table's
{\it descriptor\/} defined below. A more sophisticated approach is
possible but this code is merely a `patch' so we are not after full
generality\footnote{A somewhat cleaner way to achieve the same effect
is to use the \.{\_Generic} facility of \Cee11.}.

@d output_table(table_desc, table_name, stream)
  if ( output_desc.output_##table_name ) {

      int i, j = 0;
  
      fprintf( stream, table_desc.preamble, table_desc.name );

      for( i = 0; i < sizeof(table_name)/sizeof(table_name[0]) - 1; i++) {

          if ( table_desc.formatter ) {
                 
              j += table_desc.formatter( stream, i );

          } else {

              if ( table_name[i] ) {

                  j += fprintf( stream, table_desc.separator, table_name[i] );

              } else {

                  j += fprintf( stream, "%s", table_desc.null );
    
              }

          }

          if ( j > MAX_PRETTY_LINE && table_desc.prettify ) {

              fprintf( stream, "\n" );
              j = 0;

          }
      }

      if ( table_desc.formatter ) {

          table_desc.formatter( stream, -i );

      } else {

          if ( table_name[i] ) {

              fprintf( stream, table_desc.postamble, table_name[i] );

          } else {

              fprintf( stream, "%s", table_desc.null_postamble ); 

          }

      }

      if ( table_desc.cleanup ) {

          table_desc.cleanup( &table_desc );

      }
  }

@<Global variables and types@>=
  struct table_d {

    @<Generic table desciptor fields@>@;

  };

@ @<Generic table desciptor fields@>=
      char *name;
      char *preamble;
      char *separator;
      char *postamble;
      char *null_postamble;
      char *null;
      bool prettify;
      int (*formatter)( FILE *, int );
      void (*cleanup)( struct table_d * );

@ Tables are output first. The action output code must come last since
it changes the values of the tables to achieve its goals. Again, a
different approach is possible, that saves the data first but
simplicity was deemed more important than total generality at this
point.
@<Perform output@>=
  @<Output all tables@>@;

@ One more application of `gather the names first then process' technique.
@<Output all tables@>=
#define _register_table_d(name) @[output_table(name##_desc, name, tables_out);@]
  @<Table names@>@;
#undef _register_table_d    

@ Tables will be output by each driver. Placeholder here, for
\CWEAVE's piece of mind.
@<Table names@>=

@ Action output invokes a totally new level of dirty code. If tables,
constants, and tokens are just data structures, actions are actually
code. We can only hope to cycle through all the actions which is
enough to use \bison\ or \flex\ successfully with \TeX. The |switch| statement
containing the actions is embedded in the parser code so to get access
to each action we have to coerce |yyparse()| to jump to each
case. This is where we need the table manipulation. This code is
highly specific to the program used (since \bison\ and \flex\ code
have to be `reverse engineered' to make the parser and scanner
functions do what we want), here we only declare the options controlling
the level of detail and the type of actions output.
@<Global variables...@>=
  static int bare_actions = 0; /* (|static| for local variables) and |int| to pacify the compiler
                                  (for a constant initializer and compatible type) */
  static int optimize_actions = 0;

@ The first of the following options allows one to output an action switch without the
actions themselves. It is useful when one needs to output a \TeX\
parser for a grammar file that is written in \Cee. In this case it
will be impossible to cycle through actions (as no setup code has been
executed), so the parser invocation is omitted.

The second option splits the action switch into several macros to speed up
the processing of the action code.

The last argument of the `flexible' macro below is supposed to be an
extended description of each option which can be later utilized by a
|usage()| function.
@<Raw option list@>=
  _register_option("bare-actions", no_argument, &bare_actions, 1, "")@;
  _register_option("optimize-actions", no_argument, &optimize_actions, 1, "")@;

@ The rest of the action output code mimics that for table output, starting with
the descriptor. To make the output format more flexible, this
descriptor should probably be turned into a specialized routine.
@<Global variables and types@>=
  struct action_d {

      char *preamble;
      char *act_setup;
      char *act_suffix;
      char *action1;
      char *actionn;
      char *postamble;
      void (*print_rule)( int );
      void (*cleanup)( struct action_d * );

  };

@ @<Output descriptor fields@>=
  bool output_actions:1;

@ Nothing is output by default, including actions.
@<Default outputs@>=
  @[@].output_actions = 0,@[@]@;

@ @<Local variable and type declarations@>=
  struct action_d action_desc = {0};

@ The function below outputs the \TeX\ code of each action when the
appropriate action is `run' by the action output switch. 
The main concern in designing these functions is to make the
code easier to look at. Further explanation is given in the grammar
file. If the parser is doing its job, this is the only place where one
would actually see these as functions (or, rather, macros).

In compliance with paragraph 6.10.8(2)\footnote{[$\ldots$] {\it Any
other predefined macro names shall begin with a leading underscore
followed by an uppercase letter or a second underscore.}} of the \ISO\
\Cee11 standard the names of these macros do not start with an
underscore, since the first letter of \.{TeX} is
uppercase\footnote{One might wonder why one of these functions is
defined as a \CWEB\ macro while the other is put into the preamble `by
hand'. It really makes no difference, however, the reason the second
macro is defined explicitly is \CWEB's lack of awareness of `variadic'
macros which produces undesirable typesetting artefacts.}.
\def\TeXx{\hbox{\.{TeX\_}}}
\def\TeXxx{\hbox{\.{TeX\_\_}}}
@s TeX__ TeX
@d TeX_( string ) fprintf( tables_out, "          %s%%\n", string )

@q \CWEB\ is not aware of variadic macros, so this has to be done the old way@>
@<\Cee\ preamble@>=
#define TeX__( string, ... ) @[fprintf( tables_out, "      " string "%s\n", __VA_ARGS__, "%" )@]

@*2 \TeX\ tables. We begin with a few macros to facilitate the output
of tables in the format that \TeX\ can understand. There is really no
good way to represent an array in \TeX\ so a rather weak compromise
was chosen. Further explanation of this choice is given in the \TeX\
file that implements the \TeX\ parser for the \bison\ input grammar.
Some tables require name adjustments due to \TeX's reluctance to treat
digits as part of a name.
 
@d tex_table_generic(table_name)
  table_name##_desc.preamble = "\\newtable{%s}{%%\n";
  table_name##_desc.separator = "%d\\or ";
  table_name##_desc.postamble = "%d}%%\n";
  table_name##_desc.null_postamble = "0}%%\n";
  table_name##_desc.null = "0\\or ";
  table_name##_desc.prettify = true;
  table_name##_desc.formatter = NULL;
  table_name##_desc.cleanup = NULL;
  output_desc.output_##table_name = 1;

@d tex_table(table_name)
  tex_table_generic(table_name);
  table_name##_desc.name = #table_name;

@ Outputting constants. An approach paralleling the table output
scheme is taken with constants. Since constants are \Cee\ {\it
macros\/} one has to be careful to avoid the temptation of using
constant {\it names\/} directly as names for fields in
structures. They will simply be replaced by the constants'
values. When the names are concatenated with other tokens, however,
the \Cee\ preprocessor postpones the macro expansion until the
concatenation is complete (see clauses 6.10.3.1, 6.10.3.2, and
6.10.3.3 of the \ISO\ \Cee\ Standard, \cite[ISO/C11]). Unless the result of the
concatenation is still expandable, the expansion will halt.
@<Global variables and types@>=
  struct const_d {

      char *format;
      char *name;

  };

@ @<Local variable and type declarations@>=
#define _register_const_d(c_name) @[struct const_d c_name##_desc;@] 
  @<Constant names@>@;
#undef _register_const_d

@ @<Output descriptor fields@>=
#define _register_const_d(c_name) @[bool output_##c_name:1;@]
  @<Constant names@>@;
#undef _register_const_d

@ @<Default outputs@>=
#define _register_const_d(c_name) @[@[@].output_##c_name = 0,@[@]@]
  @<Constant names@>@;
#undef _register_const_d

@ @<Perform output@>=
  fprintf( tables_out, "%%\n%% constant definitions\n%%\n" );
  @<Output constants@>@;

@ @<Output constants@>=
{

  int any_constants = 0;
#define _register_const_d(c_name)                                \
                                                                 \
  if ( output_desc.output_##c_name ) {                           \
      const_out( tables_out, c_name##_desc, c_name)@;            \
      any_constants = 1;                                         \
  }

  @<Constant names@>@;

#undef _register_const_d

  if ( any_constants ); /* this is merely a placeholder statement */

}

@ Constants are very driver specific, so to make \CWEAVE\ happy $\ldots$
  @<Constant names@>=

@ A macro to help with constant output. 
@d const_out(stream, c_desc, c_name) fprintf(stream, c_desc.format, c_desc.name, c_name);

@ Action switch output routines modify the automata tables and
therefore have to be output last. Since action output is highly
automaton specific, we leave this section blank here, to pacify
\CWEAVE\ in case this file is typeset by itself.
@<Output action switch, if any@>=

@*2 Error codes.

@<Global variables and types@>=
  enum err_codes{ @<Error codes@>@, LAST_ERROR };

@ @<Error codes@>=
  NO_MEMORY, BAD_STRING, BAD_MIX_FORMAT,@[@]

@ A lot more care is necessary to output the token table. A number of precautions are taken
to ensure that a maximum possible range of names can be passed safely to \TeX. This involves some 
manipulation of \.{\\catcode}'s and control characters. The
complicated part is left to \TeX\ so the output code can be kept
simple. The helper function below is used to `combine' two strings.

@d MAX_PRETTY_LINE 100

@<Auxiliary function declarations@>=
  char *mix_string( char *format, ... );

@ @<Auxiliary function definitions@>=
  char *mix_string( char *format, ... ) {

        char *buffer;
        size_t size = 0;
        int length = 0;
        int written = 0;
        char *formatp = format;
        va_list ap, ap_save;

        va_start( ap, format );
        va_copy( ap_save, ap );

        size = strnlen( format, MAX_PRETTY_LINE * 5 );

        if ( size >=  MAX_PRETTY_LINE * 5 ) {
     
            fprintf( stderr, "%s: runaway string?\n", __func__ );
            exit(BAD_STRING);
              
        }

        while ( (formatp = strstr( formatp, "%" )) ) {

            switch( formatp[1] ) {

                case 's':@;              
                    length = strnlen( va_arg( ap, char * ), MAX_PRETTY_LINE * 5 );
            
                    if ( length >=  MAX_PRETTY_LINE * 5 ) {
      
                        fprintf( stderr, "%s: runaway string?\n", __func__ );
                        exit(BAD_STRING);
              
                    }
     
                    size += length;
                    size -=2;
                    formatp++;
                    break;

                 case '%':@;
                    size--;
                    formatp +=2;

                 default:

                    printf( "%s: cannot handle %%%c in mix string format\n", __func__, formatp[1] );
                    exit( BAD_MIX_FORMAT );

            }

        }

        buffer = (char *)malloc( sizeof(char) * size + 1 );

        if ( buffer ) {

            written = vsnprintf( buffer, size + 1, format, ap_save );

            if ( written < 0 || written > size ) {

                fprintf( stderr, "%s: runaway string?\n", __func__ );
                exit(BAD_STRING);

            }

        } else {
   
            fprintf( stderr, "%s: failed to allocate memory for the output string\n", __func__ );
            exit(NO_MEMORY);

        }

        va_end( ap );
        va_end( ap_save );

        return buffer;

  }

@*2Initial setup. Depending on the output mode (right now only \TeX\
and `tokens only' (in the \bison\ `driver') are supported) the format of each table, action
field and token has to be set up.

@<Local variable and type declarations@>=
  enum output_mode {@<Output modes@>@, LAST_OUT};

@ And to calm down \CWEAVE\ $\ldots$
@<Output modes@>=

@ \TeX\ is the main output mode.
@<Establish defaults@>=
  enum output_mode mode = TEX_OUT;

@*2Command line processing. This program uses a standard way of parsing the command
line, based on |getopt_long|. At the heart of the setup are the array below with a
couple of supporting variables.

@<Outer definitions@>=
#include <unistd.h>
#include <getopt.h>
#include <string.h>

@ @<Local variable and type declarations@>=
  const char *usage = "%s [options] output_file\n";

@ @<Command line processing variables@>=
  int c, option_index = 0;@#

  enum higher_options{NON_OPTION = 0xFF, @<Higher index options@>@, LAST_HIGHER_OPTION};  

  static struct option long_options[] = { @/@[
    @<Long options array@>@;@/
    {0, 0, 0, 0} @]
  };@#

@ The main loop of the command line option processing follows. This
can be used as a template for setting up the option processing. The
specific cases are added to in the course of adding new features. 

@<Process command line options@>=
  opterr = 0; /* we do our own error reporting */

  FOREVER {
    
    c = getopt_long (argc, argv, ":" @<Short option list@>, long_options, &option_index);

    if (c == -1) break;
    
    switch (c) {

    case 0:@; /* it is a flag, the name is kept in |long_options[option_index].name|,
        and the value can be found in |long_options[option_index].val| */
      break;

    @t}\4{@>@<Cases affecting the whole program@>;
    @t}\4{@>@<Cases involving specific modes@>;
          
    case '?':@;
      fprintf (stderr, "Unknown option: `%s', see `Usage' below\n\n", argv[optind - 1]);
      fprintf(stderr, usage, argv[0]);
      exit(1);
      break;
      
    case ':':@;
      fprintf (stderr, "Missing argument for `%s'\n\n", argv[optind - 1]);
      fprintf(stderr, usage, argv[0]);
      exit(1);
      break;     
      
    default:@;
      printf ("warning: feature `%c' is not yet implemented\n", c);
    } 
    
  }
     
  if (optind >= argc)
  {

      fprintf( stderr, "No output file specified!\n" );

  } else {

      tables_out = fopen( argv[optind++], "w" );

  }

  if (optind < argc)
  {

      printf ("script files to be loaded: ");
      while (optind < argc)
	printf ("%s ", argv[optind++]);
      putchar ('\n');

  }

@ @<Long options array@>=
#define _register_option(name, arg_flag, loc, val, exp) @[{name, arg_flag, loc, val},@[@]@]
  @<Raw option list@>@;
#undef _register_option

@ In addition to spelling out the full command line option name (such
as \.{--help}) |getopt_long| gives the user a choice of using a
shortcut (say, \.{-h}). As individual options are treated in drivers
themselves, there are no shortcuts to supply at this point. We leave
this section (and a number of others) empty to be filled in with the
driver specific code to pacify \CWEAVE.

@<Short option list@>=

@ Some options have one-letter `shortcuts', whereas others only exist
in `fully spelled-out' form. To easily keep track of the latter, a
special enumerated list is declared. To add to this list, simply add
to the \CWEB\ section below.
@<Higher index options@>=

@ @<Cases affecting the whole program@>=

@ @<Cases involving specific modes@>=

