[This local archive copy mirrored from the canonical site: http://www.uic.edu/~cmsmcq/tech/sweb/swebyacc.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

A Simple Yacc/Lex Processor for Sweb, an SGML Tag Set for Literate Programming


C. M. Sperberg-McQueen

5 February 1996

This unpublished document is distributed privately for comment by friends and colleagues; it is not now a formal publication and should not be quoted in published material.

Table of Contents

Abstract

This document describes the implementation of a simple processor for Sweb, an SGML tag set for literate programming. The processor is written with the aid of yacc and lex, in several stages. The first stage merely recognizes the <scrap> elements in the input stream; the next also parses their attributes. The next uses their attribute values to build linked lists of scraps and of files of code to be written to disk at the end of the run. After several such stages, a complete working version of the program is finished; later stages add new features. Each stage is kept simple to write and thus simple to understand.

The following versions are currently defined by this document:


1 Overall Organization

The program itself has two main parts: a yacc grammar for Sweb documents and a lexical scanner which identifies the salient markup in those documents. (For our purposes, the grammar and lexical scanner need to deal only with the contents of scraps; the rest of the document is passed through unchanged.) Some auxiliary files will also be needed, and we may create free-standing files for sets of related routines which might conceivably be reused, such as the implementation of skip-lists we take from William Pugh.

This document itself is organized more or less chronologically, and reflects a conventional process of software development by both stepwise refinement and successive approximation. By stepwise refinement I mean the systematic replacement of more abstract descriptions of what a program should do with more concrete descriptions (refinements), just as it's described by Edsger Dijkstra and other proponents of structured programming. Literate programming makes it a bit easier to develop programs this way, and a lot easier for later readers of the code to follow what is happening.

By successive approximation I mean the systematic development of a program in stages, starting from a very simple program that may not look much like the desired final product, in such a way that each stage is easy to implement, given the preceding stages, and each stage comes a little closer to the final version. Ideally, each stage is useful in itself, even the early ones, but this may require a gift for design and specification that we don't all have. Successive approximation appears not to be a common term in the literature, but the practice is fairly common; it may be seen in Nancy Ide's textbook Pascal for the Humanities, in Kernighan and Plauger's extended example of program development (of the hoc or `higher order calculator' processor) in their book on The Unix Programming Environment, and in Melinda Varian's tutorial on CMS Pipelines progamming. Melinda suggests that the best way to start developing a new pipeline stage is usually to copy the program NULL REXX -- which does nothing but copy its input into its output -- into a new file with the desired name, and modify from there. It works for me, and it works for my colleagues. Not everyone works this way, of course. If his essays on the development of TeX can be believed, Donald Knuth wrote the entire program out, in Web form, before he tried to compile it the first time. [I tried to do that with the first version of the Sweb processor, but eventually gave up and did it in six stages with about forty compiled versions and runs over the test data.]

The successive approximations or stages of development for the Sweb processor are these (some are done, others are only planned):

When we reach version 0.4, with support for multiple versions, we will be in a better position to handle the gradual accretion of features and code stage by stage. Until then, I'll probably just break old versions of the code as needed, in order to get the new versions to work. I'll try not to render the document unreadable in the process, but I hope the reader will bear with me.

When all of these stages have been completed, we'll have something ready for formal release as version 1.0. (Earlier versions are likely to be made available on the network beginning about version 0.5.)

Here, in the context of the features still to be added, is probably as good a place as any to list some of the known bugs and shortcomings currently present in the code:

< 1 Shortcomings of the current code > =

 
* IDs are case-sensitive; this should be user controllable, and
  case-insensitive by default.
* attribute values are now all single quoted in the output.  They
  should be quoted or unquoted as in the input.  This means moving
  the unquoting code from lex into yacc and invoking it only after
  the attribute value has been written to the output file.
  (I resist the idea of making attribute-value quoting user
  controllable with a run-time flag. SWeb is not a general-purpose
  normalizer; if one is desired, it should be a separate program.)
* left angle brackets in source code currently need manual post-editing
  (a global change, selective if necessary, of '' to '').  The
  program should be able to hande CDATA marked sections and entity
  references as alternative workarounds; it should also be able to
  suppress SGML comments in code output with languages other than
  SGML.  Ultimately, a task for version 0.4, but a quick fix
  (recognition of comments as distinct scrap-content fragments)
  can go into version 0.04.

We can also say, quickly, that we don't know of any actual bugs:

< 2 Known bugs > =

 
None that I know of.

1.1 Yacc Grammar

The C implementation of Sweb will use yacc and lex to simplify the organization of the program.

The overall organization of the yacc program is:

< 3 sweb.y ver. 0.01 >(sweb.y) =

 
/* sweb.y:  Reads Sweb files, produces compilable code */

/*
** Revisions:
** 1996-02-09 resume work, using successive approximation.  Ver. 0.01
** 1996-01-14 begun work on sweb.tei
*/

/* Known bugs:
< Known bugs 2 >
*/
/* Known non-bug shortcomings:
* not finished
< Shortcomings of the current code 1 >
*/

%{
< C declarations, etc. 4 >
%}
< Declare yylex return type 10 >
< Yacc/Lex terminal token types 29 >
< Data types of terminals and non-terminals 30 >
%%
< Yacc grammar 27 >
%%
< Define main function 11 >
< Declare miscellaneous C functions 15 >

The initial part of the yacc file contains C declarations and the like. We know we'll need at least the following:

< 4 C declarations, etc. > =

 
< Include files for yacc 5 >
< Macro and type definitions for yacc 6 >
< C declarations for scrap and file records 59 >
< Global variables, etc. 8 >
< Function prototypes 9 >

< 5 Include files for yacc > =

 
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

(This version of the code doesn't use assertions, but we'll go ahead and add it in now; it's worth having.)

< 6 Macro and type definitions for yacc > =

 
#define YYDEBUG 1
#define YYERROR_VERBOSE 1

If we are compiling under DOS with Zortech, we need to define some macros to allow the distributed Bison code to compile. Thanks to Bob Goldstein for figuring these out, long ago. I certainly could not have done so.

< 7 Macro and type definitions for yacc 6 (cont'd) > =

 
#ifdef __ZTC__
#define bcopy(A,B,C) memcpy((B),(A),(C))
#define alloca(A)    malloc(A)
#define max(A,B)  A > B ? A : B
#endif

Global variables ...

< 8 Global variables, etc. > =

 

Functions we need to pre-declare:

< 9 Function prototypes > =

 
void yyerror(char* s);

The second section of the yacc file contains yacc declarations for the datatype of the lexical values returned by the lexical scanner, the token names defined as macros in the C output from yacc, and the datatypes of the various non-terminal symbols in the grammar. For the most part (exclusively?), our lexical scanner will return strings as the lexical value of tokens. If it's possible to define the type of yylval as just a string, I don't know how to do it. The only way I know is to use the %union keyword and define it as a union of string and integer.

< 10 Declare yylex return type > =

 
%union {
         int   i;
         char* s;
}

The tokens and their types, and the types of non-terminals, we can worry about later, when we work on the grammar itself.

We can go ahead and get the main function out of the way, though, since it doesn't do much but read command-line options and call the parser.

< 11 Define main function > =

 
main(int argc, char *argv[]) {
  int i;
  extern fDebug;
  char c;

  < Process command-line arguments 12 >
  yyparse();
  < Write scraps out to files 108 >
}

Command-line arguments take the only complicated processing. There are two cases: flags and other stuff.

< 12 Process command-line arguments > =

 
< Command-line flags 13 >
< Other command-line arguments 14 >

First, the flags:

< 13 Command-line flags > =

 
  /* get arguments and set things up ... */
  while (--argc > 0 && ((*++argv)[0] == '-' || (*argv)[0] == '/'))
    while (c = *++argv[0])
      switch (c) {
        case 't': fTrace = fDebug = fVerbose = 1;
                  iMsglevel = msgTRACE;
                  yydebug = 1;
                  break;
        case 'd': fDebug = fVerbose = 1;
                        iMsglevel = msgDEBUG;
                        break;
        case 'v': fVerbose = 1;
                        iMsglevel = msgVERBOSE;
                        break;
        default:
                   fprintf(stderr,"sweb:  unknown option
%c\n",c);
                   argc = 0;
                   break;
      }

But in fact, for now we don't handle file-name arguments.

< 14 Other command-line arguments > =

 
  if (argc > 0) {     /* we have filename args, process them */
    fprintf(stderr,
            "sweb:  Sorry, filename arguments"
            " not implemented yet\n");
  }

While we're here, let's take care of yyerror, too.

< 15 Declare miscellaneous C functions > =

 
void yyerror(char* s) {
    fprintf(stdout,"L%d <" "-- %s -->\n",cLinecount,s);
    fprintf(stderr,"! ! ! line %d of input:  %s ! ! !\n",
              cLinecount,s);
}

1.2 Lexical Scanner

The overall organization of the lexical scanner is:

< 16 sweblex.l >(sweblex.l) =

 
%{
/* SWebLex.L:  lex input for scanner for SWeb literate programming */
/*  To do:
*/
/*  Revisions:
** 1996-02-09 resume work, using successive approximation.  Ver. 0.01
** 1996-01-14 begun work on sweb.tei
*/

< C code for top of lex file 17 >
%}
< Lex macro definitions 37 >
< Lexical scanner start states 34 >
%%
< Rules for lex 35 >
%%
< Miscellaneous C functions for lex 43 >

The preliminary section of the lex file contains C declarations, mostly #include directives for header files and so on.

< 17 C code for top of lex file > =

 
< C header files for lex code 18 >
< C macro definitions for lex code 19 >
< Global variables and function prototypes for lex 22 >

The only header files we need are sweb.h, with yacc's declarations of token types, etc., and mycat.h, with my own functions for string concatenation.

< 18 C header files for lex code > =

 
#include "sweb.h"
#include "mycat.h"

Since we'll need to return a lot of `tokens' which are strings of various types, we define a macro to copy a string value from yytext to newly allocated storage, and place the relevant pointer into yylval.s. (The variable yytext is volatile, and can change if yylex is called for lookahead, so the strings have to be copied if they are to have much of a lifetime.

< 19 C macro definitions for lex code > =

 
#define SAVEVALUE yylval.s = mycopy1(yytext)

A couple of things appear to be necessary only because different compilers provide different sets of native functions.

< 20 C macro definitions for lex code 19 (cont'd) > =

 
/* Zortech thinks fileno is not ANSI C, Vern Paxson thinks it is */
#define fileno(fp)      ((fp)->_file)
#define max(A,B)  (A > B) ? A : B

The lexical scanner from which I copied this shell (that for the DTD parser dpp redefines the size of the standard lex buffer. I think this is to allow many entities to nest without running out of memory under DOS (a buffer is set up for each entity), but (a) I don't know for sure, and (b) we're not handling SGML entities at the moment, so (c) I'm putting this in an unreachable scrap, so it's not included now but is here if we need it later.

< 21 Redefine size of Lex buffers > =

 
#undef YY_READ_BUF_SIZE
#undef YY_BUF_SIZE
#define YY_READ_BUF_SIZE 1024
#define YY_BUF_SIZE (YY_READ_BUF_SIZE * 2) /* size of default input buffer */

The only global variable we will need for now is cLinecount, which contains a positive integer giving the line number of the current input. (This is used by msg.c -- should perhaps be defined there, not here?) We also define fnCurrent, the current file name, which for now is hard-coded to stdin.

< 22 Global variables and function prototypes for lex > =

 
int cLinecount = 1;
char * fnCurrent = "<stdin>";

The rules and definitions sections of the lex file will be described in the next section.

1.3 Messages

For debugging and tracing, we'll need more messages than for normal production work. The level of message tracing provided is controlled by global variables set from command-line parameters and passed to the procedure MsgKwS which (as the name implies) accepts as arguments a keyword indicating the message level and one or more strings containing the message. If the level of the message meets the message threshold set at run time, MsgKwS writes the message to stderr; otherwise it's ignored.

First, we embed the header file for the message function:

< 23 Include files for yacc 5 (cont'd) > =

 
#include "msg.h"

Next, we define a macro to allow us to embed debugging messages with reckless abandon, and define the enumerated type for message-level settings:

< 24 Macro and type definitions for yacc 6 (cont'd) > =

 
#define DEBUGMSG(S) if (fDebug) MsgKwS(msgDEBUG,S)
enum MSGTYPES  { msgTRACE   = 0,
                 msgDEBUG   = 1,
                 msgVERBOSE = 3,
                 msgINFORM  = 5,
                 msgWARNING = 7,
                 msgERROR   = 10
};

Finally, we define the global variables indicating whether we have set various flags for tracing, debugging, verbose logging, and the message level.

< 25 Global variables, etc. 8 (cont'd) > =

 
int fTrace = 0;
int fDebug = 0;
int fVerbose = 0;
int iMsglevel = msgINFORM;

Strictly speaking, I suppose the declarations for fTrace, etc., should be in the section on command-line arguments, below. But let's try it this way and see how it looks.

N.B. msg.c embeds the old dppflags.h file, which declares the trace flags, etc., as externals. This reflects the origins of msg.c in the DTD pre-processor code, and should probably be re-thought with msg.c as a library function. (When this happens, it should probably be rewritten to use vfprintf, to avoid requiring a terminating NULL argument.) For now, I just leave it, and with it I leave the definition of cLinecount as line counter in the lex file. We embed sweblex.h to give ourselves access to it in the yacc grammar:

< 26 Include files for yacc 5 (cont'd) > =

 
#include "sweblex.h"

2 Version 0.1: Tangling Scraps

The most basic function of Sweb is to tangle the code into a compilable program. Version 0.1 of the program will do nothing more than that. What we do is this:

For the moment, we simplify life by ignoring SGML comments, marked sections, and entity references. This limits us to webs contained in single files, which don't contain examples of scrap markup which could be mistaken for real scraps. We'll fix these limitations later (version 0.5).

2.1 Version 0.01: Recognizing Scraps

We begin very simply: just recognize the beginning and ending of scraps.

2.1.1 File Structure

We define an input file as a series of scraps interspersed with other material (strings). When we see a string, we write it to stdout. In production, we'll write it out unchanged; for debugging during development, we want to label each bit. So (for now, at least) we use the macro WRITE to allow us to change the definition back and forth more easily.

< 27 Yacc grammar > =

 
file    : /* nil */
        | file scrap
        | file STRING {
            WRITE("-C",$2);
            free($2);
        }
        ;

For now, we define the macro to label each bit we write with a prefix (-C, meaning "not-in-a-scrap, Content", plus the line number) so we can tell whether we are successfully distinguishing scraps from non-scraps.

< 28 Define WRITE macro > =

 
#define WRITE(A,B) if (strcmp(B,"\n") == 0) {              \
               fprintf(stdout,A " %d:\\n\n",cLinecount);   \
            } else {                                       \
               fprintf(stdout,A " %d:%s\n",cLinecount,B);  \
            }

These rules require us to define STRING as a token type:

< 29 Yacc/Lex terminal token types > =

 
%token STRING

< 30 Data types of terminals and non-terminals > =

 
%type <s> STRING

We recognize the scraps by their start- and end-tags. (Both are required; simplifying this code is one reason why.)

< 31 Yacc grammar (version 0.01) > =

 
scrap   : startscrap content endscrap
        ;
startscrap : STAGSCRAP   { WRITE("+S",$1); }
        ;
endscrap : ETAGSCRAP     { WRITE("+S",$1); }
        ;
content : /* nil */ {
              fprintf(stdout,"+S %d:\n",cLinecount);
        }
        | content STRING {
            WRITE("+S",$2);
        }
        ;

These rules introduce two new types:

< 32 Terminals (version 0.01) > =

 
%token STAGSCRAP ETAGSCRAP

< 33 Types (version 0.01) > =

 
%type <s> STAGSCRAP ETAGSCRAP

The lexical scanner needs to recognize the beginning of a scrap and go into a scrap-recognition mode. Outside that mode, it needs to recognize everything as a string. To reduce the chance of buffer overflow, we'll break the strings at newline; to ensure that scraps don't get misread as strings, we'll also break at open angle bracket. First, we declare the mode:

< 34 Lexical scanner start states > =

 
%x SC

Outside of scraps, lex will match entire lines at a time; within scraps, smaller parts of lines must be matched. So we define SC (scrap content) as an exclusive start state.

Outside of scraps, we just scan for newlines and the start of scraps.

< 35 Rules for lex > =

 
< Recognize SGML comments 96 >
    /* General string recognition */
[^\n\r<]+               { SAVEVALUE; return(STRING); }
\n                      { SAVEVALUE; cLinecount++; return(STRING); }
\r                      { ; }
{Stago}                 { SAVEVALUE; return(STRING); }

Note that when we see a newline, we increment cLinecount. We can thus use cLinecount to keep track of what line we're on in the input -- it's handy for informational messages and warnings. Carriage returns (which lex is recognizing separately from newlines in MS-DOS) are ignored, since when the newline is written out, the C compiler will automatically add the carriage return. (I hope.)

When newlines appear within tokens like the white space before or within attribute value specifications, we have to increment cLinecount appropriately. We define a simple macro for this purpose, and use it whenever the token might contain a newline.

< 36 C macro definitions for lex code 19 (cont'd) > =

 
#define NLCHECK { char * pc; \
        for (pc = yytext; *pc != '\0'; pc++) \
          if (*pc=='\n') cLinecount++; }

(The pointer pc is not otherwise used in this program, so we make it local to this block. I didn't realize this was even legal in C until very recently: Kernighan and Ritchie don't talk about it when they discuss the scope of variables. But it's handy to be able to have very local throw-away variables like this, as in a Scheme let.)

2.1.2 Scraps in the Scanner

A scrap is a sequence of character data and cross references to embedded scraps. Later, we will store the sequence as a linked list, and represent the scrap as a structure with information about the scrap. For now, all we want to do is recognize it and write it back out.

To recognize start-tags of scraps, it is convenient to have some simple lex macro definitions. First, the SGML delimiters:

< 37 Lex macro definitions > =

 
Stago   "<"
Etago   "</"
Tagc    ">"

Then, macros useful in defining the shape of attribute value specifications: white space, attribute name, value indicator, and attribute value (lexically, just a few possible types):

< 38 Lex macro definitions 37 (cont'd) > =

 
Namechar [A-Za-z0-9.\-]
Name    [A-Za-z]{Namechar}*
Number  [0-9]+
Numtok  [0-9]+[A-Za-z.\-]{Namechar}*
S       [ \t\n\r]
Vi      (({S}*)"="({S}*))
Lit     "\""
Lita    "'"
Litval  ({Lit}[^"]*{Lit})
Litaval ({Lita}[^']*{Lita})
Attvalue ({Name}|{Litval}|{Litaval})
AVLong  {S}+{Name}{Vi}{Attvalue}
AVShort {S}+{Name}
AVS     {AVLong}|{AVShort}

And finally, a full definition of a start-tag or end-tag in general (we don't use these now, but we will when we start recognizing the contents of <ref> elements in scraps) and start-tags and end-tags for <scrap> elements:

< 39 Lex macro definitions 37 (cont'd) > =

 
Stag        {Stago}{Name}({AVS}*)({S}*){Tagc}
ScrapStag   {Stago}"scrap"({AVS}*)({S}*){Tagc}
Etag        {Etago}{Name}({S}*){Tagc}
ScrapEtag   {Etago}"scrap"({S}*){Tagc}
STS         [ \t]*((\n\r?)|(\r\n?))?

The macro STS ( start-tag space) will be used to detect newlines immediately following the end of a start-tag. I think ISO 8879 requires the newline to follow the start-tag immediately, but I allow blanks and tabs before it, since (a) they are invisible and not allowing them will confuse users including myself, and (b) some editors insert a trailing blank when the file is saved. The complications in the newline are an attempt to make the code work on all systems: it will recognize a carriage-return and linefeed in either order, or either one alone, as a newline. If vendors would implement the standards for a change, this kind of variation wouldn't be the headache it is.

Now we're ready to recognize the beginning of a scrap, which puts us into scrap-content mode:

< 40 Recognition of scrap start-tags > =

 
{ScrapStag}{STS}        { BEGIN SC;
                          SAVEVALUE;
                          NLCHECK;
                          return(STAGSCRAP);
                        }

And the end of a scrap, which puts us back into the initial mode:

< 41 Rules for lex 35 (cont'd) > =

 
    /* Scrap Contents */
<SC>"\n"?{ScrapEtag}    {
                          BEGIN 0;
                          SAVEVALUE;
                          NLCHECK;
                          return(ETAGSCRAP);
                        }

We recognize an immediately following newline as if it were part of the start-tag; we do the same for a newline immediately before the end-tag. This is an incomplete but useful nod in the direction of support for the newline rules of ISO 8879.

Within scraps, we will eventually be scanning for <ref> and <ptr> elements, but for now we can use the same rules as in the initial mode:

< 42 Rules for lex 35 (cont'd) > =

 
    /* Scrap contents */
<SC>[^\n\r<]+           { SAVEVALUE; return(STRING); }
<SC>\n                  { SAVEVALUE; cLinecount++; return(STRING); }
<SC>\r                  { ; }
<SC>{Stago}             { SAVEVALUE; return(STRING); }

We don't need any C functions in the lexer at the moment:

< 43 Miscellaneous C functions for lex > =

 
/* */

2.1.3 Summary of Version 0.01

The rest of version 0.1 will just be a series of modifications and enhancements to the framework we've just set up in the two files sweb.y and sweblex.l. With all the scraps we've see thus far expanded, this is how they look:

And the lex file:

2.2 Version 0.02: Parsing Scrap Attributes

First of all, we replace the simple-minded code for matching the start of a scrap with slightly more complex code which matches it and also allows us to recognize its important attributes:

First of all, we revise the yacc grammar for files:

< 44 Yacc grammar 27 (cont'd) > =

 
scrap   : startscrap scrapcontent endscrap
        ;
startscrap : GISCRAP { WRITE("+S",$1); }
             scrapatts
             TAGC {
                WRITE("+S",$4);
                free($1); free($4);
        }
        ;
endscrap : ETAGSCRAP     { WRITE("+S",$1); free($1); }
        ;
< Yacc rules for attributes of scraps 60 >
< Handle content of scrap, ver. 0.05 89 >

The the code for handle scrap content will change in version 0.04, but for now we just want very simple processing, so we can confirm we are reading the scraps right.

< 45 Handle content of scrap, ver. 0.02 > =

 
scrapcontent : /* nil */ {
              fprintf(stdout,"+S %d:\n",cLinecount);
        }
        | scrapcontent STRING {
            WRITE("+S",$2);
        }
        ;

This requires redefining the terminal tokens (replacing the old STAGSCRAP with GISCRAP):

< 46 Yacc/Lex terminal token types 29 (cont'd) > =

 
%token GISCRAP ETAGSCRAP TAGC

< 47 Data types of terminals and non-terminals 30 (cont'd) > =

 
%type <s> GISCRAP ETAGSCRAP TAGC

Scraps can take a wide variety of attributes (TEI Lite has a lot of global attributes), but we are interested only in a few of them. For those we need to act on, we have separate rules; for all the rest, we have a single miscellaneous rule:

< 48 Yacc rules for attributes of scraps > =

 
scrapatts : /* nil */
        | scrapatts avsid
        | scrapatts avsfile
        | scrapatts avsprev
        | scrapatts avsname
        | scrapatts avscorr
        | scrapatts avsmisc
        ;
avsid   : SPACE KWID VI attvalue       { WRATT("+S","id  ",$4); }
        ;
avsfile : SPACE KWFILE VI attvalue     { WRATT("+S","file",$4); }
        ;
avsprev : SPACE KWPREV VI attvalue     { WRATT("+S","prev",$4); }
        ;
avsname : SPACE KWNAME VI attvalue     { WRATT("+S","name",$4); }
        ;
avscorr : SPACE KWCORRESP VI attvalue  { WRATT("+S","corr",$4); }
        ;
avsmisc : SPACE NAME VI attvalue       { WRATT("+S",$2,$4); }
        ;
attvalue : LIT
        | LITA
        | NAME
        | NUMBER
        | NUTOKEN
        ;

We introduce a second macro for writing to stdout:

< 49 Define WRITE macro 28 (cont'd) > =

 
#define WRATT(A,B,C) fprintf(stdout,"%s %d %s=%s\n",A,cLinecount,B,C);

These rules require declarations for SPACE, VI, the various attribute names serving as keywords, and the various forms of attribute values, whether literals or token types: NAME, NUMBER, NUTOKEN, LIT, and LITA:

< 50 Yacc/Lex terminal token types 29 (cont'd) > =

 
%token KWID KWFILE KWPREV KWNAME KWCORRESP
%token SPACE NAME VI LIT LITA NUMBER NUTOKEN

< 51 Data types of terminals and non-terminals 30 (cont'd) > =

 
%type <s> KWID KWFILE KWPREV KWNAME KWCORRESP
%type <s> SPACE NAME VI LIT LITA NUMBER NUTOKEN

The lexical scanner changes to

First, we add a rule for recognizing GISCRAP and shifting to the attribute-recognition mode:

< 52 Rules for lex 35 (cont'd) > =

 
{Stago}"scrap"          { BEGIN SA;
                          SAVEVALUE;
                          return(GISCRAP);
                        }

Then we add rules for recognizing the scrap attributes:

< 53 Rules for lex 35 (cont'd) > =

 
    /* Scrap Attributes */
<SA>{S}+                { SAVEVALUE; NLCHECK; return(SPACE);   }
<SA>"id"                { SAVEVALUE; return(KWID);    }
<SA>"file"              { SAVEVALUE; return(KWFILE);  }
<SA>"prev"              { SAVEVALUE; return(KWPREV);  }
<SA>"name"              { SAVEVALUE; return(KWNAME);  }
<SA>"corresp"           { SAVEVALUE; return(KWCORRESP); }
<SA>{Name}              { SAVEVALUE; return(NAME);    }
<SA>{Vi}                { SAVEVALUE; NLCHECK; return(VI); }
<SA>{Litval}            { SAVEVALUE; return(LIT);     }
<SA>{Litaval}           { SAVEVALUE; return(LITA);    }
<SA>{Number}            { SAVEVALUE; return(NUMBER);  }
<SA>{Numtok}            { SAVEVALUE; return(NUTOKEN); }
<SA>{S}*{Tagc}{STS}     { SAVEVALUE;
                          NLCHECK;
                          BEGIN SC;
                          return(TAGC);
                        }

Since (at least for now) TAGC is recognized only at the end of start-tags, we recognize an immediately following newline at the same time. If we didn't do this (and if we didn't eat the newline before the end-tag of the scrap) we'd have a lot of extraneous blank lines in the programs we write out.

See the appendix for some notes on newline handling and its implications for getting the code to print correctly.

< 54 Lexical scanner start states 34 (cont'd) > =

 
%x SA

2.3 Version 0.03: Scraps and Files

In version 0.03, we begin to prepare for processing the scraps into files. The first step is to keep track of what scraps there are, and what files we are supposed to write out.

We describe each scrap with a scraprecord structure: for each scrap we record the line on which it starts (strictly speaking, this will be the line on which its start-tag ends -- if that's a problem we need to change the processing of scrap start-tags), the values if any of its id, file, prev, name, and corresp attributes, and the scraps which name this one in their prev or corresp attributes (i.e. its continuations and equivalents). We also point to (the beginning and the end of) a linked list of strings containing the contents of this scrap. (The string record and its linked list are defined in section Version 0.05: Linked List of Scrap Contents.)

< 55 Define scrap-record structure > =

 
struct scraprecord {
        /* for debugging:  where did this guy start? */
        int  cLine;
        /* attribute values */
        char *pcId;
        char *pcFile;
        char *pcPrev;
        char *pcName;
        char *pcCorresp;
        /* linked nodes */
        struct scraplist * pslContins;
        struct scraplist * pslEquivs;

        /* content */
        struct stringrec * psrFirst;
        struct stringrec * psrLast;
};

Files have a simpler record type: we record their name and the scraps which point at them:

< 56 Define file-record structure > =

 
struct filerecord {
        char *pcFile;
        struct scraplist * pslScraps;
};

Both the file record and the scrap record require linked lists of scraps for various purposes. The scrap list resembles a list of cons cells: each record has a pointer to a scrap record (the car) and to the next record in the list (the cdr).

< 57 Define scrap-list record structure > =

 
struct scraplist {
        struct scraprecord * pscrapThis;
        struct scraplist   * pslNext;
};

Eventually, scraps and files will both be represented in skip lists, so that (a) we can have arbitrary numbers of each and (b) we can make alphabetical lists of them for indices or verbose logging. For now, though, we avoid the complexity of skip lists by allocating global arrays of scrap records (rgScraps and file records (rgFiles). The absolute maximum number of each is defined with a macro; the current high-water mark is recorded in a global variable (macScraps, macFiles -- the mac meaning current maximum is borrowed from standard Hungarian notation (need bibliographic reference to the article in Byte). For both files and scraps, we also define a global pointer to the `current' file or scrap, i.e. the one currently being defined; these should always be the same as rgFiles[macFiles] or rgScraps[macScraps].

< 58 Define arrays of file and scrap records > =

 
#define maxFiles 100
#define maxScraps 200
int macFiles = 0;       /* max=absolute max, mac=current high-water */
int macScraps = 0;

struct filerecord rgFiles[maxFiles];
struct filerecord * pfileCur = rgFiles;
struct scraprecord rgScraps[maxScraps];
struct scraprecord * pscrapCur = rgScraps;

These scraps all belong in the declarations section:

< 59 C declarations for scrap and file records > =

 
< Define string-record structure 86 >
< Define scrap-record structure 55 >
< Define file-record structure 56 >
< Define scrap-list record structure 57 >
< Define arrays of file and scrap records 58 >

The most complicated processing goes into the handling of the attributes on the <scrap> element. (This scrap should probably be broken up a bit.)

< 60 Yacc rules for attributes of scraps > =

 
scrapatts : /* nil */ {
            < Initialize scrap record 64 >
        }
        | scrapatts avsid
        | scrapatts avsfile
        | scrapatts avsprev
        | scrapatts avsname
        | scrapatts avscorresp
        | scrapatts avsmisc
        ;
avsid   : SPACE KWID VI attvalue {
            WRATT("+S","id  ",$4); WRITEATT($1,$2,$3,$4);
            pscrapCur->pcId = SUpperS(SUnquoteS($4));
            free($1); free($2); free($3); /* don't free $4 */
        }
        ;
avsfile : SPACE KWFILE VI attvalue {
            WRATT("+S","file",$4); WRITEATT($1,$2,$3,$4);
            pscrapCur->pcFile = SUnquoteS($4);
            < Add or update entry in files list 66 >
            free($1); free($2); free($3); /* don't free $4 */
        }
        ;
avsprev : SPACE KWPREV VI attvalue {
            WRATT("+S","prev",$4); WRITEATT($1,$2,$3,$4);
            pscrapCur->pcPrev = SUpperS(SUnquoteS($4));
            /* add/update entry in pslContins */
            < Add or update continuations list 101 >
            free($1); free($2); free($3); /* don't free $4 */
        }
        ;
avsname : SPACE KWNAME VI attvalue {
            WRATT("+S","name",$4); WRITEATT($1,$2,$3,$4);
            pscrapCur->pcName = SUnquoteS($4);
            free($1); free($2); free($3); /* don't free $4 */
        }
        ;
avscorresp : SPACE KWCORRESP VI attvalue {
            WRATT("+S","corr",$4); WRITEATT($1,$2,$3,$4);
            pscrapCur->pcCorresp = SUpperS(SUnquoteS($4));
            /* add/update entry in pslEquivs */
            free($1); free($2); free($3); /* don't free $4 */
        }
        ;
avsmisc : SPACE NAME VI attvalue {
            WRATT("++",$2,$4);     WRITEATT($1,$2,$3,$4);
        }
        ;
attvalue : LIT
        | LITA
        | NAME
        | NUMBER
        | NUTOKEN
        ;

When we read attribute values, we want to write them to output unchanged, but before we store their values in the scrap records, we need to unquote them and (for SGML ID values) do case folding. The function SUnquoteS() takes a string as argument and returns a pointer to the same string without the quotes. The modification is done in place, and the old string is lost.

< 61 Declare miscellaneous C functions 15 (cont'd) > =

 
char * SUnquoteS(char * ps) {
  char * pcC;  /* current character */
  char * pcN;  /* next character */
  char cT;
  if ((*ps == '\'') || (*ps == '\"')) {
     cT = *ps;
     pcN = ps;
     for (pcC = pcN++; *pcN != cT; pcC = pcN++)
        pcC[0] = *pcN;
     pcC[0] = '\0';
  }
  return ps;
}

If the first character is not a single or double quote, nothing happens and the pointer is returned without change to the string.

We do case folding with SUpperS(), which also works destructively.

< 62 Declare miscellaneous C functions 15 (cont'd) > =

 
char * SUpperS(char * ps) {
  char * pcC;  /* current character */
  for (pcC = ps; *pcC != '\0'; pcC++)
        *pcC = toupper(*pcC);
  return ps;
}

We need prototypes for these functions:

< 63 Function prototypes 9 (cont'd) > =

 
char * SUnquoteS(char * ps);
char * SUpperS(char * ps);

N.B. both functions belong in a library, not here.

When we start a new <scrap> element, we need to initialize it; the starting rule for scrapatts is a convenient time for the initialization:

< 64 Initialize scrap record > =

 
            /* initialize the scrap record */
            /* eventually, allocate a new one for insertion into list */
            /* for now, just let it be */
            pscrapCur = rgScraps + macScraps;
            macScraps++;

            pscrapCur->cLine      = cLinecount;
            pscrapCur->pcId       = nullstring;
            pscrapCur->pcFile     = nullstring;
            pscrapCur->pcPrev     = nullstring;
            pscrapCur->pcName     = nullstring;
            pscrapCur->pcCorresp  = nullstring;
            pscrapCur->pslContins = NULL;
            pscrapCur->pslEquivs  = NULL;
            pscrapCur->psrFirst   = NULL;
            pscrapCur->psrLast    = NULL;

We've used a character pointer to the null string to initialize various of these pointers, rather than using the null pointer itself.

< 65 Global variables, etc. 8 (cont'd) > =

 
char* nullstring = "";

When we encounter the file attribute, we need to make a record for the file, with a pointer to this scrap. If there's already a record for the file, we need to append this scrap to its scraplist field (actually, we don't do that until version 0.06: for now, we just want to get the lists of scraps and files working right).

< 66 Add or update entry in files list > =

 
            { int iT = 0;
              int fFilefound = 0;
              for (iT = 0; ((fFilefound==0) && (iT < macFiles)); iT++) {
                  if (strcmp($4,rgFiles[iT].pcFile) == 0) {
                    fFilefound = 1;
                  } /* if */
              } /* for */

              if (!fFilefound) {
                < If file not found, add it 99 >
              } else {
                < If file is found, add scrap to its list 100 >
              }
            }

If the file is not found, we add it to the file array:

< 67 If file not found, add it > =

 
                pfileCur = rgFiles + macFiles;
                macFiles++;
                pfileCur->pcFile = $4;

If the file is found, we need to add this scrap to the list of scraps attached to this file. For now, though, we do nothing.

< 68 If file is found, add scrap to its list > =

 
                /* add this to the list of scraps for this file ... */

These rules imply we are giving attvalue a string value; we need to declare it.

< 69 Data types of terminals and non-terminals 30 (cont'd) > =

 
%type <s> attvalue

As an interim check on our work, we add the following code to the main function, to print lists of file and scrap names encountered:

< 70 Dump lists > =

 
  if (macFiles == 0) {
    fprintf(stdout,"No files recorded.\n");
  }
  for (i = 0; i < macFiles; i++) {
    < Describe file 103 >
  } /* for each file */
  if (macScraps == 0) {
    fprintf(stdout,"No scraps recorded.\n");
  }
  for (i = 0; i < macScraps; i++) {
    < Describe scrap 94 >
  } /* for each scrap */

For now, the act of reporting on a file or a scrap is fairly simple; they will get more complex in version 0.05 and version 0.06.

< 71 Describe scrap > =

 
    fprintf(stdout,"Scrap %d on line %d, id = %s\n",
            i,rgScraps[i].cLine,rgScraps[i].pcId);

< 72 Describe file > =

 
    fprintf(stdout,"File %d:  %s\n",i,rgFiles[i].pcFile);

Version 0.03 requires no changes to the lexical scanner.

2.4 Version 0.04: Scrap Contents

In version 0.04, we begin parsing the contents of each scrap.

We don't do anything with this information yet. We just make sure we can recognize it properly.

To begin with, we rewrite the code for handling scrap content. In addition to strings, we recognize <ref> and <ptr> elements:

< 73 Handle content of scrap, ver. 0.04 > =

 
scrapcontent : /* nil */ {
              fprintf(stdout,"+S %d:\n",cLinecount);
        }
        | scrapcontent STRING {
            WRITE("+S",$2);
        }
        | scrapcontent ref
        | scrapcontent ptr
        ;
< Pointers to nested scraps 91 >
< Ref-element content 76 >

The pointer elements have their own attributes, and <ref> elements also have content, which can contain arbitrary phrase-level SGML tags, which we don't have to recognize. (Since a <ref> can be nested within a <ref>, though, we do have to keep track of nesting.)

< 74 Pointers to nested scraps ver. 0.04 > =

 
ref     : GIREF {
                WRITE("+R",$1);
          }
          ptratts TAGC {
                WRITE("+R",$4);
          }
          refcontent ETAGREF {
                WRITE("+R",$7);
                free($1); free($4); free($7);
        }
        ;
ptr     : GIPTR {
                WRITE("+P",$1);
          }
          ptratts TAGC {
                WRITE("+R",$4);
                free($1); free($4);
        }
        ;

The only pointer attribute we need to take care of is target: the others can all be handled by the existing avsmisc code.

< 75 Pointers to nested scraps ver. 0.04 74 (cont'd) > =

 
ptratts : /* nil */
        | ptratts avstarget
        | ptratts avsmisc
        ;
avstarget : SPACE KWTARGET VI attvalue {
            WRATT("+Ptr/Ref","target",$4);
            WRITEATT($1,$2,$3,$4);
            /* do something with the value */
            free($1); free($2); free($3); /* don't free $4 */
        }
        ;

A <ref> element has content, which in TEI-tagged documents can contain arbitrary phrase-level elements. That's not too hard to handle, but life is complicated a little bit by the possibility of a nested <ref> element. We have to recognize nested <ref>s, so we can be sure, when we see an end-tag for a <ref>, whether it's the one we need or not.

< 76 Ref-element content > =

 
refcontent : /* nil */
        | refcontent STRING   { WRITE("+R",$2); free($2);
        }
        < Comment in reference content 90 >
        | refcontent STAGMISC { WRITE("+R",$2); free($2);
        }
        | refcontent ETAGMISC { WRITE("+R",$2); free($2);
        }
        | refcontent STAGREF  { WRITE("+R",$2);
          }
          refcontent
          ETAGREF {
                WRITE("+R",$5);
                free($2); free($5);
        }
        ;

Having added these rules to the grammar, we have to add appropriate declarations for the new terminal token types:

< 77 Yacc/Lex terminal token types 29 (cont'd) > =

 
%token GIPTR GIREF STAGREF ETAGREF KWTARGET
%token STAGMISC ETAGMISC

And they need to be declared as strings:

< 78 Data types of terminals and non-terminals 30 (cont'd) > =

 
%type <s> GIPTR GIREF STAGREF ETAGREF KWTARGET
%type <s> STAGMISC ETAGMISC

In the lexical scanner, we need to add two new modes: PA for recognizing the pointer attributes (mostly just target), and RC for recognizing the content of <ref> elements.

< 79 Lexical scanner start states 34 (cont'd) > =

 
%x PA RC

Recognizing the relevant start- and end-tags is easier if we define them as patterns:

< 80 Lex macro definitions 37 (cont'd) > =

 
RefStag     {Stago}"ref"({AVS}*)({S}*){Tagc}
RefEtag     {Etago}"ref"({S}*){Tagc}

And, finally, we need to add the recognition rules themselves. First, to recognize pointer elements within the scrap, and start the pointer-attribute recognition mode:

< 81 Rules for lex 35 (cont'd) > =

 
<SC>{Stago}"ptr"        { BEGIN PA; SAVEVALUE; return(GIPTR); }
<SC>{Stago}"ref"        { BEGIN PA;
                          SAVEVALUE;
                          ++fRef;
                          return(GIREF); }

The variable fRef is described below.

The pointer attributes are similar to those for scraps. Note that some rules (e.g. for S, Name, etc.) are the same as for mode SA.

< 82 Rules for lex 35 (cont'd) > =

 
<PA>{S}+                { SAVEVALUE; NLCHECK; return(SPACE);   }
<PA>"target"            { SAVEVALUE; return(KWTARGET); }
<PA>{Name}              { SAVEVALUE; return(NAME);    }
<PA>{Vi}                { SAVEVALUE; NLCHECK; return(VI);      }
<PA>{Litval}            { SAVEVALUE; return(LIT); }
<PA>{Litaval}           { SAVEVALUE; return(LITA); }
<PA>{Number}            { SAVEVALUE; return(NUMBER);  }
<PA>{Numtok}            { SAVEVALUE; return(NUTOKEN); }

When we encounter a tagc delimiter while recognizing pointer attributes, we have to decide whether to go into scrap-contents mode, or into reference-contents mode. We keep a flag, fRef, showing the number of <ref> elements currently open; if it's greater than zero, then we're recognizing a <ref> start-tag, not a <ptr> start-tag.

< 83 Global variables and function prototypes for lex 22 (cont'd) > =

 
int fRef = 0;

< 84 Rules for lex 35 (cont'd) > =

 
<PA>{S}*{Tagc}{STS}     { if (fRef > 0) {
                             BEGIN RC;
                          } else {
                             BEGIN SC;
                          }
                          SAVEVALUE;
                          NLCHECK;
                          return(TAGC);
                        }

In reference-content mode, the only nested elements we need to recognize are recursively nested <ref> elements; otherwise, it's just the same as the initial mode (mode 0):

< 85 Rules for lex 35 (cont'd) > =

 
<RC>{RefStag}           { SAVEVALUE;
                          NLCHECK;
                          ++fRef;
                          return(STAGREF);
                        }
<RC>{RefEtag}           { SAVEVALUE;
                          NLCHECK;
                          --fRef;
                          if (fRef==0) { BEGIN SC; }
                          return(ETAGREF);
                        }
    /* within refs, arbitrary start- and end-tags are possible */
    /* but we don't have to recognize them all.  Just REFs. */
    /* Which we just did.  So we act just as we do in mode 0. */
<RC>[^\n\r<]+           { SAVEVALUE; return(STRING); }
<RC>\n                  { SAVEVALUE; cLinecount++; return(STRING); }
<RC>\r                  { ; }
<RC>{Stago}             { SAVEVALUE; return(STRING); }

2.5 Version 0.05: Linked List of Scrap Contents

Version 0.05 of the program will build a linked list for each scrap, with a list item for each portion of the list returned to yacc from lex, so the scraps can be written out to files. The linked list must distinguish character data in the scrap, pointer elements, and SGML comments. (Full SGML support is not going to be built in until later, but we need a convenient way to allow the user to embed standard C header files, which requires that we handle empty comments correctly.)

We define our linked list of strings with the stringrec structure: each string has a type (literal, pointer, or SGML comment -- perhaps it would be more useful to put any type of SGML markup there, including unrecognized start-tags and end-tags. Hmmm.). We also declare a function for making string records.

< 86 Define string-record structure > =

 
enum scrapfragtype { kwLIT, kwPTR, kwCOM };
struct stringrec {
        enum scrapfragtype kwType;
        char *pcS;
        struct stringrec * psrNext;
};

< 87 Function prototypes 9 (cont'd) > =

 
struct stringrec * PsrMakeStringrec(char *pc, enum scrapfragtype kw);

The function PsrMakeStringrec() is actually pretty simple:

< 88 Declare miscellaneous C functions 15 (cont'd) > =

 
struct stringrec * PsrMakeStringrec(char *pc, enum scrapfragtype kw) {
   struct stringrec * p;
   p = (struct stringrec *) malloc(sizeof(struct stringrec));
   if (p == NULL) {
        yyerror("Failure allocating memory for scrap fragment.");
   } else {
     p->kwType = kw;
     p->pcS = pc;
     p->psrNext = NULL;
     return p;
   }
}

All of our changes come in the yacc grammar for scrap contents.

< 89 Handle content of scrap, ver. 0.05 > =

 
scrapcontent : /* nil */ {
            WRITE("+S","");
            /* fprintf(stdout,"+S %d:\n",cLinecount); */
            /* start linked list of scrap bits (with null string) */
            { struct stringrec * psrTemp;
              psrTemp = PsrMakeStringrec(nullstring,kwLIT);
              pscrapCur->psrFirst    = psrTemp;
              pscrapCur->psrLast     = psrTemp;
            }
        }
        | scrapcontent STRING {
            WRITE("+S",$2);
            /* make a new stringrec, add it to the list */
            { struct stringrec * psrTemp;
              psrTemp = PsrMakeStringrec($2,kwLIT);
              pscrapCur->psrLast->psrNext = psrTemp;
              pscrapCur->psrLast          = psrTemp;
            }
        }
        | scrapcontent ptrelement {
            { struct stringrec * psrTemp;
              psrTemp = PsrMakeStringrec($2,kwPTR);
              pscrapCur->psrLast->psrNext = psrTemp;
              pscrapCur->psrLast          = psrTemp;
            }
        }
        | scrapcontent COMMENT {
            WRITE("+S",$2);
            { struct stringrec * psrTemp;
              psrTemp = PsrMakeStringrec($2,kwCOM);
              pscrapCur->psrLast->psrNext = psrTemp;
              pscrapCur->psrLast          = psrTemp;
            }
        }
        ;
ptrelement : ptr
        | ref
        ;
< Pointers to nested scraps 91 >
< Ref-element content 76 >

Apart from the code to handle the construction of the lists, note the addition of COMMENT as a possible type of scrap content. We need to make the same addition to reference content:

< 90 Comment in reference content > =

 
        | refcontent COMMENT  { WRITE("+R",$2); free($2);
        }

The pointer elements have only one important piece of information for us: the id pointed at by their target attribute. We process them exactly as before, but we give the yacc non-terminals ref and ptr the value of their relevant id.

< 91 Pointers to nested scraps > =

 
ref     : GIREF {
                WRITE("+R",$1);
          }
          ptratts TAGC {
                WRITE("+R",$4);
          }
          refcontent ETAGREF {
                WRITE("+R",$7);
                free($1); free($4); free($7);
                $$ = $3;
        }
        ;
ptr     : GIPTR {
                WRITE("+P",$1);
          }
          ptratts TAGC {
                WRITE("+P",$4);
                free($1); free($4);
                $$ = $3;
        }
        ;

Our only change to the pointer attribute handling is to make the non-terminal ptratts remember the value of the target attribute.

< 92 Pointers to nested scraps 91 (cont'd) > =

 
ptratts : /* nil */         { $$ = nullstring; }
        | ptratts avstarget { $$ = $2;         }
        | ptratts avsmisc
        ;
avstarget : SPACE KWTARGET VI attvalue {
            WRATT("+X","target",$4);
            WRITEATT($1,$2,$3,$4);
            free($1); free($2); free($3); /* don't free $4 */
            $$ = SUpperS(SUnquoteS($4));
        }
        ;

These changes to the grammar require us to inform yacc that ptratts have string-typed values:

< 93 Data types of terminals and non-terminals 30 (cont'd) > =

 
%type <s> ref ptr ptratts refcontent avstarget ptrelement

Now that we have these linked lists attached to each scrap, we need to dump them out at the end of the run, so we are sure we got it right.

< 94 Describe scrap > =

 
    { struct stringrec * psrTemp;
      fprintf(stdout,"\nScrap %d on line %d, id = %s\n",
              i,rgScraps[i].cLine,rgScraps[i].pcId);
      for (psrTemp = rgScraps[i].psrFirst;
           psrTemp != NULL;
           psrTemp = psrTemp->psrNext) {
        if (strcmp(psrTemp->pcS,"\n") != 0) {
          fprintf(stdout,"%s:  %s\n",
                  (psrTemp->kwType == kwLIT) ? "scrap"
                      : ((psrTemp->kwType == kwPTR) ? "ptr" : "com"),
                  psrTemp->pcS);
        }
      }
    }

Version 0.05 makes one change to the lexical scanner. We now recognize SGML comments in scrap contents (where they may be needed to allow statements like "#include " in C), and outside scraps (where they may be commenting out a scrap we don't want to see processed just now).

< 95 Lex macro definitions 37 (cont'd) > =

 
mdo           "<!"
mdc           ">"
com           "--"
comchar       [^-]
comstring     ({comchar}*("-"{comchar}+)*)
comdecl       {mdo}{com}{comstring}{com}{mdc}
emptycomment  {mdo}{mdc}

Inside scrap content or reference content, these are recognized as comments; outside, they are just strings. (We've accomplished what we want, namely suppressing the recognition of scrap start-tags inside the comment, by recognizing the comment as a whole. We don't need to distinguish comments from other strings.)

< 96 Recognize SGML comments > =

 
<SC,RC>{comdecl}        { SAVEVALUE; NLCHECK; return(COMMENT); }
<SC,RC>{emptycomment}   { SAVEVALUE; return(COMMENT); }
{comdecl}               { SAVEVALUE; NLCHECK; return(STRING); }
{emptycomment}          { SAVEVALUE; return(STRING); }

For full comments, we need the macro NLCHECK to ensure that the line number is kept current; empty comments cannot contain newlines, so the macro isn't needed.

We have to add a new token type to yacc and lex for this. If we forget, yacc will remind us. I did; it did.

< 97 Yacc/Lex terminal token types 29 (cont'd) > =

 
%token COMMENT

< 98 Data types of terminals and non-terminals 30 (cont'd) > =

 
%type <s> COMMENT

2.6 Version 0.06: Which Scraps are in Which File?

Version 0.06 will keep a linked list of the top-level scraps which belong in each file: i.e. those which name the file in their file attribute.

The main changes come in the treatment of the file attribute. For new files, our basic work is simple:

This is just error-prone enough that I found it very helpful to smother the code with assertions, which I am leaving in for now.

< 99 If file not found, add it > =

 
              /* assert:  new file, this is first scrap */
              pfileCur = rgFiles + macFiles;
              macFiles++;
              /* assert:  *pfileCur is empty, i.e. random garbage */
              pfileCur->pcFile = $4;
              { struct scraplist *pslTemp = PslMakeScraplist(pscrapCur);

                assert(pslTemp->pslNext == NULL);
                assert(pslTemp->pscrapThis != NULL);
                assert(pslTemp->pscrapThis == pscrapCur);
                assert(strcmp(pslTemp->pscrapThis->pcFile,$4) == 0);

                pfileCur->pslScraps = pslTemp;
              }

              assert(&(rgFiles[macFiles - 1]) == pfileCur);
              assert(strcmp(rgFiles[macFiles - 1].pcFile,$4) == 0);
              assert(strcmp(pfileCur->pcFile,$4) == 0);
              assert(strcmp(pfileCur->pslScraps->pscrapThis->pcFile,$4)==0);
              assert(rgFiles[macFiles - 1].pslScraps->pslNext == NULL);

If we found a file by this name in the file array, we need to:

Again, bugs became much easier to find when I added a lot of obvious assertions to the code and discovered that they weren't always as obviously true as I had thought.

< 100 If file is found, add scrap to its list > =

 
              /* assert:  file exists, this is not first scrap */
              --iT;
              assert(strcmp(rgFiles[iT].pcFile,$4) == 0);
              assert(rgFiles[iT].pslScraps != NULL);
              assert(rgFiles[iT].pslScraps->pscrapThis != NULL);
              assert(strcmp(rgFiles[iT].pslScraps->pscrapThis->pcFile,$4) == 0);
              { struct scraplist * pslTemp = rgFiles[iT].pslScraps;
                while (pslTemp->pslNext != NULL) {
                  assert(strcmp(pslTemp->pscrapThis->pcFile,$4) == 0);
                  pslTemp = pslTemp->pslNext;
                }
                assert(pslTemp != NULL);
                assert(pslTemp->pslNext == NULL);
                pslTemp->pslNext = PslMakeScraplist(pscrapCur);
              }

< 101 Add or update continuations list > =

 
          { struct scraprecord * pscrapTemp;
            struct scraplist   * pslTemp;
            pscrapTemp = PscrapFindScrap($4);
            if (pscrapTemp == NULL) {
              MsgKwS(msgERROR,"PREV attribute points to non-existent "
                   "scrap (%s)",$4);
              /*
              yyerror("Prev attribute points to non-existent scrap");
              */
            } else {
              assert(pscrapTemp != NULL);
              assert(strcmp(pscrapTemp->pcId,$4) == 0);
              if (pscrapTemp->pslContins == NULL) {
                 /* pslContins == NULL, this is first continuation */
                 pslTemp = PslMakeScraplist(pscrapCur);
                 pscrapTemp->pslContins = pslTemp;
              } else {
                 /* if pslContins != NULL, find last and add */
                 pslTemp = pscrapTemp->pslContins;
                 assert(pslTemp != NULL);
                 assert(pslTemp->pscrapThis != NULL);
                 assert(strcmp(pslTemp->pscrapThis->pcPrev,$4) == 0);
                 while (pslTemp->pslNext != NULL) {
                   assert(strcmp(pslTemp->pscrapThis->pcPrev,$4) == 0);
                   pslTemp = pslTemp->pslNext;
                }
                assert(pslTemp != NULL);
                assert(pslTemp->pslNext == NULL);
                pslTemp->pslNext = PslMakeScraplist(pscrapCur);
              }
            }
          }

We want an auxiliary function to initialize a scrap list: PslMakeScraplist().

< 102 Declare miscellaneous C functions 15 (cont'd) > =

 
struct scraplist * PslMakeScraplist(struct scraprecord * pscrapT) {
  struct scraplist * pslNew;
  pslNew = (struct scraplist *) malloc(sizeof(struct scraplist));
  if (pslNew == NULL) {
    yyerror("Failure allocating memory for scraplist.");
  } else {
    pslNew->pscrapThis = pscrapT;
    pslNew->pslNext = NULL;
  }
  return pslNew;
}

At the end of the main routine, we write out each scrap belonging to the file, recursively including each embedded scrap.

< 103 Describe file > =

 
  { struct scraplist * pslTemp;
    fprintf(stdout,"\nFile %d:  %s.  Scraps are:\n",
            i,rgFiles[i].pcFile);
    for (pslTemp = rgFiles[i].pslScraps;
         pslTemp != NULL;
         pslTemp = pslTemp->pslNext) {
      fprintf(stdout,"     %s (line %d)\n",
              pslTemp->pscrapThis->pcId,
              pslTemp->pscrapThis->cLine);
    } /* for scraps */
    fprintf(stdout,"\nOr again:\n\n");
    for (pslTemp = rgFiles[i].pslScraps;
         pslTemp != NULL;
         pslTemp = pslTemp->pslNext) {
           WriteScrapFile(pslTemp->pscrapThis,stdout);
    } /* for scraps */
  }

The auxiliary function PscrapFindScrap() finds a (pointer to a) scrap, given its SGML identifier.

< 104 Declare miscellaneous C functions 15 (cont'd) > =

 
struct scraprecord * PscrapFindScrap(char * pcKey) {
  struct scraprecord * psrFound;
  int i;
  int fFound = 0;

  if ((pcKey == NULL) || (strcmp(pcKey,"") == 0)) return NULL;

  for (i = 0; ((fFound==0) && (i < macScraps)); ++i) {
    if (strcmp(rgScraps[i].pcId,pcKey) == 0) {
      fFound = 1;
    } /* if */
  } /* for */
  if (fFound) return rgScraps + i - 1; else return NULL;
}

The function WriteScrapFile() writes a scrap to a file, together with embedded scraps and continuations. Before each continuation, it inserts a newline, to ensure that each continuation begins on a new line. (For further discussion, see the appendix Notes on Newline Handling in Scraps.)

< 105 Declare miscellaneous C functions 15 (cont'd) > =

 
void WriteScrapFile(struct scraprecord * pscrapCur, FILE * pfile) {
  struct stringrec * psrTemp;
  struct scraplist * pslTemp;
  struct scraprecord * pscrapTemp;

  /* write out the contents and embedded scraps */
  for (psrTemp = pscrapCur->psrFirst;
       psrTemp != NULL;
       psrTemp = psrTemp->psrNext) {
         if (psrTemp->kwType == kwLIT) {
           fprintf(pfile,"%s",psrTemp->pcS);
         } else if (psrTemp->kwType == kwPTR) {
           WriteScrapidFile(psrTemp->pcS,pfile);
         } else if (psrTemp->kwType == kwCOM) {
           ; /* do absolutely nothing */
         } else {
           yyerror("Unknown string-record type,"
                   " neither LIT nor PTR nor COM");
         }
  }
  /* write out the continuations */
  for (pslTemp = pscrapCur->pslContins;
       pslTemp != NULL;
       pslTemp = pslTemp->pslNext) {
      fprintf(pfile,"\n");
      WriteScrapFile(pslTemp->pscrapThis,pfile);
  }
}

The function WriteScrapidFile() finds a scrap with a given identifier and writes that scrap to a file.

< 106 Declare miscellaneous C functions 15 (cont'd) > =

 
void WriteScrapidFile(char * pcId, FILE * pfile) {
  struct scraprecord * pscrapCur;
  pscrapCur = PscrapFindScrap(pcId);
  if (pscrapCur == NULL) {
    return;
  } else {
    WriteScrapFile(pscrapCur,pfile);
  }
}

We need prototypes for all these functions:

< 107 Function prototypes 9 (cont'd) > =

 
struct scraplist * PslMakeScraplist(struct scraprecord * pscrapT);
struct scraprecord * PscrapFindScrap(char * pcKey);
void WriteScrapFile(struct scraprecord * pscrapCur, FILE * pfile);
void WriteScrapidFile(char * pcId, FILE * pfile);

Version 0.06 makes no changes to the lexical scanner.

2.7 Version 0.07: Write Input to Output

Version 0.07 parses everything in the same way as version 0.06; the only difference is that it writes its input to its output without the additional debugging information used during development. At the moment, there is no point in writing any output at all; we are just getting ready for work on later versions, which will write a `normalized' form of their input: <ptr> elements will be changed to <ref> elements, scraps will be wrapped in scrap wrappers, etc.

The only thing we have to do is replace the old code at the end of main() for writing scraps out to files, with new code that actually opens the files. We walk through the array of file records, writing out all the scraps which belong to each.

< 108 Write scraps out to files > =

 
  if (macFiles == 0) {
    fprintf(stderr,"No output files named in the web.\n");
  }
  for (i = 0; i < macFiles; i++) {
    < Write rgFiles[i] to disk 109 >
  } /* for files */
  if (macScraps == 0) {
    fprintf(stderr,"No scraps encountered in web.\n");
  }
  /* here we would go through scraps, checking that all are reached */

If a file has not changed, of course, writing it out again will change the system date on the file, misleading the programmer and -- more importantly for some purposes -- misleading make into recompiling it. We solve this problem with an idea, and some code, borrowed from Preston Briggs, who does this in his nuweb system. (Thanks!)

< 109 Write rgFiles[i] to disk > =

 
    /* version 0.06:  write out to file:  open file */
    { FILE * pfTempfile;
      char * pcTempname;

      < Open temporary file as pfTempfile 110 >
      < Write scraps for this file out to temporary file 112 >
      < Close the temporary file 111 >
      < If file changed, rename temp file, if not, delete it 113 >
    } /* block to write out scraps */

Opening the file is simple; we call the standard function tmpnam to generate a name guaranteed not to collide with any existing file, and open it. (It always seems to open in the current directory; this may cause inconvenience in some contexts.)

< 110 Open temporary file as pfTempfile > =

 
      pcTempname = tmpnam(NULL);
      pfTempfile = fopen(pcTempname,"w");
      if (pfTempfile == NULL) {
        fprintf(stderr,"Could not open temp file %s for %s\n",
                pcTempname,rgFiles[i].pcFile);
      }
      assert(pfTempfile != NULL);

Closing the file takes less trouble:

< 111 Close the temporary file > =

 
      fclose(pfTempfile);

Once the file is open, we write out each scrap in its scrap list, using the function WriteScrapFile, which has already been written. After each scrap, we append a newline, to ensure that the next scrap in the series will begin on a new line. (For further discussion, see the appendix Notes on Newline Handling in Scraps.)

< 112 Write scraps for this file out to temporary file > =

 
      { struct scraplist * pslTemp;
        for (pslTemp = rgFiles[i].pslScraps;
             pslTemp != NULL;
             pslTemp = pslTemp->pslNext) {
               WriteScrapFile(pslTemp->pscrapThis,pfTempfile);
               fprintf(pfTempfile,"\n");
        } /* for scraps */
      }

Once the temp file is written out and closed, we want to test to see whether it has changed. If it has changed, we delete the old version (in a later revision of the program, we should probably rename the old version as a backup file) and give the temporary file the proper name. If the old version of the file and the new version in the temp file are identical, we can just delete the temp file and the old one remains on disk, unchanged. If there is no old version, we need only rename the temp file with the correct name.

< 113 If file changed, rename temp file, if not, delete it > =

 
      {
        FILE * pfOldfile = fopen(rgFiles[i].pcFile,"r");
        if (pfOldfile == NULL) {
          rename(pcTempname,rgFiles[i].pcFile);
        } else {
          /* old file exists:  compare! */
          < Compare the old file with the temp file 114 >

          if (cX == cY) {
            /* files are identical */
            remove(pcTempname);
          } else {
            remove(rgFiles[i].pcFile);
            rename(pcTempname,rgFiles[i].pcFile);
          } /* if cX != cY */
        } /* if pfOldfile != NULL */
      } /* block to compare files */

The code to compare files is very simple, because we only need to know whether they are the same. We do not need to call diff or reimplement it: as soon as we find a difference, we don't need to resynchronize the two files, we can just stop right away.

< 114 Compare the old file with the temp file > =

 
          int cX, cY;
          pfTempfile = fopen(pcTempname,"r");
          do {
            cX = getc(pfOldfile);
            cY = getc(pfTempfile);
          } while ((cX == cY) && (cX != EOF));
          fclose(pfOldfile);
          fclose(pfTempfile);

Version 0.07 makes no changes to the lexical scanner.

Our final change is to redefine the WRITE and WRATT to make the program write its input to its output without change. The new version of the latter macro ignores its arguments entirely and relies on the fact that it is called always and only from rules with four arguments which need to be written out in order. Hmm. That didn't work, so we've added a new macro, which takes four arguments and writes them out. The old macro we redefine as a null statement.

< 115 Macro and type definitions for yacc 6 (cont'd) > =

 
#define WRITE(A,B) fprintf(stdout,"%s",B)
#define WRATT(A,B,C) ;
#define WRITEATT(A,B,C,D) fprintf(stdout,"%s%s%s%s",A,B,C,D);

3 Minor Enhancements and Conveniences

After completing version 0.1, we add some simple enhancements to make the program a little easier to use during development, and to provide hooks for later changes. Most important:

3.1 Version 0.11: Command-Line Arguments

3.2 Version 0.12: Change Checking on Main Input

3.3 Version 0.15: Skip Lists and Dynamic Storage

4 Version 0.2: Normalizing Scrap References

In version 0.2, the program will begin normalizing its input instead of writing it out unchanged, most notably by translating <ptr> elements into <ref> elements and performing similar kinds of text mirroring. The result will be an Sweb file (still legal as input to the program) which supports nicer display by browsers like Panorama which don't perform text mirroring, or nicer translation by transducers like tf, which don't handle forward references easily.

5 Version 0.3: Recapitulations

In version 0.3, the program will support the <recap> element, which allows the document to review the contents of scraps already seen. We'll also add some cosmetic changes to cause the program to indent code properly in both <recap> and the tangled output files: if the reference to a scrap was indented (e.g. because it was in an if statement), then the scrap itself should be indented in the tangled file. This idea comes from nuweb and it makes for much nicer output files.

5.1 The Recapitulation Tag

[To be supplied.]

5.2 Version 0.35: Indentation of Code

[to be supplied]

6 Version 0.4: Multiple Versions of Code

In version 0.4, the program will begin to support the specification of multiple versions of a program in the same input file; this will require the creation of equivalence classes for scraps, modifications to the following of pointers and prev attribute values, and handling for a command-line flag specifying which version to generate. (There should probably also be a command-line flag indicating that the only thing to be done is to give a list of the known versions.)

7 Version 0.5: Fuller SGML Support

In version 0.5 the program will begin to handle SGML properly. If not all constructs, then at least comments, marked sections, and external entities will be recognized and processed properly. (I hope, before I get to this version of Sweb, to have written a conforming or mostly conforming parser for SGML in yacc and lex, which should make it a lot easier to ensure that Sweb doesn't unintentionally do anything non-conforming. Outside of scraps, comments should be passed through without change; inside scraps, they should be passed through weave without change (and without recognition of ptr tags inside them), but omitted without trace in the tangle process. External entities should use the SGML Open catalog, and should be rewritten only if changed. DTD files, though opened and read, should probably not be written out.

7.1 Version 0.51: Comments

7.2 Version 0.52: Marked Sections

7.3 Version 0.53: External Entities

7.4 Version 0.54: Support for SGML Open Entity Catalog

8 Version 0.6: Nuweb-Style Manual Indexing

In version 0.6, Sweb will begin generating indices of files, macros, and identifiers. The latter will have to be tagged by hand, in the same way as required by nuweb. This will involve either accepting <index> elements within scraps or the extension of the Sweb tag set by adding a scrap wrapper, containing

It's not clear how the logical scrap unit should interact with multiple versions.

9 Version 0.7: Generating Reference Documentation

In version 0.7, we extend the short indices to begin generating fuller reference documentation in the style of the Odd system, which generated the reference material at the back of the TEI Guidelines. This will involve identifying some parts of the running prose as containing the canonical short descriptions of objects formally defined in the scraps (be they SGML elements and their attributes, functions in a programming language, or something else). I also expect to write a translator from Sweb into LaTeX for the production of printed documentation.

10 Version 0.8: Sample Language-Specific Processor

In version 0.8, I hope to define a simple free-standing processor (probably as a tf or dw filter) to do special processing on scraps written in a particular language, probably SGML. This will be an illustration of how Sweb processing can include language-specific processing.

11 Version 0.9: Prefix-String Matching

In version 0.9, I expect to add support for prefix-string matching in scrap references, unless I've decided by then that no one likely to use Sweb will be likely to care.


Notes on Newline Handling in Scraps

The newline rules ofISO 8879 are sometimes confusing, and when the roles of comments, processing instructions, subelements which are legal only because of an inclusion exception, etc., are all taken into account, they seem to me to become positively Baroque. Sweb attempts to implement at least the most prominent of the newline rules, though, since ignoring the parts of a standard you don't like seems to be a recipe for problems down the road.

At the moment, we implement two basic rules:

By ignored, I mean the line break is not written out when the scrap is written out to the source file. It is, however, written to the standard output. After a start-tag, the term immediately is interpreted loosely: blanks and tabs between the start-tag and the newline are ignored. This is, I think, not conforming, but since some editors (including the one I am using at the moment) insert blanks at the end of each line, it seems a necessary concession to reality. It will also make things less confusing, I hope, for people using ASCII editors.

Even these two rules are enough to confuse a programmer trying to get a C program to look right; at least, they were enough to confuse me. So the following paragraphs offer some simple rules for the common cases of scrap construction, and for avoiding problems.

Scraps go into output files under three circumstances:

In the first and second cases, it seems plausible to assume that each scrap is a coherent sequence of statements in the programming language, and there should be a newline between any two scraps being written out to the same output file. The code in WriteScrapFile, or the code which calls it, should be modified to ensure that this happens.

In the third case, our ultimate goal is to support the kind of pretty-printing used in nuweb: if the pointer to a scrap is indented, each line in the scrap will be indented the same amount in the output file. This allows the scraps themselves to be written flush left, which makes a lot of sense. Scraps may be embedded in places where no newline is desirable before or after the scrap -- or, put another way, if newlines are required or desired before or after the scrap, it's better to place them around the pointer, so the programmer has complete control over whether and where they occur. The scrap itself won't need to know, then, whether it should or should not begin or end with a newline, and can be written in the normal way, with single newlines (which will be suppressed by the SGML parser) after the start-tag and before the end-tag.

So the only place where particular care need be taken is before and after scrap-embedding <ptr> and <ref> elements. In particular, the pointer elements are apt to cause trouble, because a newline immediately following them will be suppressed. To prevent it being suppressed, use an empty SGML comment between the tagc and the newline, thus:

 
<p>The initial part of the yacc file contains C declarations
and the like.  We know we'll need at least the following:
<<!>scrap id=yaccCprolog name='C declarations, etc.'>
<<!>ptr target=yaccCheaders ><!>
<<!>ptr target=yaccCmacros  ><!>
<<!>ptr target=yaccCtypes   ><!>
<<!>ptr target=yaccCglobals ><!>
<<!>ptr target=yaccCdecls   ><!>
<<!>/scrap>
And let's hope that serious study of 8879 doesn't reveal that newlines immediately following empty comments should also be suppressed.