[This local archive copy mirrored from the canonical site: http://www.uic.edu/~cmsmcq/tech/sweb/sweb.html; links may not have complete integrity, so use the canonical document at this URL if possible. Referenced by permission of the author.]

SWEB: an SGML Tag Set for Literate Programming

C. M. Sperberg-McQueen

Version 0.5

25 September 1993

August 1994

Lightly revised
March 1995

revised and extended
January - March 1996

19 March 1996

This incomplete, unpublished document is distributed privately for comment by friends and colleagues; it is not now a formal publication and should not be quoted in published material.

Table of Contents


This document describes an SGML tag set for literate programming. First, markup is provided for embedding fragments of programming-language code into SGML documents in arbitrary order, to be recombined before compilation into the order required by the programming language's syntax. Next, tags are defined for identifiers, keywords, code fragments, and literal values occurring as phrase-level elements in the prose documentation. Finally, tags for indexing and for a general structure for reference documentation (alphabetical lists of functions and identifiers, etc.) are defined. For each type of markup, the document gives examples and describes how the markup should be processed by conventional literate-programming weave and tangle processes.

Revisions still needed in this document:

Sweb is a simple SGML tag set for simple literate programming. It is designed to be used as an added tag set in conjunction with the base tag set for prose defined by the Text Encoding Initiative (TEI). It could be combined equally well with other standard tag sets, such as the DocBook tag set developed by the Davenport Group of software vendors, or the AAP tag set developed by the Association of American Publishers and later revised as ISO 12083. I will assume the TEI DTD is being used as the base here, since I understand the TEI extension mechanisms best.[1]

The Sweb tag set includes just a few extra tags for use in the text of the documentation itself: they mark scraps of code and references to identifiers and code fragments in text. In addition, there are two tags which support the specification of multiple versions of a program in the same Sweb document, and there is a larger tag set for reference listings of the sort found in the second half of the TEI Guidelines, or in the reference manuals for programming-language function libraries. These parts of Sweb are discussed in turn, after an introduction which describes the rationale for Sweb itself.

1 Introduction

1.1 The Basics of Literate Programming

In `literate programming', a programmer constructs documentation for a program together with the program code itself, both kinds of material being included in a single document; by tying the program code intimately to the documentation, literate programming tools make it substantially easier to keep the program and the documentation consistent with each other. It is also more embarrassing if one fails to do so.

In the words of Donald Knuth, who formulated the idea and used it in the construction of his programs TeX and MetaFont:[2]

Let us change our traditional attitude to the construction of programs. Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

Among the benefits attributed to literate programming by its practitioners and others are: better (and usually fuller) explanation of the program's organization and algorithm, better indexing and cross-indexing of identifiers, easier maintenance, and easier modification of the program by the original author and others.

In Knuth's Web system, and the several similar systems inspired by it, the programmer prepares a single document, called the web, which contains both fragments of program code and prose explanation of the program structure. The code fragments may be introduced in any order suited to their explanation; they do not need to obey the restrictions imposed by the programming language on the sequence of constructs. This freedom of ordering allows global variables to be introduced together with the code which manipulates them, and for the main program and its subroutines to be declared and described in whatever order most clarifies the exposition, even in programming languages which require a rhetorically less effective order for compilation.

The web is processed in two different ways: a weave processor reads the web and produces program documentation (typically in the form of a file in a formatting language such as TeX or troff); a tangle processor rearranges the scraps of program code in the order required by a compiler, expands macros if necessary, and produces a compilable program in the programming language. In some systems, e.g. nuweb, a single program both weaves and tangles the input. Different web systems use different document languages and different programming languages. Knuth's original system used TeX for documentation and Pascal for program code; other systems have been developed for other programming languages and (less frequently) other document languages. In Knuth's system and many others, the weave component typesets the program code; such systems must thus endow weave with knowledge of the programming language being used.

Some newer systems, notably Norman Ramsey's noweb and Preston Briggs's nuweb, are programming-language independent: they have no hard-coded knowledge of the programming languages used, and thus can be used to write literate programs in any programming language, or in several languages. Without knowledge of the programming-language syntax, of course, elaborate typesetting is not feasible; language-independent web systems typically eschew typesetting entirely, reproducing scraps in a monospaced font and following verbatim the layout of the source. Sweb follows the practice of these language-independent systems: it performs no typesetting of the source code and requires no particular programming language.

1.2 Literate Programming and SGML

Like literate programming tools, SGML systems are designed to make it possible to process a single input document in very different ways with different programs. The notion of creating an SGML-based tool for literate programming is thus an obvious one; Sweb is an attempt at such a tool. In discussing Sweb and similar tools, however, it will be useful to distinguish two areas of concern: first, the type of processing to be performed on the SGML-encoded webs, and as a separate issue the markup (i.e. the set of SGML elements) required to make that processing possible.

The processing to be performed has already been described in general terms: the user must be able to weave the web into well-formatted documentation, and to tangle it into executable programs. In the interests of generality, I will abstract away from the details of this processing and where possible formulate our markup language in declarative, rather than procedural terms. That is, I wish to follow the general principle of SGML, which is that process-specific information belongs in a style-sheet or other mechanism for controlling the process, and not in the SGML document.

So, first of all, this document defines an SGML tag set for literate programming. Documents tagged this way can be translated automatically into the specialized markup used by virtually any existing literate-programming tool, or processed directly by SGML-aware application software. They can also be operated upon directly by SGML-aware literate-programming tools.

Documents using these tags can be:

As a secondary matter, this document will also describe some aspects of the processing of documents marked up with these tags. Implementations of software to perform these processes will be described in separate documents; many details of processing will vary from implementation to implementation. At the time of writing, the only implementation is a partial implementation of tangle functionality using yacc and lex; weave functionality is provided by direct display of the Sweb document in an SGML browser such as Panorama or by translation into HTML for delivery on the World Wide Web.

2 Encoding Code Scraps

In literate programs, the program text itself is included in the document, typically in small pieces which I call scraps. Knuth refers to them as modules, but this term derives from the rather specialized text structure imposed on webs by his original system, and may mislead some readers into thinking that a program is more modular in construction than is actually so. So I prefer the informal term scrap, which I adopt from the nuweb system written by Preston Briggs of Rice University.

2.1 The Tags

Scraps of code in a document are encoded using the <scrap> element; they are embedded in other scraps using the standard TEI cross-reference elements <ref> and <ptr>:[3]

By convention, the scrap identified as `previous' should precede its continuations in the document --- but this is not formally required.

Like any TEI element, the <scrap> element can bear the global lang attribute, which in this case records the programming language of the scrap. Since programming languages are not listed in ISO 639, the values for programming languages must be improvised by the author.

For example, here is a scrap of code from Knuth and Levy's rewrite of the Unix utility wc:[4]

<scrap id=wrstat
       name='Write statistics for file'>
wc_print(which, char_count,word_count,line_count);
if (file_count)
   printf(" %s\n", *argv); /* not stdin */
   printf("\n");           /* stdin */

Scraps can be referred to from other scraps, using the standard TEI elements <ptr> (which uses the SGML identifier of the scrap being embedded as the value of its target attribute) or <ref> (which can use its target attribute, or which can use its content to match the name of the target scrap). Using the SGML ID/IDREF mechanism, Knuth and Levy's rewrite of wc would have the following top-level scrap; it has an empty string as its name value, since they make this top-level scrap anonymous.

<scrap name=''>
<ptr target=hdrfiles>
<ptr target=globals>
<ptr target=functns>
<ptr target=main>
The other scraps involved would have the following overall structure:
<scrap id=hdrfiles name='Header files to include'>
#include <<!>stdio.h>

<!-- ... -->

<scrap id=globals>
int status = OK; /* exit status of command, initially OK */
char *prog_name; /* who we are */

<!-- ... -->
<scrap id=functns>  <!-- ... --> </scrap>
<!-- ... -->
<scrap id=main>  <!-- ... --> </scrap>
<!-- ... -->

The formal declaration of <scrap> is as follows:

< 1 Code scraps > =

<!ENTITY  % model.scrap  '(#PCDATA | ptr | ref)*'               >
<!ELEMENT scrap         - -  (%model.scrap)                     >
<!ATTLIST scrap              %a.global
          name               CDATA               #IMPLIED
          file               CDATA               #IMPLIED
          version            IDREFS              #IMPLIED
          index              CDATA               'auto'         >

I use the entity model.scrap as the content model for <scrap> element in order to make it easier to customize if desired. It also makes it easier to see that the <scrap> and <recap> elements are declared identically.

2.2 Matching Continuations and Embedded Scraps

2.2.1 ID/IDREF Matching

The usual method of matching cross references with their targets in SGML is to use the SGML ID/IDREF mechanism: the target is given an ID attribute, which is named in an idref attribute in the cross reference. That method can be used to indicate the embedding and continuation of code scraps, thus:

2.2.2 String-Prefix Matching

Most literate programming tools inherit from Knuth's WEB system, however, the convention that code scraps and references to them can be matched by the names given to each. That is, a code fragment first referred to thus:[5]

@<Program to print the first thousand
prime numbers@>
can later be referred to (or, as here, defined) thus:
@<Program to print...@>=

In other words, any name ending in an ellipsis is considered to match any name beginning the same way. (The prefix given before the ellipsis must be unique.) This method, which I call the string-prefix match, has the convenience of allowing the web to be slightly more readable for users, and eliminating the requirement that the programmer look up, or keep notes on, the SGML identifiers used for each scrap. It has the inconvience, of course, of not being validatable by the SGML parser.

When the ID/IDREF mechanisms mentioned above (the target attribute on <ref> elements, and the prev attribute on <scrap> elements) are not used, then a string-prefix match is performed, matching the name attribute of a scrap with either the name attribute of a continuation or the parsed-character data content of a <ref> element.

Using string-prefix matching under Sweb, the example from Knuth's prime-number program would read:

<!-- top-level scrap refers to 'Program ...' -->
<scrap name=''>
<ref>Program to print the first thousand prime numbers</ref>

<!-- ... -->
<!-- This scrap defines that program. -->
<scrap name='Program to print...'>
  <!-- ... -->

The top level of Knuth and Levy's wc would be encoded this way:

<scrap name=''>
<ref>Header files to include</>
<ref>Global variables</>
<ref>The main program</>
The two methods of reference can be combined, for legibility. In this case, weave and tangle will obey the SGML IDREF and ignore the name string.
<scrap name=''>
<ref target=hdrfiles>Header files to include</>
<ref target=globals >Global variables</>
<ref target=functns >Functions</>
<ref target=main    >The main program</>

2.3 Processing Scrap References

Within a code scrap, references to other code scraps serve as cross references or hyperlinks to those other code scraps. In a printed copy of the Sweb document, such cross-references will normally contain the name of the embedded scrap, with a scrap number or page number; by convention these appear together in Roman type enclosed in angle brackets. In an online browser, the cross references should display in much the same way, but should also be live hypertext links: the user should be able to click on the name and traverse the link automatically, so that the embedded scrap appears on the screen for consultation.

The live-link behavior is easy to accomplish if ID/IDREF linking is used, since SGML browsers conventionally interpret ID/IDREF links this way. If string-prefix matching is used, however, very few online browsers will be able to treat scrap references as live hyperlinks. It is desirable, therefore, for a weave process to normalize string-prefix references by transforming them to normal ID/IDREF references. It will simplify the task for formatters and display programs — not to mention later readers of the web — if any abbreviated forms of the name are replaced, at the same time, with the full name. After such normalization, the first example given in the preceding section would look like this:

<!-- top-level scrap refers to 'Program ...' -->
<scrap name=''>
<ref target=program>Program to print the first thousand prime numbers

<!-- ... -->
<!-- This scrap defines that program. -->
<scrap id=program
       name='Program to print the first thousand prime numbers'>
  <!-- ... -->

A similar normalization is useful for <ptr> elements. Some SGML browsers have no difficulty in following the pointer, finding the name of the scrap pointed to, and displaying that name as part of the normal rendition of the pointer; such text reflection, however, is beyond the current grasp of other browsers. Such browsers will give better results if text is copied (duplicated) in the source rather than being reflected at display time; in Sweb, this effectively means replacing <ptr> elements with equivalent <ref> elements. The target and all other attributes of the <ref> are the same as those of the <ptr> element; the content of the <ref> is the name value of the <scrap> element being pointed to.

A full Sweb implementation, therefore, performs all of the following normalizations on its input unless the user indicates by means of run-time flags that they should be suppressed:

Full Sweb implementations should also identify blind references and unreachable scraps and issue warning messages for them. In cases where blind references are not an error, the warnings can be suppressed by removing the target attribute from the <ref> element in question and processing the file with a flag suppressing string-prefix matching; this will indicate to the weave processor that no target should be sought for the <ref> element. In cases where scraps are known to be unreachable, warnings may be suppressed by giving the value unreachable to the scrap's rend attribute, in effect saying "I know this scrap is unreachable. In this case that is not an error. Don't bother me." [Is this necessary? Is there a better way?]

2.4 Output to Files

The file attribute on the <scrap> element is used to specify that a given scrap is to be written out to the file named. If a scrap tagged with a file name refers to a nested scrap, both scraps will be written out to the same file; a scrap referred to by several others will be written out several times, once for each reference.

Any scrap bearing a value for the file attribute will be written out to the file indicated, together with any scraps it embeds, followed by any continuations. This will happen even if the scrap has already been written out to some other file, or has been referred to by another scrap in the same file. Embedded scraps and continuations should therefore not bear the file attribute, unless it is intended that they appear in the output twice.

If output files are rewritten every time an Sweb document is reprocessed, the resulting changes to the operating system time-stamp on the files are likely to confuse make and similar software development utilities. They rely on the time stamp to indicate the last time changes were made to a file, and will be confused if the time stamp changes even when the file contents do not. (Changes in the Sweb source may have affected only other output files.)

To avoid this problem, full Sweb implementations should replace output files only if the new version will be different from the existing version. The nuweb system demonstrates a simple method of accomplishing this:

Other algorithms are possible, but this one seems to work quite well and to have few system dependencies.

2.5 A Simple Example

As an example, consider the sample input file included in the one-page summary for nuweb, a literate programming tool written by Preston Briggs of Rice University.[6] As a nuweb file, it looks like this:

Here is the overall structure of an example program:

@d Sample Program
@<Main Program@>

Some declarations are always necessary:

@d Declarations
@{Easy things
Real things
Complex things
@| Things Complex declarations

The main program is quite straightforward:

@d Main Program
@{Call @@subroutine

as is the subroutine

@d Subroutine
Useful Job With Things

Finally, write the entire program out to a file:

@o sample.prg -it

The files I've generated are:
the macros defined are:
and the terms I've asked to be indexed are:

An SGML user could define many of the special @-sequences as SGML short references for tags, but to make the SGML structure of this example clearer, I will use explicit SGML tags. An Sweb document equivalent to the nuweb document above is:

<p>Here is the overall structure of an example program:

<scrap name='Sample Program'>
<ref>Main Program</>

<p>Some declarations are always necessary:

<scrap name='Declarations'>
Easy things
Real things
Complex things
<index level1='Things'>
<index level1='Complex'>
<index level1='declarations'>

The main program is quite straightforward:

<scrap name='Main Program'>
Call @subroutine

as is the subroutine

<scrap name='Subroutine'>
Useful Job With Things

Finally, write the entire program out to a file:

<scrap file='sample.prg' rend='noindent keeptabs'>

The files I've generated are:
<divGen type='filenames-index'>
the macros defined are:
<divGen type='scrapnames-index'>
and the terms I've asked to be indexed are:
<divGen type='index'>

3 Phrase-Level Elements

References to identifiers, and discussions of program statements, in the text require the special material to be specially marked. For this, the following elements are provided:

Some elements are defined solely as syntactic sugar for the <ident> and <code> elements, for use in describing SGML tag sets.

The <ident> element allows single identifiers to be marked as such, so that they can be formatted, and possibly indexed, appropriately. The <code> and <eg> elements allow larger code fragments (statements, expressions, even code blocks) to be referred to in-line without having to be included in the web. Within <code> elements, the token-level tags <ident>, <kw>, etc. may or may not be used. If they are used, the rendering engine can display each in a characteristic font or color; if they are not used, a uniform monospaced font may be used.[7]

The following examples illustrate their use:[8] The first example refers to both identifiers (p) and keywords (do while):

The first prime number being given (= 2), the thousandth being
assumed unknown to the programmer, the most natural order of
filling the elements of array <ident>p</ident> is in order of
increasing subscript value, and if we express just that (with a
simple repetitive <kw>while do</kw> clause) we come to ...

The second example, from the same source, shows code fragments within the prose discussion (here with tagging of some but not all internal components):


<p><hi rend=sc>Remark.</hi> Here <code><kw>while</kw>
<ident>p[ord]</ident> &uarr 2 &le <ident>j</> <kw>do</></code>
can be replaced by
<code><kw>if</kw> <ident>p[ord]</> &uarr 2 &le <ident>j</>
<kw>then</></code>, but to my taste the marginal gain in
efficiency is not worth the intellectual effort to prove its
validity.  A programmer should learn to be lazy at the right
moment and to let the principle <q>Safety First</q> prevail!

The formal declaration of these elements is this: [It might plausibly be argued that some of these, at least tag and att belong in TEI Lite, not here; I agree, but until TEI Lite incorporates them, the easiest way to make them accessible in Sweb documents is to include them here.]

< 2 Define phrase-level elements > =

<!ENTITY % x.tokentypes ''                                      >
<!ENTITY % m.tokentypes '%x.tokentypes ident | kw | lit |
comment | gi | att | val | ent'>
<!ELEMENT code          - O  (%paraContent;)                    >
<!ATTLIST code               %a.global;                         >
<!ELEMENT ident         - O  (#PCDATA)                          >
<!ATTLIST ident              %a.global;                         >
<!ELEMENT kw            - O  (#PCDATA)                          >
<!ATTLIST kw                 %a.global;                         >
<!ELEMENT lit           - O  (#PCDATA)                          >
<!ATTLIST lit                %a.global;                         >
<!ELEMENT delim         - O  (#PCDATA)                          >
<!ATTLIST delim              %a.global;                         >
<!ELEMENT comment       - O  (#PCDATA | %m.tokentypes)*         >
<!ATTLIST comment            %a.global;                         >
<!ELEMENT delim         - O  (#PCDATA)                          >
<!ATTLIST delim              %a.global;                         >
<!ELEMENT gi            - O  (#PCDATA)                          >
<!ATTLIST gi                 %a.global;                         >
<!ELEMENT att           - O  (#PCDATA)                          >
<!ATTLIST att                %a.global;                         >
<!ELEMENT val           - O  (#PCDATA)                          >
<!ATTLIST val                %a.global;                         >
<!ELEMENT ent           - O  (#PCDATA)                          >
<!ATTLIST ent                %a.global;
          type               (ge | pe)           'ge'           >
<!ELEMENT tag           - O  (#PCDATA)                          >
<!ATTLIST tag                %a.global;
          type               (stag | etag)       'stag'         >

4 Multiple Versions

A single Sweb document may need to be used to produce multiple versions of a given file, e.g. to support different platforms, or to create a test version with a new algorithm, while retaining the old algorithm for the production version. It may also be desirable for local documentation of modifications to a tag set to be integrated with the base documentation for that tag set: some sections of the local documentation should replace corresponding sections of the base document, while other sections are purely supplemental.

In general, local modifications, whether they represent omissions, changes, or additions, can and should be handled as additional material documenting a new (local) version, and tagged using the techniques described here for multiple versions.[9]

Three approaches may be used to generate multiple versions of a program:

  1. an external pre-processor (e.g. the C pre-processor) may be used, and the pre-processor directives define, ifdef, and endif may be used in the code to control the action of the pre-processor
  2. SGML marked sections may be used, which are controlled by parameter entity references and included or excluded based on the values declared for those parameter entities
  3. the exclude and version attributes may be used on the <scrap> element, to indicate the equivalence of different scraps, and to indicate which versions use which scrap.

The first solution has a natural appeal in many situations, since in environments with such a preprocessor it is probably the most common method of producing multiple versions from the same code. If the code has many small conditional sections, however, this approach can make it hard to focus on any one version of the program; if the small conditional sections are merged to produce longer parallel sections, it can become hard to keep different versions synchronized.

The second approach (marked sections) has a similar appeal, since version control is one of the intended uses of conditional marked sections. When more than a few versions of the material exist, however, the marked-section technique may become clumsy. Above all, it is hard to know how to express, with marked sections, the relations among versions, especially dependencies, or to capture the idea that if version 3 is not mentioned explicitly in a particular scrap, then version 3 is identical, for that scrap, to version 2. Most SGML users with requirements for industrial-strength version control use specialized elements for the job, not marked sections.[10]

The third approach allows the explicit claim that two scraps of code are, in some sense, equivalent, which is useful for documentation purposes. In this approach, a scrap may be tagged as representing an alternative version of some other scrap. Since they are alternative realizations of the same specification they are in some sense equivalent. The equivalence of two scraps is indicated by the exclude attribute on one scrap using the SGML ID of the other as its value:

A set of scraps connected by such equivalence relations is an equivalence class.[11]

4.1 Scrap References in Multi-Version Webs

When a web defines only one version of a web, a reference to another scrap is interpreted, as described above, as an indication that that scrap belongs logically at the point of reference. When multiple versions are present in the same web, the reference is interpreted as a reference not to one particular scrap, but as a reference to the equivalence class: any scrap in the equivalence class might potentially be substituted for the reference. Which scrap is selected depends on which version is being generated; this in turn is controlled by the user interface to the tangle processor (e.g. by selection from a menu, or by command-line option).

In the simple case, one scrap in the equivalence class identifies itself as belonging to the version being generated (by giving the version identifier on its version attribute); that is the scrap used by tangle. If two scraps in the same equivalence class identify themselves as belonging to the version being generated, it is a fatal error. If no scrap is identified as belonging to the version in question, a fallback must be found.

4.2 Version Information and Fallbacks

The fallback information is given by the <version> element, which declares one version of the program or package, and identifies a fallback version to use when necessary. Versions are declared in a <versionList> element, which should normally appear in the front matter or introduction to the document (e.g. in a section that describes which versions of the program exist).

The fallback version is typically the one from which a new version was derived. If no scrap in an equivalence class identifies itself as belonging to the version being generated, the tangle processor will find the scrap belonging to the fallback version. If there is none, the fallback's fallback will be sought, and so on until there are no further fallbacks, at which point a scrap with no version information at all will be used (which will normally be the first, unmarked version); if none or more than one is found, it is a fatal error.

The formal declaration of the <version> and <versionList> elements is as follows:

< 3 Declare version element > =

<!-- Version declarations                                     -->
<!ELEMENT version       - O  EMPTY                              >
<!ATTLIST version
          lang               IDREF               %INHERITED
          rend               CDATA               #IMPLIED
          id                 ID                  #REQUIRED
          n                  CDATA               #IMPLIED
          fallback           IDREF               #IMPLIED       >
<!ELEMENT versionList   - O  (version*)                         >
<!ATTLIST versionList        %a.global;                         >

4.3 Example: Dijkstra on Prime Numbers

For example, Dijkstra's classic article on "Stepwise Program Construction" presents a program to fill an array with the first thousand prime numbers. He presents a scrap of code (scrap 0), which defines the overall program and refers to another scrap (scrap 1), which refines the problem and refers to a third scrap (scrap 2), which does a portion of the work and refers the rest to a fourth scrap (scrap 3). After presenting a first version of scraps 1 and 2, however, Dijkstra observes that they can be made substantially more efficient by only looking at odd numbers when seeking primes greater than 2, and presents alternate versions of these two scraps (scraps 1b and 2b). Further improvements result in versions c and d; and version d is the first to define a scrap 3. A final revision results in scraps 1e, 2e, and 3e, which represent the final form of the program.

The versions of this program can be declared thus:

<version id=A>
<version id=B fallback=A>
<version id=C fallback=B>
<version id=D fallback=C>
<version id=E fallback=D>

The first few code fragments would look something like this; note that Dijkstra uses the term version in a rather loose sense when labeling his code scraps. The root scrap has no version information, because it is the same for all versions: if we use the ID/IDREF mechanism to match scraps and references to them it will look like this:[12]

<p>We shall now give the coarsest version
of the program, viz.
<scrap id=program name='version 0'>
begin <ptr target=assign> end
<p>When this action ...
The first version of scraps 1 and 2 follows:
<p>The first prime number ... we come to

<scrap id=assign
       name='assign to the array p
       the prime table as described'
begin integer k, j: k := 1; j := 1;
    while k <= 1000 do
    begin <ptr target=increase>;
         p[k] := j; k := k + 1;

<p>Version 1a is a perfect program when ...
With a simple repetitive <kw>repeat until</kw>
clause (which may act upon a sequence of statements)
we come for <q>increase <ident>j</ident> until
the next prime number</q> to

<scrap id=increase
       name='increase j until the next prime number'
begin boolean jprime;
    repeat j := j + 1; <ref>give to <ident>jprime</>
         the meaning: <ident>j</> is a
         prime number</ref>;
    until jprime

<p>If we substitute version 2a for the appropriate
operation in version 1a our resulting program is
undoubtedly correct. ...
Scrap 2a refers to a scrap ("give to jprime the meaning: j is a prime number") which is never actually written, because at this point Dijkstra stops and considers an obvious optimization: checking only odd numbers for primality. Since there is no such scrap, there is no SGML ID to point at, and we render the reference using the <ref> element. (This also shows, in passing, how identifiers can be tagged in scrap references; working out the details is left to the reader as an exercise.)[13]

Dijkstra's second cut at scraps 1 and 2 follows:

So we come to

<scrap exclude=assign
       name='assign to the array p the
       prime table as described'
begin integer k, j: p[1] := 2; k := 2; j := 1;
    while k <= 1000 do
    begin <ptr target=increase>;
         p[k] := j; k := k + 1;

where the analogous elaboration of the
operation between quotes leads to

<scrap exclude=increase
       name='increase odd j until the next
       odd prime number'
begin boolean jprime;
    repeat j := j + 2; <ptr target=jprime-odd>;
    until jprime

Dijkstra's example raises an important question. Since his scraps 1a and 1b have exactly the same function, as do 2a and 2b, they are shown here as members of the same equivalence class. What should be done, however, with the various versions of the scrap "give to jprime [for odd j] the meaning: j is a prime number"? The variations Dijkstra introduces in its name imply variations in its function; mathematically, as it turns out, this implication is false, in this case, but it is easy to imagine cases where analogous scraps do have different functionality, reflecting different assumptions in the contexts of their use. The rule of Sweb is simple: analogous scraps, called from similar places in equivalent code, which perform mathematically different functions, are not equivalent, and should not be placed in equivalence classes. Instead, they should be treated as different scraps entirely. (It is possible to imagine cases where this will require more verbose encoding of the scraps, but they are all rather far fetched, so this should not be burdensome in practice.) Code called from the same location is normally equivalent and can normally be tagged as a single equivalence class.

[Discussion ... possibly better example, with branching]

5 Recapitulations

When introducing new code to a reader, an author may desire to present first a very simple skeleton of the code, then develop some of the details of the code, with discussion, and finally present the fleshed-out result. For example, in Dijkstra's article on "Stepwise Program Construction" mentioned above, the final revision of the program is followed by a listing showing the entire text, with all substitutions (scrap references) resolved. The final listing, which is identical in content to scraps 0, 1e, 2e, and 3e, is a recapitulation of the work done so far. Most literate programming systems don't support recapitulations at all.[14]

Recapitulations may be indicated in Sweb documents by means of the <recap> element:

Dijkstra's conclusion, for example, would be encoded thus:

<p>Finally, we perform all substitutions to construct
a single statement.
<recap scrap=program
<p>We could have made the inner blocks into compound statements
by moving the declarations for <ident>jprime</> and <ident>n</>
to the outside.  We have not done so:  clarity does not gain by
it and whether there is a point in doing it is rather dependent
on the implementation.
<p>Thus ends the treatment of the first example.</p>

The formal declaration for <recap> is this:

< 4 Declare recapitulation element > =

<!-- Recapitulations                                          -->
<!ELEMENT recap         - -  (%model.scrap)                     >
<!ATTLIST recap              %a.global;
          scrap              IDREF               #REQUIRED
          version            IDREFS              #IMPLIED       >

6 Scrap Wrappers: Encoding Scraps Revisited

The tags described so far are fine for tangling the web into compilable code, and for basic printing and display. The weave processor, however, can do only limited work in improving the online or printed display of the web. The <ident> and similar tags cannot appear in the names of scraps, because the names are attribute values rather than elements; if one scrap is continued by several others, weave cannot generate and display a list of the continuation scraps; it is not possible to indicate, after each scrap, where it was used.

A better weave processor is possible, but it requires that there be an element related to a scrap, in which to give pointers to continuation scraps, to equivalent scraps, and to scraps which embed the current scrap. We could give these as notes adjacent to the scrap itself, without extending the current tag set. But it is clearly preferable to bind them together with the scrap using normal SGML methods: I mean, by wrapping them together with the scrap into some larger element. For now, I call this element <scrapInfo>.[16]

The <scrapInfo> element is declared thus:

< 5 Declare scrapInfo element > =

<!ELEMENT scrapInfo     - -  (head?, scrap, scrapDefs?,
                             scrapEquivs?, scrapRefs?,
                             indexDefs?, indexRefs?)            >
<!ATTLIST scrapInfo          %a.global                          >

<!ENTITY % model.scrapPtrs '(#PCDATA | ptr | ref | xptr |
xref)*'                                                         >
<!ENTITY % model.scrapDefs   '%model.scrapPtrs'                 >
<!ENTITY % model.scrapEquivs '%model.scrapPtrs'                 >
<!ENTITY % model.scrapRefs   '%model.scrapPtrs'                 >

<!ELEMENT scrapDefs     - -  (%model.scrapDefs)                 >
<!ATTLIST scrapDefs          %a.global                          >
<!ELEMENT scrapEquivs   - -  (%model.scrapEquivs)               >
<!ATTLIST scrapEquivs        %a.global                          >
<!ELEMENT scrapRefs     - -  (%model.scrapRefs)                 >
<!ATTLIST scrapRefs          %a.global                          >

<!ENTITY % x.chunk 'scrapInfo |' >

Any <scrap> element can be wrapped inside a <scrapInfo> element unless it is already within one; the scrap processing is exactly the same, except that:

In normal operation, the <scrapDefs>, <scrapEquivs>, and <scrapRefs> elements will be generated by the weave processor. Typically, they will contain a sequence of <ptr> or <ref> elements, optionally preceded by fixed text. (In some cases, it may be preferable to generate the fixed text in the SGML rendering software. I leave open the possibility of including it in the Sweb document to allow for (a) manual intervention and (b) really, really stupid SGML software.) For example, in normalized form, the scrap already seen from Knuth and Levy's wc might read:

It's convenient to output the statistics by defining
a new function <ident>wc_print</ident>; then the same
function can be used for the totals.  Additionally
we must decide here if we know the name of the file
we have processed or if it was just <ident>stdin</ident>.
<head>Write statistics for file</head>
<scrap id=wrstat
wc_print(which, char_count,word_count,line_count);
if (file_count)
   printf(" %s\n", *argv); /* not stdin */
   printf("\n");           /* stdin */
<scrapRefs>This code is used in
<ref target='process-all'>section 8</ref>.</scrapRefs>

The <indexDefs> and <indexRefs> elements are described elsewhere.

7 Indices

Knuth's original Web system not only generates the compilable form of the program and typesets the documentation, but also generates indices of variables, function and procedure names, etc., showing where they are defined and where they are used. It can do this because it understands enough Pascal to parse the Pascal code and index it properly. In later literate programming systems, it has become conventional to generate cross-reference listings of various types: identifiers in the program, file names, names of modules or program scraps, etc.

Language-independent literate-programming systems generally find it hard to generate such intelligent indices without help, since without understanding the program syntax it is difficult to distinguish identifiers in the program, which should be indexed, from comments, literals, or keywords which normally should not be indexed. Indices are so useful, though, that language-independent systems have developed methods of generating them even without hard-coded knowledge of the programming language, generally by using a mixture of automatic processing and manual assistance. The nuweb system, for example, allows the user to specify what identifiers are defined in each scrap, and indexes all occurrences of those identifiers in all scraps. Or rather, it indexes occurrences of those strings in all scraps. A simple string-matching algorithm is used, which does no tokenization; the lack of tokenization means identifiers like i are matched and indexed if the letter i appears anywhere in the scrap. [Add discussion of indexing in other systems (Web, noweb, Fweb, ...) here.]

Sweb allows several approaches to indexing:

Each approach is described more fully in the following sections.

7.1 Normal Indexing

Sweb systems normally use a kind of semi-automatic indexing similar to that performed by nuweb. The author of the web can indicate, in the <indexDefs> element, which identifiers are defined or declared in the associated scrap. These identifiers are indexed as being defined in that scrap, and as being used in any other scrap within which their names occur.[18] The author can also indicate, by means of the <divGen> element, where the generated index should be inserted in the output document.

The <indexDefs> element is used to indicate identifiers defined in the current scrap, or more generally to indicate any index strings for which the current scrap should be marked as a primary index entry. The standard TEI elements <index> and <divGen> are used in the normal way to control indexing.

All weave processors for Sweb will accept simple blank-delimited strings as input; it is expected, though not required, that they will replace the input strings in their output with equivalent <index> entries. This will allow human editors to use the simpler form for the first cut at an index, and then edit the output from weave if necessary, to get better results.

In normal processing, a weave processor will

For example, one of the scraps of Dijkstra's prime-numbers program might look like this in the first draft of a document, with an index entry specified for jprime:

<head>increase <ident>j</ident> until the next prime number</head>
<scrap id=increase
begin boolean jprime;
    repeat j := j + 1; <ref>give to <ident>jprime</>
         the meaning: <ident>j</> is a
         prime number</ref>;
    until jprime
After weave processing, the <indexDefs> element would contain <index> elements, not just raw tokens:
<index level1='jprime' index='identifiers'>
At the end of the document, the identifier index would include an entry something like this:
defined in
<ref target=increase>increase <ident>j</ident>
until the next prime number</ref>,
used in ...

In some cases, it may be desirable to use the more structured form of an <index> element. If we wanted to use second-level index entries to specify the type of each identifier, for example, we might substitute an <indexDefs> entry like the following:

      rend ='defined'
In this case, the weave processor would leave the <indexDefs> element alone, and the entry in the identifier index would look something like this:
<list type=simple>
defined in
<ref target=increase>increase <ident>j</ident>
until the next prime number</ref>,
used in ...

For further discussion of the back-of-the-book index, see section The Back of the Book Index.

The formal definition of <indexDefs> is as follows:

< 6 Define indexDefs > =

<!ELEMENT indexDefs     - O  (#PCDATA | index | %m.tokentypes)* >
<!ATTLIST indexDefs          %a.global                          >

7.2 Special Cases: Fully Automatic Indexing

7.2.1 ScrapIndexer

It may be convenient at times to have a fully automatic method of generating index entries; it will be even more convenient if the index entries thus generated can be used as they are, or edited manually to remove any dross. What we want is a mechanism to take an Sweb input file without index elements, and produce as output a second Sweb input file, the same as the first except that it does have index elements, which the user can delete or change if they are not quite right.

Really good automatic indexing requires language-specific processing, but even relatively simple language-independent index-entry generators can be good enough to be useful. It's more important, in such simple indexers, to ensure that all identifiers are indexed, than to ensure that all non-identifiers are not indexed: during development of a program, extra index entries do no serious harm, and it is easier to delete unwanted index entries than to type in index entries for identifiers the automatic processor missed.

The Sweb system defines a free-standing language-independent processor to generate index entries for scraps, which currently bears the name ScrapIndexer. It behaves as follows:

The default language specification

For example, the following scrap

<head>Command-line flags</head>
<scrap id=clflags lang=C>
  /* get arguments and set things up ... */
  while (--argc > 0 && ((*++argv)[0] == '-' || (*argv)[0] == '/'))
    while (c = *++argv[0])
      switch (c) {
        case 't': fTrace = fDebug = fVerbose = 1;
                  iMsglevel = msgTRACE;
                  yydebug = 1;
        case 'd': fDebug = fVerbose = 1;
                        iMsglevel = msgDEBUG;
        case 'v': fVerbose = 1;
                        iMsglevel = msgVERBOSE;
                   fprintf(stderr,"sweb:  unknown option %c\n",c);
                   argc = 0;
would look something like this after processing by ScrapIndexer:
<head>Command-line flags</head>
<scrap id=clflags lang=C>
argc argv break c case default fDebug fprintf fTrace fVerbose iMsglevel
msgDEBUG msgTRACE msgVERBOSE stderr switch while yydebug
After running ScrapIndexer, it is easy to edit the results by deleting the tokens which should not be indexed (e.g. because they are keywords of the language):
argc argv c fDebug fprintf fTrace fVerbose iMsglevel
msgDEBUG msgTRACE msgVERBOSE stderr yydebug

Depending on the run-time options chosen, the <indexRefs> section produced by ScrapIndexer might be more like this:

<index index='identifiers' level1='argc' level2='(C)'>
<index index='identifiers' level1='argv' level2='(C)'>
<index index='identifiers' level1='break' level2='(C)'>
<index index='identifiers' level1='c' level2='(C)'>
<index index='identifiers' level1='case' level2='(C)'>
This more elaborate output makes it easier for the user to distinguish, if desired, among variables, functions, etc., by editing the value given for the level2 attribute. Since weave processing does not alter these elements if they contain <index> elements, the user's manual improvements won't be overwritten by later runs of Sweb. (If the scrap changes radically, it may be necessary to run ScrapIndexer again, indicating by means of run-time flags which scraps should be left alone and which reindexed from scratch.)

In some cases, the default language specification is radically unsuitable; the next sections describe methods, one simple and one more complex, of dealing with such cases.

7.2.2 Generating Indexers

The simple automatic index generation just described is probably useful in many cases, but it is inadequate in others. Sometimes, it's inadequate because its model of programs ignores function and module boundaries or because the index needs to contain more information about the constructs being described in the source code (e.g. whether they are variables, functions, data structures, elements, attributes, or something else). In such cases, special-purpose language-specific indexing routines may be needed; for discussion, see below, section Language-Specific Indexing.

In other cases, the simple-minded flat model of ScrapIndexer is not a problem, but the default language specification used by ScrapIndexer does an unacceptably poor job of sifting out the identifiers of a programming language from the other stuff. In this case, it would be nice if ScrapIndexer could read a language specification for each language in the web, and use them in its processing. (Let us give the name SmartIndexer to this putative enhanced version of ScrapIndexer.) Unfortunately, I expect the first version of ScrapIndexer to be written in yacc and lex, and it seems unlikely that I will be able to figure out how to make yacc and lex do this right.

If we cannot have a SmartIndexer program, it would be nice to have a program to read a language specification and generate a new version of ScrapIndexer, with the new language specification hard-coded into the yacc and lex. This would require the user to have yacc and lex available, but not to know very much about how to use them. I think I know how to write such a ScrapIndexer-generator, but it's a low priority project for me, since if the default language specification of ScrapIndexer is no good, I can always change it myself, and I don't need a clever automatic way to do it.

If we can't write a smart indexer, and can't automatically generate new versions of ScrapIndexer, the next best thing would be to have an easy way of making a new version of ScrapIndexer by hand. Since making new versions of software requires nothing more arcane than Sweb's own version-control tags, and a version of Sweb which supports multiple versions, this seems relatively straightforward. This section describes, in general terms, what is involved.

It also defines the DTD for a language-specification document, and the behavior prescribed for an indexer-generator, so that any readers with time on their hands can try those hands at an automatic generator for new versions of ScrapIndexer.

The ScrapIndexer program, needless to say, is written using Sweb, and all the assumptions built into it may be changed by adding new versions of the relevant scraps to that document. The yacc grammar embodies the program's knowledge of Sweb document structure, such as the fact that ScrapIndexer only has to worry about scraps. It also embodies assumptions about the nature of the source code: no structure, just a sequence of keywords, comments, strings, identifiers, etc. The lexical scanner -- more specifically, the lex definition section -- makes explicit all the default assumptions about what strings represent what kinds of tokens. All that's necessary, to make a new version of the program, is:

For example, let us assume we want a simple indexer for Scheme or Lisp. We would add a new <version> element to the <versionList>:

<version id=Scheme1.0>Scheme ver. 1.0</version>

Then, we would add a new section to the document:

<div1 id=lisp><head>Alternate Version for Lisp</head>
within which we would provide an alternate scrap for the lex definitions:[20]
<scrap id="lisprules"
       name="Lex rules for Lisp tokenization">
Flatcomment     (";".*)
Nestcommstart   '\0'
Nestcommend     '\0'
Delims          [\'()\[\] \t]
Sign            [-+]
Integer         {Sign}?[0-9]+
Rational        [0-9]+"/"[0-9]+
Fixpoint        {Sign}?[0-9]+"."[0-9]+
Scientific      ({Integer}|{Fixpoint})("e"{Sign}?[0-9]+)
Real            {Fixpoint}|{Scientific}
Number          {Integer}|{Real}|{Rational}
String          "\""[^\n\"]"\""
Namedchar       backspace|return|page|linefeed|newline|rubout|space|tab
Character       ("#\"[^\n\t ])|("#\"{Namedchar})
Constant        {String}|{Number}|{Character}
StandardNames   "+"|"-"|"*"|"/"|"car"|"cdr"|"c"[ad][ad]+"r"

Then we would:

Care must be taken, of course, not to overwrite the default version unintentionally. The resulting indexer should accept a run-time indication of what languages it is responsible for, and should not run on scraps identified as being in other languages. Scraps without values for the lang attribute may or may not be processed; this should depend on a run-time flag.

A tool for the automatic generation of indexers in the style of ScrapIndexer might work as follows:

The language specification provides the following elements for defining the syntax of a language and modifying the default behavior.

The regular expressions should follow the conventions of lex.

If <name> is supplied, only tokens matching the given regular expression will be indexed. Otherwise, all tokens not matching any of the other token types will be indexed.

Multiple occurrences of each element type may be given; the token type concerned will be defined as a logical alternation of all of them. For example, some of the keywords of Scheme might be defined thus:


Formally, the languageSpecification DTD contains the following declarations.

< 7 DTD for language specifications > =

<!ELEMENT languageSpecification - -
                             (delimiter | flatComment
                             | nestingComment | string
                             | constant | keyword | name)*      >
<!ATTLIST languageSpecification
          languages          NAMES               #REQUIRED      >
<!ELEMENT delimiter     - O  (#PCDATA)                          >
<!ELEMENT flatComment   - O  (#PCDATA)                          >
<!ELEMENT nestingComment - - (start, end)                       >
<!ELEMENT start         - O  (#PCDATA)                          >
<!ELEMENT end           - O  (#PCDATA)                          >
<!ELEMENT string        - O  (#PCDATA)                          >
<!ELEMENT constant      - O  (#PCDATA)                          >
<!ELEMENT keyword       - O  (#PCDATA)                          >
<!ELEMENT name          - O  (#PCDATA)                          >

7.2.3 Language-Specific Indexing

For more sophisticated indexing, an indexer must have better knowledge of the language than is possible with ScrapIndexer and its derivatives. For most languages, such knowledge can readily be supplied in the form of free-standing filters similar to ScrapIndexer which read Sweb documents as input and produce equivalent Sweb documents as output, differing only in the addition or alteration of indexing information.

Such filters must:

Given a proper grammar for the language in question, it should be possible for language-specific indexers to distinguish functions from variables, and declarations or definitions from uses of those functions and variables. A sample language-specific program is described in the next section.

Language-dependent filters might also be used to typeset code in specific languages, adding formatting or structural tags to the code. As defined, <scrap> elements can contain only character data, <ref> elements, and <ptr> elements, however, so typesetting filters must either translate the document into a new DTD, or modify the standard Sweb DTD by redefining model.scrap. One obvious possibility (with which, however, I have not yet experimented) is to redeclare scraps, adding to the content model

Such a redeclaration might make it possible to support type-setting of source code in Sweb without losing the declarative nature of SGML tagging, and without binding the user to any particular typesetting system. Further work is needed in this area.

7.2.4 SGML-Specific Indexing

A language-specific processor for SGML will be supplied, which incorporates a DTD processor which recognizes element and attribute-list declarations and indexes generic identifiers of elements being declared, generic identifiers of elements named in the content model, and attribute names and values. In entity declarations, the SGML indexer will recognize and index the name of the entity being declared. When the DTDs being processed follow TEI naming conventions, entity declarations for model- and attribute-classes will be recognized and the entity text will be analysed correctly.

For example, when processing the scrap in this document, in which the <scrap> element is defined, the SGML indexer would produce the following output:

<head>Code scraps</head>
<scrap id=dScrap>
<!<!>ENTITY  % model.scrap  '(#PCDATA | ptr | ref)*'               >
<!<!>ELEMENT scrap         - -  (%model.scrap)                     >
<!<!>ATTLIST scrap              %a.global
          name               CDATA               #IMPLIED
          file               CDATA               #IMPLIED
          version            IDREFS              #IMPLIED
          index              CDATA               'auto'         >

  <index level1='model.scrap (SGML entity)'>
  <index level1='scrap (SGML element)'>
  <index level1='name (attribute)    ' level2='on scrap'>
  <index level1='file (attribute)    ' level2='on scrap'>
  <index level1='version (attribute) ' level2='on scrap'>
  <index level1='index (attribute)   ' level2='on scrap'>
  <index level1='ptr (SGML element)' level2='in model.scrap entity'>
  <index level1='ref (SGML element)' level2='in model.scrap entity'>

7.3 Special Cases: Fully Manual Indexing

Automatic index processing can be supressed entirely, to allow the author of the web complete control over the indexing of scraps. Suppression of automatic index processing also allows manual tweaking of the index, after an automatic process has generated a rough draft.

To suppress automatic index generation entirely, it is necessary only to avoid running any of the automatic index generators. Automatic index generation can be suppressed for individual scraps, by setting their index values to manual.

To suppress the normal index processing behavior (as described above in section Normal Indexing), it is necessary to suply an appropriate run-time flag.

7.4 The Back-of-the-Book Index

Normal index processing produces a full index of identifiers at the location specified by the user by means of a <divGen> element. Its contents are very simple: a <list> of type index. Each index entry is an item; each item consists of the entry form itself, followed by a series of <ref> elements pointing at the scraps in which the identifier in question is defined or used. The point of definition will normally be marked with an asterisk, but this may vary from implementation to implementation.

A fragment of the print index from a literate program might look something like this:[21]

<list type=index>
<!-- ... -->
<item><ident>nactive_heaps</ident> *75c, 76a, 76b </item>
<item><ident>NEQ</ident> *24a </item>
<item><ident>newcell</ident> 41a, 52c, 70a, 84c</item>
<item><ident>nheaps</ident> 19b, 21b, 22b, 62c, 69, 73b, 74, 75c, 76a,
76b </item>
<item><ident>NIL</ident> 16b, 20a, *24a, 31a,
<!-- ... -->
<!-- ... -->

A version of this index intended for online use might look slightly different: the page numbers might better be replaced with <ref> elements containing the names of the scraps, with the scrap identifiers in the target attributes:

<list type=index>
<!-- ... -->
<ref target="gcroutines75c">*Define nactive_heaps()</ref>,
<ref target="gc76a">define gc_status()</ref>,
<ref target="gc76b">Define gc_info()</ref>
<ref target="type-predicates">*Define Lisp type predicates</ref>
<ref target="newcell">Define newcell</ref>,
<ref target="fopen">Define fopen</ref>,
<ref target="gc-for-newcell">Define gc_for_newcell</ref>,
<ref target="lmfuncs">Declare list-management functions</ref>
<ref target="heapvars">Heap variables</ref>,
<ref target="process-cla">Define process_cla()</ref>,
<ref target="print-hs-1">Define print_hs_1</ref>,
<ref target="init-storage-1">Define init_storage_1</ref>,
<ref target="allocate-aheap">Define allocate_aheap()</ref>,
<ref target="looks-pointerp">Define looks_pointerp</ref>,
<ref target="gcsweek">Define gc_sweep</ref>,
<ref target="gcroutines75c">Define nactive_heaps()</ref>,
<ref target="gc76a">Define gc_status()</ref>,
<ref target="gc76b">Define gc_info()</ref>
<ref target="globals">Declarations for siod.c</ref>,
<ref target="lisp-obj">Lisp objects</ref>,
<ref target="type-predicates">*Define Lisp type predicates</ref>,
<ref target="leval-args">Define leval_args</ref>,
<!-- ... -->
<!-- ... -->

7.5 Using Sweb Indexing Facilities in Practice

The markup and indexing processes described in this document can be used in a variety of ways. In a `normal' case, if any exist, the programmer might index all scraps manually, allowing Sweb to generate an index of identifiers.

In some cases, it might be more convenient to run ScrapIndexer periodically during development, so as to get a rough index automatically. Since ScrapIndexer does not distinguish between definitions and mere uses of variables, the resulting index is not so useful, but it also requires less work to generate. When the program is complete or nearly complete, it would be sensible to run ScrapIndexer(or some other automatic index generator) a final time, and then to edit the results, moving <index> elements as needed into the <indexDefs> element from the <indexRefs> element. The actual index can be generated by the default behavior of Sweb, or the index processing can be set to fully manual operation.

The general rule is simple: use the automatic index generators to generate rough indices during development, and use their output as a rough first draft, to be manually edited and tweaked as the document approaches completion.

8 Generating Reference Material

In some cases, it is not enough to generate an index of identifiers. When defining a library, or an SGML tag set -- and even when defining a relatively complex program -- it is useful to provide reference material for various constructs of the library, tag set, or program.

For a function library, reference material might list the name and arguments of each function, with a description of the function's purpose. For SGML document type definitions, reference material might list, for each element type, the generic identifer, attributes, and formal declaration of the element, with a prose description of its usage. (Several DTDs exist for DTD documentation; perhaps the most accessible, and certainly the one I know best, is the Tag Set Documentation DTD defined by the Text Encoding Initiative.)[22]

Some complications await us if we wish to support such reference materials; more complications arise if we want to generate them semi-automatically.

One complication is that the constructs of the formal system, be they functions or SGML element types, should normally be described in the prose documentation, too. Multiple descriptions, of course, risk inconsistency. Sweb provides a method of controlling redundancy and reducing inconsistency. A glossary list (i.e. a <list> or <glossary>) in the text with a type of desclist will be recognized by Sweb as containing descriptions of important items in the formal system being defined, and reference material will be generated for those items, if it doesn't already exist.

The reference material for a given item includes at least the following SGML elements:[23]

contains the reference documentation itself
contains the name of the the variable, element type, function, etc.
contains a description, copied from the <item> or <gloss> portion of the description list
With sufficiently intelligent language-specific processors, and manual editing of the results, or just with patient manual labor, the reference material may also include:
contains the full name of the item, if the formal identifier is an abbreviation
description of subordinate items (<attributes> documents the attributes of an element type, <arguments> the arguments of a function, <members> the members of a structure or class, <range> the range of values of a data type)
contains an example of the item (tagged with <eg>), together with commentary (tagged as a series of <p> elements)
contains remarks on the scope, usage, or other interesting aspects of the item
indicates what module or part of the system the item is part of
indicates what file(s) the item is defined or declared in
contains pointers to other related items, whatever the relationship (parents, children, superclasses, subclasses, etc.)
contains the formal declaration or definition of the item
contains an informal paraphrase of the declaration or definition
gives a cross reference to a section in the document which describes or mentions this item
identifies an equivalent construct in another system

For example, here is a sample description for an imaginary SGML element named bl (for blort):

  <refDoc type=tag>
    <ident type=gi>bl</>
    <rs           >blort</>
    <desc         >Marks any sequence of granfalloons.</desc>
       <refDoc type=att>
         <ident type=gi>id</>
         <rs           >identifier</>
         <desc         >Gives a unique ID to this blort.</desc>
         <informal     >any unique SGML name</>
         <formal       >id ID #REQUIRED</>
    <exemplum><p>This is how it works:  ...</p>
       <eg> ... </eg>
    <remarks><p> ... </remarks>
    <module> ... </module>
    <files> ... </files>
    <ptrList type='classes' targets='clspqr chunk'>chunk, Rahtz-class</>
    <ptrList type='children' targets='granf'>granfalloon</>
    <ptrList type='parents' targets='&amp;divs front body back'>back body
        div div0 div1 div2 div3 div4 div5 div6 div7 front</>
    <informal>Contains a series of granfalloon elements, optionally
        preceded by a heading.</>
    <formal><![ CDATA [
       <!ELEMENT blort - O (head?, granfalloon*) >
       <!ATTLIST blort
                 id        ID     #REQUIRED      >
    <ptr type='doc' target=chap3>
    <equiv scheme=TEI>seg type=blort</>

In describing the handling of <refDoc> elements, we can distinguish two main cases: either no <refDoc> element exists for a given item, or one does.

If no <refDoc> element exists for a given item, then unless the user indicates (by means of run-time flags) that this action should be suppressed, one is generated by the Sweb processor for each item found in a <label> element within a <list> element of type desclist. The generated <refDoc> contains only an <ident> element (taken from the <label> element of the description list) and a <desc> element (taken from the corresponding <item> element). If the appropriate run-time flag is given, the Sweb processor will also generate empty elements for the rest of the reference documentation. The <refDoc> element is inserted into the document wherever an appropriate <divGen> element is encountered in the input.

Language-specific processors may also be written to parse the various languages used in the web, recognize the important constructs, and copy them as appropriate into the reference documentation. The preparation of such language-specific processors is not discussed further here.

When it generates reference documentation, Sweb creates redundancies in the text, between descriptions and code scraps in the prose and the <desc> elements and code scraps in the reference documentation. Whenever Sweb copies material from one location in the input to multiple locations in the output, one location in the output is marked as the original, the other as a copy. When on later runs the redundant material is encountered in the input, the copy is ignored and a new copy is made.

Flags on the Sweb processor can be used to control the actions desired:

In all cases, the copy element is identified by having a copyOf or sameAs attribute pointing at the original. (In the latter case, the copy has a redundant copy of the material; in the former case, the copy is empty.) The original may be identified merely by being the target of a sameAs or copyOf attribute; optionally, it may also receive a select attribute whose value is its own ID. This is strictly speaking unnecessary, but helpful in flagging clearly its role as the master copy, for human editing.

The formal declarations of these elements are these:

< 8 Declare refDoc > =

<!ENTITY % m.subs 'attributes | arguments | members | range' >
<!ELEMENT refDoc        - -  (ident, rs?, desc, (%m.subs)?,
                             exemplum*, remarks?, module?,
                             files?, ptrList*, informal?, formal?,
                             ptr*, equiv*)                      >
<!ELEMENT (attributes | arguments | members) - - (refDoc*) >
<!ELEMENT range - - (code, desc)* >

9 Technical Issues

9.1 Processing Webs

Sweb is a single-DTD system, designed to be suitable for handling by a single processor rather than by separate weave and tangle processors. By single-DTD system, I mean a system whose output conforms to the same markup language as the input. This distinguishes it from virtually all of its predecessors, whether single-processor or double-processor systems:[24] they universally produce, as output, files which cannot be read again as input.

In practice, however, a double-DTD system has the undesirable effect that everyone reads the output form of the document and no one ever sees the input form.[25] A single-DTD system eliminates this problem: anyone interested in using the system can imitate the file format they see in the finished products.

The single-DTD system also makes it much easier to do a first sketch of a program or SGML DTD, without the reference material.

A single process which both weaves and tangles the input must:

These are listed in what seems to me a natural order for gradual implementation.

9.2 Invocation / Selection

The tag set declared here functions as a user-supplied extension to the host TEI DTD; apart from being supplied by the user instead of by the TEI, it works the same way the TEI `additional tag sets' work. The formal declarations of the SGML elements will be contained in one file, while the SGML glue needed to integrate the Sweb tags with the native TEI tag sets will be included in another. The system names of these are sweb.dtd and sweb.ent, respectively.

Normally, literate programming documents will use the TEI base tag set for prose (verse and drama would also work, but might seem excessively `literate' and frighten later maintenance programmers who have to work with the web). The tag set described here also requires portions of the additional tag set for linking. The document type declaration for a literate programming document must therefore select the TEI tag sets for prose and linking, and identify the files sweb.dtd and sweb.ent as containing extensions we wish to use. In the simple case, when we are not using any other extensions, the document type declaration can look like this:

  <!DOCTYPE tei.2 SYSTEM 'tei2.dtd' [
    <!ENTITY % TEI.prose   'INCLUDE' >
    <!ENTITY % TEI.linking 'INCLUDE' >
    <!ENTITY % TEI.extensions.dtd SYSTEM 'sweb.dtd' >
    <!ENTITY % TEI.extensions.ent SYSTEM 'sweb.ent' >
In the more complex case, we have to combine the Sweb additional tag set with other extensions, for example, those which define TEI Lite. In this case, we need only: [example needed.]

9.3 Specialized Handling of Standard Elements

Some fundamental literate programming facilities can be handled by standard elements and require no specialized tags. It may be necessary, however, for these standard elements to be processed in specialized ways. This section describes the special processing required for standard elements when used in literate programs.

The <ref> element and <ptr> elements are used in program scraps, to indicate embedding of other scraps. Tangle processors should follow the link recursively and embed the other scrap at the location indicated; weave processors should print the name of the other scrap, followed by a module or scrap number, section number, or page number, all enclosed within angle brackets, or using whatever typographic convention is adopted for module references.

The <divGen> element is used to signal that a text-division (<div>) element should be generated, containing an index of a kind specified in the type attribute. Special values which should be recognized by Sweb processors are:

index of files generated
index of scrap names
index of version names
index of identifiers (manually indexed)

[Idea: make two-level convention, so <divGen type='index filenames'> dumps the index created by <index index='filenames' level1='myfile.dat'> and so on. Then it's simple to allow the user to specify whether the identifiers belong in the main index, or in an index of their own. Ditto for everything else: filenames can go in their own index, or they can be put into the main index, depending on one's preferences.]

9.4 Integrating the Sweb Tags into the Base Tag Set

In order to integrate the tags defined here into the TEI base, we need to include them into the TEI element-class system. I do this by defining some specialized parameter entities:

< 9 > =

<!ENTITY % x.tokentypes ''                                      >
<!ENTITY % m.tokentypes '%x.tokentypes ident | kw | lit |
comment | gi | att | val | ent'>
<!ENTITY x.phrase '%m.tokentypes |'>
<!ENTITY x.chunk 'scrap | versionList |'>

[This section needs work.]

9.5 Implementing Language-Specific Processing

Knuth's original WEB system, and many of its offshoots, typeset the program code in a readable style, formatting code blocks appropriately (with indentation showing code structure), printing keywords in bold, literals in a monospaced font, and identifiers in italics, and using special characters in the standard TeX fonts for important operators such as assignment, logical-and, and logical-or. This requires substantial knowledge of the programming-language syntax to be hard-coded into the program or located in a table somewhere for consultation by the program.

Hard-coding the program's linguistic knowledge also has the drawback of limiting a web to the language(s) known to the weave and tangle processors; in an attempt to evade this drawback, some tools simply print all program code in a monospaced typewriter font, reserving normal typesetting for module references and comments.

In Sweb, program scraps and inline <code> elements are processed, by default, as verbatim text using a typewriter font. Language-specific processing can be achieved by writing special routines to parse program fragments and insert the proper formatting codes; these special routines should be introduced into the style-sheet controlling the weaving process. In this way, special indexing, etc., can also be introduced.

[Specific examples needed here.]

[N.B. the discussion above is silently assuming that Sweb will be implemented by an SGML-aware filter program driven by a style sheet, or by a pipeline of such filters, implemented with tf, CoST, OmniMark, or similar programs. Since Sweb's output is also an SGML document, then language-specific processors can be run on the Sweb output. They need only distinguish scraps and inline code fragments properly, and identify their language correctly.]

9.6 Tables of Correspondences

As an aid in understanding the Sweb tag set, and the expected processing, I provide the following partial lists of commmands in some widely used literate programming tools, together with the equivalent in Sweb.

For WEB:

major module: use <div1>
module: use <div2>, <div3>, etc., or <p>.
Pascal program: use <scrap>
module definition: use <scrap>
module reference: use <ref> or <ptr>
index entry (in normal font): use <index> with an appropriate level2 value
index entry (in typewriter font): use <index> with an appropriate level2 value
end of argument: this is handled differently in SGML; mark the end of an SGML element with the appropriate end-tag, the end of an attribute value with a closing quotation mark.
Pascal-within-text: use <code> or <ident>
mark next identifier as defining occurrence: not implemented
format like Pascal semicolon: not implemented
define a macro: not implemented

N.B. in this version of Sweb, a number of WEB facilities have no SGML analogues; these may eventually be added, but this has not yet been done. In addition to those already mentioned, these constructs without equivalents include:

The special markup for overriding the default formatting depends critically upon the specific algorithms used to typeset the program code, and has a very unSGML-like feel to it, so it may never be added; however, it may be essential to really satisfactory results, so it may be necessary to define it in the form of a set of declarative hinting elements for use by any formatter.

For nuweb:

@o filename flags @{ ... @}
write the scrap out to the named file, allowing no page breaks within the scrap: <scrap file=fn rend=flags> ... </scrap>
@O file flags @{ ... @}
ditto, but allow page breaks within the scrap: <scrap file=fn rend='loose flags'> ... </scrap>
@d scrapname @{ ... @}
define a scrap with the name given; allow no page breaks: <scrap name='scrapname'> ... </scrap>
@D scrapname @{ ... @}
define a scrap with the name given; allow page breaks: <scrap name='scrapname' rend='loose'> ... </scrap>
@{ ... @}
delimit a scrap (this should always be preceded by @d or @o): <scrap> ... </scrap>
@< name@>
invokes the scrap named. <ptr target=scrapid> or <ref>name</ref> or <ref target=scrapid>name</ref>
generate index of filenames created: <divGen type='filenames-index'>
generate index of macro names (scrap names): <divGen type='scraps-index'>
generate index of identifiers or other terms: <divGen type='terms-index'> or <divGen type='index'>

10 Alphabetical List of Tags

[To be supplied ... ]


[1] The ideas in this paper derive in part from several years' experience with the ODD system of tag-set documentation, and from discussions with friends and colleagues. In particular, I thank Lou Burnard of Oxford University Computing Services, with whom I designed the ODD system, and Bob Hyman of the University of Illinois at Chicago Academic Computer Center, who has taken a collegial interest in these ideas and their potential implementation.
[return to text]

[2] Donald E. Knuth, "Literate Programming," The Computer Journal 27 (1984): 97-111, rpt. [rev.] in his Literate Programming, "CSLI Lecture Notes" Number 27 ([Stanford, California]: Center for the Study of Language and Information, 1992), pp. 99-136, here p. 99.
[return to text]

[3] The tags here are all that are necessary for Sweb input, but output may include further tags, described below in section Scrap Wrappers.
[return to text]

[4] Silvio Levy and Donald E. Knuth, "An Example of CWEB," in Knuth's collection Literate Programming, "CSLI Lecture Notes" Number 27 ([Stanford, California]: Center for the Study of Language and Information, 1992), pp. 341-348. This is scrap 17, on p. 345.
[return to text]

[5] The example is from Knuth's revision of a famous program by Edsger Dijkstra, in Knuth's "Literate Programming," pp. 113, 115; see also pp. 103-104 for a formatted copy. Dijkstra's original prime-number program is discussed further below.
[return to text]

[6] The one-page summary was prepared by Adrian F. Clark for nuweb version 0.6. The subroutine has an at-sign in its name, presumably to illustrate how literal at-signs are coded in nuweb documents.
[return to text]

[7] Language-specific processors might be used to parse <code>, <eg>, or <scrap> elements and tag the component parts of the code appropriately.
[return to text]

[8] Edsger W. Dijkstra, "EWD227: Stepwise Program Construction", in his Selected Writings on Computing: A Personal Perspective (New York, Heidelberg, Berlin: Springer-Verlag, 1982), pp. 1-14, here pp. 3 and 7. This is the program rewritten by Knuth as an example in his original article on literate programming.
[return to text]

[9] Where it is desired to produce local documentation which completely replaces selected sections of the base documentation, the techniques described here won't quite do the job; they can, however, be extended in an obvious way, particularly by use of the exclude attribute to indicate which <div> elements in the local documentation should replace which <div> elements in the base documentation.
[return to text]

[10] Examples include the Text Encoding Initiative and the Davenport Group.
[return to text]


The corresp attribute, which is also declared globally when the TEI additional tag set for linking is in use, may be preferable for this purpose; its name at least seems less confusing. The exclude attribute, however, seems to capture the semantics of the situation more precisely; scraps in an equivalence class are not only equivalent, but mutually exclusive alternants. If left completely on my own, I'd use an equiv attribute, but there seems no point in adding a new attribute which differs from exclude only in name.

The real problem is that only linguists have really internalized the notion that when two items are in complmentary distribution, and thus effectively exclude each other, they are variant forms of the same thing. For the rest of us, being variant forms and exclusion seem somehow contradictory. (The linguist in me, of course, observes triumphantly that this means they are variant forms of the same phenomenon ...)

[return to text]

[12] I evade, for the moment, some questions raised by Dijkstra's sample program, notably his use of identifiers and keywords within scrap names, which cannot be represented with the tags introduced so far; for a solution, see below, section Scrap Wrappers. I also ignore, for now, the variation in scrap names in Dijkstra's text.
[return to text]

[13] It does not show, of course, how identifiers can be tagged in scrap names, which are SGML attributes, not content. For a discussion of this and related issues, see the section on Scrap Wrappers.
[return to text]

[14] I am unaware, in fact, of any which support recapitulations.
[return to text]

[15] It seems unnecessary to allow the user to specify, through special evaluate and quote attributes, or rule (with values of evaluate or quote) and exceptions, which scrap references should be expanded (evaluated) and which should be left alone (quoted, taking quoted as an antonym of evaluated). If any users of the tag set actually need this capability, I hope they'll let me know. For now, it seems likely that <recap> should almost always expand only the references to scraps already seen, leaving forward references unexpanded.
[return to text]

[16] Other names once or still under active consideration are <wrappedScrap>, <wrap>, <scrapWrap>, <sWrap>, and <sw> (which can mean either scrap wrap or Sweb). I like <scrapwrap> or <sWrap> myself, but the moans of my friends were too audible. Thanks to Bob Hyman for suggesting the name <scrapInfo>, which has the merit of eliciting no moans at all.
[return to text]

[17] For now, I am resisting the temptation to add the rule:

It does seem natural to want to assign an ID to the entire scrap wrapper, but it seems simpler to forbid the use of such alternate identifiers as aliases.
[return to text]

[18] In some implementations, as in nuweb, very simple string matching may be used when searching for other scraps which refer to the identifier; in some, there may be a less error-prone approach. Consult the implementation-specific documentation.
[return to text]

[19] Individual implementations may also choose to append (or to supply as a level2 value) a string consisting of the name of the programming language and the string identifier, so that the identifier blort in a fragment written in Pascal would be replaced by <index level1='blort (Pascal identifier)'> or <index level1='blort' level2='Pascal identifier'>. An implementation might equally well prefer to leave such refinements to specialized language-specific indexing tools, on which see below, section Special Cases: Fully Automatic Indexing.
[return to text]

[20] The syntactic details approximate those given in R. Kent Dybvig, The Scheme Programming Language (Englewood Cliffs: Prentice-Hall, 1987). Some details have been changed for convenience (e.g. the number 100.000 is classed as a real, not an integer); other details may have been changed inadvertently. No effort is made to list all the predefined functions of Scheme.
[return to text]

[21] This fragment is transcribed (manually!) from a partially indexed copy of the Scheme interpreter siod.
[return to text]

[22] The TSD is based on the Odd system for DTD generation designed by Lou Burnard and myself -- the tags described here are a generalization of the Odd and TSD tag sets.
[return to text]

[23] The tags used are generalizations of tags in the TEI auxiliary tag set for Tag Set Documentation, with the following equivalences:

[return to text]

[24] Or even triple-processor systems, like the TEI Odd system, in which oddp2x performs weave functions, odddtd performs tangle functions, and oddref generates reference material.
[return to text]

[25] In the case of Odd, the input format remained, in effect, a secret known only to the editors of the TEI and those few members of work groups willing and able to learn it successfully from our examples. Everyone else produced drafts which, at best, attempted to look like P2X files. Since these had to be reduced manually to legal Odd format, I had plenty of time to reflect on how useful a single-DTD design would be.
[return to text]