WWW-to-PAT Gateway: Exploiting an SGML-aware system through the Web

Change notes

First revision completed, June 18, 1994

Most figures removed from the text and made available as code for ftp, June 30

Text amended to take June 30th changes into account.

Introductory section on "key concepts" added, Aug. 4.

Please consult the change notes and list of proposed enhancements.

This technical paper presents the information necessary for implementing a gateway from a Common Gateway Interface (CGI) compliant WWW server to PAT, Open Text's text search engine. It presents information on several variants on implementation, including an Oxford English Dictionary (OED) lookup facility, a book browsing facility using the TEI Guidelines for Electronic Text Encoding and Interchange (P3), and a KWIC result generator for literary analysis in a text collection. One of the problems of the Web and of HTML in particular is the limited use of markup and limited structure recognition: only a small set of characteristics can be represented, and few of these have any functional value beyond presentational capabilities. While one would like to be able to deliver components of the text (e.g., a single glossary entry from a collection of several thousand), as rendered by the markup, only file transfer is possible without resorting to programs external to HTML and the server. Despite the anchor element, <A>, with its ability to provide hypertext links within and beyond the text, it is still necessary to return the entire file containing the link to the user, even when only a small portion is required for the link. The gateway capability discussed in this technical paper documents a method by which the Web administrator with access to PAT can use any number of richer SGML DTDs, and begin to provide the user with access to that richer set of tags and structural retrieval possibilities.

Introduction: Key Concepts

This approach to using the WWW to provide access to complex textual resources involves many tools and concepts that may be unfamiliar to the reader. This introduction is a modest attempt to describe many resources worthy of articles or books themselves, and to describe the relation of these resources to one another.

SGML and related concepts

Standards and open systems approaches must be a defining part of library efforts to provide large-scale wide-area access to textual resources. It is not enough to say that access is improved if a major investment is made in textual resources that will be unusable in two years: the texts must be reusable. Because of the cost of creating the texts, it must be possible to use the texts in a variety of settings with a variety of tools. To that end, a standards-based encoding scheme must be at the foundation of text creation. Standard Generalized Markup Language (SGML), an international standard, is such an encoding scheme and has proven extremely valuable in effecting an open systems approach with text.[1] This paper is not the place for an argument of the value of SGML, especially when the argument has been made so effectively elsewhere.[2] Still, in addition to its value as an internationally-approved standard, SGML is ideally suited to supporting textual retrieval because it is a descriptive rather than a procedural markup language. That is, it is a language designed to reflect the structure or function of text, rather than simply its typography or layout. Thus, in a text retrieval system, portions of the document can be searched and retrieved, and functionally different elements can be displayed differently depending on their function.

The application of SGML using the Text Encoding Initiative (TEI) Guidelines will play a central role in ensuring that textual resources -- particularly those important to textual studies -- are produced in a way that makes them flexible and of continuing value. The difficulty of designing an implementation of SGML to meet a broad range of text processing needs in the humanities has been met by the Text Encoding Initiative in its Guidelines for the Encoding and Interchange of Machine-Readable Texts.[3] The TEI itself is a collaborative project of the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. Its purpose is the promulgation of guidelines for the markup of electronic text for a variety of disciplines involved in the study of text. In mid-1994 a comprehensive and detailed two volume set of guidelines was published. While the print version is absolutely an essential acquisition by libraries, an electronic version (discussed here) was also made available by the author of this article.

A central feature of SGML is the DTD. The DTD, or Document Type Definition, is a codification of the possible textual characteristics in a given document or set of documents. SGML expresses the organization of a document without necessarily resorting to using the paradigm of the file system -- discrete files representing the organizational components of a document. It expresses textual features such as footnotes, tables, headings, and the building blocks of content such as paragraphs using a descriptive language focusing on the role of the element rather than some presumed display value. In fact, SGML is not a tag set, but a grammar, with the ``vocabulary'' or tags of an individual document being articulated in its DTD. Using this fairly rigorous grammar, SGML can both declare information about the document in a way that can be transported with the document and can in fact enforce some rigor in the application of markup by aiding in ``parsing'' the document.

Hypertext Markup Language, or HTML, is a form of SGML expressed by its own unique DTD. The shape of that DTD has changed significantly since first articulated by researchers at CERN, and it continues to change with the demands of the WWW.[4] HTML was designed to facilitate making documents available on the World Wide Web and expresses a variety of features such as textual characteristics and hypertext links. These hypertext links are perhaps its most useful offering because they allow one to link documents to other resources throughout the Internet, effectively making the Internet a large hypertext document.

The WWW's CGI and FORMs

The World Wide Web is far more than a server protocol for the transfer of HTML documents, far more than a sort of ``gopher on steroids.'' Among the many resources it offers in facilitating sophisticated retrieval of information is the Common Gateway Interface, or CGI. Like HTML, CGI is in transition. However, in its current state it offers a set of capabilites that allow us to use the WWW to support much more complex documents and retrievals. The Common Gateway Interface is a set of specifications for external gateway programs to speak to the WWW's server protocol, HTTP. It allows the administrator to run external programs from the server in such a way that information requests return a desired document to the user or, more typically, generate a document on the fly. This capability makes it possible to provide uniform access to data structures or servers that are completely independent of the HTTP, including structures such as those represented in SGML documents or servers such as Z39.50 servers. The CGI specification is detailed online on the NCSA documentation server.[5]

Closely associated with the CGI is the FORMs specification first introduced with NCSA's Mosaic. This feature is a client-independent mechanism to submit complex queries, usually through a graphical user interface. FORMs-compliant interfaces such as Mosaic, lynx (a UNIX vt100 client), and OmniWeb (a NeXTStep client) use a complex offering of fill-out forms, check boxes, and lists to mediate queries between the user and the CGI resource. Users respond by making selections that qualify submissions to the server, for example checking a box to qualify a search as an author search, thereby making a complex command-line syntax unnecessary.[6]

Perl and the computer languages of CGI

CGI programs can be written in a variety of languages, including UNIX shell scripts, C programs, and perl. In fact, there are few limitations on the type of language that can be used. Perl is foremost among the options available to most WWW administrators. Largely the work of a single person, Larry Wall, perl can be used to create extremely fast and flexible programs with no practical limits on the size of the material it can treat. Perl also has outstanding support for the UNIX ``regular expression'', making it ideal for text systems such as those documented here where one form of markup must be translated to another.[7]

A Modular Approach

The approach taken in these examples separates the operations of retrieval to allow one component (e.g., a filter) to be upgraded without affecting other components. It should be emphasized that this separation of operations grew out of local needs and that other approaches, including an approach that combines all operations in a single program, are possible. The four steps are:

(1) FORMS handling: where users, with the aid of the FORM, submit a query;
(2) CGI query handling: where the query is received and translated to a PAT search;
(3) PAT result handling: where information returned from PAT is transformed into lists or entries that can be selected; and
(4) nonHTML-to-HTML filtering: where the richer SGML is transformed into HTML.

This multi-stage approach has many advantages. For example, it is possible to use different types of programs for the different stages, tailoring the selection of programs to strengths each might have for that function or to local needs. In the approach documented here, HTML FORMs, shell programs, C programs, and perl are used for the four operations. Separating the functions also allows persons with different responsibilities, skills, or interests to manage the different processes. For example, a system administrator might manage the second and third stages, while someone responsible for more aesthetic issues in the delivery might manage parts of the first and the fourth. At the University of Virginia Library, filters from richer SGML to HTML continue to be enhanced by staff from the Library's Electronic Text Center, in a process completely separate from the development of other parts of the interface. Other approaches are certainly possible, and an effort will be made to improve the efficiency of the current implementation.

Gateway configuration

(1) FORMS for query submission

FORMs to handle submission of the query may be simple or complex. The three examples given here demonstrate that range, with the Middle English FORM supporting word and phrase searches, the OED search providing a great deal of information about the areas to be searched and information to be retrieved, and the TEI browse including information about levels and ID values embedded in a collection of URLs.

Simple search FORM: Middle English query

This search form, shown in ME.html, is purposely simple to allow users to retrieve keywords-in-context without knowing commands such as those needed to view results.[8]. A search term is requested from the user and registered as the variable query. So that the user or system is not overwhelmed by large result sets, the size of result sets is limited to 100 items, and an additional FORM option (registering the variable size) is included to help the user move subsequently through the results 100 items at a time, or to sample 100 items from the entire result set.

Mediating structured searchs through FORMs: OED query

The richness of the OED is often overwhelming even for sophisticated users. Most do not want keyword-in-context results and would prefer simple lookup capabilities. The OED is elaborately structured to facilitate a broad array of activities, including simple lookups, so that even the simplest approaches require elaborate queries. The FORM illustrated in oed.html assists users in submitting many of the most commonly performed searches, including entry retrieval with simple lookups and truncated term lookups (e.g., ``photo'' for all words beginning with this stem). Additionally, retrieving quotations using structures such as the quoted author is possible, making possible such queries as give me quotations authored by Chaucer. This process includes the following:

In the FORM, the user submits a search term which is captured as the variable query.
The user selects the type of search. Many types of searches are possible, including traditional lookups, alphabetic browses, full-text searches, and quotation retrieval.
Several other elements are used to limit the size of results. As in the Middle English search FORM, a default of no more than 100 results at a time may be viewed from each search.
In addition, a variable called period is offered to allow users to limit quotation searches by century.

Browsing structured text through URLs: TEI Guidelines

The structured browsing of the TEI Guidelines adds another important feature for mediating access to large or complex collections. Users of the Guidelines are as likely to want to read a chapter or section as they are to want to search the contents. To facilitate this sort of browsing, an initial HTML page is created containing the titles of the major hierarchical structures (i.e., the element <DIV0>), and each of these is linked to an HTML page containing the titles of subsidiary (i.e., <DIV1> through <DIV4>) structures.[9] This organization is presented in the two files, TEI.html, the top level HTML page, and tei-tocs1.html, and the secondary HTML page for ``Part 1.'' The URL for each list item in the subsidiary page contains the information necessary to conduct a search and retrieve the structural component being selected. For example, to retrieve the section ``Structure and Notational Conventions of this Document,'' the first sub-section of the first chapter in Part 1, the URL points to the component extraction program tei-tocs and specifies that this is ID ``struct'' and is of the level <DIV2> (e.g,. the section is bounded by <DIV2 ID="struct"> and </DIV2>).

(2) CGI query negotiation

The CGI query handling operation accepts the search or action initiated by the user and prepares it for submission to PAT.[10] Information from the user is submitted to the CGI program specified in the FORM or URL (e.g., <FORM ACTION=/bin/tei.sh>). A wide range of queries can be supported using this strategy. For example, a single word or phrase search can be submitted; it is also possible to ask for two words within a specified proximity to each other, two structures (e.g., stanzas) that contain one or more words or phrases, or to retrieve structural units such as chapters. Each of the instances in this implementation uses UNIX (Bourne) shell scripts to negotiate interaction with PAT. The program begins by testing to see if a query has been made and, especially in searches like those in the Middle English collection, a test is made of preliminary PAT results to determine whether a valid (e.g., correct syntax or punctuation) search was performed.

Simple PAT queries through CGI

In the Middle English query, simple searches are accepted and KWIC results (64 characters in each) and a citation are sent to the user's screen. The ``citation'' is actually an ID assigned to each text and extrapolated throughout the text, page by page, by Electronic Text Center staff. So, for example, the ID JefLett is assigned to Jefferson's Letters, and on page 234 the ID will be JefLett234. This ID value becomes a hypertext link to viewing a larger context in the subsequent processing of results. The PAT communication illustrated in me-kwic consists of commands that:

move PAT into its ``quiet mode'' for tagged communication with another program ({QuietMode raw});
set the sort order and direct PAT to use the ID described above as the identifier for each result ({PrintMode 3 ID page});
search the word or phrase;
print the set of results (e.g., second 100) requested by the user.

Complex PAT queries through CGI: OED

In the second example, a more sophisticated query taking advantage of PAT's structure recognition is submitted. The CGI script for OED interaction uses several structures based on the tags in the OED to search and generate results with PAT. As seen in oed-kwic, the types of searches created through the OED FORM are created in shell program case statements. For example, if the user submitted a search asking for quotations authored by Bailey in the sixteenth century,
docs C16 incl ("bailey" within docs A)
(i.e., "document structures `16th century quote' that contain `bailey' within the quoted author field), the query is submitted to PAT. Alternatively, a truncated lookup of ``photo'' would be submitted as:
docs E incl ("photo" within docs HG)
(i.e., ``document structures called `entries' that contain photo at the beginning of the lookup form, but only within the headword group''). Each of these searches ends with the appropriate ``print'' command to PAT, i.e., either print document structures called Q (quote), or print document structures called HG (the Headword Group of an Entry).

Retrieving SGML structures with PAT through CGI

The approach taken with the TEI Guidelines represents the richest implementation of this approach, combining searching and browsing, and enabling browsing through a recognition of structure. The CGI communication with PAT is also surprisingly simple. URLs are either static, as in the FORM in tei-tocs1.html, or they can be generated dynamically as the result of a search as in the TEI filter (see tei.pl, discussed below). In the static URL, two elements needed to conduct the search with PAT are embedded in each URL. A URL such as
HTTP://etext.virginia.edu/bin/tei-tocs?div=DIV3&id=ABTEI2
sends the level of the <DIV> (i.e., <DIV3>) and the ID value (i.e., ABTEI2) to the CGI script seen in tei.sh. This is used to formulate the PAT query:
docs DIV3 incl "<DIV3 ID=ABTEI2 "
and the subsequent ``print'' command (i.e., pr.docs.DIV3) to retrieve the relevant portion.

(3) PAT result handling

Two stages of result handling, each relying on PAT's ``Quiet mode,'' take place before pages or dictionary entries are presented to the user. In its ``Quiet mode,'' PAT produces very helpful results with all information tagged for more reliable processing. For example, the size of the result set is marked with <SSize> tags, as in <SSize>23</SSize> and every result can be made to begin with a <Hdr>, as in <Hdr>AusEmma234</Hdr>.[11] The first of the two stages produces only minor transformations, primarily displaying the number of results retrieved and separating each result into a separate line. The results of the first stage are piped to the second stage to produce a view for users.

In the perl code for the Middle English ``Quiet mode'' translation (me-parse.pl), the <SSize> tags are used to highlight the number of results. The OED preprocessing (oed-parse.pl) simply ensures that each result is on a separate line, deferring issues such as using the <SSize> tags to highlight the number of results until a later stage. The ``Quiet mode'' filter used for the TEI is essentially the same as that used for the Middle English collection (teiquiet.pl).

The second stage presents each element to the user in a sort of ``index'' view so that a broader display can be produced. The perl code for this stage usually produces a KWIC view, with each line being a hypertext link to an expanded view of the result. The user sees the <Hdr> content as the link, but the byte offset -- the number of characters into the file, which PAT uses to locate results -- is the actual search component of the link. The first line below, with bracketed information as a hypertext link, is presented to the user. The second line is the actual HTML.

[AB]  entations for hypertextual links and other non-hierarchic struc
[<A href=/bin/tei-1500?id=39493>AB</A>]  entations for hypertextual links and other non-hierarchic struc

The perl code for the Middle English (me-kwic.pl) is indicative of that needed to produce a KWIC view. Results from the OED are presented to the user as a list of dictionary headwords, from which the entire entry can be displayed; this perl code is available as oedkwic.pl.

(4) Filtering to HTML

The results of searches, each in a complex SGML designed to support retrievals such as those discussed here, are prepared by the final stage for presentation to a WWW client such as Mosaic by being filtered to HTML. Filtering, in these examples, is again achieved by perl. Most tags from the originating files will be specialized and not have a corresponding HTML tag. For example, the quoted author element in the OED, <A>, has no corresponding HTML tag but one might render it as italicized text. Because of this lack of correspondence and the limited number of HTML tags, decisions are largely arbitrary and draw on presentational or aesthetic needs.

The filter, oed.pl, is used to filter OED tags to HTML. This filter and the resulting output demonstrates the challenge of filtering rich, heterogeneous text to HTML. A sample tagged OED entry is included in Figure 1 to illustrate the problems encountered. Information can be presented attractively to a user with a GUI client, despite many compromises in mapping complex tags to simple presentation characteristics. Future improvements in HTML and the ability of clients to interpret a variety of tags should enhance the display of OED entries.

Subsequent to the KWIC view for TEI and Middle English, more thorough transformations are performed on the texts. For example, upon being selected, a result's byte offset is sent to a program that uses PAT to print 1,500 character context for a result. The results of this 1,500 character view may contain any of the possible tags in the DTD, so a filter that represents all possible element values is created. At this time, the filters fall short in many areas, in particular in their ability to express complex relationships made possible in the DTD. For example, an element, <LIST>, may have an attribute value TYPE="gloss" that suggests it should be converted to the HTML <DL>. Before closing, the element <LIST> may contain other <LIST>s with other TYPE= values. At the present time, these simple stream-oriented filters cannot differentiate between tags with the same name but different functions in this type of complex nesting relationship. Two examples of filters are included here, a filter for the Oxford Text Archive DTD in ota.pl and a filter for the TEI P3 DTD in tei.pl. Currently (August 1994) both represent local display concerns and continue to be enhanced.

Preparing Texts with PAT

Very little preparation of SGML texts is necessary to be able to implement this strategy with PAT. In order to provide a useful indicator of location for KWIC views, we typically add an <ID> element with concise positional information. For example, in our Modern English texts, we add a combination of author, title, and pagination to create an <ID> such as JefLett244 (i.e., Thomas Jefferson, Letters, page 244). These <ID> values are limited to ten characters.

The TEI Guidelines are complex and make liberal use of minimization (i.e., omission of end tags where they are clearly implied by context), making other steps necessary. These included:

normalization: a parser was used to normalize the text. This results in all elements being closed, and a consistent (uppercase) format for all elements. Thus, a chapter that begins with <div1 id=AB> and whose end tag is implied by the beginning of a new chapter is changed to <DIV1 ID="AB"> and </DIV1>.
added id values: three <DIV1> elements did not have id attributes. These were added to aid in retrieval, and included attributes such as id="bibliog" for the bibliography.
ID element: an <ID> was created for each <DIV1>, with the value of the id attribute being copied to the <ID>. So, for example, the chapter called ``A Gentle Introduction to SGML,'' <DIV1 id="SG"> now has <ID>SG</ID>. This makes it possible to display the ID value for each <DIV1> in the KWIC views.

Conclusions and Future Directions

I believe that the strategy described here is an effective method for access to text collections and that it suggests important possibilities for access to other types of resources.[12] The University of Virginia provides access to many collections of resources with clients (i.e., interfaces) designed to support complex analysis, where users can create sets, combine them, and use a range of operations facilitated through a wide array of menus. These clients are frequently much more complicated than is desirable for simple operations such as word lookups in the OED. The strategy outlined here allows users to do simple word lookups in the OED or to formulate simple queries in the text collections without needing to understand PAT syntax or the organization of the collections. However, there is much in this strategy that is suggestive of other possibilities for providing access to collections.

Journal literature in SGML may be successfully accessed through this sort of strategy. For example, a journal run marked up according to the more elaborate AAP (Association of American Publishers) DTD, ISO 12083, could return articles to the user using simple PAT queries. However, this example only scratches the surface of the sorts of strategies that might be possible with the journal collection. Another strategy might set up CGI scripts to facilitate browsing where, for example, a user selecting Browse by author/title would be taken through a series of selections, each using PAT queries in:

producing a menu of journal titles from which the user can make selections,
producing a menu of volume/issue numbers from which the user can make selections,
and finally producing a menu of author/title options from which the user can make selections.

Through this and similar strategies, the Web can be an effective means of accessing the original files in a fuller SGML without altering the markup or resorting to fragmenting the material into thousands of files corresponding to the individual articles or even parts of articles. Other similar strategies for books (part and chapter segmentation, or part and sub-part segmentation) and documentation can be imagined as well. I believe that the strategies outlined here are significant evidence that SGML can serve as a delivery format for electronic materials sold by publishers to libraries, and further that SGML can serve as an effective internal file format on a library server.

Postscript: Testing These Strategies

These strategies have been employed for access to collections at the University of Virginia and the University of Chicago. Open Text has generously allowed the University of Virginia to provide non-UVa access to five of its collections or resources. (For more information on this, see the announcement from March 1994.) To examine most of the strategies discussed here (though not access to the OED), select examples from the test page at the University of Virginia.

Endnotes

[1] ISO, ISO 8879:1986(3) Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML). First edition--1986-10-15 (Geneva: International Organization for Standardization, 1986).

[2] Perhaps the best argument to date for the use of SGML in literary and linguistic computing is the often reprinted article:
Coombs, James H., Renear, Allen H., and Stephen J. DeRose. ``Markup Systems and the Future of Scholarly Text Processing,'' in Landow, George P., and Paul Delany, eds. The Digital Word: Text-based Computing in the Humanities (Cambridge, Mass. : MIT Press, 1993).
Also very valuable is the thorough guide to the standard itself:
Goldfarb, Charles. The SGML Handbook (New York: Oxford University Press, 1990).

[3] The first preliminary draft of the Guidelines was published in 1990. The second draft, known as P2, was published subsequently in fascicles and was available via anonymous ftp from file://sgml1.ex.ac.uk/tei/p2. A complete, revised edition (P3) was published in England and the US in mid 1994.

[4] The specifications for HTML continue to be in draft form but are being discussed by the Internet Engineering Task Force. A current version of the draft specification can be found at file://info.cern.ch/pub/docs/html-spec.ps.Z.

[5] Current CGI specifications and very useful examples can be found at http://hoohoo.ncsa.uiuc.edu/cgi/overview.html.

[6] The array of FORMs options, along with examples of implementation, is documented by NCSA at http://www.ncsa.uiuc.edu/SDG/Software/Mosaic/Docs/fill-out-forms/overview.html.

[7] Two excellent books have been published by O'Reilly & Associates, the so-called ``camel book'' Programming perl (Larry Wall and Randall Schwartz), and the ``llama book'' Learning Perl (Randall Schwartz).

[8] Other searches are facilitated as well, including proximity searches and traditional Boolean searches. In contrast to the simple search discussed in this article, these searches mediate user communication by taking multiple inputs (e.g., ``love'' and ``envy''), an operator (e.g., near or intersect), and in the case of the Boolean search, a document structure to use for intersection (e.g., a poem or chapter). This allows support for queries such as ``love near envy'' or ``stanzas that include love intersected with stanzas that include envy.''

[9] The menus were created in a combination of manual and automated processes, the latter using PAT to identify components, and programs like perl to convert output to HTML and URLs. This approach was selected in order to provide a static view of the organization of the Guidelines. It should be stressed, however, that a completely automated approach is possible and may be incorporated in subsequent projects. In a more automated approach, something like the following would be performed:

User selects TEI page, which in turn initiates a PAT search returning all <DIV0> element <HEAD>s. (For example, one might request docs HEAD within docs DIV0 if the HEAD element were unique.)

The information gleaned in the first step would in turn enable the dynamic construction of a menu option containing a URL that would extract <HEAD> items at levels DIV1 through DIV4.

The information about lower levels within each Part could then be used to create URLs such as those described earlier, e.g.,
http://etext.virginia.edu/tei-tocs/div=DIV1&id=ABII

[10] It should be emphasized that in this and many other stages, extensive interaction with the filesystem takes place, and every precaution should be taken to ensure security and protect against unpredictable results. Even though CGI/httpd implementors have taken precautions against dangerous queries, the httpd server should be run as ``nobody''or a comparable user to protect against malicious queries. Also, you will note precautions taken against miscommunication with PAT or runaway processes.

[11] In most cases, a step is introduced to remove all newlines, conducted by piping output to the ``tr'' command, as in | tr -d '\012'. This is done to aid perl in treating tag content that crosses boundaries.For the OED, for example, this creates output where results will have been rendered as a series of single lines, one for each result.

It is not possible to remove the newline characters in the TEI Guidelines, as many of the examples rely on formatting with newlines in CDATA elements. This results in inadequacies in the filtering where, for example, the beginning and tags are separated by several lines, and the corresponding HTML are best determined by an attribute value (e.g., TYPE="gloss"). However, this is a shortcoming in the author's perl skills rather than a fault in the TEI markup.

[12] There is still a compelling need for more specialized tools to access the same collections. Persons involved in complex analytical activities need to be able to combine sets in a variety of ways, to see numbers associated with previous searches at a glance, and to create and access structures not otherwise accessible through the WWW gateway. Currently the University of Virginia supports a locally developed vt100 client, as well as commercial X-Windows and MS-Windows clients from Open Text.

Appendix: Proposed enhancements

Incorporate PAT API: Incorporating the Open Text Application Programming Interface (API) tools will make it possible speak to PAT directly from the CGI programs, rather than creating a separate search file. This will be more efficient, will make possible testing for search size without reexecuting the search, and may make possible a degree of interaction with the text not currently possible.
Use perl and subroutines: Perl is probably the most efficient programming language to mediate interaction between the FORM and PAT, and can execute the separate components of the activity outlined in this paper as subroutines.
Improve SGML filters: If filters can be written to recognize the full range of relationships made possible by the SGML, filtering could be more attractive. For example, as discussed in the paper, LIST elements containing LIST elements will be differentiated both by the nesting and by the attribute values. Until a filter can be written to exploit these characteristics, it will fall short of what is possible.

Appendix 2: Cookbook for OED implementation

Install WWW server (i.e., httpd). This implementation has only been tested with the NCSA httpd, available at ftp://ftp.ncsa.uiuc.edu/Web/ncsa_httpd/current/httpd_source.tar.Z.
Areas for CGI scripts are registered in the conf area of the NCSA httpd, in the srm.conf. In the srm.conf, create a line for pat executables. For example, if we were to call this area patbin, that entry might read:
ScriptAlias /patbin/ /usr/local/httpd/patbin/
Access to the OED must be restricted to persons at your institution. For the NCSA httpd, this can be done by creating a file called .htaccess in the patbin directory and declaring an acceptable set of IP addresses. Please see the httpd documentation for further information, but note that it is not important to restrict access to the HTML FORM oed.html, but rather the executable files.
A data dictionary, probably called 2e.dd, will have been included with your copy of the OED. Copy this data dictionary (e.g., copy 2e.dd to 2eWWW.dd) and in the new dictionary change the names of the ``Regions'' so that the simpler tag names are used. For example, change the name Entry to E.
Put the following files in the patbin directory:
- oed-kwic
- oedkwic-filt
- oed-parse.pl
- kwic.input.sed
- oed.pl
- qoed2html.pl
- oed-id
- oed.sh
Make certain that all the files in oedbin are executable and change the relevant portions (paths and variables such as the titles) in the first lines of oed-kwic, oed-id, and oed.sh. Alter oed.html to point to the correct path (e.g., patbin).

Figure 1: Sample tagged OED entry

<E><HG><HL><LF>debug</LF><SF>debug</SF><MF>debug</MF></HL><MPR>
d<i>i&mac.</i>b<i>&reva.</i>&sd.g</MPR><IPR><IPH>di&lm.&sm.b&
revv.g</IPH></IPR>, <PS>v.</PS></HG><ET>f. <XR><XL>de-</XL>
<SN>II</SN>. <SN>2</SN></XR> +<XR><XL>bug</XL><PS>sb.</PS>
<HO>2</HO></XR></ET><p><S4><#>1</#><S6><DEF><PS>trans.</PS>  =
<XR><XL>delouse</XL><PS>v.</PS></XR></p></DEF><QP><Q><D>1960</D>
<A>J. Stroud</A><W>Shorn Lamb</W> vi. 70 <T>We'll..take them
round to the Clinic, and..get them debugged there.</T></Q></Q>
</S6></S4><p><S4><#>2</#><S6><DEF><LB>slang.</LB>To remove
faults from (a machine, system,etc.).</p></DEF><QP><EQ><Q>
<D>1945</D> <W>Jrnl. R. Aeronaut. Soc.</W> XLIX.  183/2 
<T>It ranged from the pre-design development of essential
components, through the stage of type test and flight test 
and `debugging' right through to later development of the 
engine.</T></Q></EQ><Q><D>1959</D><W>New Scientist</W> 26 
Mar. 674/1 <T>The `debugging' time spent in perfecting a 
non-automatic programme.</T></Q><Q><D>1964</D> <W>Discovery</W>
Oct. 51/3 <T>This failure report plays a vital role in the 
process by which the scientist corrects or de-bugs his 
programme.</T></Q><Q><D>1964</D> <A>T.  W. McRae</A>
<W>Impact of Computers on Accounting</W> iv. 99 <T>Once we have
`debugged' our information system.  </T></Q><Q><D>1970</D>
<A>A. Cameron</A> et al. <W>Computers &amp. O.E. 
Concordances</W> 49 <T>Program translation, debugging, and 
trial runs of the concordance were performed at the University
of Michigan Computer Center.</T></Q><Q><D>1970</D> <A>A. 
Cameron</A> et al. <W>Computers &amp. O.E.  Concordances</W>,
49 <T>By Christmas the program was debugged.</T></Q></QP></S6>
</S4><p><S4><#>3</#> <S6><DEF>To remove a concealed microphone
or microphones from (a room, etc.); to free of such listening
devices by electronically rendering them inoperative.  Cf.
<XR><XL>bug</XL><PS>sb.</PS><HO>2</HO><SN>3</SN><SN>f</SN></XR>.
orig.<LB>U.S.</LB></p></DEF><QP><Q><D>1964</D> <W>Business
Week</W> 31 Oct. 154 (<W>heading</W>) <T>When walls have ears,
call a debugging man.</T></Q><Q><D>1964</D><W>Business Week</W>
31 Oct. 154 (<W>heading</W> 158/2 )<T>He quotes high fees for
his work, saying that debugging equipment is expensive.</T></Q>
<Q><D>1966</D> in Random House Dict. </Q><Q><D>1969</D><W>New
Scientist</W> 16 Jan. 128/3<T>`Debugging' the boardroom and the
boss's telephone may become as common in industry as in the 
unreal world of the super-spy. </T></Q><Q><D>1976</D><A>M.
Machlin</A> <W>Pipeline</W> xxxi. 353 <T>The room..had steel
walls and had been rigorously de-bugged.</T></Q><Q><D>1978</D>
<W>Sunday Mail Mag.</W> (Brisbane) 9 Apr. 3/6 <T>Jamil,
America's leading `debugging' expert, discovered the secret
of an exported `bug' which should not have worked.</T></Q>
<Q><D>1987</D><W>Daily Tel.</W> 3 Apr. 1/8 <T>American
officials are scrambling to `de-bug' their embassy in Moscow
before the arrival of Mr Shultz, Secretary of State, on Monday
week.</T></Q></QP></S6></S4><p><S4><SE>Also <BL>
<LF>debugging</LF><SF>de&sm.bugging</SF><MF>debugging</MF></BL>
<DEF><PS>vbl. sb.</PS> (see senses 2, 3 above).</DEF>
</SE></p></S4></E>