This technical paper presents the information necessary for implementing a gateway from a Common Gateway Interface (CGI) compliant WWW server to PAT, Open Text's text search engine. It presents information on several variants on implementation, including an Oxford English Dictionary (OED) lookup facility, a book browsing facility using the TEI Guidelines for Electronic Text Encoding and Interchange (P3), and a KWIC result generator for literary analysis in a text collection. One of the problems of the Web and of HTML in particular is the limited use of markup and limited structure recognition: only a small set of characteristics can be represented, and few of these have any functional value beyond presentational capabilities. While one would like to be able to deliver components of the text (e.g., a single glossary entry from a collection of several thousand), as rendered by the markup, only file transfer is possible without resorting to programs external to HTML and the server. Despite the anchor element, <A>
, with its ability to provide hypertext links within and beyond the text, it is still necessary to return the entire file containing the link to the user, even when only a small portion is required for the link. The gateway capability discussed in this technical paper documents a method by which the Web administrator with access to PAT can use any number of richer SGML DTDs, and begin to provide the user with access to that richer set of tags and structural retrieval possibilities.
Standards and open systems approaches must be a defining part of library efforts to provide large-scale wide-area access to textual resources. It is not enough to say that access is improved if a major investment is made in textual resources that will be unusable in two years: the texts must be reusable. Because of the cost of creating the texts, it must be possible to use the texts in a variety of settings with a variety of tools. To that end, a standards-based encoding scheme must be at the foundation of text creation. Standard Generalized Markup Language (SGML), an international standard, is such an encoding scheme and has proven extremely valuable in effecting an open systems approach with text.[1] This paper is not the place for an argument of the value of SGML, especially when the argument has been made so effectively elsewhere.[2] Still, in addition to its value as an internationally-approved standard, SGML is ideally suited to supporting textual retrieval because it is a descriptive rather than a procedural markup language. That is, it is a language designed to reflect the structure or function of text, rather than simply its typography or layout. Thus, in a text retrieval system, portions of the document can be searched and retrieved, and functionally different elements can be displayed differently depending on their function.
The application of SGML using the Text Encoding Initiative (TEI) Guidelines will play a central role in ensuring that textual resources -- particularly those important to textual studies -- are produced in a way that makes them flexible and of continuing value. The difficulty of designing an implementation of SGML to meet a broad range of text processing needs in the humanities has been met by the Text Encoding Initiative in its Guidelines for the Encoding and Interchange of Machine-Readable Texts.[3] The TEI itself is a collaborative project of the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. Its purpose is the promulgation of guidelines for the markup of electronic text for a variety of disciplines involved in the study of text. In mid-1994 a comprehensive and detailed two volume set of guidelines was published. While the print version is absolutely an essential acquisition by libraries, an electronic version (discussed here) was also made available by the author of this article.
A central feature of SGML is the DTD. The DTD, or Document Type Definition, is a codification of the possible textual characteristics in a given document or set of documents. SGML expresses the organization of a document without necessarily resorting to using the paradigm of the file system -- discrete files representing the organizational components of a document. It expresses textual features such as footnotes, tables, headings, and the building blocks of content such as paragraphs using a descriptive language focusing on the role of the element rather than some presumed display value. In fact, SGML is not a tag set, but a grammar, with the ``vocabulary'' or tags of an individual document being articulated in its DTD. Using this fairly rigorous grammar, SGML can both declare information about the document in a way that can be transported with the document and can in fact enforce some rigor in the application of markup by aiding in ``parsing'' the document.
Hypertext Markup Language, or HTML, is a form of SGML expressed by its own unique DTD. The shape of that DTD has changed significantly since first articulated by researchers at CERN, and it continues to change with the demands of the WWW.[4] HTML was designed to facilitate making documents available on the World Wide Web and expresses a variety of features such as textual characteristics and hypertext links. These hypertext links are perhaps its most useful offering because they allow one to link documents to other resources throughout the Internet, effectively making the Internet a large hypertext document.
The World Wide Web is far more than a server protocol for the transfer of HTML documents, far more than a sort of ``gopher on steroids.'' Among the many resources it offers in facilitating sophisticated retrieval of information is the Common Gateway Interface, or CGI. Like HTML, CGI is in transition. However, in its current state it offers a set of capabilites that allow us to use the WWW to support much more complex documents and retrievals. The Common Gateway Interface is a set of specifications for external gateway programs to speak to the WWW's server protocol, HTTP. It allows the administrator to run external programs from the server in such a way that information requests return a desired document to the user or, more typically, generate a document on the fly. This capability makes it possible to provide uniform access to data structures or servers that are completely independent of the HTTP, including structures such as those represented in SGML documents or servers such as Z39.50 servers. The CGI specification is detailed online on the NCSA documentation server.[5]
Closely associated with the CGI is the FORMs specification first introduced with NCSA's Mosaic. This feature is a client-independent mechanism to submit complex queries, usually through a graphical user interface. FORMs-compliant interfaces such as Mosaic, lynx (a UNIX vt100 client), and OmniWeb (a NeXTStep client) use a complex offering of fill-out forms, check boxes, and lists to mediate queries between the user and the CGI resource. Users respond by making selections that qualify submissions to the server, for example checking a box to qualify a search as an author search, thereby making a complex command-line syntax unnecessary.[6]
CGI programs can be written in a variety of languages, including UNIX shell scripts, C programs, and perl. In fact, there are few limitations on the type of language that can be used. Perl is foremost among the options available to most WWW administrators. Largely the work of a single person, Larry Wall, perl can be used to create extremely fast and flexible programs with no practical limits on the size of the material it can treat. Perl also has outstanding support for the UNIX ``regular expression'', making it ideal for text systems such as those documented here where one form of markup must be translated to another.[7]
The approach taken in these examples separates the operations of retrieval to allow one component (e.g., a filter) to be upgraded without affecting other components. It should be emphasized that this separation of operations grew out of local needs and that other approaches, including an approach that combines all operations in a single program, are possible. The four steps are:
This multi-stage approach has many advantages. For example, it is possible to use different types of programs for the different stages, tailoring the selection of programs to strengths each might have for that function or to local needs. In the approach documented here, HTML FORMs, shell programs, C programs, and perl are used for the four operations. Separating the functions also allows persons with different responsibilities, skills, or interests to manage the different processes. For example, a system administrator might manage the second and third stages, while someone responsible for more aesthetic issues in the delivery might manage parts of the first and the fourth. At the University of Virginia Library, filters from richer SGML to HTML continue to be enhanced by staff from the Library's Electronic Text Center, in a process completely separate from the development of other parts of the interface. Other approaches are certainly possible, and an effort will be made to improve the efficiency of the current implementation.
FORMs to handle submission of the query may be simple or complex. The three examples given here demonstrate that range, with the Middle English FORM supporting word and phrase searches, the OED search providing a great deal of information about the areas to be searched and information to be retrieved, and the TEI browse including information about levels and ID values embedded in a collection of URLs.
query
. So that the user or system is not overwhelmed by large result sets, the size of result sets is limited to 100 items, and an additional FORM option (registering the variable size
) is included to help the user move subsequently through the results 100 items at a time, or to sample 100 items from the entire result set.
quoted author
is possible, making possible such queries as give me quotations authored by Chaucer
. This process includes the following:
query
.
period
is offered to allow users to limit quotation searches by century.
<DIV0>
), and each of these is linked to an HTML page containing the titles of subsidiary (i.e., <DIV1>
through <DIV4>
) structures.[9] This organization is presented in the two files, TEI.html, the top level HTML page, and tei-tocs1.html, and the secondary HTML page for ``Part 1.'' The URL for each list item in the subsidiary page contains the information necessary to conduct a search and retrieve the structural component being selected. For example, to retrieve the section ``Structure and Notational Conventions of this Document,'' the first sub-section of the first chapter in Part 1, the URL points to the component extraction program tei-tocs
and specifies that this is ID ``struct'' and is of the level <DIV2>
(e.g,. the section is bounded by <DIV2 ID="struct">
and </DIV2>
).
The CGI query handling operation accepts the search or action initiated by the user and prepares it for submission to PAT.[10] Information from the user is submitted to the CGI program specified in the FORM or URL (e.g., <FORM ACTION=/bin/tei.sh>
). A wide range of queries can be supported using this strategy. For example, a single word or phrase search can be submitted; it is also possible to ask for two words within a specified proximity to each other, two structures (e.g., stanzas) that contain one or more words or phrases, or to retrieve structural units such as chapters. Each of the instances in this implementation uses UNIX (Bourne) shell scripts to negotiate interaction with PAT. The program begins by testing to see if a query has been made and, especially in searches like those in the Middle English collection, a test is made of preliminary PAT results to determine whether a valid (e.g., correct syntax or punctuation) search was performed.
JefLett
is assigned to Jefferson's Letters, and on page 234 the ID will be JefLett234
. This ID value becomes a hypertext link to viewing a larger context in the subsequent processing of results. The PAT communication illustrated in me-kwic consists of commands that:
{QuietMode raw}
);
{PrintMode 3 ID page}
);
docs C16 incl ("bailey" within docs A)
docs E incl ("photo" within docs HG)
HTTP://etext.virginia.edu/bin/tei-tocs?div=DIV3&id=ABTEI2
<DIV>
(i.e., <DIV3>
) and the ID value (i.e., ABTEI2
) to the CGI script seen in tei.sh. This is used to formulate the PAT query:
docs DIV3 incl "<DIV3 ID=ABTEI2 "
pr.docs.DIV3
) to retrieve the relevant portion.
Two stages of result handling, each relying on PAT's ``Quiet mode,'' take place before pages or dictionary entries are presented to the user. In its ``Quiet mode,'' PAT produces very helpful results with all information tagged for more reliable processing. For example, the size of the result set is marked with <SSize>
tags, as in <SSize>23</SSize>
and every result can be made to begin with a <Hdr>
, as in <Hdr>AusEmma234</Hdr>
.[11] The first of the two stages produces only minor transformations, primarily displaying the number of results retrieved and separating each result into a separate line. The results of the first stage are piped to the second stage to produce a view for users.
In the perl code for the Middle English ``Quiet mode'' translation (me-parse.pl), the <SSize>
tags are used to highlight the number of results. The OED preprocessing (oed-parse.pl) simply ensures that each result is on a separate line, deferring issues such as using the <SSize>
tags to highlight the number of results until a later stage. The ``Quiet mode'' filter used for the TEI is essentially the same as that used for the Middle English collection (teiquiet.pl).
The second stage presents each element to the user in a sort of ``index'' view so that a broader display can be produced. The perl code for this stage usually produces a KWIC view, with each line being a hypertext link to an expanded view of the result. The user sees the <Hdr>
content as the link, but the byte offset -- the number of characters into the file, which PAT uses to locate results -- is the actual search component of the link. The first line below, with bracketed information as a hypertext link, is presented to the user. The second line is the actual HTML.
[AB] entations for hypertextual links and other non-hierarchic struc [<A href=/bin/tei-1500?id=39493>AB</A>] entations for hypertextual links and other non-hierarchic strucThe perl code for the Middle English (me-kwic.pl) is indicative of that needed to produce a KWIC view. Results from the OED are presented to the user as a list of dictionary headwords, from which the entire entry can be displayed; this perl code is available as oedkwic.pl.
The results of searches, each in a complex SGML designed to support retrievals such as those discussed here, are prepared by the final stage for presentation to a WWW client such as Mosaic by being filtered to HTML. Filtering, in these examples, is again achieved by perl. Most tags from the originating files will be specialized and not have a corresponding HTML tag. For example, the quoted author element in the OED, <A>
, has no corresponding HTML tag but one might render it as italicized text. Because of this lack of correspondence and the limited number of HTML tags, decisions are largely arbitrary and draw on presentational or aesthetic needs.
The filter, oed.pl, is used to filter OED tags to HTML. This filter and the resulting output demonstrates the challenge of filtering rich, heterogeneous text to HTML. A sample tagged OED entry is included in Figure 1 to illustrate the problems encountered. Information can be presented attractively to a user with a GUI client, despite many compromises in mapping complex tags to simple presentation characteristics. Future improvements in HTML and the ability of clients to interpret a variety of tags should enhance the display of OED entries.
Subsequent to the KWIC view for TEI and Middle English, more thorough transformations are performed on the texts. For example, upon being selected, a result's byte offset is sent to a program that uses PAT to print 1,500 character context for a result. The results of this 1,500 character view may contain any of the possible tags in the DTD, so a filter that represents all possible element values is created. At this time, the filters fall short in many areas, in particular in their ability to express complex relationships made possible in the DTD. For example, an element, <LIST>
, may have an attribute value TYPE="gloss"
that suggests it should be converted to the HTML <DL>
. Before closing, the element <LIST>
may contain other <LIST>
s with other TYPE=
values. At the present time, these simple stream-oriented filters cannot differentiate between tags with the same name but different functions in this type of complex nesting relationship. Two examples of filters are included here, a filter for the Oxford Text Archive DTD in ota.pl and a filter for the TEI P3 DTD in tei.pl. Currently (August 1994) both represent local display concerns and continue to be enhanced.
Very little preparation of SGML texts is necessary to be able to implement this strategy with PAT. In order to provide a useful indicator of location for KWIC views, we typically add an <ID>
element with concise positional information. For example, in our Modern English texts, we add a combination of author, title, and pagination to create an <ID>
such as JefLett244
(i.e., Thomas Jefferson, Letters, page 244). These <ID>
values are limited to ten characters.
The TEI Guidelines are complex and make liberal use of minimization (i.e., omission of end tags where they are clearly implied by context), making other steps necessary. These included:
<div1 id=AB>
and whose end tag is implied by the beginning of a new chapter is changed to <DIV1 ID="AB">
and </DIV1>
.
<DIV1>
elements did not have id attributes. These were added to aid in retrieval, and included attributes such as id="bibliog"
for the bibliography.
<ID>
was created for each <DIV1>
, with the value of the id attribute being copied to the <ID>
. So, for example, the chapter called ``A Gentle Introduction to SGML,'' <DIV1 id="SG">
now has <ID>SG</ID>
. This makes it possible to display the ID
value for each <DIV1>
in the KWIC views.
I believe that the strategy described here is an effective method for access to text collections and that it suggests important possibilities for access to other types of resources.[12] The University of Virginia provides access to many collections of resources with clients (i.e., interfaces) designed to support complex analysis, where users can create sets, combine them, and use a range of operations facilitated through a wide array of menus. These clients are frequently much more complicated than is desirable for simple operations such as word lookups in the OED. The strategy outlined here allows users to do simple word lookups in the OED or to formulate simple queries in the text collections without needing to understand PAT syntax or the organization of the collections. However, there is much in this strategy that is suggestive of other possibilities for providing access to collections.
Journal literature in SGML may be successfully accessed through this sort of strategy. For example, a journal run marked up according to the more elaborate AAP (Association of American Publishers) DTD, ISO 12083, could return articles to the user using simple PAT queries. However, this example only scratches the surface of the sorts of strategies that might be possible with the journal collection. Another strategy might set up CGI scripts to facilitate browsing where, for example, a user selecting Browse by author/title would be taken through a series of selections, each using PAT queries in:
These strategies have been employed for access to collections at the University of Virginia and the University of Chicago. Open Text has generously allowed the University of Virginia to provide non-UVa access to five of its collections or resources. (For more information on this, see the announcement from March 1994.) To examine most of the strategies discussed here (though not access to the OED), select examples from the test page at the University of Virginia.
LIST
elements containing LIST
elements will be differentiated both by the nesting and by the attribute values. Until a filter can be written to exploit these characteristics, it will fall short of what is possible.
conf
area of the NCSA httpd, in the srm.conf
. In the srm.conf
, create a line for pat executables. For example, if we were to call this area patbin
, that entry might read:
ScriptAlias /patbin/ /usr/local/httpd/patbin/
.htaccess
in the patbin directory and declaring an acceptable set of IP addresses. Please see the httpd documentation for further information, but note that it is not important to restrict access to the HTML FORM oed.html
, but rather the executable files.
Entry
to E
.
<E><HG><HL><LF>debug</LF><SF>debug</SF><MF>debug</MF></HL><MPR> d<i>i&mac.</i>b<i>&reva.</i>&sd.g</MPR><IPR><IPH>di&lm.&sm.b& revv.g</IPH></IPR>, <PS>v.</PS></HG><ET>f. <XR><XL>de-</XL> <SN>II</SN>. <SN>2</SN></XR> +<XR><XL>bug</XL><PS>sb.</PS> <HO>2</HO></XR></ET><p><S4><#>1</#><S6><DEF><PS>trans.</PS> = <XR><XL>delouse</XL><PS>v.</PS></XR></p></DEF><QP><Q><D>1960</D> <A>J. Stroud</A><W>Shorn Lamb</W> vi. 70 <T>We'll..take them round to the Clinic, and..get them debugged there.</T></Q></Q> </S6></S4><p><S4><#>2</#><S6><DEF><LB>slang.</LB>To remove faults from (a machine, system,etc.).</p></DEF><QP><EQ><Q> <D>1945</D> <W>Jrnl. R. Aeronaut. Soc.</W> XLIX. 183/2 <T>It ranged from the pre-design development of essential components, through the stage of type test and flight test and `debugging' right through to later development of the engine.</T></Q></EQ><Q><D>1959</D><W>New Scientist</W> 26 Mar. 674/1 <T>The `debugging' time spent in perfecting a non-automatic programme.</T></Q><Q><D>1964</D> <W>Discovery</W> Oct. 51/3 <T>This failure report plays a vital role in the process by which the scientist corrects or de-bugs his programme.</T></Q><Q><D>1964</D> <A>T. W. McRae</A> <W>Impact of Computers on Accounting</W> iv. 99 <T>Once we have `debugged' our information system. </T></Q><Q><D>1970</D> <A>A. Cameron</A> et al. <W>Computers &. O.E. Concordances</W> 49 <T>Program translation, debugging, and trial runs of the concordance were performed at the University of Michigan Computer Center.</T></Q><Q><D>1970</D> <A>A. Cameron</A> et al. <W>Computers &. O.E. Concordances</W>, 49 <T>By Christmas the program was debugged.</T></Q></QP></S6> </S4><p><S4><#>3</#> <S6><DEF>To remove a concealed microphone or microphones from (a room, etc.); to free of such listening devices by electronically rendering them inoperative. Cf. <XR><XL>bug</XL><PS>sb.</PS><HO>2</HO><SN>3</SN><SN>f</SN></XR>. orig.<LB>U.S.</LB></p></DEF><QP><Q><D>1964</D> <W>Business Week</W> 31 Oct. 154 (<W>heading</W>) <T>When walls have ears, call a debugging man.</T></Q><Q><D>1964</D><W>Business Week</W> 31 Oct. 154 (<W>heading</W> 158/2 )<T>He quotes high fees for his work, saying that debugging equipment is expensive.</T></Q> <Q><D>1966</D> in Random House Dict. </Q><Q><D>1969</D><W>New Scientist</W> 16 Jan. 128/3<T>`Debugging' the boardroom and the boss's telephone may become as common in industry as in the unreal world of the super-spy. </T></Q><Q><D>1976</D><A>M. Machlin</A> <W>Pipeline</W> xxxi. 353 <T>The room..had steel walls and had been rigorously de-bugged.</T></Q><Q><D>1978</D> <W>Sunday Mail Mag.</W> (Brisbane) 9 Apr. 3/6 <T>Jamil, America's leading `debugging' expert, discovered the secret of an exported `bug' which should not have worked.</T></Q> <Q><D>1987</D><W>Daily Tel.</W> 3 Apr. 1/8 <T>American officials are scrambling to `de-bug' their embassy in Moscow before the arrival of Mr Shultz, Secretary of State, on Monday week.</T></Q></QP></S6></S4><p><S4><SE>Also <BL> <LF>debugging</LF><SF>de&sm.bugging</SF><MF>debugging</MF></BL> <DEF><PS>vbl. sb.</PS> (see senses 2, 3 above).</DEF> </SE></p></S4></E>