Capitalizing on Text Structures

Frank Wm. Tompa

University of Waterloo
fwtompa@uwaterloo.ca

Keywords: structured text databases

Extended Abstract

Scholarship increasingly depends on electronic document repositories and the growth of digital libraries. As in physical libraries, the documents to be housed in scholarly collections include historical documents, literary works, reference texts, and government publications. Even more apparent in computer- readable form are collections of business documents (from annual reports and customer literature to procedures manuals and internal communications) and linguistic corpora (collections of spoken and written communication assembled to reflect the uses of language). Gray literature, including technical reports, personal communications, and online help information, also constitutes a growing text resource.

SGML provides a method to describe the structure of a complex document in which components, layout, or other chosen features of the text are indicated through markup. The TEI Guidelines use SGML to define a set of comprehensive conventions for representing documents, and thus they establish a basis for scholarly communications. HTML defines another set of tags to delineate text structures. Beyond text representation, however, communications support also requires mechanisms for querying and manipulating structured documents.

The UW Centre for the New OED and Text Research was founded in 1985 to apply database technology to the management of structured text. As part of the project, we developed flexible and efficient search and display software, embodied in the Pat text search engine and the Lector general purpose browser. Each entry in the search index designates a "semi-infinite" string that starts at a critical point in the text (e.g., at a word start) and continues uninterruptedly to the end of the text; this provides the basis for fulltext search. A separate index for each interesting set of text regions is also maintained, using a pair of semi-infinite strings to designate the extent of a region. Such regions typically, but not necessarily, coincide with structural units identified by matching pairs of start and end tags. Any stream of tagged text can be formatted to the screen by the browser using the tags to dictate a typography that illustrates the text's structure. Style sheets can be defined using a specially-designed formatting, or display-specification, language, and thus the choice of typographical strategies is user-selectable. Used together, Pat and Lector form a powerful query facility for text-dominated databases.

Neither Pat nor Lector require regions of two types to be consistently related: some occurrences of X may include occurrences of Y, some Ys may include Xs, and some X regions can overlap Y regions arbitrarily within a single document. Tree-based models and corresponding software can be defined to provide simpler access to hierarchically constrained texts, such as those more typically defined using SGML. The Goedel database system, also defined at Waterloo, provided such software, which was effective in programming the conversion of the text of the OED into a draft for the Shorter OED. More recently, we have created software that, like Goedel is based on a tree model of text, but like Pat allows trees to be dynamically described rather than requiring them to be fixed in the text.

One popular data model for domains other than text is based on tables rather than on trees, and many texts contain within them tabular, or nearly tabular, data. For example, a simple bibliography may be interpreted as a table with columns for author, title, date, publisher, location, etc. Since most text does not have such regular structure, however, texts cannot naturally be described using such a model. But just as sections of texts can be viewed as trees, tabular views can be dynamically imposed on sections of texts to provide simpler access to those related fields. Thus, even though a text does not have a regular and consistent tabular structure, local regularity can be temporarily imposed in order to form a response to a query.

A major advantage of providing tabular views of text is that software can be defined to access text using SQL, the data manipulation language commonly used in relational database management systems such as provided by Oracle, IBM, Sybase, and Microsoft. As a result, tables of components can be extracted from texts and subsequently counted, aggregated, and rearranged before being reassembled for display. The Text/Relational Database Management System we have implemented at Waterloo provides simultaneous access to text and to other relational databases, and provides powerful operators to query the contents of texts.

The talk will be illustrated by examples chosen to demonstrate these various data models, which have been found to be useful in describing texts, and the associated operators that allow applications to be built to identify and extract interesting components from structured text.

Related publications

A.Salminen and F.W.Tompa, "Grammars++ for Modelling Information in Text," 1996. (ftp://cs-archive.uwaterloo.ca/cs-archive/CS-96-40/CS-96-40.ps.Z)

D.R.Raymond, F.W.Tompa, and D.Wood, "From Data Representation to Data Model: Meta-Semantic Issues in the Evolution of SGML" Computer Standards & Interfaces, Vol. 18 (1996) 25-36.

G.E.Blake, M.P.Consens, I.J.Davis, P.Kilpelainen, E.Kuikka, P.-A.Larson, T.Snider, and F.W.Tompa, "Text / Relational Database Management Systems: Overview and Proposed SQL Extensions," 1995. (see http://solo.uwaterloo.ca/trdbms/)

A.Salminen and F.W. Tompa, "PAT Expressions: an algebra for text search," Acta Linguistica Hungarica, Vol. 41, 1-4 (1992-93) 1994, 277-306.

G.E.Blake, T. Bray, and F.W.Tompa, "Shortening the OED: Experience with a Grammar-Defined Database," Trans. on Information Systems, Vol. 10, No. 3 (July 1992) 213-232.

Frank Wm. Tompa, "An Overview of Waterloo's Database Software for the OED," Proceedings of the Symposium on Historical Dictionary Databases and Data Retrieval Requirements (Toronto, October 1991), in CCH (Centre for Computing in the Humanities) Working Papers 2 (1992) pp.123-143.

D.R. Raymond and F.W.Tompa, "Hypertext and the Oxford English Dictionary," Comm. ACM, Vol. 31, No. 7 (July 1988), 871-879.