Lessons for the World Wide Web
from the Text Encoding Initiative

David T. Barnard
Lou Burnard
Steven J. DeRose
David G. Durand
C.M. Sperberg-McQueen

Abstract:: Although HTML is widely used, it suffers from a serious limitation: it does not clearly distinguish between structural and typographical information. In fact, it is impossible to have a single simple standard for document encoding that can effectively satisfy the needs of all users of the World Wide Web. Multiple views of data, and thus multiple DTDs, are needed. The Text Encoding Initiative (TEI) has produced a complex and sophisticated DTD that makes contributions both in terms of the content that it allows to be encoded, and in the way that the DTD is structured. In particular, the TEI DTD provides a mechanism for describing hypertextual links that balances power and simplicity; it also provides the means for including information that can be used in resource description and discovery. The TEI DTD is designed as a number of components that can be assembled using standard SGML techniques, giving an overall result that is modular and extensible.
Keywords:: SGML, modular DTDs, extensible DTDs, linking mechanisms, header

1. Introduction

The World Wide Web is growing with an amazing rapidity, and thus HTML (Hypertext Markup Language) document encoding is in widespread use. However, even in the presence of this apparent success there is an underlying confusion of categories that leads to abuse and misuse of HTML. Or, perhaps more correctly, to different uses and interpretations of HTML, since it embodies this ambiguity at a fundamental level itself.

Although at its inception this was not true, HTML is now defined as an application of SGML (Standard Generalized Markup Language). SGML is a metalanguage for defining document markup. It is defined by an international standard [8], and there is a handbook that interprets the standard [6]. Robin Cover maintains a World Wide Web page that is an excellent source for information about SGML [4]. SGML allows the definition of a set of documents by specifying the components that the documents will contain, the ways in which components can be combined together to make larger components and entire documents, and the ways in which the boundaries of components will be indicated in the document. This delineating of components is called markup. The various parts of the formal specification of a document class are gathered together in a document type definition (DTD). For example, a simple DTD for office memoranda might include definitions for a heading and a body, with the heading including to, from, date and subject components and the body containing paragraph components. A components is (usually) delineated by preceding it with its name in angle brackets and following it with its name preceded by a slash in angle brackets, as in

<heading> ... <subject>Salary Policy</subject> ... </heading>

HTML is defined as an application of SGML. This means that a DTD defines the components of HTML documents, and their possible hierarchical relationships [2]. Future versions of HTML promise to be increasingly explicitly tied to the formal SGML setting.

Although it makes concessions for the encoding of processing information--such as layout commands--SGML is designed to allow systems to focus on the structure of documents, to describe what is present rather than how it will be processed. In the document processing model adopted by SGML, the description of document formatting (or any other processing) is consciously and explicitly separated from the description of document structure.

The same claim cannot be made for HTML. It contains structural concepts, such as the <P> tag to describe a paragraph. However, the paragraph in HTML is not, strictly speaking, a structural unit that is contained in some units and can contain others. Instead, the paragraph tag indicates a point at which specific processing is to occur as network browsers process HTML files. An even more extreme example of non-structural encoding in some network documents is the indication that text is to blink when presented on the screen--an indication that does not even have a meaning when the document is to be printed.

Of course, the most obvious, perhaps most frequent, and design-anticipated use of documents encoded in HTML is to display them on a screen with a network browser. And it is not surprising that this intended application should be--or, at least should still be--implicit in the document encoding. But this means that even users who would prefer to use a structural encoding cannot do so. Absent (at the moment) style sheets for mapping structural categories to display characteristics, users frequently resort to tag "tag abuse"--using existing tags for their typographical effects rather than for their structural significance, if any.

The Text Encoding Initiative (TEI) is a large international project sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The inception of the project was at a planning meeting held late in 1987 which was attended by researchers involved in encoding texts for various research purposes (such as the production of critical editions and linguistic analysis) and in producing software to deal with encoded texts. There was agreement among the participants that the diversity of encoding techniques in use was detrimental to sharing of texts, software, and research results.

The three organizations agreed to sponsor a project to develop a common standard for encoding texts of interest to the communities they represented (humanistic researchers, linguists, and others involved in "language industries"). The organizations would support the project by providing members for a Steering Committee and by attempting to raise funds for the development work. Over the next several years, funds were provided by the U.S. National Endowment for the Humanities, Directorate General XIII of the Commission of the European Communities, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada.

The project had as design goals that the Guidelines should:

Provide a standard format for data interchange,
Provide guidance for encoding texts in this format,
Support the encoding of all kinds of features of all texts studied by researchers, and
Be application independent.

This has led to a number of important design decisions, such as:

The choice of SGML,
The provision of a large predefined tagset,
A distinction between required, recommended, and optional encoding practices,
Encodings for different views of text,
Alternative encodings for the same features, and
Mechanisms for user-defined extensions to the scheme.

The work of the project was carried out by scholars at institutions in North America and in Europe. The main result of the project is a document entitled Guidelines for Electronic Text Encoding and Interchange (TEI P3), edited by Sperberg-McQueen and Burnard [10]. This large document (almost 1300 pages) describes a set of SGML tagsets that comprise a modular and extensible DTD for encoding a wide range of documents.

The Guidelines can be found online in several places. The official project repository, containing the Guidelines and other project documents, is at ftp://ftp-tei.uic.edu/pub/tei (for users in North America) and its mirror sites ftp://ftp.ifi.uio.no/pub/SGML/TEI (for users in Europe) and ftp://TEI.IPC.Chiba-u.ac.jp/TEI/P3 (for users in Asia), or at ftp://info.ex.ac.uk/pub/SGML/tei or at http://etext.virginia.edu/TEI.html in a form that can be searched. The entire volume of Computers and the Humanities for 1995 is devoted to the TEI; the papers in that volume contain references ot other TEI-related articles. In particular, the general papers in that volume are a good introduction to the project [7,11], and there is an introduction to SGML from the perspective of the project [3]. Although SGML served the TEI well, we have identified some ways in which SGMl could be improved [1].

In the remainder of this paper we show how some of the results obtained by the TEI can provide insight into the use of documents and document encoding standards on the World Wide Web.

2. Using Multiple DTDs

It became clear early in the work of the TEI that a single comprehensive DTD that could encode every feature of interest to the communities contributing to the project would be so large as to be impossible to understand, and doubtless impossible to design. Further, users of TEI documents are often interested in several views of a document at the same time, so that in effect multiple DTDs were required in any case.

As a result, the TEI DTD has been designed in a modular fashion. A particular document will use only those pieces of the DTD that apply to it. The selection of pieces to include is done using standard SGML mechanisms, so it can be specified to an SGML parser.

Further, the TEI DTD is extensible. Users can add other modules to it, again using standard SGML mechanisms. These extensions can be communicated to an SGML parser--and thus obviously to other users--in a formal manner, so that the extensions can be as explicitly specified and documented as the basic DTD. The need for extensibility is a direct consequence of the richness and open-endedness of the application areas for electronic documents. In spite of the considerable amount of effort that has gone into designing the TEI DTD, there will inevitably be uses for which it is not well suited, and forms of information that cannot be conveniently encoded using its structures. Our approach to dealing with this has been explicitly to provide an extension mechanism.

The modular structure of the TEI DTD is achieved by using tag sets that are grouped into these categories.

Core tag sets describe standard components of documents. They are included in all forms of the DTD. These include such things as paragraphs, lists, simple links and cross references, highlighting and quotation, which are all familiar to users of HTML. Core tag sets also include tags for notes, indexes, bibliography entries, names, numbers and dates, and other commonly used categories.
Base tag sets include the basic structures needed for describing a specific text type. Usually one of these is selected for a given document, although there are ways to use several of them together for complex documents. There are base tag sets for prose, for verse, for drama, for print dictionaries, for the transcription of spoken material, and for terminological databases.
Additional tag sets are extra tags that are used for specific purposes. They are compatible with all the bases, and with each other. Any combination of these tag sets can be used in a single document. There are additional tag sets defined at present for encoding linking, segmentation and alignment; for encoding simple analytic mechanisms (linguistic segments); for encoding critical apparatus associated with a text; and for several other purposes.

In a TEI document, there is an explicit indication of which base tag set is being used, and of which additional tag sets are being used. The core tag sets are implicitly present, because they are included in each base.

Documents explicitly indicate which extensions to the TEI DTD they use (technically, in the SGML declaration subset). A document parser is therefore able to check modifications to the DTD using standard SGML mechanisms, and the notation provides inline documentation of the changes. The modifications are made possible by maintaining two versions of the DTD. There is a version for people to read; this is the version documented in the Guidelines. There is also a version for parsers to read; this version is derived programmatically from the first one by the introduction of parameter entities for various purposes. Modifications to the DTD are made by changing the values of parameter entities, thus changing the DTD that is expanded in the parser.

The modifications that are possible are these.

Deleting an element. An element defined in the TEI DTD can be suppressed so that it cannot be used in the document. An SGML parser will detect all uses of the tag as an error.
Renaming an element. This can be used to rename the tags in a language other than English, or to use local vocabulary within a project or collection of documents.
Extending given classes. There are several predefined classes of tags in the TEI DTD. These classes typically share a set of attributes and thus can be treated in similar ways by applications. A particular document can specify that a tag is to be included in one of these classes.
Specifying new content models. If the definition of what goes in an element is not sufficient for what needs to be expressed in a document, the element can be deleted and a new definition given. By introducing new names at this point, it is possible to extend the DTD with new tag sets for new applications.

It would be possible to use the parameter entity mechanism for other purposes as well, such as changing attribute names, redefining existing attributes, changing the inclusion and exclusion exceptions for an element, and so on. The set of modification possibilities given here was considered to be sufficient for most of the things that users claimed they needed to do.

The experience of the TEI in designing a complex DTD leads to several conclusions that are of interest to the World Wide Web community. First, a single fixed DTD, no matter how well it is designed, can never serve all users equally well. There need to be ways to specify structures not yet anticipated. Second, it is possible to design DTDs--or DTD families--that are modular and extensible. The TEI tagsets provide a model for how this can be done. Third, a rich set of structures can already be described with the existing TEI DTD, and it can thus already be used for a rich variety of applications. We encourage readers to consider it for their applications.

We now turn to two specific content areas addressed by the TEI DTD that demonstrate helpful ways to use SGML for encoding information of value in World Wide Web applications. These are the specification of hypertext links, and the description of documents and their contents.

3. Linking Mechanisms

The World Wide Web has grown because of its simplicity. In particular, the concept of a Uniform Resource Locator (URL) is a simple one: a text string provides an address of a location in a file on a machine on the network. However, the simplicity that contributes to rapid growth is limiting. URLs cannot locate a portion of text or a substructure in a document, they cannot easily specify how links might be related in sets, and they cannot specify any semantics to be associated with a link.

Another approach for specifying hypertext links is to use the HyTime standard [9] (the book by DeRose and Durand contains a description of HyTime [5]). HyTime does not suffer from being too simple. It is, in fact, very powerful; it allows for very general cases of hypermedia links to be specified. Links can be separated from objects (documents), complex relationships can be specified, coordinate systems can be defined and parts of documents selected based on those coordinate systems, and so on.

In our view, URLs are too simple to encode many of the structures that are common in and among documents on the World Wide Web, and HyTime provides (and requires) a more powerful mechanism than many applications will need. The TEI DTD provides a mechanism that attempts to balance power and complexity.

The TEI DTD provides linking mechanisms for building several different kinds of structure. Simple links within a document are formed using the SGML "id" and "idref" mechanism. Links between documents are provided through extended pointers. These latter exist in two different forms.

The <xptr> tag provides a pointer to another location, either in the current document or some other document.
The <xref> tag allows the inclusion of textual commentary with the specification of the pointer.

These extended pointers build on the SGML id and idref mechanism. The extensions are specified by giving strings as the values of attributes of SGML generic identifiers. These strings need to be interpreted by application software that interprets them and understands their significance.

The definition of extended pointers uses

Hierarchical references to structures in a document (in much the same way that files can be named in a hierarchical file system),
More general structural relationships (such as the identification of the "next" node with a given generic identifier, which is to be found by a simple, clearly specified rule about tree traversal),
Locations that are defined relative to the node making the reference,
Patterns that are to be applied when the link is traversed or activated, and
Queries that are related to HyQ, the HyTime query language.

We will not give the details of extended pointers here. These can be found in the Guidelines. What is of interest here is the kinds of structures that can be easily encoded using the mechanisms provided by the TEI DTD. Here are some examples.

A segment is a portion of a document. It can be used as the point of attachment of a link. An arbitrary structure can be defined as a segment.

An anchor is an arbitrary point in a document. It can be used as the point of attachment of a link. (This is similar to the definition of a name on an anchor in HTML.)

A correspondence can be established between one span of content and another. For example, there might be a correspondence between a fragment of a document, and someone's comments on that fragment.

An alignment shows how two documents (or fragments) are related. For example, there could be an alignment between a document in one language and another document that is the translation into a second language. An alignment can be specified outside the two documents (or fragments) that are aligned.

A synchronization is a relationship that represents temporal rather than textual correspondence. For example, it is often necessary to synchronize overlapping text segments in a representation of speech where several speakers can be talking at the same time, or at least in overlapping speeches.

An aggregation is a collection of fragments into a single logical whole. For example, the set of passages in a document relating to a specific topic, such as the set of paragraphs that discuss indexing in a paper on information retrieval, would be an aggregate.

Multiple hierarchies occur, essentially, when more than one tree is to be considered as being built over the same textual frontier. For example, the logical structure of a document (chapters, sections, paragraphs) and its physical structure (pages, lines) are two different hierarchies over the same frontier. Although the SGML CONCUR feature can be used to specify structures of this sort, it has a number of associated problems: when a document is changed by the addition of a new view, it may be necessary to change existing markup (by the addition of a prefix indicating the view to which the existing tags correspond); the coding of tags becomes more verbose than otherwise, and many SGML applications at present do not implement the feature. There are tags provided to specify page and line boundaries, and thus in a rudimentary way to provide for this second commonly required hierarchy. The more general approach used is to mark boundaries of the elements in the multiple hierarchies and to reconstitute the view, essentially by using aggregates.

These structures that have been identified by participants in the TEI as useful ones for encoding documents for research purposes seem to us to be useful in many other contexts in the World Wide Web as well. The TEI DTD provides mechanisms for encoding these structures in relatively straightforward ways. These mechanisms can be used without having to provide all of the processing power in Web application software that is required to process HyTime.

4. Resource Identification and Discovery

The World Wide Web contains many documents in many locations. One of the major challenges in a complex distributed environment like this is the identification and discovery of documents that are relevant to some task. In a traditional library, resources are identified by the preparation of catalog information in a restricted but rich and dynamic domain of categories. Identifying relevant resources often involves the expertise of the person who needs information, various programs that have access to catalogs for relatively simple searches, and experts in the domain of interest (subject librarians). While the search techniques applied to catalogs are relatively simple, the catalogs contain explicitly coded information about subject areas so that searches are usually able to identify a useful collection of materials.

Information retrieval in collections of electronic documents involves the expertise of the person who needs information, sophisticated search programs, and sometimes experts in the domain (subject librarians). Information can be labeled with various category attributes, but larger amounts of text (abstracts, and perhaps complete documents) can be searched. Because there is little or no explicit encoding of the information in the text, sophisticated algorithms are often used to attempt judgements about relevance of a document based on the occurrences of patterns in the text.

Identifying relevant resources on the World Wide Web can take several forms. It can involve searching through structured subject indexes as in traditional library access, as well as searching through the text of documents as in traditional information retrieval.

But because the Web contains so many documents--orders of magnitude more than most databases used with traditional search strategies--identifying relevant resources can be difficult. It would seem attractive to allow documents to describe themselves so that a rich domain of categories can be used, and so that judgements about relevance do not need to be restricted to algorithmic approximations.

Documents encoded according to the TEI DTD must include a TEI header that contains information about the electronic document. The information in the header can be used to facilitate the identification of resources and their discovery by search programs and by manual browsing.

The header has four major parts.

A file description contains a full bibliographical description of the electronic document. A standard bibliographic citation can be derived from this information, so it could be used to make a standard library catalog record. This part of the header also includes information about the source of the electronic document (for example, the document may be appearing originally in electronic form, it may be transcribed from a printed form, and so on).

An encoding description describes the relationship between the source and the electronic document. This part of the header can describe any normalizations applied to the text, the specific kinds of analytic encoding that have been used, and so on.

A text profile contains information that classifies the text and establishes its context. This part of the header describes the subjects addressed, the situation in which the text was produced, those involved in producing it, and so on. This part can be used with a fixed vocabulary of subjects, for example, to catalog texts into some predefined subject structure; or it can be used more freely to allow a dynamic subject universe.

A revision history allows the encoding of a history of changes made to the electronic document. The past of the header is useful for the identification and control of versions of a document.

Each part of the header is potentially complex, and can contain extensive amounts of information. Most parts of the header are optional, though, so can be used only as they are considered useful or necessary by document developers. A minimal header contains a file description including a title, publication statement, and source, together with a text profile identifying the language in which the document is written.

To take best advantage of the mass of information that is available on the Web, users must be able to find the documents that are relevant when they are looking for information. The best way to facilitate this is to have documents identify and describe themselves.

The TEI header is an example of how documents can be made to be self-identifying. Documents with a header can be indexed in ways that are considered to be appropriate by their developers. Information that is provided can be used by readers of Web documents, and by programs that search the Web to identify relevant resources for readers.

5. Conclusion

The World Wide Web is based on a set of simple tools and concepts, including HTML, that have made possible a phenomenal rate of acceptance and growth. These simple notions, though, will not be sufficient to support continued growth and a diversity of applications.

There are various ways in which full SGML can be provided on the Web, including varieties of server-side processing (such as mapping more complex structures to HTML for delivery to clients) and client-side processing (such as spawning applications that are capable of dealing with general SGML DTDs or a specific DTD).

The Text Encoding Initiative has developed a comprehensive specification for a DTD that provides a richer set of structures in a modular extensible framework. The DTD itself, together with its structuring principles and the specific contributions for hypertext links and for resource description, suggest fruitful approaches to developing and enhancing the World Wide Web.

References

1. Barnard, David T., Burnard, Lou and Sperberg-McQueen, C.M., Lessons Learned from Using SGML in the Text Encoding Initiative, Computer Standards and Interfaces, (accepted February 1995). Also appeared as Technical Report 95-375, Department of Computing and Information Science, Queen's University (1995).

2. Berners-Lee, T., and Connolly, D., Hypertext Markup Language - 2.0, <draft-ietf-html-spec-06.txt>, Boston, HTML Working Group, September 1995.

3. Burnard, Lou, What is SGML and How Does It Help?, Computers and the Humanities 29,1, 1995, 41-50.

4. Cover, Robin, SGML Web Page, http://www.sil.org/sgml/sgml.html, 1994.

5. DeRose, Steven J., and Durand, David G., Making Hypermedia Work: A User's Guide to HyTime, Boston/Dordrecht/London, Kluwer Academic Publishers, 1994, xxii + 384 pages.

6. Goldfarb, Charles, The SGML Handbook, Oxford, Oxford University Press, 1990, 688 pages. Contains the full annotated text of ISO 8879 (with amendments).

7. Ide, Nancy, and Sperberg-McQueen, C.M., The Text Encoding Initiative: Its History, Goals, and Future Development, Computers and the Humanities 29,1, 1995, 5-15.

8. ISO (International Organization for Standardization), ISO 8879-1986 (E) Information Processing--Text and Office Systems--Standard Generalized Markup Language (SGML), Geneva, International Organization for Standardization, 1986.

9. ISO (International Organization for Standardization) ISO/IEC 10744:1992 Information Technology--Hypermedia/Time-based Structuring Language (HyTime), Geneva, International Organization for Standardization, 1992.

10. Sperberg-McQueen, C.M., and Burnard, Lou, (eds.), Guidelines For Electronic Text Encoding and Interchange (TEI P3), Chicago and Oxford, ACH-ACL-ALLC Text Encoding Initiative, May 1994, 1290 pages.

11. Sperberg-McQueen, C.M., and Burnard, Lou, The Design of the TEI Encoding Scheme, Computers and the Humanities 29,1, 1995, 17-39.

About the Authors

David T. Barnard http://www.qucis.queensu.ca/home/barnard/info.html
Queen's University, Kingston, Canada
David T. Barnard joined the Department of Computing and Information at Queen's University in 1977, having studied at the University of Toronto. He is now Professor in that Department. His research applies formal language analysis to treating documents as members of a formal language, and to compiling programming languages with a focus on using parallel machines. He chaired one of the working committees of the Text Encoding Initiative and is now a member of the Steering Committee of the project.

Lou Burnard
Ocford University Computing Services, Oxford University, England
Lou Burnard is Humanities Computing Manager at Oxford University Computing Services. His responsibilities include the Oxford Text Archive, which he founded in 1976 and the British National Corpus. He is also European editor of the Text Encoding Initiative, and co-author of a report proposing the establishment of a networked UK Arts and Humanities Data service.

Steven J. DeRose
Senior Systems Architect, Electronic Book Technologies, Inc.

David G. Durand http://cs-www.bu.edu:80/students/grads/dgd/
Computer Science Department, Boston University

C.M. Sperberg-McQueen http://www-tei.uic.edu/~cmsmcq/
University of Illinois at Chicago
C. M. Sperberg-McQueen is a senior research programmer at the computer center of the University of Illinois at Chicago. He currently works in the Network Information Services group. He was trained in Germanic philology in the U.S. and Germany, and is a member of the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing, and the Association for Computational Linguistics. Since 1988 he has been editor in chief of the ACH/ACL/ALLC Text Encoding Initiative.

Lessons for the World Wide Web from the Text Encoding Initiative