[This local archive copy is from the official and canonical URL, http://lingua.arts.klte.hu/allcach98/abst/abs47.htm; please refer to the canonical source document if possible.]
Concurrent Document Hierarchies in MECS and SGML
C. M. Sperberg-McQueen, Claus Huitfeldt
It appears to be generally agreed, by those applying computers to problems of humanistic research, that a well-designed markup language is central to the task of representing textual materials in an electronic form adequate to the needs of research. Current interest focuses on declarative markup languages based on the Standard Generalized Markup Language (SGML), or on its subset the Extensible Markup Language (XML). One of the most persistent challenges for the application of SGML to existing texts is finding suitable representations, in SGML's tree-based data model, for textual features which overlap each other. The most persistent complaint of SGML's critics among humanists is that SGML simply cannot handle such overlapping features. In the general form stated, this claim is untrue, but it is fair to say that handling overlap requires some substantial extensions to what is otherwise a rather simple data model. Overlapping features, however, are common enough in existing texts that almost every system designed for scholarly text processing has some facility for handling overlap, either in the form of dual logical and physical hierarchies (as in John B. Smith's interactive concordance program ARRAS) or in the form of non-hierarchical coding (as in the COCOA tagging supported by the Oxford Concordance Program and other systems). Overlappingtextual features are an inescapable fact of textual life.
This paper will first describe the fundamentals of what we will call the `basic' SGML data model1 and explain how overlap presents a problem for this basic SGML model. It will then discuss two notations invented to overcome that problem: the Multi-Element Code System (MECS) developed by the Wittgenstein Archive at the University of Bergen and the CONCUR feature of SGML itself, and explore some problems which arise in translating between these two notations. In conclusion, it will describe some possible avenues for future work on overlap and related markup problems.
The SGML Basic Model, and Why Overlap is a Problem
In its simplest forms, SGML markup lends itself to a straightforward model for markup interpretation and processing: the features of the text, as the encoder understands them, are represented by SGML elements and by attributes on those elements. Since elements nest within each other and attributes apply to whole elements, an SGML document has a natural representation as a tree structure whose nodes represent elements and whose leaves represent the characters of the text. Each node is labeled with the generic identifier of its element type, and decorated with a set of attribute-value pairs representing its attributes. The legal forms of the document tree may be restricted using a document type definition, which provides a restricted form of context-free grammar; this grammar may be checked by a validating SGML parser.
Tagging a passage of text as an element of a particular type, and assigning particular values to its attributes, is interpreted, according to this model, as a claim that the passage thus tagged exhibits the textual features associated with those element types and attribute values. To discover the features predicated of any given passage, one need in the common case merely consider only the ancestors of the passage in question, which means a traversal from the leaf to the root, noting all the nodes encountered along the way. In the general case, some complications may arise.2
Even allowing for the complications just mentioned, the basic model of SGML semantics associates single occurrences of some feature with single SGML elements; there are one-to-one relationships between individual textual features and individual SGML element types and attributes, and between occurrences of those features and instances of those element types. Overlap presents a problem in this model because while the two textual features may overlap, two nodes in the document tree, and thus two SGML elements, may not.
A variety of methods have been developed to register overlapping features nevertheless. In the TEI Guidelines, for example, textual features may be marked by `milestone' elements, in which a feature is predicated not of the contents of an element but of the span of text between one milestone element and the next. The TEI defines several other techniques, which all rely on the fragmentation of one `logical' element into multiple SGML elements, relying on DTD-specific semantics to instruct an application to knit the fragments into a virtual whole; this is the method used in the part attribute of the TEI <l> (verse-line)element, or more generally with the attributes next and prev and the virtual aggregation element<join>.3 Such application-specific semantics run a higher risk of application failure and complicate the semantic model of the TEI.
A different approach is taken by the Multi-Element Code System (MECS) developed at the Wittgenstein Archives at the University of Bergen. In MECS, as in SGML, a textual feature is marked by one or more tags. MECS defines several types of code: no-element codes, which correspond to SGML elements declared empty; single-element codes, which resemble normal SGML elements, and multi-element codes, which are used both for structures such as lists, with a single homogeneous sequence of substructures, and for structures such as corrections or substitutions, which have a fixed or variable number of subcomponents distinguished positionally.
In MECS, any two codes may overlap.4 This allows for a simple and convenient data notation for overlapping features, which is precisely the same as for non-overlapping features. In a linear scan of the text, it is easy to detect, at any location in the text, what features are predicated of that location. The potential for overlap does, however, make it impossible to represent MECS documents using a conventional tree structure; as a result, it is not simple to find the predicates true of a given location after a random access to the document.5 Unclosed elements can also only be detected at the end of the document.
Perhaps the biggest drawback of allowing unlimited overlap is that MECS is incapable of constraining document structures using a conventional context-free grammar.6
In the data model of MECS, the document is a sequence of characters, each of which can carry any number of features; each feature is represented by a particular element type. Features are atomic; there are no attributes in MECS.
The simplicity of the MECS approach, and the uniformity of its interpretive rules, make it very attractive; the absence of a document grammar, and with it the loss of document validation, make it less attractive. A system which could combine the simplicity of overlap handling in MECS with the document grammar of SGML could be very useful. One might create a system by inventing a new markup formalism, by extending MECS to handle document grammars, or by seeking ways of providing MECS-like behavior in SGML. We here consider the third of these possibilities.
The structural rules of MECS can be represented in SGML by using the SGML feature concur. This feature allows an SGML document to be marked up concurrently using more than one DTD. Each tag is labeled with the name of the DTD to which it pertains; a tag may pertain to multiple DTDs. The document can be parsed and validated according to each DTD independently, or according to some set of DTDs if the implementation supports single-pass multi-DTD parsing; some popularizations claim that only one DTD can be validated at a time, but the text of ISO 8879 makes no such restriction.
[In the full paper, a fuller overview of CONCUR will be given,with examples.]
Bergen DTDs and Tag Set Partitions
An algorithm may be given for generating a set of concurrent DTDs from any MECS tag set. Its essence consists in creating a separate DTD for each type of MECS code in the tag set.7 Such a set of DTDs we call, in honor of the home of MECS, a set of Bergen DTDs.The DTD for a no-element code a will take the form
<!DOCTYPE A [
<!ELEMENT A O O (#PCDATA | a)* >
<!ELEMENT a - O EMPTY >
while that for a one-element code <em>b</em> will take the form
<!DOCTYPE B [
<!ELEMENT B O O (#PCDATA | b)* >
<!ELEMENT b - - (#PCDATA) >
[N.B. these examples are written for an XML system, or for an SGML system set up for case-sensitive names.] The DTDs for multi-element codes are substantially similar.
While such a translation loses no information from the MECS equivalent, it would be more useful if we could reduce the number of DTDs to some minimum number, by merging the DTDs for two elements if they do not overlap each other. The paper will describe semi-automatic and fully automatic methods of partitioning the set of element types so that suitable document grammars can be written.
The advantages and disadvantages of concurrent DTDs for handling textual phenomena will be discussed. The principal advantage is that this method provides both a natural method of handling concurrent structures and the possibility of a document grammar. The principal disadvantages are practical: the notation of concur is not particularly attractive (though it is not much uglier than the current alternatives in the TEI scheme), and the feature is seldom implemented. One disadvantage, however, is theoretical: concur provides no method of specifying constraints on the interrelationship of separate DTDs. In encoding plays like those described by Barnard et al., 1988, one might wish to require that <div> elements in the dramaturgical structure be identical to <div> elements in the metrical structure. The advent of the Extensible Markup Language (XML), however, may change the practical situation: it is much easier to implement concur for the fully normalized documents prescribed by XML than to handle all of its complex interactions with features present in SGML but omitted in XML.8
By using concur, the simplicity of interpretational rules found in MECS (and in the basic SGML model) can be combined with a powerful language for expressing constraints on document structures (the DTD). The theoretical and practical advantages outweigh the practical disadvantages and the humanities computing community should begin serious experimentation concur.
1By the basic data model of SGML we mean a reference model for interpretation and processing of SGML documents, which is often assumed by users and implementors, but which is not itself explicit in or required by the SGML standard. We call it the `basic' model because most SGML systems and DTDs rely heavily on it, but SGML-based markup schemes are not limited to this basic model, as may be seen from the TEI examples mentioned below, which require a more complicated model.
2Descendant nodes may override, rather than supplementing, the meaning of their ancestors. Some information relevant to processing, such as the sequential position of a node among its siblings, may require consideration of siblings. The meaning of some element types may depend in part upon their children or descendants (this is a popular suggestion in discussions of how to rewrite an SGML document type definition [DTD] without attributes); discovering the claims made by such elements about a given passage may require a full traversal of some subtrees.
3For a brief survey of the predication methods used in TEI, see Sperberg-McQueen and Burnard, "The Design of the TEI Encoding Scheme".
4Actually, in the current implementation of MECS, multi-element codes may not overlap, but this has no theoretical justification; instead it is a pragmatic rule made ad hoc to enforce certain constraints of the MECS code set used at the Wittgenstein Archives. Also, the current version of MECS provides no mechanism for co-indexing start- and end-tags, so no MECS element can overlap another element of the same type.
5A simple implementation may simply scan from that location to either the beginning or the end of the document. Other implementations may simplify the operation by using more complex data structures.
6Whether a context-sensitive grammar or some other formalism which augments the power of context-free grammars could be used to constrain document structure in MECS remains an open research question, which the authors intend to address within the framework of their current project.
7Actually, since multi-element codes cannot overlap, a single DTD can accommodate all multi-element codes in the code set.
8N.B. XML does not include the concur feature, and there is no real prospect that it will. But the normalization and well-formedness constraints XML defines can be used to simplify the definition of an XML-like language which does include concur.