The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Last modified: December 17, 2008
Markup Languages and (Non-) Hierarchies

[August 05, 2002] Bracketed (braced) markup languages use hierarchical 'containment' or 'nesting' structures to organize pieces of information. The hierarchy can be mapped to nodes in a tree/graph, optimized for several kinds of processing. While hierarchical representation is well-suited to serialization, storage, and other processing goals, it is not necessarily a good match for modeling information about tangible objects, procedures, events, and processes — as conceptualized by users. Few things in the universe are quintessentially hierarchical in nature. Arguably, a document does not have a "logical structure" that can be directly modeled by a hierarchical representation. Most naturally occurring physical objects and human-made artifacts (like texts) are susceptible to many kinds of analysis, suggesting multiple (simultaneous, overlapping) hierarchies, and typically involve non-hierarchical relationships.

David Durand wrote in 1996: "... Methods of tagging all of these currently exist in the TEI, in the form of particular tags for particular perspectives. But since we now know that the breaking of strict hierarchies is the rule, rather than the exception, it is time to determine what additional features are required from markup systems to make the formal description of such non-hierarchical phenomena straightforward..."

Note: The following reference list was only lightly maintained after July 2005. See the References section in the Balisage 2008 paper of Sperberg-McQueen and Huitfeldt for additional/recent citations, and hints in the program listing for the 2008-12 Goddag Workshop.


  • TEI SIG:Overlap. "The goal of the TEI Overlapping Markup SIG is to bring together users of the TEI who are acutely interested in issues of multiple hierarchies and in particular handling those in XML."

  • [December 17, 2008] "Essential Hierarchy." By Jeni Tennison. Blog. "In [a recent] post I discussed the kinds of situations where overlapping markup can appear in documents, and the distinction between containment, when one element happens to contain another, and dominance, where the relationship between the two elements is more meaningful. Here I'll expand a bit more on the issue of whether dominance relationships are or should be part of the essential information in the document... Overlap is arguably the main remaining problem area for markup technologists. Capturing and analysing the overlap between poetic and syntactic structures in poems and plays helps academics gain a deeper understanding of the ways poetic technique has changed over time. And the complexities of structures in documents such as the Bible simply cannot be represented without allowing overlap to happen. But academic study aside, overlap is a really important problem because whenever we collaborate on documents and whenever we change documents, we create overlapping structures. One of the major projects that I've worked on at TSO deals with publishing consolidated legislation, showing the places where 'current' legislation was amended over time from its original, enacted state. The authors of legislation care little for document structures, and amendments often overlap document structures such as paragraphs and list items, and each other... When you're talking about overlapping structures, it's useful to make the distinction between structures that contain each other and structures that dominate each other. Containment is a happenstance relationship between ranges while dominance is one that has a meaningful semantic. A page may happen to contain a stanza, but a poem domainates the stanzas that it contains. In LMNL, we view a document as consisting of a sequence of atoms, usually characters, and ranges over those characters. But the model makes no assertions about dominance relationships between the ranges. This document model is easy to construct from a serialised document like the one above. Conversely, GODDAG document models are directed acyclic graphs (DAGs): the nodes within those graphs have children and parents, with leaf nodes containing characters, and the parent-child relationship implies dominance. This is a useful model for processing, and particularly querying. Navigating a DAG is a lot like navigating a tree, just one that represents multiple hierarchies. But it isn't possible to construct a DAG from a serialised document like the one above without extra information about which containment relationships are actually dominance relationships, and which mere happenstance. [...] James Clark commented: ... 'I would be inclined to start by designing the information model first and then figure out a syntax to represent that information model. Maybe I'm just brainwashed by too much XML/SGML, but the hierarchical relationships seem like a fundamental aspect of the information about the document which the markup should be capturing explicitly.' [...] As well as overlap, LMNL has weird things like structured and ordered annotations, atoms, and anonymous ranges. In that spirit, I want to see if we can get away with not having hierarchy as a fundamental part of the information model. Does this allow us to do things that we couldn't otherwise do, or is it a burden? I don't know yet... If the syntax for expressing hierarchies is that verbose and difficult to use, people won't use it, and we'll have to find a way to add dominance relationships programmatically. We might as well start from that point. But perhaps someone out there can come up with a clean, elegant syntax for expressing dominance within overlapping markup?"

  • [August 15, 2008] "Markup Discontinued: Discontinuity in TexMecs, Goddag Structures, and Rabbit/Duck Grammars." By C. M. Sperberg-McQueen and Claus Huitfeld. Paper presented at Balisage: The Markup Conference (12-15 August 2008). "This paper describes work in progress on the problem of discontinuous structures undertaken in the context of the project Markup Languages for Complex Documents (MLCD) at the University of Bergen. The first section describes the problem in simple terms. Subsequent sections describe the problem as it relates to other work of the MLCD project: the TexMecs notation for marked up documents in serial form, to the Goddag data structure, and to validation with rabbit/duck grammars. We consider the problem of discontinuity in the light of a recent proposal to treat containment and dominance as distinct relations. We suggest that this proposal allows a solution (or dissolution) of the problem, at least for Goddag structures. In the context of validation, a number of problems remain for future work... In its most general form, the problem of discontinuity is the question 'How can we represent this passage, and others like it, in a satisfactory way?' This general problem takes more specific forms in the context of specific notations, data structures, and validation tools... The problem encountered here is not new; it has been recognized for some years... Mechanisms like those defined by the TEI are not, of course, understood or supported by generic XML tools; they require vocabulary-specific support by a different layer of code. One goal of the TexMecs notation first specified in 2001 and revised in 2003 is to move such information about virtual elements and element discontinuity out of the application level and into the level of basic syntax... In the MLCD project, efforts to define that more abstract layer have focused on the definition of generalized ordered-descendant directed acyclic graphs (Goddags)... In recent work reported elsewhere, we have for other reasons begun to contemplate distinguishing more sharply than heretofore between dominance (regarded as the transitive closure of the parent/child relation). and containment (regarded as superset/subset relation on the leaf nodes reachable from a node by following parent-child arcs)... The MLCD project has proposed rabbit/duck grammars as a possible mechanism for validating overlapping structures... LMNL, the layered markup and annotation language, cannot readily make a distinction between containment and dominance (we are indebted to a reviewer of this paper for this observation), and so seems unlikely to satisfy the competing demands which suggest to us that both concepts are needed... XCONCUR and similar mechanisms already incorporate the containment/dominance distinction to a certain degree. The usual rules for hierarchical markup ensure that in a set of concurrent trees, dominance entails containment, but the fact that two elements in a containment relation may be in different trees ensures that in systems with concurrent markup, the relations of containment and dominance are systematically separate and distinct... Mechanisms like Trojan Horse markup can be used to serialize discontinuous elements, provided the empty XML elements used to mark the boundaries of virtual elements have ways to signal that they record a resumption, rather than only a beginning or end, of the virtual element... [So] we have proposed that a graph structure which more nearly reflects our intuitions about the document can be constructed if we retain the principle that parent/child and ancestor/descendant relations imply that the ancestor contain the descendant, but jettison the converse principle that any element properly contained by another element is necessarily a descendant of (dominated by) that other element..."

  • [August 2007] "International Workshop on Markup of Overlapping Structures." Jean Carletta, Steven DeRose, Patrick Durusau, Wendell Piez, C.M. Sperberg-McQueen, Jeni Tennison, and Andreas Witt. "XML and SGML have revolutionized the representation of structured information, but not all information structures map easily into systems of hierarchically nested elements. Markup of overlapping structures is a perennially hot topic, reinvented and reimagined as often as it is solved. This full-day workshop brought together the proponents of some of the major proposals for markup, representation, extraction, display, and validation of semantic overlap to summarize the systems they are developing and discuss topics of common interest. A morning of formal presentations was followed by an afternoon of free-ranging discussion..." See the Extreme index for "Concurrent Markup/Overlap."

  • [August 2007] "On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees." By Andreas Witt, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, and Kilian Evang. Extreme Markup Languages 2007 (Montréal, Québec). "The Generalised Architecture for Sustainability (GENAU) provides a framework for the transformation of single-file, multi-layer annotations into multi-rooted trees. By employing constraints expressed in XCONCUR-CL, this procedure can be performed lossless, i.e., without losing information, especially with regard to the nesting of elements that belong to multiple annotation layers. This article describes how different types of linguistic corpora can be transformed using specialised tools, and how constraint rules can be applied to the resulting multi-rooted trees to add an additional level of validation... This contribution touches upon several different topics with regard to using XML-based markup languages for linguistic research. Our overall goal is the sustainable archiving of linguistic data. Corpora usually contain multiple annotation layers (morphology, part-of-speech, syntax, semantics, information structure, etc.). We devised a generalised architecture that, among other aspects, requires individual conceptual annotation layers contained in linguistic corpora to be separated from one another. To meet this prerequisite, we have to transform a linguistic corpus (normally represented by a single XML file, i.e., a single-rooted tree) into several XML files (i.e., a multi-rooted tree) so that each file contains one specific annotation layer.. XCONCUR [Mirco Hilbert, Oliver Schonefeld, Andreas Witt: 'Making CONCUR work', Proceedings of Extreme Markup Languages, Montréal, 2005] provides authors who are familiar with XML a standardized and intuitive way to write documents with overlapping markup. This is achieved by reviving SGML's CONCUR option and by introducing this option into XML... XCONCUR's document syntax is very similar to SGML with the CONCUR option set to YES. Each element is prefixed with an annotation layer id. This annotation layer id is used to assign the specific element to a distinct annotation layer. A generic identifier in terms of XCONCUR is the combination of annotation layer id and element name...

  • [July 28, 2005] "Making CONCUR Work." By Mirco Hilbert (Justus-Liebig-University Giessen), Oliver Schonefeld (Bielefeld University), and Andreas Witt (Bielefeld University). From the Proceedings of Extreme Markup Languages 2005. See Extreme Markup Languages 2005, August 1-5, 2005 Montréal, Canada. "This paper describes ways to approach the functionality of the SGML-feature CONCUR. This work is based on two Master's Theses. The origin of these theses was a paper originally given at Extreme Markup Languages 2004, where it was argued that the redundant encoding of information in multiple forms, as described by the TEI-Guidelines, has a lot of advantages over other methods of encoding multiply structured text. The multiple encoding of text allows to use all the available techniques and software products for XML documents. The MuLaX-Format was developed as an integrated format for editing. This format is strongly influenced by the SGML option CONCUR. The XML-conformance is achieved by the processing model: The processing is conservative because the concurrent annotations are kept separately. In other words, a non XML-syntax as a sequentialization of multiple incompatible hierarchies is used on the one hand, and on the other hand, all the processing is done on (hierarchical/XML conform) trees..."

  • [November 2004] "The Extended XPath Language (EXPath) for Querying Concurrent Markup Hierarchies." From the Website of Ionut Emil Iacob (Research in Computing for Humanities, University of Kentucky). [Publications listing] "This document provides semantics of the Extended XPath language (EXPath) for Concurrent Markup Hierarchies (CMH). Extended XPath (EXPath) is an extension of regular XPath to provide selection of nodes in a GODDAG. One key difference between EXPath and XPath is the return type of a location step evaluation: in EXPath a location step is evaluated to a node-set-collection: a node-set per each hierarchy. Consequently, the context of an expression evaluation is the same as the context of an XPath expression, with the following amendments: (1) context position is the position of the context node in the node-set corresponding to the context node hierarchy; (2) context size is the size of the node-set in the context node hierarchy... [Data Model, GODDAG]: For representing a distributed XML document we use the General Ordered-Descendant Directed Acyclic Graph (GODDAG) data structure proposed by Sperberg-McQueen and Huitfeldt. Informally, a GODDAG for a distributed XML document can be thought of as the graph that unites the DOM trees of individual components, by merging the root node and the text nodes. However, because of possible overlap in the scopes of XML elements from different component documents, GODDAGs will feature one more node type, that we call here leaf node, not found in DOM trees. In a GODDAG, leaf nodes are children of the text nodes, and they represent a consecutive sequence of content characters that is not broken by an XML tag in any of the components of the distributed XML document. While each CMH component will have its own text nodes in a GODDAG, the leaf nodes will be shared among all of them..." See also "XPath Extension for Querying Concurrent XML Markup", by Ionut E. Iacob, Alex Dekhtyar, and Wenzhong Zhao (University of Kentucky, Department of Computer Science), Technical Report TR394-04, February 2004.

  • [November 24, 2004] "A Framework for Management of Concurrent XML Markup." By Alex Dekhtyar and Ionut Emil Iacob (Department of Computer Science, University of Kentucky, Lexington, KY, USA). [To appear in] Special Issue of Data and Knowledge Engineering (2004). "The problem of concurrent markup hierarchies in XML encodings of documents has attracted attention of a number of humanities researchers in recent years. The key problem with using concurrent hierarchies to encode documents is that markup in one hierarchy is not necessarily well-formed with respect to the markup in another hierarchy. Previously proposed solutions to this problem rely on the XML expertise of the editors and their ability to maintain correct DTDs for complex markup languages. In this paper, we approach the problem of maintenance of concurrent XML markup from the Computer Science perspective. We propose a framework that allows the editors to concentrate on the semantic aspects of the encoding, while leaving the burden of maintaining XML documents to the software. The paper describes the formal notion of the concurrent markup languages and the algorithms for automatic maintenance of XML documents with concurrent markup... The ultimate goal of the proposed framework is to free the human editor from the effort of dealing with the validity and well-formedness issues of document encoding and to allow him or her to concentrate on the meaning of the encoding. In the proposed approach, the editor describes a collection of simple DTDs, one for each hierarchy, without having to worry about the need to build and maintain a 'master' DTD. Existence of such 'concurrent' DTDs introduces the need for specialized software to support the editorial process drive it by the semantics of the markup. This software must allow the editor to indicate the positions in the text where the markup is to be inserted, select the desired markup, and take record the results... In Section 2 we present a motivating example based on the ARCHway project, in which the authors are currently involved. Section 3 formally defines the notion of a collection of concurrent markup languages. In Section 4 we present three key algorithms for the manipulation of concurrent XML markup. The MERGE algorithm builds a single master XML document from several XML encodings of the same text in concurrent markup. The FILTER algorithm outputs an XML encoding of the text for an individual markup hierarchy, given the master XML document. The UPDATE algorithm incrementally updates the master XML document given an atomic change in the markup. Finally, in Section 5 we study the performance of our implementation of the MERGE algorithm..." [cache]

  • [November 24, 2004] "Report of TEI Overlapping Markup SIG meeting in Baltimore, 2004-10." By Dot Porter. Posted to the TEI Overlapping Markup SIG discussion list. See also the Web version. "At last year's meeting, we had discussed creating a web site to explain in some detail many different approaches to overlapping markup. This year, we discussed some approaches that are currently in use by those of us in the SIG: the use of milestone elements and Just In Time Trees (JITTS). The traditional problem with using milestone elements extensively to deal with overlapping markup is that existing support languages (XPath, XSLT) cannot deal with non-content (that is, text between two milestone elements acting as the beginning and end tags). However, Alex Dekhtyar and Emil Iacob at the University of Kentucky have been working on an extension of XPath (Extended XPath, or EXPath) that can search overlapping encodings represented in a GODDAG. The GODDAG can be stored in an XML file with milestones (plus a set of DTDs, one per hierarchy) or in separate files: one XML file per hierarchy. In fact, the storage method is not important as long as there are parsers for GODDAG. The GODDAG implementation provides DOM-like API which can be used as well by an XML editor. They have begun working on an extension of XSLT as well. Patrick Durusau also cited a paper presented at the 2004 Extreme Markup Conference by Steven DeRose, Markup Overlap: A Review and a Horse. In this paper, DeRose outlines a system of milestone elements similar to that already implemeneted at the University of Kentucky, which he calls clix (not to be confused with Constraint Language in XML, CLIX). The SIG proposes to investigate the possibility of implementing within TEI a system for dealing with overlapping markup through a system of milestone elements based on clix, JITTS, and the EXPath and EXSLT support being developed at the University of Kentucky..."

  • [August 2004] "Markup Overlap: A Review and a Horse." By Steven DeRose (Bible Technologies Group). In Proceedings of Extreme Markup Languages 2004 Conference (August 2-6, 2004, Montréal, Canada). "Overlap describes cases where some markup structures do not nest neatly into others, such as when a quotation starts in the middle of one paragraph and ends in the middle of the next. OSIS, a standard XML schema for Biblical and related materials, has to deal with extreme amounts of overlap. The simplest is book/chapter/verse and book/story/paragraph hierarchies that pervasively diverge; but many types of overlap are more complicated than this. The basic options for dealing with overlap in the context of SGML (ISO 8879) or XML are described in the TEI Guidelines. I summarize these with their strengths and weaknesses. Previous proposals for expressing overlap, or at least kinds of overlap, don't work well enough for the severe and frequent cases found in OSIS. Thus, I present a variation on TEI milestone markup that has several advantages, though it is not a panacea. This is now the normative way of encoding non-hierarchical structures in OSIS documents... I propose a syntax known as 'Trojan milestones', which is highly readable, easy to learn, and makes milestone forms of elements always permit the right attributes. In short, it uses the same element type both as a normal element, and as empty start- and end-milestones. This maximizes consistency between overlap and non-overlap cases, such as attribute declarations, GIs, and so on. I also develop the use of Trojan milestones to represent LMNL documents in XML, a syntax I call 'CLIX'. And I discuss the problems of ordering for co-terminous LMNL ranges..." [PDF format, cache]

  • [June 24, 2004] TEI Overlapping Markup SIG [Report]. TEI Overlapping Markup Special Interest Group, first convened at the TEI 2003 meeting (Nancy, France on 08-November-2003), led by Dorothy Porter, Peter Robinson, and Patrick Durusau. [With fourteen persons in attendence] the group "first discussed some of the problems involved with overlapping markup. We agreed that the first thing we need to do is document the problems of existing approaches, of which there are many (both problems and approaches). Patrick Durusau volunteered to put together a web page, to be placed within the TEI website, outlining various existing solutions and briefly describe the general pros and cons to each approach. We would then like to solicit comments on these solutions from the TEI community in general, and ask community members which of these approaches might address problems to specific to their projects. This would give us an overview of the problems of overlapping markup, and will also give us an indication if an existing approach (or approaches) could be endorsed by TEI, or if we need to find a new approach (or approaches) that will suit the needs of the community. We hope that the work of the SIG will help to inform the TEI's existing Stand-Off Markup working group. In discussing the problems of overlapping markup, it was agreed that the most common problem is that of the markup of text (paragraphs, sentences, chapters, words) vs. the markup of the physical object (page, page or folio line, quire)... David Durand, member of the Stand-Off Markup Workgroup (which is currently charged with issues of overlapping markup), has agreed to create a listserv/email list for all those interested in overlapping markup. The initial group will include all those in attendence at the SIG. David Durand explained the current approach TEI is taking to overlapping markup, which involves separate XML files for each hierarchy, using Xpointer and Xinclude to map between files. It was agreed generally that this approach may be too dependent on heavy encoding to be useful..." See TEI Workgroup on Stand-Off Markup, XLink and XPointer. A 2004-06-24 posting from Dorothy Carr [Dot] Porter to TEI-L mentions a mailing list for overlapping markup, 'TEI-OL-SIG' list. [cache]

  • [July 15, 2003] "Testing Structural Properties in Textual Data: Beyond Document Grammars." By Felix Sasaki and Jens Pönninghaus (Universität Bielefeld). [Pre-publication draft of paper published] in Literary and Linguistic Computing Volume 18, Issue 1 (April 2003), pages 89-100. "Schema languages concentrate on grammatical constraints on document structures, i.e., hierarchical relations between elements in a tree-like structure. In this paper, we complement this concept with a methodology for defining and applying structural constraints from the perspective of a single element. These constraints can be used in addition to the existing constraints of a document grammar. There is no need to change the document grammar. Using a hierarchy of descriptions of such constraints allows for a classification of elements. These are important features for tasks such as visualizing, modelling, querying, and checking consistency in textual data. A document containing descriptions of such constraints we call a 'context specification document' (CSD). We describe the basic ideas of a CSD, its formal properties, the path language we are currently using, and related approaches. Then we show how to create and use a CSD. We give two example applications for a CSD. Modelling co-referential relations between textual units with a CSD can help to maintain consistency in textual data and to explore the linguistic properties of co-reference. In the area of textual, non-hierarchical annotation, several annotations can be held in one document and interrelated by the CSD. In the future we want to explore the relation and interaction between the underlying path language of the CSD and document grammars..." See: (1) the abstract for LitLin; (2) the research group's publication list; (3) the related paper "Co-reference annotation and resources: a multilingual corpus of typologically diverse languages", in Proceedings of the Third International Conference on Language Resources and Evaluation (LREC-2002). [source PDF]

  • [October 02, 2002]   New Website for Layered Markup and Annotation Language (LMNL).    A communiqué from Jeni Tennison announces an online collection of resources for the Layered Markup and Annotation Language (LMNL), first introduced at the 2002 Extreme Markup Languages Conference 2002 in Montréal. Project principals include Jeni Tennison, Gavin Thomas Nicol, and Wendell Piez. LMNL, pronounced 'liminal', "is an experimental approach to digital text encoding that supports, in SGML/XML terms, overlapping elements (ranges in LMNL) and structured attributes (annotations in LMNL)." The Extreme paper by Tennison and Piez presented LMNL as a solution to the challenge of representing multiple hierarchies within a single document and annotating existing tree structures with type information (as in the PSVI). The layered data model is based on the Core Range Algebra investigated by Gavin Nichol; this data model views documents as strings over which span a number of named ranges, each of which can themselves have associated metaranges with their own internal structure. The development team has now published a simple tutorial for LMNL and continues to address the "interesting challenges of extracting tree models, writing schema, query, and transformation languages." Initial online specifications cover; (1) the core LMNL Data Model, (2) a Reified Data Model which is used to describe physical documents that represent LMNL documents, and (3) a draft LMNL Object Model (LOM) API which specifies an object-oriented API for the LMNL data model. A public mailing list is dedicated to the discussion of LMNL and its applications. [Full context]

  • [August 21, 2002] "Meaning and Interpretation of Concurrent Markup." By Andreas Witt (Universität Bielefeld). Presented at ALLC/ACH 2002, Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities, July 24 - 28, 2002, University of Tübingen, Germany. "The difficulty of annotating not hierarchically structured text with SGML-based mark-up languages is a problem that has often been addressed... In general, annotated text consists of content and annotations. Annotations are used on a syntactical level. Therefore they are used for assigning a meaning to (parts of) a document. While developing a document grammar the focus should be centred on the content. This point of view is expressed by Sperberg-McQueen, Huitfeldt and Renear (2000). They show how knowledge which is syntactically coded into text by annotations can be extracted by knowledge inference. After summarizing this approach, it will be shown, how this technique can be expanded so that it can be used for inferences of separately annotated and implicitly linked documents - documents marked-up according to different document grammars... The described model of knowledge representation can only be used for single documents. However, it will be shown, that this model can easily be expanded, so that it is applicable for the inference of relations between several separately annotated XML-documents with the same primary data. ... The outlined architecture has many advantages. The model allows for structuring text according to multiple concurrent document grammars without workarounds. Furthermore additional annotations can be subsequently included, without changing already established annotations. The annotations are on the one hand independent of each other, on the other hand they are interrelated via the text, allowing for the inference of relations between different levels of annotation. The final advantage to be mentioned is that the compatibility of several or all annotations used can be proven automatically. This can be done using a technique originally developed within linguistics, namely unification." See Witt's reference page, with the slide presentation and Python code implemented by Daniel Naber. [cache]

  • [August 15, 2002] "Multiple Informationsstrukturierung mit Auszeichnungssprachen. XML-basierte Methoden und deren Nutzen für die Sprachtechnologie." By Andreas Witt. Dissertation zur Erlangung des akademischen Grades. Doctor philosophiae (Dr. phil.) eingereicht an der Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld. 217 pages. Contains a discussion of the SGML CONCUR feature. [cache]

  • [August 23, 2002] "The Layered Markup and Annotation Language (LMNL)." By Jeni Tennison (Jeni Tennison Consulting) and Wendell Piez (Mulberry Technologies). Extended abstract of the presentation at Extreme Markup 2002. [LMNL 'Layered Markup and aNnotation Language' is pronounced "liminal".] "In document-oriented XML development, there's frequently a requirement for several views of the same document to coexist. For example, one view might represent the logical structure of a document into chapters, sections and paragraphs, while another represents the physical manifestation of that document in a particular book, maintaining page and even line breaks. The structures in these different views often overlap -- a page might start in the middle of one paragraph and end after another, for example -- and this makes it difficult for a simple hierarchical structure, such as XML, to represent... All the [previous] approaches have their strengths and weaknesses. Interestingly, they all assume a DAG (directed acyclic graph) as a primary data model, sometimes enhancing it with metainformation at a different level. (Even in SGML CONCUR, this metainformation is provided by the DTD. TexMECS assumes a more complex graph structure, dubbed GODDAG, allowing elements to have multiple parentage.) Recognizing the difficulties of this approach, however (for example, in XSLT it is quite challenging and processor-intensive, albeit not impossible, to perform the splicing and segmenting operations typically required to transform between concurrent structures), the authors postulated it might be worthwhile to start by parsing markup into an entirely different data model. Since XML and XSLT already provide us with a strong technology for processing trees (we reasoned), we could always have a tree when we needed one; so we opted to concentrate on a more rudimentary data model that would capture the information we needed, while not itself trying to assert containment or sibling relations (that are too simple to apply to overlapping structures). These could be inferred or specified in other layers of the system. The Core Range Algebra presented by Gavin Nichol at this year's Extreme conference proposes a data model that supports overlapping structures by viewing documents as sequences of characters over which named ranges are defined. To represent more fully the range of document structures encountered in real documents, we have extended this data model to include the concepts of 'layers', which are ranges that fully contain all the ranges that start or end within them, and 'metaranges', which are layers that can be associated with ranges to provide meta-information about their content. This data model can be represented in XML, using any of the methods outlined above, but for ease of writing, we have developed a specialised syntax, the Layered Markup and Annotation Language (LMNL)..."

  • [August 05, 2002] "The Layered Markup and Annotation Language (LMNL)." By Jeni Tennison (Jeni Tennison Consulting) and Wendell Piez (Mulberry Technologies). Presented at the Extreme Markup Languages Conference 2002. "Representing multiple hierarchies within a single document has always been a problem for XML. To try to address the problems of representing multiple hierarchies and of annotating existing tree structures with type information (as in the PSVI), we have developed a layered data model based on the Core Range Algebra presented at Extreme 2002 by Gavin Nichol. This data model views documents as strings over which span a number of named ranges, each of which can themselves have associated metaranges with their own internal structure. To aid our experimentation with this data model, we developed a markup notation to reflect it, the Layered Markup and Annotation Language (LMNL), and have constructed several prototype applications to facilitate the extraction of single views, as XML structures, from LMNL documents. This paper outlines LMNL and discusses how its development has made us reflect on the nature of XML, schema and query languages..." On "Core Range Algebra," by Gavin Nicol (Red Bridge Interactive): "If various portions of an information resource must be addressed for various reasons, the portions may overlap or otherwise conflict with any hierarchical structure that could be reasonably reflected by its own inherent structure. In such cases, it doesn't help to embed (more) markup in the data. A proposed 'Core Range Algebra' can be used to address and sequence ranges, unions, intersections, and concatenations of arbitrary portions of information resources..."

  • [August 23, 2002] "Declaring Trees: The Future of the Evolution of Markup? Just-In-Time-Trees (JITTs)." By Patrick Durusau and Matthew Brook O'Donnell. Based on a paper presented at Extreme Markup 2002. See the slides in PPT and PDF format. "Just-In-Time-Trees (JITTs) arose from the realization that SGML/XML syntax had been overburdened with the tasks of declaring a document root (sigular) and the markup to be recognized in the document. (Not actually true for markup recognition in SGML with its concur feature but that was seldom implemented and so not generally available.) We propose moving the declaration of the document root and the markup to be recognized from that root from markup syntax to pre-processing before passing the document instance to an SGML or XML parser. The move from syntax to processing for these declarations may seem like a small one but it has important implications for the development of markup syntax. Freed from the strictures of the current model, markup syntax can be used to record any number of arbitrary or overlapping structures in a document instance. It is true that processing eventually requires declaration of a traditional tree, but there are many cases (presentation vs. logical layout) where the storage of overlapping hierarchies in a single document instance will be of immediate use. We are exploring ways to relate information from overlapping trees that are found on successive parses of a document instance... We propose that the declaration of the document root and the markup to be recognized should be moved from the syntax layer and made a part of the processing of a text. That change in the model for handling markup removes the various problems with overlapping markup that have been the subject of numerous proposals but few widespread implementations since the rise of SGML. Our latest proposal differs from all prior ones in that it allows the use of standard XML software for the processing of texts, while allowing extensive experimentation with markup languages for the encoding of texts. Our argument for markup recognition is grounded in the text of ISO 8879 (concur) and extends that concept to XML by the use of filters to declare the document root and markup to be recognized..."

  • [August 05, 2002] "Coming Down From the Trees: Next Step in The Evolution of Markup?" By Patrick Durusau (Society of Biblical Literature). Presented at the Extreme Markup Languages Conference 2002. Thursday, August 8, 2002. "The syntax of SGML/XML presumes a tree structure and consequently imposes that presumption upon texts. From DTDs and XML schemas to research on the nature of markup and well-formedness, that presumption about syntax has informed such efforts. However useful that view may be for some purposes, it has artificially limited and impoverished markup for texts. Markup under any recognized SGML/XML convention is unable to easily represent arbitrary structures in a text or to represent overlapping or concurrent structures in a text. Those limitations spring directly from the presumptions about the nature of markup syntax. It is argued that the presumption of a tree for syntax and thereby for the structure of texts is both unnecessary and only a claim about texts and not an inherent property of texts. These difficulties are further compounded in the rich set of XML processing technologies, such as DOM, XPath, XSLT and XQuery, all of which rely upon the tree based syntax and the consequent presumption about the structure of texts. Markup syntax without the presumption of a tree structure, allows the recording of arbitrary structures in a text (including a tree if so desired) as well as the overlapping structures that have so far eluded a commonly used solution..." See also "Implementing Concurrent Markup in XML."

  • [September 16, 1999] "Concurrent Document Hierarchies in MECS and SGML." [Abstract] By C. Michael Sperberg-McQueen and Claus Huitfeldt. In Literary and Linguistic Computing [Journal of the Association for Literary and Linguistic Computing, ISSN 0268-1145] Volume 14, Number 1 (April 1999), pages 29-42, with 16 references). "Applications of computers to humanistic research rely increasingly on SGML or XML markup; it is a persistent challenge to find suitable representations, in these tree-based formalisms, for the overlap of textual features. [The most persistent complaint of SGML's critics among humanists is that SGML simply cannot handle such overlapping features. In the general form stated, this claim is untrue, but it is fair to say that handling overlap requires some substantial extensions to what is otherwise a rather simple data model. Overlapping features, however, are common enough in existing texts that almost every system designed for scholarly text processing has some facility for handling overlap, either in the form of dual logical and physical hierarchies (as in John B. Smith's interactive concordance program ARRAS) or in the form of non-hierarchical coding (as in the COCOA tagging supported by the Oxford Concordance Program and other systems). Overlapping textual features are an inescapable fact of textual life.] SGML lends itself to a straightforward data model with a simple relationship between markup (elements and attributes) and features or structures in the text. Barnard et al. (Computers and the Humanities, 22:265-276, 1988; 29:211-231, 1995) and the TEI ( Guidelines for Electronic Text Encoding and Interchange, 1994) have presented methods for registering the existence of overlap using SGML notations, but these methods are often felt to be unsatisfactory, in part (we argue) because they complicate the otherwise straightforward SGML data model. [This paper will first describe the fundamentals of what we will call the 'basic' SGML data model and explain how overlap presents a problem for this basic SGML model. It will then discuss two notations invented to overcome that problem: the Multi-Element Code System (MECS) developed by the Wittgenstein Archive at the University of Bergen and the CONCUR feature of SGML itself, and explore some problems which arise in translating between these two notations. In conclusion, it will describe some possible avenues for future work on overlap and related markup problems.] The MECS and CONCUR notations described here allow the straightforward markup of overlapping textual features. CONCUR further allows the formulation of useful document grammars for concurrent hierarchies of textual features. The theoretical and practical advantages outweigh the practical disadvantages, and the humanities computing community should begin serious experimentation with CONCUR. . . The advent of the Extensible Markup Language (XML), however, may change the practical situation: it is much easier to implement CONCUR for the fully normalized documents prescribed by XML than to handle all of its complex interactions with features present in SGML but omitted in XML. [N.B. XML does not include the CONCUR feature, and there is no real prospect that it will. But the normalization and well-formedness constraints XML defines can be used to simplify the definition of an XML-like language which does include CONCUR." On MECS, see the publications page of Claus Huitfeldt, and the Wittgenstein Archives at the University of Bergen.

  • "GODDAG: A Data Structure for Overlapping Hierarchies." By C. M. Sperberg-McQueen and Claus Huitfeldt. [To be] Presented at ACH-ALLC '99. See: the ACH-ALLC '99 program. See the abstract [local archive copy]; alt URL.

  • "GODDAG: A Data Structure for Overlapping Hierarchies." By C.M. Sperberg-McQueen and Claus Huitfeldt. To be published in the Proceedings of PODDP'00 and DDEP'00, edited by Anne Brüggemann-Klein and Ethan Munson. New York: Springer, 2001. See the web site for Principles of Digital Document Processing - PODDP00, Fifth International Workshop on Principles of Digital Document Processing (September 15, 2000, Munich, Germany).

  • [May 10, 2001] Markup Language for Complex Documents (MLCD). Project Description. Project Host: Center for Humanities Information Technology, University of Bergen. Principal investigator: Associate Prof. Claus Huitfeldt, HIT Center. "The goal of MLCD is to lay the theoretical foundation for a better system of representation for complex textual phenomena than can be found in today's SGML- and XML-based systems. The project will also lay a foundation for software development, with an eye to web-based delivery. All SGML- and XML-based systems, however, have problems with the representation of a variety of phenomena which are essential for the acceptable representation and processing of text. These problems are in large part solved by MECS, a coding system developed by Claus Huitfeldt in work at the Wittgenstein Archive at the University of Bergen. MECS, however, has no well defined data structure and no notion of document grammar. The goal of the MLCD project is to define a system which combines the best of SGML/XML and MECS. A notation for such a system has already been designed, and a data structure has been sketched out. The project will work to complete the specification of the data structure and to develop some method of specifying document grammars. Within the framework of this project, prototype software will also be developed for experimental use and for verifying the coherence of the system design. MLCD will work first and foremost to establish a theoretical foundation for better analysis of text structures and to extend the development of systems for the electronic representation of text. The success of the project can contribute a great deal to the work of developing web-based standards for the next generation..." See the main entry.

  • "Concurrent Document Hierarchies in MECS and SGML." By C. M. Sperberg-McQueen and Claus Huitfeldt. Presented at ACH-ALLC 1998. This paper will first describe the fundamentals of what we will call the `basic' SGML data model and explain how overlap presents a problem for this basic SGML model. It will then discuss two notations invented to overcome that problem: the Multi-Element Code System (MECS) developed by the Wittgenstein Archive at the University of Bergen and the CONCUR feature of SGML itself, and explore some problems which arise in translating between these two notations. In conclusion, it will describe some possible avenues for future work on overlap and related markup problems." [local archive copy]

  • [September 2000] "GODDAG: A Data Structure for Overlapping Hierarchies." By C.M. Sperberg-McQueen and Claus Huitfeldt. Pages 139-160 in Digital Documents: Systems and Principles: Eighth International Conference on Digital Documents and Electronic Publishing, DDEP 2000; 5th International Workshop on the Principles of Digital Document Processing, PODDP 2000, Munich, Germany, September 13-14, 2000. Revised Papers. Lecture Notes in Computer Science. ISSN: 0302-9743; ISBN: 3-540-21070-9. Springer-Verlag Heidelberg (2003/2004). "Notations like SGML and XML represent document structures using tree structures; while this is in general a step forward from earlier systems, it creates certain difficulties for the representation of documents in which the structures of interest are not properly nested. Overlapping structures, discontinuous structures, and material which occurs in different orders in different parts, views, or versions of a document are all problems for SGML and XML. Overlapping structures have received attention from a variety of authors on SGML and XML, who have proposed various solutions including the use of non-SGML notations with translation into SGML for processing, the use of the concur feature of SGML, exploitation of conditional marked sections in the DTD and document instance, the imposition of various kinds of unusual interpretations on SGML/XML elements as milestones or as fragments of some larger virtual' element, or the use of detailed annotation separate from the base text being annotated. An alternative is the use of a non-SGML/XML notation which does not require that elements form a hierarchical structure. One such notation, MECS, was developed by one of the authors and has been used in practice for over a decade. Unfortunately, the element structure of a MECS document cannot conveniently be represented as a tree, so that MECS processors lack the assistance provided to SGML/XML processors by the unifying assumption of a simple standard data structure for the document. We propose a data structure for representing documents with overlapping structures (including MECS documents). As in the conventional tree representation of SGML and XML, elements are represented by nodes in a graph, and the character data content of the document by labels on the leaves of the graph. We use a directed acyclic graph in which an arc a ? b indicates that node b is a child of node a. Unlike SGML/XML trees, our graph structure allows children to have multiple parents. In the general form of the data structure, an ordering is imposed on the children of each node; this gives the data structure its name: general ordered-descendant directed acyclic graph (GODDAG). A restricted form of GODDAG, in which an ordering is imposed on the leaves of the graph, cannot handle multiple orderings of the same material but can represent any legal MECS document. The data structure here proposed should be useful in the representation of naturally occurring documents with complex structures; it may also be useful in other applications..."

  • [July 17, 2001] "Implementing Concurrent Markup in XML." By Patrick Durusau (Society of Biblical Literature) and Matthew Brook O'Donnell (University of Surrey). Paper [to be] presented at Extreme Markup Languages 2001, August 12-17, 2001, Montréal, Canada. "Texts in the humanities -- as well as other texts -- often exhibit or their users wish to encode multiple overlapping hierarchies using descriptive markup, e.g., by marking physical page features as well as textual and linguistic structures. The optional CONCUR feature of SGML has seldom been implemented and is not present in XML. Relying upon XPath expressions, the authors have implemented concurrent markup in standard XML. XSLT scripts are used to build and query across concurrent hierarchies. The authoring, validation and processing of the document instances required for this technique use standard XML software."

  • Technical Topics: Multiple Hierarchies. Chapter 31 in the TEI Guidelines for Electronic Text Encoding and Interchange. "At various points in these Guidelines, the discussion has mentioned the problems which arise when using SGML to encode textual features which do not take a strictly hierarchical form: features, that is, which do not necessarily nest within other features. This chapter provides an overview of the techniques defined in these Guidelines for handling such problems, and should be consulted when deciding how to deal with them." See also Chapter 31 online from HTI. [local archive copy]

  • Annex 10. Overlapping hierarchies. From the CES documentation. "This problem of overlapping hierarchies is a common one when applying SGML to certain complex descriptive situations, because the data model provided by SGML is that of an ordered labeled tree. When the phenomena to be recognized are independent of each other, they generally fail to nest regularly in a single hierarchy, requiring additional representations to be layered on top of SGML's basic structures. This occurs when there are multiple hierarchies, each to be applied to the same data, but where there are a well defined set of independent and hierarchical information types to be represented (as in the example above). Other common examples are the conflicts between typographic features (e.g., highlighting) and linguistic features such as sentence and word boundaries; and variant annotations (e.g., segmentations), which are generally non-hierarchical..." [local archive copy]

  • "Hyperlink semantics for standoff markup of read-only documents." By Henry S. Thompson and David McKelvie (Language Technology Group, HCRC, University of Edinburgh). Presented at "SGML Europe '97: The Next Decade - Pushing the Envelope" (Barcelona, Spain, May 1997). "There are at least three reasons why separating markup from the material marked up ('standoff annotation') may be an attractive proposition: (1) The base material may be read-only and/or very large, so copying it to introduce markup may be unacceptable; (2) The markup may involve multiple overlapping hierarchies; (3) Distribution of the base document may be controlled, but the markup is intended to be freely available. In this paper we introduce two kinds of semantics for hyperlinks to facilitate this type of annotation, and describe the LT NSL toolset which supports these semantics. The two kinds of hyperlink semantics which we describe are: (1) inclusion, where one includes a sequence of SGML elements from the base file; (2) replacement, where one provides a replacement for material in the base file, incorporating everything else." See also the bibliographic entry.

  • David McKelvie, Chris Brew, Henry S. Thompson. "Using SGML as a Basis for Data-Intensive NLP [Natural Language Processing]." In Computers and the Humanities 31/5 (1998), pages 367-388. Discusses how LT NSL copes with overlapping hierarchies. "We use stand-off markup for a wide range of overlapping hierarchy cases. . . support for this approach is provided by our LT XML toolkit (which also has legacy SGML support)." See

  • "Concurrent Structures" - Section 6.2 in "A Gentle Introduction to SGML." = Chapter 2 of the TEI Guidelines for Electronic Text Encoding and Interchange. [local archive copy]

  • "Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies." By Allen Renear, Elli Mylonas, and David Durand. In Research in Humanities Computing, 1993/1995. See the bibliographic record. [local archive copy]

  • What Should Markup Really Be? Applying Theories of Text to the Design of Markup Systems. Paper presented at ALLC/ACH '96 (June 25 - 29, 1996. University of Bergen, Norway). By David G. Durand, Steven J. DeRose, and Elli Mylonas. After presenting a list of seven kinds of common hierarchy-breaking textual structures, the authors write: "Methods of tagging all of these currently exist in the TEI, in the form of particular tags for particular perspectives. But [emphasis added] since we now know that the breaking of strict hierarchies is the rule, rather than the exception, it is time to determine what additional features are required from markup systems to make the formal description of such non-hierarchical phenomena straightforward. We propose that it is better to integrate the formal properties of these recurring non-hierarchical phenomena into markup systems themselves, rather than re-inventing them tag-by-tag. Their explicit representation will enable more perspicuous, explicit, and consistent descriptions of nonhierarchical tag-relationships and constraints, in the same way as the formal definitions of content models in SGML do for hierarchical documents." See the online version. [cache]

  • David Barnard, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, Michael Sperberg-McQueen, and Giovanni Battista Varile. "Hierarchical Encoding of Text: Technical Problems and SGML Solutions. In The Text Encoding Initiative: Background and Contents. Guest Editors: Nancy Ide and Jean Véronis. Computers and the Humanities 29/3 (1995), pages 211-231.

  • Edited Version of Renear's Target Paper. In Monist 80:3 (1997), Interactive Issue, Edited by Barry Smith and Herbert Hrachovec. By Michael Biggs and Claus Huitfeldt: "Philosophy and Electronic Publishing. Theory and Metatheory in the Development of Text Encoding." [local archive copy], .ZIP format

  • "Overlapping Hierarchies." In Introduction to Encoding. A Tutorial For New Encoders, by Carole E. Mah. November 19, 1998. [local archive copy]

  • Multiple Hierarchic Structures - Michael Kay and Richard Tobin, 1999-06.

Notes and personal opinions

  • "Logical structure." Suppose we admit that a US twenty-dollar bill is a "document," Why not? And is any coin used for currency less a document because it's not printed on paper?. What is the "logical structure" of a twenty-dollar bill? That is: according to the perspective of the engraver or printer at the mint, according to the expert in counterfeit money, according to the vending machine scanner, according to the numismatic enthusiast, according to the epigraphist, according to the origamist? Surely the "logical structure" of a coin is different for the metallurgist than for the referee managing the coin-toss at the football game. For more common documents (print books, magazines, printed telephone bills, street signs, lapidary monuments), there are typically many legitimate analytical perspectives giving rise to multiple "logical structure" descriptions which may or may not be represented optimally in some hierarchical markup notation. See related comments in a paper extract explaining why SGML design got away with the sale of some simplistic working assumptions.

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: