Cover Pages Logo SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic

The Layered Markup and Annotation Language (LMNL)


The Layered Markup and Annotation Language (LMNL)

By Jeni Tennison (Jeni Tennison Consulting) and Wendell Piez (Mulberry Technologies).

*Extended abstract of the presentation at Extreme Markup 2002.

(Layered Markup and aNnotation Language), pronounced "liminal"

In document-oriented XML development, there's frequently a requirement for several views of the same document to coexist. For example, one view might represent the logical structure of a document into chapters, sections and paragraphs, while another represents the physical manifestation of that document in a particular book, maintaining page and even line breaks. The structures in these different views often overlap -- a page might start in the middle of one paragraph and end after another, for example -- and this makes it difficult for a simple hierarchical structure, such as XML, to represent.

There have been many attempts to get around this problem in the past, falling into five categories:

  • SGML's CONCUR (concurrent markup), which was never widely implemented (and whose application for overlapping structures would break XML well-formedness rules).

  • Having one primary, hierarchical, view, and using empty elements (sometimes called "milestones") or processing instructions to mark the start and end points of the structures from other views, for example in TEI [1].

  • Having one primary, hierarchical, view, and using external pointers (e.g., XPointers [2]) to indicate the ranges covered by the structures from other views, again used in TEI [1].

  • Breaking down the document into words or even individual characters, and indicating the elements to which these individual atoms belong through containment [3].

  • Creating a new markup language that supports overlapping elements (e.g., TexMECS [4])

All these approaches have their strengths and weaknesses. Interestingly, they all assume a DAG (directed acyclic graph) as a primary data model, sometimes enhancing it with metainformation at a different level. (Even in SGML CONCUR, this metainformation is provided by the DTD. TexMECS assumes a more complex graph structure, dubbed GODDAG [7], allowing elements to have multiple parentage.)

Recognizing the difficulties of this approach, however (for example, in XSLT it is quite challenging and processor-intensive, albeit not impossible, to perform the splicing and segmenting operations typically required to transform between concurrent structures), the authors postulated it might be worthwhile to start by parsing markup into an entirely different data model.

Since XML and XSLT already provide us with a strong technology for processing trees (we reasoned), we could always have a tree when we needed one; so we opted to concentrate on a more rudimentary data model that would capture the information we needed, while not itself trying to assert containment or sibling relations (that are too simple to apply to overlapping structures). These could be inferred or specified in other layers of the system.

The Core Range Algebra presented by Gavin Nichol at this year's Extreme conference proposes a data model that supports overlapping structures by viewing documents as sequences of characters over which named ranges are defined. To represent more fully the range of document structures encountered in real documents, we have extended this data model to include the concepts of "layers", which are ranges that fully contain all the ranges that start or end within them, and "metaranges", which are layers that can be associated with ranges to provide meta-information about their content. This data model can be represented in XML, using any of the methods outlined above, but for ease of writing, we have developed a specialised syntax, the Layered Markup and Annotation Language (LMNL). The paper will describe the data model and syntax in detail, but a small example of a LMNL document is as follows:


[book [title [lang}en{lang]}Genesis{title]}
[chapter}
[section [title}The creation of the world.{title]}
[para}
[v}[s}[note}In the beginning of creation, when God made heaven and
earth,{note [alt}In the beginning God created heaven and
earth.{alt]]{v] [v}the earth was without form and void, with darkness
over the face of the abyss, [note}and a mighty wind that swept{note [alt}and
the spirit of God hovering{alt]] over the surface of the waters.{s]{v]
[v}[s}God said, [quote}[s}Let there be a light{s]{quote], and there
was light;{v] [v}and God saw that the light was good, and he separated
the light from darkness.{s]{v] [v}[s}He called the light day, and the
darkness night. So evening came, and morning came, the first
day.{s]{v]
{para]
...{chapter]...{section]...{book]

This example demonstrates overlapping ranges (verses and sentences overlap, for example), the use of metaranges in both the start and end tags of ranges, and the annotation of metaranges with further metaranges (here to indicate that the title of the book is specified in English).

Using a layered data model as the basis of document markup allows a very different markup style from the tree model, because no one view of the document has to be given a higher priority than another, and it means that all the information, about all the views, is held within a single document. Another feature of the data model is that metaranges can themselves have metaranges, such that every piece of information can be further annotated, for example with language or data type information, with no artificially imposed limit.

An immediate challenge then is to extract information and individual, hierarchical (XML), views from the LMNL document. So far, we have constructed three prototype applications to facilitate this:

  • An analyser that identifies those range types that overlap with each other, to help users work out which XML hierarchies can be created.

  • A filter that can extract particular range types to create an XML document.

  • An adapted XPath processor to pull out information from a LMNL document using XPath syntax.

The development of the layered data model and applications to process LMNL leads to some interesting questions that, since a XML's tree model can be viewed as a simplified version of a layered data model, also reflect on XML.

For example, what would schemas for LMNL look like and what does validation mean? Both the roles and the functions of schemas (assert conformity to constraints; present an enhanced infoset etc.) have to be considered carefully once we have lost the "safety net" of XML well- formedness.

Just for validation purposes, schemas that performed "lax" validation could be constructed for each hierarchy present within the document, and applied consecutively.

For authoring support, though, a schema needs to be able to tell what ranges are permissible at a given point in the document, something that might depend on a whole range of contextual information rather than simply the covering ranges for the point. This means that a rule-based schema language such as Schematron [5] might be more appropriate than a grammar-based schema language.

Also, how can XML schema languages assist in extracting a specific tree structure from a layered data model? A list of range/element names is a simple solution, but does not reflect desired structures, nor the fact that some ranges can be split or recombined to create a tree structure while others cannot (depending on whether the semantics of the range is distributed or not [6]).

These issues will be explored in the full paper.

[1] Text Encoding Initiative. The TEI Guidelines. See http://www.tei-c.org/Guidelines2/index.html.

[2] DeRose, Steven J., Eve Maler and Ron Daniel, eds. XML Pointer Language (XPointer) Version 1.0. W3C Candidate Recommendation 11 September 2001. See http://www.w3.org/TR/xptr/.

[3] Durusau, P. & Brook O'Donnell, M. (2001) Implementing concurrent markup. Extreme Markup Languages 2001. See http://www.extrememarkup.com/extreme/2001/ExtremeWeb/author-pkg/p0021.zip.

[4] Huitfeldt, Claus, and C. M. Sperberg-McQueen. TexMECS: An experimental markup meta-language for complex documents. See http://www.hit.uib.no/claus/mlcd/papers/texmecs.html. Rev. 17 February 2001.

[5] Jelliffe, Rick. The Schematron: An XML Structure Validation Language using Patterns in Trees. See http://www.ascc.net/xml/resource/schematron/schematron.html.

[6] Sperberg-McQueen, C. M., Huitfeldt, C. & Renear, A. (2000) Meaning and Interpretation of Markup: not as simple as you think. Extreme Markup Languages 2000. See http://www.gca.org/attend/2000_conferences/Extreme_2000/Papers/Sperberg-MccQueen/mimslides.htm.

[7] Sperberg-McQueen, C. M., and Claus Huitfeldt. GODDAG: A Data Structure for Overlapping Hierarchies. See http://www.hit.uib.no/claus/goddag.html.


Jeni Tennison
http://www.jenitennison.com/


Abstract provided by Jeni Tennison. Prepared by Robin Cover for The XML Cover Pages archive. For general references: "Markup Languages and (Non-) Hierarchies."


Globe Image

Document URL: http://xml.coverpages.org/LMNL-Abstract.html