[Cache from http://www.hit.uib.no/claus/mlcd/papers/texmecs.html 2001-05-10; please use this canonical URL/source for later/revised/authoritative versions if possible.]

TexMECS

An experimental markup meta-language for complex documents

Claus Huitfeldt

C. M. Sperberg-McQueen

25 January 2001

rev. 17 February 2001

1. General design principles
2. Grammar
- 2.1. Delimiters
- 2.2. Syntax
  - 2.2.1 Documents
  - 2.2.2 Empty and virtual elements
  - 2.2.3 Start- and end-tags
  - 2.2.4 Interrupted elements
  - 2.2.5 Entities
  - 2.2.6 CDATA marked sections
  - 2.2.7 Comments
  - 2.2.8 Internal structure of tags
  - 2.2.9 Basic tokens
- 2.3. Lexical forms and tokenizing
- 2.4. Levels
  - 2.4.1 Levels of abstraction
  - 2.4.2 Core language features and convenience features
3. Examples
- 3.1. John loves Mary
- 3.2. Peer Gynt I.i 1-4
- 3.3. Hughie, Louis, and Dewey
- 3.4. Bloody Mary
- 3.5. Des Minnesangs Frühling

This document sketches the outlines of TexMECS,

[The x in the name is pronounced the same way as the cs at the end; the word is thus a homonym of Tex/Mex, a term applied to cuisine and music of the area on the border of the U.S. and Mexico. Theoretically, the name stands for ‘trivially extended MECS’, but this claim is not wholly convincing even to the authors, given that the language described here is not really an extension to MECS, and that the things included here which go beyond MECS are not really trivial.]

a markup language (or, more precisely, a markup meta-language or family of markup languages) intended for experimental work in dealing with complex documents.

TexMECS was developed at the Center for Humanistic Information Technology at the University of Bergen as part of the project Markup Languages for Complex Documents, with support from the Lauritz Meltzer Høyskolefond.

This document assumes some familiarity with XML, SGML, and MECS encoding of documents, with the problems posed for these systems by complex documents, and with the Goddag structure proposed by the authors for representing document structures.

Status: The grammar given in this document reflects the version of TexMECS presented in Bergen the first week of February 2001. The grammar has not yet been proofread to make sure all non-terminals are defined, spelled correctly, etc. The examples are not yet complete and may still include relics of design alternatives recently rejected.

1. General design principles

The basic principles of the design are:

For documents that exhibit a straightforward hierarchical structure, TexMECS should be isomorphic to XML.
For documents that exhibit a suitable structure, TexMECS should be isomorphic to MECS.
Ideally, the syntax should be distinct both from that of XML and from that of MECS. We did consider, at one time, allowing user choice of delimiters, or perhaps just dual syntaxes, one XML-like and one MECS-like. We have decided against that, however, in the interests of simplicity (though CH is suspected of retaining a hidden leaning toward allowing existing MECS and XML documents to be legal TexMECS just as they stand).
Every TexMECS document should be translatable into a GODDAG structure without reference to application-specific semantics. (I.e. the rules for translation into GODDAG structures should be standard for all TexMECS documents.) Every GODDAG structure should be representable as a TexMECS document. It might be desirable to minimize the number of ways a given GODDAG structure can be serialized in TexMECS, but at the moment such minimization doesn't appear possible without arbitrary restrictions.
In general, the syntax should be as simple to define and process as possible; syntactic constructs which do not extend the expressive power of the language but only make it more convenient should be avoided. Where we are unable to avoid or resist them entirely, the definition of TexMECS (this document) should distinguish clearly between core features of the syntax and convenience features, and implementors should implement convenience features only when doing so does not materially complicate the implementation.

2. Grammar

2.1. Delimiters

We have been inspired in part by a proposal made during the development of XML by William D. Lindsey under the subject line "Stupid NET tricks". This proposal was further developed into Lindsey's Lambda Markup Language, which is well worth reading, but for various reasons we don't follow him in all details of that proposal.

In fact, after further deliberation (and meditation on the cost in convenience of using parenthesis for markup delimiters), we have diverged from him almost completely. But we still find his proposal useful in brushing out some cobwebs.

In TexMECS, the only characters which have to be escaped in running text are left and right brace; all delimiters defined here are strings beginning or ending with one of these characters, which allows the other characters in the delimiters to be interpreted informally by the user (and by some simple software) as decorations of the tag which help distinguish its function. We distinguish:

empty elements marked by sole-tags: {e id="foo" lang="no"}
normal elements with start- and end-tags: {e id="foo" lang="no"{Peer, du lyver!}e}
interrupted elements, with start-, suspend-, resume-, and end-tags: {e id="foo" lang="no"{Peer,}-e} says Aase in the first scene, {+e{du lyver!}e}
elements with children whose order has no significance and which can therefore be reordered on serialization without loss of information: {|e id="foo" lang="no"|{...}|e|}. For now, we assume that these cannot be interrupted, so we don't foresee special suspend- and resume-tags for these elements.
virtual elements, which have a generic identifier and attributes, and who share children with another element in the document, whose ID is given in the virtual-element tag: {^e^id3 id="foo" lang="no"}
[Note that virtual elements are handled in a way slightly different from the TEI's <join> element. The TEI <join> element corresponds roughly to a TexMECS element whose children are all virtual. The virtual elements we define correspond to the IDREF values in the targets attribute of the <join> element. Each virtual element can point to exactly one other element in the document. In this way we avoid the complex mechanism TEI must provide in order to allow the user to specify whether the elements pointed at themselves become virtual children, or whether their children become virtual children.]
self-overlapping elements, which use a simple co-indexing scheme: tags are co-indexed by a tilde and a suffix of numbers and letters. The suffixes need not be unique: they only mean that the corresponding end- or start-tag must have the same suffix. The normal case of overlap: {e~1{...{e~2{...}e~1}...}e~2}. An abnormal case: if a second start-tag with the same suffix is encountered before the end-tag, then the two elements nest normally. So {e~1{...{e~1{...}e~1}...}e~1} means the same as {e{...{e{...}e}...}e}. Other suffixes can confuse the issue, but don't change the nesting effect of identical suffixes: {e~1{...{e~2{...{e~1{...}e~2}...}e~1}...}e~1}.

N.B. there is no direct TexMECS equivalent to MECS multi-element codes; by convention, these should be translated into one parent and a sequence of child elements; we use the MECS name for the children, and decorate it with an underscore for the parent. In this case, the underscore is decoration, not a special delimiter. {_e{{e{...}e}{e{...}e}{e{...}e}{e{...}e}}_e}

Entities and character references also use curly braces as delimiters:

Numeric character references, using decimal ({#d228}) or hexadecimal ({#xE2}) references to UCS-4 code points
Internal entities: {&eacute}
Structured internal entities: {&dot . fullstop} vs. {&dot . decimal}.
External entities: {<url>}, e.g. {<vw117-a>} or {<http://www.w3.org/XML>}

External entities and structured internal entities are classed as convenience features; in practice references to internal entities for named characters will be preferred to numeric character references.

Some miscellaneous features must also be listed:

Comments: {* ... *}. Note that comments can nest. In this they resemble conditional sections.
We reserve the delimiters {! ... !} for declarations.
CDATA sections: {#CDATA{ ... }#CDATA}. A convenience feature.

2.2. Syntax

2.2.1. Documents

A document is a sequence of character data, tags of various kinds, and entity references of various kinds.

document ::= /* */
           | document chunk
chunk    ::= sole-tag
           | start-tag | end-tag 
           | suspend-tag | resume-tag
           | start-tag-set | end-tag-set
           | virtual-element
           | internal-entity | external-entity 
           | character-ref
           | cdata-section | comment 
           | datacharacter

2.2.2. Empty and virtual elements

There are two kinds of tags without content (or with empty content, if one prefers that formulation): ‘normal’ ones, and ‘virtual’ elements. The use of the carat for the latter is, yes, inspired in part from the Pascal notation for pointers.

sole-tag        ::= '{' eid atts '}'
virtual-element ::= '{^' eid '^' idref atts '}'  /* WFC: idref OK */

WFC: idref OK. The idref value in a virtual element must appear on some element in the document as the value of an id.

2.2.3. Start- and end-tags

Start- and end-tags must match, roughly as in XML, but they need not nest. For every start-tag, a matching end-tag must follow before the end of the TexMECS document; every end-tag must be preceded by a matching start-tag. To match, start- and end-tags must have the same generic identifier and the same suffix, and there must be no closer matching tag. Since the start- and end-tags can come in arbitrary orders, we express the match rule as a well-formedness constraint (WFC), not in the grammar.

start-tag     ::= '{' eid atts '{'   /* WFC: end-tag match */
end-tag       ::= '}' gi '}'         /* WFC: start-tag match */
start-tag-set ::= '{|' eid atts '|{' /* WFC: end-tag-set match */
end-tag-set   ::= '}|' gi '|}'       /* WFC: start-tag-set match */

WFC: end-tag match Each start-tag must be paired with a matching end-tag appearing later in the data stream. Two tags match if and only if they have the same generic identifier (gi) including any suffix. A start-tag s and an end-tag e are paired

[This definition of pairing is borrowed with modifications from the documentation for sgrep.]

if and only if:

s precedes e,
s is not paired with any end-tag earlier than e, and
e is not paired with any start-tag later than s.

WFC: start-tag match Each end-tag must be paired with a matching end-tag appearing earlier in the data stream.

WFC: end-tag-set match Each start-tag-set must be paired with a matching end-tag-set appearing later in the data stream.

WFC: start-tag-set match Each end-tag-set must be paired with a matching start-tag-set appearing earlier in the data stream.

2.2.4. Interrupted elements

Elements may be interrupted using suspend-tags, and resumed using resume-tags:

suspend-tag ::= '}-' gi '}' /* WFC: suspend-tag OK */
resume-tag  ::= '{+' gi '{' /* WFC: resume-tag OK */

WFC: suspend-tag OK Each suspend-tag must be paired both with (a) a matching start-tag or resume-tag appearing earlier in the data stream, and (b) a matching resume-tag appearing later in the data stream. A suspend-tag t is paired with a preceding start- or resume-tag rs if and only if:

t and rs match (gi)
rs precedes t,
rs is not paired with any suspend- or end-tag earlier than t, and
t is not paired with any start- or resume-tag later than rs.

A suspend-tag t is paired with a following resume-tag r if and only if:

t and r match
t precedes r,
t is not paired with any resume-tag earlier than r, and
t is not paired with any suspend-tag later than s.

WFC: resume-tag OK Each resume-tag must be paired with a matching suspend-tag appearing earlier in the data stream, and with a matching end- or suspend-tag appearing later in the data stream. A resume-tag matches a preceding suspend-tag as described in WFC: suspend-tag OK. A resume-tag r matches an end- or suspend-tag es if and only if:

r and es match
r precedes es,
r is not paired with any suspend- or end-tag earlier than es, and
es is not paired with any resume-tag later than r.

[In cases where several discontinuous elements with the same base generic identifier overlap each other, these rules may not always produce intuitive results — if it can be said that users actually have intuitions about the proper behavior here. The generation of a suitably large set of examples for discussion (and software testing) remains a requirement for future work.]

2.2.5. Entities

Entities may be internal or external. The replacement text of internal entities may have character references and entity references, but not tags. Internal entities may have two names; the second identifies the semantic function, if the first is ambiguous.

[We currently have no rules about how entities and identifiers for semantic functions are used or declared.]

External entities are not declared; they are just URLs embedded in the document. They may be relative (and if so, they are interpreted using any xml:base information in an ancestor; if conflicting xml:base information is given, the results are undefined).

internal-entity ::= '{&amp;' NAME '}'
                  | '{&amp;' NAME S '.' S NAME '}' 
                    /* CF: structured internal entities */
external-entity ::= '{<' URL '>}'
                    /* CF: external entities */

Convenience feature Structured internal entities are a convenience feature and may not be implemented by all TexMECS software.

Convenience feature External entities are a convenience feature and may not be implemented by all TexMECS software.

We also allow numeric character references, using a syntax similar to XML's.

character-ref ::= '{#' [dD] [0-9]+ '}'
                | '{#' [xX] [A-Fa-f0-9]+ '}'

2.2.6. CDATA marked sections

CDATA marked sections are part of the convenience layer, not part of the core. They can nest, and differ in this way from the corresponding feature of SGML and XML. Implementors of translation software should beware.

cdata-section ::= '{#CDATA{' cdchars '}#CDATA}'
                  /* CF: CDATA sections */
cdchars       ::= CHAR* - (CHAR* ('{#CDATA{' | '}#CDATA}') CHAR*)

Convenience feature CDATA sections are a convenience feature and may not be implemented by all TexMECS software.

2.2.7. Comments

Comments can also nest.

comment     ::= '{*' commcontent '*}'
commcontent ::= /* */
              | commcontent commentdata
              | commcontent comment
commentdata ::= CHAR+ - (CHAR* ('{*' | '*}') CHAR*)

2.2.8. Internal structure of tags

ID and IDREF have uses on several levels. We expect users to be able to use IDs and IDREFs in the usual ways familiar from SGML and XML, but we want to allow TexMECS processors to construct the appropriate GODDAG structure without reference to declarations, so we specify a form for IDs which is syntactically evident even without declarations.

eid   ::= gi? "@" id?              
gi    ::= NAME SUFFIX? | SUFFIX
id    ::= NAME                   /* WFC: unique ID */
idref ::= NAME                   /* WFC: idref OK */

Attributes are defined much the same way they are in XML:

atts ::= avs* S?
avs  ::= S NAME S? '=' S? QUOTED

(Note: avs is short for ‘attribute-value specification’.)

2.2.9. Basic tokens

The tokens CHAR, NAME, QUOTED, and S deserve a bit more comment.

Every TexMECS processor will accept ASCII data streams; processors may also accept UTF8 and UTF16 data streams.

In an ASCII data stream, from the point of view of the parser, the lexer can return as a CHAR any ASCII character:

CHAR ::= [#x00-#x7F]

From the point of view of the lexical scanner, the data stream contains either single characters which are legal characters, or else it contains curly braces which have been escaped with a backslash.

CHAR ::= ([#x00-#x7F] - [{}]) /* braces cannot occur 'naturally' */
       | '\{'                 /* but may occur if escaped */
       | '\}'

In a Unicode/ISO 10646 data stream, similarly, braces must be escaped, but the range of legal characters includes all the characters in the basic multilingual plane or any of the 16 other planes. Surrogate characters (each of which encodes one half of a character in one of the 16 planes beyond the Basic Multilingual Plane) are not visible to the parser:

CHAR ::= [#x00-#xD7FF]
       | [#xE000-#xFFFD]
       | [#x10000-#x10FFFF] /* any Unicode character, 
                               excluding the surrogate blocks, 
                               FFFE, and FFFF. */

(Note, however, that if a system were to cheat and have the lexical scanner simply return 16-bit characters, including surrogates, it would make no difference for TexMECS parsing.)

Whether the datastream is ASCII or Unicode/ISO 10646, data characters are defined in terms of the primitive CHAR:

datacharacter ::= CHAR

The NAME token is returned only within markup. A NAME is restricted to ASCII characters (as a reminder that this is an experimental system, not a production system):

NAME     ::= Nameinit Namechar*
Nameinit ::= [a-zA-Z_]
Namechar ::= Nameinit | [0-9] | ':' | '.' | '-'
SUFFIX   ::= '~' Namechar*

Note that while formally names can begin with underscores, it is recommended that underscores be used only in the translation of MECS poly-element codes.

Implementors might also consider recognizing tildes as name characters, and thus allowing GIs with suffixes to be passed as single tokens. Any implementor taking this approach should be sure to enforce the rules that there may be only one tilde in a name token.

Quoted strings are used only for attribute values. Internal entity references are legal within them, but not tags.

QUOTED   ::= '"' dqstring '"' 
           | "'" sqstring "'"
dqstring ::= ((CHAR - '"') | internal-entity | character-ref)*
sqstring ::= ((CHAR - "'") | internal-entity | character-ref)*

S (space) is recognized as a token only within markup.

S ::= (#x20 | #x9 | #xD | #xA | #x85 | #x2028 | #x2029)+

URLs are, for our purposes, just strings of characters.

URL ::= (CHAR - '>')*

2.3. Lexical forms and tokenizing

We distinguish various kinds of tokens and give them names:

EETAGO = "{"            // empty-element tag open and close
EETAGC = "}"
STAGO = "{"             // start-tag
STAGC = "{"
ETAGO = "}"             // end-tag
ETAGC = "}"
ITAGO = "}"             // suspend (interrupt) tag
ITAGC = "}"
RTAGO = "{+"            // resume-tag
RTAGC = "{"
STAGSO = "{|"           // start-tag for set
STAGSC = "|{"
ETAGSO = "}|"           // end-tag for set
ETAGSC = "|}"
VTAGO = "{^"            // virtual element tag open
VTAGC = "}"
VSEP = "^"              // virtual-element gi/target separator
SUFFIXSEP = "~"         // suffix separator (for self-overlap)
ERO = "{&"              // entity-reference open
ENVS = " . "            // entity name-value separator
ERC = "}"
EERO = "{<"             // external-entity reference open, close
EERC = ">}"
COMO = "{*"             // comment open, close
COMC = "*}"
MDO = "{!"              // markup declaration open, close
MDC = "!}"
CDATAMSO = "{#CDATA{"   // CDATA marked-section start, close
CDATAMSC = "}#CDATA}"

Since the user is not allowed to redefine these delimiters, this list and the names are provided primarily as a reference for programmers and out of habit; in the actual grammar below, we use the literals, rather than these names.

The lexical scanner will need to have different modes:

One mode for content, in which it returns
- '{' : the start of a start-tag (STAGO) or sole-tag (EETAGO)
- '{|' : the start of a start-tag for an unordered element (STAGSO)
- '{+' : the start of a resume-tag for an interrupted element (RTAGO)
- '{^' : the start of a tag for a virtual element (VTAGO)
- '{&' : the start of an internal entity reference (ERO)
- '{&#' : the start of a decimal character reference (DCRO)
- '{&#x' : the start of a hexadecimal character reference (HCRO)
- '{<' : the start of an external entity reference (EERO)
- '{*' : the start of a comment (COMO)
- '{!' : the start of a declaration (reserved for future use) (MDO)
- '{#CDATA{' : the start of a CDATA marked section (CDATAMSO)
- '}' : the start of an end-tag (ETAGO)
- '}|' : the start of an end-tag for an unordered element (ETAGSO)
- '}-' : the start of a suspend-tag for an interrupted element (ITAGO)
- CHAR : a data character (see below)
One mode for markup, in which it returns
- '{' : the end of a start-tag or resume-tag (STAGC, RTAGC)
- '|{' : the end of a start-tag for an unordered element (STAGSC)
- '|}' : the end of an end-tag for an unordered element (ETAGSC)
- '}' : the end of a virtual-element tag, sole-tag, suspend-tag, internal entity reference, character-reference, end-tag (VTAGC, EETAGC, ITAGC, ERC, CRC, ETAGC)
- '>}' : the end of an external entity reference (EERC)
- NAME : an identifier (see below)
- SUFFIX : an identifier suffix (see below)
- '=' : an attribute-value separator
- '.' : an entity-name-value separator
- QUOTED : a quoted string (for attribute values)
- S : white space (for some purposes need not be passed on, can be consumed by lexical scanner)
one mode for comments, in which it either scans to the end of the comment or returns
- '{*' : start of a new comment
- '*}' : end of a comment
one mode for CDATA sections, in which it returns
- '{#CDATA{' : start of a nested CDATA section
- '}#CDATA}' : end of a CDATA section

We do not define a mode for markup declarations. We simply rely on the rule that curly braces are not legal unless used as a delimiter or escaped.

2.4. Levels

2.4.1. Levels of abstraction

Since one of the goals of TexMECS is to devise a linear notation for non-linear objects (the GODDAG structure), we are apt to fall prey to confusion if we don't carefully distinguish the linear features of the markup language from the non-linear features of the implicit data structure.

We can identify the following levels:

octet stream: this is what comes from the disk, or from the port. It is a stream of bits, segmented into groups of eight.
character stream: this is the level at which the TexMECS grammar describes the TexMECS format. In theory, it doesn't matter whether the octet stream encodes the characters in Unicode, in ASCII, or in EBCDIC.
parse tree: the literal parse tree over the character stream which results from applying the grammar given above; this is not quite what we want to work with
abstract syntax graph (ASG): this structure corresponds fairly closely to the literal parse tree, but abstracts away some details. For each pair of start- and end-, start- and interrupt-, resume- and interrupt-, or resume- and end-tags, it has a node. (Interrupted elements, therefore, may have multiple nodes in the AST.) We call the nodes of this graph the overt elements of the document. (If we actually rely on any of these details, then we should try to figure out whether this should be a tree or a more general graph.)
GODDAG structure: this is constructed from the ASG by performing various operations such as merging overt nodes which belong, according to rules of TexMECS, to the same logical nodes, erasing sequence information from children of unordered elements, etc.
application-specific structure: this may be constructed from the GODDAG structure by applications following application-specific rules. It is none of our business, but it's worth noting that applications have the right to have a separate layer here if they wish.
the ‘ideal’ structure: the structure as it exists in some human mind; a reflection of the textual structure as someone understands it. We mention this level only because it may be useful, from time to time, to talk about markup error. An encoder wishes to mark up the text in such a way as to generate a GODDAG structure, or an application-specific structure, which agrees with an ideal structure in salient ways. This may not always work, either because of encoder error or because the markup system cannot generate a structure which agrees with the ideal structure in the required ways.
the text: an abstract object, which we agree by convention to regard in some ways as a single thing. Different readers may associate different ideal structures with a text; when they argue about which one is right, they are arguing, perhaps, about which ideal structure most closely approximates the text itself. Or perhaps they are arguing about which ideal structure most closely approximates the view which should be taken of the text itself for some particular purpose.

2.4.2. Core language features and convenience features

This specification distinguishes between core language features and so-called ‘convenience’ features.

Since TexMECS is intended as an experimental, not a production system, there is some reason to ignore questions of practical convenience in order to simplify the task of describing the language. The practical experience of the authors, however, has rendered them incapable of taking such an austere view.

As a compromise, we have limited the convenience features to those which we think are likely to have a relatively low implementation cost and a relatively high benefit to users (in the foreseeable future, this means: to ourselves and our collaborators). And we have explicitly labeled the convenience features as a signal to implementors that these features should normally be implemented last.

The convenience features of TexMECS are:

structured internal entities
external entities
CDATA marked sections

Other features of TexMECS are classed as core language features.

3. Examples

In this section, we illustrate some of the constructs in the language.

3.1. John loves Mary

The trivial example used in our GODDAG paper is the sentence John loves Mary, with an s element dominating the entire sentence, an a element dominating the first two words, and a b element dominating the last two words.

In TexMECS:

{s{{a{ John {b{ loves }a} Mary }b}}s}

3.2. Peer Gynt I.i 1-4

The first few lines of Peer Gynt illustrate standard multiple-hierarchy overlap, with interruptions and fragmentary elements. Our first version of this uses speaker elements; we have to interrupt the verse line elements for these, and as a result, this version has a normal tree structure at the abstract-syntax level (i.e. the overt elements here form a tree).

{sp{{speaker{AASE}speaker}{l{Peer, you're lying!}-l}}sp}
{sp{{speaker{PEER GYNT }speaker} 
{stage{without stopping}stage}{+l{No, I'm not!}l}}sp}
{sp{{speaker{AASE}speaker}{l{Well then, swear to me it's true.}l}}sp}
{sp{{speaker{PEER GYNT}speaker}{l{Swear?  why should I?}-l}}sp}
{sp{{speaker{AASE}speaker}{+l{See, you dare not!}l}
{l{Every word of it's a lie.}l}}sp}

Our second version uses attributes for the speaker identifications, and allows stage directions to be embedded in lines (which requires application-level semantics, if it is desired to allow users to believe, in some way, that words in stage directions are not “in” the verse lines).

{sp who="AASE"{{l{Peer, you're lying!}sp}
{sp who="PEER GYNT"{
{stage{without stopping}stage}No, I'm not!}l}}sp}
{sp who="AASE"{{l{Well then, swear to me it's true.}l}}sp}
{sp who="PEER GYNT"{{l{Swear?  why should I?}sp}
{sp who="AASE"{See, you dare not!}l}
{l{Every word of it's a lie.}l}}sp}

Our third version uses attributes for the speaker identifications, but does not allow stage directions to be embedded in lines.

{sp who="AASE"{{l{Peer, you're lying!}sp}
{sp who="PEER GYNT"{
}-l}{stage{without stopping}stage}{+l{No, I'm not!}l}}sp}
{sp who="AASE"{{l{Well then, swear to me it's true.}l}}sp}
{sp who="PEER GYNT"{{l{Swear?  why should I?}sp}
{sp who="AASE"{See, you dare not!}l}
{l{Every word of it's a lie.}l}}sp}

3.3. Hughie, Louis, and Dewey

This example shows a simple re-ordering. We imagine Hughie, Louis, and Dewey trying to remember a famous haiku, remembering the lines out of order, and a markup using virtual elements which reconstructs the poem in its normal order (e.g. in order that proximity searches on the words of the poem can find it, even though it is only virtually present in the document).

{sp who="HUGHIE"{{p{How did that translation go?}p}
{lg type="haiku"{{l{da de dum de dum,}l}
{l=frog{gets a new frog,}l}
{l{...}l}}lg}
}sp}
{sp who="LOUIS"{{p{Er ...}p}
{lg{{l=new{it's a new pond.}l}}lg}
}sp}
{sp who="DEWEY"{
{p{Ah ...}p}
{lg{{l=pond{When the old pond}l}}lg}
{p{Right.  That's it.}p}
}sp}
{lg{{^l^pond}{^l^frog}{^l^new}}lg}

3.4. Bloody Mary

In their musical South Pacific, Richard Rodgers and Oscar Hammerstein II include a song about a Polynesian woman named Bloody Mary, in which a chorus of sailors sings her virtues, or rather her qualities which do not normally qualify as virtues, and declares that nevertheless “Bloody Mary is the girl I love.”

One line of this song appears in three different forms in the book as published, in the Broadway cast recording, and in the commercially published sheet music for the song.

[ Unfortunately, we are reproducing this from memory and so although we are confident of the various readings we can't remember which came from where.]

One (let us call it version A) reads “Her skin is tender as DiMaggio's glove.” By the principle of lectio difficilior we can infer that this is the reading of the line from which the other two are derived: source B reads: “Her skin is tender as a baseball glove.” And the third source (C) reads: “Her skin is tender as a leather glove.”

One possible TexMECS encoding of this might use text-critical apparatus of the kind provided by the TEI.

{lg{
{l{Bloody Mary is the girl I love.}l}
{l{Bloody Mary is the girl I love.}l}
{l{Bloody Mary is the girl I love.}l}
{l{Now ain't that too damn bad.}l}
}lg}
{lg{
{l{Bloody Mary's chewing betel nuts.}l}
{l{She is always chewing betel nuts.}l}
{l{Bloody Mary's chewing betel nuts}l}
{l{And she don't use Pepsodent.}l}
}lg}
{lg{
{l{Her skin is tender as
{app{
{rdg wit="A"{DiMaggio's}rdg}
{rdg wit="B C"{a
{app{{rdg wit="B"{baseball}rdg}
{rdg wit="C"{leather}rdg}}app}
}rdg}
}app}
glove.}l}
...
{l{Now ain't that too damn bad.}l}
}lg}

Another might use virtual elements for alternate readings.

{* Witness A *}
{lg@stanza-1{
{l{Bloody Mary is the girl I love.}l}
{l{Bloody Mary is the girl I love.}l}
{l{Bloody Mary is the girl I love.}l}
{l{Now ain't that too damn bad.}l}
}lg}
{lg@stanza-2{
{l{Bloody Mary's chewing betel nuts.}l}
{l{She is always chewing betel nuts.}l}
{l{Bloody Mary's chewing betel nuts}l}
{l{And she don't use Pepsodent.}l}
}lg}
{lg{
{l@l3.1{{@L3.1a{Her skin is tender as}} {@dm{DiMaggio's}} {@L3.1z{glove.}}}l}
{^l@l3.2^l3.1}
{^l@l3.3^l3.1}
{l@l3.4{Now ain't that too damn bad.}l}
}lg}
...
{* Witness B *}
{^lg^stanza-1}
{^lg^stanza-2}
{lg{
{l@B-l3.1{{^L3.1a} a baseball {^L3.1z}}l}
{^l^B-l3.1}
{^l^B-l3.1}
{^l^l3.4}
}lg}
...
{* Witness C *}
{^lg^stanza-1}
{^lg^stanza-2}
{lg{
{l@C-l3.1{{^L3.1a} a leather {^L3.1z}}l}
{^l^C-l3.1}
{^l^C-l3.1}
{^l^l3.4}
}lg}

Yet another encoding exploits discontiguous elements to encode the text without any virtual elements:

{lg wit="A B C"{
{l{Bloody Mary is the girl I love.}l}
{l{Bloody Mary is the girl I love.}l}
{l{Bloody Mary is the girl I love.}l}
{l{Now ain't that too damn bad.}l}
}lg}
{lg wit="A B C"{
{l{Bloody Mary's chewing betel nuts.}l}
{l{She is always chewing betel nuts.}l}
{l{Bloody Mary's chewing betel nuts}l}
{l{And she don't use Pepsodent.}l}
}lg}
{lg wit="A B C"{
{l~A@A3.1 wit="A"{{l~B@B3.1 wit="B"{{l~C@C3.1 wit="C"{
Her skin is tender as
}-l~B}}-l~C}DiMaggio's}-l~A}
{+l~B{a leather}-l~B}
{+l~C{a baseball{+l~A{{+l~B{
glove.
}l~A}}l~B}}l~C}
{^l@A3.2^A3.1 wit="A"}
{^l@A3.3^A3.1 wit="A"}
{^l@B3.2^B3.1 wit="B"}
{^l@B3.3^B3.1 wit="B"}
{^l@C3.2^C3.1 wit="C"}
{^l@C3.3^C3.1 wit="C"}
{l@l3.4 wit="A B C"{Now ain't that too damn bad.}l}
}lg}

3.5. Des Minnesangs Frühling

In many cases, different witnesses to a literary work exhibit different orderings for the material. Here is an example from the collection of Middle High German love poetry compiled by Karl Lachmann (159,1 Reinmar “Ich wirbe umb allez daz ein man”).

We first show an encoding of the entire poem (from Kraus's 1950 edition); an encoding for serious use would also include apparatus criticus, but that would only distract from our main point here, so we omit that material.

{text{{body{
{lg@Sic{
{l{Ich wirbe umb allez daz ein man}l}
{l{ze wereltlîchen fröiden iemer haben sol.}l}
{l{daz ist ein wîp der ich enkan}l}
{l{nâch ir vil grôzen werdekeit gesprechen wol.}l}
{l{lob ich si sô man ander frowen tuot,}l}
{l{dazn nimet eht si von mir niht für guot.}l}
{l{do swer ich des, sist an der stat}l}
{l{dâz ûz wîplîchen tugenden nie fuoz getrat.}l}
{l{daz ist in mat.}l}
}lg}
{lgSsi{
{l{Si ist mir liep, und dunket mich}l}
{l{daz ich ir volleclîche gar unmære sî.}l}
{l{nu waz dar umbe? daz lîd ich,}l}
{l{und bin ir doch mit triuwen stæteclîchen bî.}l}
{l{waz obe ein wunder lîhte an mir geschiht,}l}
{l{daz si mich eteswenne gerne siht?}l}
{l{sâ denne lâze ich âne haz,}l}
{l{swer giht daz ime an fröiden sî gelungen baz:}l}
{l{der habe im daz.}l}
}lg}
{lg@Sal{
{l{Als eteswenne mir der lîp}l}
{l{dur sîne bœse unstæte râtet daz ich var}l}
{l{und mir gefriunde ein ander wîp,}l}
{l{sô wil iedoch daz herze niender wan dar.}l}
{l{wol ime des deiz sô reine welen kan}l}
{l{und mir der süezen arbeite gan.}l}
{l{des hân ich mir ein liep erkorn}l}
{l{dem ich ze dienste, und wære ez al der werlte zorn,}l}
{l{muoz sîn geborn.}l}
}lg}
{lg@Ssw{
{l{Swaz jâre ich noch ze lebenne hân,}l}
{l{swie vil der wære, irn wurde ir niemer tac genommen.}l}
{l{sô gar bin ich ir undertân}l}
{l{daz ich unsanfte ûz ir genâden möhte komen.}l}
{l{ich fröu mich des daz ich ir dienen sol.}l}
{l{si gelô´net mir mit lîhten dingen wol:}l}
{l{geloube eht mir, swenn ich ir sage}l}
{l{die nôt diech inme herzen von ir schulden trage}l}
{l{dick inme tage.}l}
}lg}
{lg@Sun{
{l{Und ist daz mirs mîn sælde gan}l}
{l{deich abe ir redendem munde ein küssen mac versteln,}l}
{l{gît got deichz mit mir bringe dan,}l}
{l{sô wil ichz tougenlîche tragen und iemer heln.}l}
{l{und ist daz siz für grôze swære hât}l}
{l{und vêhet mich dur mîne missetât,}l}
{l{waz tuon ich danne, unsælic man?}l}
{l{dâ heb i'z ûf und legez hin wider dâ ichz dâ nan,}l}
{l{als ich wol kan.}l}
}lg}
}body}
}text}

Next, we show how the differences in stanza sequence among the manuscripts and editors can be recorded: Lachmann, Kraus, and manuscript E give the stanzas in the order shown above:

{* Lachmann and E *}
{text{{body{
{^lg^Sic}{^lg^Ssi}{^lg^Sal}{^lg^Ssw}{^lg^Sun}
}body}}text}

Manuscripts b and C give the stanzas in a different order:

{* b and C *}
{text{{body{
 {^lg^Sic}{^lg^Sal}{^lg^Sun}
 {^lg^Ssi}{^lg^Ssw}
}body}}text}

Manuscript A, as so often, stands isolated:

{* A *}
{text{{body{
 {^lg^Ssw}{^lg^Sic}{^lg^Ssi}
 {^lg^Sal}{^lg^Sun}
}body}}text}

1. Notation

The notation used in the formal grammar here is based on that used in the XML 1.0 specification.

[We considered using Niklaus Wirth's notation, but decided against it on the grounds that its use of curly braces would be confusing in the context of TexMECS.]

The grammar is a set of rules, each consisting of a non-terminal symbol (the ‘left-hand side’), a separator (::=), and an expression (the ‘right-hand side’). The right-hand side consists of a set of alternatives, each alternative consisting of a sequence of terminal symbols (quoted), non-terminal symbols, character-range references, character references, or sub-expressions.

Symbols and expressions be suffixed with asterisk, plus sign, or question mark (‘occurrence indicators’) to indicate that they can occur zero or more times, one or more times, or zero or one time.

This notation can of course be used to define itself:

Grammar    ::= Rule*
Rule       ::= Nonterminal '::=' Expression
Expression ::= Seq ('|' Seq)* | Diff
Seq        ::= Term+ 
Term       ::= Factor [*+?]? 
Factor     ::= Nonterminal | Literal | Charref | Charset | '(' Expression ')'
Diff       ::= Term '-' Term

Nonterminal ::= [a-zA-Z][a-zA-Z0-9-]*                // any simple name
Literal     ::= ('"' [^"]* '"') | ("'" [^']* "'")    // any quoted string
Charref     ::= '#x'[a-zA-Z0-9]+                     // a hex reference
Charset     ::= '[' '^'? (Charspec)+ ']'             // a bracketed group
Charspec    ::= (SChar ('-' SChar)?) | (Charref ('-' Charref)?)
SChar       ::= Char - ('^' | '[' | ']' | '-')       // no []^- in Charsets!
Char        ::= [#x32-#x7E]

Comments take the form “/* ... */”. Comments which begin “WFC” name well-formedness constraints which conforming processors are required to check. Comments beginning with “CF” identify convenience features. Note that the bracketed character sets are a simplified form of those familiar from many regular-expression utilities.

2. References

Bray, Tim, Jean Paoli, and C. M. Sperberg-McQueen, ed. Extensible Markup Language (XML) 1.0. W3C Recommendation. [Cambridge, Sophia-Antipolis, Tokyo]: World Wide Web Consortium, 8 February 1998. Second edition 6 October 2000 ed. Eve L. Maler. http://www.w3.org/TR/REC-xml

Jaakkola, Jani, and Pekka Kilpeläinen. sgrep: search a file for a structured pattern. (Man page for sgrep). http://www.cs.helsinki.fi/u/jjaakkol/sgrepman.html

[Kraus, Carl von, ed.] Des Minnesangs Frühling. Nach Karl Lachmann, Moriz Haupt und Friedrich Vogt neu bearbeitet von Carl von Kraus. 30. Auflage. Leipzig: S. Hirzel, 1950.

Lindsey, William D., Stupid NET Tricks. Posting to w3c-sgml-wg@w3.org, 14 September 1996. http://lists.w3.org/Archives/Public/w3c-sgml-wg/1996Sep/0139.html

Lindsey, William D., lml -- Lambda Markup Language: A.K.A. "Stupid NET Tricks". 15 August 1999. http://www.blnz.com/lml/index.html

Marsh, Jonathan, ed. XML Base. W3C Proposed Recommendation 20 December 2000. [Cambridge, Sophia-Antipolis, Tokyo]: World Wide Web Consortium, 2000. http://www.w3.org/TR/xmlbase/

Sperberg-McQueen, C. M., and Claus Huitfeldt, GODDAG: A Data Structure for Overlapping Hierarchies. Paper presented at the conference Principles of Digital Document Processing, Munich, September 2000.

3. List of productions

For convenience, we append a list of non-terminals with the sections in which they are defined.

atts 2.2.8 Internal structure of tags
cdata-section 2.2.6 CDATA marked sections
cdchars 2.2.6 CDATA marked sections
CHAR 2.2.9 Basic tokens
character-ref 2.2.5 Entities
chunk 2.2.1 Documents
comment 2.2.7 Comments
commentdata 2.2.7 Comments
datacharacter 2.2.9 Basic tokens
document 2.2.1 Documents
eid 2.2.8 Internal structure of tags
end-tag 2.2.3 Start- and end-tags
end-tag-set 2.2.3 Start- and end-tags
external-entity 2.2.5 Entities
gi 2.2.8 Internal structure of tags
id 2.2.8 Internal structure of tags
IDENTIFIER 2.2.9 Basic tokens
idref 2.2.8 Internal structure of tags
internal-entity 2.2.5 Entities
NAME 2.2.9 Basic tokens
resume-tag 2.2.4 Interrupted elements
S 2.2.9 Basic tokens
sole-tag 2.2.2 Empty and virtual elements
start-tag 2.2.3 Start- and end-tags
start-tag-set 2.2.3 Start- and end-tags
suspend-tag 2.2.4 Interrupted elements
URL 2.2.9 Basic tokens
virtual-element 2.2.2 Empty and virtual elements

4. Virtual elements

We appear to have three choices to make regarding virtual elements. A virtual element is an element whose serialization form is in some way incomplete, and the properties of which are (on the logical level) to be supplemented by information taken from elsewhere in the serialized file.

On the Goddag level, there is no distinction between virtual and other elements.

The three choice points are:

Should the virtual element carry one IDREF, or a series of IDREFS?
Should it specify a generic identifier and attributes, or take them over from the item(s) pointed at? (Or, possibly, specify some and take some over?)
Should the children of the virtual element be the nodes it points at? Or should they be the children of the nodes it points at?

Number of IDREFs	Specify or take GI and attributes	Take children or nodes as children	Comments
one	specify	children	This is what MSM has been tending toward.
one	specify	node	Pointless? Leads to chains.
one	take	children	Does not allow for retagging (use of virtual elements to represent alternative interpretations of the material, cf. TEI <certainty>)
one	take	node	No good: leads to chains, and reduplicates nodes on them.
n	specify	children	Cf. TEI <join> with `scope="branches"`
n	specify	nodes	Cf. TEI <join> with `scope="root"`
n	take	children	Logical problem: can't take one GI and attribute set if there are n places to take it from.
n	take	nodes	Logical problem: can't take one GI and attribute set if there are n places to take it from.

This version of this specification chooses the first row.

5. Open questions

This section lists questions that have arisen during completion of this paper.

Should suspend- and resume-tags, and tags for virtual elements, have distinctive closing delimiters (i.e. {+e+{ ... }-e-}, as tags for unordered elements do? Or conversely should unordered elements be distinctive only at the beginning of the tag? ({|e{ ... }|e})
Should suffixes without base names be allowed? We have introduced them in order to allow anonymous spans to overlap; perhaps that is unnecessary, since anonymous spans are important only for segmenting character data in order to allow virtual elements to point at it, and anything we can express with virtual elements and overlapping anonymous spans we can also express with virtual elements and anonymous nonoverlapping spans.
Should S be allowed between any two tokens in markup? For example, should { p { be a legal start-tag? In XML, the first set of blanks is not allowed; the second is. SGML imposes this rule in order to allow stago followed by blanks to be used without being escaped in any way; since XML and TexMECS require the delimiter to be escaped regardless, the rule appears to serve no function anymore.