[Cache from http://www.hit.uib.no/claus/mecs/sum.htm; please use this canonical URL/source if possible.]


Excerpt from
MECS - A Multi-Element Code System
by
Claus Huitfeldt
forthcoming in
   Working Papers from the Wittgenstein Archives at the University of Bergen, No 3

   ISBN 82-91071-02-0
   ISSN 0803-3137
   Copyright: Claus Huitfeldt

First version: 1992.
This version: October 1998


CONTENTS

1 Introduction: MECS and SGML
1.1 Why not SGML?
1.2 MECS Syntax
1.3 MECS Program Package
1.4 The Future
 


1 INTRODUCTION: MECS AND SGML

The Norwegian Wittgenstein Project (NWP), which started in 1980, aimed at producing a machine-readable version of Ludwig Wittgenstein's Nachlass. Like many similar projects at the time, the NWP developed its own markup system. And like most projects that did so, the NWP enjoyed not only the many advantages of explicitly marked up texts, but also the severe disadvantage of having to develop its own specialized software, even for trivial, non-specialized tasks.

The NWP was discontinued in 1987, and during the preparation for its continuation, which was later (1990) to become the Wittgenstein Archives at the University of Bergen, I set out to improve the markup system. It had turned out that the system suffered from certain deficiencies. Any revision of the markup system necessitated adjustment of the software, which had in the course of several years of ad hoc revisions grown quite complicated. The pause in the project activities was therefore well spent looking for a more viable and flexible solution.

1.1 Why not SGML?

Standard Generalized Markup Language (SGML) was adopted as an international standard for text encoding by International Organization for Standardization (ISO) in 1986, so at that time (i.e. 1988-1989), SGML was the most natural candidate for consideration. However, despite its many strengths and potential advantages, I found SGML unsuited to our needs. Among the reasons were:

My conclusion therefore was that I had to develop a different system altogether for the Wittgenstein Archives, a system which had to be considerably less demanding concerning software development, to answer the specific needs of our project, and yet be general and flexible enough to allow for extensive revision of the registration system during the course of future work without necessitating revision of application software.

At roughly the same time the Text Encoding Initiative (TEI) had just started (1987). The TEI based itself on SGML. Many of the issues TEI was expected to address were relevant to the problems listed above. Although we could not wait for TEI to be completed, it was therefore also an obvious consideration for my development work to keep as close to SGML as possible.

1.2 MECS Syntax

Consequently, MECS is in many respects similar to SGML. Like SGML, MECS is not itself a markup scheme, but a set of rules for the design of markup schemes. MECS may be accommodated to conform to SGML's reference concrete syntax. SGML documents are MECS-conforming, provided that they do not make use of markup reduction or minimization.

MECS markup schemes may be declared in separate "document definitions", similar to the SGML DTDs. Because they lack most of the expressive power of SGML's DTDs, I have chosen a different term: Where SGML speaks of Document Type Definitions (DTDs), MECS speaks of Code Declaration Tables (CDTs). Basically, a CDT is a declaration listing delimiters, other characters sets, and codes (tags for elements and entities) to be used in a document. MECS documents may be validated for conformance with a particular CDT. But unlike SGML, no CDT is required in MECS (cf. below).

MECS includes equivalents to SGML's elements and internal entities. In addition, MECS includes syntactical means for the representation of structures which in SGML are treated in a different way. There are seven syntactically distinct types of codes (examples are given in MECS's default character set):

No-element codes:                 <tag>

One-element codes:                <tag/   ... /tag>

Poly-element codes:               [tag/2| ... /tag| ... /tag]

N-element codes:                  [tag/2\ ... /tag| ... /tag]

Character representation codes:   {tag}

Character disambiguation codes:   {...\tag}

Comments:                         <| ... |>

All delimiters may be redefined, and tags may be reduced or minimized (though not omitted) according to specific rules.

No-element codes correspond to SGML's empty elements, and mark points within the text. One-element codes correspond to ordinary SGML elements, and mark spans of text.

Multi-element codes, i.e. poly-element and N-element codes, have no obvious parallel in SGML. Poly-element codes mark two or more consecutive spans of text (typically indicating that they stand in a specific relationship to each other, e.g. that of substitution or counterposition). N-element codes are similar to poly-element codes. But whereas the number of spans (elements) marked by a poly-element code may vary from token to token, the number of elements in an N-element code is fixed.

Character representation codes correspond roughly to SGML's internal entities. Character disambiguation codes, which have no direct equivalent in SGML, are used in conjunction with character representation codes, typically to disambiguate homographic graphemes (e.g. characters which in one context may be punctuation marks, in another context logical operators).

In MECS (just as in SGML) parts of a document which should be ignored by the parser are marked as comments.

MECS has no direct parallel to SGML attributes, external entities and declarations. However multi-element codes may be used for some of the same purposes as attributes, and the MECS Program Package supports a file inclusion mechanism which performs some of the work that SGML external entities do. MECS has corrollaries to SGML comments and to marked sections with keyword CDATA, but not to the other SGML declaration types.

MECS documents contain text interspersed with codes. MECS does not presuppose any hierarchical document structure - elements may appear in any order and nest arbitrarily deeply. Multi-element codes may not overlap each other, but one-element codes may overlap all other codes without restriction.

The basic syntactic features of all tags occurring in a MECS-conforming document are directly deducible from their delimiters, even if markup is reduced to its minimum. This has at least three important consequences:

First, it increases human readability of documents. Even if heavily marked up documents are notoriously difficult to read for the human eye, MECS at least has the advantage that you may e.g. tell a no-element from a one-element tag immediately. (In SGML you do not know whether a start tag is associated with an end tag or not (i.e. whether it marks an empty element or not), until you have either inspected the DTD or scanned to the end of the current element (which in the worst case means the rest of the entire document instance).

Second, the same point applies to software development. There is no need for look-ahead to identify the basic syntactic features of a MECS tag. Therefore, as long as a MECS document includes a one-line header declaring its delimiters, the entire document can be parsed and validated for basic syntax conformance without recourse to any CDT.

Third, this means that MECS documents are in a certain fundamental sense self-documenting: If a MECS document includes a header, which is a one-line declaration of its delimiters, then a CDT to which that document conforms may be deduced directly from the document alone. The CDT thus deducible from the document is called the document's minimal CDT.

Although it is unequivocally decidable whether any particular document conforms to any particular CDT, an indefinite number of documents conform to any particular CDT, and any particular document conforms to an indefinite number of CDTs. In this respect the relationship between MECS documents and CDTs is the same as the relationship between SGML document instances and DTDs - it is a many- to-many relationship. What is special about the relationship between a document and its minimal CDT is that it is a many-to- one relationship: any particular MECS-conforming document conforms to one and only one minimal CDT.

1.3 MECS Program Package

The MECS Program Package contains programs for the creation, validation, formatting, reformatting, analysis, element extraction and spell checking of MECS-conforming documents, as well as programs for translation between MECS and SGML. All programs in the package run under MS-DOS.

MECSVAL is an interactive, validating parser-editor. MECSVAL checks CDTs and documents for MECS conformance, and may either deduce minimal CDTs from MECS-conforming documents or check that documents conform to particular CDTs.

MECSFORM formats or regularizes MECS-conforming documents by either reducing markup to its minimum or extending it to its standard form, wrapping lines to a user-specified maximum length, removing trailing blanks and trailing blank lines, optionally indenting specified elements and/or inserting reference codes in specified locations etc.

MECSPRES outputs text in various formats (HTML, WordPerfect, Folio Flat File, so-called "plain ASCII", and others). The program offers a number of options for the layout and formatting of elements (margins and marginalia, indentation, tables, columns, notes, section headers etc.; features like bold, italics, single and double underline, capitalization, letter-spacing; markers and special characters; links and anchors etc.) MECSPRES may also reformat text to other MECS-encoded formats, and to formats required by the programs ALPHATEXT and BETATEXT (cf. below). With MECSPRES the user may not only define stylesheets, but also format, layout and style specifications.

MECSLYSE analyzes relationships between the encoded elements of a document and allows the user to define breakpoints at which to display the code stack, list all recursive or overlapping elements, and create a tabulated list displaying the sequence and nesting level of all elements occurring in a document.

MECSGRAB extracts specified elements from a document and prints them and/or their line and column reference numbers in a separate file. This file may, under certain conditions, itself be a MECS-conforming document subject to further processing by MECSGRAB or other MECS programs.

ALPHATXT may be used for interactive spell checking in general, and spell checking of MECS-encoded documents in particular. The program may also perform a number of other tasks, such as the production of word lists sorted according to user-defined character sort criteria, frequency word lists, and simple statistical analyses.

BETATXT computes and displays all possible combinations of single elements of multi-element codes within segments of a document. For example, if sentences are marked and alternative readings are encoded with multi-element codes, then BETATXT may compute and display all the alternative readings of sentences containing substitutions.

MECSSGML converts MECS-conforming documents to SGML-conforming documents. The conversion may or may not lead to a certain loss or distortion of information, depending on the degree to which the document in question includes features specific to MECS, whether or not overlapping elements are retained, etc. (Though it is possible to restrict MECS so as not to allow features which cannot be translated to SGML without loss of information ().)

SGMLMECS converts SGML-documents to MECS-conforming documents. Although a number of SGML features will be converted to a form in which they are ignored by other MECS software, in a certain sense the conversion does not lead to loss or distortion of information: Documents converted to MECS with SGMLMECS may always be converted back again to their exact original SGML form with MECSSGML.

Except for MECSVAL and ALPHATXT, none of the programs in the package are interactive. However, Peter Cripps has written a menu- driven user interface, MECSPAC, for interactive use of the program package.

The lack of a rigorously defined document structure (a DTD) and the lack of restrictions against overlapping elements has been taken by some to suggest that writing programs for MECS would be more complicated than writing programs for SGML.

One difference is that where SGML programs may keep track of the document structure by means of a "last in first out" stack, MECS programs have to maintain a doubly linked list. Admittedly, this is a bit more complicated. On the other hand, the fact that the basic syntactical role of each and every tag can be inferred directly from its delimiters without look-ahead serves to simplify other matters considerably.

Another difference is that whereas with SGML programs may build internal tree representations of documents to facilitate manipulation on them, no such internal representations are built by MECS programs - because of the occurrence of overlapping elements this has so far seemed too complicated. 1 Therefore, all MECS programs read the entire document from its beginning in order to perform operations on it.

The MECS Program Package does not live up to standards of professional software. But the fact that it was possible for a sheer amateur to write the bulk of these programs as a side-activity during a couple of years indicates that programming for MECS is easy. Altogether the package comprises approximately 13,000 lines of Pascal code (excluding the editor). It is assumed that similar programs for SGML would demand code far in excess of this.

1.4 The Future

I once said 2 that when it comes to document structure, one of the main differences between SGML and MECS is that in SGML everything is forbidden unless it is explicitly permitted or mandatory, while in MECS everything is permitted unless it is explicitly forbidden.

In retrospect I realize that this is grossly unfair: SGML does after all admit quite permissive DTDs, and MECS does not have any means of forbidding or demanding particular document structures. 3 Still, the formulation points to a difference of emphasis: SGML provides strong mechanisms for exerting control over document structure, whereas MECS sacrifices such control in favor of free overlap and simplified or in-line declaration of elements.

Nine years have passed since the development of MECS started, and it has been used in the encoding of several thousand manuscript pages. The TEI guidelines has been available for quite some time, and has been discussed and used extensively by a large number of projects. The amount and range of SGML-based software has increased considerably.

Is there still a need for MECS? Despite the fact that SGML is a far more sophisticated markup language, I believe that the considerations which led me to dismiss SGML nine years ago still apply. 4 MECS is therefore in my eyes still the preferred choice for a project like the Wittgenstein Archives. However, MECS also has obvious shortcomings. If not all, then at least a number of these shortcomings are eliminated in SGML. Unfortunately, conversion from MECS to SGML without loss of information is notoriously difficult, so we cannot have the best of both worlds.

One recent development (1997) within the SGML area is particularly interesting. Extensible Markup Language (XML), which has received much attention lately, shares a couple of features with MECS: In XML, empty elements are visibly different from elements with content, and tag omission is not allowed. Consequently, a DTD is not required in XML, and a distinction is made between well-formedness ("valid" without DTD) and validity (valid according to some specific DTD) of documents. In all these respects, XML is therefore closer to MECS than SGML is. 5 (It is also interesting to note that one of the arguments often made in favor of XML is that it is easier to write programs for than SGML is.) However, in one important respect XML poses even greater difficulties than SGML: XML does not include SGML's CONCUR feature. And without CONCUR the conversion of MECS documents seems even more difficult.

Some work has been done in order to create a bridge from MECS to SGML. Sunniva Solstrand has developed a method (and a program) for automatically "deducing" DTDs from document instances converted from MECS to SGML ([Solstrand 1994]). Sascha Djuric has proposed a convention for automatically converting elements with overlap to hierarchical structures in a controlled manner ([Djuric 199?]). What remains in particular is a method for converting MECS documents with overlap to concurrent hierarchies by using the SGML CONCUR feature, and SGML software which implements this feature. Methods for MECS to SGML conversion is one of the concerns of an ongoing cooperation between [C. Michael Sperberg-McQueen] and myself.
 

1C. Michael Sperberg-McQueen and I are working on a data structure for MECS similar to that of SGML. This work has been promising, but it is too early to make any final judgement, and no implementation exists.
2[Huitfeldt 1995, p 238]
3The few things which are forbidden or mandatory in MECS are implicitly required by the basic syntax, not explicitly stated.
4Even though the amount of SGML software has increased, no SGML software to my knowledge yet replicates the functionality of the MECS Program Package. In particular, there is still little software that supports the CONCUR feature.
5Strictly speaking, XML is a subset of SGML. Therefore, the comparisons made here are not between a non-SGML system and SGML, but between the SGML subset XML and "full" or "unrestricted" SGML.