[May 10, 2001] The Center for Humanities Information Technology at the University of Bergen has announced support for a two-year research project focused on markup for "complex" documents. The main goal of the MLCD Project (Markup Language for Complex Documents) is to "lay the theoretical foundation for a better system of representation for complex textual phenomena than can be found in today's SGML- and XML-based systems; the project will also lay a foundation for software development, with an eye to web-based delivery." Among the issues dealt with in MLCD: the well-attested phenomenon of "overlap" (non-hierarchical structures), handled by various methods in SGML and XML, e.g., 'store both structures, filter into SGML; use the SGML CONCUR Feature; use marked sections and entity declarations; use [TEI] milestones [= empty, asynchronous elements]; model as fragmentation; use stand-off markup'; etc. The MLCD project builds upon published research from C. M. Sperberg-McQueen and Claus Huitfeldt, especially "GODDAG, MECS, and TexMECS (an experimental markup meta-language for complex documents). MECS (Multi-Element Code System) "was developed by Claus Huitfeldt in connection with the work of the Wittgenstein Archive at the University of Bergen. MECS has many similarities to SGML-based systems, but distinguishes itself from them in that it has a simpler notation and a well-defined concept of well-formedness as a property separate from that validity. MECS thus anticipates many of the ways in which XML has modified the rules of SGML; in addition, MECS allows non-hierarchical structures in the form of overlapping elements. The MLCD project intends to define a system which combines the best of SGML/XML and MECS. A notation for such a system has already been designed, and a data structure has been sketched out. The project will work to complete the specification of the data structure and to develop some method of specifying document grammars."
[May 10, 2001] "TexMECS: An experimental markup meta-language for complex documents." By Claus Huitfeldt and C. M. Sperberg-McQueen. 25-January-2001, draft/incomplete 17-February-2001 (2001-05-10). Excerpts from work-in-progress: "This document sketches the outlines of TexMECS, a markup language (or, more precisely, a markup meta-language or family of markup languages) intended for experimental work in dealing with complex documents. TexMECS was developed at the Center for Humanistic Information Technology at the University of Bergen as part of the project Markup Languages for Complex Documents, with support from the Lauritz Meltzer Høyskolefond. This document assumes some familiarity with XML, SGML, and MECS encoding of documents, with the problems posed for these systems by complex documents, and with the Goddag structure proposed by the authors for representing document structures. Status: The grammar given in this document reflects the version of TexMECS presented in Bergen the first week of February 2001. The grammar has not yet been proofread to make sure all non-terminals are defined, spelled correctly, etc. The examples are not yet complete and may still include relics of design alternatives recently rejected... The basic principles of the design are: For documents that exhibit a straightforward hierarchical structure, TexMECS should be isomorphic to XML; for documents that exhibit a suitable structure, TexMECS should be isomorphic to MECS; ideally, the syntax should be distinct both from that of XML and from that of MECS; every TexMECS document should be translatable into a GODDAG structure without reference to application-specific semantics; in general, the syntax should be as simple to define and process as possible. We distinguish: (1) empty elements marked by sole-tags; (2) normal elements with start- and end-tags; (3) interrupted elements, with start-, suspend-, resume-, and end-tags; (4) elements with children whose order has no significance and which can therefore be reordered...; (5) virtual elements, which have a generic identifier and attributes, and who share children with another element in the document...; (6) self-overlapping elements, which use a simple co-indexing scheme: tags are co-indexed by a tilde and a suffix of numbers and letters... Start- and end-tags must match, roughly as in XML, but they need not nest. For every start-tag, a matching end-tag must follow before the end of the TexMECS document; every end-tag must be preceded by a matching start-tag. To match, start- and end-tags must have the same generic identifier and the same suffix, and there must be no closer matching tag. Since the start- and end-tags can come in arbitrary orders, we express the match rule as a well-formedness constraint (WFC), not in the grammar... [The x in the name is pronounced the same way as the cs at the end; the word is thus a homonym of Tex/Mex, a term applied to cuisine and music of the area on the border of the U.S. and Mexico. Theoretically, the name stands for 'trivially extended MECS', but this claim is not wholly convincing even to the authors, given that the language described here is not really an extension to MECS, and that the things included here which go beyond MECS are not really trivial.]
From the MLCD project summary and background description:
The prevailing standard for text encoding today is SGML, which is the basis of HTML, the text format most commonly used today on the World Wide Web. SGML is also the basis of XML, which is a simplified subset of SGML which has been used to reformulate HTML (as XHTML) and which is in the process of becoming the standard form for Web publishing as well as for other applications. All SGML- and XML-based systems, however, have problems with the representation of a variety of phenomena which are essential for the acceptable representation and processing of text. These problems are in large part solved by MECS, a coding system developed by Claus Huitfeldt in work at the Wittgenstein Archive at the University of Bergen. MECS, however, has no well defined data structure and no notion of document grammar.
A system which combined the best features of SGML and MECS, providing a simple notation combined with a powerful grammatical formalism and a data model capable of representing non-hierarchical structures in a natural way, would represent a considerable step forward. Laying the groundwork for such a system has been the goal of a collaboration between Huitfeldt and C. M. Sperberg-McQueen which began in 1997-98, when Dr. Sperberg-McQueen was a visiting researcher at the historical-philosophical faculty in Bergen. Some results of that collaborative work have been published in the form of two articles; a third is in press. In brief, the work indicates that a notation and well-formedness rules based on MECS should be usable for the purpose. A suitable data model for MECS has also been successfully sketched. The relationship between SGML and this new system has been studied, with emphasis on possibilities of automatic conversion. What is lacking is a grammatical formalism which allows the expression of validity conditions...MLCD thus consists, for now, in the direct continuation of the theoretical work already begun with the establishment of a notation and a data structure. The next stage will be the working out of a grammatical formalism. In connection with that work, it will be desirable to implement a system prototype in the form of a validating parser and simple experimental conversion and analysis tools..."
References:
- MLCD Overview
- MLCD Project web site
- MLCD Project Description [EN] (also in Norwegian)
- "TexMECS: An experimental markup meta-language for complex documents." By Claus Huitfeldt and C. M. Sperberg-McQueen. 25-January-2001, draft/incomplete 17-February-2001 (2001-05-10). [cache 2001-05-10
- "Markup Languages and Complex Documents: Data Structure, Linear Form, Constraint Language." By C. M. Sperberg-McQueen and Claus Huitfeldt. Bergen, 6 February 2001.
- "Markup Languages and Complex Documents: Software Options and Practical Work." By C. M. Sperberg-McQueen and Claus Huitfeldt. Bergen, 7 February 2001.
- "GODDAG: A Data Structure for Overlapping Hierarchies." By C. M. Sperberg-McQueen and Claus Huitfeldt. [To be] Presented at ACH-ALLC '99. See: the ACH-ALLC '99 program. See the abstract [local archive copy]
- MECS - A Multi-Element Code System. By Claus Huitfeldt. Excerpted from the longer document; see following reference. See also bibliography. [cache]
- MECS - A Multi-Element Code System. By Claus Huitfeldt. [Forthcoming in] Working Papers from the Wittgenstein Archives at the University of Bergen, No 3. ISBN 82-91071-02-0, ISSN 0803-3137. Copyright: Claus Huitfeldt. First version: 1992. This version: October 1998. "The subject matter of this document is text encoding. It presents what I have called the Multi-Element Code System, MECS. This is a working paper in the full sense of the term, i.e., a report on work in progress..." [cache]
- "Meaning and Interpretation of Markup." By C. M. Sperberg-McQueen, Claus Huitfeldt, and Allen Renear. Foils from presentation at ALLC/ACH 2000 (Glasgow) and Extreme Markup Languages 2000 (Montréal). Article to appear in Markup Languages: Theory & Practice 2.3 (2000).
- Seminar presentations in the HIT Center, Bergen, 5-7 February 2001. By Claus Huitfeldt and C. M. Sperberg-McQueen. "A seminar on the project Markup Languages and Complex Documents was given in the HIT Center, Alligaten 27, Bergen, Norway, on 5-7 February 2001. The purpose of this page is to provide access to the slides used in that seminar. The goal was to provide an overview of the project for the information of collaborators, other projects potentially interested in our work, and others... Monday 5 February, 9:15 - 11:00. Introduction and background; Tuesday 6 February 11:15 - 1:00. Data Structure, Linear Form, Constraint Language [XML]; Wednesday 7 February 1:15 - 3:00. Software Options and Practical Work" [XML].
- Humanities Information Technologies Research Programme [Forskningsprogram for humanistisk informasjonsteknologi], University of Bergen
- Contact: Claus Huitfeldt (HIT Director of Research)
- SGML/XML and (Non-) Hierarchy - Main reference section with previous/related work.