The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Last modified: May 10, 2001
Markup Language for Complex Documents (Bergen MLCD Project)

[May 10, 2001] The Center for Humanities Information Technology at the University of Bergen has announced support for a two-year research project focused on markup for "complex" documents. The main goal of the MLCD Project (Markup Language for Complex Documents) is to "lay the theoretical foundation for a better system of representation for complex textual phenomena than can be found in today's SGML- and XML-based systems; the project will also lay a foundation for software development, with an eye to web-based delivery." Among the issues dealt with in MLCD: the well-attested phenomenon of "overlap" (non-hierarchical structures), handled by various methods in SGML and XML, e.g., 'store both structures, filter into SGML; use the SGML CONCUR Feature; use marked sections and entity declarations; use [TEI] milestones [= empty, asynchronous elements]; model as fragmentation; use stand-off markup'; etc. The MLCD project builds upon published research from C. M. Sperberg-McQueen and Claus Huitfeldt, especially "GODDAG, MECS, and TexMECS (an experimental markup meta-language for complex documents). MECS (Multi-Element Code System) "was developed by Claus Huitfeldt in connection with the work of the Wittgenstein Archive at the University of Bergen. MECS has many similarities to SGML-based systems, but distinguishes itself from them in that it has a simpler notation and a well-defined concept of well-formedness as a property separate from that validity. MECS thus anticipates many of the ways in which XML has modified the rules of SGML; in addition, MECS allows non-hierarchical structures in the form of overlapping elements. The MLCD project intends to define a system which combines the best of SGML/XML and MECS. A notation for such a system has already been designed, and a data structure has been sketched out. The project will work to complete the specification of the data structure and to develop some method of specifying document grammars."

[May 10, 2001] "TexMECS: An experimental markup meta-language for complex documents." By Claus Huitfeldt and C. M. Sperberg-McQueen. 25-January-2001, draft/incomplete 17-February-2001 (2001-05-10). Excerpts from work-in-progress: "This document sketches the outlines of TexMECS, a markup language (or, more precisely, a markup meta-language or family of markup languages) intended for experimental work in dealing with complex documents. TexMECS was developed at the Center for Humanistic Information Technology at the University of Bergen as part of the project Markup Languages for Complex Documents, with support from the Lauritz Meltzer Høyskolefond. This document assumes some familiarity with XML, SGML, and MECS encoding of documents, with the problems posed for these systems by complex documents, and with the Goddag structure proposed by the authors for representing document structures. Status: The grammar given in this document reflects the version of TexMECS presented in Bergen the first week of February 2001. The grammar has not yet been proofread to make sure all non-terminals are defined, spelled correctly, etc. The examples are not yet complete and may still include relics of design alternatives recently rejected... The basic principles of the design are: For documents that exhibit a straightforward hierarchical structure, TexMECS should be isomorphic to XML; for documents that exhibit a suitable structure, TexMECS should be isomorphic to MECS; ideally, the syntax should be distinct both from that of XML and from that of MECS; every TexMECS document should be translatable into a GODDAG structure without reference to application-specific semantics; in general, the syntax should be as simple to define and process as possible. We distinguish: (1) empty elements marked by sole-tags; (2) normal elements with start- and end-tags; (3) interrupted elements, with start-, suspend-, resume-, and end-tags; (4) elements with children whose order has no significance and which can therefore be reordered...; (5) virtual elements, which have a generic identifier and attributes, and who share children with another element in the document...; (6) self-overlapping elements, which use a simple co-indexing scheme: tags are co-indexed by a tilde and a suffix of numbers and letters... Start- and end-tags must match, roughly as in XML, but they need not nest. For every start-tag, a matching end-tag must follow before the end of the TexMECS document; every end-tag must be preceded by a matching start-tag. To match, start- and end-tags must have the same generic identifier and the same suffix, and there must be no closer matching tag. Since the start- and end-tags can come in arbitrary orders, we express the match rule as a well-formedness constraint (WFC), not in the grammar... [The x in the name is pronounced the same way as the cs at the end; the word is thus a homonym of Tex/Mex, a term applied to cuisine and music of the area on the border of the U.S. and Mexico. Theoretically, the name stands for 'trivially extended MECS', but this claim is not wholly convincing even to the authors, given that the language described here is not really an extension to MECS, and that the things included here which go beyond MECS are not really trivial.]

From the MLCD project summary and background description:

The prevailing standard for text encoding today is SGML, which is the basis of HTML, the text format most commonly used today on the World Wide Web. SGML is also the basis of XML, which is a simplified subset of SGML which has been used to reformulate HTML (as XHTML) and which is in the process of becoming the standard form for Web publishing as well as for other applications. All SGML- and XML-based systems, however, have problems with the representation of a variety of phenomena which are essential for the acceptable representation and processing of text. These problems are in large part solved by MECS, a coding system developed by Claus Huitfeldt in work at the Wittgenstein Archive at the University of Bergen. MECS, however, has no well defined data structure and no notion of document grammar.

A system which combined the best features of SGML and MECS, providing a simple notation combined with a powerful grammatical formalism and a data model capable of representing non-hierarchical structures in a natural way, would represent a considerable step forward. Laying the groundwork for such a system has been the goal of a collaboration between Huitfeldt and C. M. Sperberg-McQueen which began in 1997-98, when Dr. Sperberg-McQueen was a visiting researcher at the historical-philosophical faculty in Bergen. Some results of that collaborative work have been published in the form of two articles; a third is in press. In brief, the work indicates that a notation and well-formedness rules based on MECS should be usable for the purpose. A suitable data model for MECS has also been successfully sketched. The relationship between SGML and this new system has been studied, with emphasis on possibilities of automatic conversion. What is lacking is a grammatical formalism which allows the expression of validity conditions...MLCD thus consists, for now, in the direct continuation of the theoretical work already begun with the establishment of a notation and a data structure. The next stage will be the working out of a grammatical formalism. In connection with that work, it will be desirable to implement a system prototype in the form of a validating parser and simple experimental conversion and analysis tools..."


Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: