[Cache from http://helmer.hit.uib.no/claus/mecs/mecs.htm; please use this canonical URL/source if possible.]
ISBN 82-91071-02-0
ISSN 0803-3137
Copyright: Claus Huitfeldt
First version: 1992.
This version: October 1998
1 Introduction: MECS and SGML
1.1 Background
1.2 SGML
1.3 MECS Syntax
1.4 MECS Program Package
1.5 Conclusion
2 PART I MECS - A Multi-Element Code System
Version 2.00, August 1993
2.1 Summary
2.2 Basic Code Syntax, Code Systems and Documents
2.3 Codes, Tags and Elements
2.4 Code Types
2.4.1 No-element Codes
2.4.2 One-element Codes
2.4.3 Poly-element Codes
2.4.4 N-element Codes
2.4.5 Character Representation Codes
2.4.6 Character Disambiguation Codes
2.4.7 MECS Comments
2.5 Generic identifiers and attribute strings
2.6 Markup Reduction
2.7 Document Structure
2.8 Classification of MECS systems
2.9 Character sets
2.9.1 Delimiters
2.9.1.1 Code delimiters
2.9.1.2 String delimiter
2.9.2 Nil character
2.9.3 Free characters
2.9.4 Tag characters
2.9.5 Default character sets
2.9.6 Examples of alternative character sets
2.10 Code Declaration Table (CDT)
2.11 Deducing a Minimal CDT from an Encoded Document
2.12 SGML Compatibility
2.12.1 Some general observations
2.12.2 From SGML to MECS
2.12.3 From MECS to SGML
2.13 Revision History
2.14 Plans for MECS Version 3
3 PART II MECS Program Package Version
2, August 1994 2 User Guide
3.1 Installation and System Requirements
3.2 A Note to SGML Users
3.3 Creating and Validating Documents and CDTs
3.4 Formatting Documents
3.5 Reformatting Documents
3.6 Analyzing Documents
3.6.1 Code Status Report
3.6.2 Document Structure and Overlapping Elements
3.6.3 Breakpoints and Recursion
3.6.4 Betatexts (Substitutions)
3.6.5 Spell Checking
3.6.6 Frequency Word Lists and Simple Statistical
Analyses
3.6.7 Extracting Elements
3.7 Processing SGML Documents in MECS
3.7.1 Validating SGML documents for MECS Conformance
3.7.2 Converting SGML files to MECS
3.7.3 Converting MECS documents to SGML
3.8 Project management
4 Reference Guide
4.1 General Features and Command Line Parameters
4.2 MECSVAL
4.2.1 Interactive Mode
4.2.2 Command Line Parameters
4.2.3 Examples
4.2.4 MECSVAL Editor Commands
4.3 MECSFORM
4.4 MECSLYSE
4.5 MECSGRAB
4.6 MECSPRES
4.6.1 Profile Definition Table (PDT)
4.6.1.1 Overall Structure
4.6.1.2 Code Declarations
4.6.1.3 Position
4.6.1.4 Mode
4.6.1.5 MarkIn, MarkDel and MarkOut
4.6.1.6 NoteNumber and NoteType
4.6.2 Declaration of Codes of Different Types
4.6.2.1 No-element Codes
4.6.2.2 One-element Codes
4.6.2.3 Multi-element Codes
4.6.2.4 Character Codes
4.6.3 Layout and Format
4.6.3.1 Layout
4.6.3.2 Format
4.6.4 Command Line Parameters
4.6.5 Examples
4.7 MECSBETA
4.8 BETATXT
4.9 MECSSPEL
4.10 ALPHATXT
4.10.1 Command Line Parameters
4.10.2 Defining an Alphabetic Sort Order
4.10.3 Frequency Word Lists and Simple Statistical
Analyses
4.10.4 Spell Checking
4.10.5 Working with Marked-up Documents
4.11 MECSSGML
4.12 SGMLVAL
4.13 SGMLMECS
Appendix A About the MECS Program Package
Appendix A: About the MECS Program Package
Appendix B: MECSPRES PDT Declaration Parameters
Appendix C: MECSPRES Predefined Layouts, Formats and Styles
Appendix D: MECSPRES User-Defined Layouts, Formats and Styles
The subject matter of this document is text encoding. It presents what
I have called the Multi-Element Code System, MECS.
Today, text encoding is more or less synonymous with SGML
(Standard Generalized Markup Language). Chapter 1 is an introduction
summarising the rest of the document by way of comparing MECS to SGML.
1
Chapter 2 provides a full description of MECS.
It may be read independently of the rest of the document.
Chapter 3 is a user guide and Chapter 4
a technical documentation of the MECS Program Package, a program package
for the validation, manipulation and analysis of MECS documents.
This is a working paper in the full sense of the term, i.e. a report on work in progress. I have wanted to publish it for a long time, but a new and better version of MECS or the MECS Program Package has always seemed to be around the next corner. 2 Planned changes to MECS are described in 2.14 .
Readers who intend to use this document primarily as a practical guide
to the MECS Program Package are advised to start with the Summary in 2.1
, and then proceed directly to the User Guide in Chapter 3 .
The rest of Chapter 2 provides reference material and
information of relevance to readers interested in technical aspects of
the MECS syntax, e.g. with a view to redefining the delimiter set or to
finding out whether a given markup syntax is MECS-conforming.
Readers who intend to use the MECS Program Package for
processing of SGML documents are strongly recommended to read the following
sections carefully: 2.12, 3.2, 3.7 and 4.11-13.
No detailed knowledge of any particular text encoding
system is required. But it is presupposed that readers have some acquaintance
with text encoding, or are familiar at least with the rationale behind
text encoding systems in general. 3
What is published here is for the most part a result of work in the
years from 1985 to 1987 for the Norwegian Wittgenstein Project and,
later (since 1990), for the Wittgenstein Archives at the University
of Bergen. I thank these projects for having given me the opportunity
to pursue my work on text encoding, and I thank the Norwegian Research
Council for Science and the Humanities for having permitted me to spend
part of my time as Research Fellow in philosophy on this work.
MECS started as an attempt to revise the code system of
the Norwegian Wittgenstein Project, "CosyTrawma", which was originally
developed by Associate Professor AsbjÁrn Brændeland
(Huitfeldt and Rossvær 1989, pp. 177-200). Brændeland, in turn,
had drawn on work done by the Tübingen Project in Germany.
The result of the revision turned out to be an entirely
different system, the earliest drafts of which were presented in Huitfeldt
and Rossvær 1989, pp. 51-54 and 201-236, and in a number of unpublished
working papers since 1989. I am particularly indebted to Senior Executive
Officer Àystein Reigem at the Norwegian Computing Centre
for the Humanities for comments and criticism of these drafts, as well
as to Professor Stig Johansson at the University of Oslo, who gave
many helpful comments and suggestions.
Much work is currently under way in text encoding. The most important
contribution in the Humanities is arguably that of the Text Encoding Initiative
(TEI). This major international cooperation project has set a standard
for text encoding in the Humanities for a long time to come.
My own participation in TEI has provided me with many
opportunities to learn from discussions with colleagues. In particular,
discussions with Lou Burnard, Michael Sperberg-McQueen, Allen
Renear and Peter Robinson have been a recurrent source of inspiration.
I am also indebted to my colleagues at the Wittgenstein
Archives for criticism, help and encouragement. In the later years, Peter
Cripps has been a particularly rich source of constructive criticism
and inspiring enthusiasm.
I hereby thank the above-mentioned persons and institutions for their help and assistance, criticism and comments. Remaining errors and deficiencies are entirely my responsibility.
Bergen, September 1998
Claus Huitfeldt
The Norwegian Wittgenstein Project (NWP), which started in 1980, aimed at producing a machine-readable version of Ludwig Wittgenstein's Nachlass. Like many similar projects at the time, the NWP developed its own markup system. And like most projects that did so, the NWP enjoyed not only the many advantages of explicitly marked up texts, but also the severe disadvantage of having to develop its own specialized software, even for trivial, non-specialized tasks.
The NWP was discontinued in 1987, and during the preparation for its continuation, which was later (1990) to become the Wittgenstein Archives at the University of Bergen, I set out to improve the markup system. It had turned out that the system suffered from certain deficiencies. Any revision of the markup system necessitated adjustment of the software, which had in the course of several years of ad hoc revisions grown quite complicated. The pause in the project activities was therefore well spent looking for a more viable and flexible solution.
Standard Generalized Markup Language (SGML) was adopted as an international standard for text encoding by International Organization for Standardization (ISO) in 1986, so at that time (i.e. 1988-1989), SGML was the most natural candidate for consideration. However, despite its many strengths and potential advantages, I found SGML unsuited to our needs. Among the reasons were:
My conclusion therefore was that I had to develop a different system altogether for the Wittgenstein Archives, a system which had to be considerably less demanding concerning software development, to answer the specific needs of our project, and yet be general and flexible enough to allow for extensive revision of the registration system during the course of future work without necessitating revision of application software.
At roughly the same time the Text Encoding Initiative (TEI) had just started (1987). The TEI based itself on SGML. Many of the issues TEI was expected to address were relevant to the problems listed above. Although we could not wait for TEI to be completed, it was therefore also an obvious consideration for my development work to keep as close to SGML as possible.
Consequently, MECS is in many respects similar to SGML. Like SGML, MECS is not itself a markup scheme, but a set of rules for the design of markup schemes. MECS may be accommodated to conform to SGML's reference concrete syntax. SGML documents are MECS-conforming, provided that they do not make use of markup reduction or minimization.
MECS markup schemes may be declared in separate "document definitions", similar to the SGML DTDs. Because they lack most of the expressive power of SGML's DTDs, I have chosen a different term: Where SGML speaks of Document Type Definitions (DTDs), MECS speaks of Code Declaration Tables (CDTs). Basically, a CDT is a declaration listing delimiters, other characters sets, and codes (tags for elements and entities) to be used in a document. MECS documents may be validated for conformance with a particular CDT. But unlike SGML, no CDT is required in MECS (cf. below).
MECS includes equivalents to SGML's elements and internal entities. In addition, MECS includes syntactical means for the representation of structures which in SGML are treated in a different way. There are seven syntactically distinct types of codes (examples are given in MECS's default character set):
No-element codes: <tag>
One-element codes: <tag/ ... /tag>
Poly-element codes: [tag/2| ... /tag| ... /tag]
N-element codes: [tag/2\ ... /tag| ... /tag]
Character representation codes: {tag}
Character disambiguation codes: {...\tag}
Comments: <| ... |>
All delimiters may be redefined, and tags may be reduced or minimized (though
not omitted) according to specific rules.
No-element codes correspond to SGML's empty elements, and mark points within the text. One-element codes correspond to ordinary SGML elements, and mark spans of text.
Multi-element codes, i.e. poly-element and N-element codes, have no obvious parallel in SGML. Poly-element codes mark two or more consecutive spans of text (typically indicating that they stand in a specific relationship to each other, e.g. that of substitution or counterposition). N-element codes are similar to poly-element codes. But whereas the number of spans (elements) marked by a poly-element code may vary from token to token, the number of elements in an N-element code is fixed.
Character representation codes correspond roughly to SGML's internal entities. Character disambiguation codes, which have no direct equivalent in SGML, are used in conjunction with character representation codes, typically to disambiguate homographic graphemes (e.g. characters which in one context may be punctuation marks, in another context logical operators).
In MECS (just as in SGML) parts of a document which should be ignored by the parser are marked as comments.
MECS has no direct parallel to SGML attributes, external entities and declarations. However multi-element codes may be used for some of the same purposes as attributes, and the MECS Program Package supports a file inclusion mechanism which performs some of the work that SGML external entities do. MECS has corrollaries to SGML comments and to marked sections with keyword CDATA, but not to the other SGML declaration types.
MECS documents contain text interspersed with codes. MECS does not presuppose any hierarchical document structure - elements may appear in any order and nest arbitrarily deeply. Multi-element codes may not overlap each other, but one-element codes may overlap all other codes without restriction.
The basic syntactic features of all tags occurring in a MECS-conforming document are directly deducible from their delimiters, even if markup is reduced to its minimum. This has at least three important consequences:
First, it increases human readability of documents. Even if heavily marked up documents are notoriously difficult to read for the human eye, MECS at least has the advantage that you may e.g. tell a no-element from a one-element tag immediately. (In SGML you do not know whether a start tag is associated with an end tag or not (i.e. whether it marks an empty element or not), until you have either inspected the DTD or scanned to the end of the current element (which in the worst case means the rest of the entire document instance).
Second, the same point applies to software development. There is no need for look-ahead to identify the basic syntactic features of a MECS tag. Therefore, as long as a MECS document includes a one-line header declaring its delimiters, the entire document can be parsed and validated for basic syntax conformance without recourse to any CDT.
Third, this means that MECS documents are in a certain fundamental sense self-documenting: If a MECS document includes a header, which is a one-line declaration of its delimiters, then a CDT to which that document conforms may be deduced directly from the document alone. The CDT thus deducible from the document is called the document's minimal CDT.
Although it is unequivocally decidable whether any particular document conforms to any particular CDT, an indefinite number of documents conform to any particular CDT, and any particular document conforms to an indefinite number of CDTs. In this respect the relationship between MECS documents and CDTs is the same as the relationship between SGML document instances and DTDs - it is a many- to-many relationship. What is special about the relationship between a document and its minimal CDT is that it is a many-to- one relationship: any particular MECS-conforming document conforms to one and only one minimal CDT.
The MECS Program Package contains programs for the creation, validation, formatting, reformatting, analysis, element extraction and spell checking of MECS-conforming documents, as well as programs for translation between MECS and SGML. All programs in the package run under MS-DOS.
MECSVAL is an interactive, validating parser-editor. MECSVAL checks CDTs and documents for MECS conformance, and may either deduce minimal CDTs from MECS-conforming documents or check that documents conform to particular CDTs.
MECSFORM formats or regularizes MECS-conforming documents by either reducing markup to its minimum or extending it to its standard form, wrapping lines to a user-specified maximum length, removing trailing blanks and trailing blank lines, optionally indenting specified elements and/or inserting reference codes in specified locations etc.
MECSPRES outputs text in various formats (HTML, WordPerfect, Folio Flat File, so-called "plain ASCII", and others). The program offers a number of options for the layout and formatting of elements (margins and marginalia, indentation, tables, columns, notes, section headers etc.; features like bold, italics, single and double underline, capitalization, letter-spacing; markers and special characters; links and anchors etc.) MECSPRES may also reformat text to other MECS-encoded formats, and to formats required by the programs ALPHATEXT and BETATEXT (cf. below). With MECSPRES the user may not only define stylesheets, but also format, layout and style specifications.
MECSLYSE analyzes relationships between the encoded elements of a document and allows the user to define breakpoints at which to display the code stack, list all recursive or overlapping elements, and create a tabulated list displaying the sequence and nesting level of all elements occurring in a document.
MECSGRAB extracts specified elements from a document and prints them and/or their line and column reference numbers in a separate file. This file may, under certain conditions, itself be a MECS-conforming document subject to further processing by MECSGRAB or other MECS programs.
ALPHATXT may be used for interactive spell checking in general, and spell checking of MECS-encoded documents in particular. The program may also perform a number of other tasks, such as the production of word lists sorted according to user-defined character sort criteria, frequency word lists, and simple statistical analyses.
BETATXT computes and displays all possible combinations of single elements of multi-element codes within segments of a document. For example, if sentences are marked and alternative readings are encoded with multi-element codes, then BETATXT may compute and display all the alternative readings of sentences containing substitutions.
MECSSGML converts MECS-conforming documents to SGML-conforming documents. The conversion may or may not lead to a certain loss or distortion of information, depending on the degree to which the document in question includes features specific to MECS, whether or not overlapping elements are retained, etc. (Though it is possible to restrict MECS so as not to allow features which cannot be translated to SGML without loss of information ().)
SGMLMECS converts SGML-documents to MECS-conforming documents. Although a number of SGML features will be converted to a form in which they are ignored by other MECS software, in a certain sense the conversion does not lead to loss or distortion of information: Documents converted to MECS with SGMLMECS may always be converted back again to their exact original SGML form with MECSSGML.
Except for MECSVAL and ALPHATXT, none of the programs in the package are interactive. However, Peter Cripps has written a menu- driven user interface, MECSPAC, for interactive use of the program package.
The lack of a rigorously defined document structure (a DTD) and the lack of restrictions against overlapping elements has been taken by some to suggest that writing programs for MECS would be more complicated than writing programs for SGML.
One difference is that where SGML programs may keep track of the document structure by means of a "last in first out" stack, MECS programs have to maintain a doubly linked list. Admittedly, this is a bit more complicated. On the other hand, the fact that the basic syntactical role of each and every tag can be inferred directly from its delimiters without look-ahead serves to simplify other matters considerably.
Another difference is that whereas with SGML programs may build internal tree representations of documents to facilitate manipulation on them, no such internal representations are built by MECS programs - because of the occurrence of overlapping elements this has so far seemed too complicated. 4 Therefore, all MECS programs read the entire document from its beginning in order to perform operations on it.
The MECS Program Package does not live up to standards of professional software. But the fact that it was possible for a sheer amateur to write the bulk of these programs as a side-activity during a couple of years indicates that programming for MECS is easy. Altogether the package comprises approximately 13,000 lines of Pascal code (excluding the editor). It is assumed that similar programs for SGML would demand code far in excess of this.
I once said 5 that when it comes to document structure, one of the main differences between SGML and MECS is that in SGML everything is forbidden unless it is explicitly permitted or mandatory, while in MECS everything is permitted unless it is explicitly forbidden.
In retrospect I realize that this is grossly unfair: SGML does after all admit quite permissive DTDs, and MECS does not have any means of forbidding or demanding particular document structures. 6 Still, the formulation points to a difference of emphasis: SGML provides strong mechanisms for exerting control over document structure, whereas MECS sacrifices such control in favor of free overlap and simplified or in-line declaration of elements.
Nine years have passed since the development of MECS started, and it has been used in the encoding of several thousand manuscript pages. The TEI guidelines has been available for quite some time, and has been discussed and used extensively by a large number of projects. The amount and range of SGML-based software has increased considerably.
Is there still a need for MECS? Despite the fact that SGML is a far more sophisticated markup language, I believe that the considerations which led me to dismiss SGML nine years ago still apply. 7 MECS is therefore in my eyes still the preferred choice for a project like the Wittgenstein Archives. However, MECS also has obvious shortcomings. If not all, then at least a number of these shortcomings are eliminated in SGML. Unfortunately, conversion from MECS to SGML without loss of information is notoriously difficult, so we cannot have the best of both worlds.
One recent development (1997) within the SGML area is particularly interesting. Extensible Markup Language (XML), which has received much attention lately, shares a couple of features with MECS: In XML, empty elements are visibly different from elements with content, and tag omission is not allowed. Consequently, a DTD is not required in XML, and a distinction is made between well-formedness ("valid" without DTD) and validity (valid according to some specific DTD) of documents. In all these respects, XML is therefore closer to MECS than SGML is. 8 (It is also interesting to note that one of the arguments often made in favor of XML is that it is easier to write programs for than SGML is.) However, in one important respect XML poses even greater difficulties than SGML: XML does not include SGML's CONCUR feature. And without CONCUR the conversion of MECS documents seems even more difficult.
Some work has been done in order to create a bridge from MECS to SGML. Sunniva Solstrand has developed a method (and a program) for automatically "deducing" DTDs from document instances converted from MECS to SGML (Solstrand 1994). Sascha Djuric has proposed a convention for automatically converting elements with overlap to hierarchical structures in a controlled manner (Djuric 199?). What remains in particular is a method for converting MECS documents with overlap to concurrent hierarchies by using the SGML CONCUR feature, and SGML software which implements this feature. Methods for MECS to SGML conversion is one of the concerns of an ongoing cooperation between C. Michael Sperberg-McQueen and myself.
MECS is a syntax for the design of text encoding systems. Documents which conform to this syntax consist of text interspersed with codes, of which there may be seven syntactically distinct types:
No-element codes: <s>
One-element codes: <a/ ... /a>
Poly-element codes: [a/2| ... /a| ... /a]
N-element codes: [s/2\ ... /s| ... /s]
Character representation codes: {a}
or {"---"\a}
Character disambiguation codes: {a\a}
or {"---"\a}
Comments: <| xxx |>
In these examples '...' indicate coded elements, i.e. character strings
which may or may not contain further codes. 's' and 'a' exemplify generic
identifiers, i.e. names of individual codes.
The first four types of codes are sometimes referred to jointly as element codes; poly-element and N-element codes are sometimes referred to jointly as multi-element codes; while character representation and disambiguation codes are sometimes referred to jointly as character codes.
The examples above are given in MECS' default character set. However all character sets in MECS may be redefined, and there are no restrictions on which characters may be used as code delimiters or which as free characters or tag characters.
Strictly speaking, MECS is therefore not in itself a code system, but a general-purpose set of rules for the design of such systems. MECS specifies how to assign specific syntactic roles to characters and character sets, how to declare generic identifiers for codes, how to use these codes in documents, etc., - in short, how to define and use a code system conforming to the basic code syntax specified by MECS.
This definition is given in the form of a Code Declaration Table (CDT). The CDT starts with a MECS header. The header assigns values to the code delimiters, which decide the most basic general features of any MECS code system. The rest of the CDT declares free characters, tag characters and generic identifiers.
Thus, a MECS-conforming document is a document conforming to a MECS CDT. The document itself may also start with a MECS header. If it does, its minimal CDT can be reconstructed on the basis of the encoded document alone.
The default MECS header is:
£ < > < / / > [ / | \ / | / ] { " \ }
Any order and nesting level of codes in documents is allowed. Codes may
be contained within each other wholly (hierarchically) or only partly (overlapping
each other). However, there is one restriction against overlapping: multi-element
codes may nest hierarchically, but they may not overlap other multi-element
codes.
Codes belonging to different code types may have identical generic identifiers, with one exception: neither no-element and one-element codes nor poly-element and N-element codes may share the same generic identifier.
Character disambiguation codes may be used in conjunction with character representation codes only. The generic identifier of the associated character representation code may be replaced by a string of free characters enclosed by character quote delimiters.
Comments may occur anywhere in a document, and they may contain any sequence of legal characters. The contents of a comment is not regarded as part of the code structure of a document.
According to the general rules for markup reduction one-element codes, poly-element codes and N-element codes may be reduced:
Full markup Reduced markup <a/ ... /a> <a/ ... > [a/2| ... /a| ... /a] [a| ... | ... ] [a/3| ... /a| ... /a| ... /a] [a/3| ... | ... | ... ] [s/2\ ... /s| ... /s] [s\ ... | ... ] [t/3\ ... /t| ... /t| ... /t] [t/3\ ... | ... | ... ]SGML documents are MECS-conforming, provided that they do not make use of tag minimization or end tag omission 9. Some MECS documents will be well-formed SGML documents, others may easily be converted to SGML, yet others may only be converted to SGML with a certain distortion or loss of information.
2.2 Basic Code Syntax, Code Systems and Documents
MECS is a basic code syntax for the design and specification of code systems for markup of electronic documents. Strictly speaking, MECS is therefore not in itself a code system, but a general-purpose set of rules for the design of such systems.
MECS specifies how to assign specific syntactic roles to characters and character sets, how to declare generic identifiers for codes, how to use these codes in documents, etc., - in short, how to define and use a code system conforming to the basic code syntax specified by MECS.
These assignments and declarations are listed in a Code Declaration Table - a CDT. Strictly speaking, again, it is only when adding a CDT to the basic code syntax of MECS that we have a code system. Adding a CDT to the basic code syntax of MECS is like adding an alphabet and a vocabulary to a formal grammar.
The values assigned to the code delimiters decide the most general basic features of any MECS code system. These values are declared in the MECS header. The MECS header is the very first part of the CDT and may also be included as the first part of MECS documents. Any MECS document which contains such a header is self-documenting in the sense that a minimal CDT may be reconstructed on the basis of the document alone.
An electronic document adhering to the specifications of a specific MECS code system, e.g. MECS-XXX, may be called a MECS-XXX-conforming or a MECS system-conforming document. A document adhering to the specifications of some MECS code system or other will be called a MECS-conforming document. All MECS system-conforming documents are MECS-conforming documents, but not vice versa.
In our context, a computerized text is regarded as a stream of characters.
A MECS document is a string of free characters and codes.
A code is an ordered sequence of tags and (optionally) elements. A code may consist of one single tag, or it may consist of several tags and one or more elements included between the tags.
An element is a string of free characters and tags.
An element occurring between the tags of one and the same code is called the code's coded element.
A tag consists of code delimiter(s) and/or tag characters. More specifically, a tag may consist of a tag open delimiter, a string of tag characters constituting a generic identifier, possibly followed by an attribute string, and a tag close delimiter. Or a tag may consist of a tag close delimiter only.
There are seven types of codes. Using the MECS default delimiters (cf. 9.5), examples of these code types will appear as follows:
No-element code: <s>
One-element code: <a/ ... /a>
Poly-element code: [a/2| ... /a| ... /a]
N-element code: [s/2\ ... /s| ... /s]
Character representation code: {a}
or {"---"\a}
Character disambiguation code: {a\a}
or {"---"\a}
Comment: <| xxx |>
In these examples 's' and 'a' are generic identifiers, '...' indicate elements,
'---' indicate strings of free characters, and 'xxx' is any sequence of
legal characters. The element(s) occurring between the tags of a code is
called its coded element(s). Thus, one-element codes have one coded element,
poly-element and N-element codes have several coded elements, while the
other code types have no coded elements.
The first four types of codes are sometimes referred to jointly as element codes. Poly-element and N-element codes are sometimes referred to jointly as multi-element codes. The last two types of codes are sometimes referred to jointly as character codes.
A no-element code consists of one single tag, which is called the no-element
tag.
The no-element tag consists of a no-element code open
delimiter (NCO), a generic identifier (optionally followed by an attribute
string) and a no-element code close delimiter (NCC).
A one-element code consists of a one-element start tag, a coded element
and a one-element end tag.
The one-element start tag consists of a one-element start
tag open delimiter (OSO), a generic identifier (optionally followed by
an attribute string) and a one-element start tag close delimiter (OSC).
The one-element end tag consists of a one-element end
tag open delimiter (OEO), the same generic identifier as the start tag
and a one-element end tag close delimiter (OEC).
A poly-element code consists of a poly-element start tag, one or more
coded elements separated by multi-element separator tags and a multi-element
end tag.
The poly-element start tag consists of a multi-element
start tag open delimiter (MSO), a generic identifier (optionally followed
by an attribute string), a multi-element number delimiter (MNC), an element
number and a poly-element start tag close delimiter (PSC).
The multi-element separator tag consists of a multi-element
separator tag open delimiter (MDO), the same generic identifier as the
poly-element start tag and a multi-element separator tag close delimiter
(MDC).
The multi-element end tag consists of a multi-element
end tag open delimiter (MEO), the same generic identifier as the poly-element
start tag and a multi-element end tag close delimiter (MEC).
The number of coded elements contained by a poly-element
code is indicated by the element number. The number of multi-element separator
tags contained by a particular poly-element code token equals the number
of coded elements minus one.
Poly-element codes may contain two or more elements and
the number of elements contained by different tokens of the same poly-
element code in a document may vary from token to token.
An N-element code consists of an N-element start tag, one or more coded
elements separated by multi-element separator tags and a multi-element
end tag.
N-element codes are syntactically identical to poly-element
codes, except that: (1) the start tag close delimiter is an N-element start
tag close delimiter (NSC); and (2) the number of elements contained by
different tokens of the same N-element code in a document may not vary
from token to token.
2.4.5 Character Representation Codes
A character representation code consists of one single tag, which is
called the character representation tag.
The character representation tag consists of a character
representation code open delimiter (CRO), a generic identifier and either
a character code close delimiter (CCC) or a character disambiguation code
open delimiter (CDO).
If the generic identifier is followed by a character disambiguation code open delimiter (CDO), the character representation code is used in conjunction with a character representation code immediately succeeding it, like this:
{a\a}
where 'a' is the generic identifier of a character representation code
and also of a character disambiguation code. If accompanied by a character
disambiguation code, the character representation code may, instead of
a generic identifier, contain a string of free characters, enclosed by
character quote delimiters (CQDs) - cf. 4.6 for further explanation of
this feature.
2.4.6 Character Disambiguation Codes
A character disambiguation code consists of one single tag, which is
called the character disambiguation tag.
The character disambiguation tag consists of a character
disambiguation code open delimiter (CDO), a generic identifier and a character
code close delimiter (CCC).
A character disambiguation code can only be used in conjunction with a character representation code immediately preceding it. The close delimiter of the character representation code is then replaced by the open delimiter of the character disambiguation code.
The preceding character representation code may, instead of a generic identifier, contain a string of free characters, enclosed by character quote delimiters (CQD), like this:
{"---"\a}
where 'a' is the generic identifier of a character disambiguation code
and '---' is a string of free characters.
A MECS comment may contain free characters, tag characters and code delimiters, i.e. any legal characters, in any order. The contents of a comment is not regarded as part of the code structure of a document.
A comment starts with a one-element start tag open delimiter (OSO) immediately followed by a poly-element start tag close delimiter (PSC), and ends with a poly-element start tag close delimiter (PSC) immediately followed by a one-element end tag open delimiter (OEC). Thus, in MECS' default character set a comment looks like this:
<| xxx |>where 'xxx' stands for any sequence of legal characters.
<!-- xxxxxxxx -->or other SGML declarations like e.g.
<![PCDATA[ >> < < </xxx &]]>to be valid in SGML-like MECS documents.
2.5 Generic identifiers and attribute strings
A MECS code system may include any number of generic identifiers 10 for each of the different code types except for comments, which do not have generic identifiers.
Neither no-element and one-element codes nor poly-element and N-element codes may share the same generic identifier. Apart from this, codes belonging to different code types may have identical generic identifiers.
Thus, the following examples would be legal and might all be included in one and the same document conforming to a MECS code system:
(1) <s>
(2) <a/ ... /a>
(3) [a/2| ... /a| ... /a]
(4) [s/2\ ... /s| ... /s]
(5) {a}
(6) {a\a}
MECS also allows for the use of numerals as identifiers of one-element
codes, so that codes may be used with natural numbers in the place of generic
identifiers, e.g.
(7) <1/ ... /1> (8) <2/ ... /2>etc.
Start tags of element codes may contain attribute strings: if the tag's
generic identifier is followed by a string delimiter or a nil character
(i.e. in the normal position of the tag close delimiter), then the rest
of the tag may contain any sequence of free characters (except for any
that might be identical to code delimiters), ending with the tag close
delimiter.
I.e., given the examples (1) and (2) above, the following
examples would also be legal:
(9) <s This is an attribute string> (10) <a attribute=value n=1/ ... /a>2.6 Markup Reduction
The implications of these rules for each specific code type are as follows:
No-element codes cannot be reduced.
A one-element end tag with a generic identifier identical to the generic identifier of the last preceding unterminated one-element start tag may be reduced to the one-element end tag close delimiter.
A poly-element start tag may, if the code has 2 elements, be reduced to multi-element start tag open, generic identifier and poly-element start tag close.
An N-element start tag may, if the code has 2 elements, be reduced to multi-element start tag open, generic identifier and N-element start tag close.
A multi-element separator tag with a generic identifier
identical to the last preceding unterminated multi-element start tag or
separator tag may be reduced to the multi-element separator tag close.
A multi-element end tag with a generic identifier identical
to the last preceding unterminated multi-element separator tag may be reduced
to the multi-element end tag close delimiter.
Character representation codes, character disambiguation codes and comments cannot be reduced.
Thus, the examples (2)-(4) above may be reduced to
(11) <a/ ... > (12) [a| ... | ... ] (13) [s\ ... | ... ]2.7 Document Structure
A MECS document may include the header of a MECS system to which it conforms (cf. 10). If so, the first character of the header must also be the very first character of the document.
Apart from this optional header, a MECS document consists of codes and elements appearing in any order.
That a code A contains a code B means that one or more of B's tags are
contained in A's coded element(s).
B is hierarchically nested within A if A contains B, and
B does not contain A.
A and B overlap if A contains B, and B contains A.
Any order and nesting level of codes is allowed, with one exception: no multi-element code may overlap any other multi-element code 11.
This means that the following examples are all legal:
(14) <a/ /a> <b/ /b>
(15) <a/ <a/ /a> /a>
(16) <a/ <b/ /b> /a>
(17) <a/ <b/ /a> /b>
(18) <a/ [a/2| /a| /a> /a]
(19) [a/2| [s/2\ /s| /s] /a| /a]
(20) [a/2| [a/3| /a| /a| /a] /a| [t/3\ /t| /t| /t] /a]
(21) [a/2| <a/ <s> {a\a} [b/2| [s/2\ {a\a} /s| {b} /s]
<b/ {b\a} /b| /a> /b] <s> /a| /b> /a]
However, the following example is illegal:
[s/2\ [b/2| /s| /b| /s] /b]It should be noted that overlapping reduces the possibilities for markup reduction. For example, (21) above reduces to:
(22) [a| <a/ <s> {a\a} [b| [s\ {a\a} | {b} ]
<b/ {b\a} /b| /a> ] <s> /a| /b> ]
2.8 Classification of MECS systems
Any MECS system is either complete or partial.
A complete MECS system contains all the seven code types
described above (cf. 4).
A partial MECS system lacks one or more of the code types
of a complete system. Partial systems are called N-type systems, where
N is the number of code types contained by the system.
Any MECS system is either reduced, reducible, or irreducible.
A reduced MECS system demands full reduction of all start
tags and separator tags.
A reducible MECS system permits but does not require reduction
of start and separator tags.
An irreducible MECS system requires that no tags are reduced.
Any MECS system is either restricted or unrestricted.
In a restricted MECS system, no codes may overlap.
A system which is not restricted, is unrestricted.
A reduced system is necessarily a restricted system, but not vice versa.
A MECS document consists of legal characters only. The legal characters are the delimiters, free characters and tag characters.
There are several subsets of legal characters, whereof some sets may overlap and others not. MECS assigns a number of different roles to these character sets and to individual characters, and includes rules concerning the relationships between these sets and between particular members of particular sets.
There are 18 code delimiters. They correspond to the six first types of codes (cf. 4) as indicated below.
Assigning values to the code delimiters is one of the most basic operations in the definition of a MECS code system. The values assigned to the code delimiters decide many of the basic syntactical features of the code system (cf. 8, 9.5 and 9.6).
If a code delimiter is assigned the value nil, the delimiter itself
is said to be nil, or undeclared.
That a character which is the value of a code delimiter
is a reserved delimiter value means that it can not belong to the free
characters of the code system defined. A delimiter is said to be reserved
if its value is a reserved delimiter value.
Values are assigned to code delimiters according to the following rules:
The string delimiter (SD) may occur anywhere in elements. In tags, SD separates generic identifiers from attribute strings.
SD may not be nil. Its value may not be identical to the value of any code delimiter. SD is always a free character.
If SD is assigned a blank character, line endings and start and end of file will be regarded as equivalents to SD.
The nil character may occur anywhere in a document. In tags, the nil character separates generic identifiers from attribute strings.
Its value may not be identical to any of the code delimiters. The nil character is always a free character.
If the nil character is identical to SD, the character value in question will be interpreted as SD, and the system defined is said to contain no nil character.
Free characters may occur anywhere in elements.
All legal characters except those which are reserved delimiter values may be included in the set of free characters.
The string delimiter and the nil character always belong to the free characters.
Generic identifiers consist of tag characters.
All legal characters except those which are values of code delimiters may be included in the set of tag characters.
Default code delimiters
The default MECS code delimiters define a complete, unrestricted, reducible system, with 10 different, whereof 8 reserved, delimiter values.
Header: £ < > < / / > [ / | \ / | / ] { " \ }
Code delimiters: < / > [ | ] { \ } "
Reserved code delimiters: < / > [ | ] { }
| Default string delimiter: | (blank) Default nil character: | (blank) |
:
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0 , ; . : - ( ) ! ? " ' * % & = + (blank)Default tag characters
:
a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 1 2 3 4 5 6 7 8 9 0 . _ -2.9.6 Examples of alternative character sets
Example 1
A complete, unrestricted, irreducible system is defined by the following header:
£ * * * | * @ | | * \ | @ @ * @ " / @The system has 6 code delimiters, whereof 3 are reserved.
| Code delimiters: | * | @ \ " / |
| Reserved code delimiters: | * | @ |
The code syntax of the system is:
No-element code: *s* One-element code: *a| ... *a@ Poly-element code: |a|2* ... |a@ ... @a* N-element code: |s|2\ ... |s@ ... @s* Character representation code: @a@ or @"---"/a@ Character disambiguation code: @a/a@ or @"---"/a@ Comment: ** xxx *@Example 2
A complete, unrestricted, reducible system is defined by the following header:
£ * * * < * > * % [ \ * | * ] * " _ /The system has 11 code delimiters, whereof 4 are reserved.
| Code delimiters: | * < > % [ \ | ] " _ / |
| Reserved code delimiters: | * > | ] |
The code syntax of the system is:
No-element code: *s* One-element code: *a< ... *a> Poly-element code: *a%2[ ... *a| ... *a] N-element code: *s%2\ ... *s| ... *s] Character representation code: *a/ or *"---"_a/ Character disambiguation code: *a_a/ or *"---"_a/ Comment: *[ xxx [>
Example 3
A partial (4-type), restricted, reduced system is defined by the following header:
£ < > < / £ > £ £ / £ £ £ £ £ / £ £ /The system has 3 code delimiters, whereof all are reserved.
| Code delimiters: | < > / |
| Reserved code delimiters: | < > / |
The code syntax of the system is:
No-element code: <s> One-element code: <a/ ... > Character representation code: /a/ Comment: </ --- />Example 4
A partial (4-type), unrestricted, reducible system is defined by the following header:
£ < > < > </ > [ £ ! £ £ £ £ ] & £ £ ;The system has 8 code delimiters, whereof 6 are reserved.
| Code delimiters: | < > </ [ ! ] & ; |
| Reserved code delimiters: | < > [ ! ] & |
The code syntax of the system is:
No-element code: <s> One-element code: <a> ... </a> Character representation code: &a; Comment: <! xxx [ xxx ] xxx >
2.10 Code Declaration Table (CDT)
It is only when adding a Code Declaration Table (CDT) to MECS' basic code syntax that we have a MECS code system. The CDT assigns values to the delimiters and other character sets, and declares the actual codes of the system. The CDT itself is a file of characters.
The very first character of the CDT declares the system's string delimiter.
The second character of the CDT declares the system's nil indicator, which in the rest of the CDT indicates an assignment of the value nil.
The third character of the CDT declares the system's nil character.
In the rest of the CDT, all character strings delimited by string delimiters are strings.
The first 18 strings of the table declare the systems's code delimiters, in the following order:
NCO NCC OSO OSC OEO OEC MSO MNC PSC NSC MDO MDC MEO MEC CRO CQD CDO CCCTogether, the first three characters and the first 18 strings of the CDT form a string defining the system's header. The default MECS header is:
£ < > < / / > [ / | \ / | / ] { " \ }
(Note that the very first character of the header is a blank.)
The next five strings (i.e. strings nos 19, 20, 21, 22 and 23) declare the system's free characters.
Strings number 24 and 25 declare the system's tag characters.
The next six strings (i.e. strings nos 26, 27, 28, 29, 30 and 31) declare the CDT's code type indicators, which in the rest of the CDT indicate which code type a generic identifier belongs to.
String no 26 declares the no-element code indicator. String no 27 declares the one-element code indicator. String no 28 declares the numeric indicator. String no 29 declares the poly-element code indicator. String no 30 declares the character representation code indicator. String no 31 declares the character disambiguation code indicator.
The first part of the CDT, i.e. the part beginning with SD and including the first 31 strings, is the code syntax part of the CDT.
The rest of the CDT is the code inventory part. This part consists of pairs of strings, the first of which is a code type indicator and the second a generic identifier. Each pair declares a code of the indicated type and assigns a generic identifier to it.
If the numeric indicator is not nil, numbers indicate N-element codes in the rest of the table, and one-element codes with numeric identifiers (cf. 5) are declared by replacing the generic identifier by a numeric indicator.
The following is an example of a CDT defining a complete, unrestricted, reducible system with the MECS default character sets defined above (cf. 9.5).
(23)
+---------------------------------------+-----------+---------+
| £ < > < / / > [ / | \ / | / ] { " \ } | Header | |
+---------------------------------------+-----------+ |
| | | |
|abcdefghijklmnopqrstuvwxyz | Free | |
|ABCDEFGHIJKLMNOPQRSTUVWXYZ | char- | |
|1234567890 | acters | |
|,;.:-()!?"' | |Code |
|*%&=+ | |Syntax |
+---------------------------------------+-----------+Part |
|abcdefghijklmnopqrstuvwxyz | Tag char- | |
|ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890._-| acters | |
+---------------------------------------+-----------+ |
|no one num poly rep dis | Code type +---------+
| | indicators| |
+---------------------------------------+-----------+ |
|no s | | |
|one a | | |
|one b | | |
|one num | | |
|poly a | |Code |
|poly b | |Inventory|
|2 s | |Part |
|3 t | | |
|rep a | | |
|rep b | | |
|dis a | | |
+---------------------------------------+-----------+---------+
The following document, which contains the examples (1)..(22) above, conforms
to the above CDT.
(24)
(1) <s>
(2) <a/ ... /a>
(3) [a/2| ... /a| ... /a]
(4) [s/2\ ... /s| ... /s]
(5) {a}
(6) {a\a}
(7) <1/ ... /1>
(8) <2/ ... /2>
(9) <s This is an attribute string>
(10) <a attribute=value n=1/ ... /a>
(11) <a/ ... >
(12) [a| ... | ... ]
(13) [s\ ... | ... ]
(14) <a/ /a> <b/ /b>
(15) <a/ <a/ /a> /a>
(16) <a/ <b/ /b> /a>
(17) <a/ <b/ /a> /b>
(18) <a/ [a/2| /a| /a> /a]
(19) [a/2| [s/2\ /s| /s] /a| /a]
(20) [a/2| [a/3| /a| /a| /a] /a| [t/3\ /t| /t| /t] /a]
(21) [a/2| <a/ <s> {a\a} [b/2| [s/2\ {a\a} /s| {b} /s] <b/
{b\a} /b| /a> /b] <s> /a| /b> /a]
(22) [a| <a/ <s> {a\a} [b| [s\ {a\a} | {b} ] <b/
{b\a} /b| /a> ] <s> /a| /b> ]
2.11 Deducing a Minimal CDT from an
Encoded Document
Although it is unequivocally decidable whether any particular document conforms to the MECS code system defined by any particular CDT, an indefinite number of documents conform to any particular CDT, and any particular document conforms to an indefinite number of CDTs.
However, if a document contains the header of any CDT to which it conforms (cf. 9.1.1), one particular CDT to which the document conforms may be deduced directly from the document alone.
In virtue of the rules for assigning values to code delimiters (cf. 9.1.1), the basic syntactic features of all tags occurring in a MECS-conforming document are directly deducible from their code delimiters. The deduction can be done without look-ahead, unless the delimiter pairs NCO - NCC and OSO - OSC are identical (cf. 9.1.1, exception to rule no 4).
This holds true for all MECS-conforming documents, whether the system to which they conform is partial or complete, whether it is restricted or unrestricted, and whether it is reduced, reducible, or irreducible.
The CDT thus deducible from the document is called the document's minimal CDT. Any particular MECS-conforming document has one and only one minimal CDT 12.
It is therefore recommended that all MECS documents contain a header. Some examples follow below.
Document (24) has the following minimal CDT:
(25)
£ < > < / / > [ / | \ / | / ] { " \ }
T
abeghilnrstuv
0123456789
.
()=
abst
12
n o # p r m
n s
o #
o a
o b
p a
p b
2 s
3 t
r a
r b
m a
The following document
_ [ ] [ | _ ] _ _ | _ _ _ _ _ _ _ _ _ This document contains no-element [NO] and [ONE| one-element ] codes, and [|comments|]. No more. Its minimal CDT will define a system which is partial, restricted (since there is no one-element end tag open delimiter), and reduced (for the same reason). This document contains no-element [NO] and [ONE| one-element ] codes, and [|comments|]. No fun.has the following minimal CDT
_ [ ] [ | _ ] _ _ | _ _ _ _ _ _ _ _ _ CDINT acdefghilmnoprstuwyz _ ,. ()- _ ENO n o _ _ _ _ n NO o ONE2.12 SGML Compatibility
2.12.1 Some general observations
As a first and very rough approximation, it may be said that: 1. SGML documents are MECS-conforming, provided they do not make use of tag minimization or end tag omission 13. 2. Some MECS documents are well-formed SGML documents, others may easily be converted to SGML, yet others may only be converted to SGML with a certain distortion or loss of information.
MECS no-element codes correspond to SGML empty elements (so- called milestones.) MECS one-element codes correspond to SGML elements. MECS character representation codes correspond roughly to SGML internal entities. There is nothing in SGML which corresponds directly to the MECS poly- element codes, N-element codes and character disambiguation codes. MECS-aware software will accept, but ignore, SGML attributes and declarations, and interpret SGML entity references differently from SGML applications.
MECS markup reduction rules differ from SGML markup reduction or minimization rules.
Functionally, a MECS document, if stripped of its optional MECS header, corresponds to the SGML document instance. But while the MECS CDT corresponds roughly to the SGML Document Type Definition (DTD), there are fundamental differences between MECS CDTs and SGML DTDs.
The following features of MECS are exceptions and deviations from the main outline of the basic code syntax which have been made in order to enhance SGML compatibility.
An SGML document is a MECS-conforming document if no tag minimization or end tag omission has been used 14.
However, MECS software will interpret attributes, declarations and entity references differently from the way they are interpreted by an SGML application. In MECS, SGML attributes will be regarded as attribute strings and ignored. All SGML declarations, including the DTD as well as marked sections and comments, will be regarded simply as MECS comments and thus also ignored. SGML entities will be interpreted as character representation codes.
It is possible to define MECS code delimiters so that they agree closely with the corresponding parts of the SGML concrete reference syntax (cf. 9.6, example 4.)
For example, the following SGML document 15
<!DOCTYPE TEI.1 SYSTEM "c:\tei\public\tei1.dtd" [ <!ENTITY tla "Three Letter ACROnym"> <!ELEMENT my.tag - - (#PCDATA)> <!-- following line added by C.H. --> <!ELEMENT my.stone - o EMPTY> <!-- any other special-purpose declarations or re-definitions go in here --> ]> <tei.1> This is an instance of a modified TEI.1 type document, which may contain <my.tag>my special tags</my.tag>, <!-- following line added by C.H. --> including milestones such as <my.stone>, and references to my usual entities such as &tla;. </tei.1>is a MECS-conforming document, with the following minimal CDT
£ < > < > </ > [ £ ! £ £ £ £ ] & £ £ ; EIT acdefghilmnoprstuwy 1 ,. £ aegilmnosty .1 n o £ £ r £ n my.stone o my.tag o tei.1 r tla2.12.3 From MECS to SGML
A MECS document is a well-formed SGML document provided that:
The conversion of MECS documents not satisfying conditions 1 to 7 to SGML-conforming document instances is a straightforward process, provided they satisfy condition 8.
The conversion of MECS documents not satisfying condition 8 to SGML-conforming documents is likely to be a rather complicated process, and may lead to distortion or loss of information. There are two ways in which such documents can be converted: 1) either all occurrences of overlap can be eliminated (cf. Part II, ##); 2) or one has to identify sets of codes in the document which do not overlap, and define concurrent DTDs for each of these sets.
SGML applications are likely to interpret attribute strings and character disambiguation codes differently from the way they are interpreted by MECS software.
A preliminary version of MECS was drafted in February 1990 16. Version 1.00 was finished in February 1991 17. Version 1.01, of June 1992 18, consisted in a slight revision of the CDT format. The revision did not necessitate changes to version 1.00 documents.
Version 2.00, which was finished in August 1993 19, includes minor changes both in the CDT format, the MECS header, and the basic syntax of one of the code types, i.e. the N- element codes. Therefore, transition from earlier versions to version 2.00 necessitates changes to CDTs and may also require changes to MECS documents encoded according to these versions.
MECS version 3 will represent a simplification of the structures already present in earlier versions. At the same time, version 3 will offer new capabilities and new and more powerful mechanisms.
MECS version 2 no-element codes, one-element codes and N-element codes have one thing in common: their number of elements is fixed. In version 3, therefore, they will all be subsumed under one category, which will be called N-element codes. Poly-element codes will be retained, with the modification that they may contain any number of elements including 0 or 1 (i.e., the number of elements does not any more have to be higher than 1).
Character representation codes and character disambiguation codes will be retained.
The four remaining code types of version 3 may be exemplified as follows, in default notation:
Full markup Reduced markupN-element codes:
<tag> <tag/ ... /tag> <tag/ ... > <tag/ ... /tag| ... /tag> <tag/ ... | ... >Poly-element codes:
<tag_0> <tag_1/ ... /tag> <tag< ... > <tag_2/ ... /tag| ... /tag> <tag| ... | ... >Character codes:
{tag}
{tag_tag}
The character code close delimiter may be left out if immediately followed
by a string delimiter or a reserved code delimiter, as follows:
{tag} {tag
{tag}< {tag<
{tag}/ {tag/
{tag}| {tag|
{tag}> {tag>
{tag}{ {tag{
Inclusion of mechanisms similar to those of SGML external entities will
be considered.
Comments and marked sections
Comments will be similar to version 2 comments, but the syntax will be changed so as to facilitate processing of SGML documents:
<|-- ... --|>Inclusion of mechanisms similar to the SGML marked sections (with keywords IGNORE, CDATA, RCDATA and INCLUDE) will be considered:
<|[IGNORE[ ... ]]|> <|[CDATA[[ ... ]]|> <|[RCDATA[ ... ]]|> <|[INCLUDE[ ... ]]|>Attributes
In earlier versions, a tag consists of a generic identifier which may be followed by an attribute string. The attribute strings play no role in the earlier versions, except to increase compatibility with SGML by leaving a space open for SGML attributes. Version 3 will follow up this strategy by incorporating most or all the syntactical features of SGML attributes.
In addition, a syntax for structured attributes proposed by Peter Cripps will be considered for inclusion in MECS (cf. Cripps 1996).
Discontinuation
An element opened by a start tag or delimiter tag may be discontinued by
_tag|and then resumed again by
|tag_e.g. like this:
<tag/ ... _tag| --- |tag_ ... /tag>(In the example, the element indicated by '---' does not belong to the code's coded elements.)
Overlap
In version 3 codes of all types may overlap codes of any type (whereas in version 2 multi-element cannot overlap each other).
One problem with earlier versions is that tokens of the same code type cannot overlap:
<s/ <s/ /s> /s>will necessarily be interpreted as two hierarchically nested codes. In MECS version 3, a tag may include a special code token identifier which serves to overcome this limitation, eg as follows:
<s #1/ <s #2/ /s #1> /s #2>Document Structure
In version 3, there will be no restrictions on combinations of codes whatsoever: all codes may nest arbitrarily deep and codes of all types may overlap with each other.
Master Documents
As a result of the changes described above the MECS header format will be simplified.
As in earlier versions, the syntactical role of every tag can be deduced directly from its delimiters. If a document includes a MECS header it will therefore still be possible to deduce a document's entire code syntax, including its code inventory, from the encoded document itself.
Unlike earlier versions, however, the formal specification of the code system will not be contained in a Code Declaration Table (CDT), but in a Master Document, which is itself a well-formed MECS document. Correspondingly, the master document deducible from any well-formed MECS document is called its Minimal Master Document. It also follows that any Minimal Master Document is its own Minimal Master Document.
In addition, Format Master Documents will allow for the inclusion of element format declarations. A Format Master Document specifies for each of the codes in a code system whether its coded element should correspond to some specific format such as free text, numeric characters only, a date in some standard format, a closed list of string values, and so on.
As with version 2, all SGML documents will be formally MECS- conforming documents. However, the functional compatibility of version 3 with SGML will be improved.
This User Guide is meant as a help to a quick start for use of the program package. It does not cover all aspects or details of the programs. For more detailed technical information, cf. 3 below. Some knowledge of the basic MECS syntax is presupposed - cf. Part I section 1 for a brief introduction.
Peter Cripps of the Wittgenstein Archives has written a menu- driven user interface integrating all aspects of the MECS Program Package. This user interface will be documented separately and made available later.
3.1 Installation and System Requirements
All programs in the package run on IBM PCs with DOS version 3.x or later, and compatibles.
Users will normally receive a copy of the package on a floppy disk or
as a zip archive containing a directory called MECS. To install the package,
copy all files on a separate directory on your hard disk called e.g. 'C:\MECS',
and add the full path name to your path string, e.g. by adding 'C:\MECS'
to the path command in your AUTOEXEC.BAT file as follows:
PATH=C:\;C:\DOS;...;C:\MECS
In all examples given below it will be assumed that this installation
procedure has been followed.
The package occupies less than 700 Kb of disk space. A hard disk is recommended, although the package will also run on a floppy disk system. Memory requirements depend on the size of your documents. It is possible to run the programs with less than 200 Kb available memory, but in most cases you will need more. The programs do not make use of extended or expanded memory.
Users with no intention to use the MECS Program Package for processing of SGML documents or conversion of MECS documents to SGML may skip this section.
Users who do have such intentions will find the rest of this User Guide to be of help as an introduction to the MECS Program Package, even though the examples discussed here are not SGML examples.
Roughly, all SGML documents are MECS-conforming, and all documents created or modified in MECS can be converted to SGML.
However, this requires some qualification: SGML documents are MECS-conforming only provided that they do not make use of tag minimization or end tag omission 20. The MECS Program Package provides tools which ensure that any document you create in MECS either is or can be converted to an SGML-conforming document instance (). The Program Package also allows you to take steps to ensure that such conversion leads to no loss or distortion of information ().
You can test your SGML documents for MECS conformance with the program MECSVAL. SGML documents are MECS-conforming in virtue of certain exceptions and deviations from the main outline of the basic syntax of MECS which have been made precisely in order to enhance SGML compatibility (cf. Part I, ##) 21.
MECSVAL is the only program in the MECS Program Package which takes all of these exceptions and deviations fully into consideration 22.
Therefore, if you intend to do any serious work at all with SGML documents by means of the MECS Program Package, it is highly recommended that you first use the program SGMLMECS to convert them to a format accepted also by the other MECS programs. You may convert your documents back to SGML again with the program MECSSGML.
SGML has features and capabilities which MECS does not have, and vice versa. But while MECS 'knows' at least something about SGML, SGML does not 'know about' MECS at all. Features and capabilities of MECS which are not shared by SGML may create SGML syntax errors. MECS, on the other hand, is designed simply to accept and ignore those features of SGML which it does not share with MECS.
If you want to use MECS primarily as a tool to process SGML documents, you should be aware that there are certain features of SGML which though accepted are not supported by MECS ().
If you use MECS to create SGML documents, or want to be able to convert your MECS documents to SGML, you should avoid using MECS features which have no corollary in SGML, or be aware of the consequences of doing so ().
3.3 Creating and Validating Documents and CDTs
To create your first MECS document, type the command
MECSVAL
at the DOS prompt and press return. The following menu will be displayed
in the upper part of the screen:
+------------------------------------------------------------+ |C:\MYDIR Mem: 433018 MECSVAL version 2.01| +------------------------------------------------------------+ |L LOG: I Info 1 List directory | |C CDT: S Switches 2 Change directory| |T TXT: M Create Minimal CDT 3 Copy file | |E EDT: D Check CDT 4 Print file | |Q Quit V Check CDT and TXT 5 Delete file | +------------------------------------------------------------+Press 'E', and MECSVAL will prompt you for a file name. Type a file name, e.g. 'DOC1', and press Enter to activate the MECSVAL editor.
The first thing you need to do, is to include a MECS header at the very beginning of your document. We will assume that you intend to use the MECS default delimiters (.). To save yourself some typing, you may include the default MECS header by pressing Ctrl+K, then R. When prompted for a file name, type 'C:\MECS\HEADMECS' (assuming that you installed the MECS Program Package on a directory called 'C:\MECS'), and press Enter. The top of your screen will now look like this:
+------------------------------------------------------------+
|+----------------------------------------------------------+|
||DOC1. Line 1 Col 1 Byte 1 Insert Indent Save||
|+----------------------------------------------------------+|
| £ < > < / / > [ / | \ / | / ] { " \ } |
| |
| |
Go to the line below the header and include the file C:\MECS\EX1 (or, alternatively,
type the text below in). Your document should now look like this:
£ < > < / / > [ / | \ / | / ] { " \ }
<|From EX1: |>
[dmi\0|6]
<paragraph/<title/Sample MECS Document>>>
<intro/<paragraph/<indent/3>This is a sample
<b/MECS> document which is intended to demonstrate
the use of currently available <b/MECS>
software./paragraph>/intro>
Press F2 to store the text and exit the editor. Note that on the main menu
DOC1 is now indicated as the current editor (EDT) file. If you need to
review DOC1 again before proceeding, press 'E' and then Enter. To exit
the editor and save, press F2. To exit without saving, press Ctrl+K, then
Q.
You need to check your text for coding errors, but you have not yet created any Code Declaration Table (CDT). You may do both these things in one operation: press 'M' and type 'DOC1' when prompted for a text (TXT) file name. Because the text contained an error, you will get an error message and an indication of the line and column number where the error was detected:
+------------------------------------------------------------+ |C:\MYDIR Mem: 433018 MECSVAL version 2.01| +------------------------------------------------------------+ |L LOG: I Info 1 List directory | |C CDT: S Switches 2 Change directory| |T TXT: M Create Minimal CDT 3 Copy file | |E EDT: DOC1 D Check CDT 4 Print file | |Q Quit V Check CDT and TXT 5 Delete file | +------------------------------------------------------------+ |Text file: C:\MYDIR\DOC1 | |Writing code declaration table: DOC1.CDT | | Report from MECSVAL 25.8.1994, 22:30 | | | | | | | |Errors in DOC1: | | | | 3 [dmi\0|6] | | 4 <paragraph/<title/Sample MECS Document>>> | | ^ | |Error 68: No one-element code active | | | |1 errors encountered in DOC1 | |WARNING: CDT file may contain ERRORS | |Press Q to quit, any key to edit | +------------------------------------------------------------+The error message 'No one-element code active' indicates that you have included a superfluous one-element end tag close delimiter, i.e. a '<' too many. Press any key (except 'Q'), and the editor will be activated with the cursor positioned at the exact location of the error. Correct the error (by deleting the superfluous '>'), exit and save by pressing F2. Repeat this process until you get the message 'No errors' on pressing 'M' at the main menu. At this stage, your screen should look like this:
+------------------------------------------------------------+ |C:\MYDIR Mem: 430456 MECSVAL version 2.01| +------------------------------------------------------------+ |L LOG: I Info 1 List directory | |C CDT: DOC1.CDT S Switches 2 Change directory| |T TXT: DOC1 M Create Minimal CDT 3 Copy file | |E EDT: DOC1 D Check CDT 4 Print file | |Q Quit V Check CDT and TXT 5 Delete file | +------------------------------------------------------------+ |Text file: C:\MYDIR\DOC1 | |Writing code declaration table: DOC1.CDT | | Report from MECSVAL 25.8.1994, 22:30 | | | | | | | |Errors in DOC1: | | | |No errors encountered in text file DOC1 | | | | | | | | | | | | | | | +------------------------------------------------------------+You have now created a (minimal) CDT called DOC1.CDT on the basis of DOC1. Review your minimal CDT by pressing 'E' and entering 'DOC1.CDT'. It will look like this:
£ < > < / / > [ / | \ / | / ] { " \ }
CDEMST
abcdefhilmnoprstuvwy
036
.
£
abdeghilmnoprt
£
n o £ p r m
o b
o indent
o intro
o paragraph
o title
2 dmi
As you can see, all the free characters and codes you used in DOC1 have
been declared. Exit by pressing Ctrl+K, then Q. Assuming that you conclude
from this inspection that you need to declare additional characters and
codes used in the rest of this example, it is suggested that you extend
the minimal CDT. An example CDT is supplied with the Program Package under
the file name EX.CDT, so in this case you may save some typing by simply
copying C:\MECS\EX.CDT. Normally, however, you would have to work some
other way to create your extended CDT, e.g.: press '3' on the main menu
and copy DOC1.CDT to a file called EX.CDT, and then edit EX.CDT to suit.
The example EX.CDT looks like this:
£ < > < / / > [ / | \ / | / ] { " \ }
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
,;.:-()!?"'
*%&=+ß
abcdefghijklmnopqrstuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890._-
n o # p r d
o b
o indent
o intro
o paragraph
o title
2 dmi
o REF
n ind
n l
o example
o i
o note
o s
o u
p s
r reverse_E
r reverse_A
d exist
Press F2 to store and exit EX.CDT. You may check EX.CDT for errors by pressing
'C' at the main menu, entering 'EX.CDT' when prompted for a file name.
Then press 'D'. If errors are encountered, an error message will be displayed
in the lower part of the screen:
+------------------------------------------------------------+ |C:\MYDIR Mem: 430456 MECSVAL version 2.01| +------------------------------------------------------------+ |L LOG: I Info 1 List directory | |C CDT: EX.CDT S Switches 2 Change directory| |T TXT: DOC1 M Create Minimal CDT 3 Copy file | |E EDT: EX.CDT D Check CDT 4 Print file | |Q Quit V Check CDT and TXT 5 Delete file | +------------------------------------------------------------+ |Reading code declaration table: EX.CDT | |Text file: C:\MYDIR\DOC1 | | Report from MECSVAL 25.8.1994, 22:33 | | | | | | 20 o i | | 21 o note/ | | ^ | |Error 39: Illegal character in generic identifier | | | |1 errors encountered in code declaration table EX.CDT | |Press Q to quit, any key to edit | | | | | | | | | +------------------------------------------------------------+In this case, you had mistakenly included the tag close delimiter in the declaration of a generic identifier. Press any key (except 'Q'), and the editor will be activated with the cursor positioned at the location in the file where the error was detected. Correct the error, store and exit, and press 'D' at the main menu again. Repeat this process until you get no error messages. Edit DOC1 again and add the following text (by typing it in, or by including the file C:\MECS\EX2) 23:
<|From EX2: |>
<paragraph/<s/We <s/will see <s/some examples>
of recursive codes, of <b/elements <u/which/b>
overlap/u>, of/s> special characters<ind> like
{reverse_A}, {reverse_E}, {reverse_E\exist},
and {"E"\exist}, and of<i> substitutions in
<note/simplified/note>/s> MECS-WIT style:/paragraph>
<exmple/<paragraph/
<ind><s/Ich besuche gern das alte <i/kleine>
[s|Schloß|<i/Haus>] meines [s|Onkels.|<i/Vaters.>]/s>
<ind><s/Ich besuche gern das
[s|alte Schloß meines Onkels|<i/kleine Haus> meines
<i/Vaters>]/s>/paragraph>
/example>
<paragraph/This is the end of our
<note/very artificial> example./paragraph>
Press F2 to store and exit DOC1. Instead of creating yet another minimal
CDT on the basis of the new version of DOC1, you may check it against EX.CDT
by pressing 'V' on the main menu. An error message will be displayed in
the lower part of the screen:
+------------------------------------------------------------+ |C:\MYDIR Mem: 433018 MECSVAL version 2.01| +------------------------------------------------------------+ |L LOG: I Info 1 List directory | |C CDT: EX.CDT S Switches 2 Change directory| |T TXT: DOC1 M Create Minimal CDT 3 Copy file | |E EDT: DOC1 D Check CDT 4 Print file | |Q Quit V Check CDT and TXT 5 Delete file | +------------------------------------------------------------+ |Reading code declaration table: EX.CDT | |Text file: C:\MYDIR\DOC1 | | Report from MECSVAL 25.8.1994, 22:34 | | &nbs