[Mirrored from: http://www.sgmlopen.org/sgml/docs/author.htm]

Concepts Behind SGML-aware Authoring

The SGML standard, ISO 8879:1986, defines a meta-language for defining markup languages. The standard primarily addresses the issue of the syntactic interpretation of a complete, valid SGML document by a conforming SGML parser. It does not define how an application, such as an authoring tool, should behave. Authoring requires the solution of various problems that do not exist when only the parsing of complete documents is considered.

Authoring in SGML

The SGML standard, ISO 8879:1986, defines an abstract syntax for defining markup languages. Since SGML is really a method for defining many languages--each DTD potentially defines a different markup language--SGML itself is often called a meta-language. The standard primarily addresses the issue of the syntactic interpretation of a complete, valid SGML document by a conforming SGML parser. It does not define how an application, such as an authoring tool, should behave. Authoring requires the solution of various problems that do not exist when only the parsing of complete documents is considered.

There are various ways of authoring SGML including using a standard character-based editor such as vi, emacs, or some word processor in "ASCII" mode to author both the text and the markup; using some sort of conversion program to add SGML markup to existing content (including cases where the existing file may include some sort of non-SGML codes or "markup"); using an SGML-aware editor that provides--automatically or through some non-character based interface--the proper markup; and recreating an SGML document from a repository of existing SGML (or SGML-equivalent) fragments.

Neither the character-based editor nor the conversion approach really addresses the issue of authoring in SGML, and the repository reassembly approach (which is very interesting in its own right) assumes pre-authored SGML. This paper concentrates on the concepts behind non-character based, SGML-aware authoring interfaces of which there are several implementations on the market today.

The parser and the editing application

It is useful to make a clear distinction between what is commonly called an SGML parser and an SGML application such as an editor. A parser is a program that reads its input and interprets it according to the rules of the language it is written to "recognize." (In the technical sense, "recognize" means to be able to process the syntax of the language and does not imply any understanding of any application-level meaning.) According to ISO 8879, an SGML parser is one that recognizes markup in SGML documents. It forms only part of an SGML system, since a system is incomplete without an application. The parser turns a character stream into tokenized SGML constructs, but only the application can associate meaning--such as formatting specifications--to the constructs. ISO 8879 defines what is valid input to the parser--and, as such, defines what an SGML parser must do--but it does not define what an application must do with what the parser gives it.

When the SGML standard was developed, there were no SGML-aware editors, so it was anticipated that markup may have to be inserted manually. Therefore, a fair amount of effort was put into describing abbreviations that would minimize the effort of explicitly inserting markup. These "minimization" techniques included such features such as short tagging, tag omission, short references, and others. The idea was that the program interpreting the SGML file--the parser--would "fill out" the abbreviated forms to produce for the application--the formatter or editor or whatever--the logical equivalent to the unabbreviated form.

In the years since the standard was published, the market has seen the development of various SGML applications, including several non-character based SGML editors, that have obviated the need for most of the minimization techniques. In fact, certain minimization features of SGML may be conceptually incompatible with the basic concepts behind structurally aware, non-character based SGML editors.

The parser itself is often used by an SGML-aware editor to read a document into the editor as well as to perform various editing transformations that require the detailed processing of a parser (such as converting a CDATA marked section into an "included" one). Furthermore, some parsing products have an interface that allows the application to make use of state information that the parser maintains. In this case, it is often said that the parsing product is being used as a context engine by the application. How much of the context checking is done by the parser and how much by the application itself varies from product to product. A conforming SGML parser is necessary to read arbitrary and possibly minimized SGML. However, if a document is authored in a non-character based SGML editor, or once an SGML document is read into such an editor, many of the SGML "features" are irrelevant because they relate to the character stream representation of an SGML document rather than its logical structure. For example, details of angle brackets and record ends and short references and other minimization are not an issue in an SGML editor where the user is provided with a flexible, configurable, user-friendly interface to allow the editing of "SGML structure" at a level that avoids the details of markup syntax. SGML features such as SHORTREF, OMITTAG, and SHORTTAG are meaningless in an editor where tags may not need to appear in the interface at all and such keyboarding shortcuts are irrelevant.

The SGML-aware editing application

An SGML editor is an application. While it may use a parser as one of its components, it is not just an interface to a parser. A non-character based SGML editor is optimized to author structured documents that represent valid SGML document instances. That is, while such an editor provides an interface that transcends the specifics of markup such as angle brackets, equal signs, and other syntactic details, it should represent the document using internal data structures that are isomorphic to the basic constructs of SGML.

Another way of explaining this is to consider that there are three levels of "understanding" an SGML document. The lowest level is the primary job of the parser which scans each input character to recognize the markup and to tokenize the various pieces (e.g., to recognize the individual characters in the string as the ASCII representation of the end tag of a "para" element). The highest level is the primary job of the application such as a composition system that determines, for example, how to format a paragraph element or character entity.

The low level can roughly be termed the recognition of SGML syntax, and the high level is the association of application dependent meaning to the structures (otherwise known at the attachment of application semantics). But there is a level of "understanding" between that of SGML syntax and application semantics which can be termed the SGML semantics. When it is said that SGML defines no semantics, what is meant is that SGML has no inherent high level application dependent semantics such as that needed to format a document. But the SGML standard does define, implicitly if not explicitly, what it means to be an element, an attribute, an entity, a marked section, and what it means to have a content model, a declared value, a default. All non-character based editors that accept SGML must convert the SGML-conforming character input stream into an internal representation, but an SGML editor that inherently understands SGML semantics can provide much greater benefit to the end user than an editor--even a structured one--that "imports" and "exports" SGML by converting it into an alternate view that does notmaintain a real-time comprehension of and compliance to SGML semantics.

Another issue that is not a concern for a batch parser but that an SGML editor must address is that of the partial or incomplete document. For example, if a DTD specifies that a chapter consists of two or more sections, an editor application must allow the insertion of the chapter element and the first section element even though the document does not conform to the DTD until the second section element is entered. Note that the paradigm of inserting both (empty) section elements automatically doesn't solve the problem in another situation where some required content consists of a choice between two or more options, since the editor cannot automatically determine which choice to make. Furthermore, even in non-ambiguous cases, it is more user-friendly to allow a user to create a temporarily incomplete document while doing operations such as cutting out the second section to paste it in front of the first one.

Note that, for the purposes of this discussion, "incomplete" is distinct from "invalid." For example (unless context rules are turned off) it is not possible with some SGML editors to produce a document that is structurally invalid from an SGML point of view due to inherent, automatic real-time context checking. That is, one cannot insert something into a document that shouldn't be inserted there. However, it is possible to have an incomplete--as opposed to invalid--document in cases where the DTD calls for some required structure that has not yet been edited into the document instance. In such SGML editors, it is common to have a way to request a check of the completeness of the document when the author believes they have finally inserted all the required structure (or automatically, for example, when there is a request to save the document).

Measuring an SGML editor's conformance

Clause 15 and Annex G of ISO 8879 defines various parts of conformance to the standard. However, conformance is mostly specified in terms of the parsing component. For example, the conformance classification code (also called the "FSV" since it indicates a conformance level in the areas of features, syntax, and validation) and System declaration are more relevant for an SGML parser than for an SGML application such as an editor.

Conformance for an editor application is often measured by examining what it can import and export. Of course, it should be able to read a wide range of valid SGML and produce valid SGML for output. But also important is what it does to guide the author in creating the SGML document and how the SGML editor ensures the validity of the document. One that allows invalidity that can only be discovered at export (or another batch verification phase) will make the authoring of valid SGML documents much more difficult than one that constantly maintains validity and provides real-time contextual aids to authoring.

Several SGML products support arbitrary SGML declarations. By modifying the various parts of the SGML declaration, different features and syntaxes may be supported by the authoring system. An SGML system is conforming if it properly handles all documents that conform to the given SGML declaration. Some systems can properly handle a broader scope of SGML declarations than others. The SGML standard defines a "System declaration" that identifies the set of SGML declarations that the given system can handle.

Any full explanation of conformance must necessarily involve the concept of ESIS (Element Structure Information Set) defined in the SGML Conformance Standard (and listed in Appendix B of The SGML Handbook.) In general, ESIS is the minimum set of information that must be passed along by a conforming SGML parser. This includes such information as element boundaries, attributes and their values, and the location of external and SDATA entity references. The SGML Conformance Standard uses the concept of ESIS to define conformance for a parser. However, the definition of ESIS is not inclusive enough to describe all that an SGML editor must do. For example, the set of ESIS information passed on to an application by a parser does not include such things as comments, ignored marked section, or--for internal general text entities--any indication that the content was the replacement text of an entity reference. Yet, a recognition and understanding of all these constructs should be part of a complete SGML authoring product.

Handling non-ESIS constructs

The way ESIS was defined is most compatible with the concept of batch parsing rather than SGML authoring. The key point in relating ESIS with the practicalities of SGML authoring is that, to be of greatest practicality and usefulness, an SGML editor must be aware of more than just ESIS.

In addition to the information passed along by a parser as defined in ISO 8879, an SGML editor must maintain some further set of information about the SGML markup in an SGML document instance. For example, some SGML editors maintain comments, general text entities, and the location of ignored marked sections. ESIS does not require maintenance of this information, and so not all SGML conforming systems will handle these constructs. We can call the set of logical information considered by an SGML editor "ESIS+." Note, that the set of information chosen to be maintained and edited by a given system will vary for SGML system to SGML system. It will almost certainly never represent the universe of all possible pieces of markup information found in an SGML document (this has been called MSIS, or Markup Sensitive Information Set, by some SGML experts).

A non-character based SGML editor does not need to provide authoring interfaces to all the details of MSIS since the non-character based interface raises the end user above the level of SGML syntax. However, the set of SGML semantics that a complete SGML authoring system understands should go beyond ESIS. In general, an SGML editor's definition of ESIS+ guides what information is maintained when an SGML marked-up document is brought into the SGML system as well as what possible information can be represented in an SGML marked-up document produced by the system. Furthermore, the SGML editor should provide a user interface to most of the ESIS and ESIS+ level features of SGML. Some of the non-ESIS features to which an authoring system might want to provide an interface include SGML comments, the structure of internal general text entities, marked sections, and subdocuments (Subdoc).

The interface to handling comments and internal general text entities is straightforward (though the implementation has its subtleties). Ignored marked sections can have an interface much like comments, but marked sections in general are more complex. A complete implementation that allows for the switching of marked sections from ignored to included and that handles the marked section status keyword specifications properly even when specified using a parameter entity reference requires a more sophisticated interface. It must allow for marked sections with unbalanced markup (e.g., that includes the start tag of an element but not its end tag) and for changing the values of parameter entities so that some marked sections switch from ignored to included while others switch from included to ignored while making sure that the final result is still valid SGML even though the intermediate state (after one status is changed but before the other is) may by invalid. A subdocument is basically an external SGML entity with its own DTD. The authoring interface to a subdoc can be similar to that for a regular external SGML entity, but there are issues such as the different ID name space, a different entity name space, and the need for a potentially different presentation.

Summary

SGML-aware authoring addresses issues that extend beyond markup. Authors benefit from using an SGML editor that delivers real-time comprehension of and compliance toSGML semantics. An SGML editor can be measured both by the guidance it gives the author in the creation of the document and by the extent to which the editor ensures the document's validity. A good SGML editor constantly maintains validity and provides real-time contextual aids to authoring.

Concepts Behind SGML-aware Authoring was written by Paul Grosso of ArborText, Inc., a member of SGML Open

An international consortium, SGML Open is dedicated to accelerating the widespread adoption of ISO 8879, the Standard Generalized Markup Language. Members include vendors providing a broad range of SGML software and services, augmented by an advisory board of industry leaders and analysts and liaison relationships with customer user groups.