Arguments against SGML

[Archive copy mirrored from: http://www.naggum.no/~erik/sgml/against.html]

Arguments against SGML

The Standard Generalized Markup Language [SGML, ISO 8879], is a standard that was intended to provide users with an application-independent language for their information, to protect against the vagaries of vendors who, through their desire to hold on to the customer and make money off of their software, sought to limit the usefulness of the customers' information, in practice if not in intent. However, SGML fails to protect as much as it is claimed to do. In fact, using SGML blindly in the belief that it protects your information may in fact endanger it and thus place your investments at risk.

Using SGML wisely requires that you think through a number of hard issues, and apply much intelligence and effort to address the needs of your information. (I refer to "the needs of the information", as opposed to somebody's information needs, because information has a life of its own, and it is to the extent to which we keep it alive that it may become able to satisfy our needs. It is therefore of fundamental importance not only to look at the result of using information productively, but to study the premises that make information able to produce those results.)

If you have already thought through a number of hard issues when deciding whether to use SGML or not, I trust that you will have no problems and much enthusiasm for following me in more and harder issues.

Features and Syntax

A proper SGML document starts with an SGML declaration, detailing such things as the document character set, the syntax used, and the system capacity and the SGML features required to process the document. Such a multitude of information in a single "declaration" is a sign that the complexity of the system is not sufficiently understood. When we look closer, we also find serious flaws in the foundation.

Features in SGML change the syntax of the language!

This means that the information is safe only as long as the same set of features is used when processing it as when it was created, thus introducing application and processing dependencies in a strong sense, but SGML was created to remove application and processing dependencies. Since it should be irrelevant whether it were possible to predict the uses to which an SGML document may be put, we must find ways around this flaw. To find ways around it, we must know the impact of each feature.

Markup Minimization Features

SHORTTAG

When enabled, allows the following to be omitted:

the tagc in tags
attributes that may be defaulted
the name and value indicator of attributes whose value is a member of a name group
the lit or lita delimiters surrounding attribute values that consist only of name characters
the generic identifier in tags (i.e., allowing empty tags)

In addition, the net-enabling start-tag and the net is allowed and recognized.

The impact of the feature is somewhat more important than this, however. A stago or etago may be followed by tagc, which causes a sequence that was data when this feature was not enabled to be recognized as markup, causing changes to the contents of the document, introducing errors or changes in the structure.

Turning this feature off has the very undesirable consequences of requiring that all attributes be specified in full and that the name of attributes with a name group value must have their attribute name specified.

OMITTAG

When enabled, allows some tags to be omitted completely.

The impact of the feature is somewhat more important than that, however. In an element declaration, the omitted tag minimization flags are required unless this feature is off, meaning that element declarations without these flags cannot be used as a part of an unknown document type definition.

SHORTREF

When enabled in the syntax (not the FEATURES section of the SGML declaration), enables a mapping from short strings to entities.

However, this has serious ramifications for both document type definition and document instance. The short reference mappings are specified in short reference mapping declarations (SHORTREF) and maps are used in elements via the short reference map use declaration (USEMAP). These declarations are illegal in the document type definition if the feature is not enabled, making it necessary for a document type definitions to "know" whether the feature is in use.

Worse, the short reference strings are specified in the SGML declaration. It is an error for SHORTREF declaration to specify a string that is not declared in the syntax, which creates a strong binding between the SGML declaration and the SHORTREF declaration. Moreover, a longer short reference string takes precedence over a shorter, even if the longer is not mapped, causing its constituent characters to be parsed as data. This makes extensive use of short references hazardous.

DATATAG

When enabled, makes the parser recognize certain data characters as both terminating element contents and as data in the containing element.

The content model of the element is considered to be elements and PCDATA. If this feature is used, the interpretation of an instance with fully specified tags is still subject to special parsing in their contents, which may lead to serious problems in reuse of information.

RANK

When enabled, allows a generic identifier in a tag to omit its rank.

This is the only feature that is done right. It is recognized in the element declaration even though the feature is not used, and the rules for generic identifiers is such that a rank stem cannot also be a generic identifier by itself, thereby removing the possibility that a document may be parsed differently with and without this feature.

Link and Concurrent Document Features

CONCUR: When enabled, a tags may include a document type specification.
This has the unfortunate side effect of making a stago or etago followed by a grpo into markup instead of data, as it would have been had not CONCUR been enabled, and then either causing an error, or the markup to be completely ignored. There are, however, so many constraints on CONCUR that it is useless. However, the problems that apply to LINK also apply to CONCUR.
SIMPLE
IMPLICIT
EXPLICIT: When enabled, these introduce a document type specification in entity references.
Unfortunately, this means that an ero or pero followed by a grpo is recognized as markup, and may cause the entity reference to be completely ignored, as opposed to being data as it would have been had these features not been enabled. The net effect of this is that a document cannot have LINK applied to it, because the document may change meaning in ways hard to detect.

Other Features

FORMAL: When enabled, a public identifier changes syntax.
Although technically the same category changes as the others, nobody in their right mind has "FORMAL NO" in the SGML declaration.
SUBDOC: When enabled, entities may be subdocuments.
If the SUBDOC feature flag in the SGML declaration had related to references to subdocument entities, all would be well, but it is the very declaration of subdocument entities that is prohibited if SUBDOC is not enabled, in addition to any references. This differs from the usual way unused elements or entities are treated, which is to ignore them.

Direct Changes to the Syntax

The syntax may also be changed directly in many ways. Some of these appear to be innocuous, but are far from it, such as making more characters into name start characters, which has the dangerous effect of making an ero, pero, stago or etago become delimiters in context where something used to be data.

Character Sets and Encoding

The SGML declaration contains a specification of the document character set as an implicit mapping from characters in a base character set to the integer representation of their encoding, then to described character set integers, creating an character set with an encoding unique to each document. The base character set is named only by a parser-specific formal public identifier, where conventions for non-ASCII sets vary greatly. In practice, these described character sets are fairly regular mappings of common character sets, such as ISO 646 IRV (ASCII) or ISO 8859-1 (Latin 1). In theory, anything is possible.

However, parsing SGML is not only a question of the meaning of a character in the document character set, but of the meaning of each character to SGML. For this purpose, the SGML declaration defines a syntax-reference character set with the same two-stage mapping as above. Normally, the syntax-reference character set is some well-known and stable character set, such as ISO 646 IRV or ISO 8859-1, but it, too, could be anything. The syntax-reference character set is use to define the roles of various delimiters as strings of characters, or, actually, vectors of integers. These, however, are represented in the most counter-intuitive way possible: the literal strings assigned to delimiters in the SGML declaration are read in the system character set, and the resulting numbers used with respect to the syntax-reference character set.

To complicate this further, the SGML parser has its own system character set used to read files and other input sources. Now, for this to work at all, the document character set must be the same as the system character set, otherwise the parser would not be able to read the SGML declaration! However, conversion of SGML documents between character sets is a very complex task, and one that has not generally been performed.

Now consider the paradox that is introduced by using integers as the means of representing character. The base character set of both the document and the syntax-reference character sets only exist as reference points as far as SGML is concerned. No attempt is made to represent characters other than "character number N of base character set X". Fundamentally, therefore, SGML has no mechanism to match characters between the syntax-reference character set and the document character set!

How does an SGML parser manage to parse SGML documents when the meaning of characters is left in a limbo? *TODO*

If SGML had avoided the numbers game, and dealt with characters as characters, it would have been possible to create much more advanced encodings, with no loss in flexibility. The idea that characters are integers because a bunch of bits can be regarded both as an integer and as a character is limiting more languages than SGML; C suffers from the same general problems.

Entity-level Encoding Specification

The complexity in the above paragraph is exacerbated by the fact that an entity is not under control of the document which incorporates it.

Conclusion

SGML as a language has a large number of serious problems with respect to its design goals, which SGML does more to hurt than to help. All of the features and facilities which were created to satisfy specific needs are so ill-designed as to interfere with each other, creating an unknown set of features for anything smaller than an entire document. This actually means that to obtain interoperability and reuse, general agreement must be obtained on a specific set of features, precisely what SGML was trying to avoid.

If SGML is to be used for information that is expected to achieve application and system independence, a simple application-level convention must be adopted: never to let characters starting any delimiter string be used as a data character. If violations of this were to be reported as errors, we would have a much, much simpler language, we would have an opening for extensions to SGML using more delimiter strings, and we would ensure that the information would be safe in all kinds of reuses and environments. However, this would violate the "principle" in the revision process that no valid SGML document should be invalid under the revised edition of SGML. Thus, the information and SGML is endangered unless that "principle" is rescinded.

Erik Naggum