[Unofficial mirror copy from: http://www.textuality.com/sgml-erb/kimber/index.html, November 5, 1996 ]
by Eliot Kimber, Passage Systems.
September 1996
It seems clear that the world needs a version of SGML that can be parsed quickly and easily, particularly for use in the network delivery of SGML but also for the general purpose of enabling small and easy-to-implement SGML processors. This requirement has been stated in a number of ways by a number of people. This requirement is most often stated as "lex-able SGML", e.g., an SGML for which parsers can be created using the lex/yacc parser generation tool set, or as "ad-hoc parsable documents".
In addition, there is a clear requirement for documents in which the element types are not explicitly declared.
Both of these requirements can be met by the same general solution.
It appears to be possible to satisfy this requirement by refining the set of optional SGML features such that a minimal set of features can be specified that avoid those aspects of SGML that make parsing difficult.
The specific changes required are:
There are many situations where either there is no compelling need for explicit document type declarations or a compelling need for implied document type declarations (such as environments where every document will have new or different element types, possibly determined by the data the documents contain or are derived from). When markup minimization is not used, there are no inclusion exceptions, there are no EMPTY elements or #CONREF attributes, there are no #CURRENT attributes, and the document can be assumed to be valid, then the structure of the document can be inferred from the element markup itself. Parsers that do not need to first parse a document type declaration and then use it to drive parsing can be created very easily by any reasonably competent programmer.
To make ad-hoc SGML parsing possible, it must be possible to disallow the use of inclusion exceptions, EMPTY elements and #CONREF attributes, and #CURRENT attributes. Thus the proposal to make these currently required aspects of SGML optional features so that they can be turned off and disallowed by system declarations.
Given these restrictions, some or all of the element and attribute declarations could be omitted from a document type declaration without loss of parsing ability. Obviously, most contextual constraints cannot be validated under these circumstances. Validation, if needed, can always be provided by a separate validating parser and a completely specified set of element type declarations.
Most modern systems, and therefore most modern users, provide name lengths of at least 32 characters for things like files, variables, and unique identifiers. By increasing the RCS name length to at least 32, the need for explicit SGML declarations that do nothing other than increase the name length is removed, removing the need for simple parsers to support parsing of SGML declarations.
While this change would, ostensibly, increase the base memory requirements for parsers limited to the RCS by a factor of 4, this does not seem like a serious concern, both because of the capacity of modern computers and the limited number of actual parsers limited to the reference concrete syntax, not to mention modern techniques of dynamic memory allocation, which largely remove the need to be concerned about capacity limits in general.
As changes 1-3 enable the ad-parsing of SGML documents in the absence of explicit element type declarations, it seems reasonable to codify the omission of element type declarations for some or all of the element types in a document.
The feature must provide at least this option:
A. Complete omission of all element and attribute declarations
Option A can be thought of as "DOCTYPE #IMPLIED", where the document type declaration is completely implied by the element markup in the document instance. This option is necessary to enable the parsing of documents without any reference to explicit content models.
The exact syntax would be something like:
<!DOCTYPE DOCTYPENAME #IMPLIED [ <!-- Internal subset containing only NOTATION and ENTITY declarations --> ]>
There would be no external doctype subset and only NOTATION and ENTITY declarations would be allowed within the internal doctype declaration subset.
If change (6) is accepted, any declarations resulting from the change (6) design would also be allowed.
NOTATION declarations are required to enable the use of architectures and ENTITY declarations are required to enable the use of entities (which cannot be an optional feature of SGML).
Option A could be indicated in the SYSTEM or SGML declaration with a new optional feature, e.g. "IMDOCTYP" or something.
In addition, it might make sense to provide an additional option:
B. Omission of some (but not necessarily all) element and attribute declarations.
This option would be useful with parsers that are capable of working with explicit content models but allow documents where some or all of the element types are not explicitly declared. There are several possibilities for how an element type could be processed when its declaration is omitted, but an obvious behavior would be to treat it as having a content model of "ANY" in the absence of mapping to any architectural form or using the meta content model for the architectural form to which it conforms, if any.
The availability of option B could be treated as a new optional feature, e.g., OMITELTP YES or something.
Even when element type declarations are omitted, it is still necessary to process ID and entity references. Therefore, it should be possible to indicate in some simple way which attributes in a document should be treated as IDs, ID references, or entity references.
One simple mechanism is to define conventional attribute names, e.g., "ID" for ID attributes, "REFID" and "REFIDS" for ID references, and "ENTITY" for entity references. The behavior could be such that when the element type declaration is omitted, attributes with these names are treated as described. The main problem with this approach is that it encroaches on the ID name space by essentially dictating attribute names.
The use of architectures can provide a solution as architectures can define architectural attribute declarations that define attribute names and data types.
In addition, it might be possible to provide a new declaration or a modification to the ATTLIST declaration to enable the declaration of attribute names globally for a whole document. Such a proposal is part of the changes stemming from the general architecture support requirements.
Included subelements, in addition to being included in the content models of all the descendant element types of the element type of which they are an inclusion, also have a different record end behavior. Namely, the record ends caused solely by included subelements are not taken as data.
This record end behavior reflects the implied semantic of included subelements as being "invisible" to the elements around them (the typical example being index entry markers). This semantic and the record end behavior are important features of SGML and a hard requirement given the record-sensitive nature of SGML. In other words, because SGML allows documents to be organized into multiple records and treats some record ends as data content, there must be a way to allow elements whose presense or abscense will not affect the data content of the elements in which they appear.
However, the inclusion feature of SGML imposes some serious parsing difficulties. In particular, the use of inclusions prevents ad-hoc parsing of documents with omitted element type declarations. It also makes the automatic creation of subdocuments (or the accurate parsing of document fragments, which comes to the same thing) difficult if not impossible.
Therefore, there is a requirement to provide a mechanism for indicating that particular element types have the semantic of "invisibility" with respect to record ends that does not also make them inclusions in the current sense.
There are at least two possible ways to solve this problem. Both are proposed here for consideration and further analysis.
The invisibility semantic could be indicated by adding some new token to ELEMENT declarations, e.g., "<!ELEMENT #INVISIBL MyElementType".
This approach has the advantage that it directly binds the semantic of invisibility to the element type. It has the disadvantage that the ELEMENT declaration must be provided for each invisible element, which affects ad-hoc parsing.
Proposal 2: New Declaration for Globabl Declaration of Invisible Element Types
Define a new declaration, e.g. "<!INVISIBL" whose value is a list of element type names to be treated as invisible elements with respect to data record ends.
This approach has the advantage that it can be used with completely omitted document types to better support ad-hoc parsing.