The eXtensible Markup Language (XML)

[Mirrored from: http://www.u-net.com/~sgml/xml.htm], November 07, 1996 12:14:02 CST

The eXtensible Markup Language (XML) -- A report on the work to date of the SGML on the Web ERB

Martin Bryan, The SGML Centre

At the end of August a new ERB was set up by [W3C] to discuss the role of SGML on the Web. A series of discussion questions were posted and observations were requested. The subject of internationalization was one of the first ones to be brought up in the discussions, and it was soon resolved that the UCS-2 code set should be the basis of both the data and the markup for the new language.

The initial discussion period came to a close on October 8th 1996, and the SGML ERB held an editorial meeting on October 9th and voted that:

XML will have only one concrete syntax, fixed at XML specification time, not document-instance parse time
All or virtually all the information provided by a normal SGML declaration will be fixed for all documents; no SGML declaration will be necessary. (Possible exception: character-set information may vary document to document, but will be conveyed in other ways.)
XML will have no OMITTAG, DATATAG, SHORTREF, LINK, CONCUR, RANK, or SUBDOC features
XML will make only partial use of the SHORTTAG feature:
- no minimized start-tags or, probably, end-tags
- no omission of attribute name and value indicator
but
- omission of attribute-value specification will be legal for attributes not declared REQUIRED (applies only when DTD is supplied and used)
XML will have no quantities or capacities
XML will not allow asynchronous marked sections -- marked sections must begin and end in the same element.
XML will have CDATA marked sections, which must begin with the 9-character literal string <![CDATA[ and end with the 3-character literal string ]]>. XML will not have RCDATA or TEMP marked sections.
XML will not have INCLUDE or IGNORE marked sections in document instances
XML will have no CDATA or RCDATA elements
CDATA marked sections are to be used for blocks of text
XML will retain the distinction between element content and mixed content. (Applies only if DTD supplied and used.)
XML will require all attribute-value specifications to take the form of attribute-value literals
XML will not allow RE to end an entity or character reference; an explicit refc must provided, and it must be a semicolon
XML will have declarations for elements, and attributes, but not for short-references or links
XML will retain fundamentally the same parsing rules as SGML, though they may be expressed differently. (N.B. There is some sentiment for making XML's rules more restrictive than SGML's.)
Like SGML, XML will forbid empty strings as attribute values for non-CDATA attributes, require FIXED attributes to take their default values, and distinguish implied values from null-string values
XML will have no CURRENT attributes, but it will have FIXED, REQUIRED, and IMPLIED attributes, and attributes with explicit defaults.
Unlike SGML, XML will not allow direct references to external data entities from within parsed character data
Like SGML, XML will forbid recursive entity reference
Like SGML, XML will allow elements to be declared ANY
XML will behave like SGML as regards behavior and precedence of occurrence indicators and connectors in content models.

On October 16th the following decisions were made:

XML will retain the notion and syntax of comments, i.e. 8879 'comment declarations', but comment declarations will contain at most one comment. Empty comments (<!>) will not be allowed in XML.
the character repertoire of XML documents is that of ISO 10646. Conforming XML documents may be in UTF-8 or UCS-2 form. All XML processors must accept documents in UTF-8 and UCS-2 (or optionally UTF-16) form. XML processor may provide a user option which allows them to accept documents in other coded character sets (e.g. ISO 8859 or JIS 0208) or other encodings of 10646 or other coded character sets (e.g. Extended Unix Code) -- this behavior must be optional (i.e. the user must be able to turn it off, so that documents not in UTF-8 or UCS-2 raise errors).
XML will not require each document instance to have a DTD.
Assuming that a satisfactory RE rule can be agreed on, XML will not forbid comments and processing instructions in mixed content.
XML will restrict PCDATA to models of the form (#PCDATA | x ... | z)*
XML will not use MSOCHAR, MSSCHAR, and MSICHAR strings.
XML will forbid empty end-tags.

The following points were agreed on October 19th:

XML will have external NDATA entities.
XML will have internal text entities.
Version 1.0 XML will not have public identifiers, only system identifiers.
Version 1.0 system identifiers will be URLs.
Version 1.0 URLs need not carry the FSI-style <url> label.

The following points were agreed on October 23rd:

XML will require all entities to be synchronous with the document's element structure
XML will not to prescribe any particular method of handling entity ends
XML will retain SGML's prohibition on ENTITY attributes referring to SGML text entities
reference to an entity not declared and not included in the list of 'automatic' declarations is a reportable error. (No particular error recovery strategy will be prescribed.)
XML will define automatically the entities lt, gt, amp, and two entities for double and single quotation (for use in attribute value literals), names to be determined in separate discussion
XML documents may refer to characters in ISO 10646 using the form &u- or &U- followed by four hexadecimal digits, followed by semicolon
XML will retain SGML's prohibition on multiple declarations for the same element
XML 1.0 will prohibit the use of inclusion and exclusion exceptions in element declarations
XML will allow content-model references to undeclared elements
XML will forbid use of the & connector in content models
XML will retain SGML's prohibition on multiple attribute-list declarations for the same element or on multiple declarations for the same attribute
XML should have the following attribute types: ID, IDREF, IDREFS, ENTITY, ENTITIES, CDATA, enumerated attribute types, NOTATION attribute type, NMTOKEN and NMTOKENS. The types NUMBER(S), NUTOKEN(S), AND NAME(S) are to be dropped.

The following points were agreed on October 26th:

XML will add "." to the set of legal name-start characters for XML, and to reserve the portion of all name-spaces beginning ".XML." for the purposes of the language
Note: Tim Bray has since discovered that beginning GI's with .XML. may cause incompatibility with CSS -- the SGML-ERB may end up having to reserve "-XML-".
There will be a mechanism, using a reserved attribute, to toggle, per element, between two modes of white-space handling. In "White Space Preservation" mode, all white space including RE is passed through to the application, with the exception of a single leading and trailing RE *if* they are alone on a line with the start- or end-tag. Note: "alone on a line with" assumes that comments have already been stripped. In "White Space Collapse" mode, all initial and trailing white space in an element is eaten by the parser, and all internal white space, including successive blank lines, is replaced by a single space character before passing to the application.
The setting of this toggle is by default inherited from the parent element. The root element of any document, by default, has the toggle set to "White Space Collapse" mode.
The White Space mode is orthogonal to the use of CDATA marked sections; that is to say, CDATA marked sections will still ignore markup delimiters, but will respect the current White Space mode.

The following points were agreed on October 30th:

Support for external text entities would be an optional feature of XML 1.0
XML will use the string '/>' as the NET delimiter in the SGML declaration for XML documents
XML will allow the form <e/> to be used to indicate the presence of empty elements, with or without element declarations.

The following points were agreed by November 7th:

XML will change PIC to be ?> - this will allow a lot of things to fit into PI's that currently can't (most notably some proposed server-side scripting languages)
XML will have no CONREF attributes
XML will allow more than one enumerated type (name-group declared value) to contain the same possible value
XML will support only the <e/> syntax for EMPTY elements
XML will specify that the <img> form of HTML empty elements will be allowed for compatibility reasons.

There are still a number of outstanding questions that need to be resolved prior to the issue of a draft specification of XML, which is scheduled to be completed prior to a meeting scheduled to be held in Boston on Sunday 17th November, immediately prior to SGML '96. The first formal announcements regarding the new standard will be made at SGML '96, followed immediately by a statement of what has been achieved to date at the i18n conference in Seville.

Martin Bryan
10th November 1996