[Mirrored from: http://www.u-net.com/~sgml/xml.htm], November 07, 1996 12:14:02 CST
The eXtensible Markup Language (XML) --
A report on the work to date of the SGML on the Web ERB
Martin Bryan, The SGML Centre
At the end of August a new ERB was set up by [W3C] to discuss the role of
SGML on the Web. A series of discussion questions were posted and
observations were requested. The subject of internationalization was
one of the first ones to be brought up in the discussions, and it was soon
resolved that the UCS-2 code set should be the basis of both the data and
the markup for the new language.
The initial discussion period came to a close on October 8th 1996, and the
SGML ERB held an editorial meeting on October 9th and voted that:
- XML will have only one concrete syntax, fixed at XML
specification time, not document-instance parse time
- All or virtually all the information provided by a normal
SGML declaration will be fixed for all documents; no SGML declaration
will be necessary. (Possible exception: character-set information may
vary document to document, but will be conveyed in other ways.)
- XML will have no OMITTAG, DATATAG, SHORTREF, LINK, CONCUR,
RANK, or SUBDOC features
- XML will make only partial use of the SHORTTAG feature:
- no minimized start-tags or, probably, end-tags
- no omission of attribute name and value indicator
but
- omission of attribute-value specification will be legal for
attributes not declared REQUIRED (applies only when
DTD is supplied and used)
- XML will have no quantities or capacities
- XML will not allow asynchronous marked sections -- marked
sections must begin and end in the same element.
- XML will have CDATA marked sections, which must begin with the 9-character
literal string
<![CDATA[
and end with the 3-character literal string
]]>
.
XML will not have RCDATA or TEMP marked sections.
- XML will not have INCLUDE or IGNORE marked sections in document
instances
- XML will have no CDATA or RCDATA elements
- CDATA marked sections are to be used for blocks of
text
- XML will retain the distinction between element content and
mixed content. (Applies only if DTD supplied and used.)
- XML will require all attribute-value specifications to take
the form of attribute-value literals
- XML will not allow RE to end an entity or character
reference; an explicit refc must provided, and it must be a semicolon
- XML will have declarations for elements, and attributes, but not for
short-references or links
- XML will retain fundamentally the same parsing rules as
SGML, though they may be expressed differently. (N.B. There is some
sentiment for making XML's rules more restrictive than SGML's.)
- Like SGML, XML will forbid empty strings as attribute values
for non-CDATA attributes, require FIXED attributes to take their default
values, and distinguish implied values from null-string values
- XML will have no CURRENT attributes, but it will have FIXED,
REQUIRED, and IMPLIED attributes, and attributes with explicit
defaults.
- Unlike SGML, XML will not allow direct references to
external data entities from within parsed character data
- Like SGML, XML will forbid recursive entity reference
- Like SGML, XML will allow elements to be declared ANY
- XML will behave like SGML as regards behavior and precedence
of occurrence indicators and connectors in content models.
On October 16th the following decisions were made:
- XML will retain the notion and syntax of comments, i.e. 8879
'comment declarations', but comment declarations will contain
at most one comment. Empty comments (
<!>
) will not be allowed in XML.
- the character repertoire of XML documents is that of ISO 10646.
Conforming XML documents may be in UTF-8 or UCS-2 form.
All XML processors must accept documents in UTF-8 and UCS-2 (or
optionally UTF-16) form. XML processor may provide a user option which allows them to accept
documents in other coded character sets (e.g. ISO 8859 or JIS 0208)
or other encodings of 10646 or other coded character sets (e.g.
Extended Unix Code) -- this behavior must be optional (i.e. the user
must be able to turn it off, so that documents not in UTF-8 or
UCS-2 raise errors).
- XML will not require each document instance to have a DTD.
- Assuming that a satisfactory RE rule can be agreed on, XML will not
forbid comments and processing instructions in mixed content.
- XML will restrict PCDATA to models of the form
(#PCDATA | x ... | z)*
- XML will not use MSOCHAR, MSSCHAR, and MSICHAR strings.
- XML will forbid empty end-tags.
The following points were agreed on October 19th:
- XML will have external NDATA entities.
- XML will have internal text entities.
- Version 1.0 XML will not have public identifiers, only system identifiers.
- Version 1.0 system identifiers will be URLs.
- Version 1.0 URLs need not carry the FSI-style
<url>
label.
The following points were agreed on October 23rd:
- XML will require all entities to be synchronous with the document's element structure
- XML will not to prescribe any particular method of handling entity ends
- XML will retain SGML's prohibition on ENTITY attributes referring to SGML text entities
- reference to an entity not declared and not included in the list of
'automatic' declarations is a reportable error. (No particular error recovery strategy will be prescribed.)
- XML will define automatically the entities lt, gt, amp,
and two entities for double and single quotation (for use in attribute
value literals), names to be determined in separate discussion
- XML documents may refer to
characters in ISO 10646 using the form
&u-
or &U-
followed by four
hexadecimal digits, followed by semicolon
- XML will retain SGML's prohibition on multiple declarations for
the same element
- XML 1.0 will prohibit the use of inclusion and exclusion exceptions in
element declarations
- XML will allow content-model references to undeclared elements
- XML will forbid use of the
&
connector in content models
- XML will retain SGML's prohibition on multiple attribute-list
declarations for the same element or on multiple declarations
for the same attribute
- XML should have the following attribute types:
ID, IDREF, IDREFS, ENTITY, ENTITIES,
CDATA, enumerated attribute types, NOTATION attribute type,
NMTOKEN and NMTOKENS. The types NUMBER(S), NUTOKEN(S), AND NAME(S)
are to be dropped.
The following points were agreed on October 26th:
- XML will add "." to the set of legal name-start characters
for XML, and to reserve the portion of all name-spaces beginning ".XML."
for the purposes of the language
Note: Tim Bray has since discovered that beginning GI's with .XML. may
cause incompatibility with CSS -- the SGML-ERB may end up having to
reserve "-XML-".
- There will be a mechanism, using a reserved attribute, to toggle, per
element, between two modes of white-space handling. In "White Space
Preservation" mode, all white space including RE is passed through to
the application, with the exception of a single leading and trailing RE
*if* they are alone on a line with the start- or end-tag. Note:
"alone on a line with" assumes that comments have already been stripped.
In "White Space Collapse" mode, all initial and trailing white space in an
element is eaten by the parser, and all internal white space, including
successive blank lines, is replaced by a single space character before
passing to the application.
The setting of this toggle is by default inherited from the
parent element. The root element of any document, by default, has
the toggle set to "White Space Collapse" mode.
The White Space mode is orthogonal to the use of CDATA marked
sections; that is to say, CDATA marked sections will still ignore
markup delimiters, but will respect the current White Space mode.
The following points were agreed on October 30th:
- Support for external text entities would be an optional feature of XML
1.0
- XML will use the string
'/>'
as the NET delimiter in the SGML declaration
for XML documents
- XML will allow the form <e/> to be used to indicate the presence of empty
elements, with or without element declarations.
The following points were agreed by November 7th:
- XML will change PIC to be
?>
- this will allow
a lot of things to fit into PI's that currently can't (most notably some
proposed server-side scripting languages)
- XML will have no CONREF attributes
- XML will allow more than one enumerated type (name-group
declared value) to contain the same possible value
- XML will support only the
<e/>
syntax for EMPTY elements
- XML will specify that the
<img>
form of HTML empty elements will be allowed for compatibility reasons.
There are still a number of outstanding questions that need to be
resolved prior to the issue of a draft specification of XML,
which is scheduled to be completed prior to a meeting scheduled to
be held in Boston on Sunday 17th November, immediately prior to SGML '96.
The first formal announcements regarding the new standard will be made
at SGML '96, followed immediately by a statement of what has been achieved
to date at the i18n conference in Seville.
Martin Bryan
10th November 1996