XML 1.0 Specification Errata

This document:: http://www.w3.org/XML/xml-19980210-errata
Last revised:: $Date: 2000/01/14 15:20:27 $

This document records known errors in the document:: http://www.w3.org/TR/1998/REC-xml-19980210
The latest version of the XML 1.0 specification:: http://www.w3.org/TR/REC-xml

Abstract

This document records all known errors in the XML 1.0 specification, http://www.w3.org/TR/1998/REC-xml-19980210. The errata are numbered, classified as Substantive, Editorial or Clarification and listed in reverse chronological order of their date of publication. Early errata (1999-02-17 and before) are neither numbered, classified nor dated.

Please email error reports to xml-editor@w3.org.

Known Errors

Errata as of 2000-01-06.

E50 Substantive Source: minutes XML-Syntax 1999-05-25 E73

Section 3.2.1: Change the grammar for 'choice' in production [49] from:
"choice ::= '(' S? cp ( S? '|' S? cp )* S? ')'"
to:
"choice ::= '(' S? cp ( S? '|' S? cp )+ S? ')'"
(which amounts to changing the * into a +).
Rationale:: Eliminate unnecessary ambiguity in the grammar for cp and children, which serves no purpose and confuses some implementors.

E49 Substantive Source: minutes XML-Syntax 1999-05-19 E67

Section 4.2.2: Amend the paragraph beginning "An XML processor should handle a non-ASCII character..." to read as follows: "Some URIs may contain characters that are either reserved (see [IETF RFC2396], section 2.2) or non-ASCII. An XML processor should handle such a character in a URI by representing the character in UTF-8 as one or more bytes, and then escaping these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value)."
Rationale:: Original only discussed non-ASCII characters, include case of reserved characters.

E48 Clarification Source: minutes XML-Syntax 1999-02-24 E31 and minutes XML-Syntax 1999-05-19 E65

Appendix F

Modify the text from the paragraph beginning "The second possible case occurs when the XML entity..." to the end of the appendix to read:

The second possible case occurs when the XML entity is accompanied by encoding information, as in some file systems and some network protocols. When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In particular, please refer to [IETF RFC2376] "XML Media Types" which defines the text/xml and application/xml MIME types and provides some useful guidance. In the interests of interoperability, however, the following rule is recommended.

If an XML entity is in a file, the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.

Appendix A.2

Add a non-normative reference:

IETF RFC2376: IETF (Internet Engineering Task Force). RFC 2376: XML Media Types, ed. E. Whitehead, M. Murata. 1998.

Rationale:

Take RFC 2376 into account.

E47 Substantive Source: minutes XML-Syntax 1999-05-19 E62

Section 4.3.3: Prepend the following to the last sentence of the paragraph immediately preceding production [80]: "In the absence of external character encoding information (such as MIME headers), ".
Rationale:: The original only covered the case where no external information was available.

E46 Clarification Source: minutes XML-Syntax 1999-05-19 E60:

Section 3.1: Append the following to the first paragraph after production [41]: "Note that the order of attribute specifications in a start-tag or empty-element tag is not significant."

E45 Substantive Source: minutes XML-Syntax 1999-05-19 E58

Section 3.1: Change the second sentence of the paragraph right after production [44] to read: "For interoperability, the empty-element tag should be used, and should only be used, for elements which are declared EMPTY."
Rationale:: "For interoperability" and "must" were combined inappropriately.

E44 Substantive Source: minutes XML-Syntax 1999-05-19 E56 and E64

Appendix F

Append the following to the second paragraph: "The notation ## is used to denote any byte value except 00." Adjust the itemized list of detection cases to read as follows:

With a Byte Order Mark:
 00 00 FE FF: UCS-4, big-endian machine (1234 order)
 FF FE 00 00: UCS-4, little-endian machine (4321 order)
 FE FF 00 ##:  UTF-16, big-endian
 FF FE ## 00:  UTF-16, little-endian
 EF BB BF: UTF-8
Without a Byte Order Mark:
 00 00 00 3C: UCS-4, big-endian machine (1234 order)
 3C 00 00 00: UCS-4, little-endian machine (4321 order)
 00 00 3C 00: UCS-4, unusual octet order (2143)
 00 3C 00 00: UCS-4, unusual octet order (3412)
 00 3C ## ##, 
 00 25 ## ##,
 00 20 ## ##,
 00 09 ## ##,
 00 0D ## ## or
 00 0A ## ##: Big-endian UTF-16 or ISO-10646-UCS-2. Note that, absent
              an encoding declaration, these cases are strictly
              speaking in error.
 3C 00 ## ##,
 25 00 ## ##,
 20 00 ## ##,
 09 00 ## ##,
 0D 00 ## ## or
 0A 00 ## ##: Little-endian UTF-16 or ISO-10646-UCS-2. Note that, absent
              an encoding declaration, these cases are strictly
              speaking in error.
 3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS,
              EUC, or any other 7-bit, 8-bit, or mixed-width encoding
              which ensures that the characters of ASCII have their
              normal positions, width, and values; the actual encoding
              declaration must be read to detect which of these
              applies, but since all of these encodings use the same
              bit patterns for the ASCII characters, the encoding
              declaration itself may be read reliably  
 4C 6F A7 94: EBCDIC (in some flavor; the full encoding declaration
              must be read to tell which code page is in use)
 other: UTF-8 without an encoding declaration, or else the data stream
        is corrupt, fragmentary, or enclosed in a wrapper of some kind

Add the following to the second paragraph after the list (this also takes care of the previous erratum on UTF-7): "Note: Since external parsed entities in UTF-16 may begin with any character, this autodetection does not always work. Also, because of the overloaded usage it makes of ASCII-valued bytes, the UTF-7 encoding may fail to be reliably detected."

Rationale:

Original version did not distinguish UCS-2, cases without Byte Order mark, UTF-8 with BOM, etc.

E43 Editorial and Clarification Source: minutes XML-Syntax 1999-05-12 and 1999-05-19 E55

Appendix C: Change "conformant SGML document" to "conforming SGML documents".; Delete the word "valid" from the first sentence, since even well-formed but not valid XML documents are also conforming SGML documents.
Appendix A: Add reference to WebSGML amendment (Annex K of ISO 8879)

E42 Clarification Source: minutes XML-Syntax 1999-05-12 E52

Section 6: Change the first sentence of the second paragraph (after the "symbol ::= expression" example) to read: "Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lower case letter."

E41 Substantive Source: minutes XML-Syntax 1999-05-12 E51

Section 4.4: Change the definition corresponding to "Reference in DTD" to read: "as a reference within either the internal or external subsets of the DTD, but outside of an EntityValue, AttValue, PI, Comment, SystemLiteral or PubidLiteral." (with suitable links).
Rationale:: "PI, Comment, SystemLiteral or PubidLiteral" added to maintain compatibility with SGML.

E40 Substantive Source: minutes XML-Syntax 1999-05-12 E50

Section 3.1: In the first sentence of the paragraph immediately following production 43, change "must" to "should".
Rationale:: For an element containing only white space, "must" is unenforceable by a processor that doesn't know the content model of the element.

E39 Clarification Source: minutes XML-Syntax 1999-05-12 E49

Section 2.4: Add the following to the second paragraph: "Note that text that matches the nonterminal S (production [3]) is markup, not character data".
Section 2.10: In the first sentence of the first paragraph, remove the phrase ", denoted by the nonterminal S in this specification" from within the parentheses.
Rationale:: Clarify the distinction between white space corresponding to production [3] and other white space.

E38 Substantive Source: minutes XML-Syntax 1999-05-12 E48

Section 2.12: Add a paragraph immediately after production [38]: "The following is a non-normative summary of the definition of language codes in RFC 1766."
Appendix A: Move the references to ISO 639 and ISO 3166 from A.1 (normative) to A.2 (other).
Rationale:: Makes clear the original intent of having RFC 1766 normative and the rest (prose, ISO 639, ISO 3166) informative.

E37 Editorial Source: minutes XML-Syntax 1999-05-12 E47

Section 4.3.3: In the first sentence of the second paragraph, correct the reference to 10646 to "ISO/IEC 10646 Annex F" (instead of Annex E) and the reference to Unicode to "Unicode Section 2.4" (instead of Appendix B).

E36 Clarification Source: minutes XML-Syntax 1999-05-12 E46

Section 4.3.3: Correct the previous erratum to read "It is a fatal error for a TextDecl to occur other than at the beginning of an external entity." (It is a fatal error, not merely an error).

E35 Editorial Source: minutes XML-Syntax 1999-05-12 E72

Section 2.2: In the first paragraph, remove the word "graphic" from the third sentence (beginning with "Legal characters are tab, carriage return...").

E34 Substantial Source: minutes XML-Syntax 1999-05-12 E42a

Section 4.1: In the first sentence of the definition of the "Entity Declared" WFC, change the phrase "the Name given in the entity reference must match that in an entity declaration" to "for an entity reference that does not occur within the external subset or a parameter entity, the Name given in the entity reference must match that in an entity declaration that does not occur within the external subset or a parameter entity".
Rationale:: Suppose a standalone document containing an entity reference, the entity being declared in the external subset. Without this change, a processor that doesn't read the external subset would find a violation of the WFC, whereas a processor that doesn't read it wouldn't. The change ensures that all processors are able to determine whether the WFC is met for standalone documents.

E33 Substantial Source: minutes XML-Syntax 1999-02-24 E45

Section 5.1: Amend the last sentence of the last paragraph to read: "Except when "standalone='yes'", they must not process entity declarations or attribute-list declarations encountered after a reference to a parameter entity that is not read, since the entity may have contained overriding declarations."
Rationale:: Without the addition of 'Except when "standalone='yes'"', there is no guarantee that making a document standalone will cause all XML processors to reports the same results to the application.

E32 Editorial Source: minutes XML-Syntax 1999-02-24 E44

Section 2.8: In the second sentence of the paragraph after production [27], change "document type definition" to "document type declaration".

E31 Substantial Source: minutes XML-Syntax 1999-02-24 E43

Section 3.1: Add a validity constraint to production [41] as follows: "Validity Constraint: Valid xml:lang: if the Name in an attribute specification is xml:lang, then the value, after normalization as an NMTOKEN, must match production [33]".
Rationale:: Despite a very clear intention, expressed by a full page of prose, there was nothing in the spec to enforce the validity of xml:lang.

E30 Editorial Source: minutes XML-Syntax 1999-02-24 E41

Section 2.3: Reword the second sentence of the paragraph after production [3] as follows: "A letter consists of an alphabetic or syllabic base character or an ideographic character."
Appendix B: Remove "; these classes combine to form the class of letters" from the first sentence.
Rationale:: The text was in contradiction with production [84].

E29 Substantial Source: minutes XML-Syntax 1999-02-24 E38

Section 4.1: From the definition of the "Entity Declared" WFC, remove the sentence "The declaration of a parameter entity must precede any reference to it."
Rationale:: This WFC does not apply to production [69] PEReference, so the offending sentence is non sequitur. The sentence is present in the text of the Entity Declared VC, which does apply to [69].

E28 Clarification Source: minutes XML-Syntax 1999-02-24 E34

Section 3: To item number 2 of the list of valid cases for the "Element Valid" VC, add the following: "Note that a CDATA section containing only white space does not match the nonterminal S, and hence cannot appear between pairs of child elements."

E27 Clarification Source: minutes XML-Syntax 1999-02-24 E33

Section 2.5: After the example, add a paragraph reading "Note that the grammar does not allow a comment ending in '--->'. The following example is not well-formed." and an example: ""

E26 Clarification Source: minutes XML-Syntax 1999-02-24 E32

Section 4.2.2: Modify the second sentence of the paragraph immediately following the "Notation Declared" VC to read as follows: "It is a URI, meant to be dereferenced to obtain input for the XML processor to construct the entity's replacement text."
Rationale:: It wasn't clear to some that the URI should be dereferenced and the resulting byte stream treated as input to the XML processor to construct the entity's replacement text.

E25 Clarification Source: minutes XML-Syntax 1999-02-24 E30

Section 4: Amend the first sentence of the third paragraph to read: "An unparsed entity is a resource whose contents may or may not be text, and if text, may be other than XML."
Rationale:: The original text could be interpreted as saying that unparsed entities which are text can't be XML, which is wrong.

E24 Clarification Source: minutes XML-Syntax 1999-02-24 E28 and E37

Section 3.3.3

Replace the first paragraph, the itemized list of steps and the following paragraph with the following:

Before the value of an attribute is passed to the application or checked for validity, but after the end-of-line normalization described in section 2.11 has been performed, the XML processor must normalize the attribute value as follows:

Begin with a normalized value consisting of the empty string.
For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:
- For a character reference, append the referenced character to the normalized value.
- For an entity reference, recursively process the replacement text of the entity.
- For a #xD#xA sequence in an external parsed entity or in the literal entity value of an internal parsed entity, append a space character (#x20) to the normalized value.
- For a whitespace character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
- For another character, append the character to the normalized value.

If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

Rationale:

The fact that the existing text describes an algorithm for filling in an initially empty string with the normalized value was widely misunderstood. There was also confusion regarding white space treatment.

E23 Editorial Source: minutes XML-Syntax 1999-02-24 E26

Section 4.3: Amend the last paragraph to read: "Examples of text declarations containing encoding declarations:"

E22 Substantial Source: minutes XML-Syntax 1999-02-24 E25

Section 4.7: Add a Validity Constraint to production [82] as follows:
"Validity Constraint: Unique Notation Name: only one notation declaration can declare a given Name."
Rationale:: The spec as written allows multiple declarations of NOTATIONs with the same name, which is wrong.

E21 Substantial Source: minutes XML-Syntax 1999-02-24 E23

Section 5.1: Change the third paragraph to read: "Validating processors must, at user option, report violations of the constraints expressed by...".
Rationale:: "at user option" was missing, in contradiction with 1.2

E20 Clarification Source: minutes XML-Syntax 1999-02-17 E20

Section 6: To the item about A B add the sentence: "Concatenation has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D)." To the items about A+ and A* add analogous sentences.

E19 Clarification Source: minutes XML-Syntax 1999-02-17 E19

Section 3.2.1

Change the sentence

"For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text should not be empty, and neither the first nor last non-blank character of the replacement text should be a connector (| or ,)."

"For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text must contain at least one non-blank character, and neither the first nor last non-blank character of the replacement text should be a connector (| or ,)."

Rationale:

Per 4.4.8, parameter entities are always padded with one space at each end, so the replacement text is never empty. Interoperability thus requires that there be at least one non-blank character.

E18 Clarification Source: minutes XML-Syntax 1999-02-17 E18

Section 2.4: Delete the second sentence of the third paragraph, which reads: "They are also legal within the literal entity value of an internal entity declaration; see "4.3.2 Well-Formed Parsed Entities". "
Rationale:: This sentence is bogus. When & or < are in a literal entity value they are being used as a markup delimiter, thus the whole second sentence is just confusing static.

Errata as of 1999-02-17.

Section 2.1: In the list headed "Matching the document production implies that:", the second list item has forward references to "start-tag" and "end-tag" which need to be marked as defined terms.
Section 2.2: The sentence "Legal characters are tab, carriage return, line feed, and the legal graphic characters of Unicode and ISO/IEC 10646" is somewhat in conflict with production [2] for Char, whose ranges include many character positions that are not yet defined by the Unicode/ISO 10646 standards. Change the text to make it clear that production [2] is normative; in practical terms this means that newly-added characters such as the Euro (€ or €) are legal in XML documents.
Section 2.8: In production [24], the quotation-mark literals aren't quoted. Should be ("'" VersionNum "'" | '"' VersionNum '"')
Section 2.8: Just before production [28], the word "fuller" should be "further".
Section 2.8: To the paragraph before production [30], add "The external subset and any external parameter entities referred to in the DTD must match the production for extPE. See 4.3.2 Well-Formed Parsed Entities".
Section 3.2.1: The term "element content" needs to be marked as a formally defined term.
Section 3.2.1: Change the word "parenthetized" to "parenthesized."
Section 3.2.2: The name of the literal token PCDATA has no justification. Add to the paragraph after production [51]: 'The keyword PCDATA derives historically from the term "parsed character data."'
Section 3.3: The text "For interoperability, writers of DTDs may choose to provide at most one attribute-list declaration for a given element type, at most one attribute definition for a given attribute name, and at least one attribute definition in each attribute-list declaration." could be read as forbidding more than one attribute of the same name in a DTD. Change it to read "For interoperability, writers of DTDs may choose to provide at most one attribute-list declaration for a given element type, at most one attribute definition for a given attribute name in an attribute-list declaration, and at least one attribute definition in each attribute-list declaration."
Section 3.3.1: Immediately before production [54], delete ', as noted:' and read '. The validity constraints noted in the grammar are applied after the attribute value has been normalized as described in 3.3 Attribute-List Declarations'.
Section 3.3.1: The spec as written allows multiple attributes of type NOTATION on a single element, which defeats the purpose; add a Validity Contraint to production [58] as follows: "Validity Constraint: One Notation per Element Type: No element type may have more than one NOTATION attribute specified."
Section 4.: In the first sentence, the phrase "identified by name" should be replaced by "identified by entity name". Also, the phrase "see below" should be removed, and phrase "document entity" made into a term-definition link.
Section 4.3.3: In the paragraph beginning "In the absence of information", delete the phrase "for an encoding declaration to occur other than at the beginning of an external entity". Add a new paragraph after that one reading "It is an error for a TextDecl to occur other than at the beginning of an external entity."
Section 4.4.5: The first example has "&YN;" - it should be "%YN;".
Section 6.: The notation used in Productions [13] ([-'()+,./:=?;!*#@$_%]) and [26] ([a-zA-Z0-9_.:]) is not described in the notation section, although the semantics are obvious. Need to add descriptions to the first definition-list in the Notation section.
Appendix A.2: The citations for the papers by Anne Brüggemann-Klein need improving: 'A. Brüggemann-Klein und D. Wood. Deterministic Regular Languages. Extended abstract in A. Finkel, M. Jantzen, Hrsg., STACS 1992, S. 173-184. Springer-Verlag, Berlin 1992. Lecture Notes in Computer Science 577. Full version titled "One-Unambiguous Regular Languages" in Information and Computation 140 (2): 229--253, February 1998.' and (to replace the Regular Expressions into Finite Automata) 'A. Brüggemann-Klein. Formal Models in Document Processing. Habilitationsschrift. Faculty of Mathematics at the University of Freiburg, 1993, available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps.'
Appendix F: Add a note that the algorithm given here does not work for UTF-7.