XML 1.0 Specification Errata

Abstract

This document records all known errors in the Extensible Markup Language (XML) 1.0 Specification (W3C Recommendation 10 Feb 1998); for updates see the latest version.

The errata are numbered, classified as Substantial, Editorial or Clarification and listed in reverse chronological order of their date of publication. Early errata (1999-02-17 and before) are neither numbered, classified nor dated.

Please email error reports to xml-editor@w3.org.

Known Errors

Errata as of 2000-04-09.

E62 Substantial Source: XML Core WG list

Section 2.3: Change productions [6] Names and [8] Nmtokens to use #x20 (a single space character) instead of S:
"[6] Names ::= Name (#x20 Name)*"
"[8] Nmtokens ::= Nmtoken (#x20 Nmtoken)*"
Rationale:: This change is necessary to preserve SGML compatibility. In principle it makes previously valid documents invalid, but it is believed that only contrived documents, not real ones, are affected.

Errata as of 2000-02-17.

E61 Clarification Source: Richard Tobin

Section 3.3.3: Further clarify E24 by adding a new paragraph after the paragraph following the bulleted list (beginning "If the attribute type is not CDATA, then the XML processor must..."):
Note that if the unnormalized attribute value contains a character reference to a whitespace character other than space (#x20), the normalized value contains the referenced character itself (#xD, #xA or #x9). This contrasts with the case where the unnormalized value contains a whitespace character (not a reference), which is replaced with a space character (#x20) in the normalized value and also contrasts with the case where the unnormalized value contains an entity reference whose replacement text contains a whitespace character; being recursively processed, the whitespace character is replaced with a space character (#x20) in the normalized value.
Rationale:: It was not completely clear how an attribute value containing a character reference to a whitespace character other than space is supposed to be normalised.

Errata as of 2000-01-31.

E60 Editorial Source: XML Core WG list

Section 2.12: Change the first sentence of the paragraph following the bullet list following production [38] to read:
There may be any number of Subcode segments; if the Langcode is an ISO639Code, and if the first subcode segment exists and consists of two letters, then it must be a country code from [ISO 3166], "Codes for the representation of names of countries."

Errata as of 2000-01-25.

E59 Editorial Source: xml-editor list

Obsoletes E28

Section 3

Change item number 2 of the list of valid cases for the "Element Valid" VC to read:

The declaration matches children and the sequence of child elements belongs to the language generated by the regular expression in the content model, with optional white space (characters matching the nonterminal S) between the start tag and the first child element, between child elements or between the last child element and the end tag. Note that a CDATA section containing only white space does not match the nonterminal S, and hence cannot appear in these positions.

E58 Editorial Source: xml-editor list

Appendix A.1

Rename the existing IANA entry IANA-CHARSETS (but keep the "IANA" link target to avoid breaking external links).

Section 4.3.3

Adjust the [IANA] reference in 4.3.3 accordingly.

Appendix A.2

Add a new entry IANA-LANGCODES as follows:

IANA-LANGCODES: (Internet Assigned Numbers Authority) Registry of language tags. See http://www.isi.edu/in-notes/iana/assignments/languages/.

Section 2.12

Change the [IANA] reference to point to this new entry.

E57 Clarification Source: xml-editor list

Section 4.3.3: Amend the second paragraph after production [81] to read (the first sentence is actually unchanged):
In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" should be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-9" should be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" should be used for the various encoded forms of JIS X-0208-1997. It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA], other than those just listed, should be referred to using their registered names; other encodings should use names starting with an "x-" prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are of course not required to support all IANA-registered encodings).

E56 Clarification Source: xml-editor list

Section 4.3.3: Amend the second sentence of the first paragraph to read "All XML processors must be able to read entities in both the UTF-8 and UTF-16 encodings."

Errata as of 2000-01-17.

E55 Editorial Source: minutes XML-Syntax 1999-02-10 E10

Obsoletes E12

Section 3.2.1: The term "content model" needs to be marked as a formally defined term.

E54 Editorial Source: minutes XML-Syntax 1999-02-24 E40

XML version of the spec: There are a couple of entities declared in the internal subset with names starting with 'xml' or 'XML', which is against the statement in 2.3 that such names are reserved for future standardization.

E53 Editorial Source: xml-editor list

Section 4: In the last paragraph before section 4.1, "Parameter entities" should be a defined term (bold).

E52 Editorial Source: xml-editor list

Section 3.2.1: In each of productions [49] and [50], the first instance of cp should be a hyperlink like the second instance.

E51 Editorial Source: xml-editor list

Section 4.4: There are incorrect hyperlinks in the row identified by "Occurs as Attribute Value" of the first table of section 4.4.
In the column headed "Character", "Not recognized" hyperlinks to "#not recognized" instead of "#not-recognized" (missing dash). In the columns headed " Internal General" and "External Parsed General", "Forbidden" hyperlinks to "#not-recognized" instead of "#forbidden".

Errata as of 2000-01-06.

E50 Substantial Source: minutes XML-Syntax 1999-05-26 E73

Section 3.2.1: Change the grammar for 'choice' in production [49] from:
"choice ::= '(' S? cp ( S? '|' S? cp )* S? ')'"
to:
"choice ::= '(' S? cp ( S? '|' S? cp )+ S? ')'"
(which amounts to changing the * into a +).
Rationale:: Eliminate unnecessary ambiguity in the grammar for cp and children, which serves no purpose and confuses some implementors.

E49 Substantial Source: minutes XML-Syntax 1999-05-19 E67

Section 4.2.2: Amend the paragraph beginning "An XML processor should handle a non-ASCII character..." to read as follows: "Some URIs may contain characters that are either reserved (see [IETF RFC2396], section 2.2) or non-ASCII. An XML processor should handle such a character in a URI by representing the character in UTF-8 as one or more bytes, and then escaping these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value)."
Rationale:: Original only discussed non-ASCII characters, include case of reserved characters.

E48 Clarification Source: minutes XML-Syntax 1999-02-24 E31 and minutes XML-Syntax 1999-05-19 E65

Appendix F

Modify the text from the paragraph beginning "The second possible case occurs when the XML entity..." to the end of the appendix to read:

The second possible case occurs when the XML entity is accompanied by encoding information, as in some file systems and some network protocols. When multiple sources of information are available, their relative priority and the preferred method of handling conflict should be specified as part of the higher-level protocol used to deliver XML. In particular, please refer to [IETF RFC2376] "XML Media Types" which defines the text/xml and application/xml MIME types and provides some useful guidance. In the interests of interoperability, however, the following rule is recommended.

If an XML entity is in a file, the Byte-Order Mark and encoding-declaration PI are used (if present) to determine the character encoding. All other heuristics and sources of information are solely for error recovery.

Appendix A.2

Add a non-normative reference:

IETF RFC2376: IETF (Internet Engineering Task Force). RFC 2376: XML Media Types, ed. E. Whitehead, M. Murata. 1998.

Rationale:

Take RFC 2376 into account.

E47 Substantial Source: minutes XML-Syntax 1999-05-19 E62

Section 4.3.3: Prepend the following to the last sentence of the paragraph immediately preceding production [80]: "In the absence of external character encoding information (such as MIME headers), ".
Rationale:: The original only covered the case where no external information was available.

E46 Clarification Source: minutes XML-Syntax 1999-05-19 E60:

Section 3.1: Append the following to the first paragraph after production [41]: "Note that the order of attribute specifications in a start-tag or empty-element tag is not significant."

E45 Substantial Source: minutes XML-Syntax 1999-05-19 E58

Section 3.1: Change the second sentence of the paragraph right after production [44] to read: "For interoperability, the empty-element tag should be used, and should only be used, for elements which are declared EMPTY."
Rationale:: "For interoperability" and "must" were combined inappropriately.

E44 Substantial Source: minutes XML-Syntax 1999-05-19 E56 and E64

Appendix F

Append the following to the second paragraph: "The notation ## is used to denote any byte value except 00." Adjust the itemized list of detection cases to read as follows:

With a Byte Order Mark:
 00 00 FE FF: UCS-4, big-endian machine (1234 order)
 FF FE 00 00: UCS-4, little-endian machine (4321 order)
 FE FF 00 ##:  UTF-16, big-endian
 FF FE ## 00:  UTF-16, little-endian
 EF BB BF: UTF-8
Without a Byte Order Mark:
 00 00 00 3C: UCS-4, big-endian machine (1234 order)
 3C 00 00 00: UCS-4, little-endian machine (4321 order)
 00 00 3C 00: UCS-4, unusual octet order (2143)
 00 3C 00 00: UCS-4, unusual octet order (3412)
 00 3C ## ##, 
 00 25 ## ##,
 00 20 ## ##,
 00 09 ## ##,
 00 0D ## ## or
 00 0A ## ##: Big-endian UTF-16 or ISO-10646-UCS-2. Note that, absent
              an encoding declaration, these cases are strictly
              speaking in error.
 3C 00 ## ##,
 25 00 ## ##,
 20 00 ## ##,
 09 00 ## ##,
 0D 00 ## ## or
 0A 00 ## ##: Little-endian UTF-16 or ISO-10646-UCS-2. Note that, absent
              an encoding declaration, these cases are strictly
              speaking in error.
 3C 3F 78 6D: UTF-8, ISO 646, ASCII, some part of ISO 8859, Shift-JIS,
              EUC, or any other 7-bit, 8-bit, or mixed-width encoding
              which ensures that the characters of ASCII have their
              normal positions, width, and values; the actual encoding
              declaration must be read to detect which of these
              applies, but since all of these encodings use the same
              bit patterns for the ASCII characters, the encoding
              declaration itself may be read reliably  
 4C 6F A7 94: EBCDIC (in some flavor; the full encoding declaration
              must be read to tell which code page is in use)
 other: UTF-8 without an encoding declaration, or else the data stream
        is corrupt, fragmentary, or enclosed in a wrapper of some kind

Add the following to the second paragraph after the list (this also takes care of the previous erratum on UTF-7): "Note: Since external parsed entities in UTF-16 may begin with any character, this autodetection does not always work. Also, because of the overloaded usage it makes of ASCII-valued bytes, the UTF-7 encoding may fail to be reliably detected."

Rationale:

Original version did not distinguish UCS-2, cases without Byte Order mark, UTF-8 with BOM, etc.

E43 Editorial and Clarification Source: minutes XML-Syntax 1999-05-12 and minutes XML-Syntax 1999-05-19 E55

Appendix C: Change "conformant SGML document" to "conforming SGML documents".; Delete the word "valid" from the first sentence, since even well-formed but not valid XML documents are also conforming SGML documents.
Appendix A: Add reference to WebSGML amendment (Annex K of ISO 8879)

E42 Clarification Source: minutes XML-Syntax 1999-05-12 E52

Section 6: Change the first sentence of the second paragraph (after the "symbol ::= expression" example) to read: "Symbols are written with an initial capital letter if they are the start symbol of a regular language, otherwise with an initial lower case letter."

E41 Substantial Source: minutes XML-Syntax 1999-05-12 E51

Section 4.4: Change the definition corresponding to "Reference in DTD" to read: "as a reference within either the internal or external subsets of the DTD, but outside of an EntityValue, AttValue, PI, Comment, SystemLiteral or PubidLiteral." (with suitable links).
Rationale:: "PI, Comment, SystemLiteral or PubidLiteral" added to maintain compatibility with SGML.

E40 Substantial Source: minutes XML-Syntax 1999-05-12 E50

Section 3.1: In the first sentence of the paragraph immediately following production 43, change "must" to "should".
Rationale:: For an element containing only white space, "must" is unenforceable by a processor that doesn't know the content model of the element.

E39 Clarification Source: minutes XML-Syntax 1999-05-12 E49

Section 2.4: Add the following to the second paragraph: "Note that text that matches the nonterminal S (production [3]) is markup, not character data".
Section 2.10: In the first sentence of the first paragraph, remove the phrase ", denoted by the nonterminal S in this specification" from within the parentheses.
Rationale:: Clarify the distinction between white space corresponding to production [3] and other white space.

E38 Substantial Source: minutes XML-Syntax 1999-05-12 E48

Section 2.12: Add a paragraph immediately after production [38]: "The following is a non-normative summary of the definition of language codes in RFC 1766."
Appendix A: Move the references to ISO 639 and ISO 3166 from A.1 (normative) to A.2 (other).
Rationale:: Makes clear the original intent of having RFC 1766 normative and the rest (prose, ISO 639, ISO 3166) informative.

E37 Editorial Source: minutes XML-Syntax 1999-05-12 E47

Section 4.3.3: In the first sentence of the second paragraph, correct the reference to 10646 to "ISO/IEC 10646 Annex F" (instead of Annex E) and the reference to Unicode to "Unicode Section 2.4" (instead of Appendix B).

E36 Clarification Source: minutes XML-Syntax 1999-05-12 E46

Section 4.3.3: Correct the previous erratum to read "It is a fatal error for a TextDecl to occur other than at the beginning of an external entity." (It is a fatal error, not merely an error).

E35 Editorial Source: minutes XML-Syntax 1999-05-12 E72

Section 2.2: In the first paragraph, remove the word "graphic" from the third sentence (beginning with "Legal characters are tab, carriage return...").

E34 Substantial Source: minutes XML-Syntax 1999-05-12 E42a

Section 4.1: In the first sentence of the definition of the "Entity Declared" WFC, change the phrase "the Name given in the entity reference must match that in an entity declaration" to "for an entity reference that does not occur within the external subset or a parameter entity, the Name given in the entity reference must match that in an entity declaration that does not occur within the external subset or a parameter entity".
Rationale:: Suppose a standalone document containing an entity reference, the entity being declared in the external subset. Without this change, a processor that doesn't read the external subset would find a violation of the WFC, whereas a processor that does read it wouldn't. The change ensures that all processors are able to determine whether the WFC is met for standalone documents.

E33 Substantial Source: minutes XML-Syntax 1999-02-24 E45

Section 5.1: Amend the last sentence of the last paragraph to read: "Except when "standalone='yes'", they must not process entity declarations or attribute-list declarations encountered after a reference to a parameter entity that is not read, since the entity may have contained overriding declarations."
Rationale:: Without the addition of 'Except when "standalone='yes'"', there is no guarantee that making a document standalone will cause all XML processors to reports the same results to the application.

E32 Editorial Source: minutes XML-Syntax 1999-02-24 E44

Section 2.8: In the second sentence of the paragraph after production [27], change "document type definition" to "document type declaration".

E31 Substantial Source: minutes XML-Syntax 1999-02-24 E43

Section 3.1: Add a validity constraint to production [41] as follows: "Validity Constraint: Valid xml:lang: if the Name in an attribute specification is xml:lang, then the value, after normalization as an NMTOKEN, must match production [33]".
Rationale:: Despite a very clear intention, expressed by a full page of prose, there was nothing in the spec to enforce the validity of xml:lang.

E30 Editorial Source: minutes XML-Syntax 1999-02-24 E41

Section 2.3: Reword the second sentence of the paragraph after production [3] as follows: "A letter consists of an alphabetic or syllabic base character or an ideographic character."
Appendix B: Remove "; these classes combine to form the class of letters" from the first sentence.
Rationale:: The text was in contradiction with production [84].

E29 Substantial Source: minutes XML-Syntax 1999-02-24 E38

Section 4.1: From the definition of the "Entity Declared" WFC, remove the sentence "The declaration of a parameter entity must precede any reference to it."
Rationale:: This WFC does not apply to production [69] PEReference, so the offending sentence is non sequitur. The sentence is present in the text of the Entity Declared VC, which does apply to [69].

E28 Clarification Source: minutes XML-Syntax 1999-02-24 E34

Obsoleted by E59

Section 3: To item number 2 of the list of valid cases for the "Element Valid" VC, add the following: "Note that a CDATA section containing only white space does not match the nonterminal S, and hence cannot appear between pairs of child elements."

E27 Clarification Source: minutes XML-Syntax 1999-02-24 E33

Section 2.5: After the example, add a paragraph reading "Note that the grammar does not allow a comment ending in '--->'. The following example is not well-formed." and an example: ""

E26 Clarification Source: minutes XML-Syntax 1999-02-24 E32

Section 4.2.2: Modify the second sentence of the paragraph immediately following the "Notation Declared" VC to read as follows: "It is a URI, meant to be dereferenced to obtain input for the XML processor to construct the entity's replacement text."
Rationale:: It wasn't clear to some that the URI should be dereferenced and the resulting byte stream treated as input to the XML processor to construct the entity's replacement text.

E25 Clarification Source: minutes XML-Syntax 1999-02-24 E30

Section 4: Amend the first sentence of the third paragraph to read: "An unparsed entity is a resource whose contents may or may not be text, and if text, may be other than XML."
Rationale:: The original text could be interpreted as saying that unparsed entities which are text can't be XML, which is wrong.

E24 Clarification Source: minutes XML-Syntax 1999-02-24 E28 and E37

Further clarified by E61.

Section 3.3.3

Replace the first paragraph, the itemized list of steps and the following paragraph with the following:

Before the value of an attribute is passed to the application or checked for validity, but after the end-of-line normalization described in section 2.11 has been performed, the XML processor must normalize the attribute value as follows:

Begin with a normalized value consisting of the empty string.
For each character, entity reference, or character reference in the unnormalized attribute value, beginning with the first and continuing to the last, do the following:
- For a character reference, append the referenced character to the normalized value.
- For an entity reference, recursively process the replacement text of the entity.
- For a #xD#xA sequence in an external parsed entity or in the literal entity value of an internal parsed entity, append a space character (#x20) to the normalized value.
- For a whitespace character (#x20, #xD, #xA, #x9), append a space character (#x20) to the normalized value.
- For another character, append the character to the normalized value.

If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

Rationale:

The fact that the existing text describes an algorithm for filling in an initially empty string with the normalized value was widely misunderstood. There was also confusion regarding white space treatment.

E23 Editorial Source: minutes XML-Syntax 1999-02-24 E26

Section 4.3: Amend the last paragraph to read: "Examples of text declarations containing encoding declarations:"

E22 Substantial Source: minutes XML-Syntax 1999-02-24 E25

Section 4.7: Add a Validity Constraint to production [82] as follows:
"Validity Constraint: Unique Notation Name: only one notation declaration can declare a given Name."
Rationale:: The spec as written allows multiple declarations of NOTATIONs with the same name, which is wrong.

E21 Substantial Source: minutes XML-Syntax 1999-02-24 E23

Section 5.1: Change the third paragraph to read: "Validating processors must, at user option, report violations of the constraints expressed by...".
Rationale:: "at user option" was missing, in contradiction with 1.2

E20 Clarification Source: minutes XML-Syntax 1999-02-17 E20

Section 6: To the item about A B add the sentence: "Concatenation has higher precedence than alternation; thus A B | C D is identical to (A B) | (C D)." To the items about A+ and A* add analogous sentences.

E19 Clarification Source: minutes XML-Syntax 1999-02-17 E19

Section 3.2.1

Change the sentence

"For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text should not be empty, and neither the first nor last non-blank character of the replacement text should be a connector (| or ,)."

"For interoperability, if a parameter-entity reference appears in a choice, seq, or Mixed construct, its replacement text must contain at least one non-blank character, and neither the first nor last non-blank character of the replacement text should be a connector (| or ,)."

Rationale:

Per 4.4.8, parameter entities are always padded with one space at each end, so the replacement text is never empty. Interoperability thus requires that there be at least one non-blank character.

E18 Clarification Source: minutes XML-Syntax 1999-02-17 E18

Section 2.4: Delete the second sentence of the third paragraph, which reads: "They are also legal within the literal entity value of an internal entity declaration; see "4.3.2 Well-Formed Parsed Entities". "
Rationale:: This sentence is bogus. When & or < are in a literal entity value they are being used as a markup delimiter, thus the whole second sentence is just confusing static.

Errata as of 1999-02-17.

E17

Section 2.1: In the list headed "Matching the document production implies that:", the second list item has forward references to "start-tag" and "end-tag" which need to be marked as defined terms.

E16

Section 2.2: The sentence "Legal characters are tab, carriage return, line feed, and the legal graphic characters of Unicode and ISO/IEC 10646" is somewhat in conflict with production [2] for Char, whose ranges include many character positions that are not yet defined by the Unicode/ISO 10646 standards. Change the text to make it clear that production [2] is normative; in practical terms this means that newly-added characters such as the Euro (€ or €) are legal in XML documents.

E15

Section 2.8: In production [24], the quotation-mark literals aren't quoted. Should be ("'" VersionNum "'" | '"' VersionNum '"')

E14

Section 2.8: Just before production [28], the word "fuller" should be "further".

E13

Section 2.8: To the paragraph before production [30], add "The external subset and any external parameter entities referred to in the DTD must match the production for extPE. See 4.3.2 Well-Formed Parsed Entities".

E12

Obsoleted by E55

Section 3.2.1: The term "element content" needs to be marked as a formally defined term.

E11

Section 3.2.1: Change the word "parenthetized" to "parenthesized."

E10

Section 3.2.2: The name of the literal token PCDATA has no justification. Add to the paragraph after production [51]: 'The keyword PCDATA derives historically from the term "parsed character data."'

E9

Section 3.3: The text "For interoperability, writers of DTDs may choose to provide at most one attribute-list declaration for a given element type, at most one attribute definition for a given attribute name, and at least one attribute definition in each attribute-list declaration." could be read as forbidding more than one attribute of the same name in a DTD. Change it to read "For interoperability, writers of DTDs may choose to provide at most one attribute-list declaration for a given element type, at most one attribute definition for a given attribute name in an attribute-list declaration, and at least one attribute definition in each attribute-list declaration."

E8

Section 3.3.1: Immediately before production [54], delete ', as noted:' and read '. The validity constraints noted in the grammar are applied after the attribute value has been normalized as described in 3.3 Attribute-List Declarations'.

E7

Section 3.3.1: The spec as written allows multiple attributes of type NOTATION on a single element, which defeats the purpose; add a Validity Contraint to production [58] as follows: "Validity Constraint: One Notation per Element Type: No element type may have more than one NOTATION attribute specified."

E6

Section 4.: In the first sentence, the phrase "identified by name" should be replaced by "identified by entity name". Also, the phrase "see below" should be removed, and phrase "document entity" made into a term-definition link.

E5

Section 4.3.3: In the paragraph beginning "In the absence of information", delete the phrase "for an encoding declaration to occur other than at the beginning of an external entity". Add a new paragraph after that one reading "It is an error for a TextDecl to occur other than at the beginning of an external entity."

E4

Section 4.4.5: The first example has "&YN;" - it should be "%YN;".

E3

Section 6.: The notation used in Productions [13] ([-'()+,./:=?;!*#@$_%]) and [26] ([a-zA-Z0-9_.:]) is not described in the notation section, although the semantics are obvious. Need to add descriptions to the first definition-list in the Notation section.

E2

Appendix A.2: The citations for the papers by Anne Brüggemann-Klein need improving: 'A. Brüggemann-Klein und D. Wood. Deterministic Regular Languages. Extended abstract in A. Finkel, M. Jantzen, Hrsg., STACS 1992, S. 173-184. Springer-Verlag, Berlin 1992. Lecture Notes in Computer Science 577. Full version titled "One-Unambiguous Regular Languages" in Information and Computation 140 (2): 229--253, February 1998.' and (to replace the Regular Expressions into Finite Automata) 'A. Brüggemann-Klein. Formal Models in Document Processing. Habilitationsschrift. Faculty of Mathematics at the University of Freiburg, 1993, available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps.'

E1

Appendix F: Add a note that the algorithm given here does not work for UTF-7.

Last updated $Date: 2000/04/10 11:41:40 $ by $Author: fyergeau $
xml-editor