This is a W3C Working Draft for review by W3C members and other
interested parties. It is a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to use W3C Working Drafts as reference material or to
cite them as other than "work in progress". A list of current W3C
working drafts can be found at http://www.w3.org/TR.
Note: Since working drafts are subject to frequent
change, you are advised to reference the above URL, rather than the
URLs for working drafts themselves.
This work is part of the W3C SGML
Activity (for current status, see
http://www.w3.org/MarkUp/SGML/Activity).
Extensible Markup Language (XML) is an extremely simple dialect of
SGML which is completely described in this document. The goal is to
enable generic SGML to be served, received, and processed on the Web
in the way that is now possible with HTML. XML has been designed for
ease of implementation and for interoperability with both SGML and
HTML.
Extensible Markup Language
Chicago, Vancouver, Mountain View, et al.:
World-Wide Web Consortium, XML Working Group, 1996, 1997.
Created in electronic form.
EnglishExtended Backus-Naur Form (formal grammar)
1997-07-24 : CMSMcQ : correct error (lost *) in definition
of ignoreSectContents (thanks to Makoto Murata)
Allow all empty elements to have end-tags, consistent with
SGML TC (as per JJC).
1997-07-23 : CMSMcQ : pre-emptive strike on pending corrections:
introduce the term 'empty-element tag', note that all empty elements
may use it, and elements declared EMPTY must use it.
Add WFC requiring encoding decl to come first in an entity.
Redefine notations to point to PIs as well as binary entities.
Change autodetection table by removing bytes 3 and 4 from
examples with Byte Order Mark.
Add content model as a term and clarify that it applies to both
mixed and element content.
1997-06-30 : CMSMcQ : change date, some cosmetic changes,
changes to productions for choice, seq, Mixed, NotationType,
Enumeration. Follow James Clark's suggestion and prohibit
conditional sections in internal subset. TO DO: simplify
production for ignored sections as a result, since we don't
need to worry about parsers which don't expand PErefs finding
a conditional section.
1997-06-29 : TB : various edits
1997-06-29 : CMSMcQ : further changes:
suppress old FINAL EDIT comments and some dead material.
revise occurrences of % in grammar to exploit Henry Thompson's pun,
especially markupdecl and attdef
remove RMD requirement relating to element content (?)
1997-06-28 : CMSMcQ : Various changes for 1 July draft:
add text for draconian error handling (introduce
the term Fatal Error)
RE deleta est (changing wording from
original announcement to restrict the requirement to validating
parsers)
tag definition of validating processor and link to it
add colon as name character
change def of %operator
change standard definitions of lt, gt, amp
strip leading zeros from #x00nn forms.
1997-04-02 : CMSMcQ : final corrections of editorial errors
found in last night's proofreading. Reverse course once more on
well-formed: Webster's Second hyphenates it, and that's enough
for me.
1997-04-01 : CMSMcQ : corrections from JJC, EM, HT, and self
1997-03-31 : Tim Bray : many changes
1997-03-29 : CMSMcQ : some Henry Thompson (on entity handling),
some Charles Goldfarb, some ERB decisions (PE handling in miscellaneous
declarations. Changed Ident element to accept def attribute.
Allow normalization of Unicode characters. move def of systemliteral
into section on literals.
1997-03-28 : CMSMcQ : make as many corrections as possible, from
Terry Allen, Norbert Mikula, James Clark, Jon Bosak, Henry Thompson,
Paul Grosso, and self. Among other things: give in on "well formed"
(Terry is right), tentatively rename QuotedCData as AttValue
and Literal as EntityValue to be more informative, since attribute
values are the only place QuotedCData was used, and
vice versa for entity text and Literal. (I'd call it Entity Text,
but 8879 uses that name for both internal and external entities.)
1997-03-26 : CMSMcQ : resynch the two forks of this draft, reapply
my changes dated 03-20 and 03-21. Normalize old 'may not' to 'must not'
except in the one case where it meant 'may or may not'.
1997-03-21 : TB : massive changes on plane flight from Chicago
to Vancouver
1997-03-21 : CMSMcQ : correct as many reported errors as possible.
1997-03-20 : CMSMcQ : correct typos listed in CMSMcQ hand copy of spec.
1997-03-20 : CMSMcQ : cosmetic changes preparatory to revision for
WWW conference April 1997: restore some of the internal entity
references (e.g. to docdate, etc.), change character xA0 to
and define nbsp as  , and refill a lot of paragraphs for
legibility.
1996-11-12 : CMSMcQ : revise using Tim's edits.
Add list type of NUMBERED and change most lists either to
BULLETS or to NUMBERED.
suppress QuotedNames, Names (not used)
correct trivial-grammar doc type decl
rename 'marked section' as 'CDATA section' passim
Also edits from James Clark:
define the set of characters from which [^abc] subtracts
charref should use just [0-9] not Digit
location info needs cleaner treatment: remove? (ERB
question)
one example of a PI has wrong pic.
clarify discussion of encoding names
encoding failure should lead to unspecified results; don't
prescribe error recovery
don't require exposure of entity boundaries
ignore white space in element content
reserve entity names of the form u-NNNN
clarify relative URLs
And some of my own:
correct productions for content model: model cannot
consist of a name, so "elements ::= cp" is no good.
1996-11-11 : CMSMcQ : revise for style.
Add new rhs to entity declaration, for parameter entities.
1996-11-10 : CMSMcQ : revise for style.
Fix / complete section on names, characters.
Add sections on parameter entities, conditional sections.
Still to do: Add compatibility note on deterministic content models.
Finish stylistic revision.
1996-10-31 : TB : Add Entity Handling section
1996-10-30 : TB : Clean up term & termdef. Slip in
ERB decision re EMPTY.
1996-10-28 : TB : Change DTD. Implement some of Michael's
suggestions. Change comments back to //. Introduce language for
XML namespace reservation. Add section on white-space handling.
Lots more cleanup.
1996-10-24 : CMSMcQ : quick tweaks, implement some ERB
decisions. Characters are not integers. Comments are /* */ not //.
Add bibliographic refs to 10646, HyTime, Unicode.
Rename old Cdata as MsData since it's only seen
in marked sections. Call them attribute-value pairs not
name-value pairs, except once. Internal subset is optional, needs
'?'. Implied attributes should be signaled to the app, not
have values supplied by processor.
1996-10-16 : TB : track down & excise all DSD references;
introduce some EBNF for entity declarations.
1996-10-?? : TB : consistency check, fix up scraps so
they all parse, get formatter working, correct a few productions.
1996-10-10/11 : CMSMcQ : various maintenance, stylistic, and
organizational changes:
replace a few literals with xmlpio and
pic entities, to make them consistent and ensure we can change pic
reliably when the ERB votes
drop paragraph on recognizers from notation section
add match, exact match to terminology
move old 2.2 XML Processors and Apps into intro
mention comments, PIs, and marked sections in discussion of
delimiter escaping
streamline discussion of doctype decl syntax
drop old section of 'PI syntax' for doctype decl, and add
section on partial-DTD summary PIs to end of Logical Structures
section
revise DSD syntax section to use Tim's subset-in-a-PI
mechanism
1996-10-10 : TB : eliminate name recognizers (and more?)
1996-10-09 : CMSMcQ : revise for style, consistency through 2.3
(Characters)
1996-10-09 : CMSMcQ : re-unite everything for convenience,
at least temporarily, and revise quickly
1996-10-08 : TB : first major homogenization pass
1996-10-08 : TB : turn "current" attribute on div type into
CDATA
1996-10-02 : TB : remould into skeleton + entities
1996-09-30 : CMSMcQ : add a few more sections prior to exchange
with Tim.
1996-09-20 : CMSMcQ : finish transcribing notes.
1996-09-19 : CMSMcQ : begin transcribing notes for draft.
1996-09-13 : CMSMcQ : made outline from notes of 09-06,
do some housekeeping
Extensible Markup Language Version &XML.version; Document W3C-SGML-ERB DD-1996-0004&doc.date;This draft is intended for public
discussion.It is subject to approval by the
XML Working Group.
Introduction
The Extensible Markup Language, abbreviated XML, describes a class of
data objects called XML documents and partially
describes the behavior of
computer programs which process them. XML is an application profile or
restricted form of SGML, the Standard Generalized Markup Language [ISO 8879].
XML documents are made up of storage units called entities, which contain either text or binary data.
Text is made up of characters, some
of which form the character data in the
document, and some of which form markup.
Markup encodes a description of the document's storage layout and
logical structure. XML provides a mechanism to impose constraints on
the storage layout and logical structure.
A software module
called an XML processor is used to read XML documents
and provide access to their content and structure.It is assumed that an XML processor is
doing its work on behalf of another module, referred to as the
application. This specification describes the
required behavior of an XML processor in terms of how it must read XML
data and the information it must provide to the application.
Origin and Goals
XML was developed by an XML Working Group (originally known as the
SGML Editorial Review Board) formed under the auspices of the World
Wide Web Consortium (W3C) in 1996 and chaired by Jon Bosak of Sun
Microsystems with the very active participation of an XML Special
Interest Group (previously known as the SGML Working Group) also
organized by the W3C. The membership of the XML Working Group is given
in an appendix. Dan Connolly served as the WG's contact with the W3C.
The design goals for XML are:
XML shall be straightforwardly usable over the
Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML
documents.
The number of optional features in XML is to be kept to the
absolute minimum, ideally zero.
XML documents should be human-legible and reasonably
clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
This specification, together with the associated standards, provides
all the information necessary to understand XML version &XML.version;
and construct computer programs to process it.
This version of the XML specification (&doc.date;)
is for &doc.audience;.
It &doc.distribution;.
Relationship to Existing Standards
Standards relevant to users and implementors of XML include:
SGML (ISO 8879:1986). By definition, valid XML documents are conformant SGML
documents in the sense described in ISO standard 8879. The current
draft of this specification
presupposes the successful completion of the current
work on a technical corrigendum to ISO 8879 now being prepared
by ISO/IEC JTC1/SC18/WG8. If the corrigendum is not
adopted in the expected form, some clauses of this specification
may change, and some
recommendations now labeled for
interoperability will become requirements labeled
for compatibility.
Unicode and ISO/IEC 10646. This specification depends on the
international
standard ISO/IEC
10646 (with amendments AM 1 through AM 5)
and the
Unicode Standard, Version
2.0, which define the encodings and meanings of the
characters which make up XML text
data. All the characters in ISO/IEC 10646
are present, at the same code points, in Unicode.
IETF RFC 1738 and RFC 1808.
RFC 1738 and RFC 1808 define the syntax and semantics of Uniform Resource
Locators, or URLs.
Terminology
Some terms used with special meaning in this specification are:
Conforming data and XML
processors are permitted to but need not behave as
described.
Conforming data and XML processors are required to
behave as described; otherwise
they are in error.
A violation of the rules of this
specification; results are
undefined. Conforming software may detect and report an error and may
recover from it.
An error
which conforming software must detect and report to the application.
After encountering a fatal error, an XML processor may continue
processing the data to search for further errors and may report such
errors to the application. In order to support correction of errors,
the processor may make unprocessed text from the document (with
intermingled character data and markup) available to the application.
Once a fatal error is detected, however, the processor must not
continue normal processing (i.e. it must not
continue to pass character data and information about the document's
logical structure to the application in the normal way).
A rule which applies to all valid XML
documents.
Violations of
validity constraints are errors; they must, at user option, be reported
by validating XML processors.
A rule which applies to all well-formed XML documents.
Violations of well-formedness constraints are
fatal errors.
Conforming software may or must (depending on the modal verb in the
sentence) behave as described; if it does, it must
provide users a means to enable or disable the behavior
described.
(Of strings or names:)
Case-insensitive match: two strings or names being
compared match if they are identical after case-folding.
(Of strings and rules in the grammar:)
A string matches a grammatical production if it belongs to the
language generated by that production.
(Of content and content models:)
The content of a parent element
in a document matches the content model
for that element if (a) the content model matches the rule for
Mixed and the content consists of
character data and elements whose names match names in the
content model, or if (b) the content model matches the rule for
elements, and the sequence of
child elements
belongs to the language generated by the regular expression in
the content model.
a process applied
to a sequence of characters, in which those identified as
non-uppercase (in scripts which have case distinctions) are replaced
by their uppercase equivalents, as specified in The Unicode Standard, Version 2.0,
section 4.1.
Note that Unicode recommends folding to lowercase; for compatibility reasons, XML processors must
fold to uppercase. Case-folding, as described here, neither requires
nor forbids the normalization of Unicode character sequences into
canonical form (e.g. as described in The Unicode Standard, section 5.9).
Case-sensitive
string match: two strings or names being compared must be identical.
Characters with multiple possible representations in ISO/IEC 10646 (e.g.
characters with
both precomposed and base+diacritic forms) match only if they have the
same representation in both strings.
At user option, processors may normalize such characters to their
canonical form.
A feature of
XML included solely to ensure that XML remains compatible with SGML.
A
non-binding recommendation included to increase the chances that XML
documents can be processed by the existing installed base of SGML
processors which predate the
technical corrigendum to ISO 8879 now in the process of preparation
by ISO/IEC JTC1/SC18/WG8.
Notation
The formal grammar of XML is given using a simple Extended
Backus-Naur Form (EBNF) notation. Each rule in the grammar defines one
symbol, in the form
symbol ::= expression
Symbols are written with an initial capital letter if they are
defined by a regular expression, or with an initial lowercase letter if
a recursive grammar is required for recognition.
Literal strings are quoted; unless otherwise noted
they are case-insensitive.
The distinction between symbols which can and cannot
be recognized using simple regular expressions may be used to set the
boundary between an implementation's lexical
scanner and its parser, but this specification neither constrains the
placement of that boundary nor presupposes that all implementations
will have one.
Within the expression on the right-hand side of a rule, the
meaning of symbols is as shown below.
where N is a hexadecimal integer, the
expression represents the character in ISO/IEC 10646 whose canonical
(UCS-4) bit string, when interpreted as an unsigned binary number, has
the value indicated. The number of leading zeroes in the
#xN form is insignificant; the number of leading
zeroes in the corresponding bit string is governed by the character
encoding in use and is not significant for XML.
represents any character
with a value in the range(s) indicated (inclusive).
represents any character
with a value outside the
range indicated.
represents any character
with a value not among the characters given.
represents a literal string matching that
given inside the double quotes.
represents a literal string matching that
given inside the single quotes.
a followed by b.
a or b but not both.
the set of strings represented by
a but not represented by
b
a or nothing; optional a.
one or more occurrences of a.
zero or more occurrences of a.
specifies that in the external DTD subset a
parameter entity may occur in the
text at the position where a may occur; if so, its
replacement text must match S? a S?. If
the expression a is governed by a suffix operator, then
the suffix operator determines both the maximum number of parameter-entity
references allowed and the number of occurrences of a
in the replacement text of the parameter entities: %a*
means that a must occur zero or more times, and
that some of its occurrences may be replaced by references to
parameter entities whose replacement text must contain zero or
more occurrences of a; it is thus a more compact way
of writing %(a*)*.
Similarly, %a+ means that a
must occur one or more times, and may be replaced by
parameter entities with replacement text matching
S? (a S?)+.
The recognition of parameter entities in the internal subset is much more
highly constrained.
expression is treated as a unit, and
may carry the % prefix operator, or a suffix operator:
?, *, or +.
comment.
Well-formedness check; this identifies by name a check for
well-formedness associated with
a production.
Validity check; this identifies by name a check for
validity associated with
a production.
Common Syntactic Constructs
This section defines some symbols used widely in the grammar.
S (white space) consists of one or more space (#x20)
characters, carriage returns, line feeds, or tabs.
S(#x20 | #x9 | #xd | #xa)+
Legal
characters are tab, carriage return, line feed, and the legal graphic
characters of Unicode and ISO/IEC 10646.Char#x9 | #xA | #xD | [#x20-#xFFFD]
| [#x00010000-#x7FFFFFFF]any ISO/IEC 10646 UCS-4 code, FFFE and FFFF excluded
Characters are classified for convenience as letters, digits, or other
characters. Letters consist of an alphabetic or syllabic
base character possibly
followed by one or more combining characters, or of an ideographic
character. Certain layout and format-control characters defined by
ISO/IEC 10646 should be ignored
when recognizing identifiers; these are defined by the
classes Ignorable and Extender.
Full definitions of the specific characters in each class
are given in the appendix on character
classes.
A Name is a token
beginning with a letter or underscore character and continuing with
letters, digits, hyphens, underscores, or full stops (together known
as name characters).
Names beginning with the string XML are
reserved for standardization in this or future versions of this
specification.
Note: the colon character is also
allowed within XML names; it is reserved for experimentation with
name spaces and schema scoping. Its meaning is expected to be
standardized at some future point, at which point those documents
using colon for experimental purposes will need to be updated.
(Note: there is no guarantee that any name-space mechanism
adopted for XML will in fact use colon as a name-space delimiter.)
In practice, this means that authors should not use colon in XML
names except as part of name-space experiments, but that implementors
should accept colon as a name character.
An
Nmtoken (name token) is any mixture of
name characters.
MiscName'.' | '-' | '_' | ':'
| CombiningChar
| Ignorable
| ExtenderNameCharLetter
| Digit
| MiscNameName(Letter | '_' | ':')
(NameChar)*NamesName
(SName)*Nmtoken(NameChar)+NmtokensNmtoken (SNmtoken)*
Literal data is any quoted string not containing
the quotation mark used as a delimiter for that string; different
forms of literal data may or may not contain
angle brackets, entity references, and character references. Literals are used
for specifying the replacement text of internal entities
(EntityValue),
the values of attributes (AttValue),
and external identifiers
(SystemLiteral); for some
purposes, the entire literal can be skipped without scanning for
markup within it (SkipLit):
EntityValue'"'
([^%&"]
| PEReference
| Reference)*
'"'
|
"'"
([^%&']
| PEReference
| Reference)*
"'"AttValue'"'
([^<&"]
| Reference)*
'"'
|
"'"
([^<&']
| Reference)*
"'"SystemLiteral'"' URLchar* '"'
| "'" (URLchar - "'")* "'"URLcharSee
RFC 1738
and
1808PubidLiteral'"' PubidChar* '"'
| "'" (PubidChar - "'")* "'"PubidChar#x20 | #x9 | #xd | #xa | #x&IDEOSPACE;
| [a-zA-Z0-9]
| [-'()+,./:=?]SkipLit('"' [^"]* '"')
| ("'" [^']* "'")
Note that entity references and
character references are recognized and
processed within EntityValue and AttValue, but not within
SystemLiteral.
Documents
A textual object is an
XML document if it is
either valid or
well-formed, as
defined in this specification.
Logical and Physical Structure
Each XML document has both a logical and a physical structure.
Physically, the document is composed of units called entities. An entity may refer to other entities to cause their
inclusion in the document. A document begins in a "root" or document entity.
The logical structure contains declarations, elements,
comments,
character references, and
processing
instructions, all of which are indicated in the document by explicit
markup.
The two structures must be synchronous:
see section 4.1.
Well-Formed XML Documents
A textual object is
said to be a well-formed XML document if, first, it
matches the production labeled document, and if for
each entity reference which appears in
the document, either the entity has been declared in the document type declaration or the entity name is
one of: &magicents;.
Matching the document production
implies that:
It contains one or more
elements.
It meets all the well-formedness constraints (WFCs) given
in the grammar.
There is exactly
one element,
called the root, or document element, for
which neither the start-tag nor the
end-tag is in
the content of any other element.
For
all other
elements,
if the start-tag is in the content of another element, the end-tag is in
the content of the same element. More simply stated, the elements,
delimited by start- and end-tags, nest within each other.
As a consequence
of this,
for each non-root
element
C in the document, there is one other element P
in the document such that
C is in the content of P, but is not in
the content of any other element that is in the content of
P. Then P is referred to as the
parent of C, and C as a
child of P.
Characters
The data stored in an XML entity is
either text or binary. Binary data has an associated notation, identified by name; beyond a
requirement to make available the notation's name and the
associated system
identifier, XML places no constraints on the contents or use of binary
entities. So-called binary data might in fact be textual; its
identification as binary means that an XML processor need not parse
it in the fashion described by this specification.XML text data is a sequence of characters.
A character is an atomic unit of
text;
valid
characters
are specified by ISO/IEC 10646.
Users may extend the ISO/IEC 10646 character repertoire by exploiting the
private use areas.
The mechanism for encoding character values into bit patterns may
vary from entity to entity. All XML processors must accept the UTF-8
and UCS-2 encodings of 10646; the mechanisms for signaling which of
the two are in use, or for bringing other encodings into play, are
discussed later, in the discussion of character encodings.
Regardless of the specific encoding used, any character in the ISO/IEC
10646 character set may be referred to by the decimal or hexadecimal
equivalent of its bit string.
Character Data and Markup
XML text consists of intermingled character
data and markup.
Markup takes the form of
start-tags,
end-tags,
empty elements,
entity references,
character references,
comments,
CDATA sections,
document type declarations, and
processing instructions.
All text that is not markup
constitutes the character data of
the document.
The ampersand character (&) and the left angle bracket (<)
may appear in their literal form only when used as markup
delimiters, or within comments, processing instructions, or CDATA sections. If they are needed elsewhere,
they must be escaped
using either numeric character references or the strings
"&" and "<". The right angle
bracket (>) may be represented using the string
">", and must, for
compatibility, be so represented when it appears in the string
"]]>", when that string is not marking the end of
a CDATA section.
In the content of elements, character data
is any string of characters which does
not contain the start-delimiter of any markup.
In a CDATA section, character data
is any string of characters not including the CDATA-section-close
delimiter, "]]>".
To allow attribute values to contain both single and double quotes, the
apostrophe or single-quote character (') may be represented as
"'", and the double-quote character (") as
""".
PCData[^<&]*
Comments
Comments may appear anywhere
except in a CDATA section, i.e. within
element content, in
mixed content, or in a DTD. They must
not occur within declarations or tags.
They are not part of the document's character
data; an XML
processor may, but need not, make it possible for an application to
retrieve the text of comments.
For compatibility, the string
-- (double-hyphen) must not occur within
comments.
Comment'<!&como;'
(Char* -
(Char* '&comc;' Char*))
'&comc;>'
An example of a comment:
<!&como; declarations for <head> & <body> &comc;>
Processing Instructions
Processing
instructions (PIs) allow documents to contain instructions
for applications.
PI'<?' NameS
(Char* -
(Char* &pic; Char*))
&pic;
PIs are not part of the document's character
data, but must be passed through to the application. The
Name is called the PI target; it is used
to identify the application to which the instruction is directed. XML
provides an optional mechanism, NOTATION, for
formal declaration of such names.
PI targets with names beginning with the string "XML"
are reserved for standardization in this or future versions of
this specification.
CDATA Sections
CDATA sections can occur
anywhere character data may occur; they are
used to escape blocks of text containing characters which would
otherwise be recognized as markup. CDATA sections begin with the
string <![CDATA[ and end with the string
]]>:
CDSectCDStartCDataCDEndCDStart'<![CDATA['CData(Char* -
(Char* ']]>' Char*))
CDEnd']]>'
Within a CDATA section, only the CDEnd string is
recognized, so that left angle brackets and ampersands may occur in
their literal form; they need not (and cannot) be escaped using
< and &. CDATA sections
cannot nest.
An example of a CDATA section:
<![CDATA[<greeting>Hello, world!</greeting>]]>
White Space Handling
In editing XML documents, it is often convenient to use "white space"
(spaces, tabs, and blank lines, denoted by the nonterminal S in
this specification) to
set apart the markup for greater readability. Such white space is typically
not intended for inclusion in the delivered version of the document.
On the other hand, "significant" white space that must be retained in the
delivered version is common, for example in poetry and
source code.
An XML processor
must always pass all characters in a document that are not
markup through to the application. A
validating XML processor must distinguish white space
in element content from other non-markup
characters and signal
to the application that white space in element content is not
significant.
A special attribute may be inserted in
documents to signal an intention that the element to which this attribute
applies requires all white space to be treated as
significant by applications.
In valid documents, this attribute must be
declared as follows, if used:
The value DEFAULT signals that applications'
default white-space processing modes are acceptable for this element; the
value PRESERVE indicates the intent that applications preserve
all the white space.
This declared intent is considered to apply to all elements within the content
of this element, unless overriden with another instance of the
XML-SPACE attribute.
The root element of any document
is considered to have signaled no intentions as regards application space
handling, unless it provides a value for
this attribute or the attribute is declared with a default value.
Prolog and Document Type Declaration
XML documents
may, and should,
begin with an XML declaration which specifies, among other
things, the version of
XML being used.
The function of the markup in an XML document is to describe its
storage and logical structures, and associate attribute-value pairs with the
logical structure.
XML provides a
mechanism, the document type declaration, to
define constraints on that logical structure and to support the use of
predefined storage units.
An XML document is said to be
valid if there is an associated document type
declaration and if the document
complies with the constraints expressed in it.
The document type declaration must appear before
the first start-tag in the document.
documentprologelementMisc*prologXMLDecl?
Misc*
(doctypedeclMisc*)?XMLDecl&xmlpio;
VersionInfoEncodingDecl?
RMDecl?
S?
&pic;VersionInfoS 'version' Eq
('"&XML.version;"' | "'&XML.version;'")MiscComment | PI |
S
The identification of the XML version as "1.0" does not indicate a
commitment to produce any future versions of XML, nor if any are produced, to
use any particular numbering scheme.
Since future versions are not ruled out, this construct is provided
as a means to allow the possibility of automatic version recognition, should
it become necessary.
For example, the following is a complete XML document, well-formed but not
valid:
Hello, world!
]]>
and so is this:
Hello, world!
]]>
The XML
document type declaration may include a pointer to an
external entity containing a subset of
the necessary markup declarations, and may also directly include
another, internal, subset.
These two subsets make up the
document type definition, abbreviated DTD.
The DTD, in effect, provides a grammar which defines a class of documents.
Properly speaking, the DTD consists of both subsets taken together,
but it is a common practice for the bulk of the markup
declarations to appear in the external subset, and for this
subset, usually contained in a file, to be referred to as "the DTD"
for a class of documents.doctypedecl'<!DOCTYPE' SName (SExternalID)?
S? ('['
%markupdecl*
']'
S?)? '>'Root Element TypeNon-null DTDmarkupdecl(
%elementdecl
| %AttlistDecl
| %EntityDecl
| %NotationDecl
| %PI
| %S
| %Comment
| InternalPERef
)*InternalPERefPEReferenceIntegral Declarations
The Name in the document-type declaration must
match the element type of the root element.
The internal and external subsets of the DTD must not both
be empty.
A parameter-entity
reference recognized in this context must have replacement
text consisting of zero or more complete declarations,
i.e. matching the production for the non-terminal
markupdecl.
The external subset must obey substantially
the same grammatical constraints
as the internal subset; i.e. it must match the production for the
non-terminal symbol
markupdecl.
In the external subset, however, parameter-entity references can
be used to replace constructs prefixed by % in a production of
the grammar, and conditional sections
may occur.
In the internal subset, by contrast, conditional sections may not
occur and the only parameter-entity references
allowed are those which match the non-terminal
InternalPERef
within the rule for markupdecl.
extSubset(
%markupdecl
| %conditionalSect
)*
For example:
Hello, world!
]]>
The system identifierhello.dtd
indicates
the location of a DTD for the document.
The declarations can also be given locally, as in this
example:
]>
Hello, world!
]]>
If both the external and internal subsets are used,
an XML processor must read the internal subset first,
then the external subset.
This has the effect that entity and attribute declarations in the
internal subset take precedence over those in the external subset.
Required Markup Declaration
In some cases, an XML processor can
read an XML document and accomplish useful tasks without having first
processed the entire DTD. However, certain
declarations can substantially affect the actions of an XML processor.
It is desirable, therefore, to be able to specify whether a
document contains any such declarations.
A document author can communicate whether or not DTD processing is
necessary using a required markup declaration
(abbreviated RMD), which appears as a component of the XML
declaration:
RMDeclS
'RMD' Eq "'" ('NONE' | 'INTERNAL' | 'ALL') "'"
| S
'RMD' Eq '"' ('NONE' | 'INTERNAL' | 'ALL') '"'
In an RMD, the value NONE indicates that an XML
processor can parse the containing document correctly without first
reading any part of the DTD. The value INTERNAL
indicates that the XML processor must read and process the internal subset of the DTD, if provided, to
parse the containing document correctly. The value ALL
indicates that the XML processor must read and process the
declarations in both the subsets of the DTD, if provided, to parse the
containing document correctly.
The RMD must indicate that the entire DTD is required if the
external subset contains any declarations of
attributes with default values, if
elements to which
these attributes apply appear in the document without
specifying values for these attributes, or
entities (other than &magicents;),
if references to those
entities appear in the document, or
element types with element content,
if white space occurs
directly within any instance of those types.
If such declarations occur in the internal but not the external
subset, the RMD must take the value INTERNAL. It is an
error to specify INTERNAL if the external subset is
required, or to specify NONE if the internal or
external subset is required.
If no RMD is provided, an XML processor must behave as though
an RMD had been provided with the value ALL.
An example XML declaration with an RMD:<?XML version="&XML.version;" RMD='INTERNAL'?>
Logical Structures
Each XML document contains one or more
elements, the boundaries of which are
either delimited by start-tags
and end-tags, or, for empty elements by an empty-element tag. Each element has a type,
identified by name (sometimes called its generic
identifier or GI), and may have a set of
attributes. Each attribute has a name and a value.
This specification does not constrain the semantics, use, or (beyond
syntax) names of the elements and attributes, except that names
beginning with the string XML
are reserved for standardization in this or future versions of this
specification.
Start-Tags, End-Tags, and Empty-Element Tags
The beginning of every
non-empty XML element is marked by a start-tag.
STag'<' Name
(SAttribute)*
S? '>'Unique Att SpecAttributeNameEqAttValueAttribute Value TypeNo External Entity
ReferencesEqS? '=' S?
The Name in the start- and end-tags gives the
element's type.
The Name-AttValue pairs are referred to as
the attribute specifications of the element,
with the Name
referred to as the attribute name and
the content of the
AttValue (the characters between the ' or
" delimiters)
as the attribute value.
No attribute may appear more than once in the same start-tag
or empty-element tag.
The attribute must have been declared; the value must be of the type
declared for it.
(For attribute types, see the discussion of attribute
declarations.)
Attribute values cannot contain entity references to
external entities.
An example of a start-tag:
<termdef id="dt-dog" term="dog">
The end of every element
may (for elements which are not
empty, must) be marked by an end-tag
containing a name that echoes the element's type as given in the
start-tag:
ETag'</' NameS? '>'
An example of an end-tag:</termdef>
The text between the start-tag and
end-tag is called the element's
content:
content(element | PCData
| Reference | CDSect
| PI | Comment)*ContentelementEmptyElement| STagcontentETagGI Match
Each element type used must be declared.
The content of an element instance must match the content model declared
for that element type.
The Name in an element's end-tag must match that in
the start-tag.
If an element is empty,
it may be represented either by a start-tag immediately followed
by an end-tag, or by an empty-element tag.An
Empty-element tag takes a special form:
EmptyElement'<' Name (SAttribute)* S?
'/>'Unique Att Spec
Empty-element tags may be used for any element which has no
content, whether or not they are declared using the keyword
EMPTY.
Examples of empty elements:
<IMG align="left"
src="http://www.w3.org/Icons/WWW/w3c_home" />
<br></br>
<br/>
Element Declarations
The element structure of an
XML document may, for
validation purposes,
be constrained
using element and attribute declarations.
An element declaration constrains the element's
type and its
content.
Element declarations often constrain which element types can
appear as children of the element.
At user option, an XML processor may issue a warning
when a reference is made to an element type for which no declaration
is provided, but this is not an error.
An element
declaration takes the form:
elementdecl'<!ELEMENT' S
%NameS
(%SS)?
%contentspecS? '>'
Unique Element Declaration contentspec'EMPTY'
| 'ANY'
| Mixed
| elements
where the Name gives the type of the
element.
No element type may be declared more than once.
An element can declared
using a content model, in which case
its content can be categorized as element content or mixed content,
as explained below.
An element declared using the keyword EMPTY must be empty and may be tagged using an
empty-element tag
when it appears in the document.
If an element type is declared using the keyword ANY, then
there are no validity constraints on its content: it may
contain child elements of
any type and
number, interspersed with character data.
Examples of element declarations:
<!ELEMENT br EMPTY>
<!ELEMENT %name.para; %content.para; >
<!ELEMENT container ANY>
Element Content
An element type may be declared to have
element content, which means that elements of that
type may only contain other elements (no character data).
In this case, the
constraint includes a content model, a simple grammar governing
the allowed types of the child
elements and the order in which they appear. The grammar is built on
content particles (CPs), which consist of names,
choice lists of content particles, or
sequence lists of content particles:
elements(choice
| seq)
('?' | '*' | '+')?cp(Name
| choice
| seq)
('?' | '*' | '+')?cpsS?
%cpS?choice'('
S?
%ctokplus
(S?
'|'
S?
%ctoks)*
S?
')'ctokpluscps
('|' cps)+
ctokscps
('|' cps)*
seq'('
S?
%stoks
(S?
','
S?
%stoks)*
S? ')'stokscps
(',' cps)*
where each Name gives the type of an element which may
appear as a child. Any content
particle in a choice list may appear in the element content at the appropriate
location; content particles occurring in a sequence list must each
appear in the element content in the
order given. The optional character following a name or list governs
whether the element or the content particles in the list may occur one
or more (+), zero or more (*), or zero or
one times (?). The syntax
and meaning are identical to those used in the productions in this
specification.
The content of an element matches a content model if and only if it is
possible to trace out a path through the content model, obeying the
sequence, choice, and repetition operators and matching each element in
the content against an element name in the content model. For compatibility reasons, it is an error
if an element in the document can
match more than one occurrence of an element name in the content model.
More formally: a finite state automaton may be constructed from the
content model using the standard algorithms, e.g. algorithm 3.5
in section 3.9
of Aho, Sethi, and Ullman.
In many such algorithms, a follow set is constructed for each
position in the regular expression (i.e., each leaf
node in the
syntax tree for the regular expression);
if any position has a follow set in which
more than one following position is
labeled with the same element type name,
then the content model is in error
and may be reported as an error.
For more information, see the appendix on
deterministic content models.
Examples of element-content models:
<!ELEMENT spec (front, body, back?)>
<!ELEMENT div1 (head, (p | list | note)*, div2*)>
<!ELEMENT head (%head.content; | %head.misc;)*>
Mixed Content
An element type may be
declared to contain
mixed content, that is, text comprising character
data optionally interspersed with
child elements.
In this case, the types of the child elements are
constrained, but not their order nor their number of occurrences:
Mixed'(' S?
%( %'#PCDATA'
(S?
'|'
S?
%Mtoks)*
)
S?
')*' | '(' S? %('#PCDATA') S? ')'
Mtoks%Name
(S?
'|'
S?
%Name)*
where the Names give the types of elements
that may appear as children.
The same name must not appear more than once in a single mixed-content
declaration.
Examples of mixed content declarations:
<!ELEMENT p (#PCDATA|a|ul|b|i|em)*>
<!ELEMENT p (#PCDATA | %font; | %phrase; | %special; | %form;)* >
<!ELEMENT b (#PCDATA)>
Attribute-List Declarations
Attributes are used to associate
name-value pairs with elements.
Attributes may appear only within start-tags; thus, the productions used to
recognize them appear in the discussion of
start-tags. Attribute-list
declarations may be used:
To define the set of attributes pertaining to a given
element type.
To establish a set of type constraints on these
attributes.
To provide default values
for attributes.
Attribute-list declarations specify the name, data type, and default
value (if any) of each attribute associated with a given element type:
AttlistDecl'<!ATTLIST' S
%NameS?
%AttDef+
S? '>'AttDefS %NameS %AttTypeS %Default
The Name in the
AttlistDecl rule is the type of an element. At user
option, an XML processor may issue a warning if attributes are
declared for an element type not itself declared, but this is not an
error. The Name in the AttDef rule is
the name of the attribute.
When more than one AttlistDecl is provided for a given
element type, the contents of all those provided are merged. When
more than one definition is provided for the same attribute of a
given element type, the first declaration is binding and later
declarations are ignored.
For interoperability, writers of DTDs
may choose to provide at most one attribute-list declaration
for a given element type, and at most one attribute definition
for a given attribute name.
For interoperability, an XML processor may at user option
issue a warning when more than one attribute-list declaration is
provided for a given element type, or more than one attribute definition
for a given attribute, but this is not an error.
Attribute Types
XML attribute types are of three kinds: a string type, a
set of tokenized types, and enumerated types. The string type may take
any literal string as a value; the tokenized types have varying lexical
and semantic constraints, as noted:
AttTypeStringType
| TokenizedType
| EnumeratedTypeStringType'CDATA'TokenizedType'ID'ID | 'IDREF'Idref | 'IDREFS'Idref | 'ENTITY'Entity Name | 'ENTITIES'Entity Name | 'NMTOKEN'Name Token | 'NMTOKENS'Name Token
Values of this type must be valid
Name symbols. A name must not appear more than once in
an XML document as a value of this type; i.e., ID values must uniquely
identify the elements which bear them.
Values of this type must match
the Name
(for IDREFS, the Names) production;
each Name must match the value of an ID attribute on
some element in the XML document; i.e. IDREF values must
match some ID.
Values of this type
must match the production for Name
(for ENTITIES, Names);
each Name must
exactly match the
name of an externalbinary general entity declared in the DTD.Values of this type
must consist of a string matching the Nmtoken nonterminal
(for NMTOKENS, the Nmtokens nonterminal) of the grammar defined
in this specification.
The XML processor must normalize attribute values before
passing them to the application, as described in the section
on attribute-value normalization.
Enumerated attributes can take one of a list of values provided in
the declaration; there are two types:
EnumeratedTypeNotationType
| EnumerationNotationType%'NOTATION'
S
'('
S?
%Ntoks
(S? '|' S?
%Ntoks)*
S?
')' Notation Attributes Ntoks%Name
(S?
'|'
S?
%Name)*
Enumeration'(' S?
%Etoks
(S? '|'
S?
%Etoks)*
S?
')'EnumerationEtoks%Nmtoken
(S?
'|'
S?
%Nmtoken)*
The names in
the declaration of NOTATION attributes must be names of
declared notations (see the discussion of notations). Values of this type must match
one of the notation names included in the declaration.Values of this type
must match one of the Nmtoken tokens in the declaration.
For interoperability, the same
Nmtoken should not occur more than once in the enumerated
attribute types of a single element type.
Attribute Defaults
An attribute declaration provides
information on whether
the attribute's presence is required, and if not, how an XML processor should
react if a declared attribute is absent in a document:
Default'#REQUIRED'
| '#IMPLIED' Attribute Default Legal | ((%'#FIXED' S)? %AttValue)#REQUIRED means that
the document is invalid
should the processor
encounter a
start-tag
for the element type in question which specifies no value for
this attribute.
#IMPLIED means that if the attribute is omitted
from an element of this type,
the XML processor must inform the application
that no value was specified; no constraint is placed on the behavior
of the application.
If the attribute
is neither #REQUIRED nor #IMPLIED, then the
AttValue value contains the declared
default value. If the #FIXED is present,
the document is invalid
if the attribute
is present with a different value from the default. If a default value
is declared, when an XML processor encounters an omitted attribute, it
is to behave as though the attribute were present with its value being
the declared default value.
The
declared
default value must meet the constraints of the declared attribute type.
Examples of attribute-list declarations:
<!ATTLIST termdef
id ID #REQUIRED
name CDATA #IMPLIED>
<!ATTLIST list
type (bullets|ordered|glossary) "ordered">
<!ATTLIST form
method CDATA #FIXED "POST">
Attribute-Value Normalization
Before the value of an attribute is passed to the application, the
XML processor must normalize it as follows:
Line-end characters (or, on some systems, record boundaries)
must be replaced by single space (#x20) characters.
Character references and references to internal text
entities must be expanded. References to external entities
are an error.
If the attribute is not of type CDATA, all strings
of white space must be normalized to single space characters (#x20),
and leading and trailing white space must be removed.
Values of type ID,
IDREF,
IDREFS,
NMTOKEN,
NMTOKENS, or of enumerated or notation types, must be
folded to
uppercase.
If no DTD is present, attributes should be treated as CDATA.
Conditional Sections
Conditional sections are portions of the
document type declaration external subset which are
included in, or excluded from, the logical structure of the DTD based on
the keyword which governs them.conditionalSectincludeSect
| ignoreSectincludeSect'<![' %'INCLUDE' '['
(%markupdecl*)* ']]>'ignoreSect'<![' %'IGNORE' '['
ignoreSectContents*
']]>'ignoreSectContents
((SkipLit
| Comment
| PI) -
(Char* ']]>' Char*))
| ('<![' ignoreSectContents* ']]>')
| (Char - (']' | [<'"]))
| ('<!' (Char - ('-' | '[')))
Like the internal and external DTD subsets, a conditional section
may contain one or more complete declarations,
comments, processing instructions,
or nested conditional sections, intermingled with white space.
If the keyword of the conditional section is
INCLUDE, then the conditional section is read and
processed in the normal way. If the keyword is
IGNORE, then the declarations within the conditional
section are ignored; the processor must read the conditional section to
detect nested conditional sections and ensure that the end of the
outermost (ignored) conditional section is properly detected.
If a conditional section with a
keyword of INCLUDE occurs within a larger conditional
section with a keyword of IGNORE, both the outer and the
inner conditional sections are ignored.
If the keyword of the conditional section is a parameter
entity reference, the parameter entity is replaced by its value
before the processor decides whether to
include or ignore the conditional section.
An example:
<!ENTITY % draft 'INCLUDE' >
<!ENTITY % final 'IGNORE' >
<![%draft;[
<!ELEMENT book (comments*, title, body, supplements?)>
]]>
<![%final;[
<!ELEMENT book (title, body, supplements?)>
]]>
Physical Structures
An XML document may consist
of one or many virtual storage units. These are called
entities; they are identified by name and have
content.
An entity may be stored in,
but need not comprise the whole of,
a single physical storage object such as a file or
database field.
Each XML document has one entity
called the document entity, which serves
as the starting point for the XML
processor (and may contain the whole document).
Entities may be either binary or text.
A text entity contains
text data which is considered as an
integral part of the document.A binary entity contains
binary data with an associated notation.
Only text entities may be referred to using entity references;
only the names of binary entities may be given as the value of ENTITY
attributes.
Logical and Physical Structures
The logical and physical structures (elements and entities)
in an XML document must
be synchronous.
Tags and elements must
each begin and end in the same entity, but may
refer to other
entities internally; comments,
processing instructions,
character
references, and
entity references must each be contained entirely
within a single entity. Entities must each contain an integral number
of elements, comments, processing instructions, and references,
possibly together with character data not contained within any element
in the entity, or else they must contain non-textual data, which by
definition contains no elements.
Character and Entity References
A character reference refers to a specific character in the
ISO/IEC 10646 character set, e.g. one not directly accessible from
available input devices:
CharRef'&#' [0-9]+ ';' | '&hcro;' [0-9a-fA-F]+ ';'
An entity
reference refers to the content of a named entity.General entities are
text entities for use within the document itself; references to them
use ampersand (&) and semicolon (;) as
delimiters. In this specification,
general entities are sometimes referred to with the
unqualified term entity when this leads
to no ambiguity.Parameter entities are text entities for use within the DTD,
or to control processing of
conditional sections;
references to them use percent-sign (%) and semicolon
(;) as delimiters.ReferenceEntityRef
| CharRefEntityRef'&' Name ';'Entity DeclaredText Entity No RecursionPEReference'%' Name ';'Entity DeclaredText Entity No RecursionIn DTD
The Name given in the entity reference must exactly match the name given in the declaration
of the entity, except that well-formed documents need not declare
any of the following entities: &magicents;. In valid
documents, these entities must be declared, in the form
specified in the section on
predefined entities.
In the case of parameter entities, the declaration
must precede the reference.
An entity reference must not contain the name of a binary entity. Binary entities may be referred
to only in attribute values declared to
be of type ENTITY or ENTITIES.
A text or parameter entity must not contain a recursive reference to itself,
either directly or indirectly.
In the external DTD subset, a parameter-entity reference is
recognized only at the locations where
the nonterminal PEReference or the
special operator % appears in a production of the
grammar. In the internal subset, parameter-entity references
are recognized only when they match
the InternalPERef non-terminal
in the production for markupdecl.
Examples of character and entity references:
Type <key>less-than</key> (&hcro;3C;) to save options.
This document was prepared on &docdate; and
is classified &security-level;.
Example of a parameter-entity reference:
<!ENTITY % ISOLat2
SYSTEM "http://www.xml.com/iso/isolat2-xml.entities" >
%ISOLat2;
Entity Declarations
Entities are declared thus:
EntityDecl'<!ENTITY' S %NameS %EntityDefS? '>'General entities| '<!ENTITY' S '%' S
%NameS
%EntityDefS? '>'Parameter entitiesEntityDefEntityValue
| ExternalDef
The Name is that by which the entity is invoked by
exact match in an entity
reference.
If the same entity is declared more than once, the first declaration
encountered is binding; at user option, an XML processor may issue a
warning if entities are declared multiple times.
Internal Entities
If the entity definition is an EntityValue, the
defined entity is
called an internal entity. There is no separate physical
storage object, and the replacement text of the entity is given in the
declaration. Within the EntityValue,
parameter-entity references and character references are recognized
and expanded immediately. General-entity references within the
replacement text are not recognized
at the time the entity declaration is parsed, though they may be
recognized when the entity itself is referred to.
An internal entity is a text entity.
Example of an internal entity declaration:
<!ENTITY Pub-Status "This is a pre-release of the
specification.">
External Entities
If the entity is not
internal, it is an external
entity, declared as follows:
ExternalDefExternalID
%NDataDecl?ExternalID'SYSTEM' SSystemLiteral| 'PUBLIC' SPubidLiteralSSystemLiteralNDataDeclS %'NDATA' S
%NameNotation Declared
If the NDataDecl is present, this is a binary
data entity, otherwise a text entity.
The Name must match the declared name of a
notation.
The
SystemLiteral that follows the keyword SYSTEM
is called the entity's system identifier. It is a URL,
which may be used to retrieve the entity.
Unless otherwise provided by information outside the scope of this
specification (e.g. a special XML element defined by a particular
DTD, or a processing instruction defined by a particular application
specification), relative URLs are relative to the location of the
entity or file within which the entity declaration occurs. Relative
URLs in entity declarations within the internal DTD subset are thus
relative to the location of the document; those in entity declarations
in the external subset are relative to the location of the files
containing the external subset.
In addition to a system literal, an external identifier may
include a public identifier.
An XML processor may use the public
identifier to try to generate an alternative URL. If the processor
is unable to do so, it must use the URL specified in the system
literal.
Examples of external entity declarations:
<!ENTITY open-hatch
SYSTEM "http://www.textuality.com/boilerplate/OpenHatch.xml">
<!ENTITY open-hatch
PUBLIC "-//Textuality//TEXT Standard open-hatch boilerplate//EN"
"http://www.textuality.com/boilerplate/OpenHatch.xml">
<!ENTITY hatch-pic
SYSTEM "../grafix/OpenHatch.gif"
NDATA gif >
Character Encoding in Entities
Each external text entity in an XML document may use a different
encoding for its characters. All XML processors must be able to read
entities in either UTF-8 or UCS-2.
It is recognized that for some purposes, the use of additional
ISO/IEC 10646 planes other than the Basic Multilingual Plane
may be required.
A facility for handling characters in these planes is therefore a
desirable characteristic in XML processors and applications.
Entities encoded in UCS-2 must
begin with the Byte Order Mark described by ISO/IEC 10646 Annex E and
Unicode Appendix B (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF).
This is an encoding signature, not part of either the markup or
character data of the XML document.
XML processors must be able to use this character to
differentiate between UTF-8 and UCS-2 encoded documents.
Although an XML processor is only required to read entities in
the UTF-8 and UCS-2, it is recognized that many other encodings are in daily
use around the world, and it may be advantageous for XML processors to read
entities that use these encodings.
For this purpose, XML provides an encoding declaration
processing instruction, which, if it occurs,
must appear at the
beginning of a system entity, before any
other character data or markup. In the document entity, the encoding
declaration is part of the XML declaration;
in other entities, it is part of an encoding processing instruction:
EncodingDeclS 'encoding' EqQEncodingStart of System EntityEncodingPI&xmlpio; S 'encoding' EqQEncodingS? &pic;QEncoding'"' Encoding '"' | "'"
Encoding "'"EncodingLatinNameLatinName[A-Za-z] ([A-Za-z0-9._] | '-')*Name containing only Latin characters
An XML encoding declaration may occur at the beginning of a
system entity; it must
not occur within the body of any entity.
The values
UTF-8,
UTF-16,
ISO-10646-UCS-2, and
ISO-10646-UCS-4 should be
used for the various encodings and transformations of Unicode /
ISO/IEC 10646, the values
ISO-8859-1,
ISO-8859-2, ...
ISO-8859-9 should be used for the parts of ISO 8859, and
the values
ISO-2022-JP,
Shift_JIS, and
EUC-JP
should be used for the various encoded forms of JIS X-0208. XML
processors may recognize other encodings; it is recommended that
character encodings registered (as charsets)
with the Internet Assigned Numbers
Authority (IANA), other than those just listed, should be referred to
using their registered names.
It is an error for an entity including
an encoding declaration to be presented to the XML processor
in an encoding other than that
named in the declaration.
An entity which begins with neither a Byte Order Mark nor an encoding
declaration must be in the UTF-8 encoding.
While XML provides mechanisms for distinguishing encodings,
it is recognized that in a heterogeneous networked environment,
it may be difficult to signal the encoding of an entity reliably.
Errors in this area fall into two categories:
failing to read an entity because of inability to recognize
its actual encoding, and
reading an entity incorrectly because
of an incorrect guess of its proper encoding.
The first class of error is extremely damaging, and given a
correct encoding declaration, the second class is
extremely unlikely.
For these reasons, XML processors should make an effort to use all available
information, internal and external, to aid in detecting an entity's correct
encoding. Such information may include, but is not limited to:
Using information from an HTTP header
Using a MIME header obtained other than through HTTP
Metadata provided by the native OS file system or by document
management software
Analysing the bit patterns at the front of an entity to determine if
the application of any known encoding yields a valid encoding
declaration. See the appendix on
autodetection of character sets
for a fuller description.
If an XML processor encounters an entity with an encoding that it is
unable to process, it may
inform the application of this fact and
may
allow the application to
request either that the entity should be treated as an binary entity, or that processing should
cease.
Examples of encoding declarations:
<?XML ENCODING='UTF-8'?>
<?XML ENCODING='EUC-JP'?>
Document Entity
The document
entity serves as the root of the entity
tree and a starting-point for an XML
processor.
This specification does
not specify how the document entity is to be located by an XML
processor; unlike other entities, the document entity might well
appear on an input stream of the processor
without any identification at all.
XML Processor Treatment of Entities
XML allows character and general-entity references in two places:
the content of elements (content) and
attribute values (AttValue).
When an XML processor encounters
such a reference, or the name of an external binary entity as the value
of an ENTITY or ENTITIES attribute, then:
In all cases, the XML processor may
inform the application of the reference's occurrence and its identifier
(for an entity reference, the name; for a character
reference,
the character number in decimal, hexadecimal, or binary form).
For both character and entity references, the processor must
remove the reference itself from the text data
before passing the data to the application.
For character references, the processor must
pass the character indicated
to the application in
place of the reference.
For an external entity, the processor must inform the
application of the entity's system
identifier and public identifier
if any.
If the external entity is binary, the processor must inform the
application of the associated notation name,
and the notation's associated system and public (if any)
identifiers.
For an internal
(text) entity, the processor must include the
entity; that is, retrieve its replacement text
and process it as a part of the document
(i.e. as content or AttValue, whichever was being processed when
the reference was recognized), passing the result to the application
in place of the reference. The replacement text may contain both text
and markup, which must be recognized in
the usual way, except that the replacement text of entities used to escape
markup delimiters (the entities &magicents;) is always treated as
data. (The string AT&T expands to
AT&T; the remaining ampersand is not recognized
as an entity-reference delimiter.)
Since the entity may contain other entity references,
an XML processor may have to repeat the inclusion process recursively.
If the entity is an external text entity, then in order to
validate the XML document, the processor must
include the content of the
entity.
If the entity is an external text entity, and the processor is not
attempting to validate the XML document, the
processor may, but need not, include the entity's content.
This rule is based on the recognition that the automatic inclusion
provided by the SGML and XML text entity mechanism, primarily designed
to support modularity in authoring, is not necessarily
appropriate for other applications, in particular document browsing.
Browsers, for example, when encountering an external text entity reference,
might choose to provide a visual indication of the entity's
presence and retrieve it for display only on demand.
Entity and character
references can both be used to escape the left angle bracket,
ampersand, and other delimiters. A set of general entities
(&magicents;) is specified for this purpose.
Numeric character references may also be used; they are
expanded immediately when recognized, and must be treated as
character data, so the numeric character references
< and & may be used to
escape < and & when they occur
in character data.
XML allows parameter-entity references in a variety of places
within the DTD. Parameter-entity references are always expanded
immediately upon being recognized, and the DTD must match the relevant
rules of the grammar after all parameter-entity references have been
expanded. In addition, parameter entities referred to in specific
contexts are required to satisfy certain constraints in their
replacement text; for example, a parameter entity referred to within
the internal DTD subset must match the rule for markupdecl.
Implementors of XML processors need to know the rules for
expansion of references in more detail. These rules only come into
play when the replacement text for an internal entity itself contains
other references.
In the replacement text of an internal entity, parameter-entity
references and character references in the replacement text
are recognized and resolved
when the entity declaration is parsed,
before the replacement text is stored in
the processor's symbol table.
General-entity references in the replacement text are not
resolved when the entity declaration is parsed.
In the document, when a general-entity reference is
resolved, its replacement text is parsed. Character references
encountered in the replacement text are
resolved immediately; general-entity references encountered in the
replacement text may be resolved or left unresolved, as described
above.
Character and general-entity references must be
contained entirely within the entity's replacement text.
Simple character references do not suffice to escape delimiters
within the replacement text of an internal entity: they will be
expanded when the entity declaration is parsed, before the replacement
text is stored in the symbol table. When the entity itself is
referred to, the replacement text will be parsed again, and the
delimiters (no longer character references)
will be recognized as delimiters. To escape the
characters &magicents; in an entity replacement text, use
a general-entity reference or a doubly-escaped character reference.
See the appendix on expansion
of entity references
for detailed examples.
Predefined Entities
As mentioned in the discussion of
Character Data and Markup, the characters used as markup
delimiters by XML may all be escaped using entity references
(for the entities &magicents;).
All XML processors must recognize these entities whether they
are declared or not. Valid XML documents must declare these
entities, like any others, before using them.
If the entities in question are declared, they must be declared
as internal entities whose replacement text is the single
character being escaped, as shown below.
]]>
Notation Declarations
Notations identify by
name the format of external binary
entities, or the application to which
processing instructions are addressed.
Notation declarations
provide a name for the notation, for use in
entity and attribute-list declarations and in attribute-value specifications,
and an external identifier for the notation which may allow an XML
processor or its client application to locate a helper application
capable of processing data in the given notation.
NotationDecl'<!NOTATION' S %NameS %ExternalIDS? '>'
XML processors must provide applications with the name and external
identifier of any notation declared and referred to in an attribute
value, attribute definition, or entity declaration. They may
additionally resolve the external identifier into the
system identifier,
file name, or other information needed to allow the
application to call a processor for data in the notation described. (It
is not an error, however, for XML documents to declare and refer to
notations for which notation-specific applications are not available on
the system where the XML processor or application is running.)
Conformance
Conforming XML processors fall into two
classes: validating and non-validating.
Validating and non-validating systems alike must report
violations of the well-formedness constraints
given in this specification.
Validating processors must report
locations in which the document
does not comply with
the constraints expressed by the declarations in the
DTD.
They must also report all failures to fulfill the validity constraints given
in this specification.
XML and SGML
XML is designed to be a subset of SGML, in that every
valid XML document should also be a conformant
SGML
document, using
the same DTD, and that the parse trees produced by
an SGML parser
and an XML processor should be the same.
To achieve this, XML was defined by removing features and options from the
specification of SGML.
The following list describes syntactic characteristics which XML does not
allow but which are legal in SGML. The list may not be complete.
No tag omission.
Special tag-form for empty elements.
Comment declarations must have the delimiters <!&como; comment text
&comc;>
and can't have spaces within the markup of <!&como; or
&comc;>.
No comments (&como; ... &comc;)
inside markup declarations.
Comment declarations can't jump in and out of comments with
&como;&comc;.
No name groups for declaring multiple elements or making
a single ATTLIST declaration apply to multiple elements.
No RANK feature.
No CDATA or RCDATA declared content in element declarations.
(Use CDATA sections instead.)
No exclusions or inclusions on content models.
No minimization parameters on element declarations.
Mixed content models must be optional-repeatable OR-groups, with
#PCDATA first.
No AND (&) content model groups.
No NAME[S], NUMBER[S], or NUTOKEN[S]
declared values for attributes. (Use NMTOKEN[S] or
CDATA with application-specific validation instead.)
No #CURRENT or #CONREF declared values for attributes.
Attribute default values must be quoted.
Marked sections can't have spaces within the markup of
<![keyword[ or
]]>.
No RCDATA, TEMP, IGNORE, or INCLUDE
marked sections in document instances.
Marked sections in document instances must use the CDATA
keyword literally, not a
parameter entity.
No CDATA, RCDATA, or TEMP
marked sections in the DTD.
Some restrictions on the content of ignored marked-sections:
comments, literals, and processing instructions in ignored sections
may not contain the delimiter string ]]>; this helps
ensure that the end-point of the conditional section does not
change when the section is changed from IGNORE to
INCLUDE.
No SDATA, CDATA, or bracketed internal entities.
No SUBDOC, CDATA, or SDATA external entities.
External entities must have a system identifier.
Parameter-entity references in the internal DTD subset may occur
only between declarations.
Parameter-entity references in the external DTD subset are
restricted to certain positions in the grammar, and must replace
whole non-terminals of the grammar; this ensures that all valid XML
documents are valid SGML, and makes the restrictions on parameter-entity
replacement easier to understand and implement.
No data attributes on NOTATIONs or
attribute value specifications on ENTITY
declarations.
No SHORTREF declarations.
No USEMAP declarations.
No LINKTYPE declarations.
No LINK declarations.
No USELINK declarations.
No IDLINK declarations.
No SGML declarations.
In most current SGML systems, XML documents should be able
to use the following SGML declaration.
Some systems (those which take the document character set to
be a description of the input stream) may require different
declarations, depending on the character set and the capacities
and quantities required.
"
PIC "?>"
SHORTREF NONE
NAMES
SGMLREF
]]>
]]>
This appendix contains some examples illustrating the
sequence of entity- and character-reference recognition and
expansion.
If the DTD contains the declaration
An ampersand (&) may be escaped
numerically (&#38;) or with a general entity
(&).
" >
]]>
then the XML processor will recognize the character references
when it parses the entity declaration, and resolve them before
storing the following string as the
value of the entity example:
An ampersand (&) may be escaped
numerically (&) or with a general entity
(&).
]]>
A reference in the document to &example;
will cause the text to be reparsed, at which time the
start- and end-tags of the p element will be recognized
and the three references will be recognized and expanded,
resulting in a p element with the following content
(all data, no delimiters or markup):
A more complex example will illustrate the rules and their
effects fully. In the following example, the line numbers are
solely for reference.
2
4
5 ' >
6 %xx;
7 ]>
8 This sample shows a &tricky; method.
]]>
This produces the following
in line 4, the reference to character 37 is expanded immediately,
and the parameter entity xx is stored in the symbol
table with the value %zz;. Since the replacement text
is not rescanned, the reference to parameter entity zz
is not recognized. (And it would be an error if it were, since
zz is not yet declared.)
in line 5, the character reference < is
expanded immediately and the parameter entity zz is
stored with the replacement text
<!ENTITY tricky "error-prone" >,
which is a well-formed entity declaration.
in line 6, the reference to xx is recognized,
and the replacement text of xx (namely
%zz;) is parsed. The reference to zz
is recognized in its turn, and its replacement text
(<!ENTITY tricky "error-prone" >) is parsed.
The general entity tricky has now been
declared, with the replacement text error-prone.
in line 8, the reference to the general entity tricky is
recognized, and it is expanded, so the full content of the
test element is the self-describing (and ungrammatical) string
This sample shows a error-prone method.
Deterministic Content Models
For compatibility, it is
required
that content models in element declarations be deterministic. SGML
requires deterministic content models (it calls them
`unambiguous'); XML processors built using SGML systems may
flag non-deterministic content models as errors.
For example, the content model ((b, c) | (b, d)) is
non-deterministic, because given an initial b the parser
cannot know which b in the model is being matched without
looking ahead to see which element follows the b.
In this case, the two references to
b can be collapsed
into a single reference, making the model read
(b, (c | d)). An initial b now clearly
matches only a single name in the content model. The parser doesn't
need to look ahead to see what follows; either c or
d would be accepted.
Algorithms exist which allow many but not all non-deterministic
content models to be reduced automatically to equivalent deterministic
models; see Brüggemann-Klein 1991.
Autodetection of Character Encodings
The XML encoding declaration functions as an internal label on each
entity, indicating which character encoding is in use. Before an XML
processor can read the internal label, however, it apparently has to
know what character encoding is in use—which is what the internal label
is trying to indicate. In the general case, this is a hopeless
situation. It is not entirely hopeless in XML, however, because XML
limits the general case in two ways: each implementation is assumed
to support only a finite set of character encodings, and the XML
encoding declaration is restricted in position and content in order to
make it feasible to autodetect the character encoding in use in each
entity in normal cases.
Because each XML entity not in UTF-8 or UCS-2 format must
begin with an XML encoding declaration, in which the first characters
must be '<?XML', any conforming processor can detect,
after two to four octets of input, which of the following cases apply (in
reading this list, it may help to know that in UCS-4, '<' is
#x0000003C and '?' is #x0000003F, and the Byte
Order Mark required of UCS-2 data streams is #xFEFF):
00 3C 00 3F: UCS-2, big-endian, no Byte Order Mark
(and thus, strictly speaking, in error)
3C 00 3F 00: UCS-2, little-endian, no Byte Order Mark
(and thus, strictly speaking, in error)
3C 3F 58 4D: UTF-8, ISO 646, ASCII, some part of ISO 8859,
Shift-JIS, EUC, or any other 7-bit, 8-bit, or mixed-width encoding
which ensures that the characters of ASCII have their normal positions,
width,
and values; the actual encoding declaration must be read to
detect which of these applies, but since all of these encodings
use the same bit patterns for the ASCII characters, the encoding
declaration itself may be read reliably
4C 6F E7 D4: EBCDIC (in some flavor; the full
encoding declaration must be read to tell which code page is in
use)
other: UTF-8 without an encoding declaration, or else
the data are corrupt, fragmentary, or enclosed in
a wrapper of some kind
This level of autodetection is enough to read the XML encoding
declaration and parse the character-encoding identifier, which is
still necessary to distinguish the individual members of each family
of encodings (e.g. to tell UTF-8 from 8859, and the parts of 8859
from each other, or to distinguish the specific EBCDIC code page in
use, and so on).
Because the contents of the encoding declaration are restricted to
ASCII characters, a processor can reliably read the entire encoding
declaration as soon as it has detected which family of encodings is in
use. Since in practice, all widely used character encodings fall into
one of the categories above, the XML encoding declaration allows
reasonably reliable in-line labeling of character encodings, even when
external sources of information at the operating-system or
transport-protocol level are unreliable.
Once the processor has detected the character encoding in use, it can
act appropriately, whether by invoking a separate input routine for
each case, or by calling the proper conversion function on each
character of input.
Like any self-labeling system, the XML encoding declaration will
not work if any software changes the entity's character set or encoding
without updating the encoding declaration. Implementors of
character-encoding routines should be careful to ensure the accuracy
of the internal and external information used to label the
entity.
A Trivial Grammar for XML Documents
The grammar given in the body of this specification is relatively
simple, but for some purposes it is convenient to have an even simpler
one.
A very simple, though non-conforming, XML
processor could parse a well-formed XML document using the
following simplified grammar, recognizing all element boundaries
correctly, though not expanding entity references and not detecting
all errors:
simpleDoc(SimpleData
| Markup)*SimpleData[^<&]*cf. PCDataSimpleLit('"' [^"]* '"')| ("'" [^']* "'")cf. SkipLitMarkup'<' Name
(SNameS? '=' S?
SimpleLit)*
S? '>'start-tags | '<' Name
(SNameS? '=' S?
SimpleLit)*
S? '/>'empty elements| '</'
NameS? '>'end-tags | '&' Name ';'entity references | '&#' [0-9]+ ';'decimal character references | '&hcro;' [0-9a-fA-F]+ ';'hexadecimal character references | '<!&como;'
(Char* -
(Char* '&comc;' Char*))
'&comc;>'comments | '<?'
(Char* -
(Char* &pic; Char*))
'&pic;'processing instructions | '<![CDATA['
(Char* -
(Char* ']]>' Char*))
']]>'CDATA sections| '<!DOCTYPE'
(Char - ('[' | ']'))+
('['
simpleDTD*
']')? '>'doc type declarationsimpleDTD'<!&como;'
(Char* -
(Char* '&comc;' Char*))
'&comc;>'comment | '<?'
(Char* -
(Char* &pic; Char*))
'&pic;'processing instruction SimpleLit(Char - (']' | '<' | '"' | "'"))+
'<!' (Char - ('-'))+declarations other than comment
Most processors will require the more complex
grammar given in the body of this specification.
References
Aho, Alfred V.,
Ravi Sethi, and Jeffrey D. Ullman.
Compilers: Principles, Techniques, and Tools.
Reading: Addison-Wesley, 1986, rpt. corr. 1988.Brüggemann-Klein, Anne.
Regular Expressions into Finite Automata.
Universität Freiburg, Institut für Informatik,
Bericht 33, Juli 1991.
Brüggemann-Klein, Anne,
and Derick Wood.
Deterministic Regular Languages.
Universität Freiburg, Institut für Informatik,
Bericht 38, Oktober 1991.
ISO
(International Organization for Standardization).
ISO/IEC 8879-1986 (E). Information processing — Text and Office
Systems — Standard Generalized Markup Language (SGML). First
edition — 1986-10-15. [Geneva]: International Organization for
Standardization, 1986.
ISO
(International Organization for Standardization).
ISO/IEC 10646-1993 (E). Information technology — Universal
Multiple-Octet Coded Character Set (UCS) — Part 1:
Architecture and Basic Multilingual Plane.
[Geneva]: International Organization for
Standardization, 1993 (plus amendments AM 1 through AM 5).
ISO
(International Organization for Standardization).
ISO/IEC 10744-1992 (E). Information technology —
Hypermedia/Time-based Structuring Language (HyTime).
[Geneva]: International Organization for
Standardization, 1992.
Extended Facilities Annexe.
[Geneva]: International Organization for
Standardization, 1996.
IETF (Internet Engineering Task Force).
RFC 1738: Uniform Resource Locators.
1991.
The Unicode Consortium.
The Unicode Standard, Version 2.0.
Reading, Mass.: Addison-Wesley Developers Press, 1996.
W3C XML Working Group
This specification was prepared and approved for publication by the
W3C XML Working Group (WG). WG approval of this specification does
not necessarily imply that all WG members voted for its approval. At
the time it approved this specification, the XML WG comprised the
following members:
Jon Bosak, Sun (Chair)James Clark (Technical Lead)Tim Bray, Textuality and Netscape (XML Co-editor)Jean Paoli, Microsoft (XML Co-editor)C. M. Sperberg-McQueen, U. of Ill. (XML Co-editor)Steve DeRose, INSODave Hollander, HPEliot Kimber, HighlandTom Magliery, NCSAEve Maler, ArborTextMurray Maloney, GrifPeter Sharpe, SoftQuad