[Mirrored from: http://www.ucc.ie/xml/, March 31, 1997]
Maintained on behalf of the W3C SGML Working Group by Peter Flynn (Silmaril Consultants), with the collaboration of Terry Allen (Fujitsu, Inc), Tom Borgman (Interleaf, Inc), Tim Bray (Textuality, Inc), Robin Cover (Summer Institute of Linguistics), Christopher Maden (EBT, Inc), Eve Maler (Arbortext, Inc), Peter Murray-Rust (Nottingham University), Liam Quin (Softquad, Inc), Michael Sperberg-McQueen (University of Illinois at Chicago), Joel Weber (MIT), and many other members of the SGML Working Group of the W3C
0.1 (31 January 1997) PF (First draft. Sample questions devised by participants.)
0.2 (3 February 1997) PF (Revised draft. Additional questions and answers.)
0.3 (17 February 1997) PF (Extensive revision following comments from the group. Changes to markup and organisation.)
0.4 (23 February 1997) PF (Minor editorial changes)
0.5 (1 April 1997) PF (Added Multidoc Pro as SGML browser; question on XML math; fixed ambiguity in explanation of NETs; added JUMBO; ERB changes of March 26; more details of linking and tools; adding element declaration minimization to the forbidden list.)
Paragraphs which have been added since the last version are shown in magenta and prefixed with a pilcrow (¶). Paragraphs which have been changed since the last version are also shown in magenta but are prefixed with a section sign (§). Paragraphs marked for future deletion but retained at the moment for information are shown in pale gray and are prefixed with a plus/minus sign (±).
This document contains the most frequently-asked questions (with answers) about XML, the Extensible Markup Language. It is intended as a guide for users, developers, and the interested reader, and should not be regarded as a part of the XML Draft Specification.
The FAQ is divided into four parts: a) General, b) User, c) Author, and d) Developer. The questions are numbered independently within each section. As the numbering may therefore change with each version, comments and suggestions should refer to the version number (see Revision History above) as well as the part and question number.
There is a form at the end of this document which you can use to submit bug reports, suggestions for improvement, and other comments relating to this FAQ only. Comments about the XML Draft Specification itself should be sent to the W3C.
§ The SGML file for use with any conforming system is available at http://www.ucc.ie/xml/faq.sgml (this can also be used with SGML browsers like Panorama or Multidoc Pro by downloading the DTD and stylesheet). The same text is available in an HTML version for use with an HTML browser (eg Netscape Navigator, Microsoft Internet Explorer, Spry Mosaic, NCSA Mosaic, Lynx, etc) is at http://www.ucc.ie/xml/. A plaintext (ASCII) version is available from the Web or by anonymous FTP to one of several FAQ repositories. The versions above are also available by electronic mail to the WebMail server (for users with email-only access). For printed copies there is a PostScriptTM version or you can have it on flattened dead trees by sending $10 (or equivalent) to the editor (email first to check currency and postal address).
A regular markup language defines what you can do (or what you have done) in the way of describing information for a fixed class of documents. XML goes beyond this and allows you to define your own customized markup language. It can do this because it's an application profile of SGML, a metalanguage, a language for describing languages.
XML is intended to be a standard `to make it easy and straightforward to use SGML on the Web: easy to define document types, easy to author and manage SGML-defined documents, and easy to transmit and share them across the Web.'
It defines `an extremely simple dialect of SGML which is completely described in the Draft XML Specification (DXS). The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.'
`For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML' [quotes from the DXS].
Not quite. SGML is the `mother tongue', used for describing thousands of different types of document.
HTML is just one document type, used in the Web. It defines a fixed type of document with markup to let you describe a common class of simple office-style report, with headings, paragraphs, lists, illustrations, etc, and some provision for hypertext and multimedia.
XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them.
XML is a project of the World-Wide Web Consortium (W3C), and the development of the specification is being supervised by their SGML Editorial Review Board (ERB). The work of definition and specification is being done by a Working Group appointed by the ERB, with co-opted contributors and experts from various fields.
It removes two constraints which are holding back Web developments: dependence on a single document type (see HTML above), based on a system (SGML) whose syntax allows many powerful but complex options. XML both simplifies the levels of optionality in SGML, and allows the development of user-defined document types on the Web.
XML redefines some of SGML's internal qualities and quantities, and removes a large number of the more complex and sometimes less-used features which made it harder to write processing programs (see list).
It also introduces a new class of document which does not require the formal declaration of a predefined document type. See the questions about document type declarations, `valid' vs `well-formed' documents, and how to define your own document types in the Developers' Section.
HTML is already overburdened with dozens of interesting but often incompatible inventions from different manufacturers, because it provides only one way of describing your information.
XML will allow groups of people or organisations to create their own customised markup languages for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, linguistics, knitting, history, engineering, rabbit-keeping etc).
HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure.
Information on a network which connects many different types of computer has to be usable on all of them. Public information cannot afford to be restricted to one make or model or manufacturer, or to cede control of its data format to private hands. It is also helpful for such information to be in a form that can be reused in many different ways, as this can minimize wasted time and effort.
SGML is the international standard which is used for defining this kind of application, but those who need an alternative based on different software are entirely free to implement similar services using such a system, especially if it is for non-public use.
XML is a subset of SGML: it omits some more complex and some less-used parts in return for the benefits of being easier to write applications for, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files can be parsed and validated the same as any other SGML file (see the question on XML software).
XML processing software (browsers, formatters, search engines, etc) will be able to handle all sorts of new applications, not just HTML.
Programmers may find it useful to think of XML as being SGML-- rather than HTML++.
Online, there's the XML Draft Specification available from the W3C; a brief summary of XML with an extensive list of online reference material in Robin Cover's SGML pages; and a summary and condensed FAQ from Tim Bray.
± The GCA and SGML Open are conducting a three-day conference on The new publishing business case: XML, SGML, and the Internet in San Diego, California, on 10-12 March 1997. Further details from Julie Morrison Desmond or from the GCA's Web site.
Technology Appraisals Ltd are holding a three-day conference in London, England, on XML and network publishing technologies, XML ready for prime time?, and The XML Technology Bootstrap on 21-23 April 1997. They are also running a three-day seminar on business on the Internet and the Web, including material on XML, on 27 April-1 May (also in London). Details from David Hitchcock at TAL.
The Sixth International World Wide Web Conference (WWW '97) is being held on 7-12 April 1997 in Santa Clara, California.
¶ The Irish SGML Users Group are holding a meeting entitled XML - Extending the World Wide Web with guest speaker Jon Bosak (Sun) on April 30, 2-5pm in Dublin (location to be announced).
There is a mailing list called xml-dev
for those committed to developing components for XML. You can subscribe
by sending a 1-line mail message to
subscribe xml-dev yourname@yoursite
The list is hypermailed for online reference at http://www.lists.ic.ac.uk/hypermail/xml-dev/.
Note that this list is for those people actively involved in developing resources for XML. It is not for general information about XML (see this FAQ and other sources) or for general discussion about SGML implementation and resources (see comp.text.sgml).
§ XML is still being developed, and there are already some pilot browsers, so you can experiment with them. When the specification is more complete, more browsers should start to appear, and you may be able to download them and use them to browse the Web much as you do with current software now.
¶ You can use the browsers to look at some of the emerging XML material, such as Jon Bosak's Shakespeare plays.
If you want to start preparations for writing XML, see the questions in the Authors' Section.
Because authors and providers can design their own document types, browser presentation will be able to benefit from greatly improved facilities, both for graphical display and for performance.
Document types can be explicitly tailored to an audience, so the cumbersome fudging that has to take place with HTML to achieve special effects should become a thing of the past: authors and designers will be free to invent their own markup elements.
Information content can be richer and easier to use, because the hypertext linking abilities of XML are much greater than those of HTML.
Because XML removes many of the underlying complexities of SGML in favor of a more flexible model, writing programs to handle XML will be much easier than doing the same for full SGML.
Information will be much more accessible and reusable, because the more flexible markup of XML can be utilized by any XML software instead of being partly restricted to specific manufacturers as has become the case with HTML.
XML files remain fully conformant SGML, so they can be used outside the Web as well, in any normal SGML environment.
§ There are already some browsers emerging (see below), but the XML specification is still under development. As with HTML, there won't be just one browser, but many. However, because the potential number of different XML applications is not limited, no single browser should be expected to handle 100% of everything.
Expect to see the generic parts of XML (eg parsing, tree management, searching, formatting, and the use of architectural forms) combined into a general-purpose browser library or toolkit to make it easier for developers to take a consistent line when writing XML applications. Such applications could then be customised by adding semantics for specific markets, or using languages like Java to develop plugins for generic browsers and have the specialist modules delivered transparently over the Web.
¶ JUMBO is a prototype GUI browser/editor/search/rendering tool for the output of XML parsers. It displays the abstract document tree which can be queried and edited in limited fashion. Java classes can be dynamically loaded for the current DTD and allow complex transformation and rendering. The emphasis is on the import of legacy files into structured documents, and the management of non-textual data, including common data structures (trees, tables, lists, etc). Currently JUMBO parses a subset of XML files (ie only elements and their attributes) and will be grafted onto other parsers as soon as possible. The software and a wide range of XML demo files, including Jon Bosak's PLAY, can be downloaded for any Java-enabled browser from http://www.venus.co.uk/omf/cml/
¶ Inso Corporation (owners of EBT) are reported to have demonstrated pilot XML software using DynaWeb at the GCA's XML Conference in San Diego (March 1997). [Details do not appear to be available on the Web: if anyone has information on this, please use the form at the end of this FAQ.]
No, existing SGML and HTML applications software will continue to work with existing files. But as with any enhanced facility, if you want to view or download and use XML files, you will need to use XML-aware software.
Authors should also read the Developers' Section, which contains further information about the internals of XML files.
No, XML itself does not replace HTML: instead, it complements it by allowing you to define your own alternative to HTML (or extension to it).
HTML is expected to remain in common use for some time to come.
XML documents can be very simple, with no formal document type declaration, and straightforward nested markup of your own design:
<?XML VERSION="1.0" RMD="NONE"?> <conversation> <greeting n="1">Hello, world!</greeting> <response>Stop the planet, I want to get off!</response> </conversation>
Or they can be more complicated, with a specific DTD and maybe a local subset, and a more complex structure:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?> <!doctype titlepage system "typo.dtd" [<!entity % active.links "INCLUDE">]> <titlepage> <whitespace type="vertical" amount="36"/> <title font="Baskerville" size="24/30" alignment="centered">Hello, world!</title> <whitespace type="vertical" amount="12"/> <!--* In some copies the following decoration is hand-colored, presumably by the author *--> <image location="http://www.foo.bar/fleuron.eps" type="URL" alignment="centered"/> <whitespace type="vertical" amount="24"/> <author font="Baskerville" size="18/22" style="italic">Munde Salutem</author> </titlepage>
Or they can be anywhere between: a lot will depend on how you want to define your document type and what it will be used for. See the question on valid and well-formed files.
In both cases there are some simple rules to follow, and you may need to use a Document Type Definition (DTD) which will both guide you in creating the document and guide the user's software in reading it. If there isn't a suitable DTD in existence for your type of document, you can write one of your own.
Unlike some current ways of using HTML, you can't just make it up as you go along and hope that any old tag will do: for software to make sense of it, you need to follow a pattern or plan. In the case of a valid file, this is the DTD; in the case of a well-formed file, the structure has to be implicit in your markup.
Many HTML authoring tools already produce almost well-formed XML document instances, and some of them may also be nearly valid if there is a Document Type Definition (DTD).
¶ The differences are small but significant (see the question on XML document classes). Existing HTML browsers are tolerant of invalid markup, so they will probably display XML files which use an XML version of a HTML DTD, or which are simply well-formed HTML, even though there are some slight differences. See the question on how to make existing HTML files work in XML.
Yes, this is what the `well-formed' document class is there for:
<?XML VERSION="1.0" RMD="NONE"?> <FAQ> <Q><IMAGE XML-LINK="ques.gif"/>Can I create my own XML documents without a DTD?</Q> <A>Yes. This is an example of a well-formed document, which can be parsed by any XML-compliant parser. However, it won't know how to display it unless you supply a stylesheet.</A> </FAQ>
¶ Properly balanced, nested elements; all
start-tags and end-tags always present for elements which contain text
data; a trailing slash on elements defined
Yes, if they are unambiguous. XML processors will realize that in some applications precise white-space is critical, but in others that they need to collapse it: there are different rules for white-space handling in valid and well-formed documents, as well as ways to affect it.
<chapter> <section> <title> My title for Section 1. </title> <para> ... </para> </section> </chapter>
§ In other words, do you really want those
linebreaks and spaces before, after, and in the title, or are they just
there to make it easier to edit or because it was machine-generated?
Should it say
<title>My title for Section 1.</title>
with no linebreaks or space? There are requirements and recommendations
for applications on the detection and handling of white-space in the
Draft XML Specification. XML instances
may not rely on whitespace being ignored in element content.
Element names (start-tags and end-tags) are case-insensitive: you can use upper- or lower-case (or even MiXeD). Attribute names are also case-insensitive. Attribute values, however, may be case-sensitive, depending on context: you can specify which in a Document Type Definition. All entity names, and your data content (the text), are case-sensitive.
¶ You need to include the Document Type Declaration:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?> <!DOCTYPE HTML SYSTEM "http://www.foo.com/myfiles/html3x.dtd">
¶ Any DTD you reference must be an
XML version, as must any other entities the DTD refers to (eg
character entities like ISO
Latin-1). They must all be accessible either through the network or from
the user's local disk (eg by supplying a
URL or filename for each in their
¶ The file itself must be well-formed (see below).
§ If your file does not conform to any of the available DTDs, then you can make it well-formed. You must make sure it follows the rules for well-formed files, by editing the file and making the necessary changes. Then place an XML Declaration containing a Required Markup Declaration at the top:
<?XML VERSION="1.0" RMD="NONE"?> <HTML><HEAD><TITLE>Test file</TITLE></HEAD> <BODY><BLINK>Test text <IMG SRC="foo.gif" alt="A foo"/></BLINK> </BODY></HTML>
This lets you omit the DTD so long as the file is well-formed.
§ Yes, provided a) the document has a valid Document
Type Definition (DTD) which you can use; and b) and the files are valid,
not just well-formed.
But at the moment there are few tools which handle XML files unchanged
because of the format of
EMPTY elements. This is
expected to change soon.
§ Yes, but at the moment there is still a need for tutorials, simple tools, and more examples of XML documents. Well-formed XML documents may look similar to HTML except for some small but very important points of syntax.
As every user community can have their own document type defined, it should be much easier to learn, because element names can be picked for relevance.
Yes, the XML Draft
Specification explictly makes reference to
ISO 10646 and says that users
`may extend the ISO 10646 character repertoire, in the rare cases
where this is necessary, by making use of the private use areas[ . . . ]all
XML processors must accept the UTF-8 and UCS-2 encodings of 10646; the
mechanisms for signalling which of the two are in use, and for bringing
other encodings into play, are[ . . . ]in the discussion of character
encodings. Regardless of the specific encoding used, any character in
the ISO 10646 character set may be referred to by the decimal or
hexadecimal equivalent of its bit string':
&#u-hhhh; [from the
A DTD is usually a file (or several files together)
which contain a formal definition of a particular type of document. This
sets out what names can be used for elements, where they may occur (for
<ITEM> might only be
<LIST>), and how
they all fit together. It lets processors parse a document and identify
where each element comes, so that stylesheets, navigators, search
engines, and other applications can be used.
There are thousands of DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own. As with any language, you need to learn some of it [SGML] to do this: but XML is much simpler, see the list of restrictions which shows what has been cut out.
DTDs specifically for use on the Web may become commonplace, and people in different areas of interest may write their own for their own purposes: this is what XML is for.
§ The linking abilities of XML
systems are much more powerful than those of HTML. Existing
links will remain usable, but new linking technology is based on the
lessons learned in the development of other hypertext standards, such as
HyTime, which will let
you manage bidirectional and multi-way links, as well as links to a span
of text (within your own or other documents) rather than to a single
point. This is already implemented for SGML in browsers like
Panorama and Multidoc Pro.
¶ At the time of compiling this version, some of the linking facilities are still under discussion. Current discussions center around the use of HTML-style links according to the URL and URI specifications (RFCs 1738 and 1630) with the TEI's Extended Pointer Notation. This would allow references such as
§ There are already three XML parsers (written in Java) which can be used to check that your files conform to the Draft XML Specification:
Norbert Mikula's NXP at http://www.edu.uni-klu.ac.at/~nmikula/NXP/
Tim Bray's Lark at http://www.textuality.com/Lark/
Sean Russell's kernel at http://jersey.uoregon.edu/ser/software/XML.tar.gz
¶ Yes, if the document type you use provides for math. The long-expired HTML3 could be used, or HTML Pro, or ISO12083 Math, or the developments of the OpenMath or HTML-Math projects, or one of your own making. Browsers to display rudimentary math embedded in SGML already exist (eg Panorama, Multidoc Pro), and the mathematics-using communities may develop their own software for XML.
¶ The sophistication could vary from math expressions like through simple inline equations such as to display equations like
(If you are using an HTML browser to read this, the above equations may not be rendered correctly unless you have a math plugin for Netscape like IBM's TechExplorer which reads the embedded TeX equivalent [use the source, Luke!].)
Right here (http://www.w3.org/pub/WWW/TR/). Includes the EBNF.
§ A valid file begins like any normal SGML file with a Document Type Declaration, but may have an optional XML Declaration prepended:
<?XML VERSION="1.0"?> <!doctype foo system "http://www.foo.org/bar.dtd"> <foo> <bar>...<blort/>...</bar> </foo>
An XML version of the specified DTD must be accessible to the XML processor, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network (with the SYSTEM identifier set to a URL).
<?XML VERSION="1.0" RMD="INTERNAL"?> <!doctype foo [ <!element guff - - (#PCDATA)> ]> <foo> <bar>...<blort/>...</bar> <guff>...</guff> </foo>
The default, when no XML Declaration is
§ Well-formed XML files can be used without a DTD, but they must follow some simple rules to enable a browser to parse the file correctly (so that it can apply your stylesheet, enable linking, etc). Valid files must also be well-formed.
<?XML VERSION="1.0" RMD="NONE"> <foo> <bar>...<blort/>...</bar> </foo>
all tags must be balanced: that is, all elements must have both start- and end-tags present (omission is not allowed, with one exception, see `Empty elements' below)
all attribute values must be in quotes (the single-quote character [the apostrophe] may be used if the value contains a double-quote character, and vice versa)
<BR> would become either
§ there must not be any markup
&) in the
character data (ie they must be escaped
elements must nest inside each other properly (no overlapping markup, same rule as for regular SGML).
¶ Well-formed XML files are considered to have
& predefined and thus available
for use even without a DTD. Valid XML files must declare them
explicitly if they use them. A revised version of the XML Specification
will give a precise definition of what that declaration must be.
Instructions are SGML's way of adding `what to do' and
`how to do it' details to a file. In XML every file can
begin with an XML Declaration which starts with `
and the keyword
XML, and ends with `
(slightly different from plain SGML, which omits the final
The XML Declaration must include the version number of XML being followed, and may include a Required Markup Declaration and an Encoding Declaration:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?>
¶ The XML Declaration is optional, and defaults to the values given here: if other values are needed, the declaration must be included at the top of the file.
The principal changes are in what you can do in writing a Document Type Definition (DTD). To simplify the syntax and make it easier to write processing software, the following markup declaration restrictions have been picked for XML:
No comments (
*--) inside markup declarations
Comment declarations can't have spaces within the
Comment declarations can't jump in and out of comments
No name groups for declaring multiple elements or element attlists
No exclusions or inclusions on content models
¶ No minimization parameters on element declarations
Mixed content models must be optional-repeatable ORs,
No AND (
&) content model groups
Attribute default values must be quoted
Marked sections can't have spaces within the markup of
INCLUDE marked sections
Marked sections in instance must have
keyword, not parameter entity
or bracketed internal entities
SDATA external entities
No data attributes on
attribute value specifications on
<!--* comment text *-->
The `asterisk-double-dash' sequence is therefore illegal in comment text as it is the terminator. Spaces are not allowed between either of the angle brackets, the exclamation mark, or any of the dashes or asterisks: they are only valid within the comment text.
¶ As noted in the question on source formatting, instances may not rely on whitespace being ignored in element content.
If you want to use existing SGML DTDs and entity files for XML, they will need to be edited to conform to the above requirements, but this only has to be done once.When the list has been finalized, it is likely that suitably-modified versions of the popular DTDs and character entity sets (eg the `ISO' files like ISOlat1) will be made available for use with XML.
Only to serve up .xml files as the correct MIME type.The XML project is submitting a MIME type of text/xml for approval, so for serving XML documents all that is needed is to edit the mime-types file (or its equivalent) and add the line
text/xml xml XML
However, more sophisticated applications may require HTTP content negotiation to determine what tools the client has for display. Also, since XML is designed to support stylesheets and sophisticated hyperlinking, XML documents may be accompanied by ancillary files such as DTDs, entity files, catalogs, stylesheets, etc, which may need their own MIME entry, and which require placing in the appropriate directories.
If you run scripts generating HTML, which you wish to work with XML, they will need to be modified to produce the relevant document type.
INCLUDEs (`part-generated' content)?
However, some files containing embedded calls to external procedures which get invoked before transmission, such as the NCSA's `special' HTML, need to be checked carefully, to make sure that they do not contain raw markup characters (ie angle brackets and ampersands) which might confuse editors and other processing software. For example,
<!-- #exec cmd="tr '\012' '\040' <foo.bar"-->
which in a .shtml file embeds the contents of foo.bar in the output stream, with all newlines changed to spaces, needs to be written as
<!-- #exec cmd="tr '\012' '\040' <foo.bar"-->
The same rule applies as for
so you need to ensure that any embedded code which gets passed to a
third-party engine (eg SDQL
writes, LiveWire requests,
etc) does not contain any characters
which might be misinterpreted as XML markup (ie
no angle brackets or ampersands): either use a
marked section to avoid your XML application parsing the embedded code,
or use the standard
& character entity references
For implementation to succeed, the terminology needs
to be precise (for example `element' and
`tag' are not synonymous: an element is a whole
unit of markup, and may consist of a start-tag alone (as in HTML's
<BR>) or a start-tag and an end-tag
and the content which goes between them; tags are
simply the markers at the start and end of elements). Sloppy terminology
Not yet, although many aspects of development software are being worked on: see the question on XML software.