[Mirrored from: http://www.ucc.ie/xml/], February 23, 1997
Maintained on behalf of the W3C SGML Working Group by Peter Flynn (Silmaril Consultants), with the collaboration of Terry Allen (Fujitsu, Inc), Tom Borgman (Interleaf, Inc), Tim Bray (Textuality, Inc), Robin Cover (Summer Institute of Linguistics), Christopher Maden (EBT, Inc), Eve Maler (Arbortext, Inc), Peter Murray-Rust (Nottingham University), Liam Quin (Softquad, Inc), Michael Sperberg-McQueen (University of Illinois at Chicago), Joel Weber (MIT), and many other members of the SGML Working Group of the W3C
Revision history
0.1 (31 January 1997) PF (First draft. Sample questions devised by participants.)
0.2 (3 February 1997) PF (Revised draft. Additional questions and answers.)
0.3 (17 February 1997) PF (Extensive revision following comments from the group. Changes to markup and organisation.)
Summary
This document contains the most frequently-asked questions (with answers) about XML, the Extensible Markup Language. It is intended as a guide for users, developers, and the interested reader, and should not be regarded as a part of the XML Draft Specification.
Organization
The FAQ is divided into four parts: a) General, b) User, c) Author, and d) Developer. The questions are numbered independently within each section. As the numbering may therefore change with each version, comments and suggestions should refer to the version number (see Revision History above) as well as the part and question number.
There is a form at the end of this document which you can use to submit bug reports, suggestions for improvement, and other comments relating to this FAQ only. Comments about the XML Draft Specification itself should be sent to the W3C.
Availability
The SGML file for use with any conforming system is available at http://www.ucc.ie/xml/faq.sgml (can also be used with the Panorama browser). The same text is available in an HTML version for use with an HTML browser (eg Netscape Navigator, Microsoft Internet Explorer, Spry Mosaic, NCSA Mosaic, Lynx, etc) is at http://www.ucc.ie/xml/. A plaintext (ASCII) version is available from the Web or by anonymous FTP to one of several FAQ repositories. The versions above are also available by electronic mail to the WebMail server (for users with email-only access). For printed copies there is a PostScriptTM version or you can have it on flattened dead trees by sending $10 (or equivalent) to the editor (email first to check currency and postal address).
A.5 Aren't XML, SGML, and HTML all the same thing?
A.6 Who is responsible for XML?
A.7 Why is XML such an important development?
A.8 How does XML make SGML simpler and still let you define your own document types?
A.9 Why not just carry on extending HTML?
A.10 Why do we need all this SGML stuff? Why not just use Word?
A.11 What's the relationship between XML and SGML and HTML?
A.12 Where do I find more about XML?
A.13 Where can I discuss implementation and development of XML?
B.1 Do I have to do anything to use XML?
B.2 What can XML offer that current Web technology can't?
C.2 What does an XML document look like inside?
C.3 If I can define it all myself, how does XML know what it all means?
C.4 Is an HTML document legal XML?
C.5 Can I create my own XML documents without explicitly defining a document type?
C.6 Can I prettyprint my XML documents?
C.7 Which parts of an XML document are case-sensitive?
C.8 How can I make my existing HTML files work in XML?
C.9 If XML is just a `subset' of SGML, can I use XML files directly with SGML tools?
C.10 I'm used to authoring and serving HTML. Can I learn XML easily?
C.11 Will XML be able to use non-Latin characters?
C.12 What's a Document Type Definition (DTD) and where do I get one?
D.2 What's the difference between `valid' and `well-formed'?
D.3 What are all these Processing Instructions?
D.4 What else has changed between SGML and XML?
D.5 Do I have to change any of my server software to work with XML?
D.6 Can I still use server-side
INCLUDE
s (`part-generated' content)?D.7 Can I (and my authors) still use client-side
INCLUDE
s?D.8 I'm trying to understand the XML Spec: why does SGML (and XML) have such difficult terminology?
XML stands for `Extensible Markup Language'. It's extensible because it is not a fixed format like HTML.
A regular markup language defines what you can do (or what you have done) in the way of describing information. XML goes beyond this and allows you to define your own customized markup language. It can do this because it's an application profile of SGML.
XML is designed to be a standard `to make it easy and straightforward to use SGML on the Web: easy to define document types, easy to author and manage SGML-defined documents, and easy to transmit and share them across the Web'.
It defines `an extremely simple dialect of SGML which is completely described in the Draft XML Specification (DXS). The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.'
`For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML' [quotes from the DXS].
SGML is the Standard Generalized Markup Language (ISO 8879), the international standard system for defining and using document formats.
HTML is the HyperText Markup Language (RFC 1866), a specific application of SGML used in the World-Wide Web.
Not quite. SGML is the `mother tongue', used for describing thousands of different types of document.
HTML is just one document type, used in the Web. It defines a fixed type of document with markup to let you describe a common class of simple office-style report, with headings, paragraphs, lists, illustrations, etc, and some provision for hypertext and multimedia.
XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them.
XML is a project of the World-Wide Web Consortium (W3C), and the development of the specification is being supervised by their SGML Editorial Review Board (ERB). The work of definition and specification is being done by a Working Group appointed by the ERB, with co-opted contributors and experts from various fields.
It removes two constraints which are holding back Web developments: dependence on a single document type (see HTML above), based on a system (SGML) whose syntax allows many powerful but complex options. XML both simplifies the levels of optionality in SGML, and allows the development of user-defined document types on the Web.
XML redefines some of SGML's internal qualities and quantities, and removes a large number of the more complex and sometimes less-used features which made it harder to write processing programs (see list).
It also introduces a new class of document which does not require the formal declaration of a predefined document type. See the questions about document type declarations, `valid' vs `well-formed' documents, and how to define your own document types in the Developers' Section.
HTML is already overburdened with dozens of interesting but often incompatible inventions from different manufacturers, because it provides only one way of describing your information.
XML will allow a group of people or organisations to create their own customised markup languages for exchanging information in their domain (music, chemistry, electronics, finance, linguistics, history, engineering, etc).
Information on a network which connects many different types of computer has to be usable on all of them, and cannot afford to be restricted to one make or model or manufacturer. It is also helpful for such information to be in a form that can be reused in many different ways, as this can minimize wasted time and effort. SGML is the international standard which is used for defining this kind of application, but those who need an alternative based on different software are entirely free to implement similar services using such a system.
XML is a subset of SGML: it omits some more complex and some less-used parts in return for the benefits of being easier to write applications for, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files can be parsed and validated just as with any SGML files can.
XML processing software (browsers, formatters, search engines, etc) will be able to handle all sorts of new applications, not just HTML.
Programmers may find it useful to think of XML as being SGML-- rather than HTML++.
Online, there's the XML Draft Specification available from the W3C, and a brief summary of XML with an extensive list of online reference material in Robin Cover's SGML pages.
The GCA and SGML Open are conducting a three-day conference on The new publishing business case: XML, SGML, and the Internet in San Diego, California, on 10-12 March 1997. Further details from Julie Morrison Desmond or from the GCA's Web site.
Technology Appraisals Ltd are holding a three-day conference in London, England, on XML and network publishing technologies, XML ready for prime time?, and The XML Technology Bootstrap on 21-23 April 1997. They are also running a three-day seminar on business on the Internet and the Web, including material on XML, on 27 April-1 May (also in London). Details from David Hitchcock at TAL.
The Sixth International World Wide Web Conference (WWW '97) is being held on 7-12 April 1997 in Santa Clara, California.
There is a mailing list for those committed to
developing components for XML, which is hypermailed at http://www.lists.ic.ac.uk/hypermail/xml-dev/.
You can subscribe by sending a 1-line mail message to
majordomo@ic.ac.uk
saying:
subscribe xml-dev yourname@yoursite
Note that this list is for those people actively involved in developing
resources for XML. It is not for general
information about XML (see this FAQ) or for general discussion about
SGML implementation and resources (see comp.text.sgml).
While XML is still being developed, there are no browsers, so right now you can't do anything. When browsers start to appear, you will be able to download them and use them to browse the Web much as you do with current software now.
If you want to start preparations for writing XML, see the questions in the Authors' Section.
Because authors and providers can design their own document types, browser presentation will be able to benefit from greatly improved facilities, both for graphical display and for performance.
Document types can be explicitly tailored to an audience, so the cumbersome fudging that has to take place with HTML to achieve special effects should become a thing of the past: authors and designers will be free to invent their own markup elements.
Information content can be richer and easier to use, because the hypertext linking abilities of XML are much greater than those of HTML.
Because XML removes many of the underlying complexities of SGML in favor of a more flexible model, writing programs to handle XML will be much easier than doing the same for full SGML.
Information will be much more accessible and reusable, because the more flexible markup of XML can be utilized by any XML software instead of being partly restricted to specific manufacturers as has become the case with HTML.
XML files remain fully conformant SGML, so they can be used outside the Web as well, in any normal SGML environment.
There aren't any yet, because the specification is still under development. As with HTML, there probably won't be just one browser, but many. Because the potential number of different XML applications is not limited, no single browser should be expected to handle 100% of everything.
Expect to see the generic parts of XML (eg parsing, tree management, searching, formatting, and the use of architectural forms) combined into a general-purpose browser library or toolkit to make it easier for developers to take a consistent line when writing XML applications.
Such applications could then be customised by adding semantics for specific markets, or using languages like Java to develop plugins for generic browsers and have the specialist modules delivered transparently over the Web.
No, existing SGML and HTML applications software will continue to work with existing files. But as with any enhanced facility, if you want to view or download and use XML files, you need to use XML-aware software.
Authors should also read the Developers' Section, which contains further information about the internals of XML files.
No, XML itself does not replace HTML: instead, it complements it by allowing you to define your own alternative to HTML (or extension to it).
HTML is expected to remain in common use for a long time to come.
XML documents can be very simple, with no formal document type declaration, and straightforward nested markup:
<?XML VERSION="1.0" RMD="NONE"?> <conversation> <greeting n="1">Hello, world!</greeting> <response>Stop the planet, I want to get off!</response> </conversation>
Or they can be more complicated, with a specific DTD and a local subset, and a more complex structure:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?> <!doctype titlepage system "typo.dtd" [<!entity % active.links "INCLUDE">]> <titlepage> <whitespace type="vertical" amount="36"> <title font="Baskerville" size="24/30" alignment="centered">Hello, world!</title> <whitespace type="vertical" amount="12"> <!--* In some copies the following decoration is hand-colored, presumably by the author *--> <image location="http://www.foo.bar/fleuron.eps" type="URL" alignment="centered"> <whitespace type="vertical" amount="24"> <author font="Baskerville" size="18/22" style="italic">Munde Salutem</author> </titlepage>
Or they can be anywhere between: a lot will depend on how you want to define your document type and what it will be used for.
You can use a stylesheet, like you can with HTML, to define the appearance, but the way in which an XML processor recognizes your markup depends on whether your file is valid or well-formed.
In both cases there are some simple rules to follow, and you may need to use a Document Type Definition which will both guide you in creating the document and guide the user's software in reading it.
Unlike some current ways of using HTML, you can't just make it up as you go along: for software to make sense of it, you need to follow a pattern or plan. In the case of a valid file, this is the DTD; in the case of a well-formed file, it has to be implicit in your markup.
Yes, there are two ways an HTML document can be legal XML: it can either be valid or it can be well-formed. See the question on XML document classes.
Many HTML authoring tools already produce well-formed XML document instances, and some of them may also be valid if there is a Document Type Definition (DTD) that can be fetched.
Yes, this is what the `well-formed' document class is there for:
<?XML VERSION="1.0" RMD="NONE"?> <FAQ> <Q><IMAGE XML-LINK="ques.gif"/>Can I create my own XML documents without a DTD?</Q> <A>Yes. This is an example of a well-formed document, which can be parsed by any XML-compliant parser. However, it won't know how to display it unless you supply a stylesheet.</A> </FAQ>
Yes, if they are unambiguous. XML processors will realize that in some applications precise white-space is critical, but in others that they need to collapse it: there are different rules for white-space handling in valid and well-formed documents, as well as ways to affect it.
<chapter> <section> <title> My title for Section 1. </title> <para> ... </para> </section> </chapter>
In other words, do you really want those linebreaks and spaces
before, after, and in the title, or are they just there to make it
easier to edit or because it was machine-generated? Should it say
<title>My title for Section 1.</title>
with no breaks or space? There are requirements and recommendations for
applications on the detection and handling of white-space in the
Draft XML Specification.
Element names (start-tags and end-tags) are case-insensitive: you can use upper- or lower-case (or even MiXeD). Attribute names are also case-insensitive. Attribute values, however, may be case-sensitive, depending on context: you can specify which in a Document Type Definition. All entity names, and your data content (the text), are case-sensitive.
XML defines two levels of conformance, valid and well-formed.
If your file already conforms to one of the HTML Document Type Definitions (DTDs), then it is probably already valid, and you just need to state this at the top of the file:
<?XML VERSION="1.0" RMD="ALL"?> <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" "http://www.foo.com/dtds/html2.dtd">
by
placing an XML Declaration and
Required Markup Declaration before the
DOCTYPE
declaration
and making sure an XML version of the DTD file and any others it refers
to (eg file of
character entities like ISO
Latin-1) are accessible via the DOCTYPE
declaration (eg
by supplying a URL for each as a SYSTEM
identifier).
If your file does not conform to any of the available DTDs, then you need to make sure it follows the rules for well-formed files, by editing the file and placing an XML Declaration and Required Markup Declaration at the top:
<?XML VERSION="1.0" RMD="NONE"?> <HTML><HEAD><TITLE>Test file</TITLE></HEAD> <BODY><BLINK>Test text</BLINK></BODY></HTML>
Yes, provided a) the document has a valid Document Type Definition (DTD) which you can use; and b) and the files are valid, not just well-formed.
Yes, but at the moment there is still a need for tutorials, simple tools, and examples of XML documents. Well-formed XML documents may look very similar to HTML except for some small but important points of syntax.
As every user community can have their own document type defined, it should be much easier to learn, because element names can be picked for relevance.
Yes, the XML Draft
Specification explictly makes reference to
ISO 10646 and says that users
`may extend the ISO 10646 character repertoire, in the rare cases
where this is necessary, by making use of the private use areas[ . . . ]all
XML processors must accept the UTF-8 and UCS-2 encodings of 10646; the
mechanisms for signalling which of the two are in use, and for bringing
other encodings into play, are[ . . . ]in the discussion of character
encodings. Regardless of the specific encoding used, any character in
the ISO 10646 character set may be referred to by the decimal or
hexadecimal equivalent of its bit string':
&#dddd;
or
&#u-hhhh;
[from the
DXS].
A DTD is usually a file (or several files together)
which contain a formal definition of a particular type of document. This
sets out what names can be used for elements, where they may occur (for
example, <ITEM>
might only be
meaningful inside <LIST>
), and how
they all fit together. It lets processors parse a document and identify
where each element comes, so that stylesheets, navigators, search
engines, and other applications can be used.
There are thousands of DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own. As with any language, you need to learn some of it [SGML] to do this: but XML is much simpler, see the list of restrictions which shows what has been cut out.
DTDs specifically for use on the Web may become commonplace, and different areas of interest may write their own for their own purposes: this is what XML is for.
The linking abilities of XML systems are much more
powerful than those of HTML. Existing HREF
-style links
will remain usable, but new linking technology (similar to those
proposed by the TEI)
will let you manage bidirectional and multi-way links, and to link to a
span of text within your own or other documents rather than to a single
point.
At the time of this version, many of the linking facilities are still under discussion.
There are already two XML parsers (written in Java) which can be used to check that files conform to the Draft XML Specification:
Norbert Mikula's NXP at http://www.edu.uni-klu.ac.at/~nmikula/NXP/
Tim Bray's Lark at http://www.textuality.com/Lark/
Right here (http://www.w3.org/pub/WWW/TR/). Includes EBNF.
Valid XML files are those which have a Document Type Definition (DTD) like all other SGML applications, and which adhere to it.
A valid file begins like any normal SGML file with a Document Type Declaration, but has an XML Declaration prepended:
<?XML VERSION="1.0"?> <!doctype foo system "http://www.foo.org/bar.dtd"> <foo> <blort> . . . </blort> </foo>
and the XML version of the specified DTD must be accessible to the XML processor, either known locally, or retrievable via the network.
If there is an internal DTD subset, this may be
referenced by an `INTERNAL
'
Required Markup Declaration (RMD) in the
XML Declaration:
<?XML VERSION="1.0" RMD="INTERNAL"?> <!doctype foo [ <!element blort - - (#PCDATA)> ]> <foo> <blort> . . . </blort> </foo>
Well-formed XML files can get away without a DTD, but in that case they must follow some simple rules to enable a browser to parse the file (so that it can apply your stylesheet, enable linking, etc):
the file must start with a Required Markup Declaration, saying that there is no DTD:
<?XML VERSION="1.0" RMD="NONE"> <foo> <blort> . . . </blort> </foo>
all tags must be balanced: that is, all elements must have both start- and end-tags (with one exception, see below)
all attribute values must be in quotes (the single-quote character [the apostrophe] may be used if the value contains a double-quote character, and vice versa)
any EMPTY
elements (eg
those with no end-tag like HTML's <IMG>
,
<HR>
, and <BR>
and others) must either end with `/>
'
(eg
</img src="foo.gif">
)
or you have to make them non-EMPTY
by adding a real end-tag.
Example:
<BR>
would become either
<BR/>
or
<BR>
</BR>
.
there must not be any markup characters (<
or &
) in the character data (ie
they must be escaped as <
and
&
)
elements must nest inside each other properly (no overlapping markup, same rule as for regular SGML).
Processing Instructions are SGML's way
of adding `what to do' and `how to do it'
details to a file. In XML every file must begin with an XML Declaration
which starts with `<?
' and the
keyword XML
, and ends with `?>
'
(slightly different from plain SGML, which omits the final
question-mark).
The XML Declaration must include the version number of XML being followed, and may include a Required Markup Declaration and an Encoding Declaration:
<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?>
The principal changes are in what you can do in writing a Document Type Definition (DTD). To simplify the syntax and make it easier to write processing software, the following markup declaration restrictions have been picked for XML:
No comments (--*
*--
) inside markup declarations
Comment declarations can't have spaces within the
markup of <!--*
or *-->
Comment declarations can't jump in and out of comments
with *-- --*
No name groups for declaring multiple elements or element attlists
No CDATA
or RCDATA
declared content
No exclusions or inclusions on content models
Mixed content models must be optional-repeatable ORs,
with #PCDATA
first
No AND (&
) content model groups
No NAME
[S
],
NUMBER
[S
], or NUTOKEN
[S
]
declared values
No #CURRENT
or #CONREF
declared values
Attribute default values must be quoted
Marked sections can't have spaces within the markup of
<![keyword[
or
]]>
No RCDATA
, TEMP
,
IGNORE
, or INCLUDE
marked sections
in instance
Marked sections in instance must have CDATA
keyword, not parameter entity
No SDATA
, CDATA
,
or bracketed internal entities
No SUBDOC
, CDATA
,
or SDATA
external entities
No public identifiers in entity and notation declarations (this is being changed to permit them)
No data attributes on NOTATION
s or
attribute value specifications on ENTITY
declarations
No SHORTREF
declarations
No USEMAP
declarations
No LINKTYPE
declarations
No LINK
declarations
No USELINK
declarations
No IDLINK
declarations
No SGML
declarations
The astute reader will have noticed a change in the syntax for a comment. XML comments add an asterisk after the opening double dash and before the closing one, so they are in the form:
<!--* comment text *-->
The `asterisk-double-dash' sequence is therefore illegal in comment text as it is the terminator. Spaces are not allowed between either of the angle brackets, the exclamation mark, or any of the dashes or asterisks: they are only valid within the comment text.
If you want to use existing SGML DTDs and entity files for XML, they will need to be edited to conform to the above requirements, but this only has to be done once.When the list has been finalized, it is likely that suitably-modified versions of the popular DTDs and character entity sets (eg the `ISO' files like ISOlat1) will be made available for use with XML.
Only to serve up .xml files as the correct MIME type.The XML project is submitting a MIME type of text/xml for approval, so for serving XML documents all that is needed is to edit the mime-types file (or its equivalent) and add the line
text/xml xml XML
However, more sophisticated applications may require HTTP content negotiation to determine what tools the client has for display. Also, since XML is designed to support stylesheets and sophisticated hyperlinking, XML documents may be accompanied by ancillary files such as DTDs, entity files, catalogs, stylesheets, etc, which may need their own MIME entry, and which require placing in the appropriate directories.
If you run scripts generating HTML, which you wish to work with XML, they will need to be modified to produce the relevant document type.
INCLUDE
s (`part-generated'
content)?Yes, so long as what they generate ends up as part of an XML-conformant file (ie either valid or well-formed.
However, some files containing embedded calls to external procedures which get invoked before transmission, such as the NCSA's `special' HTML, need to be checked carefully, to make sure that they do not contain raw markup characters (ie angle brackets and ampersands) which might confuse editors and other processing software. For example,
<!-- #exec cmd="tr '\012' '\040' <foo.bar"-->
which in a .shtml file embeds the contents of foo.bar in the output stream, with all newlines changed to spaces, needs to be written as
<!-- #exec cmd="tr '\012' '\040' <foo.bar"-->
INCLUDE
s?The same rule applies as for
server-side INCLUDE
s,
so you need to ensure that any embedded code which gets passed to a
third-party engine (eg SDQL
enquiries, Java
write
s, LiveWire requests,
etc) does not contain any characters
which might be misinterpreted as XML markup (ie
no angle brackets or ampersands): either use a CDATA
marked section to avoid your XML application parsing the embedded code,
or use the standard <
,
>
, and
&
character entity references
instead.
For implementation to succeed, the terminology needs
to be precise (for example `element' and
`tag' are not synonymous: an element is a whole
unit of markup, and may consist of a start-tag alone (as in HTML's
<BR>
) or a start-tag and an end-tag
and the content which goes between them; tags are
simply the markers at the start and end of elements). Sloppy terminology
causes misunderstandings.
Those new to SGML may want to read something like the Gentle Introduction to SGML chapter of the TEI.
Not yet, although many aspects of development software are being worked on: see the question on XML software.