[Mirrored from: http://www.ucc.ie/xml/, March 31, 1997]

Commonly Asked Questions about the Extensible Markup Language

The XML FAQ

Draft Version 0.5 (1 April 1997)

Maintained on behalf of the W3C SGML Working Group by Peter Flynn (Silmaril Consultants), with the collaboration of Terry Allen (Fujitsu, Inc), Tom Borgman (Interleaf, Inc), Tim Bray (Textuality, Inc), Robin Cover (Summer Institute of Linguistics), Christopher Maden (EBT, Inc), Eve Maler (Arbortext, Inc), Peter Murray-Rust (Nottingham University), Liam Quin (Softquad, Inc), Michael Sperberg-McQueen (University of Illinois at Chicago), Joel Weber (MIT), and many other members of the SGML Working Group of the W3C

Revision history

0.1 (31 January 1997) PF (First draft. Sample questions devised by participants.)
0.2 (3 February 1997) PF (Revised draft. Additional questions and answers.)
0.3 (17 February 1997) PF (Extensive revision following comments from the group. Changes to markup and organisation.)
0.4 (23 February 1997) PF (Minor editorial changes)
0.5 (1 April 1997) PF (Added Multidoc Pro as SGML browser; question on XML math; fixed ambiguity in explanation of NETs; added JUMBO; ERB changes of March 26; more details of linking and tools; adding element declaration minimization to the forbidden list.)

Paragraphs which have been added since the last version are shown in magenta and prefixed with a pilcrow (¶). Paragraphs which have been changed since the last version are also shown in magenta but are prefixed with a section sign (§). Paragraphs marked for future deletion but retained at the moment for information are shown in pale gray and are prefixed with a plus/minus sign (±).

Summary

This document contains the most frequently-asked questions (with answers) about XML, the Extensible Markup Language. It is intended as a guide for users, developers, and the interested reader, and should not be regarded as a part of the XML Draft Specification.

Organization

The FAQ is divided into four parts: a) General, b) User, c) Author, and d) Developer. The questions are numbered independently within each section. As the numbering may therefore change with each version, comments and suggestions should refer to the version number (see Revision History above) as well as the part and question number.

There is a form at the end of this document which you can use to submit bug reports, suggestions for improvement, and other comments relating to this FAQ only. Comments about the XML Draft Specification itself should be sent to the W3C.

Availability

§ The SGML file for use with any conforming system is available at http://www.ucc.ie/xml/faq.sgml (this can also be used with SGML browsers like Panorama or Multidoc Pro by downloading the DTD and stylesheet). The same text is available in an HTML version for use with an HTML browser (eg Netscape Navigator, Microsoft Internet Explorer, Spry Mosaic, NCSA Mosaic, Lynx, etc) is at http://www.ucc.ie/xml/. A plaintext (ASCII) version is available from the Web or by anonymous FTP to one of several FAQ repositories. The versions above are also available by electronic mail to the WebMail server (for users with email-only access). For printed copies there is a PostScriptTM version or you can have it on flattened dead trees by sending $10 (or equivalent) to the editor (email first to check currency and postal address).


The Questions

A. General questions

A.1   What is XML?

A.2   What is XML for?

A.3   What is SGML?

A.4   What is HTML?

A.5   Aren't XML, SGML, and HTML all the same thing?

A.6   Who is responsible for XML?

A.7   Why is XML such an important development?

A.8   How does XML make SGML simpler and still let you define your own document types?

A.9   Why not just carry on extending HTML?

A.10   Why do we need all this SGML stuff? Why not just use Word or Notes?

A.11   What's the relationship between XML and SGML and HTML?

A.12   Where do I find more about XML?

A.13   Where can I discuss implementation and development of XML?

B. Users of SGML (including browsers of HTML)

B.1   Do I have to do anything to use XML?

B.2   What can XML offer that current Web technology can't?

B.3   Where can I get an XML browser?

B.4   Do I have to switch from SGML or HTML to XML?

C. Authors of SGML (including writers of HTML)

C.1   Does XML replace HTML?

C.2   What does an XML document look like inside?

C.3   If I can define it all myself, how does XML know what it all means?

C.4   Is an HTML document legal XML?

C.5   Can I create my own XML documents without explicitly defining a document type?

C.6   Can I prettyprint my XML documents?

C.7   Which parts of an XML document are case-sensitive?

C.8   How can I make my existing HTML files work in XML?

C.9   If XML is just a `subset' of SGML, can I use XML files directly with SGML tools?

C.10   I'm used to authoring and serving HTML. Can I learn XML easily?

C.11   Will XML be able to use non-Latin characters?

C.12   What's a Document Type Definition (DTD) and where do I get one?

C.13   How will XML affect my document links?

C.14   What XML software can I use today?

C.15   Can I do mathematics using XML?

D. Developers and Implementors (including WebMasters and server operators)

D.1   Where's the spec?

D.2   What's this difference between `valid' and `well-formed'?

D.3   What are all these Processing Instructions?

D.4   What else has changed between SGML and XML?

D.5   Do I have to change any of my server software to work with XML?

D.6   Can I still use server-side INCLUDEs (`part-generated' content)?

D.7   Can I (and my authors) still use client-side INCLUDEs?

D.8   I'm trying to understand the XML Spec: why does SGML (and XML) have such difficult terminology?

D.9   Is there a Developer's API kit for XML?


The Answers


A. General questions

A.1   What is XML?

XML stands for `Extensible Markup Language' (extensible because it is not a fixed format like HTML) and it is designed to enable the use of SGML on the World-Wide Web.

A regular markup language defines what you can do (or what you have done) in the way of describing information for a fixed class of documents. XML goes beyond this and allows you to define your own customized markup language. It can do this because it's an application profile of SGML, a metalanguage, a language for describing languages.

A.2   What is XML for?

XML is intended to be a standard `to make it easy and straightforward to use SGML on the Web: easy to define document types, easy to author and manage SGML-defined documents, and easy to transmit and share them across the Web.'

It defines `an extremely simple dialect of SGML which is completely described in the Draft XML Specification (DXS). The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.'

`For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML' [quotes from the DXS].

A.3   What is SGML?

SGML is the Standard Generalized Markup Language (ISO 8879), the international standard system for defining, identifying and using the structure and content of documents.

A.4   What is HTML?

HTML is the HyperText Markup Language (RFC 1866), a specific application of SGML used in the World-Wide Web.

A.5   Aren't XML, SGML, and HTML all the same thing?

Not quite. SGML is the `mother tongue', used for describing thousands of different types of document.

HTML is just one document type, used in the Web. It defines a fixed type of document with markup to let you describe a common class of simple office-style report, with headings, paragraphs, lists, illustrations, etc, and some provision for hypertext and multimedia.

XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them.

A.6   Who is responsible for XML?

XML is a project of the World-Wide Web Consortium (W3C), and the development of the specification is being supervised by their SGML Editorial Review Board (ERB). The work of definition and specification is being done by a Working Group appointed by the ERB, with co-opted contributors and experts from various fields.

A.7   Why is XML such an important development?

It removes two constraints which are holding back Web developments: dependence on a single document type (see HTML above), based on a system (SGML) whose syntax allows many powerful but complex options. XML both simplifies the levels of optionality in SGML, and allows the development of user-defined document types on the Web.

A.8   How does XML make SGML simpler and still let you define your own document types?

XML redefines some of SGML's internal qualities and quantities, and removes a large number of the more complex and sometimes less-used features which made it harder to write processing programs (see list).

It also introduces a new class of document which does not require the formal declaration of a predefined document type. See the questions about document type declarations, `valid' vs `well-formed' documents, and how to define your own document types in the Developers' Section.

A.9   Why not just carry on extending HTML?

HTML is already overburdened with dozens of interesting but often incompatible inventions from different manufacturers, because it provides only one way of describing your information.

XML will allow groups of people or organisations to create their own customised markup languages for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, linguistics, knitting, history, engineering, rabbit-keeping etc).

HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure.

A.10   Why do we need all this SGML stuff? Why not just use Word or Notes?

Information on a network which connects many different types of computer has to be usable on all of them. Public information cannot afford to be restricted to one make or model or manufacturer, or to cede control of its data format to private hands. It is also helpful for such information to be in a form that can be reused in many different ways, as this can minimize wasted time and effort.

SGML is the international standard which is used for defining this kind of application, but those who need an alternative based on different software are entirely free to implement similar services using such a system, especially if it is for non-public use.

A.11   What's the relationship between XML and SGML and HTML?

XML is a subset of SGML: it omits some more complex and some less-used parts in return for the benefits of being easier to write applications for, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files can be parsed and validated the same as any other SGML file (see the question on XML software).

XML processing software (browsers, formatters, search engines, etc) will be able to handle all sorts of new applications, not just HTML.

Programmers may find it useful to think of XML as being SGML-- rather than HTML++.

A.12   Where do I find more about XML?

Online, there's the XML Draft Specification available from the W3C; a brief summary of XML with an extensive list of online reference material in Robin Cover's SGML pages; and a summary and condensed FAQ from Tim Bray.

± The GCA and SGML Open are conducting a three-day conference on The new publishing business case: XML, SGML, and the Internet in San Diego, California, on 10-12 March 1997. Further details from Julie Morrison Desmond or from the GCA's Web site.

Technology Appraisals Ltd are holding a three-day conference in London, England, on XML and network publishing technologies, XML ready for prime time?, and The XML Technology Bootstrap on 21-23 April 1997. They are also running a three-day seminar on business on the Internet and the Web, including material on XML, on 27 April-1 May (also in London). Details from David Hitchcock at TAL.

The Sixth International World Wide Web Conference (WWW '97) is being held on 7-12 April 1997 in Santa Clara, California.

The Irish SGML Users Group are holding a meeting entitled XML - Extending the World Wide Web with guest speaker Jon Bosak (Sun) on April 30, 2-5pm in Dublin (location to be announced).

A.13   Where can I discuss implementation and development of XML?

There is a mailing list called xml-dev for those committed to developing components for XML. You can subscribe by sending a 1-line mail message to majordomo@ic.ac.uk saying:
subscribe xml-dev yourname@yoursite
The list is hypermailed for online reference at http://www.lists.ic.ac.uk/hypermail/xml-dev/.

Note that this list is for those people actively involved in developing resources for XML. It is not for general information about XML (see this FAQ and other sources) or for general discussion about SGML implementation and resources (see comp.text.sgml).


B. Users of SGML (including browsers of HTML)

B.1   Do I have to do anything to use XML?

§ XML is still being developed, and there are already some pilot browsers, so you can experiment with them. When the specification is more complete, more browsers should start to appear, and you may be able to download them and use them to browse the Web much as you do with current software now.

You can use the browsers to look at some of the emerging XML material, such as Jon Bosak's Shakespeare plays.

If you want to start preparations for writing XML, see the questions in the Authors' Section.

B.2   What can XML offer that current Web technology can't?

Because authors and providers can design their own document types, browser presentation will be able to benefit from greatly improved facilities, both for graphical display and for performance.

Document types can be explicitly tailored to an audience, so the cumbersome fudging that has to take place with HTML to achieve special effects should become a thing of the past: authors and designers will be free to invent their own markup elements.

Information content can be richer and easier to use, because the hypertext linking abilities of XML are much greater than those of HTML.

Because XML removes many of the underlying complexities of SGML in favor of a more flexible model, writing programs to handle XML will be much easier than doing the same for full SGML.

Information will be much more accessible and reusable, because the more flexible markup of XML can be utilized by any XML software instead of being partly restricted to specific manufacturers as has become the case with HTML.

XML files remain fully conformant SGML, so they can be used outside the Web as well, in any normal SGML environment.

B.3   Where can I get an XML browser?

§ There are already some browsers emerging (see below), but the XML specification is still under development. As with HTML, there won't be just one browser, but many. However, because the potential number of different XML applications is not limited, no single browser should be expected to handle 100% of everything.

Expect to see the generic parts of XML (eg parsing, tree management, searching, formatting, and the use of architectural forms) combined into a general-purpose browser library or toolkit to make it easier for developers to take a consistent line when writing XML applications. Such applications could then be customised by adding semantics for specific markets, or using languages like Java to develop plugins for generic browsers and have the specialist modules delivered transparently over the Web.

  • JUMBO is a prototype GUI browser/editor/search/rendering tool for the output of XML parsers. It displays the abstract document tree which can be queried and edited in limited fashion. Java classes can be dynamically loaded for the current DTD and allow complex transformation and rendering. The emphasis is on the import of legacy files into structured documents, and the management of non-textual data, including common data structures (trees, tables, lists, etc). Currently JUMBO parses a subset of XML files (ie only elements and their attributes) and will be grafted onto other parsers as soon as possible. The software and a wide range of XML demo files, including Jon Bosak's PLAY, can be downloaded for any Java-enabled browser from http://www.venus.co.uk/omf/cml/

  • Inso Corporation (owners of EBT) are reported to have demonstrated pilot XML software using DynaWeb at the GCA's XML Conference in San Diego (March 1997). [Details do not appear to be available on the Web: if anyone has information on this, please use the form at the end of this FAQ.]

B.4   Do I have to switch from SGML or HTML to XML?

No, existing SGML and HTML applications software will continue to work with existing files. But as with any enhanced facility, if you want to view or download and use XML files, you will need to use XML-aware software.


C. Authors of SGML (including writers of HTML)

Authors should also read the Developers' Section, which contains further information about the internals of XML files.

C.1   Does XML replace HTML?

No, XML itself does not replace HTML: instead, it complements it by allowing you to define your own alternative to HTML (or extension to it).

HTML is expected to remain in common use for some time to come.

C.2   What does an XML document look like inside?

XML documents can be very simple, with no formal document type declaration, and straightforward nested markup of your own design:

<?XML VERSION="1.0" RMD="NONE"?>
<conversation>
  <greeting n="1">Hello, world!</greeting>
  <response>Stop the planet, I want to get off!</response>
</conversation>

Or they can be more complicated, with a specific DTD and maybe a local subset, and a more complex structure:

<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?>
<!doctype titlepage system "typo.dtd" 
[<!entity % active.links "INCLUDE">]>
<titlepage>
  <whitespace type="vertical" amount="36"/>
  <title font="Baskerville" size="24/30" 
         alignment="centered">Hello, world!</title>
  <whitespace type="vertical" amount="12"/>
  <!--* In some copies the following decoration is 
        hand-colored, presumably by the author *-->
  <image location="http://www.foo.bar/fleuron.eps" type="URL" alignment="centered"/>
  <whitespace type="vertical" amount="24"/>
  <author font="Baskerville" size="18/22" style="italic">Munde Salutem</author>
</titlepage>

Or they can be anywhere between: a lot will depend on how you want to define your document type and what it will be used for. See the question on valid and well-formed files.

C.3   If I can define it all myself, how does XML know what it all means?

You can use a stylesheet, like you can with HTML, to define the appearance, but the way in which an XML processor recognizes your markup depends on whether your file is valid or well-formed.

In both cases there are some simple rules to follow, and you may need to use a Document Type Definition (DTD) which will both guide you in creating the document and guide the user's software in reading it. If there isn't a suitable DTD in existence for your type of document, you can write one of your own.

Unlike some current ways of using HTML, you can't just make it up as you go along and hope that any old tag will do: for software to make sense of it, you need to follow a pattern or plan. In the case of a valid file, this is the DTD; in the case of a well-formed file, the structure has to be implicit in your markup.

C.4   Is an HTML document legal XML?

§ Yes, almost. There are two ways an HTML document can be made legal XML: it can either be valid or it can be well-formed.

Many HTML authoring tools already produce almost well-formed XML document instances, and some of them may also be nearly valid if there is a Document Type Definition (DTD).

The differences are small but significant (see the question on XML document classes). Existing HTML browsers are tolerant of invalid markup, so they will probably display XML files which use an XML version of a HTML DTD, or which are simply well-formed HTML, even though there are some slight differences. See the question on how to make existing HTML files work in XML.

C.5   Can I create my own XML documents without explicitly defining a document type?

Yes, this is what the `well-formed' document class is there for:

<?XML VERSION="1.0" RMD="NONE"?>
<FAQ>
<Q><IMAGE XML-LINK="ques.gif"/>Can I create my own XML documents 
without a DTD?</Q>
<A>Yes.  This is an example of a well-formed document, which can be
parsed by any XML-compliant parser. However, it won't know how to
display it unless you supply a stylesheet.</A>
</FAQ>

Properly balanced, nested elements; all start-tags and end-tags always present for elements which contain text data; a trailing slash on elements defined EMPTY.

C.6   Can I prettyprint my XML documents?

Yes, if they are unambiguous. XML processors will realize that in some applications precise white-space is critical, but in others that they need to collapse it: there are different rules for white-space handling in valid and well-formed documents, as well as ways to affect it.

<chapter>
  <section>
    <title>
      My title for Section 
1.
    </title>
    <para>
      ...
    </para>
  </section>
</chapter>

§ In other words, do you really want those linebreaks and spaces before, after, and in the title, or are they just there to make it easier to edit or because it was machine-generated? Should it say <title>My title for Section 1.</title> with no linebreaks or space? There are requirements and recommendations for applications on the detection and handling of white-space in the Draft XML Specification. XML instances may not rely on whitespace being ignored in element content.

C.7   Which parts of an XML document are case-sensitive?

Element names (start-tags and end-tags) are case-insensitive: you can use upper- or lower-case (or even MiXeD). Attribute names are also case-insensitive. Attribute values, however, may be case-sensitive, depending on context: you can specify which in a Document Type Definition. All entity names, and your data content (the text), are case-sensitive.

C.8   How can I make my existing HTML files work in XML?

XML defines two levels of conformance, valid and well-formed.

§ If your file already conforms to one of the HTML Document Type Definitions (DTDs), then it may be close to being valid. Three things need changing:

  • You need to include the Document Type Declaration:

    <?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?>
    <!DOCTYPE HTML SYSTEM "http://www.foo.com/myfiles/html3x.dtd">

    and you can also precede this with the optional XML Declaration with a Required Markup Declaration (default values are those given here).

  • Any DTD you reference must be an XML version, as must any other entities the DTD refers to (eg files of character entities like ISO Latin-1). They must all be accessible either through the network or from the user's local disk (eg by supplying a URL or filename for each in their SYSTEM identifiers).

  • The file itself must be well-formed (see below).

§ If your file does not conform to any of the available DTDs, then you can make it well-formed. You must make sure it follows the rules for well-formed files, by editing the file and making the necessary changes. Then place an XML Declaration containing a Required Markup Declaration at the top:

<?XML VERSION="1.0" RMD="NONE"?>
<HTML><HEAD><TITLE>Test file</TITLE></HEAD>
<BODY><BLINK>Test text <IMG SRC="foo.gif" alt="A foo"/></BLINK>
</BODY></HTML>

This lets you omit the DTD so long as the file is well-formed.

C.9   If XML is just a `subset' of SGML, can I use XML files directly with SGML tools?

§ Yes, provided a) the document has a valid Document Type Definition (DTD) which you can use; and b) and the files are valid, not just well-formed. But at the moment there are few tools which handle XML files unchanged because of the format of EMPTY elements. This is expected to change soon.

C.10   I'm used to authoring and serving HTML. Can I learn XML easily?

§ Yes, but at the moment there is still a need for tutorials, simple tools, and more examples of XML documents. Well-formed XML documents may look similar to HTML except for some small but very important points of syntax.

As every user community can have their own document type defined, it should be much easier to learn, because element names can be picked for relevance.

C.11   Will XML be able to use non-Latin characters?

Yes, the XML Draft Specification explictly makes reference to ISO 10646 and says that users `may extend the ISO 10646 character repertoire, in the rare cases where this is necessary, by making use of the private use areas[ . . . ]all XML processors must accept the UTF-8 and UCS-2 encodings of 10646; the mechanisms for signalling which of the two are in use, and for bringing other encodings into play, are[ . . . ]in the discussion of character encodings. Regardless of the specific encoding used, any character in the ISO 10646 character set may be referred to by the decimal or hexadecimal equivalent of its bit string': &#dddd; or &#u-hhhh; [from the DXS].

C.12   What's a Document Type Definition (DTD) and where do I get one?

A DTD is usually a file (or several files together) which contain a formal definition of a particular type of document. This sets out what names can be used for elements, where they may occur (for example, <ITEM> might only be meaningful inside <LIST>), and how they all fit together. It lets processors parse a document and identify where each element comes, so that stylesheets, navigators, search engines, and other applications can be used.

There are thousands of DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own. As with any language, you need to learn some of it [SGML] to do this: but XML is much simpler, see the list of restrictions which shows what has been cut out.

DTDs specifically for use on the Web may become commonplace, and people in different areas of interest may write their own for their own purposes: this is what XML is for.

C.13   How will XML affect my document links?

§ The linking abilities of XML systems are much more powerful than those of HTML. Existing HREF-style links will remain usable, but new linking technology is based on the lessons learned in the development of other hypertext standards, such as TEI and HyTime, which will let you manage bidirectional and multi-way links, as well as links to a span of text (within your own or other documents) rather than to a single point. This is already implemented for SGML in browsers like Panorama and Multidoc Pro.

At the time of compiling this version, some of the linking facilities are still under discussion. Current discussions center around the use of HTML-style links according to the URL and URI specifications (RFCs 1738 and 1630) with the TEI's Extended Pointer Notation. This would allow references such as

 http://www.uic.edu/x/y/z.xml?/tei/id(p23)child(1,emph)
 http://www.uic.edu/x/y/z.xml#/tei/id(p23)child(1,emph)
C.14   What XML software can I use today?

§ There are already three XML parsers (written in Java) which can be used to check that your files conform to the Draft XML Specification:

See also the question on XML Browsers and the details of the xml-dev mailing list for software developers.

C.15   Can I do mathematics using XML?

Yes, if the document type you use provides for math. The long-expired HTML3 could be used, or HTML Pro, or ISO12083 Math, or the developments of the OpenMath or HTML-Math projects, or one of your own making. Browsers to display rudimentary math embedded in SGML already exist (eg Panorama, Multidoc Pro), and the mathematics-using communities may develop their own software for XML.

The sophistication could vary from math expressions like xi through simple inline equations such as E = mc2 to display equations like

Sni=1  (xi - p)2/n

(If you are using an HTML browser to read this, the above equations may not be rendered correctly unless you have a math plugin for Netscape like IBM's TechExplorer which reads the embedded TeX equivalent [use the source, Luke!].)


D. Developers and Implementors (including WebMasters and server operators)

D.1   Where's the spec?

Right here (http://www.w3.org/pub/WWW/TR/). Includes the EBNF.

D.2   What's this difference between `valid' and `well-formed'?

Valid XML files are those which have a Document Type Definition (DTD) like all other SGML applications, and which adhere to it. They must also be well-formed (see below).

§ A valid file begins like any normal SGML file with a Document Type Declaration, but may have an optional XML Declaration prepended:

<?XML VERSION="1.0"?>
<!doctype foo system "http://www.foo.org/bar.dtd">
<foo>
  <bar>...<blort/>...</bar>
</foo>

An XML version of the specified DTD must be accessible to the XML processor, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network (with the SYSTEM identifier set to a URL).

If there is an internal DTD subset, this may be referenced by an `INTERNAL' Required Markup Declaration (RMD) in the XML Declaration:

<?XML VERSION="1.0" RMD="INTERNAL"?>
<!doctype foo [
<!element guff - - (#PCDATA)>
]>
<foo>
  <bar>...<blort/>...</bar>
  <guff>...</guff>
</foo>

The default, when no XML Declaration is present, is VERSION="1.0" RMD="ALL" ENCODING="UTF-8".

§ Well-formed XML files can be used without a DTD, but they must follow some simple rules to enable a browser to parse the file correctly (so that it can apply your stylesheet, enable linking, etc). Valid files must also be well-formed.

  • the file must start with a Required Markup Declaration, saying that there is no DTD (a different rule applies for valid files which do have a DTD):

    <?XML VERSION="1.0" RMD="NONE">
    <foo>
      <bar>...<blort/>...</bar>
    </foo>

  • all tags must be balanced: that is, all elements must have both start- and end-tags present (omission is not allowed, with one exception, see `Empty elements' below)

  • all attribute values must be in quotes (the single-quote character [the apostrophe] may be used if the value contains a double-quote character, and vice versa)

  • any EMPTY elements (eg those with no end-tag like HTML's <IMG>, <HR>, and <BR> and others) must either end with `/>' or you have to make them non-EMPTY by adding a real end-tag.

    Example: <BR> would become either <BR/> or <BR></BR>.

  • § there must not be any markup characters (< or &) in the character data (ie they must be escaped as &lt; and &amp;)

  • elements must nest inside each other properly (no overlapping markup, same rule as for regular SGML).

Well-formed XML files are considered to have &lt;, &gt;, &apos;, &quot;, and &amp; predefined and thus available for use even without a DTD. Valid XML files must declare them explicitly if they use them. A revised version of the XML Specification will give a precise definition of what that declaration must be.

D.3   What are all these Processing Instructions?

§ Processing Instructions are SGML's way of adding `what to do' and `how to do it' details to a file. In XML every file can begin with an XML Declaration which starts with `<?' and the keyword XML, and ends with `?>' (slightly different from plain SGML, which omits the final question-mark).

The XML Declaration must include the version number of XML being followed, and may include a Required Markup Declaration and an Encoding Declaration:

<?XML VERSION="1.0" RMD="ALL" ENCODING="UTF-8"?>

The XML Declaration is optional, and defaults to the values given here: if other values are needed, the declaration must be included at the top of the file.

D.4   What else has changed between SGML and XML?

The principal changes are in what you can do in writing a Document Type Definition (DTD). To simplify the syntax and make it easier to write processing software, the following markup declaration restrictions have been picked for XML:

  • No comments (--* *--) inside markup declarations

  • Comment declarations can't have spaces within the markup of <!--* or *-->

  • Comment declarations can't jump in and out of comments with *-- --*

  • No name groups for declaring multiple elements or element attlists

  • No CDATA or RCDATA declared content

  • No exclusions or inclusions on content models

  • No minimization parameters on element declarations

  • Mixed content models must be optional-repeatable ORs, with #PCDATA first

  • No AND (&) content model groups

  • No NAME[S], NUMBER[S], or NUTOKEN[S] declared values

  • No #CURRENT or #CONREF declared values

  • Attribute default values must be quoted

  • Marked sections can't have spaces within the markup of <![keyword[ or ]]>

  • No RCDATA, TEMP, IGNORE, or INCLUDE marked sections in instance

  • Marked sections in instance must have CDATA keyword, not parameter entity

  • No SDATA, CDATA, or bracketed internal entities

  • No SUBDOC, CDATA, or SDATA external entities

  • § No public identifiers in entity and notation declarations (this may be changed to permit them at a later stage once a resolution mechanism has been identified)

  • No data attributes on NOTATIONs or attribute value specifications on ENTITY declarations

  • No SHORTREF declarations

  • No USEMAP declarations

  • No LINKTYPE declarations

  • No LINK declarations

  • No USELINK declarations

  • No IDLINK declarations

  • No SGML declarations

The astute reader will have noticed a change in the syntax for a comment. XML comments add an asterisk after the opening double dash and before the closing one, so they are in the form:

<!--* comment text *-->

The `asterisk-double-dash' sequence is therefore illegal in comment text as it is the terminator. Spaces are not allowed between either of the angle brackets, the exclamation mark, or any of the dashes or asterisks: they are only valid within the comment text.

As noted in the question on source formatting, instances may not rely on whitespace being ignored in element content.

If you want to use existing SGML DTDs and entity files for XML, they will need to be edited to conform to the above requirements, but this only has to be done once.When the list has been finalized, it is likely that suitably-modified versions of the popular DTDs and character entity sets (eg the `ISO' files like ISOlat1) will be made available for use with XML.

D.5   Do I have to change any of my server software to work with XML?

Only to serve up .xml files as the correct MIME type.The XML project is submitting a MIME type of text/xml for approval, so for serving XML documents all that is needed is to edit the mime-types file (or its equivalent) and add the line

text/xml	xml XML

However, more sophisticated applications may require HTTP content negotiation to determine what tools the client has for display. Also, since XML is designed to support stylesheets and sophisticated hyperlinking, XML documents may be accompanied by ancillary files such as DTDs, entity files, catalogs, stylesheets, etc, which may need their own MIME entry, and which require placing in the appropriate directories.

If you run scripts generating HTML, which you wish to work with XML, they will need to be modified to produce the relevant document type.

D.6   Can I still use server-side INCLUDEs (`part-generated' content)?

Yes, so long as what they generate ends up as part of an XML-conformant file (ie either valid or well-formed.

However, some files containing embedded calls to external procedures which get invoked before transmission, such as the NCSA's `special' HTML, need to be checked carefully, to make sure that they do not contain raw markup characters (ie angle brackets and ampersands) which might confuse editors and other processing software. For example,

<!-- #exec cmd="tr '\012' '\040' <foo.bar"-->

which in a .shtml file embeds the contents of foo.bar in the output stream, with all newlines changed to spaces, needs to be written as

<!-- #exec cmd="tr '\012' '\040' &lt;foo.bar"-->

D.7   Can I (and my authors) still use client-side INCLUDEs?

The same rule applies as for server-side INCLUDEs, so you need to ensure that any embedded code which gets passed to a third-party engine (eg SDQL enquiries, Java writes, LiveWire requests, etc) does not contain any characters which might be misinterpreted as XML markup (ie no angle brackets or ampersands): either use a CDATA marked section to avoid your XML application parsing the embedded code, or use the standard &lt;, &gt;, and &amp; character entity references instead.

D.8   I'm trying to understand the XML Spec: why does SGML (and XML) have such difficult terminology?

For implementation to succeed, the terminology needs to be precise (for example `element' and `tag' are not synonymous: an element is a whole unit of markup, and may consist of a start-tag alone (as in HTML's <BR>) or a start-tag and an end-tag and the content which goes between them; tags are simply the markers at the start and end of elements). Sloppy terminology causes misunderstandings.

Those new to SGML may want to read something like the Gentle Introduction to SGML chapter of the TEI.

D.9   Is there a Developer's API kit for XML?

Not yet, although many aspects of development software are being worked on: see the question on XML software.


Response and query form

Section and question:

New material

New question, answer not known

New question, with sample answer

Corrections to existing wording

Correction to an existing question only

Correction to an existing answer only

Correction to both question and answer

Additional material

Addition to an existing question only

Addition to an existing answer only

Addition to both question and answer

Question and Answer

Details

Your name:

Affiliation:

Email address: