[This local archive copy is from the official and canonical URL, http://www.ucc.ie/xml/, 1999-05-27; please refer to the canonical source document if possible.]
Maintained on behalf of the World Wide Web Consortiums XML Special Interest Group by Peter Flynn, (University College Cork), with the collaboration of Terry Allen, (), Tom Borgman, (Harlequin Ltd), Tim Bray, (Textuality, Inc), Robin Cover, (Isogen, Inc), Bob DuCharme, (Moodys), Christopher Maden, (OReilly & Associates), Eve Maler, (Arbortext, Inc), Peter Murray-Rust, (Nottingham University), Liam Quin, (Groveware), Michael Sperberg-McQueen, (University of Illinois at Chicago), Joel Weber, (MIT), Murata, Makoto (Fuji Xerox Information Systems), and many other members of the XML Special Interest Group of the W3C as well as FAQ readers around the world. Please use the form at the end for any corrections or additions.
Recent changes
June 1999
- Added new XML mailing lists in Italian and in French
- Added details of developer resources in Chinese
- Two more translations under way (Chinese and Czech)
- Updated links to the question on DTDs
- Added question on the use of Java to generate and manage XML
- Added question on when to use attributes and when to use element markup
- Added question on the use of XML syntax to describe DTD data (schemata)
- Expanded on the explanation of the use of formal language in the spec
- Added question on the difference between XML and C++
- Separated information on XML versions of HTML into a separate question
- Typos and minor corrections
Paragraphs which have been added since the last version are indicated with a pilcrow (¶). Paragraphs which have been changed since the last version are indicated with a section sign (§). Paragraphs marked for future deletion but retained at the moment for information are indicated with a plus/minus sign (±).
Summary
This document contains the most frequently-asked questions (with answers) about XML, the Extensible Markup Language. It is intended as a first resource for users, developers, and the interested reader, and should not be regarded as a part of the XML Specification.
Organization
The FAQ is divided into four parts: a) General, b) User, c) Author, and d) Developer. The questions are numbered independently within each section. As the numbering may therefore change with each version, comments and suggestions should refer to the version number (see Recent History above) as well as the Part and Question Number.
There is a form at the end of this document which you can use to submit bug reports, suggestions for improvement, and other comments relating to this FAQ only (or you can mail the maintainer direct at pflynn@imbolc.ucc.ie). Comments about the XML Specification itself should be sent to the W3C.
Availability
The master file for use with any conforming SGML system is available at http://www.ucc.ie/xml/faq.sgml (this can also be used online with SGML browsers like Panorama or Multidoc Pro; you can also download the DTD and stylesheet installation self-extractor for faster local access with these browsers, or the DTD list as ASCII files).
There is a HTML version for use with any HTML browser (eg Netscape Navigator, Microsoft Internet Explorer, Spry Mosaic, NCSA Mosaic, Lynx, Opera, GNUscape Navigator etc) at http://www.ucc.ie/xml/.
An XML version will be produced when DTDs and browsers are more widely available to handle it.
A plaintext (ASCII) version is available from the Web and (eventually) by anonymous FTP to one of several FAQ repositories. The versions above are also available by electronic mail to the WebMail server (for users with email-only access). The plaintext version is posted to comp.text.xml monthly for the archives.
For printed copies there are PostScriptTM versions for A4 and Letter paper sizes. PDF will be available soon for A4 and Letter as well.
§ The document is also available in oil-based toner on flattened dead trees by sending US$10 (or equivalent) to the editor (email first to check currency and postal address).
§ »Thanks to Murata Makoto for making this document available in Japanese (see http://www.fxis.co.jp/DMS/sgml/xml/xmlfaq.html); to Jaime Sagarduy for the translation into Spanish (see http://slug.ctv.es/~olea/sgml-esp/xfaq15.html); to Kangchan Lee for the Korean version (see http://xml.t2000.co.kr/faq/index.html); to Jiang Luqin for one Chinese version (in preparation) and to Neko for another at http://zxd.webjump.com/xml.html); to Miloslav Nic for a Czech version (in preparation); and to Tim Bray for the Annotated Spec at http://www.xml.com/axml/testaxml.htm. A Greek version is also in preparation.
§ You can download the XML logo as a GIF or in EPS format; and an icon for your files in ICO (Microsoft Windows), Mac, or XPM (X Window system) format.
A.5 Arent XML, SGML, and HTML all the same thing?
A.6 What is the difference between SGML/XML and C or C++?
A.7 Who is responsible for XML?
A.8 Why is XML such an important development?
A.9 How can XML make SGML simpler and still let you define your own document types?
A.10 Why not just carry on extending HTML?
A.11 Why do we need all this SGML stuff? Why not just use Word or Notes?
A.12 Where do I find more information about XML?
A.13 Where can I discuss implementation and development of XML?
B.1 What do I have to do to use XML?
B.2 Why should I use XML instead of HTML?
C.2 What does an XML document look like inside?
C.3 How does XML handle white-space in my documents?
C.4 Which parts of an XML document are case-sensitive?
C.5 How can I make my existing HTML files work in XML?
C.6 Is there an XML version of HTML?
C.7 If XML is just a subset of SGML, can I use XML files directly with SGML tools?
C.8 Im used to authoring and serving HTML. Can I learn XML easily?
C.9 Will XML be able to use non-Latin characters?
C.10 Whats a Document Type Definition (DTD) and where do I get one?
C.11 I keep hearing about alternatives to DTDs. Whats a schema?
C.12 How will XML affect my document links?
C.13 Can I do mathematics using XML?
C.14 How does XML handle metadata?
C.15 Can I use Java, ActiveX, etc in XML files?
C.16 Can I use Java to create or manage XML files?
D.2 What are these terms `DTDless', `valid', and `well-formed'?
D.3 Which should I use in my DTD, attributes or elements?
D.4 What else has changed between SGML and XML?
D.5 What XML software can I use today?
D.6 Do I have to change any of my server software to work with XML?
D.7 Can I still use server-side
INCLUDE
s?D.8 Can I (and my authors) still use client-side
INCLUDE
s?D.9 Im trying to understand the XML Spec: why does SGML (and XML) have such difficult terminology?
D.10 Is there a Developers API kit for XML?
D.11 How does XML fit with the DOM?
D.12 Is there a conformance test suite for XML processors?
D.13 How do I include one DTD (or fragment) in another?
D.14 Ive already got SGML DTDs: how do I convert them for use with XML?
XML is the `Extensible Markup Language' (extensible because it is not a fixed format like HTML). It is designed to enable the use of SGML on the World Wide Web.
§ XML is not a single, predefined markup language: its a metalanguage -- a language for describing other languages -- which lets you design your own markup. (A predefined markup language like HTML defines a way to describe information in one specific class of documents: XML lets you define your own customized markup languages for different classes of document.) It can do this because its written in SGML, the international standard metalanguage for markup.
XML is designed to make it easy and straightforward to use SGML on the Web: easy to define document types, easy to author and manage SGML-defined documents, and easy to transmit and share them across the Web.
It defines an extremely simple dialect of SGML which is completely described in the XML Specification. The goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML.
For this reason, XML has been designed for ease of implementation, and for interoperability with both SGML and HTML [quotes from the XML spec].
SGML is the Standard Generalized Markup Language (ISO 8879), the international standard for defining descriptions of the structure and content of different types of electronic document. There is an SGML FAQ at http://www.infosys.utas.edu.au/info/sgmlfaq.txt which is posted every month to the comp.text.sgml newsgroup, and the SGML Web pages are at http://www.oasis-open.org/cover/sgml-xml.html.
ISO standards are governed by the International Organization for Standardization in Geneva, Switzerland, and voted into or out of existence by representatives from every countrys national standards body.
If you have a query about an international standard, you should contact your national standards body for the name of your countrys representative on the relevant ISO committee or working group.
If you have a query about your countrys representation in Geneva or about the conduct of your national standards body, you should contact the relevant government department in your country, or speak to your public representative.
The representation of countries at the ISO is not a matter for this FAQ. Please do not submit queries to the maintainer about how or why your ISO representatives have or have not voted.
HTML is the HyperText Markup Language (RFC 1866), a specific application of SGML used on the World Wide Web.
Not quite. SGML is the `mother tongue', used for describing thousands of different document types in many fields of human activity, from transcriptions of ancient Irish manuscripts to the technical documentation for stealth bombers, and from patients clinical records to musical notation.
§ HTML is just one of these document types, the one most frequently used in the Web. It defines a simple, fixed type of document with markup designed for a common class of office or technical report, with headings, paragraphs, lists, illustrations, etc, and some provision for hypertext and multimedia.
XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits the more complex and less-used parts of SGML in return for the benefits of being easier to write applications for, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files may still be parsed and validated the same as any other SGML file (see the question on XML software).
Programmers may find it useful to think of XML as being SGML-- rather than HTML++.
C and C++ (and others like Fortran, or Pascal, or Basic, or Java or dozens more) are programming languages with which you specify calculations, actions, and decisions to be carried out:
do when @front(@date,6) is equal "01-Apr" print "April Fool!\n" else print @days(@datesub("25-Dec",@date)),\ " shopping days to Christmas\n" done
SGML and XML are markup specification languages with which you can design ways of describing information, usually for storage, transmission, or processing by a program:
<para>It was the week after <event class="festival">Christmas</event> but <name class="person">Max</name>s mind was still running on the prank he had played on <name class="person">Louise</name> the previous <name class="month">April</name>.</para>
On its own, a file of SGML or XML text (including HTML) doesnt do anything: you have to have a program to do something with it.
XML is a project of the World Wide Web Consortium (W3C), and the development of the specification is being supervised by their XML Working Group. A Special Interest Group of co-opted contributors and experts from various fields contributed comments and reviews by email.
XML is a public format: it is not a proprietary development of any company. The v1.0 specification was accepted by the W3C as Recommendation on Feb 10, 1998.
It removes two constraints which are holding back Web developments:
dependence on a single, inflexible document type (HTML);
the complexity of full SGML, whose syntax allows many powerful but hard-to-program options.
XML simplifies the levels of optionality in SGML, and allows the development of user-defined document types on the Web.
To make SGML simpler, XML redefines some of SGMLs internal values and parameters, and removes a large number of the more complex and sometimes less-used features which made it harder to write processing programs (see http://www.w3.org/TR/NOTE-sgml-xml-971215).
§ Although it retains all of SGMLs structural abilities which let you define and manage your own document types, XML introduces a new class of document which does not require you to use a predefined document type description (basically you can make up your own markup so long as you stick strictly to the syntactic rules). See the questions about valid and well-formed documents, and how to define your own document types in the Developers Section.
HTML is already overburdened with dozens of interesting but often incompatible inventions from different manufacturers, because it provides only one way of describing your information.
XML will allow groups of people or organizations to create their own customized markup languages for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, petroleum geology, linguistics, cooking, knitting, stellar cartography, history, engineering, rabbit-keeping, mathematics, etc).
HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure.
§ Information on a network which connects many different types of computer has to be usable on all of them. Public information cannot afford to be restricted to one make or model or manufacturer, or to cede control of its data format to private hands. It is also helpful for such information to be in a form that can be reused in many different ways, as this can minimize wasted time and effort. Proprietary data formats, no matter how well documented or publicized, are simply not an option: their control still resides in private hands and they can be changed or withdrawn arbitrarily without notice.
§ SGML is the international standard for defining this kind of application, but those who need an alternative based on different software for other purposes are entirely free to implement similar services using such a system, especially if they are for private use.
Online, theres the XML Specification and ancillary documentation available from the W3C; the XML Web pages with an extensive list of online reference material in Robin Covers SGML pages; and a summary and condensed FAQ from Tim Bray.
The items listed below are the ones I have been told about: please mail me if you come across others.
The annual XML Conference is run by the Graphic Communications Association. XML99 is being held in Philadelphia on December 5-9 and consists as last year of two conferences in one: the XML Conference 99 and Markup Technologies 99
SGML/XML Asia/Pacific is in Sydney on October 18-21.
Further details of these on the GCAs Web site.
There are lists of books, articles, and software for XML in Robin Covers SGML and XML pages. That site should always be your first port of call: please look there first before using the form in this FAQ to ask about software or documentation.
NotePlease Read The Fine Documentation which you will be sent when you join a mailing list, as it contains important information, particularly about what to do when your email address changes.
There is a mailing list called xml-dev for those committed to developing components for XML. You can subscribe by sending a 1-line mail message to majordomo@ic.ac.uk saying:
subscribe xml-dev your@email.address
(substituting your correct email address). To unsubscribe, send a 1-line message to the same address saying
unsubscribe xml-dev your@email.address
The list is hypermailed for online reference at http://www.lists.ic.ac.uk/hypermail/xml-dev/. Note that this list is for those people actively involved in developing resources for XML. It is not for general information about XML (see this FAQ and other sources) or for general discussion about SGML implementation and resources (see below).
§ There is a general-purpose mailing list called XML-L for public discussions: to subscribe, send a 1-line mail message to LISTSERV@listserv.heanet.ie saying
subscribe XML-L forename surname
(substituting your own forename and surname). To unsubscribe, send a 1-line message to the same address saying
unsubscribe XML-L
(Note that LISTSERV lists like XML-L dont need you to give your email address: they read it from your email headers.) You can access XML-L and its archives, as well as subscribe and unsubscribe interactively, from http://listserv.heanet.ie/xml-l.html.
NotePlease note that there is a lot of inaccurate and misleading information published in print and on the Web about subscribing to mailing lists. The information given here is correct -- use it.
¶ There are mailing lists being set up in other languages:
Gianni Rubagotti writes: A new Italian mailing list about XML is born: to subscribe, send a mail message without a subject line but with text saying
subscribe XML-IT
to majordomo@ananas.usr.dsi.unimi.it. Send discussion messages to: xml-it@ananas.usr.dsi.unimi.it (only subscribers may send messages). Everyone, Italian or not, who wants to debate about XML in our tongue is welcome.
JP Theberge writes: A French mailing list about XML has been created. To subscribe, send subscribe to xml-request@trisome.com. Then post to xml@trisome.com.
§ The Usenet newsgroup comp.text.xml is for discussions of XML. If this is not available on your local news server, ask your Internet Provider to add it, or use a Web interface like Déj[agrave]News.
§ For the average user of the Web, nothing except use a browser which works with XML (see the question about browsers). Remember XML is still being implemented, so some features are still either undefined or have yet to be written.
You can use XML browsers to look at some of the emerging XML material, such as Jon Bosaks Shakespeare plays and the molecular experiments of the Chemical Markup Language (CML). There are some more example sources listed at http://www.oasis-open.org/cover/xml.html#examples.
If you want to start preparations for creating your own XML, see the questions in the Authors Sectionand the Developers Section.
Authors and providers can design their own document types using XML, instead of being stuck with HTML. Document types can be explicitly tailored to an audience, so the cumbersome fudging that has to take place with HTML to achieve special effects should become a thing of the past: authors and designers will be free to invent their own markup elements;
Information content can be richer and easier to use, because the hypertext linking abilities of XML are much greater than those of HTML.
XML can provide more and better facilities for browser presentation and performance;
It removes many of the underlying complexities of SGML in favor of a more flexible model, so writing programs to handle XML will be much easier than doing the same for full SGML.
Information will be more accessible and reusable, because the more flexible markup of XML can be used by any XML software instead of being restricted to specific manufacturers as has become the case with HTML.
Valid XML files are kosher SGML, so they can be used outside the Web as well, in an SGML environment.
Remember the XML specification is still new, so a lot of what you see now is experimental. As with HTML, there wont be just one browser, but many. However, because the potential number of different XML applications is not limited, no single browser can be expected to handle 100% of everything.
§ Some of the generic parts of XML (eg parsing, tree management, searching, formatting, etc) are being combined into general-purpose browser libraries or toolkits to make it easier for developers to take a consistent line when writing XML applications. Such applications can then be customized by adding semantics for specific markets, or using languages like Java to develop plugins for generic browsers and have the specialist modules delivered transparently over the Web.
§ MSIE5 handles XML but currently still renders it via CSS, using a largely HTML-derived model, so not all the stylesheet options work. Microsoft are also the architects of a hybrid solution in which you can embed fragments of XML in HTML files because current HTML-only browsers simply ignore element markup which they dont recognize.
§ The publicly-released Netscape code (Mozilla) has resulted in a test XML implementation including an application of RDF plus James Clarks expat XML parser.
The authors of the MultiDoc Pro SGML browser, CITEC, have joined forces with Mozilla to produce a multi-everything browser called DocZilla, which reads HTML, XML, and SGML, with XSL and CSS stylesheets. This runs under NT and linux and is currently Alpha. See http://www.doczilla.org for details. This is early alpha, but is by far the most ambitious, and the only one so far backed by solid SGML expertise.
See also the notes on software for authors and developers, and the more detailed list on the XML pages in the SGML Web site at http://www.oasis-open.org/cover/xml.html.
No, existing SGML and HTML applications software will continue to work with existing files. But as with any enhanced facility, if you want to view or download and use XML files, you will need to add XML-aware software as it becomes available.
Authors should also read the Developers Section, which contains further information about the internals of XML files.
§ No. XML itself does not replace HTML: instead, it provides an alternative which allows you to define your own set of markup elements. HTML is expected to remain in common use for some time to come, and Document Type Definitions for HTML will be available in XML versions as well as in original SGML. XML is designed to make the writing of DTDs much simpler than with full SGML. (See the question on DTDs for what one is and why youd want one.)
Work is going on to produce XML versions of HTML and other popular existing DTDs, but this may not take off until more stable software is available. Watch comp.text.sgml, comp.text.xml, XML-L, and xml-dev for announcements.
The basic structure is very similar to most other applications of SGML, including HTML. XML documents can be very simple, with no document type declaration, and straightforward nested markup of your own design:
<?xml version="1.0" standalone="yes"?> <conversation> <greeting>Hello, world!</greeting> <response>Stop the planet, I want to get off!</response> </conversation>
Or they can be more complicated, with a DTD specified (see ), and maybe an internal subset, and a more complex structure:
<?xml version="1.0" standalone="no" encoding="UTF-8"?> <!DOCTYPE titlepage SYSTEM "http://www.frisket.org/dtds/typo.dtd" [<!ENTITY % active.links "INCLUDE">]> <titlepage> <white-space type="vertical" amount="36"/> <title font="Baskerville" size="24/30" alignment="centered">Hello, world!</title> <white-space type="vertical" amount="12"/> <!-- In some copies the following decoration is hand-colored, presumably by the author --> <image location="http://www.foo.bar/fleuron.eps" type="URL" alignment="centered"/> <white-space type="vertical" amount="24"/> <author font="Baskerville" size="18/22" style="italic">Vitam capias</author> </titlepage>
Or they can be anywhere between: a lot will depend on how you want to define your document type (or whose you use) and what it will be used for. See the question on valid and well-formed files.
§ The SGML rules regarding white-space have been changed for XML, so all white-space, including linebreaks, TAB characters, and regular spaces, even between elements where no text can appear, is passed by the parser unchanged to the application (browser, formatter, viewer, etc). This means:
`insignificant' white-space between structural elements (those which appear where only element content is allowed, ie between other elements, without text data) will get passed to the application (under standard SGML this white-space gets suppressed, which is why you can put all that extra space in HTML documents and not worry about it. This is not so in XML);
`significant' white-space within elements which can contain text and markup mixed together (mixed content or PCDATA [parsed character data]) will still get passed to the application exactly as under regular SGML.
<chapter> <section> <title> My title for Section 1. </title> <para> ... </para> </section> </chapter>
§ The parser must, however, still inform the application that white-space has occurred in element content, if it can detect it. (Users of standard SGML may recognize that this information was not in the ESIS, but it is in the grove.) In the above example, the application will receive all the pretty-printing linebreaks, TABs, and spaces between the elements as well as those embedded in the section title. It is the function of the application (browser, formatter, viewer, etc) to decide which type of white-space to discard and which to retain.
All of it, both markup and text. This is significantly different from HTML and most other SGML document types. It was introduced to allow markup in non-Latin-alphabet languages and to obviate problems with case-folding in scripts which are caseless.
Element type names (used in start-tags and end-tags) are case-sensitive: you must stick with whatever combination of upper- or lower-case you use to define them (either by usage or in a DTD).
So you cant say <BODY>
...</body>
: upper- and lower-case must
match; thus
<IMG/>
and
<img/>
are
two different element types;
For well-formed files with no DTD, the first occurrence of an element type name defines the casing;
Attribute names are also case-sensitive, on a
per-element basis: for example
<PIC width="7in"/>
and <PIC WIDTH="6in"/>
in the same file exhibit two separate attributes,
because the different casings of width
and WIDTH
distinguish them;
Attribute values are also
case-sensitive. Character data values (eg
HRef="MyFile.SGML"
) always have
been, but ID and IDREF attributes are now case-sensitive as well and no longer
get folded to uppercase for comparisons;
All entity names (Á
),
and your data content (your text), are case-sensitive, exactly as
before.
§ Make them well-formed DTD-less documents (see below) and write a stylesheet. A DTD (Document Type Description) is optional in XML, but HTML files converted to XML format currently have to be DTDless because there are few working XML versions of the current SGML-based HTML DTDs yet (they need to be substantially edited to remove their dependence on those features of SGML which are excluded from XML).
It is necessary to convert existing HTML
files to be well-formed because XML does not allow end-tag minimization
(missing </p>
, etc)
which is allowed in most HTML DTDs. Many HTML authoring tools already
produce almost (but not quite) well-formed XML. As a preparation for
XML, the W3Cs HTML Tidy
program can clean up some of the formatting mess left beind by
inadequate HTML editors.
If you want to move your files out of HTML into some other DTD entirely, there is a pilot site run by CommerceNet (http://www.xmlx.com/) for the exchange of XML DTDs, and a pilot FPI server at http://www.ucc.ie/cgi-bin/public with several common SGML DTDs to start from.
If you have created your HTML files conforming to one of the several HTML Document Type Definitions (DTDs), and they validate OK, then they can be converted as follows:
replace the
DOCTYPE
declaration and any internal subset
(basically everything within the first set of angled brackets <!DOCTYPE
HTML...>
) with the XML Declaration <?xml
version="1.0" standalone="yes"?>
change any EMPTY
elements
(eg
<ISINDEX>
, <BASE>
,
<META>
, <LINK>
,
<NEXTID>
and <RANGE>
in the header, and <IMG>
,
<BR>
, <HR>
,
<FRAME>
, <WBR>
,
<BASEFONT>
, <SPACER>
,
<AUDIOSCOPE>
,
<AREA>
, <PARAM>
,
<KEYGEN>
, <COL>
,
<LIMITTEXT>
, <SPOT>
,
<TAB>
, <OVER>
,
<RIGHT>
, <LEFT>
,
<CHOOSE>
, <ATOP>
,
and <OF>
in the body) so that they end
with />
, for example
<IMG SRC="mypic.gif"
alt="Picture"/>
§ ensure there are
correctly-matched explicit end-tags for all non-empty elements;
eg every
<P>
must have a
</P>
, etc.
If your HTML was created by a conformant editor, this process can be
automated by a normalizer program like sgmlnorm
(part of SP)
or the sgml-normalize function in an editor
like Emacs/psgml;
escape all <
and [amp ]
non-markup
(ie literal) characters as
<
and &
respectively;
ensure all attribute values are in quotes;
ensure all occurrences of all element names in start-tags and end-tags match with respect to upper- and lower-case and that they are consistent throughout the file;
ensure all attribute names are similarly in a consistent case throughout the file.
Be aware that many HTML browsers may not accept XML-style
EMPTY
elements with the trailing slash, so the above
changes may not be backwards-compatible. An alternative is to add a dummy
end-tag to all EMPTY
elements, so
<IMG src="foo.gif">
becomes
<IMG src="foo.gif">
</IMG>
.
§ If you have a lot of valid HTML files, could write a script to do this in a programming language which understands SGML/XML markup (such as Omnimark, Balise, SGMLC, or a system using one of the SGML libraries for Perl, Python, or Tcl), or you could even use editor macros if you know what youre doing.
If your HTML files are invalid (HTML created by most WYSIWYG editors is invalid) then they will almost certainly have to be converted manually, although if the deformities are regular and carefully constructed, the files may actually be almost well-formed, and you could write a program or script to do as described above. To test for invalidity and non-conformance, check the following:
do the files
contain markup syntax errors? For example, are there any backslashes
instead of forward slashes on end-tags; or elements which nest
incorrectly (eg <B>an
element which starts <I>inside one element</B> but ends
outside it</I>
)?
do the files contain markup which conflicts with the HTML DTDs, such as headings inside paragraphs, list items outside list environments?
do the files use elements which are not in any DTD? Although this is easy to transform to a DTDless well-formed file (because you dont have to define elements in advance) most proprietary [browser-specific] extensions have never been formally defined, so it is often impossible to work out where they can meaningfully be used.
Markup which is valid but which is meaningless or void may need to be edited out before conversion (such as repeated empty paragraphs or linebreaks, empty tables, invisible `spacing' GIFs etc: XML uses stylesheets, so you wont need any of these).
See the rules for `well-formed' XML files for details of what you need to check in XML when converting.
There are XML versions of the HTML DTD in preparation but none ready yet:
Ben Trafford is developing an XML version of HTML 3.2
I have started work on an XML version of HTML Pro, but its not easy, and I need convincing its worth doing.
The Extensible HyperText Markup Language (XHTML) is a W3C project: This specification defines XHTML 1.0, a reformulation of HTML 4.0 as an XML 1.0 application, and three DTDs corresponding to the ones defined by HTML 4.0. The semantics of the elements and their attributes are defined in the W3C Recommendation for HTML 4.0. These semantics provide the foundation for future extensibility of XHTML. Compatibility with existing HTML user agents is possible by following a small set of guidelines.
§ Yes, provided you use SGML software which knows about the new WebSGML
Adaptations to ISO 8879 (features needed to support XML, such as the
special form for EMPTY
elements; some aspects of the
SGML Declaration such as NAMECASE GENERAL NO
;
multiple attribute declarations, etc).
¶ An alternative is to use an SGML DTD to let you create an SGML file, but one which does not use empty elements; and then remove the DocType Declaration so it becomes a well-formed DTDless XML file.
§ At the moment there are few tools which handle XML files
unchanged because of the format of these EMPTY
elements, but this is changing. The nsgmls
parser has an XML conformance switch, introduced for use with
Jade, and the first XML-specific
editors and parsers are in use (see the question on software).
§ Yes, very easily, but at the moment there is still a need for tutorials, simple tools, and more examples of XML documents. Well-formed XML documents may look similar to HTML except for some small but very important points of syntax.
¶ The big practical difference is that XML has to stick to the rules. HTML browsers let you create broken HTML because they elide all the broken bits: with XML your files have to be correct or they simply wont work.
Yes, the XML Specification explicitly says XML uses ISO 10646, the international standard 31-bit character repertoire which covers most human (and some non-human) languages. This is currently congruent with Unicode.
The spec says (2.2): All XML processors must accept the UTF-8 and UTF-16 encodings of ISO 10646.... UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as ASCII, the rest are used to encode the rest of Unicode into sequences of between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the same as ISO 646 IRV (ASCII), so you can continue to use ASCII for English or other unaccented languages using the Latin alphabet. Note that UTF-8 is incompatible with ISO 8859-1 (ISO Latin-1) after code point 126 decimal (the end of ASCII). UTF-16 is like UTF-8 but with a scheme to represent the next 16 planes of 64k characters as two 16-bit characters.
...the mechanisms for signalling which of the two are in use, and for bringing other encodings into play, are [...] in the discussion of character encodings. The XML Specification explains how to specify in your XML file which coded character set you are using.
Use of UCS-4 can only legally be specified in SGML or XML when the WebSGML Adaptations to ISO 8879 are implemented: this enables numbers longer than eight digits to be used in the SGML Declaration.
Regardless of the specific encoding used, any character
in the ISO 10646 character set may be referred to by the decimal or
hexadecimal equivalent of its bit string: so no matter which
character set you personally use, you can still refer to specific
individual characters from elsewhere in the encoded repertoire by using
&#dddd;
(decimal character code) or
&#xHHHH;
(hexadecimal character code, in uppercase).
The terminology can get confusing, as can the numbers: see the
ISO
10646 Concept Dictionary. Rick Jelliffe has
XML-ized
the ISO character entity sets.
§ A DTD is a file (or several files
to be used together), written in XML, which contains a formal
definition of a
particular type of document. It sets out what names can be used for
element types, where they may occur, and how they all fit together.
For example, if you want a document type to be able to describe
<List>
s which contain <Item>
s, part of your DTD would contain
something like
<!ELEMENT List (Item)+> <!ELEMENT Item (#PCDATA)>
This fragment defines a list as an element type containing one or more items (thats the plus sign), and items as element types containing just text. XML is the formal specification language which processors read to automatically parse the DTD and then use that information to identify where every element type comes and how each relates to the other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used. The above fragment lets you create lists which get stored as:
<List><Item>Chocolate</Item><Item>Music</Item><Item>Surfing</Item></List>
How the list appears in print or on the screen depends on your stylesheet: you do not normally need to put anything in the XML to affect formatting in the way that had to be done with HTML before stylesheets.
In effect, a DTD provides applications with advance notice of what names and structures can be used in a particular document type. Using a DTD means you can be certain that all documents which belong to a particular type will be constructed and named in a conformant manner.
§ »There are thousands of SGML DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own. As with any language, you need to learn it to do this (see for example Developing SGML DTDs by Maler and el Andaloussi, Prentice Hall, 1997, 0-13-309881-8): but XML is much simpler than full SGML: see the list of restrictions which shows what has been cut out. Existing SGML DTDs do need to be converted to XML for use with XML systems: read the question on converting SGML DTDs to XML, and expect to see announcements of popular DTDs eventually becoming available in XML format.
Bob DuCharme writes: Many XML developers are dissatisfied with the syntax of the markup declarations described in the XML spec for two reasons. First, they feel that if XML documents are so good at describing structured information, then the description of a document types structure (its `schema') should be in an XML document instead of written with its own special syntax. In addition to being more consistent, this would make it easier to edit and manipulate the schema with regular document manipulation tools. Secondly, they feel that traditional DTD notation doesnt allow schema designers the power to impose enough constraints on the data -- for example, the ability to say that a certain element type must always have a positive integer value, that it may not be empty, or that it must be one of a list of possible choices. This would ease the development of software using that data because the developer would have less error-checking code to write.
NoteUsers from a database or computer science background should be aware that SGML systems -- and that includes XML -- are not database management systems: they are text markup systems. While there are many similarities, such as the ones described here, some of the concepts of one are simply non-existent in the other: XML does not possess some database-like features in the same way that DBMSs do not possess markup-like ones.
Several groups have submitted proposals to the W3C for alternative ways to express document type schemata. In addition to offering schema constraints like data typing and the others described here, many take advantage of other current trends in software development such as object-oriented methodologies. The W3C Schema Working Group is currently reviewing these proposals and developing their own proposal based on the most useful features suggested by the existing proposals and the members of the Working Group.
§ The linking
abilities of XML systems are much more powerful than those of
HTML, so youll be able to do much more with them. Existing
HREF
-style links will remain usable, but the new linking
technology is based on the lessons learned in the development of other
standards involving hypertext, such as
TEI and
HyTime,
which let you manage bidirectional and multi-way links, as well as links
to a span of text (within your own or other documents) rather than to a
single point. These features have been available to standard SGML
users in browsers like
DynaText, Panorama
and Multidoc Pro for many years, so there
is considerable experience and expertise available in using them.
The XML Linking Specification (XLink) and XML Extended Pointer Specification (XPointer) documents contain a detailed draft specification. An XML link can be either a URL or a TEI-style Extended Pointer (XPointer), or both. A URL on its own is assumed to be a resource (as with HTML); if an XPointer follows it, it is assumed to be a sub-resource of that URL; an XPointer on its own is assumed to apply to the current document.
An XPointer is always preceded by one of #, ?, or |. The # and ? mean the same as in HTML applications; the | means the sub-resource can be found by applying the XPointer to the resource, but the method of doing this is left to the application.
The TEI Extended Pointer Notation (EPN) is much more powerful than the `fragment address' on the end of some URLs, as it allows you to specify the location of a link end using the structure of the document as well as (or in addition to) known, fixed points like IDs. For example, the linked second occurrence of the word `XPointer' two paragraphs back could be referred to as http://www.ucc.ie/xml/faq.sgml#ID(faq-hypertext)CHILD(2,*)(6,*), meaning the sixth child object within the second child object after the element whose ID is faq-hypertext. Count the objects from the start of this question in the SGML version (which has the ID faq-hypertext):
the title of the question;
<SECT2 ID="faq-hypertext"> <TITLE>How will XML affect my document links?</TITLE>
the second paragraph:
the character data from the start of the paragraph to the first item of markup:
<PARA>The
the markup item:
<ULINK URL="http://www.w3.org/TR/WD-xlink">XML Linking Specification (XLink)</ULINK>
the text item:
and
the markup item:
<ULINK URL="http://www.w3.org/TR/WD-xptr">XML Extended Pointer Specification (XPointer)</ULINK>
the next stretch of character data:
documents contain a detailed specification. An XML link can be either a URL or a TEI-style Extended Pointer (
and the next markup item:
<LINK ID="loc1" LINKEND="loc2">XPointer</LINK>
If you view this file with Panorama or MultiDoc Pro you can click on the highlighted cross-reference button at the start of the example sentence, and it will display the locations in Extended Pointer Notation of all the links to it, including the word XPointer mentioned. (Doing this in an HTML browser is not meaningful, as they do not support bidirectional linking or EPN.) David Megginson has produced an additional function for Emacs/psgml which will deduce an XPointer for any location in an SGML or XML file.
Yes, if the document type you use provides for math. The mathematics-using community is developing software, and there is a MathML proposal at the W3C, which is a native XML application. It would also be possible to make XML fragments from the long-expired HTML3, HTML Pro, or ISO 12083 Math, or OpenMath, or one of your own making. Browsers which display some math embedded in SGML already exist (eg DynaText, Panorama, Multidoc Pro).
The sophistication could vary from math expressions like through simple inline equations such as to display equations like
If you are using an HTML browser to read this, of course, the above equations may not be rendered correctly. The Techexplorer plugin from IBM can be used with regular HTML browsers to render TeX math, and the Amaya testbed browser at the W3C has an experimental MathML display.
§ Because XML lets you define your own markup language, you can make full use of the extended hypertext features (see the question on Links) of XML to store or link to metadata in any format (eg ISO 11179, Dublin Core, Warwick Framework, Resource Description Framework (RDF), and Platform for Internet Content Selection (PICS)).
There are no predefined elements in XML, because it is an architecture, not an application, so it is not part of XMLs job to specify how or if authors should or should not implement metadata. You are therefore free to use any suitable method from simple attributes to the embedding of entire Dublin Core/Warwick Framework metadata records. Browser makers may also have their own architectural recommendations or methods to propose.
This depends on what facilities the browser makers implement. XML is about describing information; scripting languages and languages for embedded functionality are software which enables the information to be manipulated at the users end.
XML itself provides a way to define the markup needed to implement scripting languages: as a neutral standard it neither encourages not discourages their use, and does not favour one language over another, so the field is wide open.
Scripting languages are provided for in a proposal for an Extensible Style Language, XSL (see question on Stylesheets).
Yes, any programming language can be used to output data from any source in XML format. There is a growing number of front-ends and back-ends for programming environments and data management environments to automate this.
Mark Watson writes in article 344c3443.4494773@news.infonex.net: I posted the spec to a Java toolkit for creating XML documents from relational database queries, and for save/loading XML documents to local files, and for transport via sockets, RMI, and CORBA IIOP. The spec is at: www.markwatson.com/XMLdb_0_1.htm.
There is a suite of Java tutorials (with source code and explanation) available at http://developerlife.com. These tutorials show the Java2 developer how to use the IBM, Sun and OpenXML Java parsers to write Java programs that use XML.
§ The use of a stylesheet is required for XML. Some browsers may
possibly provide simple default styles for popular elements like
<Para>
, or <List>
containing <Item>
, but in general a
stylesheet gives the author much better control of the layout. But as
with any system where files can be viewed at random by arbitrary users,
the author cannot know what resources (such as fonts) are on the users
system, so care is needed.
§ The international standard for stylesheets for SGML documents is DSSSL, the Document Style and Semantics Specification Language (ISO 10179). This provides Scheme-like languages for stylesheets and document conversion, and is implemented in the Jade formatter.
The Cascading Stylesheet Specification (CSS) provides a simple syntax for assigning styles to elements, and has been part-implemented in some HTML browsers.
A new Extensible Style Language (XSL) has been drafted for use specifically with XML. This uses XML syntax (an XSL stylesheet is actually an XML file) but combines formatting features from both DSSSL and CSS (HTML) and has already attracted support from several major vendors.
Arbortexts experimental XML Styler has details of how to use it with XSL. You will also need the ActiveX controls and XSL codebase.
There are also many pre-existing proprietary stylesheet systems and implementations, many of which are deeply embedded in the technical documentation community (and thus heavily supported by one or more products):
Inso Corps DynaText and DynaWeb browser and server products (their forerunner company, EBT, was where much of todays stylesheet technology was invented);
The Synex stylesheet DTD as used in Panorama and MultiDoc Pro;
The US military standard FOSI (Formatted Output Specification Instance) is implemented in Arbortexts ADEPT*Editor (and elsewhere);
SoftQuads Author/Editor uses stylesheets controllable by the user.
§ Most browser and editor vendors appear to be committing to a move to XSL but with a large installed user base for their existing systems this will probably not occur quickly.
§ Graphics are just links which happen to
have a picture file at the end rather than another piece of text, so they can be
done in any way supported by the XLink and XPointer specifications (see
earlier question), including using
similar syntax to existing HTML images. They can also be done using XMLs
built-in NOTATION
and ENTITY
mechanism in a similar way to standard SGML. The linking specifications, however, give you much better
control over the traversal and activation of links, so an author can specify,
for example, whether or not to have an image appear when the page is
loaded, or on a click from the user, or in a separate window,
without
having to resort to scripting. Which graphic file formats will be
supported is a matter for the browser makers: XML itself doesnt
predict or restrict you. GIF, JPG, TIFF, PNG, and CGM at a minimum would seem
to make sense: there are moves towards creating a networked vector
graphics standard (see next paragraph).
Peter Murray-Rust writes: GIFs and JPEGs cater for bitmaps (pixel representations of images). Vector graphics (scaleable) are being addressed in the W3Cs graphics activity (see http://www.w3.org/Graphics/Activity). When a consensus is reached it will be possible to transmit the graphics representation within the XML file. For many graphics objects this will mean greatly decreased download time and scaling without loss of detail.
NoteYou cannot embed a raw graphics file (or any other binary [non-text] data) directly into an XML file because any bytes resembling markup would get misinterpreted: you must refer to it by linking (see below).
Bob DuCharme adds: All the data in an XML document entity must
be parseable XML. You can define an external entity as either a parsed
entity (parseable XML) or an unparsed entity (anything else). Unparsed
entities can be used for picture files, sound files, movie files, or
whatever you like. They can only be referenced from within a document as
the value of an attribute (much like a bitmap picture on an HTML Web
page is the value of the img
elements
src
attribute) and not part of the actual
document. In an XML document, this attribute must be declared to be of
type ENTITY
, and the entitys declaration must
specify a declared NOTATION
, because if the entity
isnt XML, the XML processor needs to know what it is. For example,
in the following document, the colliepic
entity is declared to have a JPEG notation, and its used as the
value of the empty dog elements picfile
attribute.
<?xml version="1.0"?> <!DOCTYPE dog [ <!NOTATION JPEG SYSTEM "Joint Photographic Experts Group"> <!ENTITY colliepic SYSTEM "lassie.jpg" NDATA JPEG> <!ELEMENT dog EMPTY> <!ATTLIST dog picfile ENTITY #REQUIRED> ]> <dog picfile="colliepic"/>
The XLink and XPointer linking specifications describe other ways to point to a non-XML file such as a graphic. These offer more sophisticated control over the external entitys position, handling, and appearance within the XML document.
(It would, however, be possible to include a text-encoded
transformation of a binary file as a CDATA
marked
section, using something like UUencode with the markup characters
]
and >
removed from
the map so that they could not occur and be misinterpreted.)
Right here (http://www.w3.org/TR/REC-xml). Includes the EBNF. There are also versions in Japanese (http://www.fxis.co.jp/DMS/sgml/xml/); Spanish (http://www.ucc.ie/xml/faq-es.html); Korean (http://xml.t2000.co.kr/faq/index.html) and a Java-ised annotated version at http://www.xml.com/axml/testaxml.htm.
Eve Maler has released the DTD and documentation used for the spec itself: this is a new version that was used to encode the XML, XLink, XPointer, DOM, etc specifications. Be aware that this version is no longer compatible with the version that XML 1.0 uses; please send any comments or questions to Eve.
§ Full SGML uses a Document Type Definition (DTD) to describe the markup (elements) available in any specific type of document. However, the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD. DTDless operation means you can invent markup without having to define it formally, at the penalty of losing automated control over the structuring of additional documents of the same type.
To make this work, a DTDless file in effect defines its own markup informally, by the simple existence and location of elements where you create them. But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure while it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules.
§ For example, HTMLs <IMG>
element is defined as EMPTY
: it
doesnt have an end-tag. An XML application reading a
file without a DTD and encountering <IMG>
would
have no way to know whether or not to expect an end-tag,
so the concept of `well-formed' files has
become necessary. This makes the start and end of every element, and the
occurrence of EMPTY
elements completely unambiguous.
All XML documents, both DTDless and valid, must be well-formed:
»if there is no DTD in use, the document must start with a Standalone Document Declaration (SDD) saying so:
<?xml version="1.0" standalone="yes"?> <foo> <bar>...<blort/>...</bar> </foo>
David Brownell notes: XML thats `just' well-formed doesnt need to use a Standalone Document Declaration at all. Such declarations are there to permit certain speedups when processing documents while ignoring external parameter entities -- basically, you cant rely on external declarations in standalone documents. The types that are relevant are entities and attributes. Standalone documents must not require any kind of attribute value normalization or defaulting, otherwise they are invalid.
all tags must be balanced: that is, all elements which may contain character data must have both start- and end-tags present (omission is not allowed except for empty elements, see below);
all attribute values must be in quotes (the
single-quote character [the apostrophe] may be used if the value
contains a double-quote character, and vice versa):
if you need both, use '
or
"
, and declare them in the
internal subset;
»any EMPTY
element
tags (eg those with no end-tag like HTMLs
<IMG>
,
<HR>
, and <BR>
and others) must either end with />
or you have to make them appear non-EMPTY
by adding a real
end-tag;
Example:
<BR>
would become either
<BR/>
or
<BR>
</BR>
.
there must not be any isolated markup-start characters
(<
or &
) in your text data (ie
they must be given as <
and
&
), and the sequence
]]>
must be given as ]][amp ]gt;
if it does not occur as the end of a CDATA
marked
section;
elements must nest inside each other properly (no overlapping markup, same rule as for all SGML);
Well-formed files with no DTD may use attributes on any element, but the attributes must all be of type CDATA by default.
§ XML files with no DTD are
considered to have <
, >
, '
, "
, and &
predefined and thus available for use
even without a DTD. Valid XML files must declare them explicitly if
they use them. If you want to use more than these five default
character entities, but you want to avoid having to write a full DTD,
it is possible to declare just character entities on their own in the
internal subset of a standalone XML file (thanks to Richard Lander for
this):
<?xml version="1.0" standalone="yes"?> <!doctype example [ <!entity nbsp " "> ]> <example>Three blanks.</example>
§ »Valid XML files are those which have a Document Type Definition (DTD) like other SGML applications, and which adhere to it. They must already be well-formed.
A valid file begins like any other SGML file with a Document Type Declaration, but may have an optional XML Declaration prepended:
<?xml version="1.0"?> <!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd"> <advert> <headline>...<pic/>...</headline> <text>...</text> </advert>
The XML Specification defines an SGML Declaration for XML which is fixed for all instances (the declaration has been removed from the text of the Specification and is now in a separate document). An XML version of the specified DTD must be accessible to the XML processor, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network. You can specify this by supplying the URL for the DTD in a System Identifier (as in the example above). It is possible (many people would say preferable) to supply a Formal Public Identifier, but if used, this must precede the System Identifier, which must still be given (and only the PUBLIC keyword is used),
<!DOCTYPE advert PUBLIC "-//Foo, Inc//DTD Advertisements//EN" "http://www.foo.org/ad.dtd">
The defaults for the other attributes of the XML Declaration are
version="1.0"
and encoding="UTF-8"
.
There is no single answer to this: a lot depends on what you are designing the document type for. The two extremes are best illustrated with examples.
`Traditional' textual practice is to put the `real' text (what would be printed) as character data content, and keep the metadata (like line numbers) in attributes, from where they can more easily be isolated for analysis or special treatment like display in the margin or in a mouseover:
<l n="184"><sp>Portia</sp><text>The quality of mercy is not straind,</text></l>
But from the systems point of view, there is nothing `wrong' with storing the data the other way round, especially where the volume of text data on each occasion is relatively small:
<line speaker="Portia" text="The quality of mercy is not straind,">184</line>
A lot will depend on what you want to do with the information and which bits of it are easiest accessed by each method. A rule of thumb for conventional textual documents is that if the markup were all stripped away, the bare text should still be readable and usable, even if inconvenient. For database output, however, or other machine-generated documents, `reading' may not be meaningful, so it is perfectly possible to have documents where all the data is in attributes, and the document contains no character data in content models at all. See http://www.oasis-open.org/cover/elementsAndAttrs.html for more.
»The principal changes are in what you can do in writing a Document Type Definition (DTD). To simplify the syntax and make it easier to write processing software, a large number of SGML markup declaration options have been suppressed (see the list of omitted features).
An extra delimiter is permitted in Names (the colon) for use in experiments with namespaces (enabling DTDs to distinguish element source, ownership, or application). A colon may only appear in mid-name, though, not at the start or the end. Work is ongoing to define how these can be declared and referenced using element and attribute markup.
Details are no longer in this FAQ as they are now changing too rapidly to be kept up to date: see the XML pages at http://www.oasis-open.org/cover/xml.html.
¶ For a detailed guide to examples of SGML and XML programs and the concepts behind them, see the editors book Understanding SGML and XML Tools (Kluwer, 1998, 0-7923-8169-6).
For browsers see the question on XML Browsers and the details of the xml-dev mailing list for software developers. Bert Bos keeps a list of some XML developments in bison, flex, perl and Python.
¶ »Information for developers of Chinese XML systems can be found at the Chinese XML Now! website of Academia Sinica: http://www.ascc.net/xml/ This site includes an FAQ and test files.
§ Only to serve up .xml files as the correct MIME type (application/xml, see RFC2376), so for serving XML documents all that is needed is to edit the mime-types file (or its equivalent) and add the line
application/xml xml XML
In some servers (eg Apache), users can change the MIME type for specific file types from their own directories by using directives in a .htaccess file. The MIME content-type text/xml must only be applied to pure ASCII files (ISO 646 IRV) because of a character-set restriction in the RFC: for all normal use, application/xml is the one to go for.
§ Since XML is designed to support stylesheets and sophisticated hyperlinking, XML documents may be accompanied by ancillary files in the same way that SGML files are: DTDs, entity files, catalogs, stylesheets, etc, which may need other MIME Content-Type entries, such as text/css for CSS stylesheets. XUA (XML User Agent), which is one of the planned deliverables of the XML WG, might provide a mechanism for packaging XML documents and XSL styles into a single message.
If you run scripts generating HTML, which you wish to work with XML, they will need to be modified to produce the relevant document type.
INCLUDE
s?Yes, so long as what they generate ends up as part of an XML-conformant file (ie either valid or just well-formed).
INCLUDE
s?The same rule applies as for
server-side INCLUDE
s,
so you need to ensure that any embedded code which gets passed to a
third-party engine (eg SDQL
enquiries, Java
write
s, LiveWire requests,
streamed content,
etc) does not contain any characters
which might be misinterpreted as XML markup (ie
no angle brackets or ampersands): either use a CDATA
marked section to avoid your XML application parsing the embedded code,
or use the standard <
,
>
, and
&
character entity references
instead.
§ For implementation to succeed, the terminology needs to be precise. Design goal 8 of the specification tells us that the design of XML shall be formal and concise. To describe XML in formal terms, the specification uses the concise language of Computer Science, which is often confusing to non-CS people because it uses well-known English words in a specialised sense which can be very different from their commonly understood meanings -- for example, `grammar', `production', `token', or `terminal'.
The specification rarely explains these terms because of the other part of this design goal: the specification should be concise. It doesnt repeat explanations that are available elsewhere. In essence this means that to grok the fullness of the spec, you need foreknowledge of computer science and SGML.
Sloppy terminology in specifications causes misunderstandings, so formal standards have to be phrased in formal terminology. This FAQ is not a formal document, and the astute reader may already have noticed it refers to `element names' where `element type names' is more correct; but the former is more widely understood.
Those new to SGML may want to read something like the Gentle Introduction to SGML chapter of the TEI Guidelines.
Thanks to Bob DuCharme for suggestions and some bits from his book on the XML Spec.
Several are available or under development. Details of these and other XML software are held on the SGML/XML Web pages.
The big conversion and application development engines like Balise, Omnimark, and SGMLC are all working on adding XML. Details of SGML software of all kinds is on the SGML Web pages.
The Document Object Model (DOM) (http://www.w3.org/TR/PR-DOM-Level-1) provides an abstract API for constructing, accessing, and manipulating XML and HTML documents. A binding of the DOM to a particular programming language provides a concrete API.
James Clark has a collection of test cases for testing XML parsers at http://www.jclark.com/xml/ which includes a conformance test.
This works exactly the same as for regular SGML. First you declare the entity you want to include, and then you reference it by name:
<!ENTITY % mylists PUBLIC "-//Foo, Inc//ENTITIES Common list structures//EN" "dtds/listfrag.ent"> ... %mylists;
Such declarations traditionally go all together towards the top of the main DTD file, where they can be managed and maintained, but this is not essential so long as they are declared before they are used. You use Parameter Entity syntax for this (the percent sign) because the file is to be included at DTD compile time, not when the document instance itself is parsed.
Note that a URL is compulsory in XML for all external file references: standard rules for dereferencing URLs apply (assume the same method, server, and directory as the containing document). The URL can be supplied either as a System Identifier alone:
<!ENTITY mydtd SYSTEM "http://www.foo.bar/~blort/my.dtd">
or as a second parameter to a formal Public Identifier as in the earlier example.
There are numerous projects being started to convert common or popular SGML DTDs to XML format (for example Patrice Bonhomme is working on an unofficial XML version of the TEI Lite DTD: details of that are discussed on the TEI-L mailing list).
The following checklist comes courtesy of Seán McGrath (author of XML By Example, Prentice Hall, 1998) [my italics]:
No equivalent of the SGML Declaration. So keywords, character set etc are essentially fixed;
Tag mimimization is not allowed, so <!ELEMENT x - O (A,B)>
becomes <!ELEMENT X (A,B)>
and
<!ELEMENT x - O EMPTY>
becomes <!ELEMENT X EMPTY>
;
#PCDATA
must only occur extreme
left in an OR
model,
eg<!ELEMENT x (A|B|#PCDATA|C)>
becomes <!ELEMENT x (#PCDATA|A|B|C)>
and <!ELEMENT x (A,#PCDATA)>
is illegal;
No CDATA
, RCDATA
elements [declared content];
Some SGML attribute types are not allowed in XML
eg NUTOKEN
. Also
there are no NOTATION
attributes (data attributes);
Some SGML attribute defaults are not allowed in XML
eg CONREF
;
Comments cannot be inline to declarations like
[they can in standard SGML]
<!ELEMENT x (A,B) -- this is an SGML comment in a
declaration -->
;
A whole bunch of SGML optional features are not
present in XML: a) all forms of tag
minimization (OMITTAG
, DATATAG
,
SHORTREF
, etc); b) Link Process
Definitions; c) Multiple
DTDs per document and many more: see
the question on the bits of SGML that were
removed for XML for a reference to the complete list;
And last but not least, CONCUR
!
There are some important differences betweeen the internal and external
subset portion of a DTD in XML: a) marked
sections can only occur in the external
subset b) Parameter Entities must be used to replace entire declarations in the
internal subset portion of a DTD, eg
the following is invalid XML:
<!DOCTYPE x [ <!ENTITY % modelx "(A|B)*"> <!ELEMENT x %modelx;> ]> <x></x>
Electronic Document Interchange has been used in e-commerce for many years to exchange documents between commercial partners to a transaction. It has required special proprietary software, but there are now moves to enable EDI data to travel inside XML. Details of developments are at http://www.xmledi.com/ and there is a guideline document at http://www.geocities.com/WallStreet/Floor/5815/guide.htm.
Illustration from
Dale
Doughertys article in Web Review (courtesy of the
publishers).