[This local archive copy mirrored from the canonical site: http://www.personal.u-net.com/~sgml/xmlintro.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]

An Introduction to the Extensible Markup Language (XML)

by Martin Bryan of The SGML Centre

This file gives a very brief overview of the most commonly used components of the World Wide Web Consortium's (W3C) Extensible Markup Language (XML), as specified in the Proposed Recommendation dated 8th December 1997.

What is XML?

XML is subset of the Standard Generalized Markup Language (SGML) defined in ISO standard 8879:1986 that is designed to make it easy to interchange structured documents over the Internet. XML files always clearly mark where the start and end of each of the component parts of an interchanged documents occur. XML restricts the use of SGML constructs to ensure that fall back options are available when access to certain components of the document is not currently possible over the Internet. It also defines how Internet Uniform Resource Locators can be used to identify component parts of XML data streams.

By defining the role of each element of text in a formal model, known as a Document Type Definition (DTD), users of XML can check that each component of document occurs in a valid place within the interchanged data stream. An XML DTD allows computers to check, for example, that users do not accidentally enter a third-level heading without first having entered a second-level heading, something that cannot be checked using the HyperText Markup Language (HTML) previously used to code documents that form part of the World Wide Web (WWW) of documents accessible through the Internet.

However, unlike SGML, XML does not require the presence of a DTD. If no DTD is available, either because all or part of it is not accessible over the Internet or because the user failed to create it, an XML system can assign a default definition for undeclared components of the markup.

XML allows users to:

bring multiple files together to form compound documents
identify where illustrations are to be incorporated into text files, and the format used to encode each illustration
provide processing control information to supporting programs, such as document validators and browsers
add editorial comments to a file.

It is important to note, however, that XML is not:

a predefined set of tags, of the type defined for HTML, that can be used to markup documents
a standardized template for producing particular types of documents.

XML was not designed to be a standardized way of coding text: in fact it is impossible to devise a single coding scheme that would be suit all languages and all applications. Instead XML is formal language that can be used to pass information about the component parts of a document to another computer system. XML is flexible enough to be able to describe any logical text structure, whether it be a form, memo, letter, report, book, encyclopedia, dictionary or database.

The components of XML

XML is based on the concept of documents composed of a series of entities. (`Entity' is the English spelling of the French word `entité', the Teutonic equivalent of which is `thing'. Those familiar with modern programming techniques will be probably be more comfortable using the word `object'. All these terms are synonymous.) Each entity can contain one or more logical elements. Each of these elements can have certain attributes (properties) that describe the way in which it is to be processed. XML provides a formal syntax for describing the relationships between the entities, elements and attributes that make up an XML document, which can be used to tell the computer how it can recognize the component parts of each document.

XML differs from other markup languages in that it does not simply indicate where a change of appearance occurs, or where a new element starts. XML sets out to clearly identify the boundaries of every part of a document, whether it be a new chapter, a piece of boilerplate text, or a reference to another publication.

To allow the computer to check the structure of a document users must provide it with a document type definition that declares each of the permitted entities, elements and attributes, and the relationships between them.

How is XML used?

To use a set of markup tags that has been defined by a trade association or similar body, users need to know how the markup tags are delimited from normal text and in which order the various elements should be used in. Systems that understand XML can provide users with lists of the elements that are valid at each point in the document, and will automatically add the required delimiters to the name to produce a markup tag. Where the data capture system does not understand XML, users can enter the XML tags manually for later validation. Elements and their attributes are entered between matched pairs of angle brackets (<...>) while entity references start with an ampersand and end with a semicolon (&...;).

Because XML tag sets are based on the logical structure of the document they are somewhat easier to understand, and remember, than physically based markup schemes of the type typically provided by word processors. An XML memo might be coded as:

<memo>
<to>All staff</to>
<from>Martin Bryan</from>
<date>5th November</date>
<subject>Cats and Dogs</subject>
<text>Please remember to keep all cats and dogs indoors tonight.</text>
</memo>

This form the file is ideal for a computer to follow, and therefore to process. The start and end of each logical element of the file has been clearly identified by entry of a start-tag (e.g. <to>) and an end-tag (e.g. </to>).

Notice that at this point nothing has been said about the format of the final document. From the neutral format provided by XML users can either chose to display the memo on a screen, whose size can be varied to suit user preferences, to print the text onto a pre-printed form, or to generate a completely new form, positioning each element of the document where needed.

Defining your own tag sets

To define tag sets users must create a Document Type Definition that formally identifies the relationships between the various elements that form their documents. For a simple memo the XML DTD might take the form:

<!DOCTYPE memo [
<!ELEMENT memo    (to, from, date, subject?, para+) >
<!ELEMENT para    (#PCDATA) >
<!ELEMENT to      (#PCDATA) >
<!ELEMENT from    (#PCDATA) >
<!ELEMENT date    (#PCDATA) >
<!ELEMENT subject (#PCDATA) >
]>

This model tells the computer that a memo consists of a sequence of header elements, <to>, <from>, <date> and, optionally, <subject>, which must be followed by the contents of the memo. The contents of the memo defined in this simple example is made up of a number of paragraphs, at least one of which must be present (this is indicated by the + immediately after para). In this simplified example a paragraph has been defined as a leaf node that can contain parsed character data (#PCDATA), i.e. data that has been checked to ensure that it contains no unrecognized markup strings. In a similar way the <to>, <from>, <date> and <subject> elements have been declared to be leaf nodes in the document structure tree.

Where the position of an element in the model is variable the element can be defined as part of a repeatable choice of elements. For example, to allow references to books or figures to occur anywhere in the text of a paragraph, but not in the heading, the model definition for the <para> element could be modified to read:

<!ELEMENT para (#PCDATA|citation|figref)+ >

where the added elements are defined as:

<!ELEMENT citation (#PCDATA) >
<!ELEMENT figref (#PCDATA) >

Some elements do not require any contents as such. They are simply placeholders that indicate where a certain process is to take place. A special form of tag is used in XML to indicate empty elements that do not have any contents, and therefore have no end-tag. For example, a <graphic/> element is typically an empty element that acts as a place holder for the graphical part of a figure while an optional <caption> element identifies any text associated with the illustration. Together the <graphic> and <caption> make up a <figure>, which would typically be placed at the same level as a text paragraph. The following element declarations can be used to extend the model for a <memo> to allow it to include figures as well as text:

<!ELEMENT memo    (to, from, date, subject?, (para|figure)+ >
<!ELEMENT figure  (graphic, caption?) >
<!ELEMENT graphic EMPTY >
<!ELEMENT caption (#PCDATA) >

Defining the attributes of elements

Where elements can have variable forms, or need to be linked together, they can be given suitable attributes to specify the properties to be applied to them. For example, it might be decided that the <subject> field of a memo could optionally be printed in bold or italics. A suitable attribute list declaration might, in this case, be:

<!ATTLIST subject form (bold|italic|normal) "normal" >

This tells the computer that the <subject> start-tag can be amended to read <subject form="bold"> or <subject form="italic"> if a variant font is required. If no such change is requested the program is to use the default value to make the tag read <subject form="normal">.

One especially important type of attribute is the unique identifier. Because it is unique it can be used to provide a cross reference between two points in the document. For example, you can ensure that a unique identifier is assigned to each figure by adding an attribute list declaration of the following form to the DTD:

<!ATTLIST figure id ID #REQUIRED >

This tells the computer that every <figure> element must be entered with a unique identifier within the start-tag, e.g. as <figure id="fig1"> rather than just <figure>.

Unique identifiers can be referred to within the text by use of attributes that form identifier references. Typically a figure reference element might have its attribute declaration list defined as:

<!ATTLIST figref refid IDREF #IMPLIED >

The keyword #IMPLIED indicates that it is permissible to omit the attribute in some instances of the <figref> element. For example, this might need to be done if the reference was to a figure in another publication. (Unique identifiers only apply to the current XML document instance - they are not necessarily unique across document sets.)

Incorporating standard and non-standard text elements

XML also contains techniques for adding standard (boilerplate) text to a file, and for handling characters that are outside the standard character set, but which are available on certain output devices.

Commonly used text can be declared within the DTD as a text entity. A typical text entity definition could take the form:

<!ENTITY company "The SGML Centre" >

Once such a declaration has been made in the DTD users can use an entity reference of the form &company; in place of the full name of the company. An advantage of using this technique is that, should the name of the company referred to by the mnemonic change later, only the entry in the DTD needs to be changed as the entity reference will automatically call in the current definition.

Text stored in another file it can also be incorporated into a file using entity references. In this case the entity declaration in the DTD identifies the location of the file containing the text to be referenced, e.g.:

<!ENTITY appendix SYSTEM "http://www.myco.com/pub/book4/appendix.xml" >

and the entity reference (&appendix;) shows where the file is to be added to the main text stream.

Where non-standard characters are required special system-dependent entities can be declared to show how the characters can be generated. A typical entry might read:

<!ENTITY eacute CDATA "&#233;" >

When the string é is encountered in the text the computer will replace it by the code whose decimal value is 233.

Alternatively the decimal character number, or its hexadecimal equivalent, preceded by x, can be used directly as part of a character reference, e.g. é to generate é.

Illustrations, tables and other special elements

XML provides a number of techniques for handling non-standard document elements. Where the coding scheme of an element of the file such as an illustration differs from that used for normal text the contents of the element can be treated as an entity with a special notation, e.g.:

<!ENTITY fig1 SYSTEM "http://www.myco.com/book12/figures/fig1" NDATA GIF >

Alternatively details of the relevant notation can be defined as an attribute of an element, e.g.:

<!ATTLIST graphic
          source  %URL;    #REQUIRED
          type    NOTATION (GIF|PNG|JPEG) "JPEG" >

To identify where the figure is to be positioned in the text you would either enter an entity reference such as &fig1; or an empty element such as:

<graphic source="http://www.myco.com/figures/fig1.gif" type="GIF"/>

In both these situations a notation declaration is required to tell the program what to do with the unparsed data that is contained in the referenced file. Typically this takes the form of a call to a program module, e.g.:

<!NOTATION GIF SYSTEM "c:\windows\system\gif.dll" >

Where text, such as computer code, has been created in a form designed to be output on a line-by-line basis with the original it can be flagged as a special type of parsed character data by addition of a special reserved attribute, xml:space, to the element declaration:

<!ELEMENT code (#PCDATA) >
<!ATTLIST code xml:space (default|preserve) #FIXED "preserve" >

where preserve means preserve the line breaks rather than use the default of replacing line breaks by spaces before justifying the contents of the element.

Using XML coded text

An XML file normally consists of three types of markup, the first two of which are optional:

An XML processing instruction identifying the version of XML being used, the way in which it is encoded, and whether it references other files or not, e,g,
```
<?xml version="1.0" encoding="UCS2" standalone="yes">
```
A document type declaration that either contains the formal markup declarations in its internal subset (between square brackets) or references a file containing the relevant markup declarations (the external subset), e.g.:
```
<!DOCTYPE memo SYSTEM "http://www.myco.com/dtds/memo.dtd">
```
A fully-tagged document instance which consists of a root element, whose element type name must match that assigned as the document type name in the document type declaration, within which all other markup is nested.

If all three components are present, and the document instance conforms to the rules defined in the document instance, the document is said to be valid. If only the last component is present, and no formal model is present, all the XML processor can do is to check that the document instance is well-formed, i.e. that each element is properly nested within its parent elements, and that each attribute is specified as an attribute name followed by a value indicator (=) and a quoted string.

XML-coded files are, by their nature, ideal for storing in databases. Because XML files are both object-orientated and hierarchical in nature they can be adopted to virtually any type of database, though care sometimes needs to be taken to ensure that enough structural data is retained in the database to reconstruct the original file. A standarized interface to XML data is defined through W3C's Document Object Model (DOM), which provides a CORBA IDL interface between applications exchanging XML data.

Data stored using non-XML notations will need appropriate application software to process it, but the XML-coded file will correctly identify where each piece of such data belongs in the completed document and where it has been stored prior to use.

By storing data in the clearly defined format provided by XML you can ensure that your data will be transferable to a wide range of hardware and software environments. New techniques in programming and processing data will not affect the logical structure of your document's message. If more detail needs to be added to the file all you need to do is to update the model and then add new markup tags where required in the document instance. If a completely new style is required then the existing document model can be linked to the new one to provide automatic updating of document structures.

Webmaster: mtbryan@sgml.u-net.com