[Archive copy mirrored from: http://www.textuality.com/xml/typing.html, May 21, 1997]

Adding Strong Data Typing to SGML and XML

Tim Bray
May 21, 1997
This draft is intended for public discussion.

Table of Contents

1. Introduction
2. Related Standards
    2.1 XML, SGML, and HTML
    2.2 SQL Data Typing
    2.3 ISO Date/Time Representations
3. Associating Types with XML Elements
    3.1 Primary Type: The XML-SQLTYPE Attribute
    3.2 Value Ranges: the XML-SQLMIN and XML-SQLMAX Attributes
    3.3 #PCDATA Only
4. Type Semantics
    4.1 The Meaning of Type Declarations
    4.2 Counting Characters
    4.3 XML-SQLTYPE="CHAR"
    4.4 XML-SQLTYPE="VARCHAR"
    4.5 XML-SQLTYPE="INTEGER"
    4.6 XML-SQLTYPE="DECIMAL"
    4.7 XML-SQLTYPE="FLOAT"
    4.8 XML-SQLTYPE="DATE"
    4.9 XML-SQLTYPE="TIME"
    4.10 XML-SQLTYPE="TIMESTAMP"
5. Examples

1. Introduction

SGML and XML ("XML" refers to both from here on in) provide facilities for declaring document structures. However, there is very limited support for data typing as a database person would see it. This is an obvious deficiency whose seriousness will increase as XML is used increasingly for electronic data interchange and database-related applications.

This note proposes a mechanism to attach strong type declarations to XML elements using reserved attributes. While this is similar to HyTime's "architectural form" mechanism, this note does not include assume understanding, nor provide any discussion, of that mechanism.

2. Related Standards

2.1 XML, SGML, and HTML

This specification is designed for use with XML documents as described in the document Extensible Markup Language - the special terms element, attribute, and character data have the meanings defined in that document, and the syntax and usage of the element and attribute declarations is that described in that document.

However, the mechanisms described here may be applied to SGML documents (given appropriate declarations) and perhaps to HTML documents.

2.2 SQL Data Typing

SQL, as defined in International Standard ISO/IEC 9075:1992, is a language designed for use in defining and accessing structured data repositories. It includes a comprehensive selection of data types: see 6.1 <data type>. This selection has been proven effective in practice.

This note provides XML mechanisms for declaring elements to be one of a subset of these SQL types, and for restricting the range of allowed values for numeric types.

2.3 ISO Date/Time Representations

The ISO standard 8601:1988, which supersedes ISO standards 2014, 2015, 2711, 3307 and 4031, describes numerical date and time interchange formats. Dates are those used in the Gregorian calendar.

Each datum is fixed in size (leading zeroes used as necessary) and presented in decreasing order of significance (year, month, day, hour, minute, second, fractional second). All characters are taken from the ASCII repertoire.

In this specification, only the complete date and time values described in ISO 8601:1988 are used, with none of the truncation, reduced precision, or current-century tpyes of omission.

3. Associating Types with XML Elements

3.1 Primary Type: The XML-SQLTYPE Attribute

SQL data types may be associated with SQL elements based on the use of the reserved attributes XML-SQLTYPE, whose value corresponds to the name of a SQL data type. The declaration for this attribute is:

<!ELEMENT AnyElement (#PCDATA)>
<!ATTLIST AnyElement
          XML-SQLTYPE ( CHAR|VARCHAR
                       |INTEGER|DECIMAL|FLOAT
                       |DATE|TIME|TIMESTAMP)  #IMPLIED >

In some cases, a single attribute is not sufficient to provide all the required typing contraints. When this is the case, other attributes may be used to control quantities such as the size and scale of the data item. These attributes, all of whose names begin XML-SQL, are described in the sections below that discuss the details associated with each possible value of XML-SQLTYPE.

In the declaration above, XML-SQLTYPE is #IMPLIED; in practice, one would expect this to be given a #FIXED default in the DTD, so that all instances of some element would have the same type. When the XML-SQLTYPE attribute is not provided for some element, this simply means that no assertion is made concerning the data type of that element.

3.2 Value Ranges: the XML-SQLMIN and XML-SQLMAX Attributes

Elements for which the type is constrained with XML-SQLTYPE may have ranges of validity declared using the attributes XML-SQLMIN and XML-SQLMAX. These have no defaults of any kind; if not provided, no range constraint is placed on the content.

<!ATTLIST AnyElement XML-SQLMIN CDATA #IMPLIED
                     XML-SQLMAX CDATA #IMPLIED >

In all cases, the value of XML-SQLMIN and XML-SQLMAX must meet the constraints expressed by XML-SQLTYPE and any other paramaterizing attributes.

For numeric, date, and time data types, the ordering is unambiguous and the interpretation of XML-SQLMIN and XML-SQLMAX is obvious. For the CHAR and VARCHAR data types, the lexical ordering of strings is often implementation dependent. While the ordering of strings made up of characters from the ASCII and ISO-Latin character sets is well-understood, this is not the case with Unicode characters representing the glyphs of many Asian languages.

Confusion is also possible due to the fact that the direction in which characters are visually rendered into strings varies from language to language (Arabic runs right-to-left) and even within languages (Chinese may be rendered validly in many different directions).

To avoid ambiguity, for XML-SQLMIN and XML-SQLMAX range checking of CHAR and VARCHAR elements, lexical comparison of strings must always be done using the numeric values of the Unicode encoding of the characters in the string, in increasing order of the address at which they are stored.

3.3 #PCDATA Only

The attributes described in this note may only be attached to elements with #PCDATA content; that is, those which have no child elements.

4. Type Semantics

In this section, the term content refers to the character data contained in an element.

4.1 The Meaning of Type Declarations

A type declaration of the form described in this note asserts that the content of some element should meet the constraints (described herein) expressed by that declaration.

4.2 Counting Characters

Several of the type declarations constrain the allowed length of the content. In this case, the length is in characters, and should be evaluated after all entity and character references have been processed; i.e. the count applies to the content as received by an application, not as encoded in the containing entity.

4.3 XML-SQLTYPE="CHAR"

This section and each one following begins with an example of the declaration and use of an SQL-typed element.

<!ELEMENT CHAR-datum (#PCDATA)>
<!ATTLIST CHAR-datum XML-SQLTYPE CDATA #FIXED "CHAR"
                     XML-SQLSIZE CDATA #REQUIRED >
...
<CHAR-datum XML-SQLSIZE="6">Hello!</CHAR-datum>

The content is fixed in length.

4.4 XML-SQLTYPE="VARCHAR"

<!ELEMENT VARCHAR-datum (#PCDATA)>
<!ATTLIST VARCHAR-datum XML-SQLTYPE CDATA #FIXED "VARCHAR"
                        XML-SQLSIZE CDATA #REQUIRED >
...
<VARCHAR-datum XML-SQLSIZE="144">Hello!</VARCHAR-datum>

The content is variable in length up to a fixed maximum.

4.5 XML-SQLTYPE="INTEGER"

<!ELEMENT INTEGER-datum (#PCDATA)>
<!ATTLIST INTEGER-datum XML-SQLTYPE CDATA #FIXED "INTEGER" >
...
<INTEGER-datum>153125</INTEGER-datum>

The content represents a decimal integer number.

4.6 XML-SQLTYPE="DECIMAL"

<!ELEMENT DECIMAL-datum (#PCDATA)>
<!ATTLIST DECIMAL-datum XML-SQLTYPE  CDATA #FIXED "DECIMAL"
                        XML-SQLSCALE CDATA #REQUIRED >
...
<DECIMAL-datum XML-SQLSCALE='2'>42.00</DECIMAL-datum>

The content represents a fixed-point decimal number with a fixed number of digits after the decimal point.

4.7 XML-SQLTYPE="FLOAT"

<!ELEMENT FLOAT-datum (#PCDATA)>
<!ATTLIST FLOAT-datum XML-SQLTYPE CDATA #FIXED "FLOAT" >
...
<FLOAT-datum>6.02e23</FLOAT-datum>

The content represents a floating-point number.

4.8 XML-SQLTYPE="DATE"

<!ELEMENT DATE-datum (#PCDATA)>
<!ATTLIST DATE-datum XML-SQLTYPE CDATA #FIXED "DATE" >
...
<DATE-datum>1997-05-21</DATE-datum>

The content represents a date, provided in the order Year, Month, Day.

4.9 XML-SQLTYPE="TIME"

<!ELEMENT TIME-datum (#PCDATA)>
<!ATTLIST TIME-datum XML-SQLTYPE CDATA    #FIXED "TIME"
                     XML-SQLTZ   (YES|NO) "NO" >
<TIME-datum XML-SQLTZ="YES">12:25:03.45+1</TIME-datum>

Describes a time of day.

4.10 XML-SQLTYPE="TIMESTAMP"

<!ELEMENT TIMESTAMP-datum (#PCDATA)>
<!ATTLIST TIMESTAMP-datum XML-SQLTYPE CDATA    #FIXED "TIMESTAMP"
                          XML-SQLTZ   (YES|NO) "NO" >
...
<TIMESTAMP-datum>2000-01-01T00:00:00</TIMESTAMP-datum>

Describes a timestamp, including both date and time.

5. Examples

For a bank loan; balance, interest rate, and maturity date:

<!ELEMENT BALANCE  (#PCDATA) >
<!ATTLIST BALANCE  XML-SQLTYPE  CDATA #FIXED "DECIMAL"
                   XML-SQLSCALE CDATA #FIXED "2" 
                   XML-SQLMIN   CDATA #FIXED "0.00" >
<!ELEMENT INTEREST (#PCDATA)>
<!ATTLIST INTEREST XML-SQLTYPE CDATA #FIXED "FLOAT" 
                   XML-SQLMIN  CDATA #FIXED "0.0" >
<!ELEMENT MATURITY (#PCDATA)>
<!ATTLIST MATURITY XML-SQLTYPE CDATA #FIXED "DATE" >
...
<BALANCE>44378.06</BALANCE>
<INTEREST>9.125</INTEREST>
<MATURITY>2005-01-31</MATURITY>

For an airline departure: passenger name, seat number, and departure time:

<!ELEMENT LAST-NAME (#PCDATA)>
<!ATTLIST LAST-NAME XML-SQLTYPE CDATA #FIXED "VARCHAR"
                    XML-SQLSIZE CDATA #FIXED "20" >
<!ELEMENT FIRST-INITIAL (#PCDATA)>
<!ATTLIST FIRST-INITIAL XML-SQLTYPE CDATA #FIXED "CHAR"
                        XML-SQLSIZE CDATA #FIXED "1" >
<!ELEMENT SEAT-ROW (#PCDATA)>
<!ATTLIST SEAT-ROW XML-SQLTYPE CDATA #FIXED "INTEGER"
                   XML-SQLMIN  CDATA #FIXED "1"
                   XML-SQLMAX  CDATA #FIXED "36" >
<!ELEMENT SEAT-LETTER (#PCDATA)>
<!ATTLIST SEAT-LETTER XML-SQLTYPE CDATA #FIXED "CHAR"
                      XML-SQLSIZE CDATA #FIXED "1" 
                      XML-SQLMIN  CDATA #FIXED "A"
                      XML-SQLMAX  CDATA #FIXED "F" >
<!ELEMENT DEPARTURE (#PCDATA)>
<!ATTLIST DEPARTURE XML-SQLTYPE CDATA #FIXED "TIMESTAMP" 
                    XML-SQLTZ   CDATA #FIXED "YES" >
...
<LAST-NAME>Bray</LAST-NAME><FIRST-INITIAL>T</FIRST-INITIAL>
<SEAT-ROW>36</SEAT-ROW><SEAT-LETTER>B</SEAT-LETTER>
<DEPARTURE>1997-05-24T07:55:00+1</DEPARTURE>