XML Schema Part 2: Datatypes

This document specifies a language for defining datatypes to be used in XML Schemas and possibly elsewhere.

This is a W3C Working Draft for review by members of the W3C and other interested parties in the general public.

It has been reviewed by the XML Schema Working Group and the Working Group has agreed to its publication. Note that not that all sections of the draft represent the current consensus of the WG. Different sections of the specification may well command different levels of consensus in the WG. Public comments on this draft will be instrumental in the WG's deliberations.

The facilities described herein are in a preliminary state of design. The Working Group anticipates substantial changes, both in the mechanisms described herein, and in additional functions yet to be described. The present version should not be implemented except as a check on the design and to allow experimentation with alternative designs. The Schema WG will not allow early implementation to constrain its ability to make changes to this specification prior to final release.

A list of current W3C working drafts can be found at http://www.w3.org/TR. They may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".

Appendices

1. Introduction

1.1 Purpose

The [XML] specification defines limited facilities for applying datatypes to document content in that documents may contain or refer to DTDs that assign types to elements and attributes. However, document authors, including authors of traditional documents and those transporting data in XML, often require a high degree of type checking to ensure robustness in document understanding and data interchange.

The table below offers two typical examples of XML instances in which datatypes are implicit: the instance on the left represents a billing invoice, the instance on the right a memo or perhaps an email message in XML.

The invoice contains several dates and telephone numbers, the postal abbreviation for a state (which comes from an enumerated list of sanctioned values), and a ZIP code (which takes a definable regular form). The memo contains many of the same types of information: a date, telephone number, email address and an "importance" value (which undoubtedly comes from an enumerated list, such as "low", "medium" or "high"). Applications which process invoices and memos need to raise exceptions if something that was supposed to be a date or telephone number did not conform to the rules for valid dates or telephone numbers.

In both cases, validity constraints exist on the content of the instances that are not expressible in XML DTDs. The limited datatyping facilities in XML have prevented validating XML processors from supplying the rigorous type checking required in these situations. The result has been that individual applications writers have had to implement type checking in an ad hoc manner. This specification addresses the need of both document authors and applications writers for a robust, extensible datatype system for XML which could be incorporated into XML processors. As discussed below, these datatypes could be used in other XML-related standards as well.

1.2 Requirements

The [XML Schema Requirements] document spells out concrete requirements to be fulfilled by this specification, which state that the XML Schema Language must:

1.3 Scope

This portion of the XML Schema Language discusses datatypes that can be used in a XML Schema. These datatypes can be specified for element content that would be specified as #PCDATA and attribute values of various types in a DTD. It is the intension of this specification that it be usable outside of the context of XML Schemas for a wide range of other XML-related activities such as [XSL] and [RDF Schema].

For the most part, this specification discusses what are sometimes referred to as scalar datatypes in that they constrain the lexical representation of a single literal. In some cases, as for example in IDREFS (§3.3.5), ENTITIES (§3.3.7) and NMTOKENS (§3.2.2), the value may consist of a list or set of literals separated by spaces. This is an example of what is called an aggregate datatype. Future versions of this specification will contain a more general mechanism for aggregate (collection) datatypes such as sets, bags and records.

1.4 Terminology

The terminology used to describe XML Schema Datatypes is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of a datatype processor:

2. Type System

This section describes the conceptual framework behind the type system defined in this specification. The framework has been influenced by the [ISO 11404] standard on language-independent datatypes as well as the datatypes for [SQL] and for programming languages such as Java.

The datatypes discussed in this specification are computer representations of well known abstract concepts such as integer and date. It is not the place of this specification to define these concepts; many other publications provide excellent definitions.

Two concepts are essential for an understanding of datatypes as they are discussed here: a value space is an abstract collection of permitted values for a datatype. Each datatype also has a space of valid lexical representations or literals, each of which denotes a single value. A single value in the value space may be denoted by several distinct valid literals.

2.1 Datatype

[Definition:] In this specification, a datatype is defined as a 3-tuple, consisting of a) a set of distinct values, called its value space, b) a set of lexical representations, called its lexical space, and c) a set of facets that characterize properties of the value space, the lexical space or of individual values or lexical items.

2.2 Value space

A value space is a set of permitted values for a datatype. Value spaces have certain properties. For example, they always have the property of cardinality and some definition of equality and may have the concept of order by which individual values within the value space can be compared to one another.

[Definition:] A value space is the set of permitted values for a given datatype. The value space of a given datatype can be defined in one of the following ways:

In addition to the value space, each datatype has a space of valid lexical representations or literals. A single value in the value space may be denoted by several valid literals. For example, "100" and "1.0E2" are two different representations for the same value. Depending on the situation, either or both of these representations might be acceptable. The type system defined in this specification provides a mechanism for schema designers to control both the set of values and the set of acceptable lexical representations of those values for a datatype.

2.3 Lexical Space

In addition to its value space, each datatype also has a lexical representation space. [Definition:] The lexical space consists of a set of valid literals for a datatype. Each value in the datatype's value space is denoted by one or more literals in its lexical space. Each Primitive datatypes (§3.2) definition includes a detailed description of the default lexical space.

2.4 Facets

[Definition:] A facet is a single defining aspect of a concept or an object. Generally speaking, each facet characterizes a concept or object along independent aspects or dimensions.

The facets of a datatype serve to distinguish those aspects of one datatype which differ from other datatypes. Rather than being defined solely in terms of a prose description the datatypes in this specification are defined in terms of the synthesis of facet values which together determine the value space and properties of the datatype.

Facets are of two types: fundamental facets that define the datatype and non-fundamental or constraining facets that constrain the permitted values of a datatype.

2.4.1 Fundamental facets

Datatypes are characterized by properties of their value spaces. These optional properties are discussed in this section. Each of these properties give rise to a facet that serves to characterize the datatype.

2.4.1.1 Order

[Definition:] A value space, and hence a datatype, is said to be ordered if there exists an order relation defined for that value space. Order relations have the following rules:

There may exist several possible order relations for a given value space. Additionally, there may exist multiple datatypes with the same value space. In such cases, each datatype will define a different order relation on the value space.

2.4.1.2 Bounds

Some ordered value spaces, and hence some datatypes, are said to be bounded. [Definition:] A value space is bounded above if there exists a unique value U in the value space such that, for all values v in the value space, v ≤ U. The value U is said to be an upper bound of the value space. [Definition:] A value space is bounded below if there exists a unique value L in the space such that, for all values v in the value space, L ≤ v. The value L is then said to be a lower bound of the value space.

[Definition:] A datatype is bounded if its value space has both an upper and a lower bound.

2.4.1.3 Cardinality

[Definition:] Every value space has associated with it the concept of cardinality. Some value spaces are finite, some are countably infinite while still others are uncountably infinite. A datatype is said to have the cardinality of its value space. It is sometimes useful to categorize value spaces ( and hence, datatypes) as to their cardinality, there are three significant cases:

Every conceptually finite value space is necessarily exact. No computational datatype is uncountably infinite.

2.4.1.4 Exact and Approximate

The computational representation of a datatype may limit the degree to which values of the datatype can be distinguished. If every value in the value space of the conceptual datatype is distinguishable in the computational representation from every other value in the value space, then the datatype is said to be exact.

Certain mathematical datatypes with very large or infinite value spaces have representations which are said to be approximate in that multiple values in the conceptual value space map to single values in the value space of the representation. In this specification, all approximate datatypes have computational models which specify, via parametric values, a degree of approximation, that is, they require a certain minimum set of values of the mathematical datatype to be distinguishable in the computational datatype. Further, each value in the conceptual value space must be be capable of being represented in the representational value space within a certain distance i.e. the difference between the conceptual value and the representational value must not exceed some agreed upon value.

2.4.1.5 Numeric

A datatype is said to be numeric if its values are conceptually quantities (in some mathematical number system). A datatype whose values do not have this property is said to be non-numeric.

2.4.2 Constraining or Non-fundamental facets

Constraining facets are optional properties that can be applied to a datatype to (further) constrain its value space. Constraining the value space consequently constrains the allowed lexical representations. Adding constraining facets to a Base type (§2.5.2.1) is used in Defining Generated Datatypes (§4).

2.4.2.1 length

[Definition:] For the string (§3.2.3) datatype, length specifies the number of allowable characters in the string. For the binary (§3.2.9) datatype it specifies the length in bits. The value of the length facet must be a positive integer.

2.4.2.2 maximum length

[Definition:] The maxlength facet indicates the maximum length, in characters, of a string (§3.2.3) datatype for which the length (§2.4.2.1) facet is not specified. For the binary (§3.2.9) datatype it specifies the maximum length in bits if the length (§2.4.2.1) facet is not specified. The value of the maximum length facet must be a positive integer.

2.4.2.3 lexical representation

The datatypes defined in this specification are defined in terms of abstract value spaces and their properties as opposed to how values are lexically represented in XML instances. However, the lexical representation of values is of prime importance in many applications. Because of this importance, each Primitive datatypes (§3.2) definition includes a detailed description of its default Lexical Space (§2.3). [Definition:] The lexical representation facet can be used to constrain the allowable representations, or literals, for values of a datatype. The meaning of the lexical representation facet depends on the datatype to which it is applied.

2.4.2.4 enumeration

[Definition:] Presence of an enumeration facet constrains the value space of the datatype to the specified list. The enumeration facet can be applied to any datatype. No order or any other relationship is implied between the elements of the enumeration list.

2.4.2.5 minAbsoluteValue

[Definition:] The minAbsoluteValue facet specifies the minimum absolute value of the value space for generated datatypes whose basetype is real (§3.2.5). This facet (together with maxAbsoluteValue (§2.4.2.6)) can be used to generate subtypes of real (§3.2.5) which correspond to common floating point representations.

2.4.2.6 maxAbsoluteValue

[Definition:] The maxAbsoluteValue facet specifies the maximum absolute value of the value space for generated datatypes whose basetype is real (§3.2.5). This facet (together with maxAbsoluteValue (§2.4.2.6)) can be used to generate subtypes of real (§3.2.5) which correspond to common floating point representations.

2.4.2.7 maxInclusive

[Definition:] The maxInclusive facet determines the upper bound of the value space for a datatype with the Order (§2.4.1.1) property. The maximum value specified with this facet is inclusive in the sense that the value specified for the facet is itself included in the value space for the datatype.

2.4.2.8 maxExclusive

[Definition:] The maxExclusive facet determines the upper bound of the value space for a datatype with the Order (§2.4.1.1) property. The maximum value specified with this facet is exclusive in the sense that the value specified for the facet is itself excluded from the value space for the datatype.

2.4.2.9 minInclusive

[Definition:] The minInclusive facet determines the lower bound of the value space for a datatype with the Order (§2.4.1.1) property. The minimum value specified with this facet is inclusive in the sense that the value specified for the facet is itself included in the value space for the datatype.

2.4.2.10 minExclusive

[Definition:] The minExclusive facet determines the lower bound of the value space for a datatype with the Order (§2.4.1.1) property. The minimum value specified with this facet is exclusive in the sense that the value specified for the facet is itself excluded from the value space for the datatype.

2.4.2.11 precision

[Definition:] The precision facet, which only applies to the decimal (§3.3.9) datatype refers to the total number of decimal digits in the number. Its value must be a positive integer.

2.4.2.12 scale

[Definition:] The scale facet, which only applies to the decimal (§3.3.9) datatype refers to the total number of decimal digits to the right of the decimal point. Its value must be a positive number less than or equal to the precision.

2.4.2.13 encoding

2.4.2.14 period

2.5 Datatype dichotomies

It is useful to categorize the datatypes defined in this specification along various dimensions, forming a set of characterization dichotomies.

2.5.1 Atomic vs. aggregate datatypes

The first distinction to be made is that between atomic and aggregate datatypes.

For example, a date that is represented as a single character string could be the value of an atomic date datatype; while another date represented as separate "month", "day" and "year" elements would be the value of an aggregate date datatype. Not surprisingly, the distinction is analogous to that between an XML element whose content model is #PCDATA and one with element content.

As discussed above, this specification focuses mainly on atomic datatypes. Later versions will address aggregate datatypes in more detail. Note that the legacy XML attribute types IDREFS (§3.3.5), ENTITIES (§3.3.7) and NMTOKENS (§3.2.2) can be thought of as aggregate (list) types.

A datatype which is atomic in this specification need not be an "atomic" datatype in any programming language used to implement this specification.

2.5.2 Primitive vs. generated datatypes

The datatypes defined by this specification fall into both the primitive and the generated categories. It is felt that a judiciously chosen set of primitive datatypes will serve the widest possible audience by providing a set of convenient datatypes that can be used as is, as well as providing a rich enough base from which the variety of datatypes needed by schema designers can be generated.

A datatype which is primitive in this specification need not be a "primitive" datatype in any programming language used to implement this specification.

2.5.2.1 Base type

[Definition:] Every generated datatype is defined in terms of an existing datatype, referred to as the base type. Base types may be either primitive or generated.

2.5.3 Built-in vs. user-generated datatypes

Conceptually there is no difference between the built-in generated datatypes included in this specification and the user-generated datatypes which will be created by individual schema designers. The built-in generated datatypes are those which are believed to be so common that if they were not defined in this specification many schema designers would end up "reinventing" them. Furthermore, including these generated datatypes in this specification serves to demonstrate the mechanics and utility of the datatype generation facilities of this specification.

A datatype which is built-in in this specification need not be a "built-in" datatype in any programming language used to implement this specification.

3. Built-in datatypes

3.1 Namespace considerations

The built-in datatypes defined by this specification are designed so that systems other than the XML Schema Definition Language may access them. To facilitate such usage, the built-in datatypes in this specification come from the XML Datatype Language namespace, the specific namespace defined by this specification. This applies to both built-in primitive and built-in generated datatypes.

Each user-generated datatype is also associated with a unique namespace. However, user-generated datatypes do not come from the XML Datatype Language namespace; rather, they come from the namespace of the schema in which they are defined. Note that associating a namespace with a user-generated datatype is not a general purpose extensibility mechanism and does not apply to primitive datatypes. Suppose a schema author wanted to introduce a new set of primitive datatypes, say a core set of mathematical datatypes not based on the Number datatype defined as a built-in primitive by this specification. Such a schema author might try to define those datatypes, associate a unique namespace with them and expect schema processors to understand them. Unfortunately, such a scenario would not work. Each such datatype would need specialized validation code and there are still many unresolved issues regarding standard mechanisms for sharing such code.

As described in more detail in Defining Generated Datatypes (§4), each user-generated datatype must be defined in terms of a base type included in this specification, by assigning facets which serve to constrain the value set of the user-generated datatype to a subset of the base type. Such a mechanism works because all schema processors are required to be able to validate datatypes defined by subsetting the value space of a datatype included in this specification.

3.2 Primitive datatypes

The primitive datatypes are described below. For each primitive datatype we discuss the fundamental facets, if any, and the constraining facets, if any.

3.2.1 NMTOKEN

[Definition:] The NMTOKEN datatype represents the NMTOKEN attribute type from [XML]. The value space of NMTOKEN is the set of all tokens that match the Nmtoken production in [XML]. The lexical space of NMTOKEN is the set of all strings that match the Nmtoken production in [XML] .

For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.2.2 NMTOKENS

[Definition:] The NMTOKENS datatype represents the NMTOKENS attribute type from [XML]. It consists of a null-separated list of NMTOKENs. The value space of NMTOKENS is the set of all tokens that match the Nmtokens production in [XML]. The lexical space of ID is the set of all strings that match the Nmtokens production in [XML].

NMTOKENS has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.2.3 string

[Definition:] The string datatype represents character strings in XML. The value space of the string datatype is the set of finite sequences of UCS characters ([ISO 10646] and [Unicode]). A UCS character (or just character, for short) is an atomic unit of communication; it is not further specified except to note that every UCS character has a corresponding UCS code point, which is an integer.

3.2.3.1 Lexical Representation

The string datatype has an optional constraining facet called lexical representation (§2.4.2.3). The value of this facet is a regular expression. Regular expression constraints are discussed in Appendix Regular Expressions (§D). If this facet is not present, there is no restriction on the lexical representation.

3.2.3.2 Length

The string datatype has an optional constraining facet called length (§2.4.2.1). If length is specified we have a fixed length character string, whether length is measured in the number of characters in the string. If length is not specified we have a variable length character string. The value of the length facet must be a positive integer.

3.2.3.3 Maximum Length

The string datatype has an optional constraining facet called maximum length (§2.4.2.2). If maxlength is specified for a variable length string it represents an upper bound of the length of the string. The value of the maxlength facet must be a positive integer. Both length (§2.4.2.1) and maximum length (§2.4.2.2) cannot be specified for the same datatype. The absolute maximum length of variable length character strings depends on the XML parser implementation.

3.2.3.4 Maximum and Minimum Values

Clearly, the effect of these constraining facets depends on the collating sequence used to define the Order (§2.4.1.1) property for strings.

3.2.4 boolean

[Definition:] The boolean datatype has the value space required to support the mathematical concept of binary-valued logic: {true, false}.

3.2.4.1 Lexical Representation

An instance of a datatype that is defined as boolean can have the following legal lexical values {true, false}. The lexical representation is fixed and cannot be changed. The lexical representation facet is not supported.

3.2.5 real

[Definition:] The real datatype represents the standard mathematical concept of the real numbers.

3.2.5.1 Lexical representation

real values have a single standard lexical representation consisting of a mantissa followed, optionally, by the character "E" followed by an exponent. The exponent must be an integer. The mantissa must be a decimal number. The representations for exponent and mantissa must follow the default lexical rules for integer and decimal numbers discussed above. If the "E" and the the following exponent are omitted, an exponent value of 1 is assumed. For example: -1E4, 1267.43233E12, 12.78E-2, 12.

3.2.6 timeInstant

[Definition:] The timeInstant datatype represents a combination of date and time values representing a single instant of time, encoded as a single string. A single lexical representation, which is a subset of the lexical representaions allowed by [ISO 8601], is allowed for timeInstant.

3.2.6.1 Lexical Representation

The lexical representation for timeInstant is the [ISO 8601] representation CCYYMMDDThhmmss.sss where "CC" represents the century, "YY" the year, "MM" the month and "DD" the day. The letter "T" is the date/time separator and "hh", "mm", "ss.sss" represent hour, minute and second respectively. Note that this representation allows for fractional seconds.

This representation can be immediately followed by a "Z" to indicate Coordinated Universal Time. To indicate the time zone, i.e. the difference between the local time and Coordinated Universal Time, the difference immediately follows the time and consists of a sign, + or -, followed by hhmm.

For example, to indicate 1:20 pm on May the 31st, 1999 for Eastern Daylight Time which is 5 hours behind Coordinated Universal Time, one would write: 19990531T132000-0500.

3.2.7 timeDuration

[Definition:] The timeDuration datatype represents a combination of year, month, day and time values representing a single duration of time, encoded as a single string. A single lexical representation, which is a subset of the lexical representaions allowed by [ISO 8601], is allowed for timeDuration.

3.2.7.1 Lexical Representation

The lexical representation for timeDuration is the [ISO 8601] representation CCYYMMDDThhmmss.sss, preceded by an optional sign (+ or -), where "CC" represents the number of centuries, "YY" the number of years, "MM" the number of months and "DD" the number of days. The letter "T" is the date/time separator and "hh", "mm", "ss.sss" represent number of hours, minutes and seconds respectively. Note that this representation allows for fractional seconds.

For example, to indicate a duration of 1 year, 2 months, 3 days, 10 hours, and 30 minutes, one would write: 00010203T103000.

Time periods, i.e. specific durations of time, can be represented by supplying two items of information: a start instant and a duration or a start instant and an end instant or an end instant and a duration.

3.2.8 recurringInstant

[Definition:] The recurringInstant datatype represents an instant of time that recurs with a specific timeDuration (§3.2.7). Note that we do not attempt to support general recurring instants of time, just those that needed to support the generated date (§3.3.15) and time (§3.3.16) datatypes and those that arise from truncated and reduced lexical representations of timeInstant (§3.2.6).

which can be used to constrain the frequency of recurrence. Values of the period facet must be of type timeDuration (§3.2.7).

3.2.8.1 Lexical Representation

The lexical representation for recurringInstant is the left truncated [ISO 8601] representation for timeInstant (§3.2.6). For example, if the century "CC" is omitted from the timeInstant representation it means a timeInstant that recurs every hundred years. Similarly, if "CCYY" is omitted it designates a time instant that recurs every year.

Every two character "unit" of the representation that is omitted is indicated by a single hyphen "-". For example, to indicate 1:20 pm on May the 31st every year, one would write write: --0531T132000-0500.

3.2.9 binary

[Definition:] The binary datatype represents strings (blobs) of binary data. It has three fundamental facets. The optional length (§2.4.2.1) facet specifies the length of the data in bits. This defines a datatype with a fixed length. If the length is not specified, a datatype with variable length is specified . In this case, the optional maximum length (§2.4.2.2) facet specifies the maximum length of the data in bits. If the maximum length is not specified the default is unlimited length. The optional "encoding" facet specifies the encoding which may be "hex" for hexadecimal digits or "base64" for MIME style Base64 data.

3.2.10 uri

[Definition:] The uri datatype represents a Universal Resource Identifier (URI) Reference as defined in [RFC 2396]. It has no fundamental or constraining facets.

3.2.11 language

[Definition:] The language datatype represents natural language identifiers as defined by [RFC 1766] The value space of language is the set of all tokens that match the LanguageID production in [XML]. The lexical space of language is the set of all strings that match the LanguageID production in [XML].

3.3 Generated datatypes

3.3.1 Name

[Definition:] The Name datatype represents XML Names. The value space of this datatype is the set of all tokens which match the Name production of [XML]. The lexical space of this datatype is the set of all strings which match the Name production of [XML]. The basetype of Name is NMTOKEN (§3.2.1).

3.3.2 NCName

[Definition:] The NCName datatype represents XML "non-colonized" Names. The value space of this datatype is the set of all tokens which match the NCName production of [Namespaces in XML]. The lexical space of this datatype is the set of all strings which match the NCName production of [Namespaces in XML]. The basetype of NCName is Name (§3.3.1).

3.3.3 ID

[Definition:] The ID datatype represents the ID attribute type from [XML]. The value space of ID is the set of all tokens that match the Name production in [XML] and have been used in an XML document. The lexical space of ID is the set of all strings that match the Name production in [XML]. The basetype of ID is Name (§3.3.1).

ID has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.3.4 IDREF

[Definition:] The IDREF datatype represents the IDREF attribute type from [XML]. The value space of IDREF is the set of all tokens that match the Name production in [XML] and have been used in an XML document as the value of an element or attribute of type ID. The lexical space of IDREF is the set of all strings that match the Name production in [XML]. The basetype of IDREF is ID (§3.3.3).

IDREF has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.3.5 IDREFS

[Definition:] The IDREFS datatype represents the IDREFS attribute type from [XML]. It consists of a null-separated list of IDREFs. The value space of IDREFS is the set of all tokens that match the Names production in [XML] and have been used in an XML document as the value of an element or attribute of type ID. The lexical space of IDREFS is the set of all strings that match the Names production in [XML].

IDREFS has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.3.6 ENTITY

[Definition:] The ENTITY datatype represents the ENTITY attribute type from [XML]. The value space of ENTITY is the set of all tokens that match the Name production in [XML] and have been declared as an Unparsed Entity in a schema. The lexical space of ENTITY is the set of all strings that match the Name production in [XML]. The basetype of ENTITY is Name (§3.3.1).

ENTITY has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.3.7 ENTITIES

[Definition:] The ENTITIES datatype represents the ENTITIES attribute type from [XML]. It consists of a null-separated list of ENTITYs. The value space of ENTITIES is the set of all tokens that match the Name production in [XML] and have been declared as an Unparsed Entity in a schema. The lexical space of ENTITIES is the set of all strings that match the Name production in [XML].

ENTITIES has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.3.8 NOTATION

For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.

3.3.8.1 enumeration

3.3.9 decimal

[Definition:] The decimal datatype restricts allowable values to real numbers with an exact fractional part. The basetype of decimal is real (§3.2.5).

3.3.9.1 Lexical representation

Decimal values have a single standard lexical representation. This consists of a string of digits separated by a period as a decimal indicator, in accordance with the scale and precision facets, with an optional leading sign to indicate a negative number. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. For example: -1.23, 12678967.543233, 100000.00.

3.3.10 integer

[Definition:] The integer datatype is the standard mathematical concept of the integer numbers. The basetype of integer is decimal (§3.3.9). The value space of the integer datatype is the infinite set {-∞,...,-2,-1,0,1,2,...,∞} although computer implementations restrict this to a finite set.

3.3.10.1 Lexical representation

Integer values have a single, standard lexical representation. This consists of a string of digits with an optional leading sign. If the sign is omitted, "+" is assumed. For example: -1, 0, 12678967543233, +100000.

3.3.11 non-negative-integer

[Definition:] The non-negative-integer datatype is the standard mathematical concept of the non-negative integers. The value space of the non-negative-integer datatype is the infinite set {0,1,2,...,∞} although computer implementations restrict this to a finite set. The basetype of integer is integer (§3.3.10).

3.3.11.1 Lexical representation

Non-negative-integer values have a single, standard lexical representation. This consists of a string of digits with an optional leading "+" sign. If the sign is omitted, "+" is assumed. For example: 1, 0, 12678967543233, +100000.

3.3.12 positive-integer

[Definition:] The positive-integer datatype is the standard mathematical concept of the positive integers. The value space of the positive-integer datatype is the infinite set {1,2,...,∞} although computer implementations restrict this to a finite set. The basetype of integer is non-negative-integer (§3.3.11).

3.3.12.1 Lexical representation

positive-integer values have a single, standard lexical representation. This consists of a string of digits with an optional leading "+" sign. For example: 1, 12678967543233, +100000.

3.3.13 non-positive-integer

[Definition:] The non-positive-integer datatype is the standard mathematical concept of the non-positive integers. The value space of the non-positive-integer datatype is the infinite set {-∞,...,-2,-1,0} although computer implementations restrict this to a finite set. The basetype of integer is integer (§3.3.10).

3.3.13.1 Lexical representation

Non-positive-integer values have a single, standard lexical representation. This consists of a string of digits with a leading "-" sign. For example: -1, 0, -12678967543233, -100000.

3.3.14 negative-integer

[Definition:] The negative-integer datatype is the standard mathematical concept of the negative integers. The value space of the negative-integer datatype is the infinite set {-∞,...,-2,-1} although computer implementations restrict this to a finite set. The basetype of integer is non-positive-integer (§3.3.13).

3.3.14.1 Lexical representation

negative-integer values have a single, standard lexical representation. This consists of a string of digits with a leading "-" sign. For example: -1, -12678967543233, -100000.

3.3.15 date

3.3.15.1 Lexical Representation

The lexical representation for date is the reduced (right truncated) lexical representation for recurringInstant (§3.2.8): CCYYMMDD. For example, to indicate May the 31st, 1999, one would write: 19990531.

Left truncated representations can be used to represent recurring dates. If the CC is omitted it signifies a date that occurs every century. If the YY is omitted it signifies a date every year and so on. Every two character "unit" of the representation that is omitted is indicated by a single hyphen "-". For example, ---05 signifies the fifth day of every month.

3.3.16 time

[Definition:] The time datatype represents a recurring instant of time that recurs every day. The basetype of time is recurringInstant (§3.2.8). The time datatype can be considered to be a shorthand to designate a specific truncated representation for recurringInstant (§3.2.8). time is generated from recurringInstant (§3.2.8) by setting the value of the period facet equal to 24 hours.

3.3.16.1 Lexical Representation

The lexical representation for time is the left truncated lexical representation for timeInstant (§3.2.6): hhmmss.sss. For example, to indicate 1:20 pm for Eastern Daylight Time which is 5 hours behind Coordinated Universal Time, one would write: 132000-0500.

4. Defining Generated Datatypes

A generated datatype can be defined from a primitive datatype (or another generated datatype) by adding optional constraining facets. For example, it may be useful to define a datatype called i4 (signed 4-byte integer) from the built-in datatype integer by supplying maxInclusive and minInclusive facets. In this case, i4 is the name of the new user-generated datatype, integer is its base type and maxInclusive and minInclusive are the constraining facets.

This section defines the abstract syntax used for defining generated datatypes. This abstract syntax is used for defining both Generated datatypes (§3.3) and user-generated datatypes; the only difference between the built-in and user-generated datatypes being that the datatype definitions for built-in generated datatypes are included in the Schema for Datatype Definitions (normative) (§A) while the datatype definitions for user-generated datatypes appear in schemas written by users.

[Definition:] An abstract syntax provides a formal specification of the information provided for each generated datatype definition. The abstract syntax is presented using a simplified BNF. Defined terms are to the left. Their components are to the right, with a small amount of meta-syntax: ()s for grouping, | to separate alternatives, ? for optionality, * and + for iteration.

[Definition:] The concrete syntax for generated datatype definitions is the exact element and attribute names used in definitions.. The concrete syntax is a key feature of its proposed design. The concrete syntax is the form in which the schema language is used by datatype designers. Though its elements and attributes are often different from the terms of the abstract syntax bnf, the features and expressive power of the two are congruent.

The following is the definition for a possible built-in generated datatype "currency". This datatype definition would appear in the schema which defines datatypes for XML Schemas and shows that a generated datatype can have the same value space as its basetype, which might mean that it is just an "alias" or "renaming" of basetype. In this case, the specification would probably also define some "semantics" for currency which went beyond those of decimal.

The following is the definition of a user-generated datatype which could be used to represent monetary amounts, such as in a financial management application which generally do not have figures above $1M and only allow whole cents. This definition would appear in a schema authored by an "end-user" and shows how to define a datatype by specifying facet values which constrain the range of the basetype in a manner specific to the basetype (different than specifying max/min values as before)

The following example is a datatype definition for a user-generated datatype which limits the possible literal values of dates to the four US holidays enumerated. This datatype definition would appear in a schema authored by an "end-user" and shows how to define a datatype by enumerating the values in its value space. The enumerated values must be type-valid literals for the basetype.

5. Conformance

The XML specification [XML] defines two levels of conformance. Well-formed documents conform to valid XML syntax but may or may not obey the constraints defined by a DTD. Valid XML documents conform to the structure laid down in a DTD. Thus, if a DTD defines an attribute as an ID, instances of XML documents conforming to the DTD can only be valid if the values of such attributes are valid XML names and are unique in the document. By introducing additional datatypes to XML, this specification extends the notion of validity in the sense that values defined to have a certain datatype in the schema must conform to the lexical representations allowed for that datatype. Values that do not conform to the datatype defined for them in the schema raise a conformance error. As, for example, the appearance of a letter in a value defined as "integer". Similarly, for a value defined as string with length facet equal to 5, a value of "ABC" would raise an error -- length too short -- as would a value of "abcdefgh" -- length too long.

Since the datatypes discussed in this document can be used independently of XML Schema it is desirable that datatype conformance be specified as an independent, optional piece that other processors can use as they see fit. To this end, we define the datatypes processor as a separate abstract interface. Processors can call this interface with the value to be validated and the datatype, along with all its facets, that it should be validated against. The processor will return a boolean value which will be true or false depending on whether the value is valid for the datatype or not. If the value is valid it may also return a canonical representation for the value. If the value is invalid the processor will return error information including the facets that caused the value to be declared invalid.

User-generated datatypes are defined by giving values to certain, optional facets. For example, an integer within a certain range could be defined by giving values to maxInclusive and minInclusive facets. A switch on the datatypes processor could be used to turn validation off for these facets. This could be used by a processor that used the datatypes processor to eliminate validation of user-genarated datatypes.

If a particular processor, for reasons of speed or size decided not to validate datatypes, it can use a default stub interface. This always returns true.

It also needs to be said that there are no expressions on datatypes; neither are there operations on datatypes.

If we decide to allow datatype specification or specialization in instance documents (see issue "definition-overriding" above) then validating XML processors should be able to validate the format of values in XML documents in these cases as well by using the datatypes processor.

A. Schema for Datatype Definitions (normative)

Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.

<?xml version='1.0'?>
<!-- $Id: Overview.html,v 1.4 1999/09/24 22:47:18 connolly Exp $ -->
<!DOCTYPE schema PUBLIC "-//W3C//DTD XMLSCHEMA 19990923//EN" "../structures/structures.dtd" >
<schema xmlns='http://www.w3.org/1999/09/23-xmlschema/' targetNS='http://www.w3.org/1999/09/23-xmlschema/datatypes/' version='0.4'>

  <element name='datatype'>
     <archetype order='all'>
        <element ref='basetype'/>
        <element archRef='maxBound' minOccurs='0'/>
        <element archRef='minBound' minOccurs='0'/>
        <element ref='minAbsoluteValue' minOccurs='0'/>
        <element ref='maxAbsoluteValue' minOccurs='0'/>
        <element ref='maxInclusive' minOccurs='0'/>
        <element ref='minInclusive' minOccurs='0'/>
        <element ref='precision' minOccurs='0'/>
        <element ref='scale' minOccurs='0'/>
        <element ref='length' minOccurs='0'/>
        <element ref='maxLength' minOccurs='0'/>
        <element ref='enumeration' minOccurs='0'/>
        <element ref='lexicalRepresentation' minOccurs='0'/>
        <element ref='encoding' minOccurs='0'/>
        <attribute name='name' type='NMTOKEN' minOccurs='1'/>
        <attribute name='export' type='boolean' default='true'/>
     </archetype>
  </element>

  <element name='basetype'>
     <archetype content='empty'>
        <attribute name='name' type='NMTOKEN' minOccurs='1'/>
        <attribute name='schemaAbbrev' type='NMTOKEN'/>
        <attribute name='schemaName' type='uri'/>
     </archetype>
  </element>

  <!-- these are here to bridge between the content model above
       and the elements below -->
  <archetype name='minBound'/>
  <archetype name='maxBound'/>

  <!-- these can only be applied when the base type is 'real'
       and must be used in concert with one another -->
  <element name='minAbsoluteValue' type='real'>
  <element name='maxAbsoluteValue' type='real'>
  
  <!-- the true datatype of the four following depends on the basetype -->
  <element name='maxExclusive' type='string'>
     <archetype>
       <refines name='maxBound'/>
     </archetype>
  </element>
  <element name='maxInclusive' type='string'>
     <archetype>
       <refines name='maxBound'/>
     </archetype>
  </element>
  <element name='minExclusive' type='string'>
     <archetype>
       <refines name='minBound'/>
     </archetype>
  </element>
  <element name='minInclusive' type='string'>
     <archetype>
       <refines name='minBound'/>
     </archetype>
  </element>

  <element name='precision' type='integer'/>
  <element name='scale' type='integer'/>

  <element name='length' type='integer'/>
  <element name='maxLength' type='integer'/>

  <!-- the following datatype is used to limit the
       possible values for the encoding facet on
	   the binary datatype -->
  <datatype name='encodings'>
     <basetype name='NMTOKEN'/>
	 <enumeration>
	    <literal>hex</literal>
		<literal>base64</literal>
     <enumeration>
  </datatype>
  <element name='encoding' type='encodings'/>

  <element name='period' type='timeDuration'/>

  <element name='enumeration'>
    <archetype>
       <element ref='literal' minOccurs='1' maxOccurs='*'/>
    </archetype>
  </element>
  <!-- the true datatype of the following depends on the basetype -->
  <element name='literal' type='string'/>

  <element name='lexicalRepresentation'>
     <archetype>
        <element ref='lexical' minOccurs='1' maxOccurs='*'/>
     </archetype>
  </element>
    <!-- the true datatype of the following depends on the basetype -->
  <element name='lexical' type='string'/>

<!-- built-in generated datatypes -->
<!-- only has a few for now, eventually needs to have all of them -->

  <datatype name='integer'>
    <basetype name='decimal'/>
    <scale>0</scale>
  </datatype>
	
  <datatype name='non-negative-integer'>
    <basetype name='integer'/>
    <minInclusive>0</minInclusive>
  </datatype>

  <datatype name='positive-integer'>
    <basetype name='non-negative-integer'/>
    <minInclusive>1</minInclusive>
  </datatype>

  <datatype name='non-positive-integer'>
    <basetype name='integer'/>
    <maxInclusive>0</maxInclusive>
  </datatype>

  <datatype name='negative-integer'>
    <basetype name='non-positive-integer'/>
    <maxInclusive>-1</maxInclusive>
  </datatype>

  <datatype name='date'>
    <basetype name='recurringInstant'/>
    <period>000000T2400</period>
  </datatype>

  <datatype name='time'>
    <basetype name='recurringInstant'/>
    <period>000000T2400</period>
  </datatype>
</schema>

B. DTD for Datatype Definitions (normative)

Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.

<!-- Note that the expansion of 'facets' below is less
     restrictive than that imposed by the XML Schema schema for
     datatypes:  There should in fact be no more than one of each of
     minInclusive, minExclusive, maxInclusive, maxExclusive,
     precision, scale, lexicalRepresentation, enumeration,
     length, maxLength within datatype -->
<!ENTITY % minBound '(minInclusive | minExclusive)'>
<!ENTITY % maxBound '(maxInclusive | maxExclusive)'>
<!ENTITY % bounds '%minBound; | %maxBound;'>
<!ENTITY % numeric '(maxAbsoluteValue, minAbsoluteValue)? | precision | scale'>
<!ENTITY % ordered '%bounds; | %numeric;'>   
<!ENTITY % unordered
   'lexicalRepresentation | enumeration | length | maxLength | encoding'>   
<!ENTITY % facets '%ordered; | %unordered;'>
<!ELEMENT datatype (basetype, (%facets;)*)>   
<!ATTLIST datatype   
    name NMTOKEN #REQUIRED   
    export (true|false) 'true'>   
   
<!ELEMENT basetype EMPTY>   
<!ATTLIST basetype   
    name NMTOKEN #REQUIRED   
    schemaAbbrev NMTOKEN #IMPLIED   
    schemaName CDATA #IMPLIED>   

<!ELEMENT minAbsoluteValue (#PCDATA)>
<!ELEMENT maxAbsoluteValue (#PCDATA)>

<!ELEMENT maxExclusive (#PCDATA)>   
<!ELEMENT minExclusive (#PCDATA)>   
<!ELEMENT maxInclusive (#PCDATA)>   
<!ELEMENT minInclusive (#PCDATA)>   
   
<!ELEMENT precision (#PCDATA)>   
<!ELEMENT scale (#PCDATA)>   
   
<!ELEMENT length (#PCDATA)>   
<!ELEMENT maxLength (#PCDATA)>   
<!ELEMENT enumeration (literal)+>   
<!ELEMENT literal (#PCDATA)>   
<!ELEMENT lexicalRepresentation (lexical)+>   
<!ELEMENT lexical (#PCDATA)>
<!ELEMENT encoding (#PCDATA)>

C. Datatypes and Facets

Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.

The following table shows the values of the fundamental facets for each built-in datatype.

Ed. Note: (PVB 1999-07-09) Some entries in this table might conflict with what it says elsewhere in this draft, as creating this table pointed out to me some problems with the way some of the fundamental facets are defined (not to mention any transcription errors on my part in creating the table).
We obviously need more introductory text here explaining this table to the reader

	Datatype	Order (§2.4.1.1)	Bounds (§2.4.1.2)	Cardinality (§2.4.1.3)	Exact and Approximate (§2.4.1.4)	Numeric (§2.4.1.5)
Primitive	NMTOKEN (§3.2.1)	no	none	countably infinite	exact	no
	string (§3.2.3)	yes	none	countably infinite	exact	no
	boolean (§3.2.4)	no	none	finite	exact	no
	real (§3.2.5)	yes	none	uncountably infinite	approximate	yes
	timeInstant (§3.2.6)	yes	no	uncountably infinite	approximate	no
	timeDuration (§3.2.7)	yes	no	uncountably infinite	approximate	no
	recurringInstant (§3.2.8)	yes	no	uncountably infinite	approximate	no
	binary (§3.2.9)	no	no	?	?	no
	uri (§3.2.10)	no	no	uncountably infinite	exact	no
	language (§3.2.11)	no	no	countably infinite	exact	no

Generated	Name (§3.3.1)	no	no	countably infinite	exact	no
	NCName (§3.3.2)	no	no	countably infinite	exact	no
	ID (§3.3.3)	no	no	countably infinite	exact	no
	IDREF (§3.3.4)	no	no	countably infinite	exact	no
	IDREFS (§3.3.5)	no	no	countably infinite	exact	no
	ENTITY (§3.3.6)	no	no	countably infinite	exact	no
	ENTITIES (§3.3.7)	no	no	countably infinite	exact	no
	NMTOKENS (§3.2.2)	no	no	countably infinite	exact	no
	NOTATION (§3.3.8)	no	no	countably infinite	exact	no
	decimal (§3.3.9)	yes	no	countably infinite	exact	yes
	integer (§3.3.10)	yes	no	countably infinite	exact	yes
	non-negative-integer (§3.3.11)	yes	yes	countably infinite	exact	yes
	positive-integer (§3.3.12)	yes	yes	countably infinite	exact	yes
	non-positive-integer (§3.3.13)	yes	yes	countably infinite	exact	yes
	negative-integer (§3.3.14)	yes	yes	countably infinite	exact	yes
	date (§3.3.15)	yes	no	countably infinite	exact	no
	time (§3.3.16)	yes	no	uncountably infinite	approximate	no

The following table shows the constraining facets which apply to each built-in datatype.

Ed. Note: Some entries in this table might conflict with what it says elsewhere in this draft, as creating this table pointed out to me some problems with the way some of the constraining facets and datatypes are defined (not to mention any transcription errors on my part in creating the table).
We obviously need more introductory text here explaining this table to the reader (especially since this one table is broken into three pieces so that it will print nicely)

	Datatype	length (§2.4.2.1)	maximum length (§2.4.2.2)	lexical representation (§2.4.2.3)	enumeration (§2.4.2.4)
Primitive	NMTOKEN (§3.2.1)	?	?		X
	string (§3.2.3)	X	X	X	X
	boolean (§3.2.4)
	real (§3.2.5)				X
	timeInstant (§3.2.6)				X
	timeDuration (§3.2.7)				X
	recurringInstant (§3.2.8)				X
	binary (§3.2.9)	X	?	?
	uri (§3.2.10)				X
	language (§3.2.11)	?			X

Generated	Name (§3.3.1)	?	?		X
	NCName (§3.3.2)	?	?		X
	ID (§3.3.3)	?	?		X
	IDREF (§3.3.4)	?	?		X
	IDREFS (§3.3.5)	?	?		X
	ENTITY (§3.3.6)	?	?		X
	ENTITIES (§3.3.7)	?	?		X
	NMTOKENS (§3.2.2)	?	?		X
	NOTATION (§3.3.8)	?	?		X
	decimal (§3.3.9)				X
	integer (§3.3.10)				X
	non-negative-integer (§3.3.11)				X
	positive-integer (§3.3.12)				X
	non-positive-integer (§3.3.13)				X
	negative-integer (§3.3.14)				X
	date (§3.3.15)				X
	time (§3.3.16)				X

constraining facets table, cont.

	Datatype	maxInclusive (§2.4.2.7)	maxExclusive (§2.4.2.8)	minInclusive (§2.4.2.9)	minExclusive (§2.4.2.10)
Primitive	NMTOKEN (§3.2.1)
	string (§3.2.3)	X	X	X	X
	boolean (§3.2.4)
	real (§3.2.5)	X	X	X	X
	timeInstant (§3.2.6)	X	X	X	X
	timeDuration (§3.2.7)	X	X	X	X
	recurringInstant (§3.2.8)	X	X	X	X
	binary (§3.2.9)
	uri (§3.2.10)
	language (§3.2.11)

Generated	Name (§3.3.1)
	NCName (§3.3.2)
	ID (§3.3.3)
	IDREF (§3.3.4)
	IDREFS (§3.3.5)
	ENTITY (§3.3.6)
	ENTITIES (§3.3.7)
	NMTOKENS (§3.2.2)
	NOTATION (§3.3.8)
	decimal (§3.3.9)	X	X	X	X
	integer (§3.3.10)	X	X	X	X
	non-negative-integer (§3.3.11)	X	X	X	X
	positive-integer (§3.3.12)	X	X	X	X
	non-positive-integer (§3.3.13)	X	X	X	X
	negative-integer (§3.3.14)	X	X	X	X
	date (§3.3.15)	X	X	X	X
	time (§3.3.16)	X	X	X	X

constraining facets table, cont.

	Datatype	precision (§2.4.2.11)	scale (§2.4.2.12)	encoding (§2.4.2.13)	period (§2.4.2.14)
Primitive	NMTOKEN (§3.2.1)
	string (§3.2.3)
	boolean (§3.2.4)
	real (§3.2.5)	?	?
	timeInstant (§3.2.6)
	timeDuration (§3.2.7)
	recurringInstant (§3.2.8)				X
	binary (§3.2.9)			X
	uri (§3.2.10)
	language (§3.2.11)

Generated	Name (§3.3.1)
	NCName (§3.3.2)
	ID (§3.3.3)
	IDREF (§3.3.4)
	IDREFS (§3.3.5)
	ENTITY (§3.3.6)
	ENTITIES (§3.3.7)
	NMTOKENS (§3.2.2)
	NOTATION (§3.3.8)
	decimal (§3.3.9)	X	X
	integer (§3.3.10)	?	?
	non-negative-integer (§3.3.11)	?	?
	positive-integer (§3.3.12)	?	?
	non-positive-integer (§3.3.13)	?	?
	negative-integer (§3.3.14)	?	?
	date (§3.3.15)
	time (§3.3.16)

D. Regular Expressions

Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.

Ed. Note: The following description of regular expressions is copied (with slight modification) by permission from the documentation of the [Perl] programming language. This entire section should probably be replaced by something derived from the Unicode Regex TechReport [Unicode Regular Expression Guidelines] and the ECMAScript Regex proposal [ECMAScript Regex].

Issue (perl-regex): Should the final recommendation use Perl's regular expression "extensions"?

[Definition:] Regular expressions, similar to those in [Perl], can be used to constrain the format of strings. A regular expression is an alphanumeric string consisting of character symbols. Each symbol, which is usually one character but may be two characters, is a placeholder that stands for a set of characters.

Any single character matches itself, unless it is a metacharacter with a special meaning described here or above. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a "\" (e.g., "\." matches a ".", not any character; "\\" matches a "\"). A series of characters matches that series of characters in the target string, so the pattern blurfl would match "blurfl" in the target string.

You can specify a character class, by enclosing a list of characters in [], which will match any one character from the list. If the first character after the "[" is "^", the class matches any character not in the list. Within a list, the "-" character is used to specify a range, so that a-z represents all characters between "a" and "z", inclusive. If you want "-" itself to be a member of a class, put it at the start or end of the list, or escape it with a backslash. (The following all specify the same class of three characters: [-az], [az-], and [a\-z]. All are different from [a-z], which specifies a class containing twenty-six characters.)

Certain characters as used as metacharacters. The following list contains all of the metacharacters and their meanings.

\: Quote the next metacharacter
^: Match the beginning of the line
.: Match any character (except newline)
$: Match the end of the line (or before newline at the end)
|: Alternation
(): Grouping
[]: Character class

Within a regular expression, the following standard quantifiers are recognized:

*: Match 0 or more times
+: Match 1 or more times
?: Match 1 or 0 times
{n}: Match exactly n times
{n,}: Match at least n times
{n,m}: Match at least n but not more than m times

The following character sequences also have special meaning within a regular expression.

\t: tab
\n: newline
\r: return
\033: octal char 003
\x1B: hex char 1B
\w: Match a "word" character (alphanumeric plus "_")
\W: Match a non-word character
\s: Match a whitespace character
\S: Match a non-whitespace character
\d: Match a digit character
\D: Match a non-digit character

Ed. Note: we should probably define XML-specific character sequences for things like Nmtoken, Name, etc., as well as ones for the character classes listed in XML 1.0 Appendix B. Character Classes

Regular expressions may also contain the following zero-width assertions:

\b: Match a word boundary
\B: Match a non-(word boundary)

A word boundary (\b) is defined as a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.

Example

   
  555-1212     is matched by \d{3}-\d{4}           (phone number)   
  888-555-1212 is matched by (\d{3}-)?\d{3}-\d{4}  (phone number with optional area code)   
  $123,45.90   is matched by \$\d{3},\d{2}\.\d{2}   
  123-45-5678  is matched by \d{3}-?\d{2}-?\d{4}   (Social Security Number)

E. References

Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.

E.1 Normative

ECMAScript Regex: ECMAScript v2 Draft. Regular Expressions. See http://www2.hursley.ibm.com/tc39/regexp30.pdf
ISO 10646: ISO (International Organization for Standardization). ISO/IEC 10646-1993 (E). Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. [Geneva]: International Organization for Standardization, 1993 (plus amendments AM 1 through AM 7).
ISO 8601: Representations of dates and times. Available from http://www.iso.ch/markete/8601.pdf
A draft revision is also available from http://www.cl.cam.ac.uk/~mgk25/8601v04.pdf
Namespaces in XML: Namespaces in XML, Tim Bray et al. W3C, 1998 Available at: http://www.w3.org/TR/REC-xml-names/
Perl: The Perl Programming Language. See http://www.perl.org
RFC 1766: H. Alvestrand, ed. RFC 1766: Tags for the Identification of Languages 1995. Available at: http://www.ietf.org/rfc/rfc1766.txt
RFC 2396: Tim Berners-Lee, et. al. RFC 2396: Uniform Resource Identifiers (URI): Generic Syntax.. 1998 Available at: http://www.ietf.org/rfc/rfc2396.txt
SQL: SQL Standard. See http://www.jcc.com/SQLPages/jccs_sql.htm
Unicode: The Unicode Consortium. The Unicode Standard, Version 2.0. Reading, Mass.: Addison-Wesley Developers Press, 1996.
Unicode Regular Expression Guidelines: Mark Davis. Unicode Regular Expression Guidelines, 1988. Available at: http://www.unicode.org/unicode/reports/tr18/
XML: XML Standard. See http://www.w3.org/TR/REC-xml
XML Schema Part 1: Structures: XML Schema Part 1: Structures. Available at: http://www.w3.org/TR/xmlschema-1/
XML Schema Requirements: XML Schema Requirements. Available at: http://www.w3.org/TR/NOTE-xml-schema-req

E.2 Non-normative

ISO 11404: Language-independent Datatypes. Available from http://www.iso.ch/cate/d19346.html
RDF Schema: RDF Schema Specification. See http://www.w3.org/TR/PR-rdf-schema
XSL: XSL Working Draft. See http://www.w3.org/TR/WD-xsl/

F. Grateful Acknowledgments (non-normative) *

The editors the acknowledge the members of the W3C XML Schema Working Group, the members of other W3C Working Groups, and industry experts in other forums who have contributed directly or indirectly to the process or content of creating this document.

G. Open Issues

non-gregorian-dates
application-specific-binary-formats
binary-mime-type
binary-value-space
uri-scheme-facet
better-reference-mechanisms
definition-overriding
non-positive-integer-literal
perl-regex

H. Revisions from Previous Draft

19990521: PVB: corrected definition of length and maxlengths facet for strings to be in terms of characters not bytes
19990521: PVB: removed issue "other-date-representations". We don't want other separators, left mention of aggregate reps for dates as an ednote.
19990521: PVB: fixed "holidays" example, "-0101" ==> "==0101" (where == in the correction should be two hyphens, but that would not allow us to comment out this sitem)
19990521: PVB: fixed "common date" example, lexicalRepresenation ==> lexicalRepresentation
19990521: PVB: added note that -YY-MM-DD style dates are deprecated
19990521: PVB: added termdef element around definition of subtype
19990521: PVB: removed "whose basetype is a built-in" from definition of "user-generated" datatype
19990521: PVB: clarified that the length facet for binary datatype is length in bytes
19990521: PVB: fixed weird spacing problems introduced by ArborText
19990521: PVB: fixed references to non-terminals in productions
19990524: AM: changed "boolean" to have a single lexical representation.
19990524: AM: added issue: "should we add a facet to restrict a binary datatype to a user-defined format such as audio, image, etc."
19990524: AM: corrected reference to SQL standard.
19990524: AM: corrected definition of length and maximum length facets to be a positive integer.
19990524: AM: corrected default format for integer, decimal and real.
19990524: AM: rewrote issue definition-overiding.
19990524: AM: edited Conformance section to add example of lexical errors and fix reference to above issue.
19990601: PVB: changed date formats in examples of Section 1 to be conformant with the date datatype
19990601: PVB: added a "for compatibility" terminology entry
19990601: PVB: added a Name datatype and redefined the XML 1.0 attribute types in terms of it
19990601: PVB: remove "for attributes only" restriction on XML 1.0 attribute types. Added a "for compatibility" clause for attribute values.
19990601: PVB: added language datatype
19990602: PVB: added uuid datatype
19990602: PVB: added NCName datatype
19990604: AM: changed date and time formats to allow only ISO 8601 extended format. Impacted sections on the date, time datatypes, section 4, Appendix C.
19990604: AM: added ednote to string datatype saying we need to harmonize with I18N character model.
19990604: PVB: added "Revisions from previous draft" appendix
19990604: PVB: moved "built-in generated" datatype definitions into the schema for datatype definitions (instead of it being in its own appendix).
19990604: PVB: upadted the schema for datatype definitions to point to the correct (per xmlschema-1) DTD and schema
19990623: AM: added paragraph to conformance section which begins to be more precise about how conforming processors should behave
19990623: AM: removed confusing statement from conformance section which said that " checking for lexical conformance is all that is expected of an XML processor."
19990623: PVB: removed section on "Characterizing Operations" and all references to it (or its content) in the rest of the draft.
19990623: PVB: removed uuid datatype
19990623: PVB: made NMTOKEN a primitive datatype and Name a subtype of NMTOKEN.
19990623: PVB: corrected the basetypes of following XML-related generated datatypes: IDREFS (from ID to IDREF), ENTITY (from ID to Name), ENTITIES (from ID to ENTITY), NMTOKENS (from Name to NMTOKEN).
19990623: PVB: changed name of section "User-Generated Datatypes" to the more correct "Defining Generated Datatypes". Also added some explanatory text to the beginning of that section which explains that the abstract syntax there is used both for defining built-in and user-generated datatypes.
19990623: PVB: added explanations of abstract and concrete syntax (mostly borrowed from the structural draft) to section "Defining Generated Datatypes".
19990623: PVB: separated references into those that are normative and those that are non-normative
19990623: PVB: added a pointer to the draft revision of ISO 8601 in its bib entry
19990623: PVB: added "no-consensus" issues to those all sections except "Type System" and "Built-in datatypes" stating that no WG concensus has been reached on the section (the exclusions above are because those sections which granted consensus status at the Ann Arbor f2f)
19990623: PVB: cleaned up productions for numeric literals
19990624: PVB: excluded subsections 1.1 and 1.2 from the "no-consusus" issue for section 1
19990630: PVB: removed number datatype, made real into a built-in primitive, changed the basetype of decimal to real and the basetype of integer to decimal. Also added NaN, INF and -INF to the lexical space of all numeric types.
19990630: PVB: added 2 new subtypes of integer: non-positive-integer and non-negative-integer, each of which has 1 subtype: negative-integer and positive-integer, respectively. Added generated datatype definitions for these to the schema for datatypes.
19990630: PVB: fixed typos in definition of IDREF and IDREFS (was "the lexical space of ID is .." now "the lexical space of IDREF is ...")
19990630: PVB: added issue(non-negative-integer-literals)
19990630: PVB: added links to known subtypes in all datatype descriptions
19990630: PVB: changed "no-consensus" issues to "no-consensus" ednotes
19990630: PVB: changed "no-consensus" ednote for section 1 to exclude subsection 1.3, as voted on during the telcon today
19990630: PVB: corrected severl interal cross-references: from termref's to specref's
19990630: PVB: added all previous drafts (internal as well as public WDs) to the "Previous Versions" section. In future public WDs only those "previous versions" which were public WDs will display
19990630: PVB: changed "collection" to "set" in definition of "value space" (thought this had been changed long ago, sorry)
19990708: PVB: removed section 1.5 "Organization", per WG vote on telcon
19990708: PVB: removed "no-consensus" ednote from section 1
19990709: PVB: added (stub) subsections on "Precision", "Scale" and "Encoding" to section 2.4.2 "Constraining Facets". All facets mentioned in all datatype definitions in section 3 should be listed in 2.4.2. (this is not intended to address the standing issue constraining-facet-definitions, but was needed for the next revision item)
19990709: PVB: added "Datatypes and Facets" appendix which consists of several tables which attempt to show which facets apply to which datatypes
19990713: PVB: fixed bug in schema for datatypes regarding modelGroup vs. elementType Refs in unordered modelGroup
19990726: AM: Changed example of user-generated datatype from heightInInches to i4.
19990726: AM: Rewrote "Exact and Approximate".
19990812: PVB: Removed all mention of picture constraints as lexical-representations for strings
19990819: AM: Amended Ed. Note on a URL for the datatypes namespace referring to Dan Connolly's note "make up your own".
19990819: AM: Removed issue on NULLS -- 2 occurrences.
19990819: AM: Changed Ed. Note on "Better Ref Mechs" associated with IDREFS to "issue"..
19990819: AM: Removed issue on measurement units as WG decided to defer to version 2.
19990919: HT: modifed abstract syntax to better reflect intent?
19990923: HT: modified schema for schemas to conform to the concrete syntax in the latest Structures draft
19990923: PVB: added minAbsoluteValue and maxAbsoluteValue facets to real, their intent is to allow generation of subtypes of real whose value spaces correspond to comment float-point representations. Added examples to section 4 to show how to generate IEEE 32-bit, etc.
19990923: PVB: replaced dateTime, date, time and timePeriod with all new date/time related types: timeInstant, timeDuration, recurringInstant, date and time. Additionally, limited the lexical representations of each of the new types to a single form (w/ the exception of still allowing both left truncation and reduced [i.e., right truncated] representations). Changed all examples which used date/time to use the new lexical representations
19990923: PVB: modified the abstract syntax, schema for datatypes and DTD for datatypes to bring them in line with above changes.
19990924: HST: link housekeeping before publication

XML Schema Part 2: Datatypes

W3C Working Draft 24 September 1999

Abstract

Status of this document

Table of contents

Appendices

1. Introduction

1.1 Purpose

1.2 Requirements

1.3 Scope

1.4 Terminology

2. Type System

2.1 Datatype

2.2 Value space

2.3 Lexical Space

2.4 Facets

2.4.1 Fundamental facets

2.4.1.1 Order

2.4.1.2 Bounds

2.4.1.3 Cardinality

2.4.1.4 Exact and Approximate

2.4.1.5 Numeric

2.4.2 Constraining or Non-fundamental facets

2.4.2.1 length

2.4.2.2 maximum length

2.4.2.3 lexical representation

2.4.2.4 enumeration

2.4.2.5 minAbsoluteValue

2.4.2.6 maxAbsoluteValue

2.4.2.7 maxInclusive

2.4.2.8 maxExclusive

2.4.2.9 minInclusive

2.4.2.10 minExclusive

2.4.2.11 precision

2.4.2.12 scale

2.4.2.13 encoding

2.4.2.14 period

2.5 Datatype dichotomies

2.5.1 Atomic vs. aggregate datatypes

2.5.2 Primitive vs. generated datatypes

2.5.2.1 Base type

2.5.3 Built-in vs. user-generated datatypes

3. Built-in datatypes

3.1 Namespace considerations

3.2 Primitive datatypes

3.2.1 NMTOKEN

3.2.2 NMTOKENS

3.2.3 string

3.2.3.1 Lexical Representation

3.2.3.2 Length

3.2.3.3 Maximum Length

3.2.3.4 Maximum and Minimum Values

3.2.4 boolean

3.2.4.1 Lexical Representation

3.2.5 real

3.2.5.1 Lexical representation

3.2.6 timeInstant

3.2.6.1 Lexical Representation

3.2.7 timeDuration

3.2.7.1 Lexical Representation

3.2.8 recurringInstant

3.2.8.1 Lexical Representation

3.2.9 binary

3.2.10 uri

3.2.11 language

3.3 Generated datatypes

3.3.1 Name

3.3.2 NCName

3.3.3 ID

3.3.4 IDREF

3.3.5 IDREFS

3.3.6 ENTITY

3.3.7 ENTITIES

3.3.8 NOTATION

3.3.8.1 enumeration

3.3.9 decimal

3.3.9.1 Lexical representation

3.3.10 integer

3.3.10.1 Lexical representation

3.3.11 non-negative-integer