Copyright © 1999 W3C (MIT, INRIA, Keio), All Rights Reserved. W3C liability,trademark, document use and software licensing rules apply.
This document specifies a language for defining datatypes to be used in XML Schemas and possibly elsewhere.
This is a W3C Working Draft for review by members of the W3C and other interested parties in the general public.
It has been reviewed by the XML Schema Working Group and the Working Group has agreed to its publication. Note that not that all sections of the draft represent the current consensus of the WG. Different sections of the specification may well command different levels of consensus in the WG. Public comments on this draft will be instrumental in the WG's deliberations.
Please review and send comments to www-xml-schema-comments@w3.org (archive).
The facilities described herein are in a preliminary state of design. The Working Group anticipates substantial changes, both in the mechanisms described herein, and in additional functions yet to be described. The present version should not be implemented except as a check on the design and to allow experimentation with alternative designs. The Schema WG will not allow early implementation to constrain its ability to make changes to this specification prior to final release.
A list of current W3C working drafts can be found at http://www.w3.org/TR. They may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".
Ed. Note: Several "note types" are used throughout this draft:
- issue [Issue (issue-name): ]
- something on which the editors are seeking comment.
- editorial note [Ed. Note: ]
- something the editors wish to call to the attention of the reader. To be removed prior to the recommendation becoming final.
- note [Note: ]
- something the editors wish to call to the attention of the reader. To remain in the final recommendation.
The [XML] specification defines limited facilities for applying datatypes to document content in that documents may contain or refer to DTDs that assign types to elements and attributes. However, document authors, including authors of traditional documents and those transporting data in XML, often require a high degree of type checking to ensure robustness in document understanding and data interchange.
The table below offers two typical examples of XML instances in which datatypes are implicit: the instance on the left represents a billing invoice, the instance on the right a memo or perhaps an email message in XML.
Data oriented | Document oriented | ||
---|---|---|---|
|
|
The invoice contains several dates and telephone numbers, the postal abbreviation for a state (which comes from an enumerated list of sanctioned values), and a ZIP code (which takes a definable regular form). The memo contains many of the same types of information: a date, telephone number, email address and an "importance" value (which undoubtedly comes from an enumerated list, such as "low", "medium" or "high"). Applications which process invoices and memos need to raise exceptions if something that was supposed to be a date or telephone number did not conform to the rules for valid dates or telephone numbers.
In both cases, validity constraints exist on the content of the instances that are not expressible in XML DTDs. The limited datatyping facilities in XML have prevented validating XML processors from supplying the rigorous type checking required in these situations. The result has been that individual applications writers have had to implement type checking in an ad hoc manner. This specification addresses the need of both document authors and applications writers for a robust, extensible datatype system for XML which could be incorporated into XML processors. As discussed below, these datatypes could be used in other XML-related standards as well.
The [XML Schema Requirements] document spells out concrete requirements to be fulfilled by this specification, which state that the XML Schema Language must:
This portion of the XML Schema Language discusses datatypes that can be used in a XML Schema. These datatypes can be specified for element content that would be specified as #PCDATA and attribute values of various types in a DTD. It is the intension of this specification that it be usable outside of the context of XML Schemas for a wide range of other XML-related activities such as [XSL] and [RDF Schema].
For the most part, this specification discusses what are sometimes referred to as scalar datatypes in that they constrain the lexical representation of a single literal. In some cases, as for example in IDREFS (§3.3.5), ENTITIES (§3.3.7) and NMTOKENS (§3.2.2), the value may consist of a list or set of literals separated by spaces. This is an example of what is called an aggregate datatype. Future versions of this specification will contain a more general mechanism for aggregate (collection) datatypes such as sets, bags and records.
The terminology used to describe XML Schema Datatypes is defined in the body of this specification. The terms defined in the following list are used in building those definitions and in describing the actions of a datatype processor:
This section describes the conceptual framework behind the type system defined in this specification. The framework has been influenced by the [ISO 11404] standard on language-independent datatypes as well as the datatypes for [SQL] and for programming languages such as Java.
The datatypes discussed in this specification are computer representations of well known abstract concepts such as integer and date. It is not the place of this specification to define these concepts; many other publications provide excellent definitions.
Two concepts are essential for an understanding of datatypes as they are discussed here: a value space is an abstract collection of permitted values for a datatype. Each datatype also has a space of valid lexical representations or literals, each of which denotes a single value. A single value in the value space may be denoted by several distinct valid literals.
[Definition:] In this specification, a datatype is defined as a 3-tuple, consisting of a) a set of distinct values, called its value space, b) a set of lexical representations, called its lexical space, and c) a set of facets that characterize properties of the value space, the lexical space or of individual values or lexical items.
A value space is a set of permitted values for a datatype. Value spaces have certain properties. For example, they always have the property of cardinality and some definition of equality and may have the concept of order by which individual values within the value space can be compared to one another.
[Definition:] A value space is the set of permitted values for a given datatype. The value space of a given datatype can be defined in one of the following ways:
In addition to the value space, each datatype has a space of valid lexical representations or literals. A single value in the value space may be denoted by several valid literals. For example, "100" and "1.0E2" are two different representations for the same value. Depending on the situation, either or both of these representations might be acceptable. The type system defined in this specification provides a mechanism for schema designers to control both the set of values and the set of acceptable lexical representations of those values for a datatype.
In addition to its value space, each datatype also has a lexical representation space. [Definition:] The lexical space consists of a set of valid literals for a datatype. Each value in the datatype's value space is denoted by one or more literals in its lexical space. Each Primitive datatypes (§3.2) definition includes a detailed description of the default lexical space.
[Definition:] A facet is a single defining aspect of a concept or an object. Generally speaking, each facet characterizes a concept or object along independent aspects or dimensions.
The facets of a datatype serve to distinguish those aspects of one datatype which differ from other datatypes. Rather than being defined solely in terms of a prose description the datatypes in this specification are defined in terms of the synthesis of facet values which together determine the value space and properties of the datatype.
Facets are of two types: fundamental facets that define the datatype and non-fundamental or constraining facets that constrain the permitted values of a datatype.
Datatypes are characterized by properties of their value spaces. These optional properties are discussed in this section. Each of these properties give rise to a facet that serves to characterize the datatype.
[Definition:] A value space, and hence a datatype, is said to be ordered if there exists an order relation defined for that value space. Order relations have the following rules:
There may exist several possible order relations for a given value space. Additionally, there may exist multiple datatypes with the same value space. In such cases, each datatype will define a different order relation on the value space.
Ed. Note: Currently, no order relations are defined on the built-in datatypes provided by this specification; additionally, there is no means to specify an order relation on user-generated datatypes. This will be addressed in a future draft.
Some ordered value spaces, and hence some datatypes, are said to be bounded. [Definition:] A value space is bounded above if there exists a unique value U in the value space such that, for all values v in the value space, v ≤ U. The value U is said to be an upper bound of the value space. [Definition:] A value space is bounded below if there exists a unique value L in the space such that, for all values v in the value space, L ≤ v. The value L is then said to be a lower bound of the value space.
[Definition:] A datatype is bounded if its value space has both an upper and a lower bound.
[Definition:] Every value space has associated with it the concept of cardinality. Some value spaces are finite, some are countably infinite while still others are uncountably infinite. A datatype is said to have the cardinality of its value space. It is sometimes useful to categorize value spaces ( and hence, datatypes) as to their cardinality, there are three significant cases:
Every conceptually finite value space is necessarily exact. No computational datatype is uncountably infinite.
Ed. Note: Currently, cardinality is not specified for the built-in datatypes provided by this specification; additionally, there is no means to specify a cardinality on user-generated datatypes. This will be addressed in a future draft.
The computational representation of a datatype may limit the degree to which values of the datatype can be distinguished. If every value in the value space of the conceptual datatype is distinguishable in the computational representation from every other value in the value space, then the datatype is said to be exact.
Certain mathematical datatypes with very large or infinite value spaces have representations which are said to be approximate in that multiple values in the conceptual value space map to single values in the value space of the representation. In this specification, all approximate datatypes have computational models which specify, via parametric values, a degree of approximation, that is, they require a certain minimum set of values of the mathematical datatype to be distinguishable in the computational datatype. Further, each value in the conceptual value space must be be capable of being represented in the representational value space within a certain distance i.e. the difference between the conceptual value and the representational value must not exceed some agreed upon value.
Ed. Note: Currently, exactness is not specified for the built-in datatypes provided by this specification; additionally, there is no means to specify a exactness for user-generated datatypes. This will be addressed in a future draft.
A datatype is said to be numeric if its values are conceptually quantities (in some mathematical number system). A datatype whose values do not have this property is said to be non-numeric.
Constraining facets are optional properties that can be applied to a datatype to (further) constrain its value space. Constraining the value space consequently constrains the allowed lexical representations. Adding constraining facets to a Base type (§2.5.2.1) is used in Defining Generated Datatypes (§4).
[Definition:] For the string (§3.2.3) datatype, length specifies the number of allowable characters in the string. For the binary (§3.2.9) datatype it specifies the length in bits. The value of the length facet must be a positive integer.
Ed. Note: We need to ultimately reconcile the notion of string length with the resolution of the i18n issues around character, indexing, etc. I18N recommends that length and maxLength be a "character count" and do not indicate storage requirements.
[Definition:] The maxlength facet indicates the maximum length, in characters, of a string (§3.2.3) datatype for which the length (§2.4.2.1) facet is not specified. For the binary (§3.2.9) datatype it specifies the maximum length in bits if the length (§2.4.2.1) facet is not specified. The value of the maximum length facet must be a positive integer.
The datatypes defined in this specification are defined in terms of abstract value spaces and their properties as opposed to how values are lexically represented in XML instances. However, the lexical representation of values is of prime importance in many applications. Because of this importance, each Primitive datatypes (§3.2) definition includes a detailed description of its default Lexical Space (§2.3). [Definition:] The lexical representation facet can be used to constrain the allowable representations, or literals, for values of a datatype. The meaning of the lexical representation facet depends on the datatype to which it is applied.
For example, for string (§3.2.3), values for the lexical representation facet are Regular Expressions (§D).
[Definition:] Presence of an enumeration facet constrains the value space of the datatype to the specified list. The enumeration facet can be applied to any datatype. No order or any other relationship is implied between the elements of the enumeration list.
[Definition:] The minAbsoluteValue facet specifies the minimum absolute value of the value space for generated datatypes whose basetype is real (§3.2.5). This facet (together with maxAbsoluteValue (§2.4.2.6)) can be used to generate subtypes of real (§3.2.5) which correspond to common floating point representations.
[Definition:] The maxAbsoluteValue facet specifies the maximum absolute value of the value space for generated datatypes whose basetype is real (§3.2.5). This facet (together with maxAbsoluteValue (§2.4.2.6)) can be used to generate subtypes of real (§3.2.5) which correspond to common floating point representations.
[Definition:] The maxInclusive facet determines the upper bound of the value space for a datatype with the Order (§2.4.1.1) property. The maximum value specified with this facet is inclusive in the sense that the value specified for the facet is itself included in the value space for the datatype.
[Definition:] The maxExclusive facet determines the upper bound of the value space for a datatype with the Order (§2.4.1.1) property. The maximum value specified with this facet is exclusive in the sense that the value specified for the facet is itself excluded from the value space for the datatype.
[Definition:] The minInclusive facet determines the lower bound of the value space for a datatype with the Order (§2.4.1.1) property. The minimum value specified with this facet is inclusive in the sense that the value specified for the facet is itself included in the value space for the datatype.
[Definition:] The minExclusive facet determines the lower bound of the value space for a datatype with the Order (§2.4.1.1) property. The minimum value specified with this facet is exclusive in the sense that the value specified for the facet is itself excluded from the value space for the datatype.
[Definition:] The precision facet, which only applies to the decimal (§3.3.9) datatype refers to the total number of decimal digits in the number. Its value must be a positive integer.
[Definition:] The scale facet, which only applies to the decimal (§3.3.9) datatype refers to the total number of decimal digits to the right of the decimal point. Its value must be a positive number less than or equal to the precision.
Ed. Note: need to fill out definition of this facet, which applies (currently) only to binary (§3.2.9)
Ed. Note: need to fill out definition of this facet, which applies (currently) only to recurringInstant (§3.2.8)
It is useful to categorize the datatypes defined in this specification along various dimensions, forming a set of characterization dichotomies.
The first distinction to be made is that between atomic and aggregate datatypes.
For example, a date that is represented as a single character string could be the value of an atomic date datatype; while another date represented as separate "month", "day" and "year" elements would be the value of an aggregate date datatype. Not surprisingly, the distinction is analogous to that between an XML element whose content model is #PCDATA and one with element content.
As discussed above, this specification focuses mainly on atomic datatypes. Later versions will address aggregate datatypes in more detail. Note that the legacy XML attribute types IDREFS (§3.3.5), ENTITIES (§3.3.7) and NMTOKENS (§3.2.2) can be thought of as aggregate (list) types.
A datatype which is atomic in this specification need not be an "atomic" datatype in any programming language used to implement this specification.
For example, a real (§3.2.5) is a well defined mathematical concept that cannot be defined in terms of other datatypes while a date (§3.3.15) is a special case of the more general datatype recurringInstant (§3.2.8).
The datatypes defined by this specification fall into both the primitive and the generated categories. It is felt that a judiciously chosen set of primitive datatypes will serve the widest possible audience by providing a set of convenient datatypes that can be used as is, as well as providing a rich enough base from which the variety of datatypes needed by schema designers can be generated.
A datatype which is primitive in this specification need not be a "primitive" datatype in any programming language used to implement this specification.
[Definition:] Every generated datatype is defined in terms of an existing datatype, referred to as the base type. Base types may be either primitive or generated.
[Definition:] In the example above, date (§3.3.15) is referred to as a subtype of the base type recurringInstant (§3.2.8). The value space of a subtype is a subset of the value space of the base type.
Conceptually there is no difference between the built-in generated datatypes included in this specification and the user-generated datatypes which will be created by individual schema designers. The built-in generated datatypes are those which are believed to be so common that if they were not defined in this specification many schema designers would end up "reinventing" them. Furthermore, including these generated datatypes in this specification serves to demonstrate the mechanics and utility of the datatype generation facilities of this specification.
A datatype which is built-in in this specification need not be a "built-in" datatype in any programming language used to implement this specification.
The built-in datatypes defined by this specification are designed so that systems other than the XML Schema Definition Language may access them. To facilitate such usage, the built-in datatypes in this specification come from the XML Datatype Language namespace, the specific namespace defined by this specification. This applies to both built-in primitive and built-in generated datatypes.
Ed. Note: The exact URLs for the namespace(s) defined by this W3C specification is still an open issue. This issue has been raised with the XML Coordination Group (issue 1999-0201-07 Standardizing W3C namespace URIs) for general coordination and resolution. On August 11, Dan Connolly recommended we make up our own URL for datatypes. See http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/1999Aug/0060.html.
Each user-generated datatype is also associated with a unique namespace. However, user-generated datatypes do not come from the XML Datatype Language namespace; rather, they come from the namespace of the schema in which they are defined. Note that associating a namespace with a user-generated datatype is not a general purpose extensibility mechanism and does not apply to primitive datatypes. Suppose a schema author wanted to introduce a new set of primitive datatypes, say a core set of mathematical datatypes not based on the Number datatype defined as a built-in primitive by this specification. Such a schema author might try to define those datatypes, associate a unique namespace with them and expect schema processors to understand them. Unfortunately, such a scenario would not work. Each such datatype would need specialized validation code and there are still many unresolved issues regarding standard mechanisms for sharing such code.
As described in more detail in Defining Generated Datatypes (§4), each user-generated datatype must be defined in terms of a base type included in this specification, by assigning facets which serve to constrain the value set of the user-generated datatype to a subset of the base type. Such a mechanism works because all schema processors are required to be able to validate datatypes defined by subsetting the value space of a datatype included in this specification.
The primitive datatypes are described below. For each primitive datatype we discuss the fundamental facets, if any, and the constraining facets, if any.
[Definition:] The NMTOKEN datatype represents the NMTOKEN attribute type from [XML]. The value space of NMTOKEN is the set of all tokens that match the Nmtoken production in [XML]. The lexical space of NMTOKEN is the set of all strings that match the Nmtoken production in [XML] .
For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
NMTOKEN has the following subtypes:
[Definition:] The NMTOKENS datatype represents the NMTOKENS attribute type from [XML]. It consists of a null-separated list of NMTOKENs. The value space of NMTOKENS is the set of all tokens that match the Nmtokens production in [XML]. The lexical space of ID is the set of all strings that match the Nmtokens production in [XML].
NMTOKENS has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
[Definition:] The string datatype represents character strings in XML. The value space of the string datatype is the set of finite sequences of UCS characters ([ISO 10646] and [Unicode]). A UCS character (or just character, for short) is an atomic unit of communication; it is not further specified except to note that every UCS character has a corresponding UCS code point, which is an integer.
Ed. Note: We need to harmonize this definition with the I18N character model.
The string datatype has an optional constraining facet called lexical representation (§2.4.2.3). The value of this facet is a regular expression. Regular expression constraints are discussed in Appendix Regular Expressions (§D). If this facet is not present, there is no restriction on the lexical representation.
The string datatype has an optional constraining facet called length (§2.4.2.1). If length is specified we have a fixed length character string, whether length is measured in the number of characters in the string. If length is not specified we have a variable length character string. The value of the length facet must be a positive integer.
The string datatype has an optional constraining facet called maximum length (§2.4.2.2). If maxlength is specified for a variable length string it represents an upper bound of the length of the string. The value of the maxlength facet must be a positive integer. Both length (§2.4.2.1) and maximum length (§2.4.2.2) cannot be specified for the same datatype. The absolute maximum length of variable length character strings depends on the XML parser implementation.
The string datatype also has the following constraining facets:
Clearly, the effect of these constraining facets depends on the collating sequence used to define the Order (§2.4.1.1) property for strings.
Ed. Note: The issue of collating sequences for strings is complex. It will be discussed in detail in a subsequent version of this specification.
[Definition:] The boolean datatype has the value space required to support the mathematical concept of binary-valued logic: {true, false}.
An instance of a datatype that is defined as boolean can have the following legal lexical values {true, false}. The lexical representation is fixed and cannot be changed. The lexical representation facet is not supported.
[Definition:] The real datatype represents the standard mathematical concept of the real numbers.
real has the following constraining facets:
real has the following subtype:
real values have a single standard lexical representation consisting of a mantissa followed, optionally, by the character "E" followed by an exponent. The exponent must be an integer. The mantissa must be a decimal number. The representations for exponent and mantissa must follow the default lexical rules for integer and decimal numbers discussed above. If the "E" and the the following exponent are omitted, an exponent value of 1 is assumed. For example: -1E4, 1267.43233E12, 12.78E-2, 12.
[Definition:] The timeInstant datatype represents a combination of date and time values representing a single instant of time, encoded as a single string. A single lexical representation, which is a subset of the lexical representaions allowed by [ISO 8601], is allowed for timeInstant.
Issue (non-gregorian-dates): As an internationalization issue, do we want support for non-gregorian dates? This issue also applies to timeDuration (§3.2.7), date (§3.3.15) and time (§3.3.16).
The lexical representation for timeInstant is the [ISO 8601] representation CCYYMMDDThhmmss.sss where "CC" represents the century, "YY" the year, "MM" the month and "DD" the day. The letter "T" is the date/time separator and "hh", "mm", "ss.sss" represent hour, minute and second respectively. Note that this representation allows for fractional seconds.
Ed. Note: We need a more complete description of the lexical space, which, for instance, makes it clear that seconds can be represented to any precision desired, not just thousandths of a second. This note also applies to the lexical representations of timeDuration (§3.2.7), recurringInstant (§3.2.8) and time (§3.3.16).
This representation can be immediately followed by a "Z" to indicate Coordinated Universal Time. To indicate the time zone, i.e. the difference between the local time and Coordinated Universal Time, the difference immediately follows the time and consists of a sign, + or -, followed by hhmm.
For example, to indicate 1:20 pm on May the 31st, 1999 for Eastern Daylight Time which is 5 hours behind Coordinated Universal Time, one would write: 19990531T132000-0500.
[Definition:] The timeDuration datatype represents a combination of year, month, day and time values representing a single duration of time, encoded as a single string. A single lexical representation, which is a subset of the lexical representaions allowed by [ISO 8601], is allowed for timeDuration.
The lexical representation for timeDuration is the [ISO 8601] representation CCYYMMDDThhmmss.sss, preceded by an optional sign (+ or -), where "CC" represents the number of centuries, "YY" the number of years, "MM" the number of months and "DD" the number of days. The letter "T" is the date/time separator and "hh", "mm", "ss.sss" represent number of hours, minutes and seconds respectively. Note that this representation allows for fractional seconds.
For example, to indicate a duration of 1 year, 2 months, 3 days, 10 hours, and 30 minutes, one would write: 00010203T103000.
Time periods, i.e. specific durations of time, can be represented by supplying two items of information: a start instant and a duration or a start instant and an end instant or an end instant and a duration.
[Definition:] The recurringInstant datatype represents an instant of time that recurs with a specific timeDuration (§3.2.7). Note that we do not attempt to support general recurring instants of time, just those that needed to support the generated date (§3.3.15) and time (§3.3.16) datatypes and those that arise from truncated and reduced lexical representations of timeInstant (§3.2.6).
recurringInstant has a single constraining facet.
which can be used to constrain the frequency of recurrence. Values of the period facet must be of type timeDuration (§3.2.7).
recurringInstant has the following subtypes:
The lexical representation for recurringInstant is the left truncated [ISO 8601] representation for timeInstant (§3.2.6). For example, if the century "CC" is omitted from the timeInstant representation it means a timeInstant that recurs every hundred years. Similarly, if "CCYY" is omitted it designates a time instant that recurs every year.
Every two character "unit" of the representation that is omitted is indicated by a single hyphen "-". For example, to indicate 1:20 pm on May the 31st every year, one would write write: --0531T132000-0500.
[Definition:] The binary datatype represents strings (blobs) of binary data. It has three fundamental facets. The optional length (§2.4.2.1) facet specifies the length of the data in bits. This defines a datatype with a fixed length. If the length is not specified, a datatype with variable length is specified . In this case, the optional maximum length (§2.4.2.2) facet specifies the maximum length of the data in bits. If the maximum length is not specified the default is unlimited length. The optional "encoding" facet specifies the encoding which may be "hex" for hexadecimal digits or "base64" for MIME style Base64 data.
Issue (application-specific-binary-formats): Should we add a facet to allow a binary datatype to be restricted to an application-specific format such as video, audio, image?
Issue (binary-mime-type): should we add facet(s) for mime-type/subtype?
Issue (binary-value-space): Is this really a datatype? What is the value space of this datatype: the set of encoded strings or the set of binary streams after decoding?
[Definition:] The uri datatype represents a Universal Resource Identifier (URI) Reference as defined in [RFC 2396]. It has no fundamental or constraining facets.
Issue (uri-scheme-facet): should we have a facet to allow a limitation to a specific scheme? It might be useful to able to say that something was not only a URI, but that it was a "mailto" and not a "http://...".
[Definition:] The language datatype represents natural language identifiers as defined by [RFC 1766] The value space of language is the set of all tokens that match the LanguageID production in [XML]. The lexical space of language is the set of all strings that match the LanguageID production in [XML].
This section gives conceptual definitions for all built-in generated datatypes defined by this specification, including a description of the facets which apply to each datatype. The abstract syntax used to define generated datatypes (whether built-in or user-generated) is given in section Defining Generated Datatypes (§4) and the complete definitions of the built-in generated datatypes (written in the concrete syntax based on that abstract syntax given in Appendix Schema for Datatype Definitions (normative) (§A)) are provided in Appendix Schema for Datatype Definitions (normative) (§A).
[Definition:] The Name datatype represents XML Names. The value space of this datatype is the set of all tokens which match the Name production of [XML]. The lexical space of this datatype is the set of all strings which match the Name production of [XML]. The basetype of Name is NMTOKEN (§3.2.1).
Name has the following subtypes:
[Definition:] The NCName datatype represents XML "non-colonized" Names. The value space of this datatype is the set of all tokens which match the NCName production of [Namespaces in XML]. The lexical space of this datatype is the set of all strings which match the NCName production of [Namespaces in XML]. The basetype of NCName is Name (§3.3.1).
[Definition:] The ID datatype represents the ID attribute type from [XML]. The value space of ID is the set of all tokens that match the Name production in [XML] and have been used in an XML document. The lexical space of ID is the set of all strings that match the Name production in [XML]. The basetype of ID is Name (§3.3.1).
ID has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
Schema-validity Constraint: ID Unique
An ID must not appear
more than once in an XML document as a value of this type; i.e.,
ID values must uniquely identify the elements which bear them.
ID has the following subtypes:
Issue (better-reference-mechanisms): There are several situations in which we need better reference mechanisms than those provided by ID and IDREF/IDREFS. For example, it would be desirable to extend IDs and IDREFs to be typed and scoped to better represent primary key/foreign key relationships in a database. XSL has recently introduced the concept of xsl:key and xsl:keyref whereby a single property of an element can be used as a key. We need such a mechanism for XML as a whole and it would be nice if this were extended to support multi-part keys.
[Definition:] The IDREF datatype represents the IDREF attribute type from [XML]. The value space of IDREF is the set of all tokens that match the Name production in [XML] and have been used in an XML document as the value of an element or attribute of type ID. The lexical space of IDREF is the set of all strings that match the Name production in [XML]. The basetype of IDREF is ID (§3.3.3).
IDREF has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
[Definition:] The IDREFS datatype represents the IDREFS attribute type from [XML]. It consists of a null-separated list of IDREFs. The value space of IDREFS is the set of all tokens that match the Names production in [XML] and have been used in an XML document as the value of an element or attribute of type ID. The lexical space of IDREFS is the set of all strings that match the Names production in [XML].
IDREFS has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
[Definition:] The ENTITY datatype represents the ENTITY attribute type from [XML]. The value space of ENTITY is the set of all tokens that match the Name production in [XML] and have been declared as an Unparsed Entity in a schema. The lexical space of ENTITY is the set of all strings that match the Name production in [XML]. The basetype of ENTITY is Name (§3.3.1).
ENTITY has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
[Definition:] The ENTITIES datatype represents the ENTITIES attribute type from [XML]. It consists of a null-separated list of ENTITYs. The value space of ENTITIES is the set of all tokens that match the Name production in [XML] and have been declared as an Unparsed Entity in a schema. The lexical space of ENTITIES is the set of all strings that match the Name production in [XML].
ENTITIES has no fundamental or constraining facets. For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
[Definition:] The NOTATION datatype represents the NOTATION attribute type from [XML]. The value space of NOTATION is the set of all notations declared in a schema. The lexical space of NOTATION is the set of all strings that match the Name production in [XML]. The basetype of NOTATION is Name (§3.3.1).
For compatibility (see Terminology (§1.4)) this datatype should be used only on attributes.
This required facet is used to specify the list of notations.
Ed. Note: (PVB 19990601) this definition is NOT correct
[Definition:] The decimal datatype restricts allowable values to real numbers with an exact fractional part. The basetype of decimal is real (§3.2.5).
Decimal has the following required fundamental facets:
Decimal has the following constraining facets:
decimal has the following subtypes:
Decimal values have a single standard lexical representation. This consists of a string of digits separated by a period as a decimal indicator, in accordance with the scale and precision facets, with an optional leading sign to indicate a negative number. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. For example: -1.23, 12678967.543233, 100000.00.
[Definition:] The integer datatype is the standard mathematical concept of the integer numbers. The basetype of integer is decimal (§3.3.9). The value space of the integer datatype is the infinite set {-∞,...,-2,-1,0,1,2,...,∞} although computer implementations restrict this to a finite set.
Integer has the following constraining facets:
integer has the following subtypes:
Integer values have a single, standard lexical representation. This consists of a string of digits with an optional leading sign. If the sign is omitted, "+" is assumed. For example: -1, 0, 12678967543233, +100000.
[Definition:] The non-negative-integer datatype is the standard mathematical concept of the non-negative integers. The value space of the non-negative-integer datatype is the infinite set {0,1,2,...,∞} although computer implementations restrict this to a finite set. The basetype of integer is integer (§3.3.10).
non-negative-integer has the following constraining facets:
non-negative-integer has the following subtypes:
Non-negative-integer values have a single, standard lexical representation. This consists of a string of digits with an optional leading "+" sign. If the sign is omitted, "+" is assumed. For example: 1, 0, 12678967543233, +100000.
[Definition:] The positive-integer datatype is the standard mathematical concept of the positive integers. The value space of the positive-integer datatype is the infinite set {1,2,...,∞} although computer implementations restrict this to a finite set. The basetype of integer is non-negative-integer (§3.3.11).
positive-integer has the following constraining facets:
positive-integer values have a single, standard lexical representation. This consists of a string of digits with an optional leading "+" sign. For example: 1, 12678967543233, +100000.
[Definition:] The non-positive-integer datatype is the standard mathematical concept of the non-positive integers. The value space of the non-positive-integer datatype is the infinite set {-∞,...,-2,-1,0} although computer implementations restrict this to a finite set. The basetype of integer is integer (§3.3.10).
non-positive-integer has the following constraining facets:
non-positive-integer has the following subtypes:
Non-positive-integer values have a single, standard lexical representation. This consists of a string of digits with a leading "-" sign. For example: -1, 0, -12678967543233, -100000.
[Definition:] The negative-integer datatype is the standard mathematical concept of the negative integers. The value space of the negative-integer datatype is the infinite set {-∞,...,-2,-1} although computer implementations restrict this to a finite set. The basetype of integer is non-positive-integer (§3.3.13).
negative-integer has the following constraining facets:
negative-integer values have a single, standard lexical representation. This consists of a string of digits with a leading "-" sign. For example: -1, -12678967543233, -100000.
[Definition:] The date datatype represents a timeDuration (§3.2.7) that starts at midnight of a specified day and lasts for 24 hours. The basetype of date is recurringInstant (§3.2.8). date is generated from recurringInstant (§3.2.8) by setting the value of the period facet equal to 24 hours.
The lexical representation for date is the reduced (right truncated) lexical representation for recurringInstant (§3.2.8): CCYYMMDD. For example, to indicate May the 31st, 1999, one would write: 19990531.
Left truncated representations can be used to represent recurring dates. If the CC is omitted it signifies a date that occurs every century. If the YY is omitted it signifies a date every year and so on. Every two character "unit" of the representation that is omitted is indicated by a single hyphen "-". For example, ---05 signifies the fifth day of every month.
[Definition:] The time datatype represents a recurring instant of time that recurs every day. The basetype of time is recurringInstant (§3.2.8). The time datatype can be considered to be a shorthand to designate a specific truncated representation for recurringInstant (§3.2.8). time is generated from recurringInstant (§3.2.8) by setting the value of the period facet equal to 24 hours.
The lexical representation for time is the left truncated lexical representation for timeInstant (§3.2.6): hhmmss.sss. For example, to indicate 1:20 pm for Eastern Daylight Time which is 5 hours behind Coordinated Universal Time, one would write: 132000-0500.
A generated datatype can be defined from a primitive datatype (or another generated datatype) by adding optional constraining facets. For example, it may be useful to define a datatype called i4 (signed 4-byte integer) from the built-in datatype integer by supplying maxInclusive and minInclusive facets. In this case, i4 is the name of the new user-generated datatype, integer is its base type and maxInclusive and minInclusive are the constraining facets.
Example
<datatype name="i4"> <basetype name="integer"/> <minInclusive> 2147483648 </minInclusive> <maxInclusive> -2147483648 </maxInclusive> </datatype>
This section defines the abstract syntax used for defining generated datatypes. This abstract syntax is used for defining both Generated datatypes (§3.3) and user-generated datatypes; the only difference between the built-in and user-generated datatypes being that the datatype definitions for built-in generated datatypes are included in the Schema for Datatype Definitions (normative) (§A) while the datatype definitions for user-generated datatypes appear in schemas written by users.
[Definition:] An abstract syntax provides a formal specification of the information provided for each generated datatype definition. The abstract syntax is presented using a simplified BNF. Defined terms are to the left. Their components are to the right, with a small amount of meta-syntax: ()s for grouping, | to separate alternatives, ? for optionality, * and + for iteration.
[Definition:] The concrete syntax for generated datatype definitions is the exact element and attribute names used in definitions.. The concrete syntax is a key feature of its proposed design. The concrete syntax is the form in which the schema language is used by datatype designers. Though its elements and attributes are often different from the terms of the abstract syntax bnf, the features and expressive power of the two are congruent.
We include a preliminary concrete syntax in this draft, via examples, as well as in Schema for Datatype Definitions (normative) (§A) (defined using the schema language of [XML Schema Part 1: Structures]) and DTD for Datatype Definitions (normative) (§B). The emphasis in this version has been to stay quite close to the abstract syntax.
Ed. Note: The abstract syntax proposed here (and hence, the concrete syntax) are preliminary, as they allow datatype definitions which are logically inconsistent (e.g., they allow numeric facets on non-numeric datatypes). This will be corrected in future drafts, as the XML Schema language comes to allow the specification of tighter constraints.
Ed. Note: This section needs more explanatory text describing the productions and their relationship to the conceptual framework described in sections Type System (§2) and Built-in datatypes (§3).
Datatype definitions | |||||||||||||||
|
The following is the definition for a possible built-in generated datatype "currency". This datatype definition would appear in the schema which defines datatypes for XML Schemas and shows that a generated datatype can have the same value space as its basetype, which might mean that it is just an "alias" or "renaming" of basetype. In this case, the specification would probably also define some "semantics" for currency which went beyond those of decimal.
Example
<datatype name="currency"> <basetype name="decimal"/> </datatype>
Constraint on Schemas: Unique datatype definitions
The name of the datatype being defined must be unique among the datatypes
defined in the containing schema.
Constraint on Schemas: Appropriate facets
If the value space of the basetype is ordered, then only ordered
facets may appear in a datatype definition.
Datatype names | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
NOTE: The datatypename production above is not to be confused with that labeled datatypeName in [XML Schema Part 1: Structures].
Constraint on Schemas: Datatype name
The name specified must be the name of a datatype defined in
the schema in which the user-generated datatype is defined.
Facets | ||||||||||
|
Ordered facets | |||||||||||||||||||||||||
|
Constraint on Schemas: Literal type
The literal value give must be of the same type as the datatype
as the basetype given in the datatype definition in which this facet
appears.
Numeric facets | ||||||||||||||||||||||||||||||
|
Constraint on Schemas: minMaxAbsoluteValue
In a generated subtype of real (§3.2.5), if a value is specified
for the minAbsoluteValue facet a value must also be specified
for the maxAbsoluteValue facet.
The following is the definition of a user-generated datatype which could be used to represent monetary amounts, such as in a financial management application which generally do not have figures above $1M and only allow whole cents. This definition would appear in a schema authored by an "end-user" and shows how to define a datatype by specifying facet values which constrain the range of the basetype in a manner specific to the basetype (different than specifying max/min values as before)
Example
<datatype name="ieee32"> <basetype name="real"/> <minAbsoluteValue> 1.40239846e-45 </minAbsoluteValue> <maxAbsoluteValue> 3.40282347e38 </maxAbsoluteValue> </datatype>The above subtype of real represents an IEEE 32-bit floating. While the explanation is beyond the scope of this specification, the above minimum and maximum absolute values correspond to values which are representable with the 32-bit floating point format, which has 1 bit for sign, 8 bits of exponent and 23 bits of mantissa.
Example
<datatype name="ieee64"> <basetype name="real"/> <minAbsoluteValue> 4.90465645841246544e-324 </minAbsoluteValue> <maxAbsoluteValue> 1.79769313486231570e308 </maxAbsoluteValue> </datatype>The above subtype of real represents an IEEE 64-bit floating point number. While the explanation is beyond the scope of this specification, the above minimum and maximum absolute values correspond to values which are representable with the IEEE 64-bit floating point format, which has 1 bit for sign, 11 bits of exponent and 52 bits of mantissa.
Example
<datatype name="ibmhex32"> <basetype name="real"/> <minAbsoluteValue> 5.2e-85 </minAbsoluteValue> <maxAbsoluteValue> 7.2e75 </maxAbsoluteValue> </datatype>The above subtype of real represents an IEEE 32-bit floating point number. While the explanation is beyond the scope of this specification, the above minimum and maximum absolute values correspond to values which are representable with the IBM 32-bit hexidecimal floating point format, which has 1 bit for sign, 8 bits of exponent and 23 bits of mantissa.
Example
This type could just as well have been defined with the potential built-in generated type "currency" (defined above) as its basetype.
<datatype name="amount"> <basetype name="decimal"/> <precision> 8 </precision> <scale> 2 </scale> </datatype>
Unordered facets | ||||||||||||||||||||||||||||||
|
Constraint on Schemas: Lexical specification
The lexical specification must be of the "correct" kind, i.e.,
a real lexical specification for datatypes generated from
real (§3.2.5).
The following example is a datatype definition for a user-generated datatype which limits the possible literal values of dates to the four US holidays enumerated. This datatype definition would appear in a schema authored by an "end-user" and shows how to define a datatype by enumerating the values in its value space. The enumerated values must be type-valid literals for the basetype.
Example
<datatype name="holidays"> <basetype name="date"/> <enumeration> <literal> --0101 <!-- New Year's day --> </literal> <literal> --0704 <!-- 4th of July --> </literal> <literal> --1125 <!-- Thanksgiving --> </literal> <literal> --1225 <!-- Christmas --> </literal> </enumeration> </datatype>
Literals | |||||||||||||||||||||||||
|
Numeric Literals | ||||||||||||||||||||||||||||||||||||||||
|
Date and Time Literals | |||||||||||||||||||||||||||||||||||
|
Issue (definition-overriding): In some cases it may be desirable to specify datatype constraints in instance documents rather than in a schema. Should this be allowed? If the document does not have a schema then, clearly, the only possibility of adding datatype constraints is in the document instance. Even if the document has a schema the document instance may want to further restrict the content. For example, the schema may specify a value to be a string but the instance may want to impose a particular regex constraint on it. If we decide to allow datatype specification or specialization in instance document what syntax should be used? This needs to be coordinated with the structural schema editorial team.
Issue (non-positive-integer-literal): Do we need productions for the literals of non-negative-integer, positive-integer, non-positive-integer and negative-integer?
Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.
The XML specification [XML] defines two levels of conformance. Well-formed documents conform to valid XML syntax but may or may not obey the constraints defined by a DTD. Valid XML documents conform to the structure laid down in a DTD. Thus, if a DTD defines an attribute as an ID, instances of XML documents conforming to the DTD can only be valid if the values of such attributes are valid XML names and are unique in the document. By introducing additional datatypes to XML, this specification extends the notion of validity in the sense that values defined to have a certain datatype in the schema must conform to the lexical representations allowed for that datatype. Values that do not conform to the datatype defined for them in the schema raise a conformance error. As, for example, the appearance of a letter in a value defined as "integer". Similarly, for a value defined as string with length facet equal to 5, a value of "ABC" would raise an error -- length too short -- as would a value of "abcdefgh" -- length too long.
Since the datatypes discussed in this document can be used independently of XML Schema it is desirable that datatype conformance be specified as an independent, optional piece that other processors can use as they see fit. To this end, we define the datatypes processor as a separate abstract interface. Processors can call this interface with the value to be validated and the datatype, along with all its facets, that it should be validated against. The processor will return a boolean value which will be true or false depending on whether the value is valid for the datatype or not. If the value is valid it may also return a canonical representation for the value. If the value is invalid the processor will return error information including the facets that caused the value to be declared invalid.
User-generated datatypes are defined by giving values to certain, optional facets. For example, an integer within a certain range could be defined by giving values to maxInclusive and minInclusive facets. A switch on the datatypes processor could be used to turn validation off for these facets. This could be used by a processor that used the datatypes processor to eliminate validation of user-genarated datatypes.
If a particular processor, for reasons of speed or size decided not to validate datatypes, it can use a default stub interface. This always returns true.
It also needs to be said that there are no expressions on datatypes; neither are there operations on datatypes.
If we decide to allow datatype specification or specialization in instance documents (see issue "definition-overriding" above) then validating XML processors should be able to validate the format of values in XML documents in these cases as well by using the datatypes processor.
Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.
<?xml version='1.0'?> <!-- $Id: Overview.html,v 1.4 1999/09/24 22:47:18 connolly Exp $ --> <!DOCTYPE schema PUBLIC "-//W3C//DTD XMLSCHEMA 19990923//EN" "../structures/structures.dtd" > <schema xmlns='http://www.w3.org/1999/09/23-xmlschema/' targetNS='http://www.w3.org/1999/09/23-xmlschema/datatypes/' version='0.4'> <element name='datatype'> <archetype order='all'> <element ref='basetype'/> <element archRef='maxBound' minOccurs='0'/> <element archRef='minBound' minOccurs='0'/> <element ref='minAbsoluteValue' minOccurs='0'/> <element ref='maxAbsoluteValue' minOccurs='0'/> <element ref='maxInclusive' minOccurs='0'/> <element ref='minInclusive' minOccurs='0'/> <element ref='precision' minOccurs='0'/> <element ref='scale' minOccurs='0'/> <element ref='length' minOccurs='0'/> <element ref='maxLength' minOccurs='0'/> <element ref='enumeration' minOccurs='0'/> <element ref='lexicalRepresentation' minOccurs='0'/> <element ref='encoding' minOccurs='0'/> <attribute name='name' type='NMTOKEN' minOccurs='1'/> <attribute name='export' type='boolean' default='true'/> </archetype> </element> <element name='basetype'> <archetype content='empty'> <attribute name='name' type='NMTOKEN' minOccurs='1'/> <attribute name='schemaAbbrev' type='NMTOKEN'/> <attribute name='schemaName' type='uri'/> </archetype> </element> <!-- these are here to bridge between the content model above and the elements below --> <archetype name='minBound'/> <archetype name='maxBound'/> <!-- these can only be applied when the base type is 'real' and must be used in concert with one another --> <element name='minAbsoluteValue' type='real'> <element name='maxAbsoluteValue' type='real'> <!-- the true datatype of the four following depends on the basetype --> <element name='maxExclusive' type='string'> <archetype> <refines name='maxBound'/> </archetype> </element> <element name='maxInclusive' type='string'> <archetype> <refines name='maxBound'/> </archetype> </element> <element name='minExclusive' type='string'> <archetype> <refines name='minBound'/> </archetype> </element> <element name='minInclusive' type='string'> <archetype> <refines name='minBound'/> </archetype> </element> <element name='precision' type='integer'/> <element name='scale' type='integer'/> <element name='length' type='integer'/> <element name='maxLength' type='integer'/> <!-- the following datatype is used to limit the possible values for the encoding facet on the binary datatype --> <datatype name='encodings'> <basetype name='NMTOKEN'/> <enumeration> <literal>hex</literal> <literal>base64</literal> <enumeration> </datatype> <element name='encoding' type='encodings'/> <element name='period' type='timeDuration'/> <element name='enumeration'> <archetype> <element ref='literal' minOccurs='1' maxOccurs='*'/> </archetype> </element> <!-- the true datatype of the following depends on the basetype --> <element name='literal' type='string'/> <element name='lexicalRepresentation'> <archetype> <element ref='lexical' minOccurs='1' maxOccurs='*'/> </archetype> </element> <!-- the true datatype of the following depends on the basetype --> <element name='lexical' type='string'/> <!-- built-in generated datatypes --> <!-- only has a few for now, eventually needs to have all of them --> <datatype name='integer'> <basetype name='decimal'/> <scale>0</scale> </datatype> <datatype name='non-negative-integer'> <basetype name='integer'/> <minInclusive>0</minInclusive> </datatype> <datatype name='positive-integer'> <basetype name='non-negative-integer'/> <minInclusive>1</minInclusive> </datatype> <datatype name='non-positive-integer'> <basetype name='integer'/> <maxInclusive>0</maxInclusive> </datatype> <datatype name='negative-integer'> <basetype name='non-positive-integer'/> <maxInclusive>-1</maxInclusive> </datatype> <datatype name='date'> <basetype name='recurringInstant'/> <period>000000T2400</period> </datatype> <datatype name='time'> <basetype name='recurringInstant'/> <period>000000T2400</period> </datatype> </schema> |
Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.
<!-- Note that the expansion of 'facets' below is less restrictive than that imposed by the XML Schema schema for datatypes: There should in fact be no more than one of each of minInclusive, minExclusive, maxInclusive, maxExclusive, precision, scale, lexicalRepresentation, enumeration, length, maxLength within datatype --> <!ENTITY % minBound '(minInclusive | minExclusive)'> <!ENTITY % maxBound '(maxInclusive | maxExclusive)'> <!ENTITY % bounds '%minBound; | %maxBound;'> <!ENTITY % numeric '(maxAbsoluteValue, minAbsoluteValue)? | precision | scale'> <!ENTITY % ordered '%bounds; | %numeric;'> <!ENTITY % unordered 'lexicalRepresentation | enumeration | length | maxLength | encoding'> <!ENTITY % facets '%ordered; | %unordered;'> <!ELEMENT datatype (basetype, (%facets;)*)> <!ATTLIST datatype name NMTOKEN #REQUIRED export (true|false) 'true'> <!ELEMENT basetype EMPTY> <!ATTLIST basetype name NMTOKEN #REQUIRED schemaAbbrev NMTOKEN #IMPLIED schemaName CDATA #IMPLIED> <!ELEMENT minAbsoluteValue (#PCDATA)> <!ELEMENT maxAbsoluteValue (#PCDATA)> <!ELEMENT maxExclusive (#PCDATA)> <!ELEMENT minExclusive (#PCDATA)> <!ELEMENT maxInclusive (#PCDATA)> <!ELEMENT minInclusive (#PCDATA)> <!ELEMENT precision (#PCDATA)> <!ELEMENT scale (#PCDATA)> <!ELEMENT length (#PCDATA)> <!ELEMENT maxLength (#PCDATA)> <!ELEMENT enumeration (literal)+> <!ELEMENT literal (#PCDATA)> <!ELEMENT lexicalRepresentation (lexical)+> <!ELEMENT lexical (#PCDATA)> <!ELEMENT encoding (#PCDATA)> |
Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.
The following table shows the values of the fundamental facets for each built-in datatype.
Ed. Note: (PVB 1999-07-09) Some entries in this table might conflict with what it says elsewhere in this draft, as creating this table pointed out to me some problems with the way some of the fundamental facets are defined (not to mention any transcription errors on my part in creating the table).We obviously need more introductory text here explaining this table to the reader
Datatype | Order (§2.4.1.1) | Bounds (§2.4.1.2) | Cardinality (§2.4.1.3) | Exact and Approximate (§2.4.1.4) | Numeric (§2.4.1.5) | |
---|---|---|---|---|---|---|
Primitive | NMTOKEN (§3.2.1) | no | none | countably infinite | exact | no |
string (§3.2.3) | yes | none | countably infinite | exact | no | |
boolean (§3.2.4) | no | none | finite | exact | no | |
real (§3.2.5) | yes | none | uncountably infinite | approximate | yes | |
timeInstant (§3.2.6) | yes | no | uncountably infinite | approximate | no | |
timeDuration (§3.2.7) | yes | no | uncountably infinite | approximate | no | |
recurringInstant (§3.2.8) | yes | no | uncountably infinite | approximate | no | |
binary (§3.2.9) | no | no | ? | ? | no | |
uri (§3.2.10) | no | no | uncountably infinite | exact | no | |
language (§3.2.11) | no | no | countably infinite | exact | no | |
Generated | Name (§3.3.1) | no | no | countably infinite | exact | no |
NCName (§3.3.2) | no | no | countably infinite | exact | no | |
ID (§3.3.3) | no | no | countably infinite | exact | no | |
IDREF (§3.3.4) | no | no | countably infinite | exact | no | |
IDREFS (§3.3.5) | no | no | countably infinite | exact | no | |
ENTITY (§3.3.6) | no | no | countably infinite | exact | no | |
ENTITIES (§3.3.7) | no | no | countably infinite | exact | no | |
NMTOKENS (§3.2.2) | no | no | countably infinite | exact | no | |
NOTATION (§3.3.8) | no | no | countably infinite | exact | no | |
decimal (§3.3.9) | yes | no | countably infinite | exact | yes | |
integer (§3.3.10) | yes | no | countably infinite | exact | yes | |
non-negative-integer (§3.3.11) | yes | yes | countably infinite | exact | yes | |
positive-integer (§3.3.12) | yes | yes | countably infinite | exact | yes | |
non-positive-integer (§3.3.13) | yes | yes | countably infinite | exact | yes | |
negative-integer (§3.3.14) | yes | yes | countably infinite | exact | yes | |
date (§3.3.15) | yes | no | countably infinite | exact | no | |
time (§3.3.16) | yes | no | uncountably infinite | approximate | no |
The following table shows the constraining facets which apply to each built-in datatype.
Ed. Note: Some entries in this table might conflict with what it says elsewhere in this draft, as creating this table pointed out to me some problems with the way some of the constraining facets and datatypes are defined (not to mention any transcription errors on my part in creating the table).We obviously need more introductory text here explaining this table to the reader (especially since this one table is broken into three pieces so that it will print nicely)
Datatype | length (§2.4.2.1) | maximum length (§2.4.2.2) | lexical representation (§2.4.2.3) | enumeration (§2.4.2.4) | |
---|---|---|---|---|---|
Primitive | NMTOKEN (§3.2.1) | ? | ? | X | |
string (§3.2.3) | X | X | X | X | |
boolean (§3.2.4) | |||||
real (§3.2.5) | X | ||||
timeInstant (§3.2.6) | X | ||||
timeDuration (§3.2.7) | X | ||||
recurringInstant (§3.2.8) | X | ||||
binary (§3.2.9) | X | ? | ? | ||
uri (§3.2.10) | X | ||||
language (§3.2.11) | ? | X | |||
Generated | Name (§3.3.1) | ? | ? | X | |
NCName (§3.3.2) | ? | ? | X | ||
ID (§3.3.3) | ? | ? | X | ||
IDREF (§3.3.4) | ? | ? | X | ||
IDREFS (§3.3.5) | ? | ? | X | ||
ENTITY (§3.3.6) | ? | ? | X | ||
ENTITIES (§3.3.7) | ? | ? | X | ||
NMTOKENS (§3.2.2) | ? | ? | X | ||
NOTATION (§3.3.8) | ? | ? | X | ||
decimal (§3.3.9) | X | ||||
integer (§3.3.10) | X | ||||
non-negative-integer (§3.3.11) | X | ||||
positive-integer (§3.3.12) | X | ||||
non-positive-integer (§3.3.13) | X | ||||
negative-integer (§3.3.14) | X | ||||
date (§3.3.15) | X | ||||
time (§3.3.16) | X |
constraining facets table, cont.
Datatype | maxInclusive (§2.4.2.7) | maxExclusive (§2.4.2.8) | minInclusive (§2.4.2.9) | minExclusive (§2.4.2.10) | |
---|---|---|---|---|---|
Primitive | NMTOKEN (§3.2.1) | ||||
string (§3.2.3) | X | X | X | X | |
boolean (§3.2.4) | |||||
real (§3.2.5) | X | X | X | X | |
timeInstant (§3.2.6) | X | X | X | X | |
timeDuration (§3.2.7) | X | X | X | X | |
recurringInstant (§3.2.8) | X | X | X | X | |
binary (§3.2.9) | |||||
uri (§3.2.10) | |||||
language (§3.2.11) | |||||
Generated | Name (§3.3.1) | ||||
NCName (§3.3.2) | |||||
ID (§3.3.3) | |||||
IDREF (§3.3.4) | |||||
IDREFS (§3.3.5) | |||||
ENTITY (§3.3.6) | |||||
ENTITIES (§3.3.7) | |||||
NMTOKENS (§3.2.2) | |||||
NOTATION (§3.3.8) | |||||
decimal (§3.3.9) | X | X | X | X | |
integer (§3.3.10) | X | X | X | X | |
non-negative-integer (§3.3.11) | X | X | X | X | |
positive-integer (§3.3.12) | X | X | X | X | |
non-positive-integer (§3.3.13) | X | X | X | X | |
negative-integer (§3.3.14) | X | X | X | X | |
date (§3.3.15) | X | X | X | X | |
time (§3.3.16) | X | X | X | X |
constraining facets table, cont.
Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.
Ed. Note: The following description of regular expressions is copied (with slight modification) by permission from the documentation of the [Perl] programming language. This entire section should probably be replaced by something derived from the Unicode Regex TechReport [Unicode Regular Expression Guidelines] and the ECMAScript Regex proposal [ECMAScript Regex].
Issue (perl-regex): Should the final recommendation use Perl's regular expression "extensions"?
[Definition:] Regular expressions, similar to those in [Perl], can be used to constrain the format of strings. A regular expression is an alphanumeric string consisting of character symbols. Each symbol, which is usually one character but may be two characters, is a placeholder that stands for a set of characters.
Any single character matches itself, unless it is a metacharacter with a special meaning described here or above. You can cause characters that normally function as metacharacters to be interpreted literally by prefixing them with a "\" (e.g., "\." matches a ".", not any character; "\\" matches a "\"). A series of characters matches that series of characters in the target string, so the pattern blurfl would match "blurfl" in the target string.
You can specify a character class, by enclosing a list of characters in [], which will match any one character from the list. If the first character after the "[" is "^", the class matches any character not in the list. Within a list, the "-" character is used to specify a range, so that a-z represents all characters between "a" and "z", inclusive. If you want "-" itself to be a member of a class, put it at the start or end of the list, or escape it with a backslash. (The following all specify the same class of three characters: [-az], [az-], and [a\-z]. All are different from [a-z], which specifies a class containing twenty-six characters.)
Certain characters as used as metacharacters. The following list contains all of the metacharacters and their meanings.
Within a regular expression, the following standard quantifiers are recognized:
The following character sequences also have special meaning within a regular expression.
Ed. Note: we should probably define XML-specific character sequences for things like Nmtoken, Name, etc., as well as ones for the character classes listed in XML 1.0 Appendix B. Character Classes
Regular expressions may also contain the following zero-width assertions:
A word boundary (\b) is defined as a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), counting the imaginary characters off the beginning and end of the string as matching a \W.
Example
555-1212 is matched by \d{3}-\d{4} (phone number) 888-555-1212 is matched by (\d{3}-)?\d{3}-\d{4} (phone number with optional area code) $123,45.90 is matched by \$\d{3},\d{2}\.\d{2} 123-45-5678 is matched by \d{3}-?\d{2}-?\d{4} (Social Security Number)
Ed. Note: This section (both its abstract content and its concrete wording) has not yet garnered consensus among WG members.
The editors the acknowledge the members of the W3C XML Schema Working Group, the members of other W3C Working Groups, and industry experts in other forums who have contributed directly or indirectly to the process or content of creating this document.