[Cache from http://www.personal.u-net.com/~sgml/DSDL-Datatypes-v04.htm; see the canonical source if possible. See also http://dsdl.org/.]

Datatypes for document content validation

Part 5 of the Document Schema Description Language (DSDL) defines a set of primitive datatypes, a set of DSDL datatypes, a set of commonly required derived datatypes and a method for defining customized datatypes that can be used to validate the contents of specific elements and attributes within DSDL document instances. The specification also includes a set of constraints that can be used to limit the range of primitive datatypes and their derivatives.

The following goals have influenced the way in which datatypes are defined in this specification:

  1. Minimize the number of primitive datatypes
  2. Provide datatypes for constructs defined in other parts of DSDL
  3. Define an extensive set of derived datatypes within the standard
  4. Allow users to define customized datatypes based on primitive or derived datatypes, or by extending existing customized definitions
  5. Allow derived datatypes to be defined by assigning limits to existing datatypes
  6. Allow datatypes to be defined through matching values to patterns
  7. Make it possible to define valid subsets of permitted values of datatypes
  8. Make it possible to restrict the use of specific values within a datatype.

1. Primitive Datatypes

The following minimal set of primitive datatypes can be used to derive other datatypes:

[Issue 1-6: Should the language independent datatypes defined in ISO 11404 (http://std.dkuug.dk/jtc1/sc22/wg11/docs/iso11404.pdf) not incorporated into XML Schema Part 2 be considered, given that this document is currently under review by SC22, having first been published in 1996?

NB: ISO 11404 primitives are:

primitive-type = boolean-type | state-type | enumerated-type | character-type | ordinal-type | time-type | integer-type | rational-type | scaled-type | real-type | complex-type | void-type 

The types that specifically need to be considered for inclusion are "state-type", "ordinal-type", "scaled-type" and "void-type". The "boolean-type", "character-type", "time-type" and "real-type" can be equated to existing definitions. Whether "complex-type" and "enumerated-type" are true datatypes or expressions of ways in which datatypes can be created from basic types is questionable.

Other ISO 11404 types of possible interest include:

generated-type = pointer-type | procedure-type | choice-type | aggregate-type 
aggregate-type = record-type | set-type | sequence-type | bag-type | array-type | table-type
]

2. DSDL Datatypes

The following datatypes are used to identify constructs that conform to DSDL constraints:

NB: This list will need to be revised in the light of the development of DSDL.

[Issue 2-1: At what point is datatype validation applied? Do we really need all of these?]

[Issue 2-6: What other types of DSDL data types will there be?]

[Issue 2-7: Should we include DSSSL types such as quantity, pair, real (if floating point is not defined in an acceptable form for DSSSL) and number (if not directly mappable to the fixed point number construct)?]

3. Commonly Required Derived Datatypes

The following commonly used datatypes can be derived from primitive datatypes:

[Issue 3-10: Do we need to define all the alternative ISO 8601 date variants (e.g. without hyphens and colons, etc), or will the limited dateTime primitive definitions shown above be sufficient?]

[Issue 3-11: What other options do we need to allow for? Should ISO 639 language be included (if so what about the IETF rules re extensions)? What about currency as used in XForms, in support of ISO 4217 (or is this simply an application of decimal)?]

4. Constraining Properties

The following properties can be used to constrain datatypes that are derived from strings:

or

The following properties can be used to constrain datatypes that are derived from fixed point and floating point numbers:

The following additional properties can be used to constrain fixed point numbers:

The following properties can be used to constrain any datatype:

[Issue 4.2: Should more than one pattern grammar be allowed for by applying a pointer to the relevant grammar at some higher level in the syntax?]

5. Deriving Customized Datatypes

Each type defined in a DSDL schema must be assigned a unique identifier as its name, which must not be identical to any of the names assigned to primitive or derived datatypes defined in this standard, or to the name of any customized datatype imported into the schema. The datatype name must conform to the rules for defining DSDL names.

[Issue 5-1: Should datatype names be DSDL unique identifiers or keys? How can we ensure that imported datatypes do not share the same name?]

Each set of constraints defined for a datatype shall be based on either a primitive datatype, a derived datatype defined in this standard or a customized datatype defined in, or imported into, the same schema. The unique identifier of the base datatype must be specified as an attrribute of the constraint element. Where more than one constraint element is defined for a datatype the constraints are applied sequentially to create a "compound datatype" made up of components of different datatypes. 

Note: It is an error if two consecutive constraint elements have the same base type, or have base types derived from the same primitive datatype.

[Issue 5-6: Do we need elements around sets of constraints to allow the definition of structured constructs, such as arrays and tables? If so, do we also need sets, bags and repeatable sequences to be definable?]

Constraining properties shall be entered as subelements of the constraints definition using elements whose name is shown in brackets after the name of the property. Only those properties relevant for the primitive datatype from which the datatype is derived may be defined. The only property that can be specified more than once in each set of constraints is the pattern property, which can be duplicated as many times as necessary to indicate all the known patterns for the datatype. (Patterns are checked for in the order listed in the instance.)

[Issue 5-2: How will DSDL allow us to manage the fact that models of the constraints element will be dependent on the value given to the base attribute, which may in its turn be derived from a customized datatype rather than a primitive?]

[Issue 5-3a: How will patterns be specified? (Is this a good name, given its application in Part 2, or should it be renamed datatypePattern?) How can we restrict the number of times a part of a pattern is repeated? Should users be able to use the asDefinedIn attribute to reference external definitions of datatypes?]

Where only specified values are to be permitted a list of validValues may be specified. These values can be assigned a value for the rule attribute of either "no-others" to indicate that only specified values are valid, or of "with-others" to indicate that the list of validValues only indicates currently known values, which the user can extend by entering any other value that conforms to the constraints assigned to the datatype. Where no value is specified for the rule attribute the default value of "no-others" applies. Optionally a statement of meaning can be assigned to each value. The contents of all values entered within a single accept element must be unique. Where the datatypes of a valid value differ in datatype from that assigned as the base datatype of the containing constraints element, the optional datatype attribute must be used to indicate the datatype of the entered value.

Note: The mixing of datatypes within lists is deprecated, even though it has been enabled.

The invalidValues element can be used to identify specific values that may not be used as valid entries. Optionally a statement of the reason why the value is invalid can be assigned to each value. The contents of all values entered within a single reject element must be unique.

[Issue 5-4: Should datatype be allowed to have structured values? (NB: XML Schema defines permitted enumeration values as attributes, which allows the element itself to consist solely of annotation.)]

[Issue 5-5: Given that RelaxNG has exceptPatterns, do we also need to ability to inhibit values as part of the datatype definition? What is the relationship between patterns declared to be invalid for the whole datatype and those declared to be invalid for a specific element?]

The following example shows how customized datatypes can be expressed using the elements defined in Appendix A.

<datatype name="type-a">
 <constraints base="string">
  <fixedLength>3</fixedLength>
  <pattern>[a-fA-F(3)]</pattern>
  <validValues rule="no-others">
   <accept>
    <value>abc</value>
    <meaning>Latin alphabet</meaning>
   </accept>
   <accept>
    <value>def</value>
    <meaning>Braille alphabet</meaning>
   </accept>
   ...
  </validValues>
    <invalidValues>
   <reject>
    <value>bad</value>
    <reason>Can be confused with bed.</reason>
   </reject>
  </invalidValues>
 </constraints>
</datatype>

Appendix A: DSDL Schema for the Specification of Datatypes

The following (yet-to-be-validated) DSDL schema can be used to validate datatype definitions:

[Issue A.1: Can we validly used DSDL datatypes to define the specification for DSDL datatypes?]

<grammar
 datatypeLibrary="http://www.iso.ch/jtc1/sc34/iso19757/Part5.dsdl"
 ns="http://relaxng.org/ns/structure/1.0"
 xmlns="http://relaxng.org/ns/structure/1.0">
<start>
 <ref name="datatypeLibrary">
</start>
<element name="datatypeLibrary">
 <optional>
  <attribute name="name">
   <data type="ID"/>
  </attribute>
 </optional>
 <choice>
  <group>
   <oneOrMore>
    <ref name="importedDefinitions"/>
   </oneOrMore>
   <zeroOrMore>
    <ref name="datatype"/>
   </zeroOrMore>
  </group>
  <oneOrMore>
   <ref name="datatype">
  </oneOrMore>
 </choice>
</element>
<element name="importedDefinitions">
 <attribute name="source">
  <data type="URI"/>
 </attribute>
 <empty/>
</element> 
<element name="datatype">
 <attribute name="name">
  <data type="localName"/>
 </attribute>
 <oneOrMore>
  <ref name="constraints"/>
 </oneOrMore>
</element>
<element name="constraints">
 <attribute name="base">
  <data type="datatypeName"/>
 </attribute>
 <zeroOrMore>
  <choice>
   <!-- Only the pattern element may be repeated -->
   <ref name="pattern"/>
   <ref name="validValues"/>
   <ref name="invalidValues"/>
   <ref name="minExclusive"/>
   <ref name="maxExclusive"/>
   <ref name="minInclusive"/>
   <ref name="maxInclusive"/>
   <ref name="fixedLength"/>
   <ref name="minLength"/>
   <ref name="maxLength"/>
   <ref name="totalDigits"/>
   <ref name="decimalDigits"/>
  </choice>
 </zeroOrMore>
</element>
<element name="pattern">
 <optional>
  <attribute name="asDefinedIn">
   <data type="URI"/>
  </attribute>
  <attribute name="notation">
   <data type="NOTATION"/>
  </attribute>
 </optional>
 <data type="string"/>
</element>
<element name="validValues">
 <optional>
  <attribute name="rule">
   <data type="string">with-others</data>
   <data type="string">no-others</data>
  </attribute>
 </optional>
 <ref name="accept"/>
</element>
<element name="value">
  <data type="string"/>
  <optional>
   <attribute name="datatype">
    <data type="datatypeName"/>
   </attribute>
  </optional>
</element>
<element name="accept">
 <oneOrMore>
  <ref name="value">
  <zeroOrMore>
   <element name="meaning">
    <oneOrMore>
     <anyName>
      <except>
       <nsName/>
      </except>
     </anyName>
     <text/>
    </oneOrMore>
   </element>
  </zeroOrMore>
 </oneOrMore>
</element>  
<element name="invalidValues">
  <ref name="reject"/>
</element>
<element name="reject">
 <oneOrMore>
  <ref name="value">
  <zeroOrMore>
   <element name="reason">
    <oneOrMore>
     <anyName>
      <except>
       <nsName/>
      </except>
     </anyName>
     <text/>
    </oneOrMore>
   </element>
  </zeroOrMore>
 </oneOrMore>
</element>
<element name="fixedLength">
  <data type="integer"/>
</element> 
<element name="minLength">
  <data type="integer"/>
</element> 
<element name="maxLength">
  <data type="integer"/>
</element> 
<element name="minExclusive">
  <data type="integer"/>
</element> 
<element name="maxExclusive">
  <data type="integer"/>
</element> 
<element name="minInclusive">
  <data type="integer"/>
</element> 
<element name="maxInclusive">
  <data type="integer"/>
</element> 
<element name="totalDigits">
  <data type="integer"/>
</element> 
<element name="decimalDigits">
  <data type="integer"/>
</element>  
</grammar>