Datatypes for document content validation
Part 5 of the Document Schema Description Language (DSDL) defines a set of
primitive datatypes, a set of DSDL datatypes, a set of commonly required
derived datatypes and a method for defining customized datatypes that can be
used to validate the contents of specific elements and attributes within DSDL
document instances. The specification also includes a set of constraints that
can be used to limit the range of primitive datatypes and their derivatives.
1. Primitive Datatypes
The following primitive datatypes can be used to derive other datatypes:
- String (string)
A contiguous sequence of parsed characters that conform to a specified
character set, or to the default ISO 10646 Universal
Character Set (UCS) if no character set has been specified during
validation. Every character in the string has a
corresponding UCS code point, which is an integer.
Constraints: fixedLength; minLength; maxLength; pattern
- Boolean (boolean)
A binary value. Can be expressed using the strings "true" or
"false" or the integers 0 and 1.
- Fixed point number (decimal)
A sequence of digits which can optionally contain a decimal point
(expressed either as a period or a comma) separating a sequence of one
or more integer digits on the left from one or more decimal digits on
the right. The integer digits can optionally be preceded by a plus sign
(+) that confirms that the number has a positive value (the default) or
a hyphen (-) that indicates that it has a negative value.
Note: Commas may not be used to identify subsets of integers. Values
between 1 and -1 must have an integer value of 0.
Constraints: totalDigits; decimalDigits; minInclusive;
maxInclusive; minExclusive; maxExclusive; pattern
Derived units: integer; positiveInteger; negativeInteger;
nonPositiveIntger; nonNegativeInteger, long, int, short, unsignedLong,
unsignedInt, unsignedShort
[Issue 1-1: Do we need to take the Fract and Accum fixed-point
datatypes being proposed for C in ISO WDTR 18037 (http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n972.pdf)
into account?]
- Floating point number (real)
A number consisting of manitisa expressed as a decimal followed,
optionally, by the character "E" or "e", followed by
an exponent expressed as an integer. Both the mantissa and the exponent
can optionally be preceded by a plus sign (+) that confirms that the
following number has a positive value (the default) or a hyphen (-) that
indicates it has a negative value.
Constraints: minInclusive; maxInclusive; minExclusive;
maxExclusive; pattern
[Issue 1-2: Should floating point numbers be constrained as far as
maximum length is concerned?]
[Issue 1-3: Should special patterns be defined to identify positive
and/or negative infinity?]
Derived units: double
- Date and/or Time (dateTime)
A specific instance of Gregorian time, defined using the ISO 8601
extended format CCYY-MM-DDThh:mm:ss.sss±hh:mm where "CC"
represents the century (optionally preceded by a hyphen to identify
dates preceding the Gregorian calendar start point), "YY" the
year, "MM" the month and "DD" the day. The letter
"T" is the date/time separator and "hh",
"mm", "ss.sss" represent hours, minutes and seconds
(including fractional seconds following a period) respectively. Where
the time is specified for a timezone other than the Coordinated
Universal Time (UTC) zone the relevant time offset must be entered after
either a plus sign (+) or a hyphen (-) to specify the number of hours and
minutes difference from UTC/GMT. The letter Z may be used in place of
the plus or minus, without following numbers, to confirm that
the default Coordinated Universal Time zone has been used to specify the
time. Both the timezone and the time data are optional.
Note: The CCYY value may not be 0000. Months are defined using a pair
of digits in the range 01 to 12. Days are defined using a pair of digits
in the range 01 to 31, with certain values being forbidden in combination
with specific values for the month. Hours (hh) are defined as a pair of
digits in the range 00 to 23, minutes (mm) are defined as a pair of
digits in the range 00 to 59 and seconds are defined as a decimal number
in the range 00.000 to 59.999.
[Issue 1-4: Do we need a mechanism to allow people to specify that
the last day of the month should be the one that applies, irrespective
of its number?]
Derived units: Gregorian date (CCYY-MM-DD); Gregorian year (CCYY);
recurring month (--MM--, expressed as a pair of digits in the range 01
to 12)); recurring day (----DD, expressed as a pair of digits in the
range 01 to 31); time (Thh:mm:ss.sss±hh:mm where hour is expressed as a
pair of digits in the range 00 to 23, minute is expressed as a pair of
digits in the range 00 to 59 and seconds are as a decimal number in the
range 00.000 to 59.999)
- Period of time (period)
Period of Gregorian time defined using the ISO 8601 PnYnMnDTnHnMnS
format, where nY represents the number of years, nM the
number of months, nD the number of days, 'T' is the date/time
separator, nH the number of hours, nM the number of
minutes and nS the number of seconds. All numeric values (n)
can be expressed using decimals of arbitrary precision providing the
following letters and numbers are omitted.
[Issue 1-5: The XML Schema representation of period only allows decimals
to be used to qualify seconds. Should DSDL be similarly constrained, or
should it remain more compatible with ISO 8601?]
- Hexadecimal binary sequence (hexBinary)
Sequence of hexadecimal numbers (in the range 00 to FF) that encode a
finite length sequence of binary octets.
Constraints: fixedLength; minLength; maxLength; pattern
- Base64 binary sequence (base64Binary)
Binary stream is encoded using the Base64
Content-Transfer-Encoding defined in Section 6.8 of IETF
RFC 2045.
Constraints: fixedLength; minLength; maxLength; pattern
- Resource Identifier (URI/IRI)
A sequence of valid resource identifiers that form a Uniform Resource
Identifier (URI) as defined in IETF
RFC 2396, as amended by IETF
RFC 2732, or an Internationalized Resource Identifier (IRI) when
this specification is formally approved as an IETF standard. Values can
be absolute or relative, and may have an optional fragment identifier.
Constraints: fixedLength; minLength; maxLength; pattern
- Notation Type Name (notation)
A string that matches the name assigned to one of the notations defined
in a DSDL schema.
[Issue 1-6: Should the language independent datatypes defined in ISO 11404
(http://std.dkuug.dk/jtc1/sc22/wg11/docs/iso11404.pdf) not incorporated into
XML Schema Part 2 be considered, given that this document is currently under
review by SC22, having first been published in 1996?
NB: ISO 11404 primitives are:
primitive-type = boolean-type | state-type |
enumerated-type | character-type
|
ordinal-type | time-type | integer-type | rational-type | scaled-type |
real-type | complex-type | void-type
The types that specifically need to be considered for
inclusion are "state-type", "ordinal-type",
"scaled-type" and "void-type". The "boolean-type",
"character-type", "time-type" and "real-type"
can be equated to existing definitions. Whether "complex-type" and
"enumerated-type" are true datatypes or expressions of ways in which
datatypes can be created from basic types is questionable.
Other ISO 11404 types of possible interest include:
generated-type = pointer-type |
procedure-type | choice-type | aggregate-type
aggregate-type = record-type | set-type | sequence-type | bag-type |
array-type | table-type]
2. DSDL Datatypes
The following datatypes are used to identify constructs that conform to DSDL constraints:
NB: This list will need to be revised in the light of the development of
DSDL.
[Issue 2-1: At what point is datatype validation applied? Do we really need
all of these?]
- String with no whitespace that conforms to DSDL naming rules (name)
- Name where namespace prefix has been replaced by URI mapped to
namespace prefix (qualifiedName)
[Issue 2-2: Is qualifiedName really a datatype? Am I right in recorded
it as the URI+LocalName rather than Prefix+LocalName as is done in XML?]
[Issue 2-3: Is Prefix also required to record the namespace prefix
associated with the URI?]
- String with no whitespace containing name characters without
restriction on Letter for first character (NMTOKEN)
[Issue 2-4: Should we continue to use SGML-style names that are all
caps, or require the use of nameToken, etc, to be more conformant with
other datatype names?]
- Tokenized string containing only valid name tokens (NMTOKENS)
- Tokenized string containing any sequence of characters other than
spaces, tabs and control characters (tokenizedString)
[Issue 2-5: Do we need to allow for tokenized strings that contain
punctuation other than that which is valid in names? For example, tokens
such as name+, name? and M&S would not be valid name tokens but
would be valid within a tokenized string.]
- Name used as unique identifier (ID) [May need to be generalized to
cope with DSDL keys.]
- Name used to reference a unique identifier (IDREF) [May need to be
generalized to cope with DSDL key references.]
- Tokenized string containing names that will be used to reference unique identifiers
(IDREFS)
- Name used to identify formally defined DSDL entity (ENTITY)
- Tokenized string containing names that identify formally defined
DSDL entities (ENTITIES)
[Issue 2-6: What other types of DSDL data types will there be?]
[Issue 2-7: Should we include DSSSL types such as quantity, pair, real (if
floating point is not defined in an acceptable form for DSSSL) and number (if
not directly mappable to the fixed point number construct)?]
3. Commonly Required Derived Datatypes
The following commonly used datatypes can be derived from primitive datatypes:
- Integer
Fixed point number with no decimal point
<datatype name="integer">
<constraints base="decimal">
<decimalDigits>0</decimalDigits>
</constraints>
</datatype>
[Issue 3-1: Do we need the different bit length and
positive/negative variants for integers to be defined as separate
datatypes as shown in the following entries?]
- Positive integer (positiveInteger)
Integer greater than 0
<datatype name="positiveInteger">
<constraints base="integer">
<minExclusive>0</minExclusive>
</constraints>
</datatype>
- Negative integer (negativeInteger)
Integer less than 0
<datatype name="negativeInteger">
<constraints base="integer">
<maxExclusive>0</maxExclusive>
</constraints>
</datatype>
- Non-positive integer (nonPositiveInteger)
Integer less than or equal to 0
<datatype name="nonPositiveInteger">
<constraints base="integer">
<maxInclusive>0</maxInclusive>
</constraints>
</datatype>
- Non-negative integer (nonNegativeInteger)
Integer greater or equal to 0
<datatype name="nonNegativeInteger">
<constraints base="integer">
<minInclusive>0</minInclusive>
</constraints>
</datatype>
- 64-bit integer (long)
Integer in range -9223372036854775808 to 9223372036854775807
<datatype name="long">
<constraints base="integer">
<minInclusive>-9223372036854775808</minInclusive>
<maxInclusive>9223372036854775807</maxInclusive>
</constraints>
</datatype>
- 32-bit integer (int)
Integer in range -2147483648 to 2147483647
<datatype name="int">
<constraints base="integer">
<minInclusive>-2147483648</minInclusive>
<maxInclusive>2147483647</maxInclusive>
</constraints>
</datatype>
- 16-bit integer (short)
Integer in range -32768 to 32767
<datatype name="short">
<constraints base="integer">
<minInclusive>-32768</minInclusive>
<maxInclusive>32767</maxInclusive>
</constraints>
</datatype>
[Issue 3-2: Do we need 8-bit integers (bytes) as well, given
that DSDL is based on a 16-bit syntax, UTF16?]
- 64-bit positive integer (unsignedLong)
Integer in range 0 to 18446744073709551615
<datatype name="unsignedLonge">
<constraints base="integer">
<minInclusive>0</minInclusive>
<maxInclusive> 18446744073709551615</maxInclusive>
</constraints>
</datatype>
- 32- bit positive integer (unsignedInt)
Integer in range 0 to 4294967295
<datatype name="unsignedInt">
<constraints base="integer">
<minInclusive>0</minInclusive>
<maxInclusive>4294967295</maxInclusive>
</constraints>
</datatype>
- 16-bit positive integer (unsignedShort)
Integer in range 0 to 65535
<datatype name="unsignedShort">
<constraints base="integer">
<minInclusive>0</minInclusive>
<maxInclusive>65535</maxInclusive>
</constraints>
</datatype>
[Issue 3-3: Do we need 8-bit bytes as well, given that DSDL is
based on a 16-bit syntax, UTF16?]
- 64-bit floating point number (double)
IEEE double-precision 64-bit floating point number conforming to IEEE
754-1985.
[Issue 3-4: Do we need separate datatypes for 16 and 32 bit
floating numbers?]
<datatype name="double">
<constraints base="real">
<minExclusive>-1e971</minExclusive>
<maxExclusive>1e971</maxExclusive>
</constraints>
</datatype>
[Issue 3-5: How can we constrain the exponent to be in the
range -1075 to 970, and ensure the mantissa does not exceed 2^53?]
- Specific Time (time)
Thh:mm:ss.sss±hh:mm subset of the dateTime primitive
<datatype name="time">
<constraints base="dateTime">
<pattern>T[0-2][0-9]:[0-5][0-9]:[0-5][0-9](.[0-9]([0-9]([0-9])?)?)?
([+-][0-5][0-9]:[134][05])?</pattern>
</constraints>
</datatype>
[Issue 3-6: How can the constraints on the values of time be
accurately expressed? For example, the pattern suggested above does
not restrict hours to the range 00 to 23 or timezones to 00, 15, 30 or
45.]
- Gregorian Date (gDate)
CCYY-MM-DD subset of the dateTime primitive
<datatype name="gDate">
<constraints base="dateTime">
<pattern>[0-9](4)-[0-1][0-9]-[0-3][0-9]</pattern>
</constraints>
</datatype>
[Issue 3-7: How can the constraints on the values of date be
accurately expressed?]
- Gregorian Recurring Month (gMonth)
Subset of the dateTime primitive where CCYY and DD are replaced by - to
give --MM--, with no time specified.
<datatype name="gMonth">
<constraints base="dateTime">
<pattern>--[0-1][0-9]--</pattern>
</constraints>
</datatype>
- Gregorian Recurring Day Of Month (gDay)
Subset of the dateTime primitive where CCYY and MM are replaced by - to
give ----DD, with no time specified.
<datatype name="gDay">
<constraints base="dateTime">
<pattern>----[0-3][0-9]</pattern>
</constraints>
</datatype>
[Issue 3-8: Should Day Within Week be a derived datatype?]
- Gregorian Recurring Day in Month (gMonthDay)
Subset of the dateTime primitive where CCYY is replaced by - to
give --MM-DD, with no time specified.
<datatype name="gMonthDay">
<constraints base="dateTime">
<pattern>--[0-1][0-9]-[0-3][0-9]</pattern>
</constraints>
</datatype>
- Gregorian Month In Year (gYearMonth)
Subset of the dateTime primitive consisting solely of CCYY-MM.
<datatype name="gYearMonth">
<constraints base="dateTime">
<pattern>[0-9](4)-[0-1][0-9]</pattern>
</constraints>
</datatype>
- Gregorian Duration (gDuration)
Start and end dates expressed as two dateTime primitives separated
by a slash (/) (CCYY-MM-DDThh:mm:ss.sss±hh:mm/CCYY-MM-DDThh:mm:ss.sss±hh:mm)
[Issue 3-9: How can this constraint be expressed? (Should duration be
a primitive?)]
[Issue 3-10: Do we need to define all the alternative ISO 8601 date variants
(e.g. without hyphens and colons, etc), or will the limited dateTime primitive
definitions shown above be sufficient?]
[Issue 3-11: What other options do we need to allow for? Should ISO 639
language be included (if so what about the IETF rules re extensions)? What
about currency as used in XForms, in support of ISO 4217 (or is this simply an
application of decimal)?]
4. Constraining Properties
The following properties can be used to constrain datatypes that are derived from strings:
- fixed length (fixedLength)
Integer defining number of UCS characters that must be contained in a
valid string
or
- maximum length (maxLength)
Integer defining the maximum number of UCS characters that can occur in
a valid string
- minimum length (minLength)
Integer defining the minimum number of UCS characters that must occur in
a valid string
The following properties can be used to constrain datatypes that are
derived from fixed point and floating point numbers:
- maximum value: inclusive (maxInclusive)
The highest value that an entered number is permitted to have
- maximum value: exclusive (maxExclusive)
A value that the entered number must be less than
- minimum value: inclusive (minInclusive)
The lowest value that an entered number is permitted to have
- minimum value: exclusive (minExclusive)
A value that the entered number must be greater than
The following additional properties can be used to constrain fixed point
numbers:
- maximum number of digits, including any decimal point (totalDigits)
- maximum number of digits that can follow the decimal point (decimalDigits)
The following properties can be used to constrain any datatype:
- validation pattern (pattern)
[Issue 4.2: Should more than one pattern grammar be allowed for by applying a
pointer to the relevant grammar at some higher level in the syntax?]
5. Deriving Customized Datatypes
Each type defined in a DSDL schema must be assigned a unique identifier as
its name, which must not be identical to any of the names assigned to primitive
or derived datatypes defined in this standard, or to the name of any customized
datatype imported into the schema. The datatype name must conform to the rules
for defining DSDL names.
[Issue 5-1: Should datatype names be DSDL unique identifiers or keys? How can
we ensure that imported datatypes do not share the same name?]
Each set of constraints defined for a datatype shall be based on either a
primitive datatype, a derived datatype defined in this standard or a
customized datatype defined in, or imported into, the same schema. The unique
identifier of the base datatype must be specified as an attrribute of the
constraint element.
Constraining properties shall be entered as subelements of the constraints
definition using elements whose name is shown in brackets after the name of
the property. Only those properties relevant for the primitive datatype from
which the datatype is derived may be defined.
[Issue 5-2: How will DSDL allow us to manage the fact that models of the
constraints element will be dependent on the value given to the base
attribute, which may in its turn be derived from a customized datatype rather
than a primitive?]
[Issue 5-3: How will patterns be specified? How can we restrict the number
of times a part of a pattern is repeated?]
Where only specified values are to be permitted a list of validValues may
be specified. These values can be assigned a value for the rule attribute of
either "no-others" to indicate that only specified values are valid,
or of "with-others" to indicate that the list of validValues
only indicates currently known values, which the user can extend by entering
any other value that conforms to the constraints assigned to the datatype.
Where no value is specified for the rule attribute the default value of
"no-others" applies. Optionally a statement of meaning can be
assigned to each value.
The invalidValues element can be used to identify specific values that may
not be used as valid entries. Optionally a statement of the reason why the value
is invalid can be assigned to each value.
[Issue 5-4: Should datatype be allowed to have structured values? (NB: XML
Schema defines permitted enumeration values as attributes, which allows the
element itself to consist solely of annotation.)]
The following example shows how customized datatypes can be expressed using
the elements defined in Appendix A.
<datatype name="type-a">
<constraints base="string">
<fixedLength>3</fixedLength>
<pattern>[a-fA-F(3)]</pattern>
<validValues rule="no-others">
<accept>
<value>abc</value>
<meaning>Latin alphabet</meaning>
</accept>
<accept>
<value>def</value>
<meaning>Braille alphabet</meaning>
</accept>
...
</validValues>
<invalidValues>
<reject>
<value>bad</value>
<reason>Can be confused with bed.</reason>
</reject>
</invalidValues>
</constraints>
</datatype>
Appendix A: DSDL Schema for the Specification of Datatypes
To follow