Taskforce members: D. Beech, P. Biron, A. Brown, P. Chen, D. Fallside
(ed), M. Fuchs, M. Murata, J. Robie
This paper describes the recommendation of the "Simple Syntax" Taskforce for simplifying the syntax of element definitions, type definitions, and element usage within content models. It provides a syntax description, examples, in addition to a summary of alternatives explored. A companion document XML Schema Part 1: Structures - Proposed Updates from Simplification Taskforce proposes new text for the Structures working draft.
A similar mechanism is used with attributes.
In this paper, we accomplish the distinction between content model types
and element or attribute names using only two items: one establishes a
content model's name (the Type GI in the example below), the other (the
element GI) declares the appearance of an element or attribute conforming
to that content model and having a specific name. These changes in the
abstract syntax suggest a unification of attribute and element declarations
and a change in some concrete terms.
XML Schema syntax (6/23/99) | Simple syntax |
<archetype name="Address">
.... <elementType name="name"> <datatypeRef name="string"/> </elementType> .... </archetype> ....
|
<type name="Address">
.... <element name="name" type="dt:string"/> .... </type> ....
|
In this paper, we have a uniform approach to types, spanning both structured and textual content, which is a natural outgrowth of the approach taken to tags and types. In particular, all references to types are uniform; a type is an extended content model; some extended content models contain strictly text and textual content models may be lexically and value constrained. This is illustrated in the above example with a single mechanism, "type=", to refer to both datatype and structured content.
Note that what we are proposing pertains only to the mechanism for referencing
types. That is, the scope of the proposal is limited to the use of "type="
for referencing both structured content and datatypes. Under this proposal,
it would be possible for a schema author to define a type with the same
name, say "date", as a type defined in the XML-Schema: Datatypes specification.
The current proposal does not in any way address the relation between such
types..
<PurchaseOrder>
<shipTo>
<name>Alice Smith</name>
<street>123 Maple Street</street>
<city>Mill Valley</city>
<state>CA</state>
<zip>90952</zip>
</shipTo>
<orderDate>1999-05-20</orderDate>
<shipDate>1999-05-25</shipDate>
<comments>
Get these things to me in
a hurry, my lawn is going wild!
</comments>
<Items>
<Item>
<productName>Lawnmower, model BUZZ-1</productName>
<quantity>1</quantity>
<price>148.95</price>
</Item>
<Item>
<productName>Baby Monitor, model SNOOZE-2</productName>
<quantity>1</quantity>
<price>39.98</price>
</Item>
</Items>
</PurchaseOrder>
<type name="Address" >
<element name="name" type="dt:String"
/>
<element name="street" type="dt:String" />
<element name="city" type="dt:String"
/>
<element name="state" type="dt:String"
/>
<element name="zip" type="dt:Integer"
/>
</type>
An Address type consists of five elements. Though each has a distinct name, four of the elements will simply contain a string in a document instance while one will contain a number. Each of the basic types (string, integer, etc) is defined in another schema, whose namespace is indicated by the "dt" prefix.
Note that the use of a namespace qualified value in the "type" attribute is proposed as a general mechanism for referring to any type that is defined in another namespace. As such, the proposal implies removing the schemaName and schemaAbbrev mechanisms that are proposed in the current XML-Schema: Structures specification.
Given a definition of Address as above, we can then define a PurchaseOrder as
<type name="PurchaseOrder">
<element name="shipTo" type="Address"
/>
<element name="orderDate" type="dt:Date" />
<element name="shipDate" type="dt:Date"
/>
<element name="comments" type="dt:string"
/>
<element name="Items"
type="Items" />
</type>
Several of the elements of the PurchaseOrder have types defined in the datatypes namespace we saw earlier; others, such as Address and Items are types defined in the current schema, and hence are not explicitly namespace-qualified.
Finally, we will declare an element named "PurchaseOrder" whose type is also "PurchaseOrder".
<element name="PurchaseOrder" type="PurchaseOrder" />
Summing up so far, within the schema, the element type definitions and element declarations for the example are:
......
<element name="PurchaseOrder" type="PurchaseOrder"
/>
<type name="PurchaseOrder">
<element name="shipTo"
type="Address" />
<element name="orderDate"
type="dt:Date" />
<element name="shipDate"
type="dt:Date" />
<element name="comments"
type="dt:String" />
<element name="Items"
type="Items" />
</type>
<type name="Address" >
<element name="name"
type="dt:String" />
<element name="street"
type="dt:String" />
<element name="city"
type="dt:String" />
<element name="state"
type="dt:String" />
<element name="zip"
type="dt:Integer" />
</type>
<type name="Items">
<element name="Item"
type="Item" collection="list" minOccurs="1" />
</type>
<type name="Item">
<element name="product"
type="dt:String" />
<element name="quantity"
type="dt:Integer" />
<element name="price"
type="dt:Decimal" />
</type>
.....
<type name="Items">
<element name="Item"
collection="list" minOccurs="1" >
<type>
<element name="product" type="dt:String" />
<element name="quantity" type="dt:Integer" />
<element name="price" type="dt:Decimal" />
</type>
</element>
</type>
Instead of the element declaration for name="Item" containing a type="Item" attribute, it is followed by a <type> ... </type> element for the anonymous type, and then the end-tag </element> for the declaration of the "Item" element as a whole.
This can help avoid a proliferation of named types. It is also
similar to a style of declaration available in several languages, and hence
can be natural in mapping to and from such languages, e.g in SQL, a relational
table definition for Items would specify a collection of rows but would
not introduce a type name for the structure of a row.
<street> </street>
By a type we mean a description of a set of attributes and a pattern of elements and contained characters. For example the type "dt:String" means that text may appear, but no subelements, while "dt:Date" means that the content is limited to strings in a particular format (an ISO 8601 profile) representing dates. In XML technical terms, the name is the Generic Identifier of an element while the type identifies the (extended) Content Model.
A definition creates a new type; a declaration enables the appearance in a document instance of an element with a specific name and type. In the schema, we see both the definition of several types, and also several elements declared as usages of these types. For example, Address is defined to be a type, while within the definition of Address we see five declarations of elements. These declarations are not themselves types. They are accessor names to content of a specific type such as dt:String, etc.
An element declaration's name is local to the containing type, in exactly the same way that an attribute declared in a DTD is local to its type. For example, the declaration of street within Address is local to Address and has no definition in common with any other declaration named "street" elsewhere.
The relation between an element's usage and the valid contents is indicated by the "type" attribute. Suppose we want to make it clear that an element contains text but no subelements. We then say something like
<element name="street" type="dt:String" />
At the opposite extreme, we can limit an element to only contain subelements, and specific ones at that, with a declaration like
<element name="shipTo" type="Address" />
Because Address is defined in the schema to have certain elements as its content, any shipTo element appearing in an instance must have those elements.
<element name="Item" type="Item" collection="list" minOccurs="1" />
The minOccurs attribute is a constraint rule, saying that at least one "Item" element must appear. By default, when you specify collection="list" (for an ordered collection of sibling Item elements), the default for the maxOccurs is that it is unbounded, which you could write explicitly as maxOccurs="*". Or, you can set a specific maximum by using a number, as in maxOccurs="20". Similarly, you can specify any minimum (greater than or equal to zero, and less than or equal to the value of maxOccurs) by using minOccurs. When collection="list", the default for minOccurs is 0 (i.e. the pair of defaults for minOccurs and maxOccurs is equivalent to * in a DTD.)
Whenever you don't want to allow for repetition of an element, you can write collection="no", although we have never done this because it is the default for the collection attribute. Whenever collection ="no", the default for both minOccurs and maxOccurs is "1". If you want to make the element optional, you write minOccurs="0".
The relationship to the cardinalities expressible in a DTD is shown
in the table below:
DTD Cardinality | Simple Cardinality | Simple Cardinality Defaults |
nothing, e.g. [!ELEMENT foo] | nothing, e.g. <element name="foo" type="bar"/> | minOccurs=maxOccurs=1, collection="no" |
foo? | <element name="foo" type="bar" minOccurs="0"/> | maxOccurs=1, collection="no" |
foo* | <element name="foo" type="bar" collection="list"/> | minOccurs=0, maxOccurs="*" |
foo+ | <element name="foo" type="bar" collection="list" minOccurs="1"/> | maxOccurs="*" |
no simple representation | <element name="foo" type="bar" collection="list" minOccurs="3" maxOccurs="45"/> |
Note We have provided syntax for explicit numerical cardinalities
with minOccurs and maxOccurs, although we defer the decision of whether
or not to allow numerical arguments for min/maxOccurs to the XML-Schema
working group (see Main Issue 6).
<type name="Address" order="seq">
The value "seq" is actually the default, and so such a statement is not strictly necessary.
If one wants to say that a type may have one choice of subelement from among several, the value to use is "choice". For example, if an Address type may contain either an element named "zip" or an element named "postalCode", but not both, then one would write
<type name="Address" order="choice">
<element name="zip" type="dt:Integer" />
<element name="postalCode" type="dt:String"/>
</type>
<type name="Address">
<element name="name" type="dt:String"
/>
<element name="street" type="dt:String" />
<element name="city" type="dt:String"
/>
<element name="state" type="dt:String"
/>
<group order="choice">
<element name="zip"
type="dt:Integer" />
<element name="postalCode"
type="dt:String"/>
</group>
</type>
On a group element one may put order, minOccurs and maxOccurs attributes controlling in what order and in what quantity the contained elements may appear. One may also put an optional name attribute, though this is used only for intra-schema bookkeeping and has no effect on instances.
The value "empty" means that all instances of the type must have no contents.
The value "textOnly" means that an instance of the type may only contain text but no subelements.
The value "elemOnly" requires that all instances contain only subelements but no free text. For example, to say that an Address type must contain the five listed elements but no intermixed text, we would say
<type name="Address" content="elemOnly" >
<element name="name" type="dt:String"
/>
<element name="street" type="dt:String" />
<element name="city" type="dt:String"
/>
<element name="state" type="dt:String"
/>
<element name="zip" type="dt:Integer"
/>
</type>
Of the choices for content, "elemOnly" is the default, that is, it is assumed if no content value is given explicitly.
The value "mixed" allows both subelements and interspersed text. That is, it allows whatever subelements would be allowed by a content value of "elemOnly", in whatever sequence and quantity would otherwise be permitted, but also allows free text to appear inbetween subelements.
<type name="Address" content="mixed" >
<element name="name" type="dt:String"
/>
<element name="street" type="dt:String" />
<element name="city" type="dt:String"
/>
<element name="state" type="dt:String"
/>
<element name="zip" type="dt:Integer"
/>
</type>
The value "any" means that an instance of the type may contain any subelements interspersed with free text.
<type name="Address" content="any" />
<element name="InvoiceNum" type=dt:Integer"/>
....
<type name="PurchaseOrder">
....
<element ref="InvoiceNum"/>
....
</type>
.....
<element name="PurchaseOrder" type="PurchaseOrder"
/>
<element name="InvoiceNum" type=dt:Integer"/>
<type name="PurchaseOrder">
<element ref="InvoiceNum"/>
<element name="shipTo"
type="Address" />
<element name="orderDate"
type="dt:Date" />
<element name="shipDate"
type="dt:Date" minOccurs="0" />
<element name="comments"
type="dt:String" minOccurs="0" />
<element name="Items"
type="Items" />
</type>
<type name="Address">
<element name="name"
type="dt:String" />
<element name="street"
type="dt:String" />
<element name="city"
type="dt:String" />
<element name="state"
type="dt:String" />
<group order="one">
<element name="zip"
type="dt:Integer" />
<element name="postalCode" type="dt:String"/>
</group>
</type>
<type name="Items">
<element name="Item"
collection="list" minOccurs="1">
<type>
<element name="product" type="dt:String" />
<element name="quantity" type="dt:Integer" />
<element name="price" type="dt:Decimal" />
</type>
</element>
</type>
....
The mechanism presented in this report comes from this view. The "type" attribute of the element type "element" captures types, while the "name" attribute captures instance variables (or member names).
The second view (which I subscribe to) comes from the document world. This view can be called "symbols and production rules" or grammatical. In this view, symbols are first-class citizens, while production rules are second-class citizens, and are used merely for validation.
However, the mechanism in this report can more or less mimic the second
view. That is, the "name" attribute captures symbols. Content models
capture production rules, and <element name="foo" type="bar"/> mimicks
a non-terminal that always leads to a symbol "foo".
The problems caused by allowing both type definitions and element declarations to appear at the top level are thought to be obviated if all top level type definitions can also be interpreted as GI declarations.
<list>
<element name="foo" type = "bar"/>
</list>
would in used instead of the proposed:
<element name="foo" type="bar" collection="list"/>.
The <list> alternative is clearly more verbose.
Another opinion is that the collection attribute is unecessarily complex, especially because one attribute value is able to change the default value of another attribute value.
* Instead of using "*", it was proposed to use "INF". Advantage claimed for "INF" include the ability to type it as a "non-negative-integer".
<element name="foo">
<element name="bar" type="dt:String" />
</element>
Compare with the proposed form:
<element name="foo">
<type>
<element name="bar"
type="dt:String" />
</type>
</element>
Proponents of <type> note that in cases containing nested inline
definitions, it is easy for a user to confuse an element declaration with
an inline element definition because the differences (lack of a trailing
"/", no "type=" attribute) are not very salient. Hence, <type> was proposed
to clearly signal an upcoming inline type definition.
<element name="foo" type="dt:String" />
and
<element name="foo">
<type content="textOnly" />
</element>
Aug 3, 1999: DF. Added An Alternative View to the Alternatives Considered section; changed <Type ....> to <type ....>; in Alternatives Considered section added Types and Elements describing issue on allowing both element declarations and type definitions at top level; added main issue #12; fixed typos
Aug 2, 1999: DF. Fixed typo in Item definition of final example; in Types & Elements section, changed Items definition to use collection attribute; changed example used in Inline Types section to agree with inline type used in Putting it Together section, changed accompanying text accordingly; added Alternatives Considered section including subsections on min/maxOccurs (Collection Attb and "*"), Referencing elements, Inline Definitions; added paragraph to Integrated Type Model section stating the intended scope of a common "type=" mechanism; added issue #10; removed from Content section a reference to a non-existent "datatypes" section; added issue #11; reworded issue #6 to indicate that whether or not min/maxOccurs can take numerical arguments is a decision to be taken by the Schema WG (corrects ed. error); collapsed "minOccurs and maxOccurs" and "Changing min/maxOccurs Defaults" sections largely as per DavidB's suggested revision
July 29, 1999: DF. Changed doc <title> to agree with <h1> title of document; in "Putting It Together" example, updated definition of type Items to use an inline type definition and the collection attribute; added change history section
July 28, 1999: DF. Changed <h1> title of document to "Simple Syntax"; extended issue list; shortened Tag/Type section and added XML Schema syntax/Simple syntax comparison table with associated text in Tag/Type and Integrated Type Model sections; added Inline type section; added Changing min/maxOccurs Defaults section; added Referencing Elements section; modified example in Putting It Together section
July 16, 1999: DF. First draft.