Proposal by the "Simple Syntax" Taskforce

August 13, 1999


Taskforce members: D. Beech, P. Biron, A. Brown, P. Chen, D. Fallside (ed), M. Fuchs, M. Murata, J. Robie

This paper describes the recommendation of the "Simple Syntax" Taskforce for simplifying the syntax of element definitions, type definitions, and element usage within content models. It provides a syntax description, examples, in addition to a summary of alternatives explored. A companion document XML Schema Part 1: Structures - Proposed Updates from Simplification Taskforce proposes new text for the Structures working draft.

Issues Considered

Here is a list of the major issues that the Tf has considered:
  1. Simplifying syntax of tag/type. This issue is a primary motivation for this taskforce. Issue disposition: This proposal contains a simplified syntax based in part on a proposal from A.Layman, see http://lists.w3.org/Archives/Member/w3c-xml-schema-wg/1999Jul/0001
  2. Simplifying syntax dealing with structured content (content models) and text content (datatypes). This issue is a primary motivation for this taskforce. Issue disposition:  This proposal contains a simplified syntax based in part on a proposal from A.Layman, see http://lists.w3.org/Archives/Member/w3c-xml-schema-wg/1999Jul/0001
  3. How to reference an existing top-level element definition from within a content model. Issue disposition: Use an element declaration with a "ref=" attribute to indicate the element being referenced.
  4. Relax existing WD constraint that disallows multiple occurences of same element name (each having different type) within a single content model. Issue disposition: Abide by the WD for the moment.
  5. How to handle Open and Closed Content Models. Issue disposition: Awaiting work from Refinement TF.
  6. Whether or not maxOccurs can take a numerical argument (in addition to *, etc). Issue disposition: Defer to XML-Schema WG.
  7. Whether or not to allow inline type declarations, using a mechanism such as DavidB's <type> wrapper. The NG Guide included such declarations, the July 16 version of the Simple TF Proposal did not. Issue disposition: Inline type definitions are included as of the July 28 version.
  8. Whether or not to introduce a collection attribute that controls the default values of minOccurs and maxOccurs. Issue disposition: Collection attribute is part of this proposal.
  9. Whether or not to allow top level element declarations or to require all element declaration to (at least) appear within a top level Type definition. Issue disposition: TBD.
  10. Recommendation by the Tf that the XML-Schema: Datatypes spec be amended such that "<datatype name= ...." is replaced by "<type name= ....". Issue disposition: To be decided by XML-Schema WG.
  11. This proposal uses a namespace qualified value to reference types defined outside a particular schema instance. This has implications for the XML-Schema: Structures specifications. Issue disposition: To be decided by XML-Schema WG.
  12. Whether or not to allow both type definitions and element declarations to appear at the top level of a schema document. Issue disposition: As written, the proposal allows both type definitions and element declarations to appear at the top level. However, the issue is still open within the taskforce.
  13. There needs to be a reconciliation between, or at least a better explanation of the difference between elements that are defined in terms of content models containing character data, and elements that are specified in terms of datatypes such as String. Issue disposition: to be resolved, see example.
  14. Should we use <choice> instead of <group order="choice">, etc ? And if so, do we still allow <type ... order="choice"> at top level, or use <type ...> <choice> ...? Issue disposition: Pending. Current proposal uses the attribute "order=". Issue has not been discussed by TF.
  15. The current proposal document does not explicitly describe how attributes are specified and used. Issue disposition: pending. Some description is provided in the XML Schema Part 1: Structures - Proposed Updates from Simplification Taskforce companion document.

Tag/Type

In the current working draft for structural schemas content models are separated from tag names by having three items: one establishes a content model's name, another associates a element name with that content model and a third declares the appearance of such an element. In the example below, the archetype GI establishes a content model called "Address", and the elementType and archetypeRef GI's declare the appearance of an element called "shipTo" that is associated with Address.

A similar mechanism is used with attributes.

In this paper, we accomplish the distinction between content model types and element or attribute names using only two items: one establishes a content model's name (the Type GI in the example below), the other (the element GI) declares the appearance of an element or attribute conforming to that content model and having a specific name. These changes in the abstract syntax suggest a unification of attribute and element declarations and a change in some concrete terms.
 
XML Schema syntax (6/23/99) Simple syntax
<archetype name="Address">
....
  <elementType name="name">
    <datatypeRef name="string"/>
  </elementType>
....
</archetype>

....
  <elementType name="shipTo"/>
    <archetypeRef name="Address"/>
  </elementType>
....

<type name="Address">
....
  <element name="name" type="dt:string"/>
....
</type>

....
  <element name="ShipTo" type="Address"/>
....

Integrated Type Model

The current working drafts are separated into two parts, one describing schema facilities for structured content and the other describing schema facilities for text content, the latter called "datatypes." Each has very similar requirements, yet achieves them using different facilities. This is illustrated in the example above with the use of 2 mechanisms, datatypeRef and archetypeRef, to refer to datatype and structured content respectively.

In this paper, we have a uniform approach to types, spanning both structured and textual content, which is a natural outgrowth of the approach taken to tags and types. In particular, all references to types are uniform; a type is an extended content model; some extended content models contain strictly text and textual content models may be lexically and value constrained. This is illustrated in the above example with a single mechanism, "type=", to refer to both datatype and structured content.

Note that what we are proposing pertains only to the mechanism for referencing types. That is, the scope of the proposal is limited to the use of "type=" for referencing both structured content and datatypes. Under this proposal, it would be possible for a schema author to define a type with the same name, say "date", as a type defined in the XML-Schema: Datatypes specification. The current proposal does not in any way address the relation between such types..
 

Example

Throughout this proposal, features of schema are illustrated in the context of the following XML document instance:

<PurchaseOrder>
 <shipTo>
  <name>Alice Smith</name>
  <street>123 Maple Street</street>
  <city>Mill Valley</city>
  <state>CA</state>
  <zip>90952</zip>
 </shipTo>
 <orderDate>1999-05-20</orderDate>
    <shipDate>1999-05-25</shipDate>
    <comments>
        Get these things to me in a hurry, my lawn is going wild!
    </comments>
 <Items>
  <Item>
   <productName>Lawnmower, model BUZZ-1</productName>
   <quantity>1</quantity>
   <price>148.95</price>
  </Item>
  <Item>
   <productName>Baby Monitor, model SNOOZE-2</productName>
   <quantity>1</quantity>
   <price>39.98</price>
  </Item>
 </Items>
</PurchaseOrder>

Types and Elements [Alternatives]

The purchase order consists of a main element with several subordinate elements. Most of the subelements have simple atomic types such as "string" or "date," but some are complex. Type is the mechanism for defining a complex element structure. For example, we can define a type such as Address as follows:

<type name="Address" >
    <element name="name"   type="dt:String" />
    <element name="street" type="dt:String" />
    <element name="city"   type="dt:String" />
    <element name="state"  type="dt:String" />
    <element name="zip"    type="dt:Integer" />
</type>

An Address type consists of five elements. Though each has a distinct name, four of the elements will simply contain a string in a document instance while one will contain a number. Each of the basic types (string, integer, etc) is defined in another schema, whose namespace is indicated by the "dt" prefix.

Note that the use of a namespace qualified value in the "type" attribute is proposed as a general mechanism for referring to any type that is defined in another namespace. As such, the proposal implies removing the schemaName and schemaAbbrev mechanisms that are proposed in the current XML-Schema: Structures specification.

Given a definition of Address as above, we can then define a PurchaseOrder as

<type name="PurchaseOrder">
    <element name="shipTo"    type="Address" />
    <element name="orderDate" type="dt:Date" />
    <element name="shipDate"  type="dt:Date" />
    <element name="comments"  type="dt:string" />
    <element name="Items"     type="Items" />
</type>

Several of the elements of the PurchaseOrder have types defined in the datatypes namespace we saw earlier; others, such as Address and Items are types defined in the current schema, and hence are not explicitly namespace-qualified.

Finally, we will declare an element named "PurchaseOrder" whose type is also "PurchaseOrder".

<element name="PurchaseOrder" type="PurchaseOrder" />

Summing up so far, within the schema, the element type definitions and element declarations for the example are:

......
    <element name="PurchaseOrder" type="PurchaseOrder" />
    <type name="PurchaseOrder">
        <element name="shipTo"    type="Address" />
        <element name="orderDate" type="dt:Date" />
        <element name="shipDate"  type="dt:Date" />
        <element name="comments"  type="dt:String" />
        <element name="Items"     type="Items" />
    </type>
    <type name="Address" >
        <element name="name"      type="dt:String" />
        <element name="street"    type="dt:String" />
        <element name="city"      type="dt:String" />
        <element name="state"     type="dt:String" />
        <element name="zip"       type="dt:Integer" />
    </type>
    <type name="Items">
        <element name="Item" type="Item" collection="list" minOccurs="1" />
    </type>
    <type name="Item">
        <element name="product"   type="dt:String" />
        <element name="quantity"  type="dt:Integer" />
        <element name="price"     type="dt:Decimal" />
    </type>
.....

Inline Types [Alternatives]

Sometimes it is convenient to specify an element declaration that has an inline type definition instead of referencing an existing named type definition. Such inline definitions create anonymous types:. For example, the type Items from our previous section might use an inline type definition for its elements named "Item" as follows:

   <type name="Items">
        <element name="Item" collection="list" minOccurs="1" >
            <type>
               <element name="product"   type="dt:String" />
               <element name="quantity"  type="dt:Integer" />
               <element name="price"     type="dt:Decimal" />
            </type>
        </element>
    </type>

Instead of the element declaration for name="Item" containing a type="Item" attribute, it is followed by a <type> ... </type> element for the anonymous type, and then the end-tag </element> for the declaration of the "Item" element as a whole.

This can help avoid a proliferation of named types.  It is also similar to a style of declaration available in several languages, and hence can be natural in mapping to and from such languages, e.g in SQL, a relational table definition for Items would specify a collection of rows but would not introduce a type name for the structure of a row.
 

Type/Content Model

What we mean by an element's name is the tag name appearing in the document instance, for example

<street> </street>

By a type we mean a description of a set of attributes and a pattern of elements and contained characters. For example the type "dt:String" means that text may appear, but no subelements, while "dt:Date" means that the content is limited to strings in a particular format (an ISO 8601 profile) representing dates. In XML technical terms, the name is the Generic Identifier of an element while the type identifies the (extended) Content Model.

A definition creates a new type; a declaration enables the appearance in a document instance of an element with a specific name and type. In the schema, we see both the definition of several types, and also several elements declared as usages of these types.  For example, Address is defined to be a type, while within the definition of Address we see five declarations of elements.  These declarations are not themselves types.  They are accessor names to content of a specific type such as dt:String, etc.

An element declaration's name is local to the containing type, in exactly the same way that an attribute declared in a DTD is local to its type.  For example, the declaration of street within Address is local to Address and has no definition in common with any other declaration named "street" elsewhere.

The relation between an element's usage and the valid contents is indicated by the "type" attribute. Suppose we want to make it clear that an element contains text but no subelements. We then say something like

<element name="street" type="dt:String" />

At the opposite extreme, we can limit an element to only contain subelements, and specific ones at that, with a declaration like

<element name="shipTo" type="Address" />

Because Address is defined in the schema to have certain elements as its content, any shipTo element appearing in an instance must have those elements.

MinOccurs and MaxOccurs [Alternatives]

If you look at the schema above, you will see that we allow our customers to order more than one item at a time, indicating this by:

<element name="Item"   type="Item"  collection="list" minOccurs="1" />

The minOccurs attribute is a constraint rule, saying that at least one "Item" element must appear.  By default, when you specify collection="list" (for an ordered collection of sibling Item elements), the default for the maxOccurs is that it is unbounded, which you could write explicitly as maxOccurs="*". Or, you can set a specific maximum by using a number, as in maxOccurs="20". Similarly, you can specify any minimum (greater than or equal to zero, and less than or equal to the value of maxOccurs) by using minOccurs. When collection="list", the default for minOccurs is 0 (i.e. the pair of defaults for minOccurs and maxOccurs is equivalent to * in a DTD.)

Whenever you don't want to allow for repetition of an element, you can write collection="no", although we have never done this because it is the default for the collection attribute.  Whenever collection ="no", the default for both minOccurs and maxOccurs is "1".  If you want to make the element optional, you write minOccurs="0".

The relationship to the cardinalities expressible in a DTD is shown in the table below:
 
DTD Cardinality Simple Cardinality Simple Cardinality Defaults
nothing, e.g. [!ELEMENT foo] nothing, e.g. <element name="foo" type="bar"/> minOccurs=maxOccurs=1, collection="no"
foo? <element name="foo" type="bar" minOccurs="0"/> maxOccurs=1, collection="no"
foo* <element name="foo" type="bar" collection="list"/> minOccurs=0, maxOccurs="*"
foo+ <element name="foo" type="bar" collection="list" minOccurs="1"/> maxOccurs="*"
no simple representation <element name="foo" type="bar" collection="list" minOccurs="3" maxOccurs="45"/>

Note We have provided syntax for explicit numerical cardinalities with minOccurs and maxOccurs, although we defer the decision of whether or not to allow numerical arguments for min/maxOccurs to the XML-Schema working group (see Main Issue 6).
 

Order

If subelements or groups must appear in a certain order, or if only one of a set may appear, this can be indicated using the order attribute.  To say explicitly that subelements must appear in the order listed in the schema, one uses the value "seq" as in:

<type name="Address" order="seq">

The value "seq" is actually the default, and so such a statement is not strictly necessary.

If one wants to say that a type may have one choice of subelement from among several, the value to use is "choice".  For example, if an Address type may contain either an element named "zip" or an element named "postalCode", but not both, then one would write

<type name="Address" order="choice">
    <element name="zip" type="dt:Integer" />
    <element name="postalCode"  type="dt:String"/>
</type>
 

Group

Sometimes constraints do not apply to all the subelements of a type, but only to some of them.  To indicate this, one uses the group element.  For example, to indicate that an Address type must contain a name, street, city and state, but then may contain either an element named "zip" or an element named "postalCode", we would write

<type name="Address">
    <element name="name"   type="dt:String" />
    <element name="street" type="dt:String" />
    <element name="city"   type="dt:String" />
    <element name="state"  type="dt:String" />
    <group order="choice">
        <element name="zip"         type="dt:Integer" />
        <element name="postalCode"  type="dt:String"/>
    </group>
</type>

On a group element one may put order, minOccurs and maxOccurs attributes controlling in what order and in what quantity the contained elements may appear. One may also put an optional name attribute, though this is used only for intra-schema bookkeeping and has no effect on instances.

Content [Alternatives]

When defining a type, we may say that it may contain simply text, simply other elements, a mixture of text and elements, or nothing at all (it may be empty).  We specify this by using the content attribute.

The value "empty" means that all instances of the type must have no contents.

The value "textOnly" means that an instance of the type may only contain text but no subelements.

The value "elemOnly" requires that all instances contain only subelements but no free text. For example, to say that an Address type must contain the five listed elements but no intermixed text, we would say

<type name="Address" content="elemOnly" >
    <element name="name"   type="dt:String" />
    <element name="street" type="dt:String" />
    <element name="city"   type="dt:String" />
    <element name="state"  type="dt:String" />
    <element name="zip"    type="dt:Integer" />
</type>

Of the choices for content, "elemOnly" is the default, that is, it is assumed if no content value is given explicitly.

The value "mixed" allows both subelements and interspersed text. That is, it allows whatever subelements would be allowed by a content value of "elemOnly", in whatever sequence and quantity would otherwise be permitted, but also allows free text to appear inbetween subelements.

<type name="Address" content="mixed" >
    <element name="name"   type="dt:String" />
    <element name="street" type="dt:String" />
    <element name="city"   type="dt:String" />
    <element name="state"  type="dt:String" />
    <element name="zip"    type="dt:Integer" />
</type>

The value "any" means that an instance of the type may contain any subelements interspersed with free text.

<type name="Address" content="any" />
 

Referencing Elements [Alternatives]

A useful function provided in DTD's is the ability to create a top-level element declaration, and then reference it within one or more different content models. Such references are provided for through a "ref=" attribute whose value is the name of a top-level element declaration. For example:

<element name="InvoiceNum" type=dt:Integer"/>
....
<type name="PurchaseOrder">
....
  <element ref="InvoiceNum"/>
....
</type>
 

Putting It Together

If we take each of the ideas we saw above and add them to our initial schema, we have:

.....

    <element name="PurchaseOrder" type="PurchaseOrder" />
    <element name="InvoiceNum" type=dt:Integer"/>

    <type name="PurchaseOrder">
        <element ref="InvoiceNum"/>
        <element name="shipTo"       type="Address" />
        <element name="orderDate"  type="dt:Date" />
        <element name="shipDate"    type="dt:Date" minOccurs="0" />
        <element name="comments"  type="dt:String" minOccurs="0" />
        <element name="Items"          type="Items" />
    </type>

    <type name="Address">
        <element name="name"      type="dt:String" />
        <element name="street"    type="dt:String" />
        <element name="city"      type="dt:String" />
        <element name="state"     type="dt:String" />
        <group order="one">
            <element name="zip"         type="dt:Integer" />
            <element name="postalCode"  type="dt:String"/>
        </group>
    </type>

    <type name="Items">
        <element name="Item" collection="list" minOccurs="1">
            <type>
               <element name="product"   type="dt:String" />
               <element name="quantity"  type="dt:Integer" />
               <element name="price"     type="dt:Decimal" />
            </type>
        </element>
    </type>

....


Alternatives Considered

Until this section, we have described a single syntax for defining elements, their types and usage. The XML-Schema working group has directed that Tf's "show their work" and so the purpose of this section is to describe any significant alternatives that were considered by the Tf. The format of this section is that alternatives are listed under the same subheadings used to describe the proposed syntax.
 

An alternative view (MURATA Makoto)

In my opinion, there are two views in the simple task force.  The first view (which I do not subscribe to) comes from the database and programming language world.  It can be called "types and instance variables (member names)" or object-oriented.

The mechanism presented in this report comes from this view.  The "type" attribute of the element type "element" captures types, while the "name" attribute captures instance variables (or member names).

The second view (which I subscribe to) comes from the document world.  This view can be called "symbols and production rules" or grammatical.  In this view, symbols are first-class citizens, while production rules are second-class citizens, and are used merely for validation.

However, the mechanism in this report can more or less mimic the second view. That is, the "name" attribute captures symbols.  Content models capture production rules, and <element name="foo" type="bar"/> mimicks a non-terminal that always leads to a symbol "foo".
 

Types & Elements

Several reasons have been given against allowing both type definitions and element declarations to appear at the top level of a schema document.  In particular, 1) an element of a particular type, such as Address from the example, cannot by itself be valid without a second element declaration in the schema (declaring the type to also be a valid element), and 2) if a snippet from an instance document conforming to such a schema is "cut" from the document, it may be difficult to determine the type of the outermost GI in the snippet without full context information from where the snippet was cut.

The problems caused by allowing both type definitions and element declarations to appear at the top level are thought to be obviated if all top level type definitions can also be interpreted as GI declarations.

MinOccurs and MaxOccurs

Collection Attribute A <list> element type was proposed as an alteranative for the collection attribute. Hence:

 <list>
   <element name="foo" type = "bar"/>
 </list>

would in used instead of the proposed:

 <element name="foo" type="bar" collection="list"/>.

The <list> alternative is clearly more verbose.

Another opinion is that the collection attribute is unecessarily complex, especially because one attribute value is able to change the default value of another attribute value.

* Instead of using "*", it was proposed to use "INF". Advantage claimed for "INF" include the ability to type it as a "non-negative-integer".

Referencing Elements

It was suggested to use "ref=" instead of  "like=" on the grounds that the attribute is making reference to a top level element and not introducing a new element that is "like" the top level one. This suggestion has been incorporated.
 

Inline Types

A criticism of the <type> construct was that it introduces unecessary verbosity. A less verbose alternative uses the element declaration without a type name to signal that an inline type definition follows, for example:

 <element name="foo">
    <element name="bar" type="dt:String" />
 </element>

Compare with the proposed form:

 <element name="foo">
    <type>
         <element name="bar" type="dt:String" />
    </type>
 </element>

Proponents of <type> note that in cases containing nested inline definitions, it is easy for a user to confuse an element declaration with an inline element definition because the differences (lack of a trailing "/", no "type=" attribute) are not very salient. Hence, <type> was proposed to clearly signal an upcoming inline type definition.
 

Content

A comment was received that the distinction is unclear between what is implied by using the textOnly content attribute value and a String datatype. For example, what is the difference between:

<element name="foo" type="dt:String" />

and

<element name="foo">
    <type content="textOnly" />
</element>
 


Change History

Aug 13, 199: DF. Changed "like=" to "ref=" on basis of comments received and no objections; added text to Order section saying that "seq" is the default; added issue #13 plus some explanatory text in a new "Alternatives Considered: Content" section; added issue #14 regarding a suggestion to use a <choice> tag rather than a choice= attribute; added issue #15 regarding silence on attributes; added reference to the "XML Schema Part 1: Structures - Proposed Updates from SImplification Taskforce" document

Aug 3, 1999: DF. Added An Alternative View to the Alternatives Considered section; changed <Type ....> to <type ....>; in Alternatives Considered section added Types and Elements describing issue on allowing both element declarations and type definitions at top level; added main issue #12; fixed typos

Aug 2, 1999: DF. Fixed typo in Item definition of final example; in Types & Elements section, changed Items definition to use collection attribute; changed example used in Inline Types section to agree with inline type used in Putting it Together section, changed accompanying text accordingly; added Alternatives Considered section including subsections on min/maxOccurs (Collection Attb and "*"), Referencing elements, Inline Definitions; added paragraph to Integrated Type Model section stating the intended scope of a common "type=" mechanism; added issue #10; removed from Content section a reference to a non-existent "datatypes" section; added issue #11; reworded issue #6 to indicate that whether or not min/maxOccurs can take numerical arguments is a decision to be taken by the Schema WG (corrects ed. error); collapsed "minOccurs and maxOccurs" and "Changing min/maxOccurs Defaults" sections largely as per DavidB's suggested revision

July 29, 1999: DF. Changed doc <title> to agree with <h1> title of document; in "Putting It Together" example, updated definition of type Items to use an inline type definition and the collection attribute; added change history section

July 28, 1999: DF. Changed <h1> title of document to "Simple Syntax"; extended issue list; shortened Tag/Type section and added XML Schema syntax/Simple syntax comparison table with associated text in Tag/Type and Integrated Type Model sections; added Inline type section; added Changing min/maxOccurs Defaults section; added Referencing Elements section; modified example in Putting It Together section

July 16, 1999: DF. First draft.