Elements or attributes?


Subject:      Re: Designing a DTD: Elements or attributes?
From:         "W. Eliot Kimber" <eliot@isogen.com>
Date:         1997/11/18
Message-ID:   <3471DE7E.B7AB2156@isogen.com>
Newsgroups:   comp.text.sgml

Nik Clayton wrote:
> However, in designing the DTD I'm running into what I suspect might be
> a purely stylistic choice -- how do you decide when a value should be an
> attribute and when it should be an element within another element?

This a basic question all DTD designers face and there are many schools of thought on the issue.

My approach is as follows:

1. Determine if the data in question is fundamentally metadata or content. Metadata is information that describes the container while content is the information the container conveys. For example, an ID attribute is clearly metadata because it describes the containing element (by giving it a unique) name. The author of a document is also metadata, as is the title (in my view). One way to distinguish metadata from content is to ask the question "if I removed this data, would my understanding of or ability to comprehend the content change?" If the answer is no, it's metadata, if the answer is yes, it's content (or annotation, which is the third fundamental class of information). For example, knowing or not knowing the author of some information doesn't affect your ability to understand the content in normal practice (you can always think of weird cases where knowledge of the author is required, but these are games, not workaday information objects).

2. Determine the structural requirements for the data: does the data item syntactically conform to the rules for attributes or does it require more structuring?

If the value can be used as an attribute value, then I prefer attributes. If the value cannot be used as an attribute value, then it must be a subelement, either contained by the element it describes or used by reference (see below).

Many DTD designers provide a metadata container that distinquishes metadata for an element from its content, e.g.:

<division>
 <divmetadata>
  <title>This is the division title</title>
  <author>W. Eliot Kimber</author>
 </divmetadata>
 <divbody>
 <para>This is the content</para>
 </divbody>
</division>

By clearly identifying the metadata, systems that want to find metadata (either to ignore it because they only want content or to extract it for some reason) can look at both attributes and the contents of the metadata container and know that whatever they get is the metadata for the container. You can also provide attributes of individual elements that identify them as metadata independent of their containers, e.g.:

<title my-arch="metadata.item">This is the title</title>

The attribute "my-arch" indicates that the element title is derived from the architectural form "metadata.item", which in "my architecture" is defined to be a metadata item. Processors can look for my-arch attributes with a value of "metadata.item" and know that they've gotten all the metadata that's held in elements.

It is also possible to have it both ways. The "value reference facility" of ISO/IEC 10744:1997 (HyTime) (see http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-6.7.html#clause-6.7.1) lets you state that the effective value of an attribute is provided by data held elsewhere, such as by another element. You do this by associating the attribute for which you want to define the value with an attribute (or content) that addresses the value. The two attributes may be the same (an attribute may address its own effective value).

For example, say I want to have a "name" attribute but also want to have highly structured names (maybe because I have a database of people where the names are already defined and structured). I can use the value reference facility to provide a "name" attribute that gets its effective value from the database.

Say I have "Person" elements like this:

<Person id="p123-45-678">
 <name id="p123-45-678.name">
  <last>Kimber</last><first>William</first></middle>Eliot</middle>
 </name>
</Person>

And somewhere else I want to use the name above as the value of a "name" attribute. First, I declare the element with the name attribute and use the valueref attribute to indicate that the effective value of the name attribute is by reference:

<!ATTLIST Employee
  name    -- Name of the employee.  Effective value is a Name
             element used by reference from the Person database
          --
    IDREF -- Reference to Name element that holds effective value -- 
    #REQUIRED 
  grade   
    CDATA 
    #REQUIRED
  department
    CDATA
    #REQUIRED

  valueref -- Attribute defined by the HyTime value reference
              facility.  Attribute value is a pair of attribute
              names.  First name is attribute whose effective value
              is to be addressed, second name is attribute that
              addresses the value.  May be the same attribute.
           -- 
    CDATA #FIXED
    "name name"  -- "name" attribute addresses its effective value --
  HyTime   -- Element must be derived from HyTime architecture
              so that a HyTime-aware processor will recognize the
              valueref attribute as the one defined by the HyTime
              architecture. 
           --
    NAME #FIXED "hybrid"
>

An instance of the Employee element would look like this:

<employee name="p123-45-678.name" grade="E12" department="A34"/>

By "effective value" I mean the value that the processing application will use, as opposed to the value seen by the parser. In other words, after parsing the document and building some in-memory data structure or abstraction of the document, the effective value of the name attribute will be the data derived from the name element, not the ID reference used to address the name element.

For more on the use of the value reference facility, see the paper "HyTime Valueref in Aircraft Manual Authoring Management" at http://www.isogen.com/papers/valueref/valueref.html.

Finally, there may be overriding requirements that either force everything to be in attributes or nothing to be in attributes. For example, when developing a DTD for representing VRML documents, I made the decision not to use attributes at all so that the markup was very consistent, which I thought would make it easier for non-SGML-knowledgable people to grok (my intended audience being VRML people) (see http://www.drmacro.com/vrml).

There may also be tool limitations that influence your decision one way or another. For example, some simple tools provide no way to suppress the content of an element while others provide no good way to present the values of attributes. However, such limitations should be rare or occur in a place in the process that doesn't affect how the primary or authoritative versions of the documents are structured (because the input to the tool in question is the result of a transform applied to the source documents).