[This local archive copy mirrored from the canonical site: http://www.sgml.saic.com/html/paper3.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

What's New <SGML-XML-HTML> at SAIC Jobs Available
Home Who We Are What We Do Clients Contact Us Glossary Resources

The Power of Using Content Tagging and Attributes with Your Data

Chris Wheedleton
Science Applications International Corporation (SAIC)
1710 Goodridge Drive, M/S T2-5-1
McLean, VA  22102
Phone: (703) 821-4475 Fax: (703) 883-9042


The use of SGML attributes to represent complex tabular data can help authors create and maintain large volumes of data. Smart use of attributes combined with the functionality of today's SGML processing tools can make the management and distribution of this type of data simple, effective, and more usable. SAIC has recently implemented attributes in some unique SGML applications. We consider SGML attributes as a useful extension of the "content tagging" approach that is being commonly implemented with SGML elements. This paper will describe one such application that effectively used attributes to store up to 250 pages of tabular records each with up to 70 repetitive content descriptors. The application will be described and the rational for selecting an attribute solution will be described.


Mr. Wheedleton is an Information Technology Engineer for SAIC specializing in SGML/HTML and Multimedia technologies.  His experience is in the research, design, and application of these technologies for uses such as on-line publishing and computer-based training (CBT). He has performed systems analysis, designed information architectures, and reviewed technology standards, in addition to actual application development through COTS integration and computer programming.  Mr. Wheedleton has worked on program areas related to SGML and other advanced electronic publishing systems, interactive training courseware, and CALS. As a member of SAIC's SGML Consulting team, Mr. Wheedleton has developed and implemented SGML applications for U.S. military and intelligence organizations, as well as represented SAIC at conferences including speaking at SGML '95.

1.     Element or Attribute

As many government and commercial organizations begin to assess their information assets, it becomes clear that not all data sets are obvious candidates for SGML processing. The SGML community is familiar with the recent advancement of SGML data management systems that have begun to blur the distinction between pure document storage and true database management. As these systems mature, we must begin to develop document structures that are conducive to easy document production, as well as to smart and intelligent storage, retrieval and distribution.

There are two basic ways to support data in SGML: elements and attributes. Over the past couple of years we have seen many SGML applications that are using elements as "content tags" to identify critical document components. Instead of using a generic book metaphor for tagging, more specific structures can be developed that have useful element names and low levels of information granularity. The more specific the document structure or the more exact the content tagging scheme, the more effective one can be in retrieving and managing the document contents at a lower component level. These types of intelligent models will transition more easily into a true component management system. However, while elements provide the most effective way to enforce document hierarchy, SGML does not provide a mechanism to enforce character or text content restrictions on elements.

Attributes are the second way to capture data in SGML. Typically, attributes are used to describe the contents of an element much the way an adjective describes a noun. Attributes, such as Classification, carry additional information that can be used by SGML processing systems to provide stylesheet or hyperlinking functionality or to carry hidden metadata about the document or text component. One of the most important benefits of attributes is that they provide character level enforcement of content very similar to database field enforcement. Attribute values can be defined from a list of value choices, letters, numbers, or any string value. In addition, through the use of SGML processing systems, attribute values are enforced using the DTD thereby providing a built-in content checker.

When performing data analysis on a document set, the DTD designer must consider using elements or attributes for capturing document content. Both methods have their own advantages and disadvantages which are based on the capabilities or limitations of both the SGML standard, as defined by ISO 8879, and the software tools that process SGML. Areas of interest to the DTD designer include: content creation or management; special stylesheet requirements for authoring or composition; on-line hyperlinking behaviors; and fielded search and retrieval. To determine if document content is best suited for elements or attributes, the designer needs to ask the following questions:

  • Does the content have fixed components with repetitive values or is it free-text entry?
  • Are the content values specifically defined down to the character level?
  • Does the component have a nested hierarchy with a mixed content model?
  • Does a fielded form entry authoring interface suit the production staff?
  • Is the content the primary data source or is it supportive data that is to be hidden?
  • Are there multiple output style requirements, such as tabular for hardcopy and manuscript for on-line?

2.     What To Do With All This Data

Our customer recently presented us with a large document that consisted of one large multiple-page running table. Each row of the table spanned two printed pages and contained content descriptors for the particular item mentioned in columns 1 and 2 of the row. Across the two pages were 70 unique descriptors. Forty of the descriptors were subordinate to nine higher level descriptors. These nine descriptors did not have specific values, but were descriptive headings that categorized the subordinate descriptors. A page snapshot of the table and its organization can be seen in Figure 1 (This figure is not intended to be readable, but instead to show the size and complexity of the tabular information).

The values for the 70 descriptors ranged from free-text entry to very controllable content values. Many of the descriptors used one or two character representations of the actual values. Provided at the top of every page in the book were six table legends that provided value equivalences for each descriptor column. These legends can also be seen at the top of Figure 1. It was immediately evident to us that this book took a great deal of time to produce and was moderately difficult for the reader to use. The reader had to follow the rows of small print across two pages and then refer to the legends to decipher the actual values. Improving readability was one area that we would concentrate on correcting.


Figure 1 -  Sample Snapshot of 79 Column Descriptor Table

Each row spans two facing pages.  Each column represents seventy unique content descriptors. Fourty columns are subordinate to nine additional content descriptors. 

Entire table uses short, abbreviated content that represents longer values that can be found in six table legends

The original file format for the data was Microsoft Word. The data was broken into individual pages within one file. The descriptors were delimited with tabs or spaces depending on what part of the table was being examined. The table legends and column headers were produced on the printed page master. The data was overlaid onto the master pages during the production process. We wanted to both improve the authoring and maintainability of the data, while providing an easier production mechanism. While this effort was to prove the concept of electronic maintenance and distribution, we also had to take into account continuing needs to produce hardcopy production.

In designing this application, we tried many different alternatives. Due to the hierarchical appearance of the nine higher level descriptors and our experience using content element tagging schemes, we immediately tried an element model solution. While the data was presently managed and printed in a table format, we decided that representing the data in an SGML table model would require too much overhead and would not offer us flexibility for different types of distribution methods that would be easier for the reader to use. We did not want to recreate the same problems that presently existed. At the same time we evaluated the form that the content would take in the SGML model. We had to determine if the data should contain the short character representation or the fully resolved legend value.

After trying different element combinations and authoring interfaces, we decided that the element method resulted in cumbersome interfaces and generally made it too difficult for the author to deal with each of the 79 sequential elements. More importantly, the element method did not offer any character level enforcement. Since 90% of the descriptors had a specific value choice, we found that using elements posed too many opportunities for erroneous entries. Additionally, having the short character values replaced with fully resolved text values increased the possibility of erroneous entries. Expecting authors to correctly type the valid choice every time was too unrealistic.

The better solution was to use attributes to store the values of most of the 79 descriptors. Six of the descriptors were best suited to remain as free-text entries modeled using elements. The nine heading descriptors did have the appearance of hierarchical containment, but were really more for visual separation than for true content enforcement. Therefore the nine heading descriptors and the 64 value descriptors were all modeled using attributes. Each of the attribute values were strictly enforced by providing either CDATA free-text entry or by using value choices with defaults in some cases. The only remaining question was how to split the attributes up and how different authoring interfaces worked with a very lengthy attribute list.

The data model was quite simple. Each table row item had its own content model that captured the data from the first six descriptors in elements. Figures 2, 3, and 4 show the final breakout of the attributes. For example, Index Number 10 had six elements of content followed by three empty data elements called <DATA.1>, <DATA.2>, and <DATA.3>. These data elements carried all of the attributes broken down 1-17, 18-44, 45-73, respectively. The attributes were divided into these three elements based on a natural subject distinction and the division in the existing page placement. The nine heading descriptors were given attribute values that were simply placeholders to show separation of the attributes in the authoring tool.


Figure 2 - <DATA.1> Attributes

Figure 3 - <DATA.2> Attributes

Figure 4 - <DATA.3> Attributes

The SGML instance for Index Number 10 looked like the following with the attributes hidden:


Each of the 73 attribute values were defined according to the appropriate choices that existed for that value. These values were derived by referring to the supporting legend for the possible values. For simplicity, we decided to use the short character representation of the value so the author could continue using the same reference mechanism that was already in place. By providing a list of choices, we were able to eliminate all erroneous data entry.

All of the attributes were required or defaulted. By requiring all of the attributes, we were able to automatically enforce a certain level of content structure for an entire string of data. When the data elements are first inserted, the attribute values appear as "unspecified" in the authoring tool's attribute editing window. By requiring the attribute, the SGML file cannot be validated without ensuring that a value is selected, even if the value is U for unknown. Working with a list of required attributes is much simpler than having to input the same number of required elements within a substructure. At the same time, the SGML parser and the authoring tool enforcing the attribute content. A small excerpt from the DTD for <DATA.1> is shown below:

   attr1  (L1|M1|S1|V1|U1)  #REQUIRED
   attr2  (CN2|CB2|CT2|RN2|RB2|RT2|LC2|OR2|TH2|U2)  #REQUIRED
   attr3  (E3|G3|F3|P3|N3|U3)  #REQUIRED
   attr4  (attr4)  "attr4"
   attr5  (Y5|N5|U5)  #REQUIRED 
   attr6  (Y6|N6|U6)  #REQUIRED 
   attr7  (Y7|N7|U7)  #REQUIRED 
   attr8  (Y8|N8|U8)  #REQUIRED 
   attr9  (Y9|N9|U9)  #REQUIRED 
   attr10 (A10|B10|C10|D10|E10|F10|G10|H10|J10|K10|L10|M10|N10|O10|P10|Q10|U10)  #REQUIRED
   attr11 (A11|B11|C11|D11|E11|F11|G11|H11|J11|K11|L11|M11|N11|O11|P11|Q11|U11)  #REQUIRED
   attr12 (A12|B12|C12|D12|E12|F12|G12|H12|J12|K12|L12|M12|N12|O12|P12|Q12|U12)  #REQUIRED
   attr13 (A13|B13|C13|D13|E13|F13|G13|H13|J13|K13|L13|M13|N13|O13|P13|Q13|U13)  #REQUIRED
   attr14 CDATA  "U"   -- number or U --
   attr15 (L15|M15|U15)  #REQUIRED
   attr16 (Y16|N16|U16)  #REQUIRED
   attr17 (Y17|N17|U17)  #REQUIRED >

Note: In the original data many descriptor entries were left blank to represent a value of unknown. Blanks were represented by two subsequent tabs or by two subsequent spaces depending on the column(s). Instead of a blank entry we chose to use the letter U to represent unknown which would simplify the examination and use of the SGML data. During conversion we replaced the blank with the U value. This process proved difficult since both tab-delimited and space-delimited conventions were used to represent the original data. Multiple conversion passes were used to properly place each known and unknown value into the appropriate attribute descriptor.

Attributes 1, 2, and 3 all have different value choices based on the values provided in the legend. Attribute 4 is the first heading descriptor which has a choice of the attribute name and is defaulted to the attribute name. Attributes 5-8 are described by the Attribute 4 heading descriptor. Attributes 10-13 offer the largest selection of values with 17 choices each representing characters A-H and J-U. Each of these letters corresponds to a range of numeric values described in the legend. Attribute 14 is CDATA and can contain a numeric value or the letter U, with the letter U as the default selection. Attributes 15-17 offer unique selections, including the most common value choices throughout the entire attribute list, (Y)es, (N)o, or (U)nknown. The value choices, regardless of their element container <DATA.1> , <DATA.2>, or <DATA.3>, have the a unique identifier included in the value, which in this example application is the attribute number. Some type of unique identifier had to be included because SGML does not allow two attributes within the same element to use the same value. While these unique identifiers make the data look more complex, the true value can be found in the first letter. Dealing with this SGML limitation is a common situation when using many attributes in an SGML application.

3.     Attribute Authoring

Every SGML authoring tool provides different interfaces for editing attribute values. Most of the tools provide a special attribute window that pops up when requested or when a required attribute value must be entered. This window normally contains an entry field for each of the attributes attached to that particular element. By selecting the attribute field or tabbing through the fields an author can enter a value or can select a value from a list of choices, depending on how the attribute has been defined. In our application, we wanted to take advantage of the attribute editing capabilities provided by COTS SGML tools. The result was an entry screen that resembled a fully-functional, content checking database entry form.

We tried out our attribute application with two or three authoring tools. Each of them had their pros and cons but one of the easiest to use is SoftQuad A/E. The attribute window works like described above and is even smart enough to build the screen differently when the application has many attributes. A screen shot of A/E displaying the <DATA.2> attribute list is shown in Figure 5. All of the attributes are laid out in two columns with a pull-down list for each attribute. The figure shows the value list for one of the attributes inside the circle.


Figure 5 - A/E Attribute Editing for
                   <DATA.2> Element

Notice the appearance of Attributes 21, 28, 32, and 39. These attributes are four of the nine heading descriptors whose value choice is the attribute name and is also defaulted to that choice. In the attribute window, that attribute has the name filled in automatically because of the default value. Using a heading attribute in this way helps the author distinguish the separation between the attributes giving an implied visual hierarchy to the attribute fields. Additionally, when the SGML is exported, default values are not written into the SGML instance providing a cleaner and smaller instance file.

Even though the entry fields look cluttered with the attribute number included in the value, the author was still able to concentrate on the first character of the value to pick out the important information. When a field is selected and the pop down list appears, a single keystroke of the letter will automatically select the desired value choice. The UP and DOWN arrow keys also can be used to cycle through the available values. When multiple choices are available these built in entry conventions speed the process of filling out or updating the form.

4.     Output Styling

One of our goals was to make the output styling process as automated as possible. Since we were only focusing on output distribution for electronic delivery we were able to be more creative. EBT DynaText SGML browser and related styling tools provided a great deal of functionality, allowing the output to be modified on-the-fly and provided hyperlinking behaviors that would help us integrate this electronic book with other data sets. In the electronic delivery medium we were able to customize the data for easy reading and functional use without being restricted by the original tabular print delivery format. A sample output of the attribute data is shown in Figure  6.


Figure 6 - Sample Output of Attribute Data Designed for On-line Delivery

First, we changed the output to a manuscript format that was easier to read with recognizable descriptor headings and a database-like content formatting. Commonly, attribute titles are abbreviated into somewhat cryptic naming conventions whose rules are governed by the SGML standard. The abbreviated attribute name was used to determine a more appropriate and descriptive title that had specific meaning to the reader. The attributes that referred to table legends had their values automatically inserted by the smart style sheet as well. Printing the fully resolved legend value helps the reader speed to the appropriate descriptor name and immediately see the value. Since the descriptor values are the most important piece of information on the screen, the values have a slightly larger font size and are bolded for emphasis. Unknown values and their descriptor's heading are dimmed as to not distract from the known values. Color is also used to help attract the readers eyes to important information.

Some of the content elements refer to other documents. This data was given certain hyperlinking behavior functions that when clicked link the reader to another section found in another document. By virtue of moving to an electronic delivery method, the reader automatically gains the functionality to search through the data set quickly and to have only the pertinent data displayed on screen. In addition to improved display formatting, the document indexes and tables of contents are reconstructed to provide the reader a hierarchical access to data items as well as alphabetical listings by item name.

5.     Technical Considerations of Attribute Use

SGML does have some rules about how SGML can be used. First of all, the SGML declaration controls the attribute name length and the number of attributes. For the application described above, we had to modify the declaration to allow an attribute count greater than 75. Since attribute names, like element names are usually more useful if they can be more than eight characters, the name length was increased. One of the major limitations of SGML is that it does not allow the same attribute value choice to be used for different attributes attached to the same element. To deal with this rule, some type of modifier can be added to distinguish one attribute value from another.

SGML processing tools deal with attributes in many different ways. While SGML tools all follow standard use and allowance for attributes according to the SGML standard, interfaces for editing values and styling functions based on attribute values may handle attributes very differently. Some authoring tools have very flexible interfaces for editing values or selecting value choices. Some attribute input screens may change to accommodate different numbers of attributes on the screen at once, while some may only allow one attribute to be edited at a time. Some tools may use keyboard shortcuts to speed data entry, while others may act as dumb entry fields. Other output styling tools use attributes to control text-before and text-after character output; populate variables which can be used in page headers; to control hyperlinking behavior; or to affect the suppression of data based on a specific attribute value. It is very important when evaluating the use of attributes to assess the authoring environment and distribution requirements.

6.     Conclusion

Attributes have a very important place in the world of SGML. Many times people shy away from using attributes because they are sometimes cumbersome to use, may be missed during authoring, or the data is better served by using elements. However, we all recognize that attributes serve many useful purposes. If your data does have fixed components with repetitive value choices, then attributes may be useful. If your data components have content values that are defined and must be managed down to the character level, then attributes may be useful. If your data component is very flat with no mixed content models, then attributes may be useful. If your authors desire a simple database-like fielded entry screen, then attributes may be useful. If your data is supportive information that is not to be printed or viewed, but used for controlling print or view functions, then attributes may be useful.

As SGML application developers, let's begin to push SGML into places that before now have not made sense. SGML processing tools have improved over the past few years and have successfully given us power to deal with parts of the SGML standard that have been shunned. This isn't magic, its just using SGML smartly!

[Home] [Who We Are] [What We Do] [Clients] [Contact Us] [Glossary] [Resources]

© Copyright 1995-1997 by Science Applications International Corporation