[Mirrored from: http://www.eccnet.com/sgmlug/johnrice.html]

SGML Fundamentals and Design Issues

An overview of process and technique

John D. Rice

4901 7th Street North

Arlington, Virginia 22203

(703) 351-7203

Presented

February 15, 1996

Foreword

"The fundamental concept of design is to create a two- or three-dimensional world with conscious efforts of organization of various elements... to establish visual harmony and order."
- Wucius Wong (paraphrased)

The fundamental precept for an SGML application is to define a set of rules to describe a continuum of data in a harmonious and carefully governed manner.

1. Conceptual Fundamentals

The Document-centric Universe

A data collection may be seen as a continuum. That is, it grows, changes, evolves, yet remains united by some common thread. Typically, the document is described as the fundamental unit in that continuum.

Define "Document"...

In the past, the document was perceived as a concrete, physical object - a page or collection of pages containing some amount of information.

Information within the document was arranged to visually convey structure and order, demonstrate relationships between data, and communicate semantics (do NOT press button 13). In this way, document structure was implied through appearance.

In contrast, SGML defines documents explicitly in terms of their structure - not what data says or looks like, but what it represents. The document is broken-down into individual data units, whose relationships are carefully defined and governed. SGML recognizes the generics ¾ types of "things", their order, structure, and context.

Thus, the concept of "document" becomes somewhat of an abstraction, and document might then be defined as a specifically delimited collection of interrelated data units. Data units which for purposes of human-consumption might be arranged, organized, combined, and displayed in different ways according to different requirements.

This perceptual shift is fundamental to the success of any SGML application development.

A document is a subset of a larger collection of information

A document is itself composed of distinct units of information

"document" is just a word to express a boundary

2. Composition Fundamentals

In its most basic form, an SGML application is composed of the following:

A DTD or set of DTDs which identify the structure of distinct document types existing within a data continuum;
An SGML parser which verifies that the DTD complies with ISO 8879, and which validates that tagged documents comply with the DTD; and
A processing system for manipulating and presenting your data in a variety of formats.

Processing System?

The components of your processing system (context-sensitive editors, databases, text formatters, etc.) are the tools with which you manipulate and process your data.

SGML is designed to be used in a modular fashion. Just like leggos, you're expected to attach other things to it.

This does not happen by magic.

Functionality is something designed into the DTD. Attributes, entities, notations ¾ these structures can act as attachment points.

An important part of DTD design is anticipate the needs of your processing and delivery systems and to incorporate the necessary "tabs and slots" into your DTD.

At the core of every SGML application lies the DTD

You don't attach an ID attribute to an element just because it seems like the thing to do.

3. Design Fundamentals - Objective

Thoughtful application design begins with a well defined set of goals.

How do we plan to use this data?
Who needs to work with it (authors, editors, users)
How will they need to work with it (different classes of users)
In what formats do we plan to present it?

Thoughtful DTD design is based upon comprehensive document analysis.

The goal of document analysis is to:

Divide you data continuum into distinct document types;
Identify and name the all the parts of each document;
Understand the relationships between those parts; and
Describe those parts and their relationships in terms of your goals for their use.

I pressed the feeder bar, now where's my food pellet?

Traditionally, document analysis is performed by an SGML Expert with relatively limited user input. The Expert gathers samples, talks to authors and editors then goes away and draws conclusions based on this information. The resulting DTD is then a result of the interpretations that have been made.

SGML Experts know SGML

Your people know your data

What's wrong with this picture

An SGML implementation is likely to represent a significant cultural disruption. Furthermore, your authors will be forced to look at their data in new (and initially uncomfortable) ways. It will introduce new and frightening tools and practices which alter the status quo.

This is a relatively common dynamic:

NEW = DIFFERENT = BAD

Ignoring potential conflicts can only make your job more difficult. To ignore your staff as a resource is foolish.

Drawing your staff into the process of document analysis can drastically improve the functionality of your DTD. No less important, it is way to empower them -- to invest them with the feeling of involvement.

Document analysis as an act of catharsis

More modern document analysis techniques bring the SGML Expert together with authors, editors, and other key people who work with the data. By key, I mean the people who are working with your data on a daily basis. This type of analysis is performed as a facilitated discussion led by the SGML Expert. It is a process of consensus building. It is also a process of discovery.

Sometimes the truth hurts

Unfortunately, many people find that their data is not nearly as organized as they would like to believe.

Practices vary by author
"The book says we do it this way." "Yah, well..."
Concessions are made for purposes of print
What about that old box of documents in the back corner?

It is best to drag out such skeletons in the early stages. It is certainly better than finding out somewhere down the road.

4 Design Fundamentals -- the DTD

ISO 8879 is quite strict about the proper use of SGML syntax. However, it is somewhat less concerned with how that syntax appears in the DTD. Data format and organization is not a concern if you are a parser. However, DTDs are written, used, and maintained by people.

All irony aside, DTD formatting pays.

Organize structures in a logical manner

DTD meta data

Include a header with your DTD. Header information should include things like the authors name, the creation date, specifics about the document type described by the DTD, etc. Don't forget to include a change history detailing modifications made, by whom, and when."

Logical organization

Declare all parameter entities together at the top

Elements should be declared in the order in which they appear in the data. Define the contents of the larger structures as they are called in the content models.

Group low-level structures like emphasis and keyword together at the end of the DTD.

DTDs can be readable. Honest!

Using formatting strategies such as aligning declaration starts and ends, employing negative space (white space), and following a design strategy not only makes your work more readable, it makes it easier to maintain.

Comment, comment, comment

Remember, a DTD will be used by many people. Presumably, it will be updated or modified at some point. Liberal comments add another degree of functionality to your work.

Example 1.

                       <!-- First level list -->

<!-- V1.11, added optional title to all levels of lists -->

<!-- V1.11, added optional symbol before each item -->

<!-- V1.11, added figure, graphic to list1 through list5 content -->

<!ELEMENT list1        - - (title?, (figure | figureref | graphic)*,

             symbol*, item, ((symbol*, item) | list2)*) >

                       <!-- Second level list -->

<!ELEMENT list2        - - (title?, (figure | figureref | graphic)*,

             symbol*, item, ((symbol*, item) | list3)*) >

                       <!-- Third level list -->

<!ELEMENT list3        - - (title?, (figure | figureref | graphic)*,

             symbol*, item, ((symbol*, item) | list4)*) >

                       <!-- Fourth level list -->

<!ELEMENT list4        - - (title?, (figure | figureref | graphic)*,

             symbol*, item, ((symbol*, item) | list5)*) >

                       <!-- Fifth level list -->

<!ELEMENT list5        - - (title?, (figure | figureref | graphic)*,

             symbol*, item)+ >

<!-- V1.11, make type and enumtype attributes optional on list -->

<!ATTLIST (list1 |

      list2 |

      list3 |

      list4 |

      list5)        type        (ordered | unordered)   #IMPLIED

          %enum; >

Example 2.

<!-- ====  Graphic (the actual image file) ======================= -->

<!-- ============================================================= -->

<!ELEMENT  graphic         - o  EMPTY                                >

<!ATTLIST  graphic

           color                %yesorno;                         "0"

           height               NUMBER                      #REQUIRED

           width                NUMBER                      #REQUIRED

           type                 (BMP|CGM|EPS|GIF|

                                MIF|PCX|TIFF|WMF)            #IMPLIED

           graphicnum           CDATA                       #REQUIRED>

<!--       color................Is the graphic in color?

                                The default value is "0" (no)

           height...............Height of the graphic, in points

           width................Width of the graphic, in points

                                the graphics within the figure; are

                                they placed horizontally or

                                vertically?

           type.................What type of graphic is this?

           graphicnum...........Unique identifier used to refer to

                                the graphic as a part of the

                                database                           -->

<!-- ============================================================= -->

<!-- =====================  Text Level  =========================  -->

<!-- ============================================================= -->

<!-- ====  Caution =============================================== -->

<!-- ============================================================= -->

<!ELEMENT  caution         - -  (%text.level;)+                      >

<!-- ====  Reference to a Chapter Number ========================= -->

<!-- ============================================================= -->

<!ELEMENT  chapter.num     - -  (#PCDATA)                            >

<!--           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

                          Developer's Note:

                          ~~~~~~~~~~~~~~~~~

                 The chapter number should be

                 displayed with surrounding angle

                 brackets.  For instance "<26>"

               ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~              -->

Talk to the author