In this paper we outline how the use of design principles had a benificial impact on the modelling of the hierarchy of information, object information, meta-information and shape-information in electronic documents.
In contrast to the design of relational database systems, there is no such thing as "normalization" or other, generally accepted design principles in the field of document databases.
Yes, there is the CADE-course of Microstar. CADE stands for Computer Aided Document Engineering. The summary of this so-called methodology is: gather all information-elements, throw them in Microstar's Near & Far DTD design software tool and then make 101 reports for your supervisor.
Yes, there is an upcoming book of Jeanne El Andaloussi from Bull and Eve Maler from Digital, who worked together on the development of the Open Software Foundation DTD. But it is still not available.
So, what is left for the (aspirant) information architect? Looking at and studying other people's DTDs such as the TEI-, IBMIDDOC-, OSF-, DOCBOOK-, and other public domain DTDs, and be prepared for a lot of learning by trial and error.
The essence of our own experiences will be covered in the next sections.
Although you are sure you have the best tool to describe structure and content in documents (i.e. SGML of course), the raw data you have to work with at a customer's site is sometimes so badly and inconsistently structured that you end up with very loose DTDs.
It is our opinion that SGML-ing doesn't make sense if the legacy data are themselves too badly structured.
But how to ameliorate the raw material?
I remembered from my cognitive science background the term "Information Mapping". We tried to find some books on the subject, ... and we ended up following an IMAP-course. Besides learning a framework to structure information, the IMAP-methodology itself just asked to be SGMLed.
The SGML application had to facilitate :
writing and editing of information, using the known IMAP-terms and -principles
publishing to different media
hard-copy
disk
CD-ROM
World Wide Web
Braille, large-print, etc.
information retrieval
reuse of information
Therefore our guiding design principle was : "Name everything by name".
In a DTD you describe the structure and content of your information.
This can be divided in four classes:
hierarchy
objects
meta-information
shape of information
You have to define the high-level organizing containers within which the other elements are contained. In other words you describe the organizational structure of your information. Typical containers are : book, part, chapter, section, etc.
You describe the structures and properties of real-world objects. e.g. a motor has different parts. Each part has a number, a name, a description, a functionetc.
It is clear that the more you can describe the information in objectterms, the better your chances for supporting efficient information retrieval will become.
Besides the information itself, most of the time you also want to describe information about this information. E.g. who wrote it, which version is it, what security level does it have, for which operating system was it created, etc.
You can describe the presentation form or appearance of information, e.g. paragraph, list, table, emphasis, etc.
The hierarchy was first described as follows :
a book starts with a booktitle, followed by more than one parta part starts with a parttitle, followed by more than one chaptera chapter starts with a chaptertitle, followed by more than one sectiona section ....Is it necessary to name each title explicitly ?
No, because you are working in a hierarchy, you and all SGML-aware software know what the context of a title is; you know the containing element. It is obvious that the first element with identifier "title" inside a chapter is a chaptertitle.
Neither does this explicit naming enhance retrieval value. A user doesn't know and care if, what he is searching for, is in a sectiontitle or in a subsectiontitle.
The hierarchy second version :
a book starts with a title, followed by more than one parta part starts with a title, followed by more than one chaptera chapter starts with a title, followed by more than one sectiona section ....But one of our design goals was the reuse of information. What happens if I want to use a chapter of my first book as a section in another one ? It doesn't fit hierarchically.
The hierarchy made flexible :
a book starts with a title, followed by more than one divisiona division starts with a title, followed by more than one division OR "divisioncontent" (ESCAPE from the recursive definition of a division).Leave it to the processing afterwards to give the division a name.
In this same context of reusing containers of information, be careful in using the "#current" keyword as attribute default.
Name it by name, but not the hierarchical containers.
This is what Information Mapping is all about. A map describes and/or the concept, the structure, the facts, etc. of an object.
More, for every information type the method gives you all the elements needed. E.g. for the description of a structure you can textually describe what it looks like, you must have a diagram of the object and you can have a table with an enumeration of all the parts with as subelements partnumber, partname, partdescription, partfunction, etc.
What you see in some recent DTDs is that the higher level container elements start with an optional subdivision in which meta-information can be stored. E.g. The OSF-DTD starts with a meta-element for describing documentinformation, productinformation and a variety of notices.
Figure 1. a meta-element instance
On a lower level an element's meta-information is stored in attributes. E.g. the IBMIDDOC DTD.
Figure 2. attributes on an element of the IBMIDDOC DTD
Some considerations in this context :
Contentprocessing is easier done on element(content) than on attributecontent.
Do not overdo. The information architect can have the most brilliant ideas; don't forget it is the poor editor who must get all the required data in.
Meta-information is only implemented on a limited scale, because it is our belief that meta-information must be customized to the needs/wishes and most of all the budget of the client.
So, on the highest level we have only an optional element to store independent hypertextlinks (HyTime).
And on the element-level we use only a few attributes mainly for the purpose of hypertextlinking.
Since information shape elements do not add any information retrieval value, their use must be postponed until no other way of describing the element is available.
Use only the paragraphelement if you have more than one visual block inside the containerelement. In several DTDs the para-element is used redundantly, which only frustrates the editor.
Figure 3. the DTD used for SGML BeLux
Elements which can't be described any further as object, can have as content :
#PCDATA
#PCDATA with in-line objectinformation
more than one paragraph or stemmed list.
#PCDATA with in-line objectinformation or more than one paragraph or stemmed list
Although the IMAP-methodology is full of the use of tables, we didn't use a generic tableshape-element.
Since one of our design principles was to guide the writer/editor completely according to the principles and terms of the IMAP-methodology, we always used the IMAP- names.
E.g. A decisiontable consist of a list of "rules". Each rule has an if-part and a then-part. The writer will be using this terms while editing the information. It is by the use of architectural forms that the if-part and the then-part are mapped to a cell-element and the rules are mapped to a row-element.
This advantage on the editing and retrieval side can have a drawback on the formatting side when the processing software doesn't have facilities to do some formatting on the basis of architectural forms. Then you have to format all those named elements with a table/row/cell-shape instead of one generic shape element.
We never felt the need to use inclusions. Furthermore, recent postings on comp.text.sgml have highlighted some severe drawbacks of the use of inclusions.
Normally one uses the processing software to autonumber elements. So you don't need to tag e.g. a separate stepnumber per step in a procedure.
We did, because in the IMAP-methodology a procedure is explained in a proceduretable. So we used for each step (row) an empty step.number element for mapping it to the architectural form cell-element and of course to do the autonumbering inside this cell.
The first IMAP-DTD was developed in 1990. It reflects the timespirit, using SGML more as a system independent format description tool.
This DTD explicitly names structurecontainers and uses solely shape elements.
Pay attention to:
the recursive structure,
the object descriptions,
the use of shape elements
the enhanced retrieval.
We would like to acknowledge :
Wayne Wohler (IBM)
Eliot Kimber (Passage Systems)
Erik Naggum (badly himself)
Erik Skinner (Exoterica)
the TEI-work
and most of all ExoteRika