SGML Encoding for Technical Reports

James David Mason

Advanced Publishing Technology Section

Publications Division

Oak Ridge National Laboratory

As DOE facilities move towards incorporating SGML into their processes of generating and distributing technical reports, we need to ask ourselves many questions about how this is to be done. To be sure, many of our facilities will have to begin by simply making the acquaintance of SGML before we can think about advanced applications. However, good planning for implementation of SGML includes consideration for future uses of documents. Today, that consideration should include far more than just generating paper documents; indeed, the greatest potential benefits of SGML implementation lie in the area of non-traditional applications, largely centered around data base retrieval systems, including hypertext and multimedia systems. As we approach future applications, we need to consider several aspects of the task, including the structures encoded, the flow of the information, and the database tools available for retrieval.

In the past, many people thought of SGML as simply an extension of the way they used style sheets or macros in a word processor or desktop publishing or typesetting system. That is, they performed a document analysis that covered the traditional elements of publishing on a printed page. The elements they recognized were those that required some sort of formatting: reports were broken down into chapters, sections, and subsections; then those units were broken down into smaller text units like titles, lists made up of items, and then just plain paragraphs. Some of the well-known SGML applications went further and examined approaches to tables and equations. All of this analysis is valuable, because the elements that a print-based publishing approach has addressed since long before the introduction of computers-and before printing itself replaced the hand copying of manuscripts-are generally useful clues to the structure of the information they represent. All the conventional rules of typography, such as the selection of display fonts for titles or the indentation at the start of a paragraph, are intended to guide the reader through the document at hand.

Today, however, we have access to alternative delivery systems as well as to print, and many of these systems enable a nonlinear delivery of information that is quite different from what we have associated with print and its predecessors for the past several thousand years. Although "hypertext" and "multimedia" have achieved the status of buzzwords, there is more to them than simple trendiness or the tendency to do something new just because it has suddenly become possible. We might consider, for example, the growth of the hypertext paradigm for delivery of computer documentation. When online documentation started to become available, except for the inclusion of an occasional keyboard chart, it tended to be a replication of the printed manuals, which the user could page through as though it was paper. Current systems, such as Microsoft Windows, are increasingly turning to hypertext-based help, which allows the user to navigate through online documents that are completely unlike the printed manuals. Although Microsoft assists third-party software vendors to create manuals for their products, the vehicle is still a proprietary one. The Open Systems Foundation, in contrast, has chosen to go one step further and build future documentation for their version of UNIX in a non-proprietary hypertext format using HyTime (ISO 10744), a new standard based on SGML.

The significance of the growth of hypertext for delivery of scientific and technical information is not just that it provides a new, high-tech interface for information delivery but that it is a reflection of the need to rethink the process of information delivery itself. Hypertext may in some casts be an appropriate delivery mechanism for technical data. What HyTime calls the "bibliographic model" for information is already implicit in much technical literature: researchers have always included references to previous publications. But the hypertext model simply implies that in a total electronic library those documents would be directly accessible from within a publication, rather than accessible by locating a separate physical document. But the hypertext model also suggests that other information might also be accessible, only a mouse click: or two away. Should a reader be able to look at the whole corpus of experimental data that lies behind an article, and not just what the author chose to print? That is, should a mouse click on an on-screen graphic lead down a hyperpath that leads ultimately to the researcher's electronic laboratory notebook?

Without getting enmeshed in the legal, ethical, and cultural issues implicit in future electronic systems for delivering information, we need to ask some basic questions about what gets captured electronically and in what form before we go too far in building our systems. SGML, HyTime, and other standards for document management and interchange do not work by magic. Before either the traditional publications structures for paper documents or the hypertext wormholes through the black holes of laboratory notebooks can be created, someone must build the Document Type Definitions (DTDs) that establish the structures for coding documents.

What do we code?

In an SGML system of any sort, the DTDs are the keys to all other functions. What the user can edit print, retrieve, or otherwise access is guided by the DTD. In the view that many users in the past have had of SGML, the DTD establishes the "tag set" or list of things that can be identified. The DTD does that by creating the generic identifiers for elements, but it also can define the relationships among those things, some of their properties, the types of non-SGML content that can be processed, boilerplate text, external references, special-character repertoires, and processing interrelationships among elements. In short, the DTD must prepare for all future environments in which the document must live.

Identifying general classes of data

The questions raised by hypertext an just one example of the issues raised when an implementer starts document analysis and design of a DTD. Hypertext raises questions of special structures as well as the identification of content. But similar questions can be raised without getting into nonlinear documents.

Because SGML can enable identification, or "tagging", of elements that are not visually identified in print- based publishing systems, it leads to the identification of a variety of things that range from pure ideas (such as labeling the logical parts of argumentation-an experimental hypothesis or a corollary conclusion-or a special piece of information-a patent claim that is subsidiary to the primary publication of research) to connections with external processes. These later connections tend to fall into the realm of "metadata", or information about information. What programmatic source funded this research? What data is available for a hazardous-materials audit? What software release was running on the data-collection-and-analysis system? Who signed off on the internal review forms before the report was released to the journal? At the current level of development in Oak Ridge, we are still just beginning to ask the questions about what questions our DTDs should raise.

A second class of questions about identification of information involves the semantics of tagging. SGML itself identifies no semantics for tags. So long as the tags in a DTD follow SGML's restrictions on the composition and lengths of names, an SGML processor is just as happy seeing an object called "N42Z971" as it is when it gets "TITLE". The semantic association between a generic identifier and a perceived meaning on the part of the user is largely in the user's mind, unless a secondary standard like HyTime that defines semantics of some SGML constructs is also implemented. In most cases, an important consideration in choosing tag names is a convenience to the end user, whether that user is someone who creates and tags the document in the beginning or someone who later is trying to extract the document from a retrieval system.

Special notations

If tags are meant to capture the actual ideas in documents, the DTD designer can quickly get into areas of great ambiguity. For example, our general approach to mathematics is compromised by the notations we have selected to convey ideas. Mathematical concepts themselves are independent of any notation, but the formula notation most mathematicians use is considerably more compact and easy (for them) to follow than spelling out concepts entirely in words. But how do we capture this notation in SGML? The notation is highly structured, but it is structured as much by graphical conventions as by the ideas it conveys. The languages that have been used to capture mathematical notation for printing systems (e.g., TEX and eqn) use graphical functions, such as superscripting, to represent many different concepts. Shall we follow that convention in defining a mathematical DTD (as most of the DTDs to date have done), or shall we identify all the individual concepts, like exponentiation, that superscripting can represent? If we choose the latter, we may enable more precise searches of the data base, but we also run the risk of creating a tag set that is so large that it becomes unusable. So long as we have programs that require a user to type in codes, such as any of the various programs that have adopted some dialect of eqn (e.g., Interleaf, Ventura, WordPerfect) or the various SGML schemes that follow this model of coding mathematics, we are expecting users to make explicit identifications of functions. We can devise a DTD that calls for specific semantic functions, like EXPONENT or UPPERLIMIT. But what if we assume a purely graphical editor for mathematics, a point-and click entry system? Some of the graphical formula-editing programs like Expressionist and MathType can enforce the hierarchical structure of mathematics as presented in images just as certainly as the internal parser in TEX. But what enables such a program in an SGML environment to assist the user in selecting which of the many possible reasons a character may be graphically represented in a superior position?

Tables present further possibilities for rethinking how graphical presentation has shaped our thinking about the underlying data we wish to convey. We tend to put data into tables when we have large quantities of data or data with complex relationships or simply an assortment of data whose relationships cannot easily be expressed in words. We have learned to let the presentation of data in a graphical array serve as a means for showing relationships we would have difficulty expressing otherwise. Any one of these conditions that cause us to present data in tabular form can make designing coding mechanisms-whether in SGML or some other notation-hard to design.

Essentially, a good table is data driven. That is, the nature of the data contained in the table should determine how many columns and rows a table has, what sorts of headings or labels appear on them, whether a table has subsidiary headings, and, indirectly, whether there is a need for artifacts of presentation like continuation headings for long tables. A purist might argue that each potential table deserves a separate process of analysis, which would result in an individualized DTD that reflects a specific set of information relationships. In practice, individualized DTDs are unlikely, not only because of the lack of human resources for the analysis process, but also because SGML production systems do not encourage (or in some cases support the features necessary for) the practice.

Because we have grown so accustomed to thinking about the presentation of tables, we have let that presentation shape how we think about tools for encoding tables. Most presentation systems, whether they drive a display screen or a printing engine, not only are conceived of in terms of two dimensions, but also have a definite orientation. That orientation is reflected in coding that tends to set tables as a series of rows, with entries moving across the rows before moving down the page. Although a table editor (like that in Ventura) may present the user with a rectangular array in which any cell can be only a click away, the coding behind it is quite probably row oriented. Most of the well-known public SGML DTDs that include tables are also use a row-oriented approach.

The data in many tables, however, an not actually row oriented or two dimensional. In many experimental environments, the data that have traditionally been collected in tables reflect more than two independent experimental variables. The person converting these data to a table generally has to consider how complex the table should be-will it have a hierarchy of spanned column headings, or will it have intermediate titles, or will it perhaps be presented as a series of related tables? Another way of considering the data in a table is to consider plotting the information as a graph: in many cases, it will then often appear that the best way to read the data is down the columns, rather than across the rows.

Considering the possibilities that tables may present, the document analyst may conclude that a table is a kind of multidimensional array and that the information normally considered in typography as column, stub, and internal headings are actually the verbal analogs to the numeric indices a mathematician associates with an array. A DTD that reflects this view of tables may indeed be derived from a DTD designed for mathematical notation. It may not mention any presentation properties such as rows or columns at all. Processing a "table" from this array would be completely independent of how the data had been recorded. On the one hand, such an array might be easier to fill from computer-generated experimental data than typographically encoded tables often are. On the other hand, separating the encoding of the data content from the processing would make it easier to alter the processing according to immediate needs. Such a table could easily be rotated, for example, or indeed turned into a graphic rather than printed in tabular form at all.

How does information flow?

The questions of how information flows through a collection of data that may be presented as a table and what that flows means for presentation are but a part of the question of how scientific and technical information flow from the original experimenter to the eventual reader of the report that results from the experimentation. If we accept the SGML model that the information needs to be tagged by someone, who-or what-is that? And who plays what role in deciding what gets tagged and how?

At one extreme, is it appropriate or even cost effective for the researcher to do the coding? One can argue that only the experimenter really understands the content of the experiment and its results. We generally expect the researcher to write the report of the experiment, but can we expect the researcher to learn the coding system? Is the researcher, who is generally not trained in typography or graphics, the one to decide how a table should be designed or whether the data would better be presented as a graphic? Although researchers write up the original findings, abstracting and keywording are done by some other subject-matter specialist in many cases. How much tagging should that specialist do? Not all reports pass through a technical editor. Editors have traditionally been involved with traditional copymarking. It will be interesting to see how an editor who was an English or journalism major in college accepts the idea of having to identify the technical concepts that will allow a report to be retrieved from some data base.

So long as production of publications is confined to a single group or institution, how those publications are actually coded (whether in SGML or some typographic system) is a matter for internal concern only. When we begin to deliver reports from many contractors to OSTI in SGML, the concerns become mutual ones. At the very least, we will have to deliver our DTDs as well as our documents before OSTI can process them.

What can we do with the finished report-besides print it?

What OSTI, as DOE's central collecting and redistribution agent, as well as its cataloging and abstracting service, does with a contractor's report is an issue that has implications all the way back to the creator of the report, whoever that might be. Because SGML coding has no intrinsic semantics, interchange of electronic documents cannot occur in a vacuum. That is to say, SGML itself does not define a set of tags or the meaning of any tags generated using the language's facilities. Not only must either the DTD be transmitted with a document or the document refer to a preexisting DTD, there must be some agreement about the significance of the information captured by the coding. Furthermore, the coding must be designed to support the eventual users of the data. If all that OSTI did with reports involved images-print, copy, microfilm, image-based CD-ROM- traditional presentation structures would be all that needed tagging. But if OSTI wants, for example, to extract bibliographical data directly from the data file, that file would need to be coded with that purpose in mind. When the question turns to making data bases for end users to query, deciding what is tagged and how becomes even more critical. In addition to considering the tags and who does the tagging--the author or someone later in the publications pipeline-we must consider the availability of query mechanisms.

In the world of traditional data bases, the concept of a query language has long been understood, and, for many applications, a consensus has evolved around a standard called SQL-Structured Query Language. Although SQL provides a starting point for discussing queries to a text data base, because it was designed for record-oriented systems, it is not wholly applicable. The committee responsible for SQL (ISO/IEC JTC1/SC21/WG3) has also been working on a language for manipulating full-text databases, FSQL. However, this language concentrates on manipulation of the text itself and does not fully address the manipulation of SGML coding. For example, one might want to retrieve structures according to an attribute value (e.g., all the elements in whose sub-elements the secure attribute is never "secret") or a combination of structures (e.g., all the elements that contain &alpha ; ). Such queries go beyond the range of FSQL.

Perhaps the topic attracting the most attention in the area of SGML-related standardization is the creation of SGML query languages. One such language, HyQ, has just been standardized as a part of HyTime. A second language is being standardized as pan of DSSSL (Document Style Semantics and Specification Language, ISO/IEC DIS 10179). Two separate languages are being created because HyTime and DSSSL are partially complementary standards. Both standards include an approach to navigation of SGML structures, but their location models do not wholly overlap. DSSSL is designed to manipulate only SGML data (e.g. to describe the formatting of the first text paragraph after the title in a third-order section). HyTime itself is based on SGML structures, but the data controlled by a HyTime document may not be in SGML (e.g., to present a scanned piece of color artwork on a screen five seconds after the beginning of a musical selection). Both query languages address certain basic SGML issues (each could express the sample queries in the previous paragraph), and will most likely have simple intermappings in the areas of common interest. However, HyTime query engines will not be commercially available for some months, and DSSSL is not due to be completed until mid-1993. These are nonetheless already some commercial products that implement individualized approaches to SGML queries. More such products are expected, and perhaps by the time DOE is ready to implement large-scale queries there will be HyTime and DSSSL products.

Like other SGML processing systems, query systems require preplanning. Even more than in the case of printing, it is important to see the tagging scheme in relationship to the eventual end product-and the end product is even less precisely defined at this point. Most formatting systems, from Microsoft Word to LATEX, can be approached with the assumption that they have provisions for some simple text structure, such as paragraphs and footnotes and headings. Whether the DTD for an SGML document has a tag or , it is not likely to be hard to find or create something to map it onto among the functions of a formatter. The flow of processing moves one way, from the creator to the presentation system, and the latter is at least a familiar set of target. But in a query system, the end user is the captive of the creator, who has much less of an idea of what will be asked for at the end. A researcher ar LLNL, for instance, cannot retrieve information about RCRA sites in an ORNL report except by full-text pattern matching (perhaps with a very large thesaurus) unless there are tags like that provide a clue in the file. That means that DTD writers need to be thinking about potential query topics from the very beginning of document analysis.

Not only must there be preparation for queries, then must be coordination between the delivery agent, such as OSTI, and the creators of data at the remote sites. The host of a query engine is very much concerned with both what tags are to be expected in files (which can be determined by pre-transmission of DTDs or use of common DTDs and can be to some degree automated) and with the semantics associated with those tags (which largely needs to be negotiated among humans rather than between computers). Then an some forms of data transformation that can be managed from the receiving site. A simple SGML parser application, such as the DSSSL General Language Transformation Process, can map different tags with the same semantics onto a single harmonized result. For example both and could become as input to a TEX-based system that has a common "^" function-or to a corresponding query topic. (Transformation to neutral tags has its own set of problems: will a researcher/author who is used to tagging a mathematical structure with a tag know that the query system expects a tag ?) But the semantics of potential queries about other types of content is potentially much more ambiguous than just different opinions on how to name the large-but still limited-functionality of mathematical notation. How can the designer of the user interface for the query system predict more about graphics than that, for example, x-y plots will probably have legends on both axes or about tables that, to return to the presentation-oriented model, there will be headings on columns? Even such queries as those are partially predictable. But queries about tagged content, as opposed to queries involving full-text searches of content, mean that there must be coordination between the query designer and the DTD designer.

What can we do now?

SGML systems among the DOE sites are still comparatively rare. Few DTDs have been written. We can hope that nothing has had a chance to become fossilized We have a chance to do some planning.

We can expect that we will have a significant number of DTDs. Almost every site generates a variety of documents that will require some degree of customization. A generic technical report at LANL is likely to have some differences from one at BNL. Just within Oak Ridge we are already writing DTDs not only for technical reports but also for such varied things as quality-assurance certifications and manufacturing assembly procedures. It is easy to think of other sorts of documents, such as DOE Orders, environmental-impact statements, and EDI transactions, that might call for DTDs.

The structural elements of such documents require mostly the expenditure of time to identify. Structural elements are somewhat predictable, and we are used to dealing with them for presentation systems. Even though we may wind up identifying elements that are semantically similar with different tags, editors and others who are involved with DTD creation at our sites should need little assistance other than from experts in SGML coding to create our early DTDs.

Planning DTDs for queries about content, however, will require assistance from other sources. Feedback from end users of documents is particularly important. Participants at InfoTech, as representatives for a segment of that user community, can make a significant contribution to the planning process. As our capability to expand the content of data bases grows, what else should we prepare to query? Today data bases contain mostly bibliographic data, keywords, and abstracts. The bulk of abstracts is intentionally restricted, and brute-force full- text queries are perhaps adequate for searching. As the complex texts of reports join the data bases, we will need more ideas about what researchers, administrators, auditors, and other users will need to find so that we can build coordinated query and tagging systems. We should plan to coordinate efforts on a DOE-wide basis so that we have a unified strategy for delivering our retrieval systems to the users. Then we can look forward to a rewarding future of even better service to our community.

Oak Ridge National Laboratory is operated by Martin Marietta Energy Systems for the U.S. Department of Energy under contract DE-AC05-84OR21400.

By acceptance of this article the publisher or recipient acknowledges the U.S. Government's right to retain a non-exclusive, royalty-free license in and to any copyright covering the article.