Steven J. DeRose
Sr. System Architect
Electronic Book Technologies
Structured information is information that is analyzed. Not in the sense that a Sherlock Holmes should peer at it and discern hidden truth (although for some information, such as ancient texts, something much like that may happen), but rather in the sense that the information is divided into component parts, which in turn have components, and so on.
Only when information has been divided up by such an analysis, and the parts and relationships have been identified, can computers process it in useful ways. The choices made during this analysis are crucial; the most crucial point I want to emphasize today is that how you divide up your data matters.
There are many models of analysis. Among the most trivial, and in my opinion least useful, is this: "a document is a list of pages." Moving in both directions from this, utility increases. If we move upward in scope we enter the domain of most concern here: the division and organization of recorded knowledge; without such organization, our libraries would become mere collections, inaccessible and in the end unusable.
If we move downward in scope, a very similar phenomenon occurs: As progressively finer levels of analysis are made conscious, explicit and accessible, the range of things one can do with a document increases.
So this is what we mean by "structured documents" and "structured information:" information whose parts identify themselves, making themselves accessible to human and computer processing.
One can choose parts to identify on many bases. Perhaps the first important choice is whether the goal is to represent the form of some data or information, or some particular "meta-" information about the content, or the content itself. This choice is fundamental, and has radical consequences for what you can do with the resulting structured information.
On paper, form and content are partly intertwined (or as Ted Nelson said in Computer Lib, "intertwingled"). The typographic conventions of our culture, added to our knowledge of the natural language of documents (sometimes properly linguistic, sometimes graphical or otherwise semiotic), permit us to identify the content parts of books.
Computer tools are notoriously bad at identifying content given form, while being superb at the opposite transformation. For example, it is trivial for a computer to render both book titles and emphasis using italics--this makes for no conflict; it requires no artificial intelligence. On the other hand given two italic portions of a text computers will fail miserably in telling us which is a book title and which is emphasis.
Because of this inherent asymmetry, moving to a world of computerized information requires us to undertake the work of making content structures explicit. Without this step, we have not in fact moved the information to a new medium; we have merely made an electronic photocopy.
Now photocopiers are immensely useful beasts; but my point is that you can do no more with a photocopy than you can do with the original. You can make certain important gains, such as increasing access while preserving the original from excessive handling, or creating multiple copies for safety, accessibility, or even providing disposable copies for special uses. But once someone obtains the copy, they can do no more with it than with the original.
In our move to the future we must make it possible to do more. Only by representing the structure of the content, not merely the form of its expression in a prior medium, can we achieve the level of function we must have to manage the exponential growth of information we face.
Information professionals of all kinds have names for the many parts that make up document and other information structures. As one example, a quick examination of the Chicago Manual of Style will reveal many, because much of the goal there is to explain how to represent the components of structured content by using typographic form.
When creating new information it is relatively easy to identify the types of content objects. An author can state authoritatively what their intent is as they place a given content object such as a paragraph, a quotation, a line of poetry, or an axiom. Indeed, the author must make such a choice before they can possibly choose to do some word-processor action to express it. At times an author may be unconscious of these choices, and that is fine--literature is often held, in retrospect, to be most significant and meaningful at levels the author may never have seen. Still, authors' choices of structure are our key source of information about their work, and this holds at all levels from the phonological and grammatical, up to the placement and order of chapters, indexes and the like.
When dealing with pre-existing information we do not have the luxury of being the author: we can only do our best to discern structure and meaning from what we have. We can look for clues to structure in typography, and these are often very clear; but we may also wish to find structures that are completely implicit or are obscured by neutralization. For example, when the Oxford English Dictionary was converted into a structured electronic document, researchers found about 20 distinct uses for italics; only by the painstaking task of teasing etymologies apart from Latin cognates apart from literary examples, and so on, was the result made truly useful. At a subtler level, one may wish to explicate structures that are hypotheses: a literary critic may claim that some passage constitutes an allusion to Paradise Lost. The validity of such a claim is normally debatable; but explicit structure is a way to express the claim itself.
The key innovation required to move forward is that we must choose truly useful structures and make them explicit. The structure will be there anyway, but using it must remain a purely manual task unless the structure is made explicit.
Structure is in our documents. We cannot avoid it, though we can choose what kind to use in any situation. Authors clearly think in terms of linguistic discourse and other structures while writing, though much of this activity becomes automatic with practice. We also use structure constantly in navigating the information we have. Finding aids also have a great deal of structure, created through careful design.
Often we make use of structure without thinking. Open any document and structure leaps off the page: lists, figures, footnotes and the like are all over. And as documents grow larger, explicit structure-aware tools start to appear: Indexes reflect the thematic or topical structure, while tables of contents reflect the broad-stroke discourse or organizational structure, and bibliographies reveal something of the referential or link structure.
In reference works, such as those of particular interest in this forum, structure is if anything even more important. Without carefully designed subject categories, levels of organization and description, etc., navigation in large information spaces obviously bogs down.
One great advantage of structured information is that raising the component parts of information to the level of explicit representation often leads to giving them names. As Ursula LeGuin reminds us, "the name is the thing, and the true name is the true thing. To speak the name is to control the thing." Nowhere is this truer than in the realm of information.
Navigation requires naming, as does access whether by database, catalog, finding aid, or hypertext link. Choosing the right names for information units is perhaps the most crucial issue facing the electronic document community today.
We have spoken already of type-names, which say what manner of thing some thing is. But now we turn to instance-names, which pick out specific individuals: not X is a book or quotation or word or link, but X is that book or that quotation or that word or that link.
Imagine for a moment we lacked such names for our information: what if there were no chapter, section, or at least page divisions authors could cross-reference to? Cross-reference would become impossible. This is almost inconceivable at the level of whole documents; a book without a title will be given one or die a quiet death. But what of those internal components we have been discussing? Ancient texts lacked internal names; the important ones have been forced to acquire them. One can hardly find a modern Bible printed without chapter and verse divisions, and the same is true at least for scholarly editions of most classical works. Manuscripts often lack such internal cues, making the texts before us that much more complex.
For recent works we resort to page numbers for cross-reference: "see page 37 of Smith (1995)." This is possible because the number of copies whose pagination matches is very high; many books never achieve a second edition, or even a second printing. But for those that do, the use of page numbers poses a problem that brings us back to structure: page numbers break. This is obvious, but easily forgotten:
Why do these things happen? Simply because pages are not structural units in literature. They are certainly "structural units" in the far different domain of typography, but typography is not document structure in the sense of interest. A book is "the same" if reprinted from quarto to octavo and from Garamond 24 to Times 12 in all but a few senses.
Precisely the same issue affects reference tools such as finding aids. What if the only names for things were chosen from a space that itself had little structure? For example, say that libraries were organized and accessed solely by ISBN or acquisition number, or that there were no levels of organization in a finding aid, but merely prose, perhaps with markup for font changes and the like. While the presence of names would at least make access possible, there would be a radical loss in functionality.
The careful choice of structures, and the careful assignment of systematic names to them, provide the tools required to navigate through the vast information-spaces that are just around the corner.
Many proposals have been made to instead copy the notion of pages into this newer electronic world: "Just scan the LC and drop it on the net." A few years ago one could hear the same theory, but suggesting optical disk jukeboxes; and before that, microfilm. As I mentioned earlier, this approach is not truly a new medium, but merely a new kind of papyrus on which to store a copy of the original medium: highly useful but purely a quantitative, incremental change. This path can never lead to the new world of navigable, accessible information space we hope to reach.
This is because a scanned image does not contain explicit structural information that can be used to support such processes. It is exactly as if one converted to an "electronic catalog" by scanning all the 3x5 cards and doing no OCR. I suppose such a catalog would be "online," and it would have the advantage of being easily copied, backed up, and transported. But imagine using it!
The next step up from pictures of information is very popular right now: the "plain ASCII text file"--this sings the Siren song of portability, and has become popular for several reasons: First, it is vastly more amenable to machine processing than a bitmapped page. You can search it at least for words, you can mail it around, and any old software can at least display it. This is a good reason, as a half glass of water is better than none. But the other reasons are poor. We limit ourselves to "ASCII" because our networks won't take anything else without running uuencode or pkzip or binhex first, and none of those are commonplace on all computer platforms. Also, this is all the information we can get for no effort: a scanner, OCR software, and automated spelling checker will get you to "plain ASCII", and no further.
Consider some of the things that cannot be represented in "plain ASCII":
Beyond these obvious limitations there is a subtler problem: such files often use conventions to represent information about structure. For example, block quotes may be indented by adding spaces before each line, or title may be centered by adding enough spaces to approximately center them (but, center relative to what?).
To the extent files use such conventions they at least potentially gain useful functionality, but aren't "plain ASCII" anymore. Some of the characters are not just characters, they have become markup, giving information about the text. The main difference between such conventions and true markup is that the conventions are inconsistent and undocumented.
I've downloaded many interesting and desirable e-texts from the network, often ones that boasted of being "plain ASCII." The problem is, they lied to me about the text. I was sold (or in some cases given) a file that purports to contain "the text, the whole text, and nothing but the text." But here are some things I found:
Pity the scholar who analyzes such a text, or the cataloger who tries to identify it. The names we need are missing. In LeGuin's terms we do not know the true name, and so cannot control the thing. And if as in her story we should magically learn the true name, we find to our pain that the thing we name is not what we thought--not an unassuming local wizard, but a dragon in disguise.
My final point about the need for structure is that structure facilitates searching. Only if the component parts are explicitly identified can we search for information in some particular part. This is why a database of personnel records is better than a list typed into a word processor. You can search for "Jones" as a name and not a street, or "401" as an area code and not a street number, or in my favorite example from one online library catalog, search for the journal titled simply "Linguistics" without getting all the subject entries.
Imagine querying a personnel database for numbers ">10" without being able to specify that you want a "salary" as opposed to "month of hire." This seems obviously absurd. Likewise, everyone here knows why a catalog entry would be almost (almost) useless if the many MARC fields were not distinguished, or often distinguished inaccurately.
These cases are so obvious we may hardly think of them as "structure." But as documents go online in their entirety the same issues and tradeoffs apply, albeit in less obvious forms. If we do not represent structure within documents we will not be able to do the things we increasingly want to do with them.
Many finding aids seem to me to occupy a typological middle ground between databases at one end (especially the simple flat-form sort, and less so the more complex and heterogeneous MARC sort), and typical documents at the other. This makes them, if anything, more complex and more needful or careful design than other data. This continuum from simple flat databases to highly structured document bases brings us to the issues of what kinds of structure to represent. As we move from catalogs and abstracts on toward finding aids and eventually full content, correlating the levels of information and using it to increase ease of use will continue to grow in importance.
I'd like to suggest a few basic kinds of structured information, ranging from forms at one extreme to document materials at the other, and then to argue that certain reference materials ranging from MARC to finding aids fall along the continuum in between. I do not think the materials we are considering fall cleanly into either extreme, and I think that because of their intermediate nature they have both advantages and difficulties not present at either extreme.
First let us consider forms, the sort of thing we all fill out from time to time on a sheet of paper with little boxes. Form data has these central characteristics:
Now let us leap to the other extreme case, namely documents. They have quite a different pattern when we look at the same characteristics so central to forms:
So in documents, order matters. This second issue poses an inherent performance problem in the relational model. An RDB must store each paragraph (or section, or whatever) as a record in some kind of element table. To produce the correct order serial-numbers must be added to every record (this para is para 1, etc.). To retrieve and display a section, the RDB must thus select all paras with serial numbers in a certain range (likely a slow operation), and then sort the results by serial number. This is wasted effort, because normally only one basic order is ever needed but that same order must be reconstructed over and over. A database model that preserves order saves all this work.
Some time ago a query language was proposed for documents that lacked this key feature: A query for all occurrences on the word "sower" would get them: sower, sower, sower,.... What one must have is rather different: the list of where "sower" occurs, so as to navigate to those places and examine the context. This differs from getting 100 copies of a 5-letter string, which is no more useful than one copy.
So on all these fundamental axes forms and documents differ radically. My conclusion is that different tools and methods must be applied in the two domains. So where do finding aids fit in? I believe they share some characteristics of both categories and this may make them particularly complex. A finding aid must include a great deal of information about content, since that is what one is trying to find.
Some meta-information can be reduced to something resembling forms; in one sense a finding aid is similar to a MARC record: a large though typically sparse list of fields. But there is more going on. Those fields do have interdependencies; they do have levels (a colleague working on the John Carter Brown Library's exhaustive bibliography of European Americans ended up dividing author names into something like 20 sub-components and 3 or 4 levels). But finding aids must go even further.
A finding aid must provide access based not only on demographic information--author, title, edition, imprint, subjects, added entries, and a host of fields you all know far better than I do--in addition it must make use of characteristics of the content itself: what is this thing about? What school of thought from the discipline does it represent? What does it relate to in other disciplines?
There would be especial benefit in being able to get hints at relationships as yet unnoticed or unremarked. Markup that identifies relevant content and structure facilitates such a discovery process, by making explicit many of the basic facts upon which conclusions about relationships are based.
One approach to this is the preparation of abstracts, and this has proven very useful. Another is the application of statistical methods to vocabularies and word frequencies, now well understood. But the ultimate answer, I believe, comes from making the whole documents available with as much structure as possible explicitly represented. This is the true information, labeled by true names, from which abstracts and statistics come.
How then do we represent useful structure? Many parts are obvious, and, within the hard constraints of time and budgets we should represent as many of them as possible.
First, almost all documents include various generic component parts: PARAGRAPH, LIST-ITEM, QUOTE, TITLE, EMPHASIS, FOREIGN, IMAGE, and the like. Even rudimentary software can help locate these, because they map almost one-to-one to word-processor or scanner objects. Reasonably skilled yet not scholarly workers can identify them quite reliably even when software cannot.
In much the same category are various generic aggregates: BOOK, CHAPTER, SECTION, FRONT-MATTER, LIST, TABLE. However, for historical reasons typical software gives us no help with these. We have all suffered from word processors not knowing what a "list" is, and so failing to number items right or keep numbers up to date, or forcing us to re-select each list each time it changes so the software will know what to renumber. We also suffer the pain when we want to move, delete, or otherwise deal with sections and chapters. Add-on outliners help, but because few word processors truly represent any structural unit larger than a paragraph, they must use heuristics (such as "Find the next paragraph of type HEADING-2, and assume everything between is the current SECTION") that are both slow and unreliable.
Each genre, from poetry to manuals to finding aids, requires specialized objects: STANZA, REPAIR-PROCEDURE, AXIOM, PART-NUMBER, CATALOG-CODE. Identifying the right ones for finding aids is a crucial step, requiring ongoing research. It cannot be established once and for all, just as the list of defined subject headings for literature cannot be defined once and for all. The Berkeley Finding Aid Project has undertaken this task with zeal, and I expect its already fine results to improve even further as more important components emerge in the course of addressing a growing sample of finding aids.
Another major kind of component is access tools: these range from the ubiquitous footnote and sidebar, through cross-references, bibliographies, and the like. Paper necessitated other navigation tools as well, such as indexes and tables of contents. Of course these components should be represented.
The automation of cross-references is the hypertext link: one ought to be able to click on any such reference and have it work. Less obvious but quite similar is the quotation: any quotation should work to access the quoted document. If that document is undergoing change, such as critical editing or a rewrite by a living author, then the user may also wish the quotation to be dynamically updated.
Many phenomena that are evident in printed texts are not structural units that need to be identified for most purposes. Line breaks, discretionary hyphens, font and other typographic choices, and the like are usually not structural except insofar as they may serve to communicate some other structures.
When planning an encoding project, two primary questions are what structures are of interest and which are to be encoded. How they are encoded is important, but strictly less so than the fact that they are encoded. Any encoding project faces economic as well as intellectual decisions, and I will not be talking about how to decide which things not to encode when finances are running out. This depends on the goals and usage scenarios envisioned for the data. But my normal advice on the subject is that within the constraints of budget, encode anything that you think will be of independent use later. Here are a few specific diagnostic questions to ask:
The list obviously leads one strongly toward conceptual units, at the expense of the merely typographic. That is, for most purposes the placement of line breaks and discretionary hyphens, the choices of font, and so on do not require encoding except insofar as they communicate some other structures.
The final point about deciding what structure to encode is that one should study existing standards first. Encoding is not a nascent field, nor is the use of SGML. Excellent advice is available, and can save a great deal of time and help one avoid backtracking later.
SGML is the best choice for encoding these structures. It has two truly crucial advantages: First, it imposes no fixed set of component types. You can define the structures how you want for the task at hand. At the CETH Workshop on Documenting Electronic Texts one speaker expressed some doubt about whether SGML was flexible enough to provide a complete equivalent to MARC (that is, an alternative representation of all the same data). At the next coffee break three SGML experts in the room had written out DTDs to compare (hardly polished of course).
Second, SGML is a public, non-proprietary standard that will not change with each new release of some company's software. Software vendors conform to it, rather than it and your data conforming to software vendors. This is what justifies confidence that SGML data will survive for the long term, beyond any current software used on it.
So are there any downsides to SGML? Only a few. One seeming downside is that SGML requires more thought about the data. As mentioned earlier, OCR is no longer the end of the story, but the work must continue into making sometimes-hard decisions about the nature of your data. I myself consider this an upside; it does require extra effort, but the effort pays off
The main downside to SGML is that it provides too large a wealth of options: alternative syntax, abbreviatory conventions, and the like; few people bother learning them all. Fortunately such options are just that: options. Many do not add functionality or capability, merely alternative methods, and so any project can avoid them simply by deciding to. Thus most SGML experts have adopted what has come to be called a "monastic" approach: "just say no" to any features you don't need.
Before closing I should mention two other ISO standards that relate closely to SGML. The first is HyTime, which provides a set of extensions to SGML for representing hypermedia linking. All of the extensions are backward-compatible, in the sense that a regular SGML parser will still parse a HyTime document correctly. But the special semantics of certain data are extended, so that HyTime knows more than SGML about links between documents, chains of indirect links, pointers into non-SGML data such as graphics and video, and so on.
HyTime's reference mechanisms work in 3 basic ways: by specifying a unique name from some particular name space (such as formal IDs in SGML documents, document names, etc.); by counting along some particular axes (such as picking out a rectangle off of a graphic, or a time-range in music); and by retrieving objects based on some property they have (such as searching for a contained string, or elements of a particular type).
Like SGML, HyTime has an elegant and very powerful set of core features, surrounded by an enormous selection of options and alternatives. A monastic approach to HyTime has the same benefits as for SGML, only more so.
The Web has popularized a different way of linking, based on URLs. As noted before, there is grave danger in building on the foundation of document locations rather than document identifiers. A Web URL by definition breaks with even the most minor changes of environment: renaming the file, re-arranging directory structures, switching to a new hard disk. This is just like the case of citing books by shelf or accession number rather than ISBN or LCCN. Fortunately some in the Web community are aware of this and working diligently on a "URN" or "Uniform Resource Name" specification. But until it arrives and is widely used a link on the Web adds something to the Internet deficit, and the price of fixing them will have to be paid, whether by us or our virtual children.
Finding aids are the next logical step in progressing from information about the form of documents, through information about documents, to document themselves. At all these stages what the computer can do with data depends most importantly on the model applied to the data. A simple facsimile of a manuscript or other object is useful, but does not enable qualitatively new processing, just as a microfilm copy of a card catalog is useful, but not revolutionary.
In designing new model for electronic data it is important to consider whether traditional models such as the relational database really fit. In examining several basic properties of relational data versus documents in general, it becomes clear that the fit is questionable. Newer technologies are needed, and new design questions need to be researched and solved.
SGML provides a generic way of representing certain models about document structure, and of representing documents given those models. Because it is a formal international standard and has achieved very wide and diverse use, it is a safe long-term vessel for important data. As with many standards, a monastic approach to SGML enhances portability, durability, and interoperability.