Structured Information

Navigation, Access, and Control

(presented at the Berkeley Finding Aid Conference, April 4-6, 1995)

Steven J. DeRose
Sr. System Architect
Electronic Book Technologies

What is structure, anyway?

Structured information is information that is analyzed. Not in the sense that a Sherlock Holmes should peer at it and discern hidden truth (although for some information, such as ancient texts, something much like that may happen), but rather in the sense that the information is divided into component parts, which in turn have components, and so on.

Only when information has been divided up by such an analysis, and the parts and relationships have been identified, can computers process it in useful ways. The choices made during this analysis are crucial; the most crucial point I want to emphasize today is that how you divide up your data matters.

There are many models of analysis. Among the most trivial, and in my opinion least useful, is this: "a document is a list of pages." Moving in both directions from this, utility increases. If we move upward in scope we enter the domain of most concern here: the division and organization of recorded knowledge; without such organization, our libraries would become mere collections, inaccessible and in the end unusable.

If we move downward in scope, a very similar phenomenon occurs: As progressively finer levels of analysis are made conscious, explicit and accessible, the range of things one can do with a document increases.

So this is what we mean by "structured documents" and "structured information:" information whose parts identify themselves, making themselves accessible to human and computer processing.

Form vs. content

One can choose parts to identify on many bases. Perhaps the first important choice is whether the goal is to represent the form of some data or information, or some particular "meta-" information about the content, or the content itself. This choice is fundamental, and has radical consequences for what you can do with the resulting structured information.

On paper, form and content are partly intertwined (or as Ted Nelson said in Computer Lib, "intertwingled"). The typographic conventions of our culture, added to our knowledge of the natural language of documents (sometimes properly linguistic, sometimes graphical or otherwise semiotic), permit us to identify the content parts of books.

Computer tools are notoriously bad at identifying content given form, while being superb at the opposite transformation. For example, it is trivial for a computer to render both book titles and emphasis using italics--this makes for no conflict; it requires no artificial intelligence. On the other hand given two italic portions of a text computers will fail miserably in telling us which is a book title and which is emphasis.

Because of this inherent asymmetry, moving to a world of computerized information requires us to undertake the work of making content structures explicit. Without this step, we have not in fact moved the information to a new medium; we have merely made an electronic photocopy.

Now photocopiers are immensely useful beasts; but my point is that you can do no more with a photocopy than you can do with the original. You can make certain important gains, such as increasing access while preserving the original from excessive handling, or creating multiple copies for safety, accessibility, or even providing disposable copies for special uses. But once someone obtains the copy, they can do no more with it than with the original.

In our move to the future we must make it possible to do more. Only by representing the structure of the content, not merely the form of its expression in a prior medium, can we achieve the level of function we must have to manage the exponential growth of information we face.

Identifying component parts and relationships

Information professionals of all kinds have names for the many parts that make up document and other information structures. As one example, a quick examination of the Chicago Manual of Style will reveal many, because much of the goal there is to explain how to represent the components of structured content by using typographic form.

When creating new information it is relatively easy to identify the types of content objects. An author can state authoritatively what their intent is as they place a given content object such as a paragraph, a quotation, a line of poetry, or an axiom. Indeed, the author must make such a choice before they can possibly choose to do some word-processor action to express it. At times an author may be unconscious of these choices, and that is fine--literature is often held, in retrospect, to be most significant and meaningful at levels the author may never have seen. Still, authors' choices of structure are our key source of information about their work, and this holds at all levels from the phonological and grammatical, up to the placement and order of chapters, indexes and the like.

When dealing with pre-existing information we do not have the luxury of being the author: we can only do our best to discern structure and meaning from what we have. We can look for clues to structure in typography, and these are often very clear; but we may also wish to find structures that are completely implicit or are obscured by neutralization. For example, when the Oxford English Dictionary was converted into a structured electronic document, researchers found about 20 distinct uses for italics; only by the painstaking task of teasing etymologies apart from Latin cognates apart from literary examples, and so on, was the result made truly useful. At a subtler level, one may wish to explicate structures that are hypotheses: a literary critic may claim that some passage constitutes an allusion to Paradise Lost. The validity of such a claim is normally debatable; but explicit structure is a way to express the claim itself.

The key innovation required to move forward is that we must choose truly useful structures and make them explicit. The structure will be there anyway, but using it must remain a purely manual task unless the structure is made explicit.

Why do we need structure?

Structure is really there

Structure is in our documents. We cannot avoid it, though we can choose what kind to use in any situation. Authors clearly think in terms of linguistic discourse and other structures while writing, though much of this activity becomes automatic with practice. We also use structure constantly in navigating the information we have. Finding aids also have a great deal of structure, created through careful design.

Often we make use of structure without thinking. Open any document and structure leaps off the page: lists, figures, footnotes and the like are all over. And as documents grow larger, explicit structure-aware tools start to appear: Indexes reflect the thematic or topical structure, while tables of contents reflect the broad-stroke discourse or organizational structure, and bibliographies reveal something of the referential or link structure.

In reference works, such as those of particular interest in this forum, structure is if anything even more important. Without carefully designed subject categories, levels of organization and description, etc., navigation in large information spaces obviously bogs down.

Structure provides a way to name things

One great advantage of structured information is that raising the component parts of information to the level of explicit representation often leads to giving them names. As Ursula LeGuin reminds us, "the name is the thing, and the true name is the true thing. To speak the name is to control the thing." Nowhere is this truer than in the realm of information.

Navigation requires naming, as does access whether by database, catalog, finding aid, or hypertext link. Choosing the right names for information units is perhaps the most crucial issue facing the electronic document community today.

We have spoken already of type-names, which say what manner of thing some thing is. But now we turn to instance-names, which pick out specific individuals: not X is a book or quotation or word or link, but X is that book or that quotation or that word or that link.

Imagine for a moment we lacked such names for our information: what if there were no chapter, section, or at least page divisions authors could cross-reference to? Cross-reference would become impossible. This is almost inconceivable at the level of whole documents; a book without a title will be given one or die a quiet death. But what of those internal components we have been discussing? Ancient texts lacked internal names; the important ones have been forced to acquire them. One can hardly find a modern Bible printed without chapter and verse divisions, and the same is true at least for scholarly editions of most classical works. Manuscripts often lack such internal cues, making the texts before us that much more complex.

For recent works we resort to page numbers for cross-reference: "see page 37 of Smith (1995)." This is possible because the number of copies whose pagination matches is very high; many books never achieve a second edition, or even a second printing. But for those that do, the use of page numbers poses a problem that brings us back to structure: page numbers break. This is obvious, but easily forgotten:

A large-print edition cannot be done without either making the pages physically huge and unwieldy, or making the page numbers useless. This problem is inherent with pre-formatted data, and appears in most word-processors: one cannot narrow the window without clipping off the end of every line.
Even a tiny change to the content may break all later page numbers, and such effects are cumulative.

Why do these things happen? Simply because pages are not structural units in literature. They are certainly "structural units" in the far different domain of typography, but typography is not document structure in the sense of interest. A book is "the same" if reprinted from quarto to octavo and from Garamond 24 to Times 12 in all but a few senses.

Precisely the same issue affects reference tools such as finding aids. What if the only names for things were chosen from a space that itself had little structure? For example, say that libraries were organized and accessed solely by ISBN or acquisition number, or that there were no levels of organization in a finding aid, but merely prose, perhaps with markup for font changes and the like. While the presence of names would at least make access possible, there would be a radical loss in functionality.

Structure versus the alternatives

The careful choice of structures, and the careful assignment of systematic names to them, provide the tools required to navigate through the vast information-spaces that are just around the corner.

Many proposals have been made to instead copy the notion of pages into this newer electronic world: "Just scan the LC and drop it on the net." A few years ago one could hear the same theory, but suggesting optical disk jukeboxes; and before that, microfilm. As I mentioned earlier, this approach is not truly a new medium, but merely a new kind of papyrus on which to store a copy of the original medium: highly useful but purely a quantitative, incremental change. This path can never lead to the new world of navigable, accessible information space we hope to reach.

This is because a scanned image does not contain explicit structural information that can be used to support such processes. It is exactly as if one converted to an "electronic catalog" by scanning all the 3x5 cards and doing no OCR. I suppose such a catalog would be "online," and it would have the advantage of being easily copied, backed up, and transported. But imagine using it!

The next step up from pictures of information is very popular right now: the "plain ASCII text file"--this sings the Siren song of portability, and has become popular for several reasons: First, it is vastly more amenable to machine processing than a bitmapped page. You can search it at least for words, you can mail it around, and any old software can at least display it. This is a good reason, as a half glass of water is better than none. But the other reasons are poor. We limit ourselves to "ASCII" because our networks won't take anything else without running uuencode or pkzip or binhex first, and none of those are commonplace on all computer platforms. Also, this is all the information we can get for no effort: a scanner, OCR software, and automated spelling checker will get you to "plain ASCII", and no further.

Consider some of the things that cannot be represented in "plain ASCII":

Any characters not in the very restricted set used by English, such as French accented vowels, not to mention the deeper difficulties of Greek, Hebrew, and Japanese.
Footnotes: where do you put them, at their reference or at the bottom of the page or at the end of the book? how do you know they're footnotes in the first place?
Running headers and footers.

Beyond these obvious limitations there is a subtler problem: such files often use conventions to represent information about structure. For example, block quotes may be indented by adding spaces before each line, or title may be centered by adding enough spaces to approximately center them (but, center relative to what?).

To the extent files use such conventions they at least potentially gain useful functionality, but aren't "plain ASCII" anymore. Some of the characters are not just characters, they have become markup, giving information about the text. The main difference between such conventions and true markup is that the conventions are inconsistent and undocumented.

I've downloaded many interesting and desirable e-texts from the network, often ones that boasted of being "plain ASCII." The problem is, they lied to me about the text. I was sold (or in some cases given) a file that purports to contain "the text, the whole text, and nothing but the text." But here are some things I found:

Often all but a chapter would be there (possibly this is to accommodate a copyright, but that is another set of issues).
Usually the source edition is not identified. Such identification can be a formidable task in documenting archival materials, but should not be made into a problem by throwing away provenance information when it does exist.
Footnotes may be there, but placed where an end-of-page had occurred in the (presumed) source edition. There is often no way to tell that some few lines are a footnote, and not just part of the text.
The corresponding footnote references might or might not be there, but in any case are indistinguishable from content (there being no superscripts in ASCII).
Any accents are merely dropped, so such text collections are completely Anglo-centric, not to mention that they misrepresent or delete foreign interpolations such as non-English names.
Typographic nuances such as emphasis are gone even when they are crucial. For example, on the flight out here I read a magazine article on world hunger. There was a sentence "world hunger is not a problem"--the "a" in italics. The point of course was that hunger is a complex of many problems, from the biological to the political. Keep the "a" and delete the italics as "plain ASCII" must do, and the meaning changes quite radically: "world hunger is not a problem."
Many important older texts have reference information, verse or line numbers, scenes and acts in plays, titles, attributions and the like. Newer texts have bibliography, section references, and other phenomena. If included at all, these things cannot be found, and any general process must treat them as indistinguishable from content.

Pity the scholar who analyzes such a text, or the cataloger who tries to identify it. The names we need are missing. In LeGuin's terms we do not know the true name, and so cannot control the thing. And if as in her story we should magically learn the true name, we find to our pain that the thing we name is not what we thought--not an unassuming local wizard, but a dragon in disguise.

Structure provides handles for searching

My final point about the need for structure is that structure facilitates searching. Only if the component parts are explicitly identified can we search for information in some particular part. This is why a database of personnel records is better than a list typed into a word processor. You can search for "Jones" as a name and not a street, or "401" as an area code and not a street number, or in my favorite example from one online library catalog, search for the journal titled simply "Linguistics" without getting all the subject entries.

Imagine querying a personnel database for numbers ">10" without being able to specify that you want a "salary" as opposed to "month of hire." This seems obviously absurd. Likewise, everyone here knows why a catalog entry would be almost (almost) useless if the many MARC fields were not distinguished, or often distinguished inaccurately.

These cases are so obvious we may hardly think of them as "structure." But as documents go online in their entirety the same issues and tradeoffs apply, albeit in less obvious forms. If we do not represent structure within documents we will not be able to do the things we increasingly want to do with them.

Many finding aids seem to me to occupy a typological middle ground between databases at one end (especially the simple flat-form sort, and less so the more complex and heterogeneous MARC sort), and typical documents at the other. This makes them, if anything, more complex and more needful or careful design than other data. This continuum from simple flat databases to highly structured document bases brings us to the issues of what kinds of structure to represent. As we move from catalogs and abstracts on toward finding aids and eventually full content, correlating the levels of information and using it to increase ease of use will continue to grow in importance.

What kinds of structure are needed?

Basic kinds of data

I'd like to suggest a few basic kinds of structured information, ranging from forms at one extreme to document materials at the other, and then to argue that certain reference materials ranging from MARC to finding aids fall along the continuum in between. I do not think the materials we are considering fall cleanly into either extreme, and I think that because of their intermediate nature they have both advantages and difficulties not present at either extreme.

First let us consider forms, the sort of thing we all fill out from time to time on a sheet of paper with little boxes. Form data has these central characteristics:

One expects many instances of the same group of information items--many copies of the same form type. Although particular instances of a form may leave a few items blank, if there are many such items we suspect a bad form. And although some instances of a form may require surprising explanatory notes beyond what was given space in a box, this again is a sign of a bad form.
Although the information is inevitably presented in some order on a form the order is not importantto the meaning, and more importantly the order of instances of a form is irrelevant. For example, if you and I fill out employment applications the order in which they appear in some file (paper or computer) is irrelevant. Perhaps their chronological order--which of us filled out the form first--should matter, but that is quite a different piece of information.
A form's context is not part of its meaning. More concretely, taking one instance of a form out from among its fellows does not change its meaning in any way. This is crucial to the way in which form-databases work: A report or the result of a search is a list of form instances isolated from their fellows, and which each makes full sense independently.
Related to this, a subtle point regarding the identity of information: if two forms are filled out exactly the same, they are for all processing purposes indistinguishable: they are the same. This problem can only be addressed by adding arbitrary information to distinguish cases. This is why the companies we deal with assign us numbers, and why it is so troublesome when they accidentally assign the same number twice.
Items on a form have little hierarchy: there is not much in the way of item/subitem relationships. One may have a home and business address each with several parts, and it would be wrong to mix the street address of one's home with the zip code of one's business; but such examples are few and are provided explicitly on the form. There cannot be unbounded repetitions of structured sub-parts.

Now let us leap to the other extreme case, namely documents. They have quite a different pattern when we look at the same characteristics so central to forms:

One expects few instances of a given sequence of pieces of information--it is pure coincidence if two books have the same number of chapters and sections. It is odd to think of a book or article leaving certain items blank, such as chapter one--we find even a non-structural unit such as a page amusing if it is "intentionally left blank" on paper, and absurd online: we suspect a bad document or at least a bad compositor. Likewise "surprising explanatory notes" are normal in documents: we call them footnotes, sidebars, digressions, and so on; these are indeed the norm.
Unlike information on forms, information in a document is in some order that matters to the meaning. It matters rather a lot which paragraph comes first.

<digression>
Since I've just mentioned surprising digressions, I'll enter upon one and talk for just a moment about hypertext theory. When I said that the order of information in documents matters, many of you probably thought of Ted Nelson's now-traditional definition of hypertext as "non-sequential writing." It sounds as if we have a contradiction. But I take Ted's definition to mean writing that is not strictly sequential; in which a single, lock-step sequence imposed by the paper or film medium would restrict the rhetoric of authors and the choices of readers. And indeed truly going to the new medium of hypertext means that we must go beyond such sequential writing.
Authors must give their readers many choices, thinking ahead about their chosen audience(s) and what they may want "next" at any given point. They must also develop new rhetorical devices because they have less control over what the reader has already seen--poor hypertexts become littered with links labeled "If you haven't read x, or if you don't know about y, click here." Implementors must provide tools that make it easy to create and to exercise those choices, and to escape, backtrack, or at least re-orient when the user becomes lost in a maze of twisty little hypertext links, all the same. Readers in their turn must learn to notice the signposts that tell them a choice is possible, and brave the new rhetoric and the loss of a dependable "this then that then that" rule for reading. In some ways it is like looking into a scene through a keyhole.
This said, I note that even the most labyrinthine hypertext is highly sequential. One would hardly consider breaking the words of a paragraph apart into all their possible orders. Many times an author, even Faulkner, must make one passage prerequisite to another, or must lock the user out of one passage until another has been seen. Perhaps the hardest genre of all for which to make a hypertext would be murder mysteries. So, while hypertext radically takes us out beyond the idea of a single sequence or even a small number, it does not overcome time, language, and cognition. While it becomes false that all components of a document are fully or strictly ordered, most paragraphs and other rhetorical components of a document continue to have deep and significant relationships of order and precedence. This is why we must craft hypertext links rather than merely having a computer draw and quarter our texts for us.
</digression>

So in documents, order matters. This second issue poses an inherent performance problem in the relational model. An RDB must store each paragraph (or section, or whatever) as a record in some kind of element table. To produce the correct order serial-numbers must be added to every record (this para is para 1, etc.). To retrieve and display a section, the RDB must thus select all paras with serial numbers in a certain range (likely a slow operation), and then sort the results by serial number. This is wasted effort, because normally only one basic order is ever needed but that same order must be reconstructed over and over. A database model that preserves order saves all this work.
Our third characteristic has to do with taking one instance of data out from among its fellows. While for form data this does not change its meaning, the opposite is clearly true for document data and this is crucial to the way in which document-bases work: The result of a search is not a list of small components isolated from their fellows, but a component in its context.
Some time ago a query language was proposed for documents that lacked this key feature: A query for all occurrences on the word "sower" would get them: sower, sower, sower,.... What one must have is rather different: the list of where "sower" occurs, so as to navigate to those places and examine the context. This differs from getting 100 copies of a 5-letter string, which is no more useful than one copy.
This brings us to the fourth point, namely that two identical objects in a document are not the same. First, it is possible that a word, sentence, or even paragraph be repeated in a document; and second, if this should happen the repetition matters. The instance are not the same thing.
Lastly, while forms have little hierarchy, layers upon layers of substructure is the hallmark of documents.

So on all these fundamental axes forms and documents differ radically. My conclusion is that different tools and methods must be applied in the two domains. So where do finding aids fit in? I believe they share some characteristics of both categories and this may make them particularly complex. A finding aid must include a great deal of information about content, since that is what one is trying to find.

Some meta-information can be reduced to something resembling forms; in one sense a finding aid is similar to a MARC record: a large though typically sparse list of fields. But there is more going on. Those fields do have interdependencies; they do have levels (a colleague working on the John Carter Brown Library's exhaustive bibliography of European Americans ended up dividing author names into something like 20 sub-components and 3 or 4 levels). But finding aids must go even further.

A finding aid must provide access based not only on demographic information--author, title, edition, imprint, subjects, added entries, and a host of fields you all know far better than I do--in addition it must make use of characteristics of the content itself: what is this thing about? What school of thought from the discipline does it represent? What does it relate to in other disciplines?

There would be especial benefit in being able to get hints at relationships as yet unnoticed or unremarked. Markup that identifies relevant content and structure facilitates such a discovery process, by making explicit many of the basic facts upon which conclusions about relationships are based.

One approach to this is the preparation of abstracts, and this has proven very useful. Another is the application of statistical methods to vocabularies and word frequencies, now well understood. But the ultimate answer, I believe, comes from making the whole documents available with as much structure as possible explicitly represented. This is the true information, labeled by true names, from which abstracts and statistics come.

Particulars of document structure

How then do we represent useful structure? Many parts are obvious, and, within the hard constraints of time and budgets we should represent as many of them as possible.

First, almost all documents include various generic component parts: PARAGRAPH, LIST-ITEM, QUOTE, TITLE, EMPHASIS, FOREIGN, IMAGE, and the like. Even rudimentary software can help locate these, because they map almost one-to-one to word-processor or scanner objects. Reasonably skilled yet not scholarly workers can identify them quite reliably even when software cannot.

In much the same category are various generic aggregates: BOOK, CHAPTER, SECTION, FRONT-MATTER, LIST, TABLE. However, for historical reasons typical software gives us no help with these. We have all suffered from word processors not knowing what a "list" is, and so failing to number items right or keep numbers up to date, or forcing us to re-select each list each time it changes so the software will know what to renumber. We also suffer the pain when we want to move, delete, or otherwise deal with sections and chapters. Add-on outliners help, but because few word processors truly represent any structural unit larger than a paragraph, they must use heuristics (such as "Find the next paragraph of type HEADING-2, and assume everything between is the current SECTION") that are both slow and unreliable.

Each genre, from poetry to manuals to finding aids, requires specialized objects: STANZA, REPAIR-PROCEDURE, AXIOM, PART-NUMBER, CATALOG-CODE. Identifying the right ones for finding aids is a crucial step, requiring ongoing research. It cannot be established once and for all, just as the list of defined subject headings for literature cannot be defined once and for all. The Berkeley Finding Aid Project has undertaken this task with zeal, and I expect its already fine results to improve even further as more important components emerge in the course of addressing a growing sample of finding aids.

Another major kind of component is access tools: these range from the ubiquitous footnote and sidebar, through cross-references, bibliographies, and the like. Paper necessitated other navigation tools as well, such as indexes and tables of contents. Of course these components should be represented.

The automation of cross-references is the hypertext link: one ought to be able to click on any such reference and have it work. Less obvious but quite similar is the quotation: any quotation should work to access the quoted document. If that document is undergoing change, such as critical editing or a rewrite by a living author, then the user may also wish the quotation to be dynamically updated.

Many phenomena that are evident in printed texts are not structural units that need to be identified for most purposes. Line breaks, discretionary hyphens, font and other typographic choices, and the like are usually not structural except insofar as they may serve to communicate some other structures.

How to decide what's structure?

When planning an encoding project, two primary questions are what structures are of interest and which are to be encoded. How they are encoded is important, but strictly less so than the fact that they are encoded. Any encoding project faces economic as well as intellectual decisions, and I will not be talking about how to decide which things not to encode when finances are running out. This depends on the goals and usage scenarios envisioned for the data. But my normal advice on the subject is that within the constraints of budget, encode anything that you think will be of independent use later. Here are a few specific diagnostic questions to ask:

Does the component under consideration survive re-laying out the document?
Is it useful for multiple purposes?
Would an author or reader have a name for the thing (individual or kind)?
Might someone want to search for it specifically (as opposed to just text)?
Does it have other things it surrounds, fills, or is associated with?

The list obviously leads one strongly toward conceptual units, at the expense of the merely typographic. That is, for most purposes the placement of line breaks and discretionary hyphens, the choices of font, and so on do not require encoding except insofar as they communicate some other structures.

The final point about deciding what structure to encode is that one should study existing standards first. Encoding is not a nascent field, nor is the use of SGML. Excellent advice is available, and can save a great deal of time and help one avoid backtracking later.

How does SGML fit the bill?

SGML is the best choice for encoding these structures. It has two truly crucial advantages: First, it imposes no fixed set of component types. You can define the structures how you want for the task at hand. At the CETH Workshop on Documenting Electronic Texts one speaker expressed some doubt about whether SGML was flexible enough to provide a complete equivalent to MARC (that is, an alternative representation of all the same data). At the next coffee break three SGML experts in the room had written out DTDs to compare (hardly polished of course).

Second, SGML is a public, non-proprietary standard that will not change with each new release of some company's software. Software vendors conform to it, rather than it and your data conforming to software vendors. This is what justifies confidence that SGML data will survive for the long term, beyond any current software used on it.

So are there any downsides to SGML? Only a few. One seeming downside is that SGML requires more thought about the data. As mentioned earlier, OCR is no longer the end of the story, but the work must continue into making sometimes-hard decisions about the nature of your data. I myself consider this an upside; it does require extra effort, but the effort pays off

The main downside to SGML is that it provides too large a wealth of options: alternative syntax, abbreviatory conventions, and the like; few people bother learning them all. Fortunately such options are just that: options. Many do not add functionality or capability, merely alternative methods, and so any project can avoid them simply by deciding to. Thus most SGML experts have adopted what has come to be called a "monastic" approach: "just say no" to any features you don't need.

Related Standards

Before closing I should mention two other ISO standards that relate closely to SGML. The first is HyTime, which provides a set of extensions to SGML for representing hypermedia linking. All of the extensions are backward-compatible, in the sense that a regular SGML parser will still parse a HyTime document correctly. But the special semantics of certain data are extended, so that HyTime knows more than SGML about links between documents, chains of indirect links, pointers into non-SGML data such as graphics and video, and so on.

HyTime's reference mechanisms work in 3 basic ways: by specifying a unique name from some particular name space (such as formal IDs in SGML documents, document names, etc.); by counting along some particular axes (such as picking out a rectangle off of a graphic, or a time-range in music); and by retrieving objects based on some property they have (such as searching for a contained string, or elements of a particular type).

Like SGML, HyTime has an elegant and very powerful set of core features, surrounded by an enormous selection of options and alternatives. A monastic approach to HyTime has the same benefits as for SGML, only more so.

The Web has popularized a different way of linking, based on URLs. As noted before, there is grave danger in building on the foundation of document locations rather than document identifiers. A Web URL by definition breaks with even the most minor changes of environment: renaming the file, re-arranging directory structures, switching to a new hard disk. This is just like the case of citing books by shelf or accession number rather than ISBN or LCCN. Fortunately some in the Web community are aware of this and working diligently on a "URN" or "Uniform Resource Name" specification. But until it arrives and is widely used a link on the Web adds something to the Internet deficit, and the price of fixing them will have to be paid, whether by us or our virtual children.

Summary

Finding aids are the next logical step in progressing from information about the form of documents, through information about documents, to document themselves. At all these stages what the computer can do with data depends most importantly on the model applied to the data. A simple facsimile of a manuscript or other object is useful, but does not enable qualitatively new processing, just as a microfilm copy of a card catalog is useful, but not revolutionary.

In designing new model for electronic data it is important to consider whether traditional models such as the relational database really fit. In examining several basic properties of relational data versus documents in general, it becomes clear that the fit is questionable. Newer technologies are needed, and new design questions need to be researched and solved.

SGML provides a generic way of representing certain models about document structure, and of representing documents given those models. Because it is a formal international standard and has achieved very wide and diverse use, it is a safe long-term vessel for important data. As with many standards, a monastic approach to SGML enhances portability, durability, and interoperability.

Bibliography

Bush, Vannevar. 1945. "As We May Think." Atlantic Monthly176 (July): 101-108.

Coombs, James H., Allen H. Renear, and Steven J. DeRose. 1987. "Markup Systems and: the Future of Scholarly Text Processing." Communications of the Association for
Computing Machinery 30 (11): 933-947.

DeRose Steven J. and David G. Durand. 1994. Making Hypermedia Work: A User's: Guide to HyTime. Boston: Kluwer Academic Publishers.

DeRose Steven J., David G. Durand, Elli Mylonas, and Allen H. Renear. 1990.: "What is Text, Really?" Journal of Computing in Higher Education 1 (2): 3-26.

Herwijnen, Eric. 1990. Practical SGML. The Netherlands: Kluwer Academic: Publishing Group.

Horowitz, Lisa R. 1994. CETH Workshop on Documenting Electronic Texts, May: 16-18, Somerset, NJ. Technical Report #2: Center for Electronic Texts in the
Humanities, Rutgers and Princeton Universtities.

International Organisation for Standardisation. 1992. ISO/IEC IS 10744:: Hypermedia/Time-based Structuring Language: HyTime.

International Organization for Standardization. 1986. ISO 8879: 1986(E).: Information Processing--Text and Office Information Systems--Standard
Generalized Markup Language.

Nelson, Ted. 1987. Computer Lib. Redmond, Washington: Tempus Books of: Microsoft Press.

Sperberg-McQueen, C. Michael and Lou Burnard. 1989. Guidelines for the: Encoding of Machine-readable Texts. Also known as Text Encoding Initiative
document P1. Chicago: Text Encoding Initiative.

Tompa, Frank W. 1989. "What is (tagged) text?" Dictionaries in the Electronic: Age: Proceedings of the Fifth Annual Conference of the UW Centre for the
New Oxford English Dictionary: 81-93.