Access to Digital Representations of Archival Materials

The Berkeley Finding Aid Project

by

Daniel V. Pitti
Librarian for Advanced Technologies Projects

The Library
University of California, Berkeley

Published in RLG Digital Image Access Project: Proceedings from an RLG Symposium (Palo Alto: The Research Libraries Group, 1995), pp. 73-81.
(Paper prepared for the RLG DIAP Workshop, March 1995, Palo Alto, California)

Introduction

The Library at the University of California at Berkeley is currently engaged in three complementary research and demonstration projects that have as their ultimate goal a comprehensive standards-based digital archive and library system. Among other planned features, this system will provide access to digital representations of selected primary source materials, including pictorial materials. As currently conceived, the system will provide standards-based hierarchical access to collections through USMARC collection-level records linked to SGML-based finding aids linked to digital representations of the primary source materials. The first of these initiatives, and the primary focus of today's presentation, is the Berkeley Finding Aid Project. Its main objective is to provide the archive and library communities with a foundation upon which they can construct content and encoding standards for finding aids. This Project is currently entering its final phase. The second initiative, the California Heritage Digital Image Access Project, began in January 1995. It will build on the work of the first project by linking USMARC collection-level records to SGML-based finding aids and the finding aids to digital representations of 25,000 pictorial items on California history from The Bancroft Library collections. The third initiative, the American Heritage Virtual Archive Project, is currently being planned. This project will involve collaboration between several respositories of different sizes and types who will explore intellectual, political, and technological issues associated with building a single virtual archive of primary sources documenting American history.

The Berkeley Finding Aid Project is a collaborative endeavor to test the feasibility and desirability of developing an encoding standard for archive, museum, and library finding aids. The Project is funded in part by a grant from the United States Department of Education, Higher Education Act Title IIA Research and Demonstration Grant program. It is also supported by generous software grants from Electronic Book Technologies and ArborText. Work on the Project began in October 1993 and will be completed in September 1995. The Project involves two interrelated activities. The first task entails the design and creation of a prototype encoding standard for finding aids. Building a prototype database of finding aids based on the encoding scheme is the second objective of the Project.

Why Do We Need an Encoding Standard for Archival Finding Aids?

There are many reasons why an encoding standard for archival finding aids is needed. Finding aids, like the collections described in them, are extremely valuable. The information contained in finding aids is the product of countless hours of painstaking analysis and research by archivists embodying knowledge and experience possessed by no others. Finding aids as such are unique intellectual documents providing windows into collections. In additional to their intellectual value, they represent a substantial investment of money, usually public money. We thus have every reason to preserve them, just as we have an obligation to preserve the collections they describe. Unfortunately, the digital finding aids that we have created to date are in danger of being digitally stranded by ongoing, rapid changes in hardware and software. Finding aids created using proprietary word processing software such as WordPerfect and Microsoft Word, or database software such as dBase will remain usable only if we reformat them each time you the manufacturers update the software. Of course, for such reformatting to work properly, software manufacturers must provide translation programs that provide 100% fidelity in the data migration process. We all know from unpleasant experience that this is not always the case, and so information can be lost or garbled. As finding aid collections grow, such reformatting through successive software versions will become more burdensome. Even if the institution can survive the ordeal of eternally updating software, another concern has to be the durability of the software firm itself, and the durability of its interest in the software. If the firm goes out of business, or if it no longer finds the product profitable, an archive can find itself with a database of finding aids stranded in time. If finding aids are encoded using an independent standard, then their survival will not be contingent on a particular hardware or software configuration. A standard thus offers us reasonable assurance that the our electronic findings aids will survive changing technology, that we will be able to communicate with the future.

Building virtual archives of digital representations of selected materials from our archive and manuscript collections, as we are now beginning to do, provides us with another motivation for developing an encoding standard for finding aids. We must intellectually and physically manage and control the objects in digital archives, just as we have the traditional materials that frequently serve as their source. Creating digital repositories will require a rigorous definition of not only intellectual access and description found in traditional registers and inventories, but also precise definitions of the mechanisms needed to locate and present computer files in a variety of notations. We will need to create complex, compound archival information constellations comprised of finding aids, digital representations of the primary sources, and information that maintains the integrity and interrelation of the two across different hardware and software platforms. Just as is the case in a traditional repository, if we fail to manage the digital treasures in our care, we will lose them for all time. If we are to build digital archives in a rational and responsible manner, then we must put in place durable, reliable intellectual and physical control. The only means to assure the enduring access and control required is through standards. We cannot afford to build digital archives based on proprietary hardware and software that will make our collections hostage to the vagaries of the computer industry.

Used in conjunction with the ubiquitous connectivity of the Internet, a standard for encoding finding aids will enable libraries and archives to easily communicate and share information about their collections with one another and users. It will be possible for users and staff to easily locate and gather information about related but different collections. We will be able to restore the unity and integrity of collections that have been dispersed among two or more repositories creating virtual collections. Direct access to information about one another's collections will facilitate inter-institutional cooperation in collection development and preservation where knowledge of the holdings of other institutions can help curators make difficult decisions about how to spend scarce dollars developing, managing, and preserving their own collections.

Project Assumptions and Methodology

Developing an encoding standard for finding aids presented and presents many intellectual, technical, and political challenges. Project staff assumed that converting traditional finding aids into machine-readable form would immediately transform their nature and utility, opening up new realms of functionality. At the same time they realized that they had no experience creating, reading, and using online finding aids, and so were uncertain of what the new possibilities would be let alone how they ought to be exploited. They also realized that even if they had the experience, reinventing finding aids was properly an undertaking for the archival and library community, not a group of researchers at Berkeley working in splendid isolation. As complicated as this scenario was, it was made even more so by the fact that no content standard for finding aids existed (or exists).An encoding scheme by its very nature must assume content to be encoded. The project staff decided to address these practical intellectual and political issues by working from the present into the future. Instead of attempting a wholesale re-invention of finding aids, they would begin by developing a content-based model of finding aids using representative paper finding aids gathered from a cross section of the archive and library community. The representative pool of guides would ensure that the model of finding aid content and structure would reflect widespread practice. It would also ensured that the model would accommodate a wide variety of finding aids. Based on current community practice and understanding, the model would provide a familiar point of departure for exploring and experiencing the new digital world. The model would serve as the foundation of the prototype encoding scheme, and the encoding scheme itself would serve as the technical infrastructure of the machine-readable finding aids and prototype database. The transformed and digitized finding aids based on the model would then provide the community with the experience it needed to understand the new possibilities, and, so equipped, to engage in a community-wideeffort to debate and reinvent the nature and purpose of finding aids. Project staff, after considering the various possible ways encoding electronic text chose Standard Generalized Markup Language as the best available basis for the prototype standard.

What is Standard Generalized Markup Language?

Readers familiar with USMARC or MARC in any of its varieties will find that it both helps and hinders acquiring an understanding of Standard Generalized Markup Language (SGML). Knowledge of MARC helps because MARC prescribes content designation and not presentation formatting, indexing, or any other use of the content itself. MARC records can be used to print catalog cards or for online screen displays. Content designation is very much what SGML intends. Thus both MARC and SGML are about what is in documents, and not what you do with that content. MARC hinders an understanding of SGML in that MARC is dedicated to one kind of document, the bibliographic record--efforts to expand it to other categories notwithstanding. From this it follows that MARC software is dedicated to one document type, the MARC record. SGML differs markedly (no pun intended) from MARC in this regard.

While Standard Generalized Markup Language is both standard (ISO 8879) and generalized, it does not provide an off the shelf markup language that one can simply take home and apply to a letter, a novel, an article, a software manual, or a catalog record. What it really is, in fact, is a markup language meta-standard, or in simpler words, a standard for constructing markup languages. SGML provides a syntax and a meta-language for defining and expressing the logical structure of documents, and conventions for naming the components or elements of documents. One can think of SGML as a set of formal rules for defining specific markup languages for individual kinds of documents. Using these formal rules, a community sharing a particular kind of document can get together and create a markup language specific to a shared type of document.

The specific markup languages written in compliance with formal SGML requirements, are called Document Type Definitions, or, DTDs. For example, the Association of American Publishers with OCLC has developed a set of three DTDs: one for books, one for journals, and one for journal articles. A consortium of software developers and producers is developing a DTD for computer manuals. A colleague of mine at Berkeley has developed a USMARC DTD for use in a prototype bibliographic catalog employing advanced retrieval technology. Document Type Definitions or DTDs shared and followed by a community are themselves standards. The Association of American Publishers DTD is registered as ANSI/NISO z39.59-1988, and after substantial revision, has been approved just last year as an international standard, ISO 12083.

SGML is thus very general and abstract. It exists formally over and above individual markup languages for specific document classes. It is also a standard, which is to say, a formal set of conventions in the public domain, not owned by and thus not dependent on any hardware or software producer. That SGML is a standard offers its users reasonable assurance that the information created will not become obsolete because of hardware and software developments.

The formality and generality of SGML have very important implications. Because SGML syntax and rules are formal and precise, it is possible to write software that can be easily adjusted to work with any compliant Document Type Definition. Typically, an SGML software product has a toolkit that allows the user to adapt its functionality to his or her Document Type Definition. As a result, the market driving SGML software development is in principle everyone. As I mentioned above, this is very different from the MARC software market, which consists almost exclusively of libraries and a few archives and museums. Libraries, archives, and museums make up a very small, cash poor community. And while I do not want to insult the MARC oriented software developers--they do the best they can with limited resources--the products available reflect the limited resources. On the other hand, the SGML market includes virtually everyone. The Department of Defense now requires contractors to supply technical information using four standards. SGML is one of them. SGML is now being used by several software producers, airline, automobile, and tractor manufacturers, the Department of Energy and other government agencies, a wide variety of print and electronic publishers, and the Text Encoding Initiative, an international project to provide encoding standards to support linguistic and literary research.

To give you an idea of how broad and varied are the potential users of SGML, let me cite just a few of the affiliations of the people who have subscribed to a listserv devoted to SGML related activities in northern California: the Research Libraries Group, Lockheed Space and Missile, Silicon Graphics, Berkeley Department of East Asian Languages, the Institute of Forestry Genetics, Lawrence Berkeley Laboratory, UC Berkeley Library, Dialog Information Services, Berkeley Department of Slavic Languages and Literatures, and many, many more. The list of SGML related software developers reflects confidence in the potential of this market: WordPerfect; Microsoft, Xerox, Frame, Electronic Book Technologies, Avalanche, ArborText, SoftQuad, AutoGraphics, Open Text, Information Dimensions Inc., Exoterica, Object Design, and a host of others. Firms such as WordPerfect and Microsoft are not interested in little markets. Products on the market or under development include z39.50 compliant client/server databases and object oriented databases; a wide variety of authoring applications; conversion software; and electronic multimedia and paper based publishing tools.

In order to understand why SGML has generated such broad interest from both users and developers, it is useful to consider the nature of markup and what kind of markup SGML promotes. In an article now considered by many to be a classic presentation of document markup theory, James Coombs, Allen Renear, and Steven DeRose distinguished six kinds of markup, three of which I would like to discuss briefly: procedural, descriptive, and referential.

In the last few years, through the use of word processing systems, we have become familiar with procedural markup. Procedural markup consists of processing instructions to the computer. It tells the computer what to do with specified components of the text. For example, the title of a major section might have instructions that tell the printer to center the text, use a font of a certain size, and perhaps print it in bold italics. Most procedural markup is characterized by being paper directed, that is, it tells the printer how to put the text on paper. If you want to do anything else with the text, the markup is not of much help. If you want to search for the initialism "SGML" in the machine-readable version of a book, but only where it occurs as a chapter or section title, the procedural markup provides no assistance. Nor does it help if you want to display the text on a computer screen, since paper presentation and monitor presentation are quite different. Finally, procedural markup is characterized by a further limitation, to date all procedural markup has been proprietary. This means, for example, that the documents created on WordPerfect cannot be processed flawlessly on MicroSoft Word and vice-versa. Each word processing software package uses its own markup. In this environment, the future of the document is tied to the future of the software.

A second type of markup mentioned by Coombs, Renear, and DeRose is descriptive markup. With descriptive markup, we arrive at the form of markup recommended by SGML. Descriptive markup identifies the logical components of documents. While procedural markup specifies a particular procedure to be applied to a document component, descriptive markup indicates what the component is. Examples are chapter, chapter title, section, paragraph, author, publisher, and cataloging-in-publication data. None of these gives any indication of what procedures are to be applied to these components. But, if you know the elements in a document, then you can have processors to do whatever you want to them. Descriptive markup liberates the document for multiple uses. It is possible, for example, to use one and the same source document to produce printed, electronic, Braille, and voiced synthesized versions, and, for good measure, to produce HTML/Mosaic and Gopher versions. Of course the down side of liberty is that it can be abused, but that is another matter. The fact that descriptive markup can be used in so many different ways is one of its important characteristics. It escapes the single use trap of procedural markup.

It is useful to distinguish two kinds of descriptive markup: structural and nominal. Descriptive structural markup identifies document components and their logical relationship. Structural elements are components that you usually want to present visually in some distinct manner. Examples are chapter titles, paragraphs, block quotes, and the like. Descriptive nominal markup, as you might expect, identifies named entities, both concrete and abstract. Examples are corporate names, personal names, topical subjects, genres, and geographic names. While you may want to visually present these names online or on paper in some particular manner, you usually want to index them in particular ways, to use them to provide access to the source or subject matter of the document.

Referential markup, as its name suggests, refers to information that is not present. It is markup in the third person, so-to-speak. There are different kinds and ways that one might use referential markup, but I would like to focus on the kind of referential markup that enables something about which most of you have heard, and perhaps with which many of you have some experience, namely, hypertext and hypermedia. In addition to supporting text, SGML also provides provisions for using text to refer to other text, and to refer to other kinds of digital information derived from the full array of native formats: photographs (color as well as black and white); sound motion pictures; drawings; paintings; audio recordings; three dimensional objects of all kinds, shapes, and sizes; maps; manuscripts; typescripts; printed pages; mathematical data; financial data; diagrams; musical notation; choreographic notation; and anything else open to being digitally captured and rendered in some useful form. It is possible not only to refer to or point at this other digital information from within SGML based documents, but also to control the notation information needed to launch the devices necessary for rendering the various objects into humanly intelligible forms. It is thus possible to use electronic text to control and manage extra-SGML information objects of all kinds, as well as to provide access to and navigation through them.

While the markup system chosen for finding aids is very important, more important is the intellectual content of finding aids. The markup, after all, exists to serve the content, to enable us to use software to fully exploit it for maximum utility, and to wrap it in a form that makes it portable and durable. A descriptive markup system presupposes a document comprised of a particular content and logical structure to mark up, it presupposes a content standard.

Finding Aid Content Standard.

No formal content standard defining the logical elements and their relations exists for finding aids. In order to develop a prototype encoding scheme, Project staff postulated there was an implicit content standard embodied in the collective practice of a professional community that has formally and informally shared ideas about the proper method of processing, describing, and controlling collections. Project staff further postulated that this implicit standard could be discovered and rendered explicit through an analysis of representative finding aids. They have discovered over the course of analyzing and encoding over 5000 pages of findings aids from more than twenty repositories that archivists appear by and large to agree on the content of finding aids. Most differences have to do with the order in which information is presented and presentation formatting, and not with the content as such.

There are several factors appearing to account for this implicit community agreement. First, the aforementioned informal and formal sharing of ideas about archival practice has served to normalize practice to some extent. Second, over the past ten years, archivist have shared a uniform practice for cataloging the collections using the finding aids as the information source for the cataloging. The catalog record and the rules governing its content assumes specific kinds of information will be found in finding aids.

There is another regularizing factor specific to and resulting from the methodology of the Berkeley finding aid project, the impact of publicly sharing finding aids. When the project staff requested representative finding aids from various institutions to be used in the development of the content data model and in the prototype database, they were enthusiastically assured that samples would be forthcoming. Despite this enthusiastic response, however, project staff found itself waiting while only a few of the promised finding aids arrived. Our growing perplexity at this disparity between the apparent interest in the project, on the one hand, and the failure to deliver the needed finding aids, on the other, was cleared up when we learned its cause. Further inquiries led to the discovery that archivists were reluctant to expose their finding aids to the scrutiny of their peers, fearing that their practices might be criticized or worse yet ridiculed. The result was a case of the archival equivalent of stage fright. Eventually contributors carefully culled those finding aids that they thought were idiosyncratic or peculiar. The resulting pool of finding aids used by the project for developing the model thus were not truly representative of the entire range of existing finding aids, from the ridiculous to the sublime. Instead they represented what the archivists self-consciously thought would stand up to the scrutiny of their peers. And the obvious lesson to be learned from this is that ''going public' will naturally cause archivists to accomodate their practice to what they take to be the standard accepted by the community.

Where are we now, and where do we go from here?

The Berkeley Finding Aid Project, a two year undertaking, is entering its final 6 months. In the last eighteen months, the project has accomplished a great deal. In the first six months of the Project, Research staff solicited representative finding aids from the archive and library community, and analyzed and developed a data model of archive and library finding aids. In the second six months of the Project, staff installed and were trained on SGML authoring, editing and online publishing software, and developed the first iteration of a prototype encoding scheme in the form of an SGML DTD, and began testing and refining it. The third quarter of the Project has been devoted to building the prototype database of finding aids encoded according to the scheme. The database currently contains over 5000 pages of finding aids from over twenty repositories, including finding aids from the National Library of Australia, the Library of Congress, and the National Archives. Refinement and development of the data model and the DTD continue as more and more finding aids are analyzed and encoded. The final sixth months of the Project will be devoted to studying staff and patron response to the database, especially response to its utility.

The object of the Berkeley Finding Aid Project was to demonstrate the desirability and feasibility of an encoding standard for finding aids by creating and implementing a prototype. The library at Berkeley has consulted with a broad cross section of the library and archival community, and has especially sought guidance and input from experts in archival processing and cataloging. An ever increasing number of archives and libraries, representing a wide cross section of the community, has asked to experiment with the Finding Aid DTD. This cross section of the archival community has indicated to us that we have accomplished our limited goal of demonstrating the desireability and feasibility of a standard. We now hope that the experimental database being created will provide the community the experience it needs to understand what is possible in the digital environment, and, building on this understanding, to engage in an informed debate on just exactly what kind of standard it wants.

While there is an implicit de facto standard for finding aids shared by the archival community, there is still a need for a public, community-wide discussion of the nature and function of finding aids, and, if a rough consensus can be achieved, the formal adoption of a finding aid content standard. There is enough inconsistency in the logical order of finding aids, and in the relations of the various elements to one another, that it was necessary to make the first iteration of the Finding Aid DTD far more complicated and amorphous than it would need to be if there were more structural consistency and agreement on logical order. It therefore will be necessary for the community to agree upon how the DTD might be simplified by imposing more uniform practices. From the vantage point of this stage of the Project, it is also clear that several interesting and important development problems loom on the electronic finding aid design horizen. There is the particularly difficult challenge presented by how the community will decide to deal with the wide variety of lists found in finding aids. Lists by their very nature are complicated structures. Those created by archivists are even more so. The problems are not only structural, but have to do with content, specifically the descriptive elements used to describe folder contents and individual items representing various genres and formats. The challenge presented by lists is made even more difficult by the emergence in the last ten years of two very different methods of creating finding aids, one using word processing software and the other using database software. This is not a simple matter of different means to the same end. This bifurcation in current practice goes to the heart of the tension inherent in the nature of the finding aid itself between the finding aid as a document and the finding aid as database. How the community chooses to deal with this issue when it designs its finding aid standard will have profound implications for the future of archival access and control. It is imperative that these and a number of other important design issues be thoroughly studied, fully discussed, widely understood, and finally resolved by the community as a whole.