The Berkeley Finding Aid Project

Standards in Navigation

Daniel V. Pitti
Advanced Technologies Projects Librarian
The Library
University of California, Berkeley

("The Berkeley Finding Aid Project: Standards in Navigation," paper presented at the American Research Libraries/Association of American University Presses 4th Symposium on Electronic Publishing on the Network, November 1994, Washington, D.C. Paper published in Filling the Pipeline and Paying the Piper (Washington, D.C.: Association of Research Libraries, 1995), p. 161-166)

Abstract:: The archival community should develop and embrace an encoding standard for archive, museum, and library finding aids. Such a standard would ensure that Internet communication of finding aid data is effective, and that the data endures independently of the computer hardware and software used to create and use it. This is the underlying premise of the Berkeley Finding Aid Project. Some have attempted to encode finding aids in the MARC format, but with mixed results. Standard Generalized Markup Language (SGML) is recommended as a better vehicle, as it has the flexibility to handle the complex hierarchical structure of finding aids and to capture the individuality of unique collections. SGML can also facilitate access to digital surrogates of items in our collections. The availability on the Internet of fully functional collection information and item surrogates promises to dramatically alter access and preservation.

About the author:: Daniel V. Pitti is Librarian for Advanced Technologies Projects in The Library at the University of California, Berkeley. This paper was written at the beginning of a Department of Education Title IIA funded project in September 1993, and presented at the Meeting of the Society of American Archivists in New Orleans. It was revised a year later and presented at the Scholarly Publishing on the Electronic Networks symposium sponsored by the Association of Research Libraries and the Association of American University Presses in Washington, D.C. The author is grateful for the assistance of Tim Hoyer and Jack von Euw of The Bancroft Library, Berkeley, California, and Jackie Dooley of The Getty Center for the History of Art and the Humanities, Santa Monica, California.

Introduction and Project Overview.

The advent and rapid growth of the Internet is transforming scholarly communication worldwide. To take full advantage of the opportunities this development offers, the archive, museum, and library communities will need to develop and embrace standards to ensure that the communication is both useful and enduring. An encoding standard for finding aids is one such standard.

Finding aids are documents used to describe, control, and provide access to collections of related materials in library, archives, and museums. The materials in these collections are the natural by-products of the activities of individuals, families, and organizations. Many different kinds of materials are represented in these collections: manuscripts, correspondence, legal records and papers, photographs, tape recordings, video recordings, and more.

The Berkeley Finding Aid Project is a collaborative effort to test the feasibility and desirability of developing an encoding standard for archive, museum, and library finding aids. The Project is funded in part by a grant from the United States Department of Education Title IIA program. The Project began in October 1993, and will be completed in September 1995. The Project involves two interrelated activities. The first of these was creating a prototype encoding standard for finding aids. This prototype standard is in the form of a Standard Generalized Markup Language (ISO 8879) Document Type Definition (SGML DTD). Researchers at the University of California, Berkeley have developed the encoding standard in collaboration with leading experts in collection processing, collection cataloging, text encoding, system design, network communication, authority control, text retrieval, text navigation, and computer imaging. Project participants have analyzed the structure and function of representative finding aids. The basic elements occurring in finding aids have been isolated and their logical interrelationships defined. The DTD has been developed based on the results of this analysis.

Building a prototype database of finding aids is the second objective of the Project. Available hardware and software have been evaluated. We have selected ArborText's AdeptEditor as the SGML-based authoring and editing software, and Electronic Book Technologies' DynaText for providing networked access to the database. Encoding the finding aids and building the database is providing the encoding scheme developers with computer application experience with which to refine and inform the DTD development process. The network accessible database will provide, in a later phase of the Project, a means for public and staff end users to evaluate the utility and desirability of encoded finding aids. The public and staff end users then will be able to provide new ideas and suggestions to the encoding standard developers.

The success of any standard depends upon broad participation in its development and, after it is developed, widespread recognition of its utility. Standards are the products of communities, not individuals working in splendid isolation. As the Project has unfolded, we have kept the community aware of its progress, and, in the near future, will solicit critical evaluations.

Before proceeding with a detailed description of the Berkeley Finding Aid Project, I would like to comment briefly on the archival community's interest in standards.

Finding Aid Standards and the Archive and Museum Communities.

For a variety of reasons the archive and museum communities have not been motivated to develop standards to govern the intellectual content and structure of finding aids. The economic benefits of sharing cataloging that motivated the catalogers of published materials are not available to the archive and museum community because items in their collections are mostly unique. Another possible motive for normalizing finding aid content and structure would be to make them more familiar and thus more immediately intelligible to the uninitiated. Charles Jewett, one of the earliest advocates of standardizing descriptive cataloging, addressed this issue of intelligibility in his Smithsonian Report of 1853. He said of cataloging rules:

: Now, even if the one [system] adopted were that of the worst of our catalogues, if it were strictly followed in all alike, their uniformity would render catalogues, thus made, far more useful than the present chaos of irregularities.1

Increasing mutual intelligibility, though, has not in itself been a sufficient impetus to overcome the countervailing tendency to fashion the content, structure, and aesthetics to fit the local institution and users, and the characteristics of the collection being described.

With the emergence of the Internet, we must confront the need for greater intelligibility of information that can be readily shared and interpreted by the world intellectual community. In this context, the importance of mutually comprehensible finding aids and the benefits to be derived from them take on new urgency. The Internet has the potential to provide immediate access to information about our collections and even to computer surrogates of items in those collections. And this access can be available anywhere at anytime. For those institutions choosing to take advantage of the power of the Internet, users will no longer be exclusively the familiar faces of the dedicated researchers. In order to provide the access described, the archive and museum communities must develop and/or embrace an array of standards. The means are there to communicate with one another, but to make this communication mutually understandable, we will need to work together.

In order for archivists and curators to communicate with users and one another in the network environment, our computers must be able to communicate with one another. We need to be able to move information freely from computer to computer over the network. An encoding standard will enable institutions to create finding aids that are independent of proprietary hardware and software. This will allow the data to be freely interchanged across platforms and applications, and still be instantly usable without time-consuming modification or adjustment. Such interoperability will support contemporaneous interchange of collection data, and it will ensure that our investment in the data will survive over time.

Finding aids created using proprietary word processing software such as WordPerfect and Microsoft Word, or database software such as dBase will remain usable only if you reformat them each time you update the authoring software. Of course, for this to work properly, software manufacturers must provide translation programs that provide 100% fidelity in the data migration process. This is not always the case, and so information can be lost or garbled. As finding aid collections grow, such reformatting through successive software versions will become more burdensome. Even if the institution can survive the ordeal of eternally updating software, another concern has to be the durability of the software firm itself, and the durability of its interest in the software. If the firm goes out of business, or if it no longer finds the product profitable, an archive can find itself with a database of finding aids stranded in time. If finding aids are encoded using a standard, then their survival is not contingent on a particular hardware and/or software configuration. An encoding standard would guarantee that the machine-readable finding aids created today would be usable tomorrow!

This and other benefits would result from making finding aids interchangeable through a standard encoding scheme. It would be feasible for users and institutions to have fully functional access to finding aids in real-time over the network. Remote access to finding aids would enable researchers to make direct use of collection information without the expensive mediation of reference staff, and mutual access to current collection information would have a major, positive impact on the management of archive and manuscript repositories. Direct access would give researchers more autonomy and control over their research, and it would facilitate inter-institutional cooperation in collection development and preservation where knowledge of the holdings of other institutions can help curators make difficult decisions about how to spend scarce dollars developing and preserving their own collections. The list of institutional benefits derived from the use of a standard, in fact, exceeds the simple but profound benefits that individual scholars would enjoy. It would also allow inter-institutional cooperation in the description of and access to dispersed collections and to independent but related collections. Perhaps even more important in the short-term, a standard would make it feasible for collection holding institutions, library oriented vendors, and supporting federal and state agencies to develop the human and material resources needed to convert existing paper finding aids into machine- readable form. Without a standard for finding aids, it is difficult to convince agencies to fund conversion because there is no benchmark by which they can evaluate how the money is spent.

MARC versus SGML.

In order for an encoding standard to provide the infrastructure to support a full array of access, control, navigation, and print functions and uses, it must be well designed and constructed. It is the quality of the standard as an information infrastructure that will enable maximum exploitation of the encoded information.

In the early stage of developing this project, we considered using the MARC format as the basis for the standard. We did this because MARC was familiar, and because we had heard that many institutions were attempting to use it for encoding finding aids. We quickly decided that it was not the best available scheme. We had three principal reasons for making this decision.

First, we found MARC inadequate because records are limited to a maximum length of 100,000 characters. This represents approximately 30 8-1/2 X 11 pages of 10 pitch unformatted text stored in ASCII. Since many finding aids are longer than this, the size restriction is a prohibitive obstacle.

Our second difficulty with MARC is its limited accommodation of hierarchically structured information. Since finding aids are hierarchically structured documents, the flat structure of MARC makes it unsatisfactory. As archivists are painfully aware, MARC was primarily designed to capture description and access information applying to a discrete bibliographic item. Describing and providing access to complex collections through descending levels of analysis quickly overburdens the MARC structure. At most, a second level of analysis can be accommodated, but the kind of information so supplied is limited. One possible way around this problem is to employ multiple, hierarchically interrelated and interlinked records at varying levels of analysis: collection-level, subunit, and item. The use of multiple records, though, introduces extremely difficult inter- and intra-system control problems that have never been adequately addressed in the format or by MARC based software developers. Even if the control issues were adequately addressed in the format, the control required to make multiple record expression of hierarchy succeed would entail prohibitive human maintenance.

The third reason for not using MARC for finding aids involves the marketplace. It is a gross understatement to say that libraries, archives, and museums are generally not resource rich institutions. To put it into perspective, the price of one B-2 bomber would fund the Library of Congress for over three years. Lacking large amounts of capital, the MARC user community has been incapable of driving state-of-the-art hardware and software development.

Standard Generalized Markup Language or SGML is a promising framework or model for developing an encoding scheme for finding aids for a number of reasons. First, it has none of the problems associated with MARC mentioned above. SGML will accommodate hierarchically interrelated information at as many levels as needed. Furthermore, there are no inherent size restrictions on SGML based documents. SGML is an international standard embraced by an ever growing list of government, educational, research, and industrial institutions. Through the Computer-aided Acquisitions and Logistics Support or CALS initiative, the US Department of Defense has mandated that all contractors doing business with DOD must use four specific standards in communications. One of the four is SGML. Because the DOD budget is immense even if shrinking, this mandate has motivated intensive development of SGML oriented software. Besides not sharing the weaknesses of MARC, SGML has a host of other features recommending it to us.

SGML is not itself a text encoding standard, but a standard for uniformly developing standards for definable kinds or classes of documents. In SGML, a standard for a particular kind or class of document is called a Document Type Definition or DTD. Each community that uses and shares a particular type of document must assume responsibility for developing an encoding scheme specific to the type.

SGML is concerned with designating the logical elements that serve as the building blocks of documents and the interrelationships of these elements. The prototype SGML Document Type Definition for finding aids developed at Berkeley specifies what logical elements can be present, and, with varying degrees of specificity, how the elements interrelate. In this regard, it is similar to MARC and dissimilar to word processing applications. MARC captures description and access information. It does not specify how to index, display, or print the information. Markup that provides output specifications such as print or display formatting is called procedural markup. It is dedicated to a single end-use or application. Markup designating the elements constituting a kind of document is called descriptive markup. Both SGML and MARC are descriptive markup languages. Text encoded using descriptive markup is frequently called "structured text," while ASCII text is called "plain text."

Descriptive markup based on SGML enables maximum flexibility in the use of the text. Indexing, and display and print formatting can all be precisely controlled by the user of the text. SGML structured text facilitates sophisticated database and document indexing and searching such as document and document component specific keyword Boolean, word adjacency, word proximity, relevance ranking, and relevance feedback. Such "smart" indexing eases much of the inherent tension between recall and relevance that plague most large database searching systems. SGML structured text also supports advanced online text navigation. It is possible to juxtapose a dynamically generated table of contents and accompanying text to provide context to enhance reading comprehension and provide random, informed access to text. Structured SGML based text facilitates complete flexibility in automated production of printed finding aids and related print products essential to processing, control, curatorial functions, and donor relationships. The flexibility of SGML based structured text renders it unquestionably superior to hierarchically flat MARC structured text, plain text, and procedurally marked text.

The Information Future and Archives.

While the Project is presently concerned only with finding aid text, it is looking to an imminent information future in which collection-level records lead to finding aids and finding aids lead to computer surrogates of primary source materials that exist in a variety of native formats: pictorial materials, graphics, three-dimensional objects, manuscripts, typescripts, printed text, sound recordings, motion pictures, and so on. The intersections between these various forms of information will be traversed by the click of a mouse or by entering a simple command. Information interlinked in this manner is called hypermedia.

It is critical, I believe, that we move toward providing network access to computer surrogates of items in our collections as we make information about our collection more accessible through network access to finding aids. Some believe that making MARC AMC records available increases requests for use of collections. Increasing the availability of the more detailed information in finding aids will perhaps have the same effect. Such increased demand and use will justifiably raise curatorial concerns. But by making surrogates of the most used portions of our collections available, we can simultaneously increase access and limit physical access to endangered collections. Access and preservation need not be in conflict.

The Berkeley Finding Aid Project envisions an information future in which serious scholars and the casually curious alike can easily find the cultural treasures they seek. In this future, information seekers follow clearly marked paths through catalogs to finding aids and from finding aids to a wealth of information in a multitude of computer and traditional formats ... and back. Developing a standard encoding scheme for finding aids is essential to the realization of this future.

By making primary source materials accessible anywhere, anytime, the Internet will challenge basic assumptions about the nature of archives, libraries, and museums. The Internet makes it possible to transcend the physical limits of our information environment. I would like to conclude by dwelling, for a brief moment, on the notions of physical "absence" and "presence."

In an information world dominated by physical media, the absence of the items has dictated that we create synoptic surrogates to represent them. These include both catalog records and finding aids. While each provides access mechanisms, it is the descriptive component that represents the absent item. Using the description, we are, in principle, able to identify the remote object, and decide whether it is what we want. The Internet, by rendering the absent present, will no doubt significantly alter the descriptive role of catalog records and finding aids.

Archives, museums, and libraries are situated in specific geographic locations, and generally serve the needs of the surrounding community. They are very present for the community in which they reside. The Bancroft Library at the University of California, for example, is located on the campus of the University, in the city of Berkeley, in the state of California in the Far West of the United States. The Bancroft Library, its collections, its staff, its mission, all reflect this presence in the community. But if the most sought after collections of The Bancroft Library are available on the Internet, anytime and anywhere, it becomes in a certain sense universally present, present in the World Community.

Universality requires effective communication and effective communication requires standards. Standardization necessarily creates tension with the particularity of individual collections and repositories. Developing a standard that enables effective communication and that also allows curators and collection managers to adequately represent the individuality of the collection and institution is the challenge before us.

For scholars, making the absent present over the Internet offers new possibilities of scholarly communication. Scholars inclined to make their interpretations of source materials available in a standards-based, machine-readable form on the Internet will be able to make both the source materials and their interpretions of it simultaneously available to their peers and students. And it will not matter where they or the materials physically reside. Furthermore, it will be feasible to link the finding aids that lead to the source materials directly to interpretations of them. Over time, arrays of competing interpretations will accumulate around source materials, and provide alternative forms of access to and views of them.

The prospect of bringing research materials to the scholar over the Internet in lieu of the scholar having to travel to them reveals yet another possibility. Agencies that fund research involving primary source materials should finance digitizing rather than traveling. A collection, once digitized, would be available to all scholars for all time.

Footnotes

1 Charles C. Jewett, Smithsonian Report on the Construction of Catalogues of Libraries and Their Publication by Means of Separate, Stereotyped Titles (Washington, D.C.: Smithsonian Institution, 1853), 9.