[Archive copy mirrored from: http://sunsite.berkeley.edu/amher/proj.html. See the document on the home site if possible.]

menu bar
AMHER Virtual Archive Project

Project Manager:

Tim Hoyer
The Bancroft Library
University of California, Berkeley
Berkeley, CA 94720
Phone: (510) 643-3202
Email: thoyer@library.berkeley.edu

Institutional Participants:

  • Duke University
  • Stanford University
  • The University of California, Berkeley
  • The University of Virginia

Summary of Project

The American Heritage Virtual Archive Project will investigate one of the most serious problems facing knowledge seekers everywhere, the geographic distribution of both collections of primary source material and the written guides describing and providing access to them. We propose to solve this problem by creating a prototype "virtual archive", integrating into a single source, hundreds of archival finding aids describing and providing access to a large body of primary source materials from collections documenting American culture and history held by four major academic research libraries. The project will demonstrate the feasibility of providing both scholars and average American citizens with user-friendly, universal Internet access to the research collections of the world.

The design of the proposed "virtual archive" is informed by the U.S. Department of Education, Title IIA-funded Berkeley Finding Aid Project and its follow-up project, the NEH-funded California Heritage Digital Image Access Project. The American Heritage Virtual Digital Archive project extends these projects by combining finding aids from several different institutions into a single database which will provide the scholar or member of the general public seamless, user-friendly access to information about physically distributed archives. In this project, finding aids, created and maintained locally by the repository holding the collections, will be contributed via the network to the union database. The union database will allow the user to search a bibliographic catalog, display a collection-level record, and, from within the bibliographic record, click on a user interface "button" that will launch a browser used to navigate the collection's finding aid.

The American Heritage Virtual Archive Project extends the research of the earlier projects upon which it is based beyond the confines of the single institution to the new world of integrated access to many, distributed collections. To make this possible, the project's participants have agreed to adhere strictly to standards. The collection-level records use the USMARC format. The finding aids are encoded with the Standard Generalized Markup Language Document Type Definition (SGML DTD) developed in the Berkeley Finding Aid Project. The pointing information uses Uniform Resource Name (URN) and Uniform Resource Locator (URL). And the client that searches the bibliographic catalog is ANSI/NISO Z39.50 compliant.

The primary goal of this project will be the development of a demonstration system, which will serve as a test bed to evaluate both the effectiveness of the prototype's "virtual archive" in providing access to distributed digital library resources, and the feasibility of the decentralized "real world" production methods that the project will use to create it. To achieve the goal the project will pursue three major objectives:

  • To develop effective mechanisms to link and to integrate related collections contributed by different institutions so that they can be navigated as if they were part of a single, virtual archive;

  • To develop policies and procedures for decentralized creation and maintenance of the collection-level catalog records and finding aids in order to assure that this metadata can be integrated into the virtual archive;

  • To explore intellectual and technical access, description, and control issues that arise when finding aids representing collections with related subject material from different institutions are combined and used in the same database.

This project will make an ambitious start on the road to using network resources to create a truly comprehensive digital archive in the field of American history. A rigorous, analytic evaluation will be included as part of the project.

Table of Contents

I. The Need for the Project

II. The Expected Results and Dissemination

III. The Project's Plan of Work

IV. The Background of the Applicant

V. The Project's Staff

I. Need for the Project

The American Heritage Virtual Archive Project seeks to demonstrate the feasibility of overcoming a longstanding, serious, obstacle facing those who seek access to the primary source materials held in the nation's archives and manuscript repositories: the geographic distribution of the collections themselves as well as of the full-text analytic guides or finding aids that describe these materials and enable scholars to identify and locate them for use. While this obstacle has become significantly less daunting over the course of the last ten years because of the development of the USMARC Archive and Manuscript (AMC) format and its use to create hundreds of thousands of collection-level summary descriptions in the nation's bibliographic utilities, the finding aids (for which the collection-level records are but a summary), and which are the primary mechanism for describing, controlling, and providing access to our cultural and historical heritage, have remained largely in print form, accessible only onsite at the repository or, upon request, by mail. As positive a development as the AMC format has been, it has only enabled delivery of a surrogate of a surrogate in the network environment. Because of USMARC limitations, collection-level bibliographic records provide a brief and therefore limited description of collections. Much of what is in any given collection is not represented in the bibliographic record but in the full-text inventories and registers upon which they are based, and from which they are derived. The next logical step to facilitate scholars easily locating and identifying primary source materials is the creation of an online union database of the nation's archive and manuscript finding aids, accessible over the Internet.

The need to take this step has been widely recognized. Participants in the recent Research Libraries Group (RLG) forum on primary source materials recommended that RLG develop such a finding aid database for the nation's library's and archives, and a similar recommendation was overwhelmingly supported at a SOLINET-sponsored meeting of archivists. Both groups clearly expressed the urgent need to have the problems associated with the creation of such a union database systematically explored and solved in a manner that will not only provide the nation's library's and archives with a union database but also enable its archivists and librarians to maintain and use it in the proper collaborative spirit that such an enterprise calls for.

Building on the work of the Berkeley Finding Aid Project and the California Digital Image Access Project, the American Heritage Virtual Archive Project proposes to demonstrate the feasibility and desirability of such a "virtual archive" by building a large union database of hundreds of finding aids comprised of thousands of pages of text and lists that document nineteenth and early twentieth century American culture and history. The prototype "virtual archive" created in this project will demonstrate, for the first time, the technical feasibility of bringing together unified intellectual access to physically distributed collections through the creation of a centralized database of metadata (i.e. catalog records and finding aids) which index and describe the actual geographically distributed collections. This virtual archive will also provide researchers with vastly improved bibliographic and physical access to significant collections of documents that bear witness to the nation's past.

While building on Berkeley's previous research, the American Heritage Virtual Archive Project takes perhaps the single most important step forward yet attempted in efforts to provide network access to primary source materials. It offers a solution to the problem of providing integrated access on the Internet to distributed data by creating a "virtual archive." This approach offers four extremely important benefits.

  • A single place for researchers to search for detailed information about archival primary source materials, regardless of where the originals reside;

  • The potential for integrating related collections through their finding aids, regardless of where the collections reside or the finding aids were created;

  • A standards-based, scalable technical architecture that supports the "virtual archive" model;

  • A model national training program for finding aid encoding that can be proposed to the Society of American Archivists Curriculum Development Committee.

1. Brief History of the Universal Access Problem

One of the most vexing problems confronting those in search of information is the variety and number of places that they must look to discover whether what they want exists, and, if so, where it is located and how they can gain access to it. It has long been recognized that integrating the catalog records for distributed collections into one catalog and locating it in a central location greatly simplified the task for the researcher, saving valuable time, and greatly enhancing the chances of finding relevant materials.

In the middle of the nineteenth century, the librarian at the Smithsonian Institution, Charles Jewett, planned to use the cutting edge technology of his day to create that "cherished dream of scholars, the universal catalogue."

If the system should be successful, in this country, it may eventually be so in every country in Europe. When all shall have adopted and carried out the plan, each for itself, the aggregate of general catalogues, thus formed--few in number--will embrace the whole body of literature extant, and from them, it will be no impossible task to digest and publish a universal bibliography. How much this would promote the progress of knowledge, by showing, more distinctly, what has been attempted and accomplished, and what yet remains to be achieved, and thus indicating the path of useful effort; how much, by rebuking the rashness which rushes into authorship, ignorant of what others have written, and adding to the mass of books, without adding to the sum of knowledge ....(1)

For both technological and political reasons his plan failed. It would be several years before another attempt was made. In 1909, the Library of Congress began a catalog card exchange arrangement with several major libraries. Herbert Putnam, then Librarian of Congress, describes the plan and its purpose as follows:

The Library of Congress expects to place in each great center of research in the United States a copy of every card which it prints for its own catalogues; these will form there a statement of what the National Library contains. It hopes to receive a copy of every card printed by the New York Public Library, the Boston Library, the Harvard University Library, the John Crerar Library, and several others. These it will arrange and preserve in a card catalogue of great collections outside of Washington.(2)

These were the first tentative steps towards what would eventually become the National Union Catalog. Other libraries joined the effort. By 1926, the Library of Congress had compiled a file of nearly two million cards. In 1948 the file was officially named the National Union Catalog (NUC), and the libraries who had been only selectively reporting acquisitions were asked to report all of them.

Gathering the titles together was but the first step in creating a useful union listing. In order for it to be universally useful, it needed to be universally accessible. It would take until 1956 for the Library to develop a solution to this problem. For many years the Library of Congress had created "depository sets" of its card catalog and distributed them to over 100 libraries throughout the United States. This system was expensive to maintain for both the Library of Congress and the depository libraries, and the 100 depository sites failed to provide universal access. The Library decided to solve both of these problems by reviving the book catalog which had not been used by most libraries for fifty years. In 1946, the Library published A Catalog of Books Represented by Library of Congress Printed Cards Issued to July 31, 1942. Ten years later, at the urging of the American Library Association, the Library of Congress applied this approach to the National Union Catalog and began issuing in book form the titles acquired by the reporting libraries. This eventually lead to the publication of the over 600 volumes of The National Union Catalog, Pre-1956 Imprints, the largest single publication ever produced.(3) For the first time, the library world and the public it served had a system for building a national union catalog and making it universally available. But this union catalog, significant as it was, provided access only to published materials, and not to the nation's rich collections of primary source materials.

In 1951 the National Historical Publications and Records Commission (NHPRC; named at the time the National Historical Publications Commission) began to compile a union register of archive and manuscript collections held by the nation's repositories. The objective was to provide central, intellectual access to the nation's primary source materials. The effort initially focused on collection-level description rather than on the sub-collection or item-level descriptions. Large archive, manuscript, and museum collections have historically presented complex access and control problems. Chief among these is that limited human and material resources prohibit the expensive option of cataloging each item individually. The Bancroft Library's Sierra Club archives, consisting of nine related collections comprised of over one million items, well illustrates this basic economic problem. The traditional means of overcoming this problem is to describe a large collection of related items in one brief descriptive record. A collection of one thousand related items, for example, may be cataloged in one record rather than in one thousand. After gathering collection-level data from 1,300 repositories nationwide in the 1950's, the Commission published A Guide to Archives and Manuscripts in the United States in 1961.(4) The Commission decided to revise the directory in 1974, but after assessing the situation found that the number of repositories and records had increased so dramatically in the thirteen years that had elapsed from the publication of the first directory that compiling collection-level descriptions would be prohibitively expensive. The Commission decided to change the focus to repository-level information and therefore to provide a coarser level of access. Despite this shift in focus, the Commission continued to envision a "national collection-level data base on archives and manuscripts."(5) For a variety of reasons, the Commission abandoned the idea in 1982.

In 1951, the same year that the National Historical Publications and Records Commission began planning the Directory, the Library of Congress began to actively plan the National Union Catalog of Manuscript Collections (NUCMC).(6) NUCMC was intended to be for manuscripts and manuscript collections what the NUC was for printed works. Winston Tabb, Assistant Librarian for Collections Services, describes a major factor in the decision to develop NUCMC:

Scholars, particularly in the field of American history, were instrumental in urging the establishment of a center for locating, recording, and publicizing the holdings of manuscript collections available for research. They had long been frustrated by the difficulties of locating specific manuscripts and even of identifying repositories possibly containing primary-source materials.(7)

It was not until late in 1958 that the Library began to implement its plans with a grant from the Council on Library Resources. In 1959, the Manuscript Division was established in the Descriptive Cataloging Division of the Library, and given responsibility for initiating and maintaining the NUCMC program. The union manuscript catalog would provide collection-level description for collections held in US repositories, and, for particularly important manuscripts, item-level descriptions. Like the NUC, the catalog would consist of catalog cards, and, again following the NUC model, a decision was made to publish the catalog in book form and offer subscriptions. The first volume of NUCMC was published in 1962, one year after the NHPRC's A Guide to Archives and Manuscripts in the United States. The Library announced in 1994 that volume 29 would be the last print publication of the NUCMC. The future of the NUCMC is currently under review.

The suspension of the NUCMC print publication should not be interpreted as questioning the value and utility of building union catalogs to our intellectual and cultural resources and making them universally available. Instead we should understand the suspension and review of the program as the logical and prudent response to asking whether the objective might be better served by using powerful networked computer technology to deliver the union list instead of print technology.

The advent of machine-readable catalog records coupled with the emergence of nationally networked computer databases have now provided us with the means to build union catalogs that can be made available everywhere (or at least everywhere the network is available, which is rapidly becoming everywhere). The technology greatly facilitates the compiling of union databases and also dramatically overcomes the problem of universal access. While the Library of Congress continued and continues to compile the NUC and publish it in book form, over the course of the 1980's the OCLC and RLG databases have emerged as de facto union catalogs to not only the nation's bibliographic holdings but to a good share of the world's as well.

Eleven years ago the records in the national utilities almost exclusively represented published print materials. The primary source materials in the nation's archives and manuscript repositories were not represented. This was all to change with the work of the National Information Systems Task Force (NISTF) of the Society of American Archivists. From 1981 to 1984 NISTF paved the way, both intellectually and politically, for the development of the USMARC Archive and Manuscript Control (AMC) format.(8) The AMC format made it feasible for archives and manuscript repositories to provide brief, synoptic surrogates for collections in their care in bibliographic catalogs. The AMC format by itself, however, only specified content encoding standards. It did not provide standards for the actual content of the records themselves. Without such standards, the format was "simply an empty vessel."(9) The archives and manuscripts community found the Anglo-American Cataloging Rules, Second edition (AACR2) inadequate. The AACR2 chapter on manuscript cataloging abandoned longstanding archival descriptive principles. In response, Steven L. Hensen, then working at the Library of Congress, developed an alternative set of rules that was to complement the encoding standards. These rules, entitled in their second edition Archives, Personal Papers, and Manuscripts (APPM), coupled with the USMARC AMC format, enabled the archives and manuscripts community to contribute over 420,000 collection-level records to the Research Libraries Group's RLIN database.(10) Through the utilities, scholars now have access to a growing accumulation of brief descriptions of the nation's archive and manuscript collections.

As important and revolutionary as this accomplishment has been, however, it represents but the first limited step in making it possible for scholars to easily locate the primary source materials they need to perform their research. The generalized descriptions found in AMC records will only lead a researcher to a collection of items which may have individual items relevant to his research. In order to help the researcher determine the relevance of items in a collection, libraries and museums have provided an assortment of additional inventories, registers, indexes, and guides, which are generally called finding aids. Finding aids are hierarchically structured documents, proceeding in defined stages, from the general to the specific. At the most general level, they roughly correspond in scope to collection-level catalog records. At the most specific level, they briefly identify individual items in the collection. In between, they describe subsets or series of related items. Though possible, description of and access to individual items is frequently not provided. Decisions concerning depth of analysis are based on a variety of intellectual and economic factors.

USMARC AMC collection-level records and finding aids are intended to work together in the hierarchical archival access and navigation model. The AMC record occupies the top place in the model, and leads, through a note, to the detailed collection information in the finding aid. The finding aid, in turn, leads to the materials in the collection. The descriptive information in the collection-level record is based on and derived from the collection's finding aid. In cataloging terms, the finding aid is the "chief source of information" for cataloging the collection. Only a very small portion of the information contained in the registers and inventories finds it way into the bibliographic record. A dramatic example of the summary nature of the collection-level record is dramatically provided by the finding aid and catalog record for the National Municipal League Records, 1890-1991 (bulk 1929-1988) in the Auraria Library in Denver, Colorado. The finding aid comprises over 1,400 pages and 30,000 personal names. By comparison, the AMC record for this collection is approximately two pages long and has nine personal names as access points!

Clearly, the next logical step in the evolution of union access to primary source materials is a union database of the nation's archive and manuscript finding aids accessible over the Internet, anywhere in world, anytime of the day or night. Such a database will give both researchers and archive and library staff access to the enormous wealth of information contained in finding aids. Moreover, development and adoption of standards for encoding and navigating finding aids, creation of inter-institutional protocols for cooperative database construction, and demonstration of the technical capability to create, in a cost-effective manner, a national union database of finding aids are necessary preconditions for creating an operational national library of digital surrogates, including digital preservation copies, of primary source materials. Fortunately, the technological and intellectual foundation for such a database has been built in earlier research projects at Berkeley.

2. The Berkeley Finding Aid Project

Recognizing that the archival finding aid rendered in a standard, platform-independent, electronic form would provide a key and currently missing layer of collection access and control in the complex environment of the Internet, researchers at Berkeley developed the Department of Education Higher Education Act Title IIA funded Berkeley Finding Aid Project. Beginning in the fall of 1993, researchers at Berkeley began developing a prototype standard for encoding finding aids (for a full discussion of Berkeley's research strategy see Appendix B). It was a propitious time to undertake such a project. Standard Generalized Markup Language (SGML), an international standard since 1987, had become firmly established in a wide array of government and private enterprises, and a robust and rabidly growing market of software tools to support it was emerging. Politically, the archival community, at first resistant to technology, had warmed up to it as result of its positive experience of MARC AMC, and was ready to embrace the idea of encoding and content standards for finding aids. For the first time there was a viable technological basis and receptive political climate for developing an encoding scheme for full-text archival finding aids.

An encoding scheme assumes content to be encoded, and this presented the Berkeley researchers with a fundamental problem: no finding aid content standard comparable to APPM existed or exists. Before developing a prototype encoding system, the researchers first had to develop a prototype content or data model defining the structure and content of finding aids. Working with representative finding aids contributed by the archive and library community, Berkeley researchers analyzed them, and isolated and identified their basic elements and the logical relationships of the elements. They then proceeded to develop a data or prototype content model. The encoding scheme is in the form of an SGML Document Type Definition (DTD) and is based on the content model. To test the DTD, project staff developed a prototype database to evaluate search, navigation, and display of electronic finding aids (for detailed descriptions of the project and copies of the model and DTD, see Appendix B).

Two developments in early 1995 served to transfer ownership of the Finding Aid Project to the archive and library communities, where the finding aid standards development effort rightfully belongs. In February 1995, the Bentley Historical Library awarded a fellowship to Daniel Pitti and a team comprised of leading national archivists and a prominent international SGML expert to evaluate formally the finding aid DTD and data model, and, based on the evaluation, develop a DTD and content encoding guidelines to be formally proposed to the communities as a standard (for a copy of the fellowship proposal, see Appendix E). In April of 1995, the Commission on Preservation and Access (CPA) sponsored a conference in Berkeley intended to gather a cross section of the archive and library communities together to provide a preliminary evaluation of the results of the Berkeley Finding Aid Project, and to make recommendations concerning next steps (for the agenda of the conference, a list of attendees, and the CPA newsletter article on the conference, see Appendix E). The general consensus of those gathered was that Berkeley had achieved its limited objective of demonstrating the desirability and feasibility of an encoding standard for finding aids. While endorsing the Bentley research team, conference attendees also recommended that interested institutions in the community should begin testing the DTD and use the experience gained thereby to inform the ongoing standards development process. Researchers plan to have proposed encoding and content standards ready for community evaluation in early 1996.

3. The California Heritage Digital Image Access Project

While the Berkeley Finding Aid Project has focused only on the finding aid text itself, it clearly set the stage for a future in which collection-level records lead to finding aids and finding aids, through the agency of hypermedia links, lead to digital surrogates of primary source materials that exist in a variety of native formats: pictorial materials, graphics, three-dimensional objects, manuscripts, typescripts, printed text, sound recordings, motion pictures, and so on. The NEH funded California Heritage Digital Image Access Project currently underway is linking collection-level records in the Berkeley catalog to the finding aid texts, and the finding aid text to digital representations of primary source materials documenting California culture and history. The primary objectives of the California Heritage Project are twofold: first, to implement in the networked computer environment the three tiered archival access model first conceived in a mixed machine-paper environment in order to determine whether it effectively achieves its purpose of describing, controlling, and providing access to primary source materials; second, to extend the finding aid DTD development initiated in the earlier Berkeley Finding Aid Project to encompass prototype standards for controlling and accessing digital representations of primary source materials.

4. The Next Step: The Virtual Archive

With the success of the Berkeley Finding Aid Project and with the work of the California Heritage Project to make the finding aid the fulcrum of a dynamic electronic archival access system, it has become clear that it will be necessary to build a vast national union repository of electronic finding aids, first centrally and later, as technology matures, in a decentralized environment. The achievement of such a union database will be a watershed in the study of American history and culture because it will make it possible for the first time for researchers to find and navigate among the important documents that define our cultural identity as if they were part of a single universal archive of the nation's history. It is now possible to see the way in which such a virtual archive of the American cultural heritage can be achieved. The proposed project will demonstrate this by creating a prototype of such an archive. Through this prototype, the nation's archivists, librarians and researchers will experience the potential of the virtual archive. But perhaps more importantly, the prototype will be a foundation upon which can be built a production system that realizes the vision of a universal archive of American history and culture appearing on the desktops of scholars and citizens alike. Before the nation can embark on realizing this vision, there is a clear need to demonstrate that it is intellectually, politically, technologically, and economically feasible to build a large database of hundreds of thousands of finding aids contributed by thousands of repositories.(11)

The American Heritage Virtual Archive Project extends the research of the earlier projects upon which it is based beyond the confines of the single institution to the new world of integrated access to many, distributed collections. To make this possible, the project's participants have agreed to adhere strictly to standards. The collection-level records use the USMARC format. The finding aids are encoded with the Standard Generalized Markup Language Document Type Definition (SGML DTD) developed in the Berkeley Finding Aid Project. The pointing information uses Uniform Resource Name (URN) and Uniform Resource Locator (URL). And the client that searches the bibliographic catalog is ANSI/NISO Z39.50 compliant.

The primary goal of this project will be the development of the demonstration system, which will provide a test bed to evaluate both the effectiveness of the prototype's "virtual archive" in providing access to distributed digital library resources, and the feasibility of the decentralized "real world" production methods that the project will use to create it. To achieve its goal, the project will explore four types of issues: intellectual, political, technical, and economic.

Intellectual. Project participants will examine potential finding aid content standards that will be necessary for finding aids from diverse institutions to coexist harmoniously in the same database. They will also develop effective mechanisms to link and to integrate related collections contributed by different institutions so that they can be navigated as if they were part of a single, virtual collection. For example, the University of Virginia and Berkeley will explore different ways to integrate their separate collections of Mark Twain letters. They will investigate rational and appropriate use of hypertext mechanisms for linking and navigating both within individual collections and across intellectually related collections. Berkeley, with feedback from the collaborators, will develop a curriculum and manual to support finding aid encoding training and establish a networked "help desk" to provide ongoing support to the collaborators.

Political. Participants will develop policies and procedures for decentralized creation and maintenance of the collection-level catalog records and finding aids in order to assure that the metadata will be consistent and can be successfully integrated into the virtual archive. Issues concerning authority, and management of the virtual archive will be systematically reviewed. Potential problems associated with data ownership and responsibility will be explored. The dynamics of training, user support, and documentation will be studied.

Technical. Participants will investigate technical issues associated with access, description, and control that arise when finding aids representing collections with related subject material from different institutions are combined and used in the same database. Additionally, participants will study various methods for remote creation and maintenance of findings aids that are made available in a remote central database. Collaborating institutions will also experiment with the union database using different software than that used at Berkeley, in order to help separate, by comparison and contrast, application related issues from encoding and intellectual content issues. Berkeley staff will also work with Professor Ray Larson, of the School of Information Management and Systems, to experiment with applying advanced natural language retrieval technology to the union database. The project team will also investigate issues of replicability of the model and system, platform and software independence, and scalability.

Economic. Participants will continue the investigation of finding aid conversion costs begun in the Berkeley Finding Aid Project, and will study cost of conversion of card based finding aids. They will also investigate costs of data input and database maintenance, staff training and documentation, and system maintenance, including equipment renewal, database refreshing and migration, and telecommunications. The economic aspects of this demonstration project will be as important as the intellectual, political, and technical questions because the cost-effectiveness

of the finding aid model will be a key factor in its long-term acceptability as a strategy for creating large-scale digital libraries of primary resources.

The participants will test the feasibility of the virtual archive by constructing a database of hundreds of finding aids comprised of thousands of pages of text. In order for a relatively small-scale system to realistically simulate a full-scale union archival database, it needs to contain a high concentration of records documenting related subjects. The collaborating institutions will contribute finding aids for collections documenting American Heritage, with a special focus on the nineteenth and first half of the twentieth centuries. By concentrating the finding aids on a specific subject area and historical period, the database will enable researchers to study the encoding and content issues associated with combining information on related subject matter from different institutions in the same database. It will also provide a significant amount of information on a topic to attract significant numbers of researchers, permitting us to perform accurate user studies.

5. Related Work on the Problem

Although there has been considerable recent effort in the archival community to provide network access to machine-readable finding aids, these efforts have been unilateral, and not based on standards. From the beginning of the Berkeley Finding Aid Project, research staff have understood that archive and library community collaboration and cooperation on the Internet would depend upon standards-based communication. The finding aid encoding and content standards development currently well underway will provide the standards base needed to realize the long sought dream of not only centralized access to collection-level summary records but also to the hierarchical and detailed descriptions found in finding aids.

National Inventory of Documentary Sources in the United States. NIDS represents the only effort to date to provide union access to finding aids representing collections in US repositories. NIDS is a for-profit publication compiled and produced by Chadwyck-Healey Inc. Finding aids are contributed by participating archives and libraries, and are then filmed and reproduced on microfiche. An accompanying index provides access to the microfilm images. NIDS is divided into three parts: part one includes finding aids in the National Archives, seven presidential libraries, and the Smithsonian Institution Archives; part two contains registers for collections in the Manuscript Division of the Library of Congress; part three consists of finding aids for state archives, state libraries, state historical societies, academic libraries and other repositories. To date, 327 repositories, including 210 American institutions have contributed almost 52,000 finding aids.

National Union Catalog of Manuscript Collections. First published in 1962, NUCMC provides collection-level summary descriptions of manuscript collections in US repositories. It is published in book format. From 1962 to 1991, 65,325 records were contributed by 1,369 repositories. Recent issues are listed by repository name, subdivided by main entry of the catalog record. Accompanying cumulative indices provide access through names, places, and subjects. Publication of the printed serial will cease with the 1993 issue. The Library of Congress concluded that the RLIN machine-readable database of collection-level AMC records made the print publication obsolete. The NUCMC program will continue at the Library of Congress, with records being contributed to the RLIN database.

National Historical Publications and Records Commission. In 1961 the Commission published A Guide to Archives and Manuscripts in the United States, a collection-level guide to manuscripts held in 1,300 repositories nationwide. For economic reasons, the Commission changed the focus of its reference work from collection-level to repository-level, and published the Directory of Archives and Manuscript Repositories in the United States in 1978, and a second edition in 1988. The Directory provides very brief, general descriptions of collections in each repository listed.

Research Libraries Group. In March of 1995, the Research Libraries Group hosted a Primary Sources Forum to discuss the future of access to primary source materials. There was a consensus in the representatives of the archival community that RLG should begin testing the Berkeley Finding Aid Project hierarchical access model and methods. This recommendation was sent to RLG's Advisory Council. As a result of this recommendation, RLG submitted a proposal to the National Telecommunications and Information Administration, Telecommunications Information Infrastructure Assistance Program (NTIA/TIIAP). RLG proposed a two-year project to create an initial test database of 50 machine-readable finding aids contributed by 10 RLG members. Finding aids chosen to be encoded would be representative of a wide range of archival collections. If this project is funded, Berkeley will work closely with RLG staff to carry it out. Among other assistance, Berkeley will provide finding aid encoding training for RLG staff. The RLG project will help Berkeley continue the evaluation and ongoing development of the finding aid DTD, and will provide independent testing and verification of the conclusions of the Berkeley Finding Aid and California Heritage Digital Image Access Projects.

The recommendation to RLG's Advisory Council also has had an impact on the Research Libraries Group Primary Sources in American Literature Task Force (PSALT). In March 1993, members of RLG with the aid of funding from NEH began a year long project to survey the collections of America's primary source materials with a long term goal of finding a way to provide the nation's scholars with full access to all collections of primary source materials which document American literature. The Resources for Research in American Literature Project (RRAL) carried out an intensive survey of some of the collections of a number of major repositories, and although the project succeeded in gathering a great deal of useful information about the collections it surveyed it also discovered that it is difficult to get institutions to carry out intensive surveying and recataloging projects. In following up on the work of the RRAL project, PSALT has recognized that an approach must be used that will make use of existing finding aids and minimal "gateway" collection-level records, rather than intensive survey/cataloging projects. Following up on the work of the Berkeley Finding Aid Project, PSALT is looking at the possibility of creating a 'location register' (similar to a British resource of the same name) comprised of very brief collection level records containing fields noting the existence of finding aids and making it possible to link with machine readable versions of them (as is being done in the California Heritage Project).

In February 1995, with funding from NTIA/TIIAP, SOLINET hosted a planning workshop for the Southeast Special Collections Access Project (SESCA). Held in Atlanta and attended by archivists from throughout the southeastern United States, the workshop's purpose was to develop a plan of action for building a regional finding aid database. SOLINET invited Daniel Pitti from the Berkeley Finding Aid Project to speak and to demonstrate SGML-based machine-readable finding aids. As an outgrowth of the planning workshop, SOLINET in partnership with the Southern Growth Policies Board (SGPB) has submitted a proposal to NTIA/TIIAP for a two-year demonstration project to link and integrate distributed regional information resources. The project will focus on library special collections and public and government information. Berkeley will maintain contact with SOLINET, and will provide training and other assistance as requested. The SOLINET/SGPB project, like the RLG project discussed above, will assist Berkeley in its continuing evaluation and development of the finding aid DTD, and will provide independent testing and verification of Berkeley Finding Aid Project and California Heritage Digital Image Access Project conclusions.

National Digital Library Initiative. While the Library of Congress' National Digital Library Initiative will attempt to digitize a wide range of material, a prominent feature of the database will be the digitizing of primary source materials documenting the American cultural and historical experience. Library of Congress staff from the National Digital Library Initiative, the Manuscripts Division, the Prints and Photographs Division, and Information Technology Services have been working closely on the finding aid DTD and data content model with Daniel Pitti and other Berkeley Finding Aid Project and California Heritage Digital Image Access Project staff. Additionally, Helena Zinkham, Head, Processing Section, and Janice Ruth, the editor of the registers in the Manuscripts Division, are on the Bentley Finding Aid Team reviewing and revising the DTD and data content model. Berkeley staff will continue to work closely with the Library of Congress.

6. American Heritage

The technical problems addressed in this Project are highly significant and the solutions it will advance are, we think, compelling. Nevertheless, as mindful as we are of this aspect of our work, we are even more mindful of what it is we are trying to provide access to using this technology. Across the country the new information technology is stimulating something far more interesting than its own development. Projects such as American Memory and The Making of America are taking the first steps in a vast, collaborative effort to extend our national self-knowledge. Creating and connecting the sources of our knowledge is a grand American enterprise. It is fitting that it begins, but does not end, with a machine, a technology. Providing free and universal access to the sources of information that define us is the really exciting thing about this Project and the many like it that are springing up so exuberantly. Finding a way to bring the fullest record of the American Heritage to the scholars and citizens of America is the true heart of this Project.

The Project will create a rich and varied database of finding aids for primary source materials that document and illustrate the American heritage with a focus on the nineteenth and early twentieth centuries. It is important to the success of the Project that the database in the prototype system provide scholars with information about a body of inherently interesting material, which will serve as rich and coherent intellectual resource of enduring scholarly interest. The database will be constructed to exploit the power of SGML to improve and to aid in the creation of new knowledge through its support of sophisticated search and navigational capabilities. The collections included in the Project have been selected carefully to provide subject matter that will encourage new ways of searching, manipulating, and understanding digital information. This selection process will be further refined as the Project progresses in order that the collections of all of the collaborating institutions, when taken together, constitute a significant body of research resources. In addition, collections will be selected for the diversity of the formats they include and the variations in age, quality and extensiveness of their finding aids.

For descriptions of the collections we expect to include, please see Appendix A.

II. The Expected Results and Dissemination

1. Project Objectives

The virtual archive approach will have tremendous potential to improve service at the nation's, and indeed the world's, libraries, archives and museums. Remote, real-time access to full archival collection description of primary source materials from archives and special collections libraries all over the world will provide a level of service that in the past could only be achieved through extensive travelling and years of arduous and expensive scholarship. This service will be provided without recourse to the costly mediation of reference staff. The direct access of the virtual archive will give researchers more autonomy and greater control over their research. More dramatically, the integration of disparate collections achieved by the virtual archive coupled with the ability to search and navigate across them will allow researchers to see the primary materials of their work in entirely new ways. And use of digital surrogates instead of the items themselves will aid in the preservation of the originals.

The virtual archive will also advance resource sharing among the libraries of the country thus allowing them to provide better service with smaller budgets without inconveniencing the public. The access to current collection information and digital surrogates of primary sources that it will provide will facilitate inter-institutional cooperation in collection development and preservation where knowledge of the holdings of other institutions can help curators make difficult decisions about how to spend scarce dollars developing and preserving their own collections.

The key objectives of the project are:

  • To develop effective mechanisms to link and to integrate related collections contributed by different institutions so that they can be navigated as if they were part of a single, virtual archive;

  • To develop policies and procedures for decentralized creation and maintenance of the collection-level catalog records and finding aids in order to assure that this metadata can be integrated into the virtual archive;

  • To explore intellectual and technical access, description, and control issues that arise when finding aids representing collections with related subject material from different institutions are combined and used in the same database.

2. Plans to Disseminate Results of the Project

Project results will be made widely available. The Library of Congress, Duke University, University of Virginia, Stanford University and UC Berkeley will be used as test and evaluation sites for the American Heritage Virtual Archive. Each site will make client machines available for both staff and public users to conduct ongoing demonstration and evaluation of the system using the project's evaluation methods and tools. In addition, all archives and libraries with access to the Internet will have access to the virtual archive in one of two ways. First, using X-Server software on networked workstations they will be able to search, browse, and navigate the finding aid database directly using DynaText software, or by launching DynaText from within the catalog record. Second, using a World Wide Web browser (for example, Mosaic or Netscape), they will have full searching and navigating access to the full DynaText database using Electronic Book Technologies' new product DynaWeb as a filtering device for converting the SGML to HTML. The virtual archive will thus be widely and publicly available.

A detailed report will be submitted to the National Endowment for the Humanities, the Commission on Preservation and Access, the National Historical Publications and Records Commission (NHPRC), the National Archives and Records Administration, the Society of American Archivists Committee on Archival Information Exchange, the Coalition for Networked Information (CNI), Computer Interchange of Museum Information (CIMI), American Society for Information Science (ASIS), Association for Information and Image Management (AIIM), and the administrations of all of the Berkeley's collaborators and expert advisors (see Appendix F).

The prototype American Heritage Virtual Digital Archive system -- its catalog records and finding aids -- will be disseminated throughout the world on the Internet. It will be demonstrated at exhibitions and presentations to at least six major professional conferences and meetings, for example, the American Library Association Annual Meeting (ALA), the Rare Books and Manuscripts Section of the Association of College and Research Libraries (ACRL RBMS pre-conference); the Society of American Archivists Annual Meeting (SAA); the American Society of Information Science (ASIS) Annual Meeting; Association of Research Libraries (ARL) Seminar on Scholarly Publication, ARL/AAUP Seminar on Electronic Publishing on the Network.

Papers will be contributed to at least six major professional conferences and meetings, for example, American Library Association Annual Meeting (ALA), the Rare Books and Manuscripts Section of the Association of College and Research Libraries (ACRL RBMS pre-conference); the Society of American Archivists Annual Meeting (SAA); and the American Society of Information Science (ASIS) Annual Meeting; Western History Association Annual Meeting; Organization of American Historians Annual Meeting; American Historical Association, Pacific Coast Branch Annual Meeting.

Papers will be submitted to at least one manuscripts and archives journal (e.g., The American Archivist); two library automation journals (e.g., LITA's Information Technology and Libraries, Educom Review, and Academic and Library Computing); and two history journals (e.g., Western Historical Quarterly and Pacific Historical Review). The Library will publicize the results of the project through news releases, campus publications, electronic bulletin board announcements (for example, Archives Listserv, and FindAid Listserv), and in Current Cites.

The project's participating institutions will have a considerable stake in making sure the research in this project is successfully carried out and that its results are followed up with further research and development. The participating institutions will use the project results to stimulate discussion on and consideration of establishing a virtual national archive by the Library of Congress, the National Archives and Records Administration, the leading research libraries and archives in the United States, RLG and OCLC, American Library Association, Society of American Archivists, and interested scholarly associations and societies. The participating institutions will work towards national acceptance of the virtual archive approach as the standard method for integrating access to decentralized collections.

III. The Project's Plan of Work

The project is complex and its plan of operation calls for a tightly scheduled sequence of events occurring over twelve months and requiring the close cooperation of many staff members in a number of different departments at the collaborating institutions. Because of its experience with the technology, the access model, operations, and production, Berkeley will take the lead among the collaborators and offer its own proven methods (developed in the Title IIA Berkeley Finding Aid Project and in Berkeley's NEH California Heritage Digital Image Access Project) as guidelines for the other collaborators, but strict adherence to Berkeley's procedures will not be required. Flexibility is an important aspect of the production methodology this project wishes to demonstrate because in the "real world," production of data is decentralized, and individual institutions have varying internal procedures and work flows for creating their catalog records and finding aids. The different methods used by participants will be documented and analyzed as part of the evaluation of the project. Project participants will require strict adherence to standards governing the encoding, pointing/ linking, and quality of data capture (USMARC for collection-level records; the Berkeley SGML DTD for finding aids; and Uniform Resource Names (URN) and Uniform Resource Locators (URL) for naming. The use of standards is the key element that will allow the project to succeed in bringing such a diverse group of collections and institutions together into one coherent access system.

The plan of operation is detailed below in five steps. For the sake of clarity, the first step outlines how the collaborators will coordinate their work. The production steps Berkeley will follow (and offer to its collaborators as guidelines) appear after Step 1, as the normative plan of operation. Although the plan suggests a linear progression, many of its activities will overlap and a number of teams will be working concurrently on the different aspects of the project at each of the participating institutions. The project timeline follows the narrative of the plan of operation.

Step I: Coordination of Participants: Training, Production and Integration of Data

The American Heritage Virtual Archive Project was designed in a highly decentralized manner in order to model a real life production system employing staff in many institutions and including diverse types of collections. Participants will create their own data decentrally adhering to the project's standards and following the same production time table.

Training and coordination of the participants will be handled by Berkeley's Database Design Coordinator, who will also monitor the production schedule and the quality of the data produced by all participants during the course of the project. The participants' standardized metadata will be integrated on a central server at Berkeley. Berkeley will maintain a technical support "Help Desk" and consulting services for participants throughout the project.

1. Training: At the beginning of the project, a one week intensive training workshop covering all aspects of the project will be given at Berkeley for the participants' Project Coordinators. The Database Design Coordinator, with assistance of staff from Berkeley's Electronic Text Unit (see Appendix C) will be the instructors for the course. The workshop curriculum will include detailed instruction on standards, markup of finding aids, SGML authoring from scratch, SGML conversion, possible conversion methods for different source documents (including scanning, OCR, and database conversion), creation of standard pointing and linking information (URN/URL) and a thorough discussion of various production methods (the Berkeley project team will consult closely with the other collaborating institutions to help them find the most efficient and cost effective production solutions to match their individual requirements). After the completion of the workshop, the Database Design Coordinator will follow-up with a visit to each of the individual institutions in order to monitor progress and give further instruction. Close contact via project listserv, telephone and electronic mail will be maintained throughout the project by the Database Design Coordinator and the Project Coordinators at the participating institutions.

2. Catalog Records: Participants' USMARC collection-level records will be created in local cataloging systems, in OCLC or RLIN, following national standards.

3. Finding Aids: Participants' finding aids will be created (authored or converted) in compliance with the SGML DTD for finding aids created in the Berkeley Finding Aid Project. Decentralized authoring will be done using any of several commercially available SGML authoring tools. Berkeley will provide software recommendations to collaborators who want to use locally mounted software. At the present time, the two most viable options are ArborText's Adept Editor, SoftQuad's Author/Editor, and WordPerfect SGML Editor for Windows.

4. Assembling Metadata at Berkeley: USMARC collection-level records and SGML finding aids are a form of metadata because they describe, control, and provide access to the primary source materials. The metadata created by the collaborating institutions will be FTP'd to Berkeley. The USMARC records will be loaded into the GLADIS and MELVYL databases, which will serve as the project's online catalogs for the purposes of research and demonstration (this does not preclude the participants from mounting their USMARC records on other bibliographic systems). The finding aids will be loaded into DynaText, an SGML browser/database system which will run on a Sun Sparccenter 2000 server, located in Berkeley. The metadata will be loaded, indexed and archived by Berkeley operations staff (Programmer/Analyst II).

5. The Prototype Software: The work of both the Title IIA Berkeley Finding Aid Project and the NEH California Heritage Digital Image Access Project has formed the foundation for the access tools to be used in this project). The primary goal of the project is to demonstrate that USMARC collection-level records and SGML encoded finding aids, used together in a network environment, can provide description, access and control of decentralized collections belonging to multiple institutions. The prototype system will link together two different systems: an online catalog containing the USMARC collection-level records; and a finding aid browser/database containing the SGML encoded finding aids, which provide detailed description of and access to the collections. The navigation tools required by this prototype system will be linked in a client/server environment using the ANSI/NISO Z39.50 standard information retrieval protocol. The researcher will use a Z39.50 client to search a bibliographic catalog such as UC Berkeley's GLADIS and UC's MELVYL. When a collection-level record is retrieved, a "button" will appear on the client's user interface telling him/her there is a related finding aid. If the button is clicked, the client will extract the finding aid's URL from the collection-level record, launch the SGML browser while passing it the URL as an argument. The browser will then locate and display the finding aid. The researcher will navigate through the finding aid until he/she finds a reference to a relevant item or group of items (he/she will also have the option of performing the another search within either the finding aid or the entire database of finding aids).

The client software used in providing access will be the Z39.50 client developed for the NEH California Heritage Digital Image Access Project. The project's SGML browser and database will be EBT's DynaText system, which is a sophisticated and versatile SGML search and navigation tool with a wide array of native internal inline and popup image support as well as the capability of incorporating external rendering devices for a variety of media. Finally, the project will also use EBT's DynaWeb software which will allow any World Wide Web Browser (e.g., Netscape, Mosaic, etc.) to search and display finding aids from the DynaText database.

DynaText and DynaWeb will make it possible for the project to explore a number of new features, including hypertext links to other collections, in-line graphics, in-line notes (from researchers/curators, etc.), and new searching techniques. The use of commercially-available software for document creation, the database, and browsing, ensures that the project funds are used in a cost-effective manner and that the system is scalable and replicable.

Technical Note: Centralization vs. Decentralization of Metadata

The project's planners decided to centralize the metadata based on the availability of high performance, centralized database and indexing tools for both USMARC and SGML metadata. Current technology does not adequately allow for searching multiple independent USMARC bibliographic or SGML text databases in a single transaction. Since the ability to provide powerful searching of the metadata is much more important to the evaluation of this project than the decentralization of the metadata itself, a centralized approach has been selected. However, it should be noted that when the technology to search multiple, decentralized USMARC bibliographic and/or SGML text databases matures, this technology can be substituted for the project's centralized approach because there is nothing in the design of the software architecture that precludes the use of decentralized metadata.

6. Creation of Virtual Collections within the Virtual Archive: The collaborating institutions will experiment with different ways in which virtual collections can be created by intellectually gathering, through the text of finding aids, distributed collections that are intellectually related or related by provenance. One approach to linking related collections is familiar to users of the World Wide Web, namely hypertext and hypermedia links between documents. While this method will be used within the scope of this project, collaborators will also experiment with other methods. For example, for collections that have been dispersed among two or more institutions (such as the Mark Twain collections at Virginia and Berkeley), collaborators will experiment with cooperatively creating a single finding aid, in which separate components are used to describe each of the separate collections held at separate repositories. For intellectually related collections, collaborators will experiment with "meta-finding aids" that describe and interrelate two or more collections, and provide hypertext links to the finding aids for each collection.

Step II: Assembly of Metadata and the Preparation for SGML Authoring

A Library Assistant (LA) IV workleader will direct a student assistant in the assembly of cataloging records and finding aids for the project's databases. An inventory of the cataloging records and finding aids, recording their current state, will be compiled under the LA IV's supervision. This will be completed in the first several weeks of the project (May 1996).

Berkeley's team will prepare for the encoding and loading of the collection-level records and finding aids for the collections that will be included in the prototype access system. Machine-readable finding aids that exist in word processing or database formats will be stored on a server in the UC Berkeley Library. Most of Berkeley's paper finding aids will have to be scanned and converted using Optical Character Recognition software (OCR). Some of these will also require upgrading with additional information such as brief scope, biography, and history notes, and series descriptions. The Curator and the Head of Bancroft Technical Services will review all of the converted finding aids. These tasks will be completed by the end of the nine months (May 1996-January 1997). The participants will use their own procedures but will follow roughly the same time table.

Step III: SGML Authoring of Machine-Readable Finding Aids

Berkeley's Library Database Design Coordinator has researched SGML-based pointing conventions and worked out SGML compliant architectures to ensure that finding aid data are standard. He will assure that all of the project's participants provide data that conforms to the necessary standards through the project's training and monitoring program (see Step 1 above). Participants will use SGML authoring software of their own choosing and will convert on the same time table as Berkeley. Berkeley's Database Design Coordinator, the staff of ETU, and SGML trainers will be available for consultation with the other collaborators throughout the SGML authoring and linking processes.

Methods for creating the SGML compliant finding aids will vary depending upon whether the finding aid is already in machine-readable form, but not marked up, or exists only on paper or cards. If paper finding aids are of exceptionally poor quality, it may be necessary to rekey all or portions. If this is necessary, the finding aids will be keyed directly into context governed SGML-based authoring and editing software. A variety of commercial options are available (ArborText, SoftQuad, and WordPerfect, to name a few, have promising authoring tools). Methods for converting existing finding aids will vary according to the native format. Berkeley staff have identified and have experience converting finding aids from a variety of word processing programs (WordPerfect and Word, primarily) and database programs (dBase and Microsoft Access, primarily). Even within finding aids, methods vary. For textual areas comprised of large blocks of text, text can be efficiently converted using "cut-and-paste." For container lists and lists in general, which are characterized by repetition and high density tagging, staff have efficiently used WordPerfect macros and perl scripts. Finding aids that exist only in paper format are scanned, OCR'd, and errors corrected. The resulting word processing files are then converted like other word processing files.

Using encoding and maintenance guidelines and procedures developed in the Berkeley Finding Aid Project and the California Heritage Project, an Electronic Publishing Assistant III in the Electronic Text Unit (ETU) will supervise two Electronic Publishing Assistants II in the adding of URN's to related USMARC records, and the encoding of finding aids with the SGML DTD. ETU staff will encode Bancroft Library finding aids. SGML authoring will begin in June 1996 and be completed by April 1997.

Step IV: Release of the Prototype to the Public

During the first six months of the project, Berkeley's Library Systems Office (LSO) will install workstations on the Berkeley campus. In the seventh and eight months of the project, staff at various sites on campus will be trained and the prototype will be released for public access.

Step V: Evaluation and Dissemination of Project Results

Evaluation of the prototype database: Evaluation of the prototype database will address the intellectual coherence of the proposed system, and the useability of the navigation system. As the metadata database is developed, project evaluators will prepare a set of guidelines and an online user satisfaction questionnaire adapted from the model proposed in Chin and Diehl's "Development of an instrument measuring user satisfaction of the human-computer interface" (for greater detail on evaluation objectives and methods see Appendix I) and contact the collaborators with the evaluation timetable. This first phase of the evaluation development will take place between May 1996 and July 1996.

During the second phase (August 1996-December 1996), the project evaluation team will distribute guidelines and questionnaires to the collaborators and arrange for Internet access to the database under construction. Via a pre-test of the survey instrument, they will also solicit feedback for database refinement and process survey data during this phase.

In the third and final phase of the evaluation process (January 1997-April 1997), the evaluation team will conduct the survey and recruit interns from Berkeley's Graduate School of Information Management and Systems to assist in evaluation. In addition to the questionnaire, the team will conduct follow-up individual and group interviews to solicit further feedback.

At the end of the final phase of the evaluation, the project evaluators will process survey data and summarize and present the results of each of the three phases of the database evaluation process in the final report of the project.

In addition to evaluating the user-perspective of the database and navigational design, the collaborators will evaluate various strategies for creating the intellectual coherence of the database. Specifically, they will evaluate the linking of like collections through hyperlinks between and among finding aids in comparison to the creation of a more hierarchical structure in which a higher-level finding aid combining the content of the various collections into a single document contains hyperlinks to the individual collections. The participants will also evaluate the extent to which ownership of the primary source materials themselves must be made explicit in the navigational process, and to what extent finding aids from individual institutions should be kept in individual "sub-databases", or otherwise separate from the finding aids of other participants.

Political Evaluation: Throughout the project, the experts in the collaborating institutions will be developing policies and procedures to ensure that local data are protected, and that there is also an adequate ability for the collaborators to maintain the union database. Prototype policies concerning authority to input and maintain data, security, and management of the virtual archive will be developed and systematically reviewed by the participants. These prototype agreements will be distributed widely to others in the field for comment. An evaluative report of the success of the development of a collaborative database of this type will be disseminated.

Technical Evaluation: The foundation for this project are the standards to which it adheres and which it develops. Of primary importance is further testing and refinement of the DTD in a multi-institutional environment. The results of the experience of the participants will be regularly integrated into revisions of the DTD, and the final version will be published and recommended for adoption.

In addition, the participants will be testing various methods for remote creation and maintenance of the finding aids which are included in the union database. The participants will compare their experiences, evaluate the problems and advantages of the various options, and issue a report

recommending those approaches that appear most promising for future, production, phases of the construction of the American Heritage Virtual Digital Archive.

The training which is provided at the beginning of the project will be evaluated by the attendees at the end of the session and again mid-way through the project, when Berkeley staff visit each site. A prototype training curriculum will be developed based on the training given, and its evaluation. This curriculum will form the basis for future training for production systems.

Economic Evaluation: Several economic aspects of the project will be tested. Because the project is a research and demonstration project, the amount of usable data that will pertain to production systems will be limited. Nevertheless, the major components of the costs of the various operations required to create and maintain the archive will be collected and analyzed in order to attempt to predict the ongoing costs to sustain the prototype into production. The following are the key costs to be monitored:

Training: Costs for initial training of new participants, including classroom time, instruction time, and development of documentation will be collected, as will the costs of continuing training and revision of documentation as technology, standards, and policies change. These latter costs may be reflective of the ongoing costs of a production system.

Scanning/OCR/editing of finding aids: Berkeley has a great deal of experience in scanning/OCR/and editing of manual finding aids. It will take primary responsibility for evaluating ongoing costs for retrospective conversion of this type. The other participants will be able to gather start-up costs which will be of use to new institutions as they join in future production projects. The cost-effectiveness of the scanning/OCR/editing portion of the project will be key to its extension to additional participants, so collecting, analyzing, and reporting of these costs is an important part of the evaluation plan.

Costs of maintaining the database, including keying, editing, system backup, technology renewal, network costs, etc. will be collected, analyzed, and reported.

The project will be completed by April 30, 1997 and project results will be disseminated.

May 1996-April 1997
5/96Collaborators' Training at Berkeley (one week).
5/96Assemble cataloging and finding aids.
5/96-7/96Evaluation Team (EVAL) prepares guidelines and user satisfaction questionnaire.
8/96EVAL sends questionnaire to collaborators for comments.
9/96-12/96EVAL revises questionnaire in light of collaborators comments, devises online version, pre-tests survey.
5/96-11/96Convert finding aid files to ASCII text.
6/96-4/97Encode finding aids with SGML and add finding aid "locations" to collection-records.
1/97-3/97EVAL works with faculty to incorporate evaluation into seminars; installs evaluation workstations in The Bancroft Heller Reading Room; selects appropriate software application to record and process evaluation data; distributes evaluation guidelines to collaborators.
4/97EVAL analyzes completed questionnaires; writes interim evaluation report; recruits, trains, and supervises evaluation interns for on-site evaluation; coordinates on-site and classroom evaluation; analyzes evaluation data and writes final evaluation report.
9/96-4/97Project staff write and deliver papers; demonstrate prototype at conferences and meetings; final project report written.
5/97Staff will continue to demonstrate prototype at conferences and meetings.

Management Plan

The project will be managed by the UC Berkeley Library through the campus Sponsored Projects Office in accordance with all applicable Federal and University guidelines. Coordination of the participating institutions will be shared by the Project Director, Bernard J. Hurley, the Management Coordinator, Susan Rosenblatt, and the Database Design Coordinator, Daniel Pitti. Mr. Hurley will oversee the development of the prototype . Ms. Rosenblatt will assure that the participants devote sufficient resources to successfully complete the project. Mr. Pitti will train and advise the participating institutions on project standards and production techniques. He will monitor the quality of the data. Each collaborating institution has assigned a project leader who is an expert in archival and manuscript processing, cataloging, and automation. All institutions have experience in carrying out successful grant-funded projects. Berkeley will regularly monitor the production and quality of the work of all of the participants, including its own staff, and will establish benchmarks to ensure that project commitments are made on time and within budget. The following description provides details about the responsibilities of each of the key personnel.

1. Principal Investigator: Dr. Peter Lyman will provide administrative oversight of the project, will coordinate its activities at the policy level with the other participating institutions' administrators, and will represent the project to national bodies such as the Association of Research Libraries, the Commission on Preservation and Access, and the Coalition for Networked Information. He will assure the continued development of the prototype access system and the maintenance of the data created in the project. He will participate in disseminating the project results.

2. Project Director: Mr. Bernard J. Hurley will assist the Database Design Coordinator in designing the prototype access system. He will direct the evaluation of software and hardware and oversee the development of the prototype system. He will assist in the design of the user questionnaire to evaluate the prototype. He will consult with the Project Coordinators from the collaborating institutions, and will hire and supervise the project's Programmer Analyst III. He will participate in efforts to disseminate the project's results.

3. Management Coordinator: Ms. Susan F. Rosenblatt will provide general administrative support; consult with the collaborating institutions on data standards and policy governing the integration of data in the virtual archive; assure that adequate resources are allocated to support the successful completion of the project; and participate in efforts to disseminate the project results.

4. Database Design Coordinator: Mr. Daniel V. Pitti will design the prototype access system with the assistance of the Project Director. He will coordinate relations between the participating institutions. Mr. Pitti will develop the training curriculum and carry out the training of the Project Coordinators of the collaborating institutions. Following training, he will visit each of the collaborators to provide follow-up support and consultation. Mr. Pitti will serve as an SGML technical consultant to the project and he will also coordinate and supervise Electronic Text Unit staff mounting and publishing the marked up texts. He will monitor the quality of all project finding aids. He will participate in efforts to disseminate the results of the project.

5. Archival Control Coordinator: Mr. Jack von Euw will assist in designing the prototype. He will oversee Bancroft Technical Services' selection of cataloging records and finding aids. He will consult with collaborating institutions concerning finding aids and the integration of related collections in the prototype. He will assist in designing the user questionnaire and in the evaluation of the prototype. He will participate in efforts to disseminate the results of the project.

6. Evaluation Coordinators: Ms. Teri Rinne and Dr. John Ober will develop and carry out the evaluation of prototype system. They will develop the evaluation guidelines and questionnaires. They will work with the collaborators to make sure that these reflect their concerns about performance of the prototype. They will supervise the implementation of the evaluation procedures, including distributing evaluation materials. They will compile and analyze the evaluation data. They will write the evaluation report for inclusion in the project's final report. They will participate in the dissemination of the project's results.

7. Participating Institutions Project Coordinators: For resumes of project directors at the participating institutions see Appendix G.

8. Technical Staff:

The Production Controller (Programmer/Analyst II) will be responsible for receiving (via FTP) the metadata from all participants and loading/indexing it into DynaText on a regular, frequent schedule. S/he will also arrange for technical support to all grant participants for the installation and maintenance of PC based software (X window system client, Z39.50 client, SGML authoring tools, and SGML browsing software). In addition, the production controller will arrange for the Systems Administrator at Berkeley to install and support Unix based central software (DynaText, ArborText, etc.), configure the Unix kernal to support new hardware (e.g., optical disk) and will tune and monitor the systems performance.

Technical Plan:

The following software and hardware will be used in this project:

1. Software: SGML Authoring Tools (ArborText's Adept Editor, SoftQuad's Author/Editor, WordPerfect etc.); A Z39.50 bibliographic catalog client developed at Berkeley; Electronic Book Technology's DynaText SGML browser and database system, and DynaWeb, a WWW/HTML interface to the DynaText database. Whenever possible, commercially or freely-available software will be employed to reduce the costs of the project, ensure its replicability, and continuing compliance with industry standards.

2. Central Hardware: The metadata will be loaded on Berkeley's Sun Sparcstation 2000 -- a gift from Sun Microsystems to Berkeley, which will host a Sunsite for Digital Libraries.

3. Hardware, Clients: 5 PC-based, 486 class PC's and laser printers will be used to create and maintain the metadata.

IV. The Background of the Applicant

Berkeley has an excellent background to qualify it to carry out a project like this one and is able to assemble and extraordinary range of expertise in support of it. The Library will draw upon the combined expertise of the Library Systems Office, the campus Museum Informatics Project (MIP is responsible for assisting all campus museum and non-book collections in planning, testing, and implementing computerized systems), the Library's Electronic Text Unit, The Bancroft Library, the Library's Conservation Department, and the Library Photographic Services. The Project will also actively consult with the expert staffs of University of California System-wide Division of Library Automation (DLA is the division responsible for the University of California's MELVYL online library catalog), and Information Systems and Technology (IS&T is Berkeley's central campus computing organization). Finally, the Project will work closely with University of California, Berkeley School of Information Management and Systems Professor Ray Larson, developer of the Cheshire System. Professor Larson and will be consulted on issues concerning SGML and the design of the prototype. Professor Larson, a noted expert on computer access systems, is currently working to integrate SGML into the design of the Cheshire search and retrieval system.

The Library has had a long standing commitment to the preservation of and access to its collections, document delivery, national and regional resource sharing, and research and development of advanced technologies. That it has the background to develop advanced computer access systems is demonstrated by its development of the GLADIS System, the online catalog and processing system at Berkeley. The Library is deeply involved in initiatives to create and to provide access to electronic texts. It is one of the collaborators in the TULIP Project and is working with UC Press on a Mellon Foundation grant proposal for a 5 year project to create a network accessible database of electronic texts of scholarly monographs and journals in the humanities. The Library will provide SGML expertise for this project and mount the electronic texts on its server.

The Library and The Bancroft Library have been planning and working to provide enhanced access to non-print materials for more than three years. The need to automate The Bancroft's finding aids and Analytic Index, a guide to The Bancroft's finding aids, was identified during the course of the planning and writing of a Department of Education HEA Title IIC grant proposal to convert USMARC AMC collection-level cataloging records for The Bancroft's Manuscripts Collections. Susan Rosenblatt, the Associate University Librarian for Technical Services, and Peter Hanff, Acting Director of The Bancroft Library, formed the Berkeley Library's Catalog Design Liaison Group and charged it with studying the problems associated with accessing materials in non-print formats and with recommending solutions to them. The Group formed a Task Force on Finding Aids specifically to address the problems connected with computerizing the Library's finding aids. This Task Force was asked to determine the current state of finding aid computerization nationally and to investigate technology that would allow finding aids to be accessed in an online network environment. The result of the Task Force's work and of the further collaborative efforts among Bancroft staff, the Library Systems Office, the Conservation Department, the campus MIP, and IS&T has been three electronic text and imaging projects, the Berkeley Finding Aid Project (see Appendix B) funded by a research and demonstration grant under the College Library Technology and Cooperation Grants Program of the Higher Education Act, Title II-A; the RLG Digital Image Access Project, which explored collaboratively the provision of network access to digital images of urban scenes; and the NEH-funded California Digital Image Access Project. Each of these projects has demonstrated Berkeley's considerable interest and expertise in and commitment to the use of electronic technologies to improve access to and preservation of primary resource materials in special collections. In order to ensure the success of its various digital library projects, The Library has created its Electronic Text Unit (see Appendix C), which is a production unit, and the Digital Libraries Group, which consists of project leaders for a wide variety of digital library initiatives.

V. The Project's Staff

Project Director: Bernard J. Hurley has been the Director of the Berkeley Library Systems Office and a member of the Library Administrative Group since August 1981. Mr. Hurley has been working in the field of library automation for the last fourteen years. While at Berkeley he has played a central role in developing the GLADIS System, Berkeley's online catalog, catalog maintenance, authority control, and circulation system, and its access to the Berkeley Campus Information Network. He serves as the Project Director for both the Berkeley Finding Aid Project and the California Heritage Digital Image Access Project. Mr.. Hurley will spend ten hours per week on the project.

Principal Investigator: Dr. Peter Lyman is University Librarian of the University of California, Berkeley. Before coming to Berkeley, he served as University Librarian and Dean at the University of Southern California. He has written and consulted widely on the library of the future, particularly on issues related to the use of digital and networked information to create uniquely appropriate and powerful tools and collections for research, teaching, and learning. Dr. Lyman will spend approximately two hours per week on the project.

Berkeley Library Management Coordinator: Susan F. Rosenblatt, Deputy University Librarian, has been at UC Berkeley since 1984. She is responsible for guiding the Library's digital access projects. She has chaired the Library's Digital Projects Group, which has directed the Library's research strategy and planned the development of the use electronic text to provide access to primary source materials. Ms. Rosenblatt will spend two hours per week.

Berkeley Library Database Design Coordinator: Daniel V. Pitti is Advanced Technologies Projects Librarian, and Head of the Electronic Text Unit in the Library, University of California, Berkeley. Mr. Pitti is currently serving as the Library Database Design Coordinator on the Berkeley Finding Aid Project and the California Heritage Digital Image Access Project. As such, he has coordinated the analysis of the finding aid document architecture structure, and is the author of the Finding Aid Data Model and prototype SGML DTD for finding aids. Mr. Pitti is supervising the building of the prototype finding aid database. Mr. Pitti also has analyzed a wide variety of SGML based tools and has expert knowledge of software developments in this area. In the first year of the Berkeley Finding Aid Project, he gave seventeen presentations describing the project and demonstrating sample electronic finding aids. Presentation venues include the Library of Congress, the National Archives and Records Administration, Museum Computer Network, American Library Association, and the Society of American Archivists. Mr. Pitti serves on the Board of Directors of the Northern California SGML User Group. He also maintains close ties with the Text Encoding Initiative (TEI) and the Center for Electronic Text in the Humanities (CETH). Mr. Pitti is one of two designers of the authority control module in The Library's online bibliographic catalog and maintenance system. He is chair of the American Library Association (ALA) ALCTS CCS/LITA Interest Group on Authority Control in the Online Environment, and has served on a number of ad hoc ALA cataloging committees, including currently serving on the Committee on Cataloging: Description and Access Ad Task Force on a Natural History Cataloging Forum. Mr. Pitti also is a member of the joint Museum Informatics Project/Library Committee exploring integrated approaches to access, navigation, and control of library, archive, and museum resources on the University of California campus. Mr. Pitti will spend twenty hours per week on the project.

Archival Control Coordinator: Jack von Euw, Head of The Bancroft Library Technical Services, is responsible for the reorganization and management of the newly unified technical services. His previous assignment included managing The Bancroft Library Manuscripts Retrospective Conversion Project and the processing phase of the Preservation and Improved Access of the C. Hart Merriam Papers Project. He is Archival Coordinator for both the Berkeley Finding Aid Project and the California Heritage Project. Mr. von Euw will spend four hours per week on the project.

Evaluation Coordinators: Teri Rinne is Head of The Bancroft Library Public Services and is responsible for managing and reorganizing the Division. She is responsible for evaluation of the Berkeley Finding Aid Project and the California Heritage Digital Image Access Project. Ms Rinne will spend two hours on the project.

Dr. John Ober has served as Network Resources Librarian and Assistant Professor at the University of California, Berkeley. He was a recent recipient of an ALA Library/Book Fellowship under which he served 5 months in Africa. He consults and teaches extensively on Internet resources and is co-author of "Crossing the Internet Threshold: An Instructional Handbook." He is responsible for evaluation of the Berkeley Finding Aid Project and the California Heritage Digital Image Access Project, and is project manager for The Library's portion of the "Scholarship from California Net" project, a collaboration between the UC Berkeley Library and the University of California press, funded by the Mellon Foundation. Dr. Ober will spend two hours per week on the project.

This project assembles a team of collaborators that is highly knowledgeable in the areas of cataloging, bibliographic control, and bibliographic standards, of management and control of library, archive, museum, manuscript, and pictorial collections, and of advanced library technology and system design. Key members of the project teams are experts in their respective fields. Resumes for key personnel, including those in the collaborating institutions are contained in Appendix G.


  1. Charles C. Jewett Smithsonian Report on the Construction of Catalogues of Libraries and Their Publication by Means of Separate, Stereotyped Titles (Washington: The Smithsonian Institution, 1853), p. 9. Stereotype printing was the technological development behind Jewett's plan. Instead of metal plates, Jewett intended to use clay. When the plan failed, it was derisively referred to as "Jewett's Mud Catalog."

  2. The National Union Catalog, Pre-1956 Imprints (London: Mansell, 1968), vol. 1, p. vii.

  3. Ibid., p. x.

  4. This account is based on Richard A. Nobel's article "The NHPRC Data Base Project: Building the "Interstate Highway System" in The American Archivist (The Society of American Archivists), vol. 51, nos. 1-2, (Winter 1988), pp. 98-105.

  5. Ibid., p. 99.

  6. This account is based on the "Foreword" to the Library of Congress National Union Catalog of Manuscript Collections: Catalog 1991 (Washington, D.C.: Cataloging Distribution Service, Library of Congress, 1993), p. vii-ix.

  7. Ibid., p. vii.

  8. For a short history and evaluation of the work of NISTF, see David Bearman, Towards National Information Systems for Archives and Manuscript Repositories: The National Information System Task Force (NISTF) Papers 1981-1984 (Chicago: The Society of American Archivists, 1987.

  9. Quoted from p. 6 of an unpublished paper, entitled "NISTF II: The Berkeley Finding Aids Projects and New Paradigms of Archival Description and Access," presented by Steven L. Hensen at the Berkeley Finding Aid Conference, April 4-6, 1995.

  10. RLIN database statistics were provided by Michael Carroll at the Research Libraries' Group and reflect the RLIN database as of May 30, 1995.

  11. Citing a 1980 NHPRC report, Richard Noble reports that Commission staff projected that 20,000 repositories and over 700,000 collection descriptions would be included in a national database. See Richard A. Noble's article "The NHPRC Data Base Project: Building the "Interstate Highway System" in The American Archivist (The Society of American Archivists), vol. 51, nos. 1-2, (Winter 1988), p. 100. The finding aids in the Berkeley database average 27 pages in length. If this average is representative, then 700,000 finding aids would amount to nearly 19 million pages of text!! It is worth noting again that after only eleven years there are over 400,000 collection-level records in the RLIN database. Since many of the nation's archival collections have never been processed, arranged, and described, 700,000 may be a conservative estimate.

|| Home || Browse Finding Aids || EAD Home Page || Papers & Documents || SGML Tools || FTP Site || Go to SunSite ||


Copyright © 1997 UC Regents. All rights reserved.
Document maintained at http://sunsite.berkeley.edu/amher by gmontoya@library.berkeley.edu
Graphics Credit: Image Map designed and created by Mary Scott.
Last update 04/07/97. SunSITE Manager: manager@sunsite.berkeley.edu