SGML: California Heritage Digital Image Access Project

SGML: California Heritage Digital Image Access Project



                            California Heritage
                      Digital Image Access Project
                                    
                                    
                                    
                   University of California, Berkeley
                               The Library        
                          SUMMARY OF THE PROJECT

The underlying assumption of the California Heritage Digital Image Access Project is that MARC
collection-level cataloging records and standardized, electronic versions of archival finding aids,
used together in the network environment, can provide description, access, and control of
digitized images, thus responding to a pressing need in the library, museum, and archival
communities. The project will demonstrate this by creating and testing a prototype digital image
access system, available on the Internet, based on the Standard Generalized Markup Language
(SGML) finding aid technology developed in the Berkeley Finding Aid Project. The project will
also demonstrate the effectiveness of the advanced search and navigation tools which the SGML
encoded finding aids make it possible to use in the prototype access system. Most importantly, the
California Heritage Digital Image Access Project will create a rich new resource for the scholar
interested in California history and it will strive to serve as a national model for state-based digital
image archive projects. 

The Project will develop its navigation tools in a client/server environment. ANSI/NISO Z39.50,
a standard information retrieval protocol, will be used to facilitate client/server communications.
The Project will provide navigation tools that will allow users to move from USMARC collection-
level records in the online catalog, to SGML-encoded finding aids, and, finally, to a rich database
of 25,000 digital image surrogates of primary source materials documenting the California
Heritage. The images will be selected from the Manuscript and Pictorial Collections of The
Bancroft Library, captured on 35mm film, and then scanned to Kodak Photo CD. The 1024 x
1536 greyscale images will be pulled from the Photo-CD's and placed on optical disk for use in
demonstrating and evaluating the project's navigation tools.  Finding aids will be encoded using
commercial SGML authoring tools.

Patrons will use a graphical user interface (GUI) based client to search GLADIS, Berkeley's
online public access catalog (OPAC).  When the user encounters a collection-level record that has
a related finding aid, she will be able to retrieve the finding aid by clicking on a button.  The client
will then launch an SGML browser that will allow the user to navigate through the related finding
aid.  Icons or in-line thumbnail images which represent full images, or groups of images, will be
included in the SGML browser display.  Clicking on the icon or in-line image will launch an image
browser that provides for a full display of the related image or images.  

The performance of the prototype access system will be evaluated by several different users
groups: a group selected from the foremost research institutions with notable collections of
Californiana, Berkeley's collaborators in the RLG Digital Image Access Project (See Appendix E
below), graduate students in two Berkeley graduate seminars, one in Information Management
and one in the Humanities, and, finally, a group of regular patrons randomly selected in The
Bancroft. The project will evaluate whether SGML-based finding aids are an effective way to
control and to provide access to digital images of pictorial material. It will evaluate the searching,
navigation, pointing, linking, and mapping issues, and the control and display capabilities. Finally,
the project will evaluate the degree to which browsing digitized images can serve as an effective
alternative to browsing original images.                      T A B L E   O F   C O N T E N T S

                                  NARRATIVE

I. The Need for the Project. . . . . . . . . . . . . . . . . . . . . . . . . 3

[NOT INCLUDED]
II. The Expected Results and Dissemination . . . . . . . . . . . . . . . . .20

III. The Project's Plan of Work  . . . . . . . . . . . . . . . . . . . . . .26

IV. The Background of the Applicant. . . . . . . . . . . . . . . . . . . . .34

V. The Project's Staff . . . . . . . . . . . . . . . . . . . . . . . . . . .37

Appendices
  Appendix A: Museum Informatics Project . . . . . . . . . . . . . . . . . .40
  Appendix B: The Electronic Text Unit . . . . . . . . . . . . . . . . . . .70
  Appendix C: The Bancroft Library . . . . . . . . . . . . . . . . . . . . .72
  Appendix D: The Berkeley Finding Aid Project . . . . . . . . . . . . . . 103
  Appendix E: The RLG Digital Image Access Project . . . . . . . . . . . . 144
  Appendix F: List of Collaborators & Experts. . . . . . . . . . . . . . . 148
  Appendix G: Letters of Support . . . . . . . . . . . . . . . . . . . . . 152
  Appendix H: Resumes of Key Personnel . . . . . . . . . . . . . . . . . . 161
  Appendix I: Samples: Finding Aid, Collection-level Record, 
          Search & Navigation Environment. . . . . . . . . . . . . . . . . 211



                                 NARRATIVE

I. The Need for the Project.

In recent years the availability of scholarly resources on computer networks has exploded.  A
significant and growing part of this extraordinary growth has been the result of the increasing use
of digital imaging technology to produce electronic surrogates and facsimiles of primary source
materials. In particular, more and more research libraries are experimenting with digitization as a
preservation tool. Some of these experimental projects have made considerable strides toward
showing the value of digital image technology in preserving endangered text and images.  Still
important questions concerning the effectiveness of digital technology as a preservation tool
remain unanswered (questions of standards for image capture, for instance).  Be this as it may,
the future outlook for the work of Cornell and others seems extraordinarily promising and, as a
member of the Digital Preservation Consortium,
 Berkeley is looking with a great deal of interest on the techniques and technology being
developed by Cornell and other leaders in the field.  What has been accomplished in this area has
been sufficient to point the way to a possible digital preservation future but further work on
standards and on enhancements to existing technology is necessary before the digital image can be
established as a primary medium for preservation. Nevertheless, digital preservation activities
promise to guarantee an ever-increasing flow of digital imagery into the nation's computer
networks. It is clear that these developments, in conjunction with others, have brought us to the
brink of some very serious and far reaching questions on how access and control can be achieved
economically and effectively in the digital and network environments. Navigation in the vast and
growing sea of digital information is still in an exceedingly primitive condition, Gopher, WAIS,
and World Wide Web notwithstanding.  This juxtaposition between the explosion of digital
resources, on the one hand, and the backward state of access, on the other, has only served to
heighten the growing debate in the library community concerning the principles of access that are
currently in use in libraries throughout the world.  The issues surrounding the access and control
of digital computer facsimiles and surrogates is, in our opinion, the term of the preservation and
access equation most in need of solution at the present time. 

Researchers at Berkeley have been exploring one of the possible avenues for gaining access to and
control of digital images in the network environment, one which exploits both proven traditional
practices and a platform-independent, leading-edge technological application. Much of the
imagery that will be converted to digital form in the future is currently housed in archival
collections around the world. Access to and control of these collections is provided using
collection-level cataloging records and archival finding aids. Archival cataloging resources should
not be overlooked in the quest for electronic access and control on the world's computer
networks.  The 
Berkeley Finding Aid Project (See Appendix D) has been working to demonstrate that it is
feasible and useful to create a standard for electronic archival finding aids and it has been
successful in developing both a data model and a Standard Generalized Markup Language
Document Type Definition (SGML DTD) for finding aids. The underlying assumption of the
current proposal is that USMARC collection-level cataloging records and standardized, electronic
versions of archival finding aids, used together, can provide control, access, description, and
advanced search and navigation tools for collections of digitized images on the network, thus
responding to a pressing need in the library, museum, and archival communities. 

The California Heritage Digital Image Access Project will address digital image access and
control issues by creating, testing and making available on the Internet a prototype digital image
access system based on the SGML finding aid technology developed in the Berkeley Finding Aid
Project. This prototype will use USMARC AMC collection-level records and Standard
Generalized Markup Language (SGML) encoded finding aids to provide access to and control of
a rich database of digital images of primary source materials documenting the California Heritage.
The images making up the California Heritage database will be taken from the Manuscript and
Pictorial Collections of The Bancroft Library. The performance of the prototype access system
will be rigorously evaluated by three distinct users groups (See Section III, Step 6). This project
will not only offer a compelling solution to the access and control problems it identifies but also
serve as a national model for the construction of the kind of state historical digital image database
it will create.  It will also serve as a demonstration system for archival control professionals
interested in state-of-the-art solutions to the problems associated with network access to digital
data.  Most importantly, the California Heritage Digital Image Access Project will create a rich
new resource for the scholar interested in California history.     1. The Problem of Access: The Economic Argument

In the brave new world of networked computer catalogs and digital access to pictorial images, a
number of librarians and archivists have seen an opportunity to extend the power of USMARC
beyond its application to printed materials to pictures and artifacts.  The attractions inherent in
the provision of item-level access to non-print media, with the full complement of subject and
thesaurus term access, are undeniable, and certainly, for selected items, this kind of access will
continue to prove an invaluable aid to advanced scholarly research. It is generally recognized,
however, that one of the greatest obstacles to accessing images, digital or otherwise, is the sheer
volume of images waiting for access of any kind. The Bancroft Library estimates that its Pictorial
Collections contain approximately 3.5 million images. Virtually none of these can be considered
fully cataloged by existing standards for bibliographic control (AACR2, USMARC VM). To
consider accessing them all at the item-level in the electronic environment using the existing
bibliographic standards is practically unthinkable.  It has been estimated that to catalog all of The
Bancroft's pictorial holdings at the item-level using existing standards would take the entire
original cataloging staff of the University Library at Berkeley, if they did nothing else, something
approaching 400 years and $400 million. This is not to say that anyone in The Bancroft is calling
for item-level cataloging for every image and every sheet of paper in its collections. What such
figures make clear to all of us is that a number of options beyond full USMARC item-level
cataloging are needed in dealing with such a monumental access and control problem or we will
find our cataloging practices at odds with the developing technology. The Library at Berkeley has
looked at several image access systems which offer their own versions of logging photographs at
the item-level and is currently participating in a project where one such commercially devised
system, among other access solutions, is being tested as the primary mode of access to digital
images (For a description of Berkeley participation in the RLG Digital Image Access Project see
Appendix E). Our experience here has suggested that the amount of labor involved in even the
simplest item-level access system seems prohibitive. What is needed is a flexible approach that
allows for the use a number of different modes of access integrated together in the network
environment. 


    2. The Marriage of the Archival and the Bibliographic Models

Fortunately such an approach exists and has been in use in archives the world over for a long
time. This is the archival, collection-level cataloging model (along with archival finding aids),
which is described in Archives, Personal Papers, and Manuscripts, 2nd edition, by Steven L.
Hensen (APPM). Large archive, manuscript, and museum collections have historically presented
complex access and control problems. Chief among these is that limited human and material
resources prohibit the expensive option of cataloging each item individually. The Bancroft
Library's Sierra Club archives, consisting of nine related collections comprised of over one million
items, well illustrates this basic economic problem. The traditional means of overcoming this
problem is to describe a large collection of related items in a single bibliographic record. A
collection of one thousand related items, for example, may be cataloged in one bibliographic
record rather than in one thousand. Such generalized description, however, will only lead a
researcher to a collection of items which may have individual items relevant to his research. In
order to help the researcher determine the relevance of items in a collection, libraries and
museums have provided an assortment of additional inventories, registers, indexes, and guides,
which are generally called finding aids. These finding aids provide detailed description of
collections, their intellectual organization and, at varying levels of analysis, of individual items in
the collections.
 
Finding aids are hierarchically structured, proceeding in defined stages, from the general to the
specific. At the most general level, they roughly correspond in scope to collection-level catalog
records. At the most specific level, they briefly identify individual items in the collection. In
between, they describe subsets or series of related items. Though possible, description of and
access to individual items is frequently not provided. Decisions concerning depth of analysis are
based on a variety of intellectual and economic factors. Prior to the advent of computers, the
traditional media for finding aids have been typescript, printed book, and microfilm. Typed and
printed finding aids vary in length from a few pages to a thousand or more pages, though they
average under a hundred pages in length. Researchers are led to finding aids through traditional
catalog records which briefly describe collections as a whole and note the availability of more
detailed finding aids. Access to the finding aid is essential for understanding the true content of a
collection and for determining whether it is likely to satisfy the scholar's research needs (i.e.
whether a visit to the holding library to consult the collection directly is justified). In short, the
finding aid itself is an important information resource of scholarly research (For a sample of a
finding aid, see Appendix I). 
 
In archival cataloging, as described in APPM, the whole is greater than its parts. Content and
context are more important than the physical description of items but the possibility of providing
item-level description judiciously is not eliminated. Collection-level cataloging records and finding
aids, each at its own level, provide a mapping of intellectual content, not a strict, physical
inventory of minutely described objects.  This conceptual orientation makes archival cataloging
principles singularly appropriate for the new electronic environment.  What is needed from an
electronic cataloging system, when it contains digital facsimiles and surrogates of the items
themselves, is the ability to provide a context for the items and to point to collections of
intellectually related items. This is precisely what archival collection-level records and finding aids
have always been designed to do.  It is for this reason that researchers working on the Berkeley
Finding Aid Project have recognized that the archival finding aid rendered in a standard, platform-
independent, electronic form provides a key, and currently missing, layer of flexible access and
control for digital imagery in the complex environment of the Internet.

    "For all their usefulness, it is nonetheless a fact that MARC-based online library catalogs are
    based on a database structure that is almost 30 years old... In reaction to this concept of the
    catalog evolving into a window or gateway to other more detailed sources of information 
    ...the MARC/AACR2 catalog record... instead can serve as simply one level of descriptive
    detail in a system of hierarchical pointers that leads inexorably from index terms to full text.  In
    such a system the user would approach it in the usual way, looking for information on material
    on a given subject, person, etc. Moving from subject headings or index terms to full MARC
    cataloging records to ever increasing levels of detail, this user would then be faced with such
    additional options as moving from the catalog record to abstracts, other indexes, tables of
    contents, etc. and eventually to digitized images of the collection material itself--all without
    leaving the computer terminal.

    This is a particularly happy development for archivists who, with their guides, finding  aids,
    and records series descriptions, already have in place the sort of hierarchically intermediate
    descriptions that would logically sit in such a system between catalog records and full text....It
    is from this idea that the concept of the 'virtual library' has sprung."

Collection-level cataloging records and finding aids, then, are generally accepted in the national
and international archival community as the best, least costly method of providing description and
access to large collections of materials. In the electronic environment, the collection-level record
in the OPAC can truly act, as Hensen says, as a gateway to another level of description and
access, if it can be linked with an encoded version of a finding aid, which, in turn, is linked to
images of items it describes.  The California Heritage Digital Image Access Project, taking up
where the Berkeley Finding Aid Project left off, will demonstrate that the full application of this
archival access model in the network environment -- that is the systematic combination of the
USMARC collection-level catalog record with the SGML encoded finding aid -- can provide
access to and control of digital surrogates of endangered pictures and manuscripts that meets all
of the demands of the network environment by providing sophisticated indexing and navigation
and by being flexible, functional, durable, and, most important of all, economical. 


    3. The Berkeley Finding Aid Project.

On October 1, 1993, a Berkeley library research team began work on a research and
demonstration project funded by a grant under the College Library Technology and Cooperation
Grants Program of the Higher Education Act, Title II-A. The Berkeley Finding Aid Project grew
out of the recognition that to take full advantage of the opportunities presented by the rapid
growth of the Internet, the archive, museum, and library communities needed to develop and
embrace standards to ensure that scholarly communication is both useful and enduring. With a
standard for collection-level cataloging records already developed and widely accepted, the next
logical step in providing full online description and access to research collections was the
development of an encoding standard for the finding aid. The Project involved two interrelated
activities: creating a prototype encoding standard for finding aids and building a prototype finding
aid database.

The prototype encoding standard now being tested takes the form of a Standard Generalized
Markup Language (ISO 8879) Document Type Definition (SGML DTD), which has been
developed in collaboration with leading experts in collection processing, collection cataloging,
text encoding, system design, network communication, authority control, text retrieval, text
navigation, and computer imaging (For a full discussion of the reasons why SGML was chosen
see Daniel Pitti's paper, Sharing the Wealth: A Description of the Berkeley Finding Aid Project in
Appendix D below). Project participants have analyzed the structure and function of
representative finding aids. They have isolated the basic elements occurring in them and defined
their logical interrelationships. (For copies of the Finding Aid Data Model and the Finding Aid
SGML DTD see Appendix D below) 


    4. The Next Step: The Need to Move Beyond The Berkeley Finding Aid Project.

While the Berkeley Finding Aid Project has focused only on creating a prototype encoding
standard for machine-readable finding aid text, it clearly sets the stage for a future in which
collection-level USMARC catalog records provide hypertext links to finding aids, and finding
aids, in turn, provide not merely access and control of off-line archival items, but also direct online
hypermedia links to computer surrogates or representations of primary source materials that
currently exist in a variety of native formats: pictorial materials, graphics, three-dimensional
objects, manuscripts, typescripts, printed text, sound recordings, motion pictures, and so on. The
intersections between these various forms of information will be traversed by the click of a mouse
or by entering a simple command. Such an information environment will enable users to satisfy
their information needs in a coherent, integrated, easily navigated information environment.
 
Before digital representations can serve as preservation surrogates, there is a clear need to
demonstrate that effective and persistent access and control is attainable. It is the goal of the
California Heritage Digital Image Access Project to demonstrate an affordable and technologically
feasible solution to this need. 
 
The undoubtedly attractive though complex information scenario sketched above presents a
number of technical challenges and problems that must be overcome in order to demonstrate its
effectiveness for providing access, navigation, and control of preservation digital surrogates. The
California Heritage Digital Image Access Project will thoroughly investigate the following
challenges, and provide prototype solutions. Where different solutions are found feasible, the
Project will select what appears to be the best solution, while documenting the alternatives.
 
    i) Using an ANSI/NISO Z39.50 Client/Server network environment, demonstrate the
    capability of a system of USMARC collection-level records to provide direct online access to
    SGML encoded finding aids, and the capability of the finding aids to provide direct online
    access, navigation, and control of digital preservation surrogates and representations of
    primary source materials existing in a variety of native formats.
 
    ii) Demonstrate the feasibility of using Uniform Resource Name (URN) and Uniform Resource
    Locators (URL) in USMARC records for providing persistent direct access and retrieval of
    SGML encoded finding aids.
 
    iii) Investigate the relationship and compatibility of emerging URN and URL standards with
    SGML "entity declarations and references," and, what relationship, if any, the emerging de
    facto standards for "entity management" in the SGML software development community have
    with proposed URN/URL solutions. "Entity declarations and references," along with
    "notation" declarations, are the means prescribed in SGML for pointing to non-SGML digital
    objects. This investigation will lead to designing a standard, uniform approach to providing
    persistent links or pointers from SGML encoded electronic text finding aids to digital entities
    in a wide variety of notations. Such links will need to be designed in a manner ensuring durable
    or persistent control and access; which is to say, control and access that can survive migration
    across hardware and software platforms. In addition to addressing issues connected with
    identifying, locating, and retrieving digital objects, the Project will study the need to locate and
    launch software that will render the objects into a form intelligible to human beings. Project
    designers will also look closely at what role, if any, HyTime (Hypermedia/Time-based
    Document Structuring Language (ISO/IEC 10744)) might play in solving these problems. 
 
A possible by-product of this development will be an opportunity to study the following
hypothesis, which is widely held to be true but is yet empirically untested: By making surrogates
of the most used portions of our collections available in digital form on the Internet, we can
simultaneously increase access while limiting the actual need to physically handle endangered
collections. In the future electronic environment, access and preservation need not be, as they
have been in the past, in conflict.


    5. World Wide Web, HTML, and Mosaic: Can They Provide Access and Control?

In the last few months, many people have had the pleasure of experiencing hypertext and
hypermedia through Mosaic, a public domain World Wide Web client developed at the National
Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-
Champaign. Mosaic/WWW uses a text encoding system called Hypertext Markup Language
(HTML). Mosaic, WWW, and HTML represent major steps forward in network information
access and navigation. As important as Mosaic, WWW, and HTML have become, however, we
do not believe that HTML will serve as a viable long-term encoding standard for library, museum,
and archival finding aids.
  
Mosaic and HTML, the software and the underlying data structure to which it is bound, represent
a limited implementation of the SGML standard. Mosaic's HTML is an SGML DTD. The HTML
DTD tag set is small, and is primarily intended to serve online display and networked hypertext
links. HTML, contrary to the spirit of SGML, is thus more procedural than descriptive. Hence the
markup will not support a wide range of applications, many of which are essential to providing
control of, access to, and navigation of our nation's cultural treasures. 

Control, access, and navigation of library, museum, and archival collections in the large-scale
information environment of tomorrow's network will require meta-data that is descriptively rather
than procedurally marked. In order for information to control and provide access to other
information, we need to be able to identify key control data for computer manipulation. This is a
lesson clearly demonstrated by large bibliographic databases of millions of USMARC encoded
records. Our automated systems require structured information to facilitate sophisticated indexing
and access, inter- and intra-document navigation, unique identification of information bearing
objects, location information, and the like. In a database of hundreds of thousands of finding aids
providing access to millions of images, we will need to identify subject and genre terms related to
the images, as well as associated personal, corporate, and geographic names. Such structure will
be necessary if we hope to link subjects and names to national and international thesauri and name
authority files. Reliable and effective name and subject access in the large-scale, integrated
information environment of the Internet will surely be necessary for successful searching and
navigation. 

The recent departure of the entire NCSA Mosaic development team for a new commercial
corporation headed by the president of Silicon Graphics Incorporated (SGI) highlights another
major concern; namely, the durability or persistence of our encoded finding aids. This
development raises serious questions about the future of the public domain status of Mosaic
software, and the future development of the current software. The library, museum, and archival
communities have learned that creation of meta-data is an essential and very expensive
undertaking. Once the meta-data is captured, we must be as assured as possible that the
information we have created is not dependent on the existence of any one piece of hardware or
software. We can not afford to recreate the information.

The California Digital Image Access Project will use the proven structure of USMARC records
and an SGML Document Type Definition devoted exclusively to library, museum, and archival
finding aids as well as emerging and established standards for linking complexes of digital text and
extra-textual information (URN/URL, SGML entity management, HyTime) to create the
necessary architecture to support sophisticated navigation of text and images in a Z39.50
client/server network environment. We believe that this commitment will ensure both the long-
term utility and the durability of the finding aids we create. In order to ensure the broadest access
possible in the immediate future, we will map the finding aid information into HTML as well.

    6. Related Work on the Problem

Although there has been considerable recent effort in the archival community to provide network
access to archival finding aids, Berkeley was the first to investigate the use of SGML to encode
finding aids so that they can function as part of a dynamic, layered archival access system and, for
the moment anyway, is doing the only systematic study of this SGML application. There has been
a great deal of interest in the archival community in this work, especially since Daniel Pitti
delivered his paper on the Berkeley Finding Aid Project at the September 1993 Annual Meeting of
the Society of American Archivists held in New Orleans. This interest has grown more intense
recently following the successful demonstration, at the May 10-11, 1994 meeting of the RLG
Digital Image Access Project in Austin, Texas, of one of the Project's first sample SGML encoded
finding aids as an access tool for actual digital images. 

Most of the activity in providing access to finding aids over the network seems to be limited at the
present time to attempts to mount them on Gophers. The results of a poll on "Finding aids online"
sent out by La Vonne Gallo over the RLGAMSC Listserv showed that in 27 major university and
research libraries the overwhelming activity in finding aid automation has focused on this effort. A
number of the respondents (Dartmouth, University of Michigan, Rutgers, and Stanford's Hoover
Institution) professed interest in the SGML approach being taken in the Berkeley Finding Aid
Project. A few others, notably Stanford and the University of Chicago (both of which have
recently joined The Berkeley Finding Aid Project's list of collaborators), had projects to access
finding aids using something other than Gopher. 

By far the most interesting archival access project we are aware of, and the one that is most
closely related to the approach suggested in this proposal, is the University of Maine System
Libraries' Multimedia Access to Archival Resources Project. In October 1992 University of Maine
System Libraries began a project to enhance the traditional automated library system by providing
multimedia access to three major collections of archival research materials: The Maine Folklife
Center and Fogler Library Special Collections of the University of Maine Orono campus, and the
Acadian Archives/Archives acadiennes at the Fort Kent campus. The purpose of the project was
to explore the use of digital image technology to expand access to and preserve from
deterioration unique resources of three prominent archival collections on Maine and the Canadian
Maritime Provinces.

Electronic access via local and wide area networks to the collections selected for the project has
been enhanced and expanded by applying optical scanning technology to capture text and images
of collection finding aids and, when appropriate, original source materials in digital form. These
facsimile documents have been added to the Archives Image Database which contains both bit
mapped images and ASCII text files, and are stored on optical disc.  Associated text documents
(ASCII files) are linked to collection-level bibliographic records in the online catalog and may be
retrieved and displayed from any point on the Internet. Once a document is selected, choosing the
option to display text retrieves the ASCII file of the image document for perusal in the online
catalog. Network access to the image and text database is provided via a client/server architecture
that enables the transmission of images to common computer platforms, including the Apple
Macintosh and the IBM PC (with proprietary software).

The archives database is comprised of collection-level records using the Archives and Manuscripts
Control format (AMC format). At the present time, associated text documents are primarily
finding aids which are linked to the collection-level bibliographic records and may be retrieved
and displayed at online catalog terminals from any point on the network. The Archives Image
Database is one of the first such efforts, especially in the use of full text retrieval software.  The
project is particularly innovative in its approach of linking the bibliographic record in the online
catalog to associated finding aids or other document texts in the image database. 

We believe that the client/server approach and the linking of finding aids to collection-level
records represents the right path for the development of the archival cataloging system of the
future. The Maine project has clearly made an important contribution in this area. The California
Heritage Digital Image Access Project will pay close attention to Maine's experience with its own
client/server system. Nevertheless, what is being proposed here advances beyond the Maine
project in two important ways. First, Maine system planners chose not to structure their finding
aid text using SGML. Finding aids are meant to provide access and control for collections and the
electronic environment makes it possible to link finding aids directly to the materials they
describe. But in the Maine project flat ASCII finding aids are not coded for use as meta-data for
the control of the materials they describe. Flat ASCII files linked to page images of finding aids
can not provide access to and control of exogenous digital representations of primary source
materials. The second key difference is related to the first. Without SGML encoding, finding aids
can not be used as enhanced searching and navigation tools in the archival database as a whole.
The Berkeley approach emphasizes finding aids as a dynamic part of a layered access system and
exploits their true relationship to both the collection-level records and the images of items within
archival collections to make it possible to move between them with ease. 

Another project, which will address a number of the network access issues that interest us, is one
currently being proposed to NEH by Stanford University's Green Library. The Telara¤a Project
will develop network access tools from the logical structure resulting from the arrangement and
description of primary source materials. It will use these tools to build finding aids and cataloging
records and to retrieve and navigate digital preservation copies of the materials on World Wide
Web. Among other things, the project will look at how to structure and encode information for
multiple uses. The emphasis on archival access, digital imagery, and the structure of data that goes
into finding aids are all of great interest to us. Where we differ in our approaches to a number of
issues (such as structuring data, pointing to digital images, and use of SGML, WWW, HTML,
and Mosaic), we can learn from one another. Stanford is collaborating in the Berkeley Finding Aid
Project and will collaborate on the California Heritage Digital Image Access Project. Berkeley is
reciprocating by supporting the work of the Telara¤a Project. Both Stanford and Berkeley will
assist one another in evaluating their respective projects and will also share project results.

A major initiative in the area of standardization of DTD's for types of SGML structured texts is
the Text Encoding Initiative (TEI). The Berkeley Finding Aid Project developed a collaborative
relationship with TEI during the period when the Finding Aid DTD was being developed. C.
Michael Sperberg-McQueen, editor-in-chief of TEI, has expressed interest in the possibility of
incorporating the Project's Finding Aid DTD into the TEI suite of DTD's. Daniel Pitti recently
demonstrated a sample finding aid linked to digital images at a Center for Electronic Texts in the
Humanities (Rutgers University/Princeton University) Workshop on Documenting Electronic
Texts in the Humanities held in Somerset, New Jersey on May 16-18, 1994. The Workshop
brought together representatives from the TEI, academic publishers, academic computing centers,
and research libraries to discuss the relationship between TEI guidelines for electronic text
documentation and USMARC catalog records. Workshop participants showed considerable
interest in the demonstration of Berkeley's approach to image access.

While we have so far been unable to obtain detailed information, we have learned from Jim Bower
of The Getty Art History Information Program that the Art Information Task Force (AITF) has
developed "Categories for the Description of Works of Art." The AITF has hired a consultant at
Boston University to develop a prototype SGML DTD based on the model. We have examined an
early version of the DTD distributed at a recent meeting of Computer Interchange of Museum
Information (CIMI). At the lower levels of its structure, it strives to be TEI compliant. We intend
to follow this development closely, and, if possible, to ensure that our development activities are
complementary.

An electronic publication that is currently under production at Luna Imaging Inc. does touch on a
number of access and navigational issues that concern us in this project, though within the strict
limitation of a CD-ROM electronic "book" format.  In a recent visit to Berkeley to deliver a
lecture, Michael Ester, late of the Getty Art History Information Program (AHIP) and currently
President of Luna Imaging Inc., described the Frank Lloyd Wright electronic book (5,000
drawings) which Luna is doing. The model for this project is the reference book, to be used by
one person at a time. But the "book" employs the archival cataloging and finding aids from the
Frank Lloyd Wright Collections in the form of an SGML DTD to create a textual "wrapper" that
provides description and access for the images from the collections. This allows for exploitation
of the archival cataloging information to provide sophisticated searching and navigation. The
"book" is a commercial publication, which Luna expects libraries and scholars to buy at the
projected $1,000 price. It will not be available on the network. Nevertheless, it represents a very
positive development in digital access to archival materials. It was also heartening to hear an
authority like Michael Ester tout, in his lecture, SGML encoded text as the best way to provide
access to digital images, particularly when the source of the encoded text is existing non-MARC
archival description and access tools.

In yet another related effort, Professor Ray Larson, of the School of Library and Information
Science at the University of California, Berkeley, is developing, in a project supported by a
Department of Education Title IIA grant, a second generation of Cheshire, a prototype library
catalog featuring advanced natural language retrieval. Cheshire features, among other things,
probablistic weighting and relevance ranking of subject retrieval sets. In this iteration of Cheshire,
Professor Larson and his graduate assistants are mapping USMARC bibliographic records into an
SGML USMARC DTD and are developing an SGML aware, Z39.50 compliant client/server
database. Professor Larson has been a collaborator of the Berkeley Finding Aid Project. The
Library Systems Office will share information and, if possible and necessary, programming code
with Professor Larson and his assistants.

Although it is remote in intent and design from what is being proposed here, we have watched
with interest Cornell University's collaborative efforts to provide enhanced access to library
materials using digital technology. In the highly successful "Preserving Archival Material Through
Digital Technology," Cornell University in collaboration with Xerox Corporation and the Eleven
Comprehensive Research Libraries demonstrated that it could cost-effectively produce high
quality replacement copy for brittle books, in print and microfilm media, from digital versions
created by scanning the originals and storing the images in its CLASS System. Although the
preservation capability achieved in this very interesting project is astonishingly good, what
interests us most in the context of our own research is the way Cornell has used the electronic
versions of the preserved books to create a digital library accessible on the Internet.  Though
SGML is not being employed in it, Cornell's work recognizes the importance of defining the
internal structure of documents in order to make them fully navigable.  Nevertheless, the structure
of Cornell's electronic documents is limited by the mode of their production; that is, preservation
reproduction of printed books. A user can search a book by author, title, or catalog identification
number. Once he or she has selected a book, he or she can retrieve the book and navigate through
hierarchically structured paths modeled on the logical structure of most books (table of contents,
chapter, page).  The user is limited, almost exclusively, to known item searching, since actual
searchable text as opposed to page images is very limited. The system is designed primarily to
support production of printed facsimilies rather than the navigation of the electronic versions of
the text. Cornell planners astutely recognize the importance of structure in electronic texts, and
are working, in each new phase of their project, to provide more of it. As impressive as the entire
Cornell preservation and access enterprise is, it is clear that it is just beginning to address the
problems connected with sophisticated searching and access of digital surrogates and facsimiles
on the network. In concentrating on preserving books, it has not addressed the access and control
problems associated with archival collections. Nor has it investigated the role of SGML in
providing platform independent structure to documents. These are precisely the things that
concern us in this proposal.

These projects suggest a number of avenues to take in solving the problems of control and access
in the network environment but, with the lone exceptions of the University of Maine and Stanford
University, none of them has recognized the potential of traditional archival description and
control tools to play a key role in helping users navigate the oceans of electronic data becoming
available on computer networks. By and large, we have found that interest in SGML, as a means
for providing platform independent access and navigation tools to primary source materials,
though growing rapidly, is still in its infancy. Most of the SGML activity, not surprisingly, is in the
area of electronic publishing. Currently, the Berkeley Library and the University of California
Press are involved in a collaborative effort to mount SGML encoded journals and monographs on
the network. In addition to the Berkeley Finding Aid Project, the Library is also actively involved
with other campus and regional communities interested in SGML. The Library maintains a listserv
to disseminate information concerning SGML related activities in the northern California area.
The listserv is used by the Northern California SGML User Group (NCSGMLUG) and the
Northern California CALS Interest Group. Daniel Pitti sits on the Board of the NCSGMLUG.
The Library has also actively and successfully sought software and technical support from SGML
oriented vendors. Noteworthy in this regard are Xerox, Electronic Book Technologies
(Providence, Rhode Island), and ArborText (Ann Arbor, Michigan). 
 

    7. California Heritage

The technical problems addressed in this Project are highly significant and the solutions it will
advance are, we think, compelling. Nevertheless, as mindful as we are of this aspect of our work,
we are even more mindful of what it is we are trying to provide access to, using this technology.
Across the country the new information technology is stimulating something far more interesting
than its own development. Projects with names like American Memory and The Making of
America are taking the first steps in a vast, collaborative effort to extend our national self-
knowledge. Creating and connecting the sources of our knowledge is a grand American
enterprise. It is fitting that it begins, but does not end, with a machine, a technology. Providing
free and universal access to the sources of information that define us is the really exciting thing
about this Project and the many like it that are springing up so exuberantly. Finding a way to add
the fullest record of the California Heritage to the greater enterprise of recording the American
Heritage, which is beginning to take shape around us, is the true heart of this Project. 

The Project will create a rich and varied database of digital imagery of primary source materials
that document and illustrate the heritage of California. It is important to the success of the Project
that the image database in the prototype system provides scholars with a body of inherently
interesting material, which will serve as rich and coherent intellectual resource of enduring
scholarly interest. The database will be constructed to exploit the power of SGML to improve and
to aid in the creation of new knowledge through its support of sophisticated search and
navigational capabilities. The collections included in the Project have been selected carefully to
provide subject matter that will encourage new ways of searching,  manipulating, and
understanding digital information. This selection process will be further refined at the beginning of
the Project when a plan for selecting individual images and groups or "clusters" of images based
on subject and genre will be developed by a curatorial selection team (See the Project's Work
Plan, Section III, Step 1 below). These subjects and genres will be included with descriptive
caption titles in entries that will be added to the SGML encoded finding aids that will provide
access to the images. The following is a partial list of subjects that will be included in the Project's
selection plan:

    1. Indians. 2. Early Exploration. 3. Missions. 4. Pre-statehood California. 5. San Francisco. 6.
    Gold Rush. 7. Donner Party. 8. Mining. 9. Transportation, Railroads. 10. Chinese in
    California. 11. National Parks (Yosemite, Lassen, etc.). 12. 1906 Earthquake and Fire. 13.
    World War II. 14. Japanese Relocation. 15. Immigrant labor. 16. Portraits. 
 
Since the California Heritage Digital Image Access Project will select its materials from the
collections of The Bancroft Library, it is important to recognize the preeminence of The
Bancroft's collections in the field of California history (See The Bancroft brochure and "The
Bancroft Library's Pictorial and Manuscripts Collections" in Appendix C below). Each of subjects
listed above is well represented in the Bancroft's Pictorial and Manuscripts Collections, from
which images will be selected. Early exploration of California will be represented by images from
the pictorial and manuscript collections of La Perouse Expedition (1769), the Malaspina
documents dating from 1789-1792, and the Langsdorf expedition (1792). The Patrick Breen
Diary describing his ordeal as member of the Donner party will provide a significant example of
the numerous documents relating to pioneer exploration and settlement of the California and the
West. The changing natural landscape of California will be documented in the photographs by
Carleton Watkins, Joseph LeConte, Cecil Wright, and Ansel Adams that will be selected from the
over 44,500 images of the Sierra Club Pictorial Collections. As the official archives of the Sierra
Club, The Bancroft houses one of the most comprehensive collections of environmental records in
the United States. Important images will be selected from other collections documenting
California historical landscape notably the over 1,700 stereophotographs of Yosemite, Mammoth
Trees, San Francisco, and the Pacific Coast taken by Eadward Muybridge in the 1860s and 1870s. 
Vivid early representations of the California will also be selected from The Robert B. Honeyman,
Jr. Pictorial Collection of Western Americana, which includes numerous unique materials
pertaining to pre-statehood Hispanic California and the Gold Rush.  William H. Meyers served as
a gunner with the Pacific Squandron on the U.S. Sloop of War, Cyane.  The Project database will
provide access to his 1842 journal, which contains detailed watercolor illustrations and a
description of the seizure of Monterey.  Images of Native Americans will be selected from the C.
Hart Merriam papers, which contain over 3,000 photographs documenting California Indian
tribes. After having served for 20 years as the founding director of the United States Biological
Survey,  C. Hart Merriam turned his attention to recording the languages and habitat of California
Indian tribes.  In the decades between 1910 and 1942, Merriam spent five to six months each year
traversing the countryside, interviewing aged Indians.  His papers and photographs of California
Indian tribes represent one of the most significant remaining primary sources on Native American
culture, in many cases providing the last remaining links for many Native Americans to their
ancestors.  Pictorial collections of early San Francisco history, from which will imagery will be
selected, include George R. Fardon's 1856 San Francisco Album, possibly the earliest compilation
of photographs of any American city, The Wyland Stanley Collection of photographs of San
Francisco in the 1890's and early 1900's, and Isaiah West Taber's photograph albums of "Business
Houses and California Scenery and Industries," containing detailed information on the types of
businesses, products, and on the locations of commercial establishments in San Francisco in the
1880's.  Images of the manuscripts from The Jack London Papers will be included in the Project.
Selection will also be made from The Roy D. Graves Pictorial Collections. These will offer choice
images from among the thousands of photographs of all varieties of steam transportation in
California and the West, from photographs that document the Chinese in California, and from
Graves' extensive collection of images of San Francisco just prior to and immediately following
the 1906 earthquake and fire. Selection for the database will also be made from among the
thousands of photographs from the Henry J. Kaiser Papers and from the Japanese War Relocation
Records, both of which document important facets of California life during World War II. Kaiser
photographs will be selected that dramatically illustrate the wartime industrial boom in California.
The Japanese War Relocation Records document one of the dark periods in the troubled history
of ethnic assimilation in California, and are among the most heavily used materials in The
Bancroft. A substantial selection from this collection will be included in the California Heritage
image database. All of these materials will combine to create an extraordinarily rich resource,
which will be of interest to scholars and to the general public alike. 

The complex but coherent body of the digitized imagery that this selection process will create will
also make it possible to thoroughly test the capabilities of the access system and to evaluate how
digital images are used both by scholars and by the general public. This may allow us to study the
degree to which the browsing of digitized images can serve as an effective substitute for handling
rare and endangered originals.  Thus in the process of testing the access mechanisms needed to
make digital images readily available to scholars, the project will also test the use of digital
surrogates in a real research environment and demonstrate how access to surrogates might be
used to preserve originals.

Collaboration will also be an important feature of this project. We will ask the following major
research institutions with significant collections of Californiana to serve as evaluators of the
prototype because of their understanding, based on long experience with similar collections of
research materials, of the access issues involved in the Project (See Project Work Plan, Section
III, Step 6 below):

    1) University of California Libraries (UCLA, San Diego, Irvine, Santa Barbara). 2) Getty
    Museum and Art Gallery. 3) California Historical Society. 4) Oakland Museum. 5) San
    Francisco Museum of Modern Art. 6) Hearst Museum (on the Berkeley campus). 7) Yale
    University's Beinecke Library. 8) The Smithsonian Institution. 9) The National Museum of
    American History. 10) Stanford University. 11) California State Library. 12) The Huntington.
    13. University of Southern California. 

We will also include, as evaluators, our collaborators in the RLG Digital Image Access Project
and the Berkeley Finding Aid Projects (See Appendices D, E, and F below).