SGML: The Berkeley Finding Aid Project

SGML: The Berkeley Finding Aid Project

                     The Berkeley Finding Aid Project
                          Standards in Navigation

                             Daniel V. Pitti
                Advanced Technologies Projects Librarian
                               The Library
                   University of California, Berkeley
                 386 Library, Berkeley, California 94720

Abstract: The archival community should develop and embrace an
encoding standard for archive, museum, and library finding aids.
Such a standard would ensure that Internet communication of
finding aid data is effective, and that the data endures
independently of the computer hardware and software used to
create and use it. This is the underlying premise of the Berkeley
Finding Aid Project. Some have attempted to encode finding aids
in the MARC format, but with mixed results. Standard Generalized
Markup Language (SGML) is recommended as a better vehicle, as it
has the flexibility to handle the complex hierarchical structure
of finding aids and to capture the individuality of unique
collections. SGML can also facilitate access to digital
surrogates of items in our collections. The availability on the
Internet of fully functional collection information and item
surrogates promises to dramatically alter access and

About the author: Daniel V. Pitti is Librarian for Advanced
Technologies Projects in The Library at the University of
California, Berkeley. This paper was written at the beginning of
a Department of Education Title IIA funded project in September
1993, and presented at the Meeting of the Society of American
Archivists in New Orleans. It was revised a year later and
presented at the Scholarly Publishing on the Electronic Networks
symposium sponsored by the Association of Research Libraries and
the Association of American University Presses in Washington,
D.C. The author is grateful for the assistance of Tim Hoyer and
Jack von Euw of The Bancroft Library, Berkeley, California, and
Jackie Dooley of The Getty Center for the History of Art and the
Humanities, Santa Monica, California.

Introduction and Project Overview. 
     The advent and rapid growth of the Internet is transforming
scholarly communication worldwide. To take full advantage of the
opportunities this development offers, the archive, museum, and
library communities will need to develop and embrace standards to
ensure that the communication is both useful and enduring. An
encoding standard for finding aids is one such standard.
     Finding aids are documents used to describe, control, and
provide access to collections of related materials in library,
archives, and museums. The materials in these collections are the
natural by-products of the activities of individuals, families,
and organizations. Many different kinds of materials are
represented in these collections: manuscripts, correspondence,
legal records and papers, photographs, tape recordings, video
recordings, and more. 
     The Berkeley Finding Aid Project is a collaborative effort
to test the feasibility and desirability of developing an
encoding standard for archive, museum, and library finding aids.
The Project is funded in part by a grant from the United States
Department of Education Title IIA program. The Project began in
October 1993, and will be completed in September 1995. The
Project involves two interrelated activities. The first of these
was creating a prototype encoding standard for finding aids. This
prototype standard is in the form of a Standard Generalized
Markup Language (ISO 8879) Document Type Definition (SGML DTD).
Researchers at the University of California, Berkeley have
developed the encoding standard in collaboration with leading
experts in collection processing, collection cataloging, text
encoding, system design, network communication, authority
control, text retrieval, text navigation, and computer imaging.
Project participants have analyzed the structure and function of
representative finding aids. The basic elements occurring in
finding aids have been isolated and their logical
interrelationships defined. The DTD has been developed based on
the results of this analysis.
     Building a prototype database of finding aids is the second
objective of the Project. Available hardware and software have
been evaluated. We have selected ArborText's AdeptEditor  as the
SGML-based authoring and editing software, and Electronic Book
Technologies' DynaText  for providing networked access to the
database. Encoding the finding aids and building the database is
providing the encoding scheme developers with computer
application experience with which to refine and inform the DTD
development process. The network accessible database will
provide, in a later phase of the Project, a means for public and
staff end users to evaluate the utility and desirability of
encoded finding aids. The public and staff end users then will be
able to provide new ideas and suggestions to the encoding
standard developers.
      The success of any standard depends upon broad
participation in its development and, after it is developed,
widespread recognition of its utility. Standards are the products
of communities, not individuals working in splendid isolation.
As the Project has unfolded, we have kept the community aware of
its progress, and, in the near future, will solicit critical
     Before proceeding with a detailed description of the
Berkeley Finding Aid Project, I would like to comment briefly on
the archival community's interest in standards. 

Finding Aid Standards and the Archive and Museum Communities.

     For a variety of reasons the archive and museum communities
have not been motivated to develop standards to govern the
intellectual content and structure of finding aids. The economic
benefits of sharing cataloging that motivated the catalogers of
published materials are not available to the archive and museum
community because items in their collections are mostly unique.
Another possible motive for normalizing finding aid content and
structure would be to make them more familiar and thus more
immediately intelligible to the uninitiated. Charles Jewett, one
of the earliest advocates of standardizing descriptive
cataloging, addressed this issue of intelligibility in his
Smithsonian Report of 1853. He said of cataloging rules:

     Now, even if the one [system] adopted were that of the worst
     of our catalogues, if it were strictly followed in all
     alike, their uniformity would render catalogues, thus made,
     far more useful than the present chaos of irregularities.1

1Charles C. Jewett, Smithsonian Report on the Construction of
Catalogues of Libraries and Their Publication by Means of
Separate, Stereotyped Titles (Washington, D.C.: Smithsonian
Institution, 1853), 9.
Increasing mutual intelligibility, though, has not in itself been
a sufficient impetus to overcome the countervailing tendency to
fashion the content, structure, and aesthetics to fit the local
institution and users, and the characteristics of the collection
being described. 
     With the emergence of the Internet, we must confront the
need for greater intelligibility of information that can be
readily shared and interpreted by the world intellectual
community. In this context, the importance of mutually
comprehensible finding aids and the benefits to be derived from
them take on new urgency. The Internet has the potential to
provide immediate access to information about our collections and
even to computer surrogates of items in those collections. And
this access can be available anywhere at anytime. For those
institutions choosing to take advantage of the power of the
Internet, users will no longer be exclusively the familiar faces
of the dedicated researchers. In order to provide the access
described, the archive and museum communities must develop and/or
embrace an array of standards. The means are there to communicate
with one another, but to make this communication mutually
understandable, we will need to work together. 
     In order for archivists and curators to communicate with
users and one another in the network environment, our computers
must be able to communicate with one another. We need to be able
to move information freely from computer to computer over the
network. An encoding standard will enable institutions to create
finding aids that are independent of proprietary hardware and
software. This will allow the data to be freely interchanged
across platforms and applications, and still be instantly usable
without time-consuming modification or adjustment. Such
interoperability will support contemporaneous interchange of
collection data, and it will ensure that our investment in the
data will survive over time.
     Finding aids created using proprietary word processing
software such as WordPerfect and Microsoft Word, or database
software such as dBase will remain usable only if you reformat
them each time you update the authoring software. Of course, for
this to work properly, software manufacturers must provide
translation programs that provide 100% fidelity in the data
migration process. This is not always the case, and so
information can be lost or garbled. As finding aid collections
grow, such reformatting through successive software versions will
become more burdensome. Even if the institution can survive the
ordeal of eternally updating software, another concern has to be
the durability of the software firm itself, and the durability of
its interest in the software. If the firm goes out of business,
or if it no longer finds the product profitable, an archive can
find itself with a database of finding aids stranded in time. If
finding aids are encoded using a standard, then their survival is
not contingent on a particular hardware and/or software
configuration. An encoding standard would guarantee that the
machine-readable finding aids created today would be usable
     This and other benefits would result from making finding
aids interchangeable through a standard encoding scheme. It would
be feasible for users and institutions to have fully functional
access to finding aids in real-time over the network. Remote
access to finding aids would enable researchers to make direct
use of collection information without the expensive mediation of
reference staff, and mutual access to current collection
information would have a major, positive impact on the management
of archive and manuscript repositories. Direct access would give
researchers more autonomy and control over their research, and it
would facilitate inter-institutional cooperation in collection
development and preservation where knowledge of the holdings of
other institutions can help curators make difficult decisions
about how to spend scarce dollars developing and preserving their
own collections. The list of institutional benefits derived from
the use of a standard, in fact, exceeds the simple but profound
benefits that individual scholars would enjoy. It would also
allow inter-institutional cooperation in the description of and
access to dispersed collections and to independent but related
collections. Perhaps even more important in the short-term, a
standard would make it feasible for collection holding
institutions, library oriented vendors, and supporting federal
and state agencies to develop the human and material resources
needed to convert existing paper finding aids into machine-
readable form. Without a standard for finding aids, it is
difficult to convince agencies to fund conversion because there
is no benchmark by which they can evaluate how the money is
MARC versus SGML.
     In order for an encoding standard to provide the
infrastructure to support a full array of access, control,
navigation, and print functions and uses, it must be well
designed and constructed. It is the quality of the standard as an
information infrastructure that will enable maximum exploitation
of the encoded information. 
     In the early stage of developing this project, we considered
using the MARC format as the basis for the standard. We did this
because MARC was familiar, and because we had heard that many
institutions were attempting to use it for encoding finding aids.
We quickly decided that it was not the best available scheme. We
had three principal reasons for making this decision. 
     First, we found MARC inadequate because records are limited
to a maximum length of 100,000 characters. This represents
approximately 30 8-1/2 X 11 pages of 10 pitch unformatted text
stored in ASCII. Since many finding aids are longer than this,
the size restriction is a prohibitive obstacle. 
     Our second difficulty with MARC is its limited accommodation
of hierarchically structured information. Since finding aids are
hierarchically structured documents, the flat structure of MARC
makes it unsatisfactory. As archivists are painfully aware, MARC
was primarily designed to capture description and access
information applying to a discrete bibliographic item. Describing
and providing access to complex collections through descending
levels of analysis quickly overburdens the MARC structure. At
most, a second level of analysis can be accommodated, but the
kind of information so supplied is limited. One possible way
around this problem is to employ multiple, hierarchically
interrelated and interlinked records at varying levels of
analysis: collection-level, subunit, and item. The use of
multiple records, though, introduces extremely difficult inter-
and intra-system control problems that have never been adequately
addressed in the format or by MARC based software developers.
Even if the control issues were adequately addressed in the
format, the control required to make multiple record expression
of hierarchy succeed would entail prohibitive human maintenance.
     The third reason for not using MARC for finding aids
involves the marketplace. It is a gross understatement to say
that libraries, archives, and museums are generally not resource
rich institutions. To put it into perspective, the price of one
B-2 bomber would fund the Library of Congress for over three
years. Lacking large amounts of capital, the MARC user community
has been incapable of driving state-of-the-art hardware and
software development.
     Standard Generalized Markup Language or SGML is a promising
framework or model for developing an encoding scheme for finding
aids for a number of reasons. First, it has none of the problems
associated with MARC mentioned above. SGML will accommodate
hierarchically interrelated information at as many levels as
needed. Furthermore, there are no inherent size restrictions on
SGML based documents. SGML is an international standard embraced
by an ever growing list of government, educational, research, and
industrial institutions. Through the Computer-aided Acquisitions
and Logistics Support or CALS initiative, the US Department of
Defense has mandated that all contractors doing business with DOD
must use four specific standards in communications. One of the
four is SGML. Because the DOD budget is immense even if
shrinking, this mandate has motivated intensive development of
SGML oriented software. Besides not sharing the weaknesses of
MARC, SGML has a host of other features recommending it to us.
     SGML is not itself a text encoding standard, but a standard
for uniformly developing standards for definable kinds or classes
of documents. In SGML, a standard for a particular kind or class
of document is called a Document Type Definition or DTD. Each
community that uses and shares a particular type of document must
assume responsibility for developing an encoding scheme specific
to the type. 
     SGML is concerned with designating the logical elements that
serve as the building blocks of documents and the
interrelationships of these elements. The prototype SGML Document
Type Definition for finding aids developed at Berkeley specifies
what logical elements can be present, and, with varying degrees
of specificity, how the elements interrelate. In this regard, it
is similar to MARC and dissimilar to word processing
applications. MARC captures description and access information.
It does not specify how to index, display, or print the
information. Markup that provides output specifications such as
print or display formatting is called procedural markup. It is
dedicated to a single end-use or application. Markup designating
the elements constituting a kind of document is called
descriptive markup. Both SGML and MARC are descriptive markup
languages. Text encoded using descriptive markup is frequently
called "structured text," while ASCII text is called "plain
     Descriptive markup based on SGML enables maximum flexibility
in the use of the text. Indexing, and display and print
formatting can all be precisely controlled by the user of the
text. SGML structured text facilitates sophisticated database and
document indexing and searching such as document and document
component specific keyword Boolean, word adjacency, word
proximity, relevance ranking, and relevance feedback. Such
"smart" indexing eases much of the inherent tension between
recall and relevance that plague most large database searching
systems. SGML structured text also supports advanced online text
navigation. It is possible to juxtapose a dynamically generated
table of contents and accompanying text to provide context to
enhance reading comprehension and provide random, informed access
to text. Structured SGML based text facilitates complete
flexibility in automated production of printed finding aids and
related print products essential to processing, control,
curatorial functions, and donor relationships. The flexibility of
SGML based structured text renders it unquestionably superior to
hierarchically flat MARC structured text, plain text, and
procedurally marked text.

The Information Future and Archives.

     While the Project is presently concerned only with finding
aid text, it is looking to an imminent information future in
which collection-level records lead to finding aids and finding
aids lead to computer surrogates of primary source materials that
exist in a variety of native formats: pictorial materials,
graphics, three-dimensional objects, manuscripts, typescripts,
printed text, sound recordings, motion pictures, and so on. The
intersections between these various forms of information will be
traversed by the click of a mouse or by entering a simple
command. Information interlinked in this manner is called
     It is critical, I believe, that we move toward providing
network access to computer surrogates of items in our collections
as we make information about our collection more accessible
through network access to finding aids. Some believe that making
MARC AMC records available increases requests for use of
collections. Increasing the availability of the more detailed
information in finding aids will perhaps have the same effect.
Such increased demand and use will justifiably raise curatorial
concerns. But by making surrogates of the most used portions of
our collections available, we can simultaneously increase access
and limit physical access to endangered collections. Access and
preservation need not be in conflict.
     The Berkeley Finding Aid Project envisions an information
future in which serious scholars and the casually curious alike
can easily find the cultural treasures they seek. In this future,
information seekers follow clearly marked paths through catalogs
to finding aids and from finding aids to a wealth of information
in a multitude of computer and traditional formats ... and back.
Developing a standard encoding scheme for finding aids is
essential to the realization of this future.
     By making primary source materials accessible anywhere,
anytime, the Internet will challenge basic assumptions about the
nature of archives, libraries, and museums. The Internet makes it
possible to transcend the physical limits of our information
environment. I would like to conclude by dwelling, for a brief
moment, on the notions of physical "absence" and "presence." 
     In an information world dominated by physical media, the
absence of the items has dictated that we create synoptic
surrogates to represent them. These include both catalog records
and finding aids. While each provides access mechanisms, it is
the descriptive component that represents the absent item. Using
the description, we are, in principle, able to identify the
remote object, and decide whether it is what we want. The
Internet, by rendering the absent present, will no doubt
significantly alter the descriptive role of catalog records and
finding aids.
     Archives, museums, and libraries are situated in specific
geographic locations, and generally serve the needs of the
surrounding community. They are very present for the community in
which they reside. The Bancroft Library at the University of
California, for example, is located on the campus of the
University, in the city of Berkeley, in the state of California
in the Far West of the United States. The Bancroft Library, its
collections, its staff, its mission, all reflect this presence in
the community. But if the most sought after collections of The
Bancroft Library are available on the Internet, anytime and
anywhere, it becomes in a certain sense universally present,
present in the World Community.
     Universality requires effective communication and effective
communication requires standards. Standardization necessarily
creates tension with the particularity of individual collections
and repositories. Developing a standard that enables effective
communication and that also allows curators and collection
managers to adequately represent the individuality of the
collection and institution is the challenge before us.
     For scholars, making the absent present over the Internet
offers new possibilities of scholarly communication. Scholars
inclined to make their interpretations of source materials
available in a standards-based, machine-readable form on the
Internet will be able to make both the source materials and their
interpretions of it simultaneously available to their peers and
students. And it will not matter where they or the materials
physically reside. Furthermore, it will be feasible to link the
finding aids that lead to the source materials directly to
interpretations of them. Over time, arrays of competing
interpretations will accumulate around source materials, and
provide alternative forms of access to and views of them. 
    The prospect of bringing research materials to the scholar
over the Internet in lieu of the scholar having to travel to them
reveals yet another possibility. Agencies that fund research
involving primary source materials should finance digitizing
rather than traveling. A collection, once digitized, would be
available to all scholars for all time.