SGML for Cultural Heritage Information

Joseph A. Busch, Getty Art History Information Program, Santa Monica, California, USA

Abstract

The Consortium for the Computer Interchange of Museum Information (known as CIMI) develops cultural heritage community standards to preserve digital museum information and facilitate its exchange. This paper discusses the CIMI Information Model, steps in the development of the CIMI document type definition (DTD) which uses Standard Generalized Markup Language (SGML) for content designation from a data model point of view, and issues related to linking SGML information objects. The paper was originally presented at the ASIS Mid-Year Meeting in Minneapolis, Minnesota on May 24, 1995.

What is Cultural Heritage Information?

Cultural heritage information is not only beautiful paintings on white walls which is how museums are usually imagined. The real treasure troves where most cultural heritage information exists are in the archives and special collections. Almost every museum includes this kind of information. There are also many free-standing cultural heritage information archives. Cultural heritage information -- broadly conceived -- is the information that is held by museums, archives, and the special collections in libraries.

For example, a typical photographic archive exists as part of the University of London's Warburg Institute. What you see when you walk into such a photographic archive are the ubiquitous red, green, black, or natural cardboard archive boxes, and filing cabinets full of black and white photographs for study. At the Warburg, each item is filed according to a local classification system based on the principle iconographic feature represented by the object depicted in the photograph. Another example of a cultural heritage archive is the Witt Library at the Courtauld Institute in London. The Witt Library is a famous repository with more than six million visual representations of works of art. Each item is filed alphabetically by artist in boxes arranged by the country of their birth. It's very difficult to get access to this information unless the name of the artist and the country they are from is known. 1

The amount of digital information in the cultural heritage community is increasing exponentially, but on the other hand, the consistency of that information is decreasing exponentially as the amount of it is increasing. Figures 1a and 1b illustrate these trends. Until recently, much of this digital information existed in structured databases, for example, in library catalogs, museum collections management systems, and in a few cases custom designed idiosyncratic retrieval systems such as those at the Warburg and Witt. But unlike the situation in the library community where the MARC format has been in use for many years, there is currently no agreement on data structures or data values in the cultural heritage community. Currently, in the cultural heritage community, digital information exists in many disconnected databases. There is no coherent way to search these multiple, electronic resources located at many separate institutions. Each database has different interface and command structures, and it is difficult to consolidate and analyze the results of searches.

Inexpensive desktop microcomputers have begun to be widely used during the past few years in the cultural heritage community. With this explosive growth, an enormous amount of relatively unstructured text as opposed to structured databases has begun to be generated. Exhibition catalogs, gallery guides, wall labels, slide labels, educational materials, brochures, and finding aids are more frequently being generated electronically. Many institutions are beginning to experiment with small scale digital imaging projects, but in terms of the total number of objects that are being documented, digital images remain (and are likely to remain) a relatively small component of the total number of information objects generated. Increasingly, museums are mounting these digital files on World Wide Web sites at very low cost and without adding much value to them other than minimal hypertext linking. Figure 2 illustrates the relative amounts of such digitial information by various types and their projected growth. While museums are quickly creating a public face, the problem of searching these multiple, distributed resources is not being solved simply by putting more "stuff" up on the Web. Integrating such disparate, heterogeneous information sources is a difficult problem.

Cultural Heritage Data Standards

The cultural heritage user and vendor community have resisted attempts to gain acceptance of the USMARC format and to develop extensions to it for the types of data attributes that are required by this community such as those needed to describe and control museum objects. This situation can be contrasted with the archives community which developed MARC format extensions to accomodate the special attributes and functions needed to describe and control archival materials. The USMARC format for Archival and Manuscripts Control (USMARC AMC) was developed by and is being widely adopted and used in the archives and manuscripts community.2

But in the cultural heritage community broadly speaking there is currently no base of agreed information, no generally accepted standards for structuring data. There are many separate databases being developed to inventory and manage the objects in museum collections. There are many forms of museum documentation ranging from wall labels -- the labels located on the wall next to or near the objects in a museum, which are similar to object records -- to exhibition catalogs and other types of publications which are much more complex. Such published (and unpublished) documents are very complex, including a lot of textual information broken up into discreet sections, as well as figures, charts, references, and often many black and white illustrations and color plates. For such a diverse set of materials to be able to be transformed into a coherent information resource, a cultural heritage database, the cultural heritage community has a basic underlying problem to develop ways to talk about and think about this diverse set of types of materials in a common way.

The Consortium for the Computer Interchange of Museum Information

The Consortium for the Computer Interchange of Museum Information (known as CIMI) was founded as an operating project of the Museum Computer Network (MCN) in 1990 to develop community standards, preserve digital museum information, and facilitate the exchange of information. The original CIMI project received an NEH grant to develop a standards framework (known as the CIMI Framework)3 for interchanging digital museum information. The CIMI Framework recommended the use of several standards for encoding different types of cultural heritage information including ISO 2709 (or MARC) and ISO 8879 (or Standard Generalized Markup Language - SGML)4. In 1993, the Canadian Heritage Information Network (CHIN), Getty Art History Information Program (AHIP), and Research Libraries Group (RLG) agreed to sponsor CIMI for three years during which period, a consortium of participating institutions would be formed and additional grant funding sought to implement projects that would demonstrate the application of the CIMI Framework.

In 1994, CIMI was a successful applicant for a Telecommunications and Information Infrastructure Assistance Program (TIIAP) grant administered by the U.S. Commerce Department National Telecommunications and Information Administration (NTIA); and CIMI was awarded an NEH grant in 1995. The overall goal of the project which is known as Cultural Heritage Information Online, or CHIO, is to develop and demonstrate the application of standards for distributed access to heterogeneous cultural heritage information over the Internet. The CHIO project will demonstrate the application and use of SGML and Z39.50 in the cultural heritage community, and build an integrated multimedia cultural heritage information resource to demonstrate and test networked access to it.

The current CIMI consortium is a multi-faceted group of institutions ranges from single museums such as the National Muesum of American Art (NMAA), to consortia of archives at major European museums such as the Remote Access to Museum Archives (RAMA) a consortium of the Ashmolean, Museon, Musée d'Orsay, Prado, Pergamon, Goulandris, Museo Archeologico Nacional, and Uffizi; to commercial organizations such as Corbus. CIMI members also include the Canadian Museum of Civilization, CHIN, Coalition for Networked Information (CNI), Eastman Kodak Company, Getty AHIP, Museum Computer Network (MCN), Museum Documentation Association (MDA), National Gallery of Art (NGA) in Washington, Philadelphia Museum of Art, RLG, University of California at Berkeley Museum Informatics Project, University of California Division of Library Automation, and Victoria & Albert Museum (V&A).

The purpose of the TIIAP-funded CHIO Structure project is to demonstrate the application and use of SGML to build an integrated multimedia cultural heritage information resource accessible over the Internet. Figure 3 is an overview of the major steps in the CHIO Structure project. From July to December 1994 project working groups worked on the development of an overall information model. From January to March 1995 samples of different kinds of cultural heritage documentary materials were then collected. From April to June 1995 the structure and content of each of these "document" genres were analyzed. In June and July 1995 this analysis was used as the basis for developing an SGML document type definition (or DTD). Document encoding using the CIMI DTD was begun in August 1995. A database for holding SGML objects was designed and the SGML-encoded documents were mounted as a prototype in October 1995. The database was made available for public access over the Internet in March 1996. Testing and evaluation of the CHIO database is ongoing. This paper focuses on the CIMI Information Model and the process used to develop the CIMI DTD.

A Framework for Organizing Cultural Heritage Information

CIMI has taken a fresh look at the types of cultural heritage information and spent considerable time developing a model for representing and organizing them. The model which has been developed differs somewhat from the model that has evolved in the bibliographic community. The CIMI model for integrating cultural heritage information is illustrated in Figure 4. This simplified representation of the CIMI information model has three components -- Sources, Authorities, and Points of View.

Sources are actual documents, both published and unpublished, as well as other source materials that exist as many different genres. For example, exhibition catalogs are a genre of museum publications. Critical essays and educational materials are genres of museum publications with different structure and content characteristics from exhibition catalogs. Genres of archival materials include sketch books, letters, photographs, manuscripts, etc, as well as the aids devloped by archivists for finding them, or finding aids. Other genres of source materials include wall labels, collections management records, and images themselves which may or may not have text attached to them.

Authorities include materials that in the library community have been called authorities, as well as other data that has been analyzed and structured according to community data standards. Authorities are generally surrogates for source materials, but do not include all surrogates. For example, museum collections management records (or library circulation records) are not authorities because such transaction records record information related to the life cycle of an object. Collections management information is a genre of business records rather than an authority. A database that analyzes such business records, particularly across institutions and over time might be considered an authority. This category includes traditional kinds of authorities such as authority files for personal and institutional names, geographic locations, object names, and topics; as well as bibliographic databases such as abstracting and indexing services, and certain other databases that contain highly structured secondary and tertiary material. The CIMI information model recognizes that such data records have a special meaning and usefulness as pointers to sources more than in themselves, thus they are more tightly bound conceptually to source materials than in the library community model.

Points of View represent the ways that users access cultural heritage information. Users may access the information through authorities that point to sources through a sort of mediated query, or they may access the sources themselves through a full-text search. Those interested in accessing cultural heritage information also have many different intellectual points of view. A school teacher looking for classroom materials for fifth graders, a weekend museum visitor, an armchair network surfer, a museum curator, or a PhD candidate working on a dissertation have very different interests in and expectations from a distributed multimedia cultural heritage information resource. Project CHIO will focus on two points of view -- one for art historical, research-oriented users and one for the general public. The project will investigate the differences and coincidences between those two points of view.

Translating points of view into queries that can be processed by information servers, pointing queries to authorities and sources and between them, and presenting a useable response back to the user require a common glue to interface between them, that is, an exhaustive data model. The Documentation Committee of the International Council of Museums (CIDOC), a UNESCO affiliate has developed and maintains such a cultural heritage information data model (called the CIDOC Data Model)5. Unlike MARC, the CIDOC Data Model is an an entity-relationship data model. Ultimately it will be used as the glue to link the various types of resources together, and to link the points of view for accessing that data together to the resources.

In summary, the CIMI Information Model contains the following components -- very structured types of information called Authorities, less structured types of information that exist in various media called Sources, Points of View for art historians and the general public to access the information resources, and the CIDOC Data Model to provide the glue for interfacing among these components.

The CIMI SGML Document Type Definition (DTD)

This section of the paper discusses the process for decomposing genres of source documents such as those that will be included in the CHIO Project -- exhibition catalogs and wall labels. These elements need not be limited to the actual material contained in the source document, but the sources of these materials need to be considered as well. For example, an illustration may be derived from a digital image file which in turn may be derived from a real object, for example, a painting in a museum.

Figure 5 illustrates the major components of a complete publication such as an exhibition catalog. Generally, there are three parts of such a publication -- the front matter, the body of the publication, and the back matter. The front matter generally consists of the title page and verso, table of contents, introduction, and acknowledgements. The front matter is the principal source for the bibliographic description of the overall document.6 The bibliographic description could be enhanced to include a named location such as a Universal Resource Locator (URL) for linking to or from the entire digital representation of it. The body of the publication contains the primary content of the document. From a structural perspective, the body of a document generally consists of sections, subsections and paragraphs. The body may also contain specially formatted text such as tables, as well as illustrations and other visual materials. The back matter generally contains the bibliography and index, materials that can provide additional, enhanced access to parts of the document, or between this and other related source materials.

Figure 6 illustrates some of the components that might be found in a paragraph. These include footnotes or endnotes generally linked to the bibliography. Within the text, there may be quotations that might have citations to the source of the quotation attached to it, and the citation may have a bibliographic link attached to it. There may be several types of illustrations referred to in the text within a paragraph. These may include specially formatted text such as lists or tables, or illustrations. Illustrations also generally have captions. Illustrations might also be linked to external sources such as a digital image, or the data file used to generate a table or chart.

Figure 7 illustrates some of the various types of information or data types that might be found within the text of a paragraph. Encoding text can be done manually or with computer assistance. Algorithms are becoming available that can reliably identify names.7 Not all text will contain content that could be meaningfully encoded, but the more text that can be encoded, the more access points that can provided to that text. For example, within text there may be personal and institution names, locations, topics, and dates all of which might be identified as such. In some cases adding information about the context of an encoded text that may not be explicit in the text would be useful. Using SGML, such added information can be encoded as attributes that surround a section adding information such as access points to it. For example, the string "Erastus Salisbury Field," who was a famous New England itinerant artist, may be referred to within a piece of text as the general topic that that text is about, or the string may refer to the creator of a particular work of art. Adding a role attribute to the text encoding would be helpful in distinguishing between Erastus Salisbury Field as a subject and Erastus Salisbury Field as an artist/creator. It would also be useful to specify a standard form of the artist name explicitly as an artist name attribute, or to link, using a link attribute, this encoded element to a name authority file containing artistic personalities that collect together variations in its spelling and format. Encoding the context for named events by adding location, topic, and date attributes is another example of information that could be added enhance access to and meaning of text. Appendix 1 is a list of attribute names, the Categories for the Description of Art, that might be applied.8

Figure 8 illustrates the types of information of which a name authority generally consists. These fields could be represented as a group of attributes to add context and consistency encoding as discussed above. Similar attribute groups could be developed for encoding other authorities such as for locations, topics, events, or dates. There might also be attribute groupings between authorities. For example, biographical information encoded in a name authority might add attributes to make an event more explicit. A biographical event could be further encoded as a location, a date and a role for that person. It might also be linked to a USMARC authority record or even an image such as a portrait of that person. Related events might even be linked to each other.

In summary, this section described how a large information object such as an exhibition catalog, can be taken apart or decomposed into its explicit and implicit constituent parts. Potentially, it is possible break a large document into many many parts, and to add a great amount of implicit information to it. In practice, the level of detail of analysis of such documents will vary depending on the resources available and the purposes for encoding it. The goal of the CIMI project is to to specify enough detail, or granularity for mark up, so that in creating new documents and as special projects, people can begin to add the kinds of attributes that provide more access or add value to the source materials. This section of the paper has also presented a description of a methodology that can be applied to other types of information objects such as object records, wall labels, and bibliographic citations. Each genre of information will decompose in somewhat different ways, but the information elements tend to converge at the level of attributes for encoding content such as names, locations, and events.

Link Naming Mechanism

Implementing the CIMI Information Model requires a robust linking mechanism to connect the parts of the distributed public information resource that it is building. The CIMI implementation must provide meaningful ways to query heterogeneously structured target databases and then present and display results from the pieces of different types of sources and authorities documents.

Universal Resource Locations or URL's are the way that addresses are named and accessed by the hypertext transfer protocol or HTTP used on the World Wide Web (WWW). One of the big problems which is frequently experienced when browsing on the Web is that URL's are not very robust locators. They often are not adequately maintained. Generally, there is no integrity checking within the Web to insure that URL's point to existing locations. When the name of a location that is pointed to ceases to exist temporarily or forever, or when its location or name changes, there is no mechanism to insure that all pointers are updated. URL's are not like bar codes that are physically attached to objects. Bar codes can be sequenced one for each item with controls and procedures to insure that the same bar code identification occurs uniquely within a particular application domain for purposes such as controlling the circulation of items in a library. URL's are not like international standard book numbers (ISBN's) that uniquely identify publication units through a formal, community- and market-mediated process.

In an environment consisting of interconnected public resources of which the Web is a prototype, there is a real need to improve the link naming mechanism with some sort of formal public identifiers (FPI's). There are two aspects to this infrastructure problem -- one is technical and one is operational.

Implementing an FPI mechanism requires the commercial sector to develop a standard addressing protocol for encoding unique public identifiers. For example, ISBN's have been defined to consist of a standard number of alphanumeric characters. ISBN's consist of a sequence of alpha-numeric characters identifying the publisher which is assigned by an agreed upon maintenance agency, a sequence of alpha-numeric characters assigned by the publisher, and a check digit to ensure that the sequence has a valid ISBN syntax. The maintenance agency is responsible for assigning unique identifications for publication agencies, and publication agencies are responsible for assigning unique identifications for each of their publication units. FPI's might work the same way. A maintenance agency would assign unique identifications for resource publication agencies, and resource publication agencies would assign unique identifications for public servers, resources mounted on those servers, and perhaps items within those resources. The technical requirements that need to be addressed for resolving name locations down to the item level in a heterogeneous network environment are complicated, but the concept and model for such functionality is not.

From a practical perspective, implementing FPI's requires organizations that mount public information resources on servers to recognize the responsibilities that come along with providing public information resources, and to adopt operational modes that support a more robust and responsible role as publishers. On the Web, home pages come and go and they are almost always under construction. Network-based electronic publishing on the Web and in other environments must adopt controls to insure reliability and longevity for access to resources that are made publically available in a formal way. For example, a model could be adopted with various levels of integrity that differentiate between formal public electronic publications such as an electronic journal, and more casual resources such as Jane Smith's home page.

Conclusion

This paper has focussed on the CIMI Information Model, document analysis methodology, and proposed linking mechanisms. Implementing and accessing SGML databases present another range of technical challenges that are beyond the scope of this paper. Briefly, the CHIO Structure project database will be accessed over the Internet using generic WWW browsers. The current generation of web client software requires either an SGML viewer or that SGML data objects be converted to HTML. The CHIO Structure project is using SoftQuad's Panorama, a software product that converts SGML to HTML on the fly. By March 1996, the CHIO Structure project plans to have implemented multiple databases on a single server located at the Canadian Heritage Information Network that begin to demonstrate and implement the principles of the CIMI Information Model.

The second part of the CHIO project is called CHIO Access. Beginning in September 1995, CHIO Access will develop a Z39.50 (ISO 10162/63) attribute set for museums, and then implement it on the CHIO test bed database being built as part of the CHIO Structure project. By the end of this project in March 1997, the CHIO databases will be implemented on multiple servers. By this time, CIMI plans to demonstrate interoperability with the BIB-1 attribute set, demonstrating access to cultural heritage information that is integrated with information accessed on library systems world wide.

More Information

Public CIMI documents are available via file transfer protocol on the CNI server at ftp.cni.org/pub/ CIMI.

1 A full description of the Census of Antique Art and Architecture Known to the Renaissance and Witt Computer Index can be found in Joseph A. Busch, "Thinking Ambiguously: Organizing Source Materials for Historical Research." In: Challenges in Indexing Electronic Text and Images, edited by Raya Fidel, ... et.al., Medford, NJ: Learned Information, Inc., 1994.

2 See Standards for Archival Description: a Handbook, compiled by Victoria Irons Wallach for the Working Group on Standards for Arhival Description, Chicago: The Society of American Archivists, 1994.

3 David Bearman and John Perkins, Standards Framework for the Computer Interchange of Museum Information, Silver Spring, MD: Museum Computer Network, May 1993. The CIMI Framework is available for purchase from the Museum Computer Network, 8720 Georgia Ave., Suite 501, Silver Spring, MD, 20910-3602. Text and Word for Macintosh versions of the CIMI Framework are available via ftp from the Coalition for Networked Information at ftp.cni.org/pub/CIMI/framework, or from the CNI home page at http://www.cni.org/CNI.homepage.html by selecting the link to FTP services.

4 ISO (International Organization for Standardization). ISO 8879: 1986/ A1: 1988 (E). Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML), Amendment 1. Published 1988-07-01. [Geneva]: International Organization for Standardization, 1988.

5 The CIDOC Data Model is available via ftp from the International Council of Museums Documentation Committee at http://www.icom.nrm.se/archives/ICOM/CIDOC/MODEL/Relational.Model/

6 The verso of the title page often contains a preliminary bibliographic records for the document in the form of Cataloging in Publication (CIP) data. With the increasing availability of electronic versions of documents now principally for typesetting purposes, by automatically processing a document's front matter with relatively simple algorithms the bibliographic description currently found in CIP data could be generated automatically.

7 The Berkeley Finding Aids Project (BFAP) -- recently renamed Encoded Archival Description (EAD) -- has developed word processing macros and Perl scripts to automatically encode personal names within text files of finding aids. Contact Daniel V. Pitti, Advanced Technologies Projects Librarian, University of California, Berkeley, 386 Library, Berkeley, California 94720 dpitti@library.berkeley.edu for information about the Finding Aid project. The Finding Aids list server can be subscribed to at LISTSERV@library.berkeley.edu by sending sending the message SUB FINDAID <your name>.

8 The Categories for the Description of Works of Art are content guidelines developed by the Art Information Task Force (AITF), an initiative sponsored by the Getty Art History Information program (AHIP) and College Art Association (CAA). The Categories enhance compatibility between systems containing art information by providing consistent access points. The Categories are available as a hypertext document from the Getty Art History Information program, 401 Wilshire Blvd., Suite 1100, Santa Monica, CA, 90401, or by sending a message to AITF@Getty.edu.