SGML-MARC: Incorporating Library Cataloging into the TEI Environment (Text Version)

SGML-MARC:
Incorporating Library Cataloging into the TEI Environment

Stephen Paul Davis, Columbia University

This presentation was originally delivered on March 23, 1996 at the workshop on "The Text Encoding Initiative Guidelines and Their Application to Building Digital Libraries," held in conjunction with the First ACM International Conference on Digital Libraries., Bethesda, Maryland.

INTRODUCTION
In many respects the invention of the MARC record and related standards has been the most important event in librarianship and bibliography since the Library of Congress began its catalog card distribution service early in this century. It has enabled the creation of immense multinational bibliographic databases for scholars and researchers; it has allowed libraries to make use of automated support for most basic library functions, such as cataloging, acquisitions, and online public access catalogs. It proved the value of standard protocols and content guidelines in promoting the sharing and processing of information. And it put libraries, archives, and others at the forefront of the electronic information revolution.
But, in most respects, libraries are no longer on the forefront of that revolution. The electronic information environment has exploded outside of libraries in ways that we're all too familiar with--and yet incapable of really understanding the dimensions of. New information technology is now happening, for the most part, in the burgeoning private information industry, in computer science departments, and in scientific research centers.
That this has happened is of course overwhelmingly positive. Looked at one way, librarians will no longer need to continue to invent all our own standards and protocols and database systems from scratch. Better capitalized and far more innovative groups are now taking care of that for us. How true this is can be illustrated by considering the database search and retrieval systems that libraries and the undercapitalized library automation industry have created for us and our patrons. We have been working on our library public access catalogs for more than twenty years now, and what search and retrieval techniques have we implemented?

[OVERHEAD 2]
- Simple author searching
- Simple title searching
- Combined partial author-title searching
- Simple subject searching
- Keyword-Boolean searching
Yes, after twenty years, the major new retrieval technique we've made available to users of library catalogs, at least, is keyword-Boolean searching. (But at least we've made that complicated!) One might say that since much library cataloging data has the great advantage of controlled name and subject vocabularies-- created at great expense--that little more in the way of retrieval technology was needed.
Unfortunately, there has been ample evidence in the literature and in practice that shows that we have not made it nearly easy enough in our online systems to use our own subject thesauri and classification schemes; nor have we created the functionality that would truly allow our data to be used interactively to excavate all the "intelligence" we have built into our databases.
In recent years, our ability to conceive, agree on, and implement new approaches to retrieval and presentation have been limited by time, money, expertise, dis-economies of scale, available tecnologies and--I would have to say--vision.
By contrast, consider the extraordinary creativity and implementation savvy that has occurred on the World Wide Web in a little less than two years.

[OVERHEAD 3]
There are over a dozen major indexing servers, such as AltaVista, Excite, Lycos, InfoSeek, Inkomi and others. Some of these services have implemented weighted searching, relevancy feedback based on location and frequency of words, automatic stemming, "more of the same" algorithms, natural language query systems, concept-based searching, semantic trees, etc. etc. Not that all of these are highly successful. And keep in mind that different techniques are required to gather and parse free text data than structured and normalized data. But here's a great case of necessity being the mother of invention. And all this has happened within two years!
The task before us, as the traditional conservers and secondary suppliers of scholarly information, is to make every effort to take advantage of the intensive innovation going on in what was once chiefly our own preserve. At the same time, we have a professional obligation to begin migrating our extensive databases and systems into the new information environment, namely the Internet and the World Wide Web and their successor systems. Unfortunately, our MARC data is still--for the most part--locked up tight in systems designed to support traditional library automation. The major escape route MARC data has, at present, is ANSI/NISO Z39.50 (the protocol for intersystem search and retrieval), which is still--and may always be--a very narrow funnel into the larger information world.
At the same time that the overall information environment has become more complex, the rules and standards libraries use to describe information have become less and less adequate. Just as we need to migrate our existing data onto a new platform, we need to continue to evolve our practices of description and access to remedy known flaws, to make use of new retrieval techniques, and to accommodate novel types of information objects.
The work now proceeding to develop an SGML version of MARC is most the obvious and elegant way to accomplish a number of important tasks:

[OVERHEAD 4]
- bring our "legacy" data along into the Web and its successors
- continue to be able to use the MARC format and our existing MARC-based systems without necessarily pushing them--at great expense--to solve problems for which they were never intended
- to provide a more flexible context for experimentation in our cataloging practices
- to take advantage of newer retrieval tools
- to facilitate the creation of broader composite "metadata" systems
THE SGML-MARC PROJECT
Those involved in the process of creating a MARC DTD have included the following individuals, firms and institutions:
- Softquad Inc., a software developer in the area of SGML, and the designer of Panorama, virtually the only widely available Web-base SGML-viewer
- Jerome McDonough at the Berkeley School of Library and Information Science
- The Library of Congress's Network Development and MARC Standards Office
- An LC advisory group consisting of:
  - Randall Barry, Library of Congress
  - Gary Gilliam, Autographics
  - Stephen Davis, Columbia University
  - Steve Rose, Electronic Book Technologies
  - Debora Lapeyre, ATLIS
  - Ralph Levan, OCLC
  - Sally McCallum, Library of Congress
  - Jerome McDonough, UC Berkeley
  - Daniel Pitti, UC Berkeley
  - C. Michael Sperberg-McQueen, University Illinois
A brief history of this project should note the following milestones:
- In 1990 SoftQuad Inc. developed a prototype SGML Data Type Definition for the MARC record;
- In 1994, Jerome McDonough at Berkeley, prepared another prototype MARC DTD
- In October, 1995, the Library of Congress called an advisory group (above) to help decide the design principles for an "official" MARC DTD, to be developed by the Library of Congress and an outside consultant.
- In March 1996, the Library of Congress contracted with ATLIS, a firm specializing in SGML work, to create the MARC-DTD, so work is well underway.
- Some time in 1996, the Library of Congress will issue a second contract for the conversion programs to and from MARC-SGML; when these are completed they will be made freely available.
- Beginning in 1996, building on the work done by Berkeley and the principles articulated at the October 1995 meeting, Columbia University began experimenting with extending the proposed MARC DTD for the application new cataloging approaches, specifically for complex electronic entities, with the goal of and for making these SGML-MARC records directly available within Web-based deliver systems.
SGML-MARC DESIGN PRINCIPLES

[OVERHEAD 5]
Although no set of design principles has been officially adopted, the LC advisory group's working principles include the following:
- the MARC DTD should enable 100% convertibility of actual MARC records into SGML without loss of data; similarly it should allow 100% convertibility from SGML to MARC, also without loss of data. I.e., full reversibility.
- the MARC DTD should correspond to the current MARC standard and should be maintained in parallel with that standard
- MARC record-structure elements (e.g., the length of record & the directories) do not need to be preserved in SGML-MARC, since they would be able to be recalculated at the time and SGML-MARC record was converted into MARC.
- software utilities, to be created as part of this project, will be capable of performing these conversions.
- the SGML-MARC record should be able to reside either independently as metadata or embedded with the SGML document which it describes.
DESIGN PROBLEMS & CONSIDERATIONS
A number of problems and issues have already arisen.

[OVERHEAD 6]
- MARC is largely and increasingly a "data dictionary" of discrete elements; their order may or may not be prescribed, but their meaning is not depended on their order; by contrast, SGML is inherently more hierarchical, and data element definition is closely dependent upon context.
- MARC purposely includes many so-called "obsolete" data elements, which continue to exist in MARC records stored in large and small bibliographic databases; should these be included in the SGML-MARC standard, or should only the currently valid version of MARC be supported?
- MARC explicitly accommodates the addition of locally defined data elements; what accommodation should be made for these elements within SGML-MARC, where the DTD must pre-define all valid data elements in all conformant records?
- Should SGML-MARC be entirely bounded by the MARC standard, or can it even from the outset include additional features?
- How should character sets be handled? MARC has its own specified character set; the SGML TEI world typically uses a somewhat different character set. Yet other character sets are now becoming available.
- Should SGML Entity-type links be supported, or the increasingly prevalent URL, URN, and PURL conventions developed for the WWW be accommodated in the MARC 856 field?
RELATIONSHIP OF THE TEI HEADER TO THE CATALOG RECORD
By way of review, the TEI header has four major components:

[OVERHEAD 7]
- File Description: a combination of the title page, verso of title
- Encoding Description: an explication of the relationship--if any--of the encoded text to its source or sources
- Profile Description: a content description beyond basic bibliographic identification
- Revision Description: a detailed change log of any and all changes to the document
Looked at closely, the TEI header is both less than a catalog record and much more. It is less in the sense that it does not accommodate most of the data elements required by AACR2/LC cataloging nor those additional element accommodated in MARC but not specified by the cataloging rules. For instance, TEI properly has no data element for "main entry," since this is a purely cataloging-based construct.
[OVERHEAD 8]
On the other hand, it does indeed include many of the data elements specified by the cataloging rules and, in fact, by major bibliography style guides such as the Chicago Style Manual and the MLA Style Sheet. This includes elements such as title, statement of responsibility, publication statement, etc.
Insofar as purely bibliographic elements are concerned, the TEI Header resembles either a highly structured citation or a sparsely structured preliminary catalog record.
In fact, one of the hopes of the original designers of the TEI Header was to make it possible to generate preliminary MARC catalog records from the data present in the Header. In cases where no catalog record is to be generated, the TEI Header could theoretically provide excellent "self-documentation" for the entity to which it pertains. (On the other hand, the actual use of the TEI header in reality is quite variable, and at times even slightly embarrassing, at least to the cataloging world. It seems likely to me that a content standard or guideline for the Header will probably be needed quite soon.)
In short, there is no necessary relationship between the TEI Header and a catalog record for the same item; however, the TEI Header can be of great help for a cataloger preparing an actual catalog record and, in some cases perhaps, be used to generate a preliminary MARC bibliographic record. Thus the TEI Header and the catalog record serve two different--though related--functions. In the future, perhaps, we may have highly intelligent software "agents" to analyze and catalog electronic entities, and the TEI Header may then be sufficient raw material for the agent to create cataloging, without human intervention. Alternatively, our cataloging standards may decline sufficiently that they prescribe records roughly equivalent to TEI Headers, in which case, too, the full current MARC-AACR2 catalog record would be unnecessary.
Beyond the TEI's File Description element, a great deal of information can be included in the Header that would not typically be found in a bibliographic record; for example, detailed characterization of the item's content or comprehensive information about the process of encoding the item.

[OVERHEAD 9]
In artifactual terms the TEI Header--if it is included--should actually be considered an intrinsic part of the electronic item in which it resides. Under no circumstances would one delete a TEI Header simply because a completed catalog record was available and perhaps embedded in the record--any more than one would tear the title page, acknowledgments, and author's note out of a book after cataloging it.
A pragmatic argument for keeping the two types of meta-information distinct is to help preserve the authoritativeness of the catalog record--created and encoded according to national and international standards--and not intermingle it with less consistent and less uniformly-applied descriptive data supplied by a publisher or distributor.
It is also likely that, as time goes by, other schemes of metadata will be devised for inclusions within electronic publications, not replacing any embedded TEI Header or MARC catalog record, but supplementing them or supporting entirely different functionality.
ADDITIONAL TASKS RELATED TO SGML-MARC & DIGITAL LIBRARY CATALOGING

[OVERHEAD 10]
Assuming that the project to create and formalize a MARC DTD is successful, a number of other related tasks await us.
- We must devise ways to keep SGML MARC files in synch with corresponding MARC files, e.g., with respect to authorities and other "catalog maintenance."
- We must continue to investigate needed changes in cataloging practice both to be able to describe and "bibliographically control" new types of documents and publications; and also to make use of new indexing, retrieval and display tools and concepts.
- We must organize a timely MARC-SGML maintenance process that will allow the flexible testing of new approaches beyond those feasible in MARC. Something like the Z39.50 Implementors Group and the Testbed Group may be in order. It is essential that we not constrain the MARC-SGML DTD by what can be accommodated in MARC nor by the sometimes overly-deliberate standards process we currently use with the MARC formats. (This kind of formalized standards activity may well be needed eventually; but there needs to be a substantial period of experimentation before then.)
- We must develop retrieval and control systems in which SGML-MARC can, when necessary, be merged into other systems of metadata, and where varying kinds of metadata can be linked and synchronized with the actual electronic (and non-electronic) publications to which they relate.
- We must create national and international systems and protocols for ensuring that electronic publications are "authenticated" with regard to accuracy, completeness and intellectual responsibility.
- We must continue to work toward a national & international metadata plan that will encourage the creation of standard "self-documenting" digital objects by information providers.
COLUMBIA'S WORK WITH HIERARCHICAL SGML-MARC RECORDS
A project currently under way at Columbia University Libraries attempts to address some of this next set of tasks. Specifically, as part of our digital library cataloging of collections of digital images and of electronic reproductions of printed texts, we are beginning to catalog directly into an SGML-MARC record. If this project is successful, the "master records" for much of our digital library will be created and stored in SGML-MARC; from that master record we will derive both a MARC record for loading into local and national MARC databases, as well as an HTML record that can be readily used in our campus Web information systems.

[OVERHEAD 11] -- [OVERHEAD 12]
Our objective in beginning this effort was not simply to rush into the SGML world, which is still very much in its infancy and even somewhat uncertain as to its long-term prospects. Instead, we were faced with trying to catalog complex electronic entities which could not adequately be handled in MARC or our local MARC-based system. At the same time, we wanted to make this complex metadata easily available on our campus WWW-based system.
In order to achieve this, we have extended the prototype SGML-MARC DTD to accommodate one or more subrecords within a single SGML-MARC record for the purpose of describing multiple versions of the same image, individual pieces of a large collection, while at the same time being able to link directly to the digital object.
[OVERHEAD 13]
As an example, consider a digital image record for
- a collection of images
- arranged into subcollections
- with individual item records
- each with multiple linked electronic images, e.g., different sizes, resolutions
- descriptions of intermediary reproductions of originals, from which the electronic images were scanned, e.g., slides, transparencies, photographs
[OVERHEAD 14]
Here's another example of the complexity we must accommodate. Consider:
- a printed book with folded maps, color plates, B&W drawings and 300 pages of text.
- a microfilm made of the entire book, including illustrations
- a set of 300 dpi bitmapped images of the whole book, excepting the color illustrations and maps
- a TEI-encoded version of the document, derived from an OCR scan of the bitmaps.
- a high resolution electronic image scanned from the original oversized maps
- digital reproductions of the color illustrations at multiple resolutions, made from slides of the originals.
In short, the among-librarians-at-least infamous "multiple version problem" and "component part (analytic) problem" have become so significant they simply can no longer be side-stepped.
Columbia's subrecord technique allows for multiple levels of hierarchy in the same record, along with multiple component parts. The approach takes advantage of SGML's more "narrative" approach to description, rather than the atomized, separate-record technique that has been typical of MARC and AACR2.
In developing this approach, Columbia is watching carefully as new proposals for cataloging complex digital objects are put forward. We hope to use our SGML-MARC system as a testbed as work proceeds.
CONCLUSION
In conclusion, it must be said that the project to create a MARC DTD is coming none too soon. Events in the new electronic information world are developing quickly and unpredictably. Libraries and archives have an opportunity now to migrate their metadata (and, over time, large portions of their actual collections) which must not be missed.
The current rapid innovation in and creative approaches to metadata, search and retrieval software, and presentation technology is a great advantage for us; but only if we are prepared to move our metadata onto new platforms where these innovations can be brought to bear. We also need to evolve our own cataloging practices both to respond to and to take advantage of new forms of electronic information.
In many ways we are finally reaching the end of cataloging strategies and techniques elaborated over a century ago; but we are only at the very beginning of working out what comes next. It would be precipitous to imagine that we can plunge ahead now and invent the new principles and strategies we'll need for the new "information age," still in its infancy. More importantly, we need to put ourselves in the position, technically and organizationally, where we can experiment, test and let our strategies evolve, while continuing to bring along the rich intellectual resources libraries were creating long before they even knew they were in the business of "metadata."

Stephen Paul Davis, -- Last update: 3/25/96