SGML-MARC:
Incorporating Library Cataloging into the TEI Environment

Stephen Paul Davis, Columbia University


This presentation was originally delivered on March 23, 1996 at the workshop on "The Text Encoding Initiative Guidelines and Their Application to Building Digital Libraries," held in conjunction with the First ACM International Conference on Digital Libraries., Bethesda, Maryland.
  1. INTRODUCTION

    In many respects the invention of the MARC record and related standards has been the most important event in librarianship and bibliography since the Library of Congress began its catalog card distribution service early in this century. It has enabled the creation of immense multinational bibliographic databases for scholars and researchers; it has allowed libraries to make use of automated support for most basic library functions, such as cataloging, acquisitions, and online public access catalogs. It proved the value of standard protocols and content guidelines in promoting the sharing and processing of information. And it put libraries, archives, and others at the forefront of the electronic information revolution.

    But, in most respects, libraries are no longer on the forefront of that revolution. The electronic information environment has exploded outside of libraries in ways that we're all too familiar with--and yet incapable of really understanding the dimensions of. New information technology is now happening, for the most part, in the burgeoning private information industry, in computer science departments, and in scientific research centers.

    That this has happened is of course overwhelmingly positive. Looked at one way, librarians will no longer need to continue to invent all our own standards and protocols and database systems from scratch. Better capitalized and far more innovative groups are now taking care of that for us. How true this is can be illustrated by considering the database search and retrieval systems that libraries and the undercapitalized library automation industry have created for us and our patrons. We have been working on our library public access catalogs for more than twenty years now, and what search and retrieval techniques have we implemented?

    [OVERHEAD 2]

    Yes, after twenty years, the major new retrieval technique we've made available to users of library catalogs, at least, is keyword-Boolean searching. (But at least we've made that complicated!) One might say that since much library cataloging data has the great advantage of controlled name and subject vocabularies-- created at great expense--that little more in the way of retrieval technology was needed.

    Unfortunately, there has been ample evidence in the literature and in practice that shows that we have not made it nearly easy enough in our online systems to use our own subject thesauri and classification schemes; nor have we created the functionality that would truly allow our data to be used interactively to excavate all the "intelligence" we have built into our databases.

    In recent years, our ability to conceive, agree on, and implement new approaches to retrieval and presentation have been limited by time, money, expertise, dis-economies of scale, available tecnologies and--I would have to say--vision.

    By contrast, consider the extraordinary creativity and implementation savvy that has occurred on the World Wide Web in a little less than two years.

    [OVERHEAD 3]

    There are over a dozen major indexing servers, such as AltaVista, Excite, Lycos, InfoSeek, Inkomi and others. Some of these services have implemented weighted searching, relevancy feedback based on location and frequency of words, automatic stemming, "more of the same" algorithms, natural language query systems, concept-based searching, semantic trees, etc. etc. Not that all of these are highly successful. And keep in mind that different techniques are required to gather and parse free text data than structured and normalized data. But here's a great case of necessity being the mother of invention. And all this has happened within two years!

    The task before us, as the traditional conservers and secondary suppliers of scholarly information, is to make every effort to take advantage of the intensive innovation going on in what was once chiefly our own preserve. At the same time, we have a professional obligation to begin migrating our extensive databases and systems into the new information environment, namely the Internet and the World Wide Web and their successor systems. Unfortunately, our MARC data is still--for the most part--locked up tight in systems designed to support traditional library automation. The major escape route MARC data has, at present, is ANSI/NISO Z39.50 (the protocol for intersystem search and retrieval), which is still--and may always be--a very narrow funnel into the larger information world.

    At the same time that the overall information environment has become more complex, the rules and standards libraries use to describe information have become less and less adequate. Just as we need to migrate our existing data onto a new platform, we need to continue to evolve our practices of description and access to remedy known flaws, to make use of new retrieval techniques, and to accommodate novel types of information objects.

    The work now proceeding to develop an SGML version of MARC is most the obvious and elegant way to accomplish a number of important tasks:

    [OVERHEAD 4]

  2. THE SGML-MARC PROJECT

    Those involved in the process of creating a MARC DTD have included the following individuals, firms and institutions:

    A brief history of this project should note the following milestones:

  3. SGML-MARC DESIGN PRINCIPLES

    [OVERHEAD 5]

    Although no set of design principles has been officially adopted, the LC advisory group's working principles include the following:

  4. DESIGN PROBLEMS & CONSIDERATIONS

    A number of problems and issues have already arisen.

    [OVERHEAD 6]

  5. RELATIONSHIP OF THE TEI HEADER TO THE CATALOG RECORD

    By way of review, the TEI header has four major components:

    [OVERHEAD 7]

    Looked at closely, the TEI header is both less than a catalog record and much more. It is less in the sense that it does not accommodate most of the data elements required by AACR2/LC cataloging nor those additional element accommodated in MARC but not specified by the cataloging rules. For instance, TEI properly has no data element for "main entry," since this is a purely cataloging-based construct.

    [OVERHEAD 8]

    On the other hand, it does indeed include many of the data elements specified by the cataloging rules and, in fact, by major bibliography style guides such as the Chicago Style Manual and the MLA Style Sheet. This includes elements such as title, statement of responsibility, publication statement, etc.

    Insofar as purely bibliographic elements are concerned, the TEI Header resembles either a highly structured citation or a sparsely structured preliminary catalog record.

    In fact, one of the hopes of the original designers of the TEI Header was to make it possible to generate preliminary MARC catalog records from the data present in the Header. In cases where no catalog record is to be generated, the TEI Header could theoretically provide excellent "self-documentation" for the entity to which it pertains. (On the other hand, the actual use of the TEI header in reality is quite variable, and at times even slightly embarrassing, at least to the cataloging world. It seems likely to me that a content standard or guideline for the Header will probably be needed quite soon.)

    In short, there is no necessary relationship between the TEI Header and a catalog record for the same item; however, the TEI Header can be of great help for a cataloger preparing an actual catalog record and, in some cases perhaps, be used to generate a preliminary MARC bibliographic record. Thus the TEI Header and the catalog record serve two different--though related--functions. In the future, perhaps, we may have highly intelligent software "agents" to analyze and catalog electronic entities, and the TEI Header may then be sufficient raw material for the agent to create cataloging, without human intervention. Alternatively, our cataloging standards may decline sufficiently that they prescribe records roughly equivalent to TEI Headers, in which case, too, the full current MARC-AACR2 catalog record would be unnecessary.

    Beyond the TEI's File Description element, a great deal of information can be included in the Header that would not typically be found in a bibliographic record; for example, detailed characterization of the item's content or comprehensive information about the process of encoding the item.

    [OVERHEAD 9]

    In artifactual terms the TEI Header--if it is included--should actually be considered an intrinsic part of the electronic item in which it resides. Under no circumstances would one delete a TEI Header simply because a completed catalog record was available and perhaps embedded in the record--any more than one would tear the title page, acknowledgments, and author's note out of a book after cataloging it.

    A pragmatic argument for keeping the two types of meta-information distinct is to help preserve the authoritativeness of the catalog record--created and encoded according to national and international standards--and not intermingle it with less consistent and less uniformly-applied descriptive data supplied by a publisher or distributor.

    It is also likely that, as time goes by, other schemes of metadata will be devised for inclusions within electronic publications, not replacing any embedded TEI Header or MARC catalog record, but supplementing them or supporting entirely different functionality.

  6. ADDITIONAL TASKS RELATED TO SGML-MARC & DIGITAL LIBRARY CATALOGING

    [OVERHEAD 10]

    Assuming that the project to create and formalize a MARC DTD is successful, a number of other related tasks await us.

  7. COLUMBIA'S WORK WITH HIERARCHICAL SGML-MARC RECORDS

    A project currently under way at Columbia University Libraries attempts to address some of this next set of tasks. Specifically, as part of our digital library cataloging of collections of digital images and of electronic reproductions of printed texts, we are beginning to catalog directly into an SGML-MARC record. If this project is successful, the "master records" for much of our digital library will be created and stored in SGML-MARC; from that master record we will derive both a MARC record for loading into local and national MARC databases, as well as an HTML record that can be readily used in our campus Web information systems.

    [OVERHEAD 11] -- [OVERHEAD 12]

    Our objective in beginning this effort was not simply to rush into the SGML world, which is still very much in its infancy and even somewhat uncertain as to its long-term prospects. Instead, we were faced with trying to catalog complex electronic entities which could not adequately be handled in MARC or our local MARC-based system. At the same time, we wanted to make this complex metadata easily available on our campus WWW-based system.

    In order to achieve this, we have extended the prototype SGML-MARC DTD to accommodate one or more subrecords within a single SGML-MARC record for the purpose of describing multiple versions of the same image, individual pieces of a large collection, while at the same time being able to link directly to the digital object.

    [OVERHEAD 13]

    As an example, consider a digital image record for

    [OVERHEAD 14]

    Here's another example of the complexity we must accommodate. Consider:

    In short, the among-librarians-at-least infamous "multiple version problem" and "component part (analytic) problem" have become so significant they simply can no longer be side-stepped.

    Columbia's subrecord technique allows for multiple levels of hierarchy in the same record, along with multiple component parts. The approach takes advantage of SGML's more "narrative" approach to description, rather than the atomized, separate-record technique that has been typical of MARC and AACR2.

    In developing this approach, Columbia is watching carefully as new proposals for cataloging complex digital objects are put forward. We hope to use our SGML-MARC system as a testbed as work proceeds.

  8. CONCLUSION

    In conclusion, it must be said that the project to create a MARC DTD is coming none too soon. Events in the new electronic information world are developing quickly and unpredictably. Libraries and archives have an opportunity now to migrate their metadata (and, over time, large portions of their actual collections) which must not be missed.

    The current rapid innovation in and creative approaches to metadata, search and retrieval software, and presentation technology is a great advantage for us; but only if we are prepared to move our metadata onto new platforms where these innovations can be brought to bear. We also need to evolve our own cataloging practices both to respond to and to take advantage of new forms of electronic information.

    In many ways we are finally reaching the end of cataloging strategies and techniques elaborated over a century ago; but we are only at the very beginning of working out what comes next. It would be precipitous to imagine that we can plunge ahead now and invent the new principles and strategies we'll need for the new "information age," still in its infancy. More importantly, we need to put ourselves in the position, technically and organizationally, where we can experiment, test and let our strategies evolve, while continuing to bring along the rich intellectual resources libraries were creating long before they even knew they were in the business of "metadata."


Stephen Paul Davis, -- Last update: 3/25/96