SGML: Standard Generalized Markup Language and the Transformation of Cataloging

Standard Generalized Markup Language and the Transformation of Cataloging

                  Standard Generalized Markup Language
                                 and the
                      Transformation of Cataloging


                              Daniel V. Pitti
              Librarian for Advanced Technologies Projects
                               The Library
                               386 Library
                   University of California, Berkeley
                     Berkeley, California 94720-6000

Note: Paper presented at the annual conference of the North American Serials Interest Group,
June 1994, Vancouver; publication forthcoming from Haworth Press).

     Last fall I attended the Association of Research Libraries / Association of American
University Presses third seminar on electronic publishing on the network. Speaker after speaker,
some teachers, some publishers, some librarians, expressed doubt and uncertainty about where
"it" was all leading. The publishers, especially, seemed to be an especially anxious lot, though here
and there a trace of excitement was evident. From the increasingly familiar litany of issues,
concerns, and problems mentioned by many of the speakers -- copyright, cost recovery, access,
various forms of control, peer review, preservation, authenticity -- one theme clearly emerged:
whatever else happens on the Internet, the expertise of the professional cataloger will be essential
to ensure that people will be able to find and retrieve the information they seek. In fact, most of
the speakers really said "librarians," but I, rather we, know what they really meant. The cataloger's
experience-honed ability to distill, organize, interrelate, and integrate information is badly needed
on the network. To build shamelessly on a bad metaphor, I like to think of today's professional
catalogers as tomorrow's information superhighway patrol persons. If there is to be more than
data carnage on the information superhighway, mangled bits and bytes strewn among delinquent
digital images and unruly, riotous texts, we need the civilizing touch of the preeminent law and
order librarian, the cataloger. The day is rapidly approaching, when the serial cataloger will be
able to fix her steely, pitiless glare on the publisher, who, in a sick abuse of illicitly obtained inside
knowledge of AACR2R Rule 21.2, changes the electronic journal title issue by issue in some
intellectually trivial but bibliographically significant way, and say to him, "Make my day!"

     Another topic that arose again and again at the ARL/AAUP seminar was SGML, or
Standard Generalized Markup Language (ISO 8879). If widely adopted, and it appears to be
heading in that direction, SGML will help tomorrow's catalogers civilize the network. One
speaker at the seminar noted that at the first seminar in 1991, SGML was barely mentioned, and
only a few people had any idea what it was. The situation has changed significantly in the last two
years. Today I want to explain why most of you have at least heard of SGML, and I want to say a
word or two about how it might have an impact on your work. 

Introduction to Standard Generalized Markup Language.

     While Standard Generalized Markup Language is both standard (ISO 8879) and
generalized, it is not really a markup language as such. SGML does not provide an off the shelf
markup language that one can simply take home and apply to a letter, a novel, an article, a
software manual, or a catalog record. What it really is, in fact, is a markup language meta-
standard, or in simpler words, a standard for constructing markup languages. SGML provides a
syntax and a meta-language for defining and expressing the logical structure of documents, and
conventions for naming the components or elements of documents. One can think of SGML as a
set of formal rules for defining specific markup languages for individual kinds of documents.
Using these formal rules, a community sharing a particular kind of document can get together and
create a markup language specific to that  document type. These specific markup languages
written in compliance with formal SGML requirements, are called Document Type Definitions,
or, DTDs. For example, the Association of American Publishers with OCLC has developed a set
of three DTDs: one for books, one for journals, and one for journal articles. A consortium of
software developers and producers is developing a DTD for computer manuals. I am working on
a DTD for library, museum, and archival finding aids. A colleague of mine at Berkeley has
developed a USMARC DTD  for use in a prototype bibliographic catalog employing advanced
retrieval technology. Document Type Definitions shared and followed by a community are
themselves standards. The Association of American Publishers DTD is registered as ANSI/NISO
z39.59-1988, and after substantial revision, has been approved just this year as an international
standard,  ISO 12083. If I design the finding aid DTD well, I hope to have it serve as the basis for
a standard in the library and archival world. Hence, as you can see, SGML is very general and
abstract. It exists formally over and above individual markup languages. It is also a standard,
which is to say, a formal set of conventions in the public domain, not owned by and thus not
dependent on any hardware or software producer. That SGML is a standard offers its users
reasonable assurance that the information we create will not become obsolete because of
hardware and software developments. This is not true of proprietary data formats.

     The formality and generality of SGML have very important implications. Because SGML
syntax and rules are formal and precise, it is possible to write software that can be easily adjusted
to work with any compliant Document Type Definition. Typically, SGML software has a toolkit
that allows the user to adapt its functionality to his or her Document Type Definition. As a result,
the market driving SGML software development is in principle everyone. This is very different
from the MARC software market, which consists primarily of libraries and a few archives and
museums. Libraries, archives, and museums, are a very small, cash poor community. And while I
do not want to insult the MARC oriented software developers--they do the best they can with
limited resources--the products available reflect the limited resources. On the other hand, the
SGML market includes virtually everyone. The Department of Defense now requires contractors
to supply technical information using four standards. SGML is one of the four. This requirement 
is called the CALS initiative. SGML is now being used by several software producers, airline,
automobile, and tractor manufacturers, the Department of Energy and other government agencies,
a wide variety of print and electronic publishers, and the Text Encoding Initiative, an international
project to provide encoding standards to support linguistic research and the study of literature. To
give you an idea of how broad and varied are the potential users of SGML, let me cite just a few
of the affiliations of the people who have subscribed to a listserv devoted to SGML related
activities in northern California: the Research Libraries Group, Lockheed Space and Missile,
Silicon Graphics, Berkeley Department of East Asian Languages, the Institute of Forestry
Genetics, Lawrence Berkeley Laboratory, UC Berkeley Library, Dialog Information Services,
Berkeley Department of Slavic Languages and Literatures, and many, many more. The list of
SGML related software developers reflects confidence in the potential of this market:
WordPerfect; Microsoft, Xerox, Frame, Electronic Book Technologies, Avalanche, ArborText,
SoftQuad, AutoGraphics, Open Text, Information Dimensions Inc., Exoterica, Object Design, and
a host of others. Firms such as WordPerfect and Microsoft are not interested in little markets.
Products on the market or under development include z39.50 compliant client/server databases
and object oriented databases; a wide variety of authoring applications; conversion software; and
electronic multimedia and paper based publishing tools. In order to understand why SGML has
generated such broad interest from both users and developers, let us now turn to discussing the
nature of markup and what kind of markup SGML promotes.

     In an article now considered by many to be a classic presentation of document markup
theory, James Coombs, Allen Renear, and Steven DeRose distinguished five kinds of markup,
three of which I would like to discuss briefly today: procedural, descriptive, and referential. In
the last few years, through the use of word processing systems, we have become familiar with
procedural markup. Procedural markup consists of processing instructions to the computer. It
tells the computer what to do with specified components of the text. For example, in the
WordPerfect copy of the paper I am reading, there are embedded commands to center the title on
the first page horizontally, and, since the title is long, there are several hard returns segmenting it.
Most procedural markup is further characterized by being paper directed, that is, it tells the
printer how to put the text on paper. If you want to do anything else with the text, the markup is
not of much help. If you want to search for the initialism "SGML" in the machine-readable version
of this paper, but only where it occurs in a section title, the procedural markup provides no
assistance. Nor does it help if you want to display the text on a computer screen, since paper
presentation and monitor presentation are quite different. And finally, procedural markup is
characterized by a further limitation, to date all procedural markup has been proprietary. This
means, for example, that the documents created on WordPerfect cannot be processed flawlessly
on MicroSoft Word and vice-versa. Each word processing software package uses its own
markup. In this environment, the future of the document is tied to the future of the software.

     A second type of markup mentioned by Coombs, Renear, and DeRose is descriptive
markup. With descriptive markup, we arrive at the form of markup recommended by SGML.
Descriptive markup identifies the logical components of documents. While procedural markup
specifies a particular procedure to be applied to a document component, descriptive markup
indicates what the component is. Examples are chapter, chapter title, section, paragraph, author,
publisher, and cataloging-in-publication data. None of these gives any indication of what
procedures are to be applied to these components. But, if you know the elements in a document,
then you can have processors to do whatever you want to them. Descriptive markup liberates the
document for multiple uses. It is possible, for example, to use one and the same source document
to produce printed,  electronic,  Braille, and voiced synthesized versions, and, for good measure,
to produce HTML/Mosaic and Gopher versions. Of course the down side of liberty is that it can
be abused, but that is another matter. The fact that descriptive markup can be used in so many
different ways is one of its important characteristics. It escapes the single use trap of procedural

     It is useful to distinguish two kinds of descriptive markup: structural and nominal.
Descriptive structural markup identifies document components and their logical relationship.
Structural elements are components that you usually want to present visually in some distinct
manner. Examples are chapter titles, paragraphs, block quotes, and the like. Descriptive nominal
markup, as you might expect, identifies named entities, both concrete and abstract. Examples are
corporate names, personal names, topical subjects, genres, and geographic names. While you may
want to visually present these names online or on paper in some particular manner, you usually
want to index them in particular ways, to use them to provide access to the source or subject
matter of the document. When the data that is being marked up is meta-data or cataloging, then
nominal markup frequently identifies data components used to control and provide access to other

     Referential markup, the last type of markup identified by Coombs, Renear, and DeRose, as
its name suggests, refers to information that is not present. It is, so-to-speak, markup in the third
person. There are different kinds and ways that one might use referential markup, but I would like
to focus on the kind of referential markup that enables something about which most of you have
heard, and perhaps with which many of you have some experience, namely, hypertext and
hypermedia. In addition to supporting text, SGML also provides provisions for using text to refer
to other text, and to refer to other kinds of digital information derived from the full array of native
formats: phototgraphs (color as well as black and white); sound motion pictures; drawings;
paintings; audio recordings; three dimensional objects of all kinds, shapes, and sizes; maps;
manuscripts; typescripts; printed pages; mathematical data; financial data; diagrams; musical
notation; choreographic notation; and anything else open to digital capture and being digitally
rendered in some useful form. It is possible not only to refer to or point at this other digital
information from within SGML based documents, but also to control the notation information
needed to launch the devices necessary for rendering the various objects into humanly intelligible
forms. It is thus possible to use electronic text to control and manage extra-SGML information
objects of all kinds, as well as to provide access to and navigation through them. Let us now turn
our attention to a brief discussion of SGML and cataloging.

SGML and Cataloging

     Cataloging is an activity in which we use information to help us create information about
still other information. In the first category are descriptive cataloging rules, rule interpretations,
subject cataloging rules, the beloved Conser Manual, and much, much more. In the second
category are authority and catalog records. Published and unpublished items comprise the third
category. Thus we have cataloging, the catalog, and the cataloged. Currently, to the extent that all
three kinds of information are in electronic form,  they exist in a framented information
environment, based primarily on proprietary markup and proprietary software. The main
exception is the MARC  based catalog. But, as we have seen, MARC is used by a limited
community, more or less in isolation from other information formats and information processing
systems. Into this fragmented world enters SGML, which, if widely adopted, offers the prospect
of a fully integrated, interoperable information environment.
     As we have seen, SGML is a general standard capable of embracing a wide variety of text
documents, and of using that text to provide access to and control of a multitude of online
information formats. Hence SGML can serve as the basis of a comprehensive, integrated,
multimedia, text-based information environment. In essence, it would be possible to use SGML as
the general underlying standard for the information that we use to catalog, the catalog records we
create, and the electronic texts we catalog; and further yet, we could use the text in any of these
domains to provide entry to and control of extra-textual digital objects. 

     There are already developments underway or being contemplated that point to an SGML
based intergrated information environment. John Duke at Virginia Commonwealth University has
been encoding the Anglo-American Cataloging Rules, Second Edition, using the Association of
American Publishers Document Type Definition. While the Library of Congress has made no
official announcements, I have been informally told that they are contemplating using SGML to
mark up various cataloging tools. If, for example, they were to mark up the LC Rule
Interpretations and the MARC formats, then it would be possible to create hypertext links
between the AACR2 rules, Rule Interpretations, and MARC formats, in effect, creating a virtual
cataloging tool that integrates what is now inconveniently dispersed. With respect to the catalog,
Professor Ray Larson at the University of California, Berkeley, School of Library and Information
Science, is creating the second generation of his prototype catalog Cheshire. In its second
incarnation, Cheshire will be a Z39.50 compliant, SGML based client/server database. Professor
Larson and his graduate assistants have created a USMARC Document Type Definition into
which to map the catalog records. Cheshire is a catalog that employs probabalistic retrieval
software to support natural language subject searching.With the collaboration and assistance of a
wide variety of experts from numerous libraries, archives, and museums, the Berkeley Finding Aid
Project is developing a Document Type Definition for finding aids. Finding aids are documents
used in libraries, archives, and museums to provide access to and control of unpublished
collections of primary source materials. In the hierarchical structure of collection-level
information access and navigation, finding aids reside between collection-level catalog records
and  primary source materials. Catalog records lead to finding aids, and finding aids lead to
primary source materials. The Berkeley Finding Aid Project envisions a future in which
information seekers follow clearly marked paths through  library catalogs to finding aids and from
finding aids to cultural treasures in a multitude of computer and traditional formats ... and back.
To complete the circle, we need to look at developments in electronic publishing.

     There are many activities in the area of electronic publishing and text encoding. The Text
Encoding Initiative is an international project to develop a suite of Document Type Definitions for
texts used in linguistic, literary, and historical studies. The TEI guidelines, numbering some 1300
pages, were published in May 1994. Also in May, the Center for Electronic Text in the
Humanities at Princeton and Rutgers Universities sponsored a workshop on documenting
electronic texts, with a focus on the TEI header and the MARC record. The TEI header is that
portion of a TEI compliant text that functions as the chief source of information. The workshop
brought together scholars, publishers, computer scientists, and librarians. Those gathered were in
general agreement that the TEI header and MARC records should be symetrical with respect to
content designation. In essence, a TEI compliant text would come self-described. The description
would migrate into a MARC record. A cataloger would then integrate this description into the
target catalog by performing the authority work, subject analysis, and classification. The cataloger
supplied information could then be mapped back into the TEI header for use by other libraries, if
so desired. In the area of electronic journals, OCLC, an early supporter of SGML and a co-
developer of the Association of American Publishers suite of Document Type Definitions, has
been a pioneer in SGML based networked publishing. OCLC currently publishes the Online
Journal of Current Clinical Trials, Electronic Letters Online (a publication of the Institute of
Electrical Engineers in Great Britain), and Online Journal of Knowledge Synthesis for Nursing.
These journals are available to subscribers over the Internet using Guidon, client software
developed by OCLC. OCLC, in collaboration with the American Chemical Society, Bellcore and
Cornell University,  is also involved in an electronic journal experiment called CORE. This
experiment is focusing  on a large number of journals in the chemistry subject area, and involves
both page images as well as structured SGML based text.. OCLC is also talking to a number of
publishers about networked, electronic publishing of journals. Internationally, more and more
publishers are beginning to use SGML and contemplate network publishing. In my own
neighborhood, the University of California Office of the Continuing Education of the Bar, and the
University of California Press are both in the process of acquiring the expertise to convert their
publishing activities to SGML based operations. 

     Clearly there is now an opportunity to build an integrated information environment in
which the catalog provides clearly marked paths leading to both traditional and electronic
information formats.  That a civilized environment will emerge from the current Internet
development frenzy is far from certain.With the exception of the Text Encoding Initiative, no else
in publishing, to the best of my knowledge, has approached the library community about ensuring
compatibility between the encoding of the chief source of information and the MARC format. And
in addition to the publishers there are a host of other commercial and private information
producers that are overwhelming the Internet with a chaotic assortment of the good, the bad, and
the ugly. SGML, I believe, as a general standard that allows us to structure text, and to interrelate
many different kinds of information, offers us an opportunity to make the Internet a coherent,
standard based, information whole, an orderly information universe. I believe librarians, and in
particular, catalogers, have a professional obligation to actively assert themselves in the creation
of this information universe. If librarians sit back and wait to be asked, the disparate and all too
shortsighted forces developing the Internet will not think to ask them to participate in the planning
and development until it is too late.

     At the beginning of my presentation, I indulged in building on a bad metaphor, the so-
called information superhighway. A moment ago I used quite a different metaphor, one that I
think is far superior, namely, the information universe. At once we elevate the discussion from the
earth-bound arena of transportation to the unbounded heavens above. In this view of things,  it
will be the responsibility of catalogers to ensure that order emerges from chaos. Since this activity
is clearly divine, the full assumption of it will allow catalogers, at last, to take their rightful place
in relation to the information mortals.