[This local archive copy is from the official and canonical URL, http://gopher.lib.virginia.edu/speccol/scdc/articles/alcts_brief.html; please refer to the canonical source document if possible.]
The previous CFCC briefing papers examined the development of online library catalogs; Peter Graham and Michael Buckland reviewed the development and design of library systems and suggested evolutionary paths for library catalogs. Larry Dixson reviewed Z39.50 and its importance in the continuing development of information retrieval. This paper addresses markup language-- specifically SGML--and its relationship to library catalogs.
Traditionally, markup is instruction given to a typesetter about how to lay out text--what style and size of fonts to use (e.g. bold, 24-point Garamond), special typographical elements (ornaments, bullets, etc.), and the like. The markup has nothing to do with the content of the document, and everything to do with its physical appearance. Markup of this kind is known as procedural markup. Most electronic publishing and word processing packages use procedural markup that is restricted to that system. That is, the system has its own set of codes (frequently embedded in the text of a document) that carry out processes--bolding, font scaling- -only within that system or with related software running on a particular platform. Generally, markup of this sort is designed with a single result in mind, such as the production of a printed page in a particular style. If the content of the document needs to be re-used in different style or format, one must remove the first markup codes, and add new codes for the new formatting. As any librarian who has lived through the implementation of more than one online system can attest, when system software (and sometimes hardware) changes, data translation from one system to another can be an expensive, time-consuming, and potentially heartbreaking endeavor. The exchange of documents based on procedural markup works well only if both sender and receiver use the same system.
There is, however, another type of textual markup-- descriptive markup. This generic markup describes the structure and/or content of a document, rather than its physical appearance on the page or screen. Thus, the content of a document is separated from the style of presentation. Elements within a document (a chapter, a stanza, a footnote, a bibliography) are categorized using codes that tell what the element is, but not what its physical appearance must be. The content of documents that are marked up descriptively may be re-used for many different purposes and presented in many different styles.
SGML (Standard Generalized Markup Language) was first developed in 1970 as GML (Generalized Markup Language) and evolved into both a national and international standard (Bradley, 272). SGML has been an international standard since October 1986 (ISO Standard 8879); it is widely accepted in the United States, Western Europe, and Japan and is used for a variety of business, industrial and academic applications (Adler, 556). SGML is frequently referred to as a metalanguage. This means that SGML is not a single language, but a language that describes other markup languages; that is, SGML is the rules or framework for defining particular markup languages.
SGML provides for descriptive, as opposed to procedural, markup; that is, it simply states names to categorize parts of a document instead of specifying processes to be carried out.
SGML permits the description of structured information independent of its processing by providing a standard syntax for defining descriptions of classes of documents. These descriptions are called document type definitions (DTDs). DTDs define types of documents and their structures by stating what elements are required in a particular type, and what elements may be present in the document. The structure of a document can be marked up and checked against a DTD--using a special program called a parser--to ensure that it is valid, and that it conforms to the structure of the document type defined by the DTD. DTDs are almost always written with specific processing or results in mind but can easily be re- purposed. Since SGML markup is application independent, it means that documents that conform to a particular DTD can be re-used in a variety of different ways. Knowing that a document is structured in a specific way makes subsequent reuse of the document easier.
There are three different functional parts of an SGML document. The first specifies the character set of the document; the second part names the document type and, thus, the specific tags that can be used in the document. The third part of an SGML document is the actual text or content, marked with SGML tags.
For example, a "poem" could be defined as a document type. The DTD for "poem" could define lines and stanzas (tags: <line> and <stanza>) as the required elements. Optional elements could include <couplets>, <octaves>, <sestets>, <footnotes>, and <author>; these elements may be present in a document defined as a poem, but need not be. The purpose of the DTD and SGML coding might be to print all the tagged poems with octaves in bold and sestets in italics, using a translation program of some sort. The might be a need later to print the same tagged poems with only octaves in italics; it would not be necessary to re-tag each poem, but only to alter the translation program that processed the tagged texts. Indeed, the possible uses of the tagged poems is virtually limitless since almost any processing instructions can be applied. Different sorts of processing instructions can be associated with the same parts of the file. For example, a program for textual analysis could ignore footnotes in a tagged poem that a program for printing might collect for printing at the end of each poem. The same print program might extract authors' names from the poems to print at the top of each poem; the textual analysis program could use the authors' names to create a searchable database. The combination of descriptive markup and document type definitions allow SGML-encoded documents to be processed by many different pieces of software with many different results in mind. Because SGML is non- proprietary (not owned by a single computer hardware or software manufacturer) and is an international standard, data remains independent of any particular hardware or software configuration making SGML, and SGML-conformant applications extremely flexible.
What then is the relationship of SGML to an online library system? Three projects and initiatives that are currently underway can serve as examples of the usefulness of markup to library information retrieval and selection. These three projects all make use of SGML in providing structure and access to bibliographical information.
The Text Encoding Initiative (TEI) is an international, cooperative project to develop guidelines for the preparation and interchange of electronic texts for scholarly research. TEI prepared the Guidelines for Text Encoding and Interchange (Sperberg-McQueen, http://etext.virginia.edu/TEI.html) to develop
"a common encoding scheme for complex textual structures in order to reduce the diversity of existing encoding practices, simplify processing by machine, and encourage the sharing of electronic texts." (Sperberg- McQueen, preface).
The TEI header, a mandatory part of TEI-conformant texts, has four major parts: a file description, which contains a full bibliographical description of the computer file itself and includes information about the source or sources from which the electronic text was derived; an encoding description, which describes the relationship between an electronic text and its source or sources. It allows for detailed description of whether (or how) the text was normalized during transcription, how the encoder resolved ambiguities in the source, and the like; a text profile containing classificatory and contextual information about the text, such as its subject matter, the situation in which it was produced, the individuals described by or participating in producing it, and so forth; and a revision history, which allows the encoder to provide a history of changes made during the development of the electronic text (Sperberg-McQueen, 5.1.1). The guidelines for the TEI header include discussion of the header's relationship to the MARC record. The guidelines state that TEI's goal in creating the header is
"to ensure that the information required for a catalogue record be retrievable from the TEI file header, and moreover that the mapping from the one to the other be as simple and straightforward as possible." (Sperberg-McQueen, 24.1).
However, the guidelines go on to state that
[t]he most important difference between the MARC record and the TEI header is the function of each. Despite the efforts and claims of some members of the library community, the MARC record remains fundamentally an electronic version of the catalogue card, with the limitations of its model. The primary function of the MARC record when it was first designed in the mid-1960s was to allow for the electronic distribution of cataloguing records in support of card production ... The catalogue card is a unitary record for a physical object containing complex bibliographic data of varying sorts. The catalogue card points to the physical object. The TEI header provides full bibliographic information (as would a card), as well as documentary non-bibliographic information that supports the analysis, either by humans or machines, of the electronic text documented by header. Most of this analytical information, which is found in profile description, encoding description, and revision history, has little direct provision for it in the MARC record, and if retained must be recorded as unstructured notes (5XX) fields. Notes fields usually do not have the structure to support machine retrieval and analysis, while properly formatted profile, encoding, and revision descriptions lend themselves to retrieval, can support machine processing (including analysis), and point directly to the electronic text attached to the header. Moreover, the electronic text points back to the relevant elements in the header. (Sperberg-McQueen, 24.3)
The Finding Aids for Archival Collections is a collaborative project to develop an encoding standard for archive, museum, and library finding aids. "Finding aids are documents used to describe, control, and provide access to collections of related materials. In the hierarchical structure of collection-level information access and navigation, finding aids reside between bibliographic records and the primary source materials. Bibliographic records lead to finding aids, and finding aids lead to primary source materials." (Finding Aids for Archival Collections). The goals of the project are two-fold. First, to create a prototype encoding standard for finding aids in the form of an SGML DTD. Second, to build a prototype database of finding aids. The database is to serve two purposes.
First, it provides the encoding standard developers with computer application experience with which to refine and inform the development process. Second, it provides a means for end users to evaluate the utility and desirability of encoded finding aids, which, in turn, enables them to provide new ideas and suggestions to the encoding standard developers. (Finding Aids for Archival Collections)
SGML was chosen over MARC for encoding the records because of MARC's limited accommodation of hierarchically structured information. Since finding aids are hierarchically structured documents, the flat structure of MARC makes it unsatisfactory. As archivists are painfully aware, MARC was primarily designed to capture description and access information applying to a discrete bibliographic item. Describing and providing access to complex collections through descending levels of analysis quickly overburdens the MARC structure. At most, a second level of analysis can be accommodated, but the kind of information so supplied is limited. One possible way around this problem is to employ multiple, hierarchically interrelated and interlinked records at varying levels of analysis: collection-level, subunit, and item. The use of multiple records, though, introduces extremely difficult inter- and intra-system control problems that have never been adequately addressed in the format or by MARC based software developers. (Pitti, section "MARC vs. SGML")
The cooperative Research Libraries Group initiative on access to digital images has prompted the Columbia University Libraries to suggest a new model for housing and access to bibliographic and analytical data on digital images. Columbia's DIAP team suggests that data could readily be "housed" in an SGML-encoded bibliographic (metadata) record that encapsulates both summary bibliographic information along with detailed hierarchical and version-related data, when such data is appropriate and considered useful to record. The record would also include links to the actual digital items, to other related bibliographic records or, in fact, to diverse, related digital objects (such as external electronic publications, databases, numeric files, etc.) The working designation SGML Catalog Record (SCR) is proposed for this new type of record. The SCR would, by flexibly incorporating data-element "clusters," allow a more narrative approach to the recording and presentation of complex bibliographic information than is practical in the current AACR2/USMARC model, which requires the fragmentation of hierarchically-related components and version information into separate, discrete records. That the current USMARC model serves libraries and users poorly--most notably in the cataloging of microfilm reproductions of printed texts and with complex serial publications--has been widely discussed. Attempts to rectify the situation have been unsuccessful, in large part because of the intrinsically flat structure of USMARC and the automated library systems that have been designed around it. (Davis, introd.)
The Columbia proposal also suggests that libraries "would for the present, and perhaps indefinitely," continue to create summary MARC records--with pointers to the SCRs--in local online systems and national utilities. (Davis, introd.)
[t]he single-tiered approach to catalogs is already breaking down as tens of millions of items become accessible over the network and as the effects of linking online bibliographies to catalog records beings to extend the bibliographic power of the catalog beyond the dreams of catalog code compilers. To cope with the sheer scale of the first objective [knowing if a particular work is "in" a library] in a networked electronic library environment and to address the second objective [knowing which works are "in" a library] in any catalog, a hierarchical approach is needed so that users can easily move into the level of detail they want: work; variant versions of the work; parts of the work; the full text of the work; kinds of related works; and so on ... Since a single-tiered approach will not do, the future catalog will have to be multitiered and flexible and adaptive in operation. (Buckland, p. C)
How then to make the future catalog multitiered, flexible and adaptive? MARC is clearly unable to provide the flexibile, adaptive operation that emerging library systems will require. Does this mean that the MARC format is no longer useful, and will disappear in two, or five, or ten years? Of course not. There are billions and billions of MARC records in library online systems and national utilities; the cost, in time and computer resources alone, of converting them to SGML would be staggering. I believe it does mean that MARC as the library world knows it must evolve to allow libraries to make the most effective use of their very limited resources. This evolution might well mean that MARC will no longer be the single format for encoding bibliographic data in library systems. HTML (Hypertext Markup Language) should be familiar to most librarians as the markup language, since 1990, of the World Wide Web (WWW). HTML is an SGML application, complete with DTD and numerous specifications (see HyperText Markup Language (HTML) from the World Wide Web Consortium at http://www.w3.org/pub/WWW/MarkUp/). HTML is descriptive markup that is interpreted procedurally by different WWW browsers, such as Netscape or Mosaic. (Though in development, SGML-aware WWW browsers are not fully operational; see Panorama--SGML on the web) HTML allows users to embed images, sounds, and video in documents, and provides hyperlinks from one spot in a document to another, or to a separate document, image or sound file. There are limitations to HTML, however; the markup is essentially very simple and only defines a documents structure at a very basic level. Library catalogs have already taken a step forward onto the WWW. Systems such as SIRSI's WebCat (http://www.sirsi.com/webcattoc.html) and Data Research Associates DRAWeb (http://www.dra.com/products/draweb/draweb.HTM) offer WWW interfaces for library catalogs. These web-based catalogs offer a WWW client that can access and search MARC databases, create HTML on the fly, and return the results to the user's desktop. Some clients are Z39.50-compliant.
SGML, in combination with other developments, offers some additional solutions. SGML is application-independent, non- proprietary, and extremely flexible; as such, it offers a viable alternative and/or adjunct for the encoding of bibliographical information. As the projects mentioned above demonstrate, it is already possible to encode data in an SGML format. SGML and MARC are still separate formats that do not interact; MARC as it currently exists does not appear to be flexible enough to allow libraries to take full advantage of the ever-developing information retrieval technology, especially the World Wide Web. A USMARC DTD has, however, been developed by Jerome McDonough of the University of California, Berkeley and is available for anonymous ftp at ftp://library.berkeley.edu/pub/sgml/marcdtd/ The DTD is "designed for use in an on-line catalog employing SGML as its underlying record format" and is intended to "identify (tag) content- bearing elements down to the subfield level ... allow automatic conversion of USMARC records into this SGML format and back out again with as little loss of the original content as possible." (USMARC DTD, Readme file). The ability to transport bibliographic data from SGML to MARC and back again is likely the first step toward the development of a WWW-based catalog, with SGML as its underlying record structure. In such a system, clients could, and should be both Z39.50-compliant and SGML-aware; clients would be able to take full advantage of the more robust SGML markup and might, for example, offer content searching based on structural elements. An SGML-based catalog would allow additional development of the projects mentioned earlier; for example, a TEI- conformant texts would be no longer need a separate MARC bibliographic record to describe it; its accompanying header would serve as the descriptive entity and could be formatted and displayed in whatever fashion a user might choose. Search hierarchies could look first for descriptive information packets (such as headers), and bring those to the user's desktop while retaining a link to the full text, image, or digital surrogate, which could also be searched, displayed, saved and manipulated as the user wished, and which, of course, could be linked to other versions, images, or analytical files.
"Without greater flexibility in the cataloging and encoding of digital documents, library-generated bibliographic data will not be easily integrated into the developing local and national information environment as effective inventories of and indexes to the electronic holdings of libraries." (Davis, introd.)
Adler, Sharon. 1992. The birth of a standard. Journal of the American Society for Information Science 43, 8: 556-558.
Finding Aids for Archival Collections
(http://sunsite.berkeley.edu/FindingAids/)
Bradley, Neal. 1992. SGML concepts. ASLIB proceedings 44, no. 7/8 (July/August): 271-274.
Bryan, Martin. 1988. SGML: an authors' guide to the standard generalized markup language. Wokingham, Eng.: Addison-Wesley.
Buckland, Michael. 1994. From Catalog to Selecting Aid. From Catalog to Gateway, Briefings from the CFCC, no. 2, 1994 in ALCTS Newsletter 5, 1994.
Cover, Robin. 1991. Bibliography on SGML (Standard Generalized Markup Language) and related issues. Kingston, Ontario : Dept. of Computing and Information Science, Queen's University at Kingston.
1994. SGML Web Page (http://www.sil.org/sgml/sgml.html)
Davis, Stephen Paul. 1995. Digital Image Collections: Cataloging Data Model & Network Access. (http://www.cc.columbia.edu/cu/libraries/inside/projects/diap/paper.html)
Hypertext Markup Language - 2.0
(http://www.w3.org/hypertext/WWW/MarkUp/html-spec)
Panorama--SGML on the Web
(http://www.oclc.org:5047/oclc/research/panorama/)
Pitti, Daniel V. [1994?]. The Berkeley Finding Aid Project, Standards in Navigation. (ftp://library.berkeley.edu/pub/sgml/findaid/arlpap.txt)
Sperberg-McQueen, C.M. and Burnard, Lou. 1994. Guidelines for Electronic Text Encoding and Interchange. Chicago, Ill. and Oxford, England: Text Encoding Initiative., 1994. (http://etext.virginia.edu/TEI.html)
TEI Home Page, 1994-
(http://www.uic.edu/orgs/tei/)
USMARC DTD
(ftp://library.berkeley.edu/pub/sgml/marcdtd/)
Van Herwijnen, Eric. 1990. Practical SGML. Dordrecht: Kluwer Academic Publishers.
The World Wide Web Consortium
(http://www.w3.org/pub/WWW/MarkUp/)
Edward Gaynor is Associate Director of Special Collections and Coordinator of the Special Collections Digital Center at the University of Virginia Library. Thanks to Christie Stephenson, Coordinator of the Digital Image Center at the University of Virginia Library for helpful comments on a previous draft of this article.
Edward Gaynor,
Associate Director of Special Collections
University of Virginia Library
Charlottesville, Virginia 22903
voice: (804) 924-3138
email: gaynor@virginia.edu