MARC Data in an SGML Structure

[This local archive copy is from the official and canonical URL, http://www.acctbief.org/avenir/marcsgml.htm#par1; please refer to the canonical source document if possible.]

THE FUTURE OF COMMUNICATION FORMATS

MARC Data in an SGML Structure

BY
Sally H. McCallum
Chief, Network Development and MARC Standards Office
Library of Congress, Washington DC, 20540 USA
smcc@loc.gov
October 9th 1996

MARC and SGML

Why Create a MARC Data DTD?

Development of the MARC Data DTD

Plans

Conclusion

It is difficult to know where to begin as the SGML environment is so multifaceted. Fortunately, the specific topic of this paper is the SGML DTD that has been developed for the MARC record. The following will first give brief information about the characteristics of SGML, comparing and contrasting the SGML data structure and markup with the MARC structure and markup, and then discuss why the DTD is being developed, indicate some of our design choices, and finally give the plans for this activity.

MARC and SGML

Vocabulary

Before describing some general characteristics of MARC and SGML, clarification of the vocabulary used is needed. SGML employs a very different vocabulary from MARC to describe its parts and products, although most concepts are roughly comparable. Since many readers will be more familiar with the MARC nomenclature for describing machine-readable bibliographic records, the following will relate the vocabularies but then primarily use the MARC terms in order to simplify the presentation.

SGML is actually only a generalized data structure, formally called ISO 8879, Standard Generalized Markup Language (SGML). The comparable structure standard for MARC is ISO 2709, Format for Information Exchange (INEX). Neither SGML nor INEX can be used in an implementation until tags and rules are established in conformance with these ISO structure standards. A tag set for the INEX structure is called a "format", whereas one under the SGML structure is called a "document type definition" or "DTD". So formats and DTDs are comparable concepts. Finally, if a bibliographic description is marked up using tags specified in a format or a DTD, the result is called a "record" in the INEX environment and a "document instance" in the SGML environment.

The obvious question is, then, what is MARC? MARC actually refers to a certain subset of formats that have the INEX structure -- for example Can/MARC, InterMARC, CCF, and USMARC are all MARC formats that have the INEX structure. They all essentially use the same options allowed by INEX and all are intended to encode bibliographic data.

In the following the more specific terminology "MARC", "MARC format" and "MARC record" is used, rather than the strictly correct "INEX", "INEX format" and "INEX record". The comparable SGML concepts will be called "SGML", "SGML DTD", and "SGML record". This is a slight departure from orthodox SGML vocabulary. For this paper the SGML "document instance" concept will be split into "SGML record", if the content is bibliographic, and "SGML document", if it is the full text of an item. Marked up bibliographic data will always be called a "record". Although different MARC formats have different tags, they all have a commonality in the data they encode. That bibliographic content of records will be referred to by various commonly used terms: cataloging data, metadata, bibliographic data, or MARC data, regardless of whether it is marked up using MARC format tagging or SGML DTD tagging. The terms "tagging" and "marking up" will be used interchangeably.

Characteristics of SGML and MARC

ISO 2709 or MARC takes a data dictionary approach to markup. The characteristics of the MARC data structure focus on allowing detailed markup of variable length (but not extremely long) data elements to facilitate complex indexing and sorting requirements. It allows efficient access to the data in order to manipulate and use various subset combinations and to index selected parts. The field tags, indicators, subfields, and particularly the directory structure specified in ISO 2709 support these requirements.

The data structure specified in ISO 8879 or SGML is targeted at encoding documents that may be any length, but will usually be very long. The content of these documents must maintain their linear order but hierarchies inherent in the data also need to be preserved. The method for establishing tags and attributes in SGML is supportive of these requirements. The absence of directory access to the data, as found in MARC, is suitable for the often great (yet variable) length of the data segments. The emphasis in tagging is on structural and presentational characteristics rather than retrieval or sorting characteristics.

As was mentioned above, MARC and SGML are both formal standards that specify data structures. The formats based on those structures, that specify the tags used for the data and the semantics for using them, must be established by the user communities. For MARC, there is strong industry standardization on an implementation schema for cataloging and related data, the various specific MARC formats -- Can/MARC, USMARC, UNIMARC, CCF, etc.

SGML is newer than MARC and is still in a period of proliferation of implementations for textual data. Two leading tagging schemes are ISO 12083, Electronic Manuscript Preparation and Markup and TEI, the Guidelines for Electronic Text Encoding and Interchange. ISO 12083 was targeted for modern publications, with emphasis on hierarchical structures, while the TEI was developed initially for conversion of older text into electronic form. Older documents often offer an array of structural characteristics that are not compatible with a hierarchical tagging schema.

There are a number of other SGML document markups that were developed for special purposes, such as CALS (Computeraided Acquisition and Logistics Support Program), a markup designed for manuals and other technical documentation; HTML (Hypertext Markup Language), a very simple markup designed for use in marking documents for World Wide Web (WWW) servers; and MAJOUR, a markup targeted for journal articles. Individual publishers that have adopted SGML structures for publications have tended to define their own proprietary SGML tagging schemes. There are thus many, many unique SGML-based markups due to differences in publications and viewpoints. The programming language nature of the basic standard and development of markup-creation tools has encouraged experimentation and invention.

Why Create a MARC Data DTD?

Assumptions

I want to state three assumptions as a basis for the following discussion. The assumptions indicate continuing need for catalogs of bibliographic records, even though new, non-catalog uses for cataloging data may have given rise to this MARC Data DTD effort.

First, in the future there will continue to be both electronic publications and items in other media, including an enormous number of print materials.

I say this because when working in the SGML environment, it is easy to begin to focus on electronic items as if everything is or will soon be available to end users electronically. That is not the case, although the number of important (and unimportant) electronic documents certainly will increase. SGML markup is used to create documents to be disseminated electronically as well as those that will be distributed only in print form.

The second assumption is that in the future catalogs that bring together data about items, i.e., cataloging or metadata, will continue to be essential. These catalogs will continue to contain metadata about items in all media, including print and electronic.

This point is made because, when working with SGML, the distinction between the item itself and metadata about the item becomes blurred. It is easy to begin to assume that if parts of an electronic item, such as the title, are tagged, or if metadata is tagged and appended to an item, then there is no need to also pull that information into a catalog. That is not the case. The concept of the catalog of only metadata in which all items in all media are described will continue to be essential for information access.

The third assumption is that cataloging data will continue to consist of 1) information directly from the item, 2) subject information about the item that is supplied through document analysis, and 3) information that has been normalized for amalgamation and subsequent retrieval with cataloging data from the past and the future.

While derivation of cataloging data automatically from the piece is an ideal that may some day be possible, the requirements for a catalog are still such that there are human contributions necessary to the process.

Proposed Uses of a MARC Data DTD

There are increasing numbers of electronic bibliographic items being produced. Thus one use of a MARC Data DTD is for enclosing the MARC record with the electronic document. If MARC data had a standard representation in SGML, then for electronic documents the MARC data could be sent along with the document and a standard utility could process the data into a MARC structure for addition to the catalog.

For electronic documents, there is also a potential for full text retrieval but experience has shown that such retrieval is very inexact. There is interest in making MARC data, with its controlled access points, available with the full text to the retrieval programs that process the documents. The record could then help refine full text searching and reduce false hits. This is of special interest to the Library of Congress in its digital library program.

It could be especially valuable to include the catalog record with electronic documents that are digital images. The MARC data that accompanies the image could be converted and derived into a MARC-based catalog and also reside in the SGML-based file for electronic documents.

Experiments are also taking place using SGML as the format for bibliographic citation files. While most of this work thus far has been for the construction of bibliographies, which are static bibliographic files, experimentation is also taking place with dynamic catalog files also, especially in instances where the application deals only or mostly with electronic documents. The Cheshire Project at Berkeley is such an experiment.

Development of the MARC Data DTD

In the past few years, there have been several experiments in creating SGML access to MARC records. Among them was Michel Vulpe who, in 1990 while working for Softquad, Inc., wrote a MARC data DTD for experimental purposes. He revised it several times to experiment with different tagging schemes -- which highlights the fact that there are many different ways that a MARC record can be treated in a DTD.

Another experiment began in 1994, when Jerome McDonough at the University of California at Berkeley School of Information Management and Systems, wrote a DTD for MARC records to support the Cheshire catalog research project mentioned above. Version 8 of this DTD is still being used in the Cheshire project. This DTD carries the MARC coded data fields as strings, and includes all MARC leader values and the record directory in the SGML record. Reversibility of records back to the USMARC format was one of the basic criteria of the DTD.

In the fall of 1995, the Library of Congress Network Development and MARC Standards Office, where the USMARC format is maintained, began a project to develop a MARC Data DTD that could be made widely available and lead to a standard approach. There had been various requests that this be done, or else the community would have many different MARC DTDs and experimentation and interchange would be adversely affected.

A MARC/SGML working group consisting of several SGML experts and several MARC experts was brought together. The MARC-oriented members all had automation experience and were working with electronic text issues. One member of the group was Jerome McDonough who had written the Cheshire project DTD, since one option was to consider whether the MARC Data DTD could simply be an extension of the Berkeley work. In addition the working group had members who had worked on the ISO 12083 DTD and the TEI DTD.

After intense discussion, the group established a set of general requirements and specifications that they thought should be followed for the DTD. It was decided that the differences were such that the Cheshire project DTD could not be used, although it had a strong influence on the principles developed. At the present time the two DTDs are quite different in certain areas, but they may work for different types of applications.

The MARC Data DTD was then written by a contractor, ATLIS Systems, in the spring of 1996 and an alpha-test version was completed in June of this year. It took another couple of months to test the parsing and verify that all the required external files were working correctly with the DTD, but the DTD is now ready for an alpha test phase.

Since SGML is really a high level programming language, a DTD may be viewed as a program. This MARC DTD is composed of over 20,000 lines of code and references 12 external entity sets. The DTD has a large number of comments, making it easy to understand and implement.

Characteristics of the MARC Data DTD

The following are some of the key principles around which the DTD was built, established largely by the working group.

The DTD must support convertibility between the USMARC structure and the SGML encoding without loss of MARC data.
SGML is flexible so accommodating all MARC data, including the indicators and the coded data fields was not difficult. It should be noted that it will be possible to create a record using a MARC format system or a MARC Data DTD system. If the record is created in the SGML environment, it is fully convertible to the MARC structure.
Only two DTDs are required. The "bibliographic" DTD will accommodate the three USMARC formats for Bibliographic, Holdings and Community Information data. The "Authority" DTD will accommodate USMARC name and subject Authority and Classification schedule data.
There are five USMARC formats, Bibliographic, Holdings, Community Information, Authority and Classification. The first three are, with a few exceptions for Community Information, one format split into three and the last two are one split into two. This decision to reduce the number of DTDs to two was actually made by my Office when we were working with the contractor. It is an interesting approach but will need to be reviewed during the testing phase.
The structural elements of MARC do not need to be carried into the SGML structured MARC record as they can be set and calculated when the MARC record is converted from the SGML one.
For example, the MARC record has some values in its leader area that help define the record structure, such as the length of the indicator values. The DTD is defined for USMARC where the indicator values are always 2, so that value can be omitted in the SGML record. The record directory is another structure that will not be carried into SGML. When MARC data is converted from SGML to MARC, the converter will be expected to generate the directory.
Tagging should reflect the MARC tags rather than names of data elements.
Field tag: mrcb245 -- "Title Statement" --
There was much discussion on this point. The use of mnemonics or word names was considered as they could be more usable to those unfamiliar with MARC. The number of data elements, however, makes it difficult to devise meaningful brief names, and the numeric naming is a well established language to those familiar with USMARC. Numeric tagging also transcends language differences and there are many users in non-English-speaking countries.
It should be noted that the alphabetic prefix to the tag is needed since SGML requires that tags begin with alphabetics. For the MARC Data DTD the alpha characters were used to indicate context: "mrcb" for "MARC Bibliographic DTD" and "mrca" for "MARC Authority DTD."
Subfield identifiers should be defined as unique elements.
Subfield tag: mrcb245-a -- "Title" --
MARC format subfield tagging was another difficult decision. The original suggestion from the working group was to establish a generic pool of SGML elements without specific names. The tags could have been as simple as "a" through "z" and "zero" through "nine". In the end we did in fact define each subfield as a unique element so that a name could be associated with it. This will facilitate the use of SGML tools for record creation. It will also make conversion to a MARC format record easier since the SGML subfield tags contain context. It results, however, in a very large number of SGML elements. This is a choice that will need to be carefully examined during the test period.
The optionality of data elements needs to be given in preference to repeatability.
It is difficult to make items optional and not repeatable in SGML without forcing an order on them. MARC takes a more data dictionary approach to the data in the records. It clusters data elements into fields, but the order of data in the subfields is not mandated by the format but is controlled by the data itself. While there may be "usual" order, MARC allows the "unusual" order also. In the DTD, the order of fields is specified, but the order of subfields is flexible. All fields and subfields are optional. Repeatability was treated as an attribute.
Obsolete MARC data should be accommodated, but flagged.
Obsolete data elements in USMARC are a problem because record creation now spans over 25 years and there are millions of records in thousands of systems. There are not a large number of these elements, and many were very early experimental tags and thus were little used. To preserve upward compatibility, however, MARC systems are not required to remove these tags when they become obsolete, but are barred from using them in new input. Since an older MARC record with such data could well be subject to conversion into SGML these data elements need to be defined. A flag in the DTD will enable applications to block input of obsolete tags when new input is being carried out in an SGML system. This arrangement is the norm for MARC systems.

Plans

The working group also agreed upon several steps that will be necessary for the DTD to really become useful for experimental work, and have the desired standardizing impact. These are to be worked out during the alpha and beta test phases of the DTD.

The DTD should be maintained in synch with the USMARC formats so that convertibility is upheld.
My office, the Network Development and MARC Standards Office, has taken responsibility for maintaining and making available the DTD and assuring that it is updated when the MARC formats are.
Freeware conversion software should be available for easy conversion between the two structures for the MARC data.
The Network Development and MARC Standards Office plans to develop a conversion program and make it available for ftp download. It will be accessible from the MARC home page on the Web. The DTD itself and test files will be available in the same manner.
The DTD should be defined so that it can stand alone as a MARC record document in SGML, or be embedded into a document DTD, such as ISO 12083 or TEI.
The DTD is currently written as a stand-alone although it can obviously accompany any type of document. Consideration was given when establishing the tagging to the possibility of embedding. The SGML community is currently working on rules for joining DTDs. We hope to have this requirement satisfied during the development period.
Both of the document DTDs mentioned above, ISO 12083 and TEI, allow for including and tagging some bibliographic information in the SGML document instance. In the case of ISO 12083 it is in the "front matter" area, and for TEI it is in the "title page" or the header. The MARC Data DTD is not intended to be a substitute for these areas or for the TEI header. An SGML encoded MARC record may be attached to an ISO 12083 or a TEI document where it may repeat or augment data already in the document header or front matter sections.

Examples

The following title area fragments from the USMARC format and the MARC Data DTD indicate the differences in specification and approach and the similarity in outcome of the two document structures.

MARC Format fragment, the title field:

 245   TITLE STATEMENT  (NR)
	Indicators
        First Title added entry
              0 No title added entry
	      1 Title added entry
        Second  Nonfiling characters
              0-9 Number of nonfiling characters present
        Subfield Codes
             $a Title  (NR)
             $b Remainder of title  (NR)
             $c Remainder of title page transcription/statement of responsibility
                (NR)

MARC Data DTD fragment:

<!ELEMENT mrcb245 - -  ((mrcb245a | mrcb245b | mrcb245c | mrcb245d
			  | mrcb245e | mrcb245f | mrcb245g | mrcb245h | mrcb245k | mrcb245n
			  | mrcb245p | mrcb245s | mrcb2456)*)    	>
<!ATTLIST mrcb245
       name      	     CDATA       #FIXED       "Title Statement"
       obsolete  	     CDATA       #FIXED       "no"
       repeatable	     CDATA       #FIXED       "no"
       i1     (i10 | i11 | i1fill)              #REQUIRED
       i2     (i20 | i21 | i22 | i23 | i24 |
               i25 | i26 | i27 | i28 | i29 |
               i2fill)                          #REQUIRED        	>
<!ELEMENT mrcb245a -- (#PCDATA)                               	>
<!ATTLIST mrcb245a
       name           	     CDATA       #FIXED       "Title"
       obsolete        	     CDATA       #FIXED       "no"
       repeatable	     CDATA       #FIXED       "no"   	>
<!ELEMENT mrcb245b -- (#PCDATA)                               	>
<!ATTLIST mrcb245b
       name    		     CDATA       #FIXED       "Remainder of title"  
       obsolete              CDATA       #FIXED       "no"
       repeatable            CDATA       #FIXED       "no"    >
<!ELEMENT mrcb245c -- (#PCDATA)                             		>
<!ATTLIST mrcb245c
       name   		     CDATA  	 #FIXED       "Remainder of title page transcription..."
       obsolete  	     CDATA  	 #FIXED  "no"
       repeatable  	     CDATA  	 #FIXED  "no"   	>

Marked up fragment of a record:

MARC:

[245] 10$aSGML :$ban author's guide to the standard generalized markup language/$cMartin Bryan

SGML:

<mrcb245 i1=1 i2=0><mrcb245-a>SGML : <mrcb245-b>an author's guide to the standard generalized markup language / <mrcb245-c>Martin Bryan

Conclusion

This work is an interesting development for bibliographic control, driven by the revolution taking place in the accessibility of electronic documents. Over the next year we expect to see several experiments started and await the results to see how the DTD will need to evolve. The digital library "movement" itself is in its infancy and may go in any number of directions, but the cataloging tools built by librarians, who have been maintaining electronic catalogs for over 20 years, will still be a cornerstone of the new mixed environment. Our formats and cataloging data will be a major contribution to the digital environment.

References:

Larson, Ray, et al. Cheshire II: "Designing a Next-GEneration Online Catalog." Journal of the American Society for Information Science, v.47, no.7, p.555-567.

Cole, Timothy W. and Kazmer, Michelle M. "SGML as a Component of the Digital Library." Library Hi-Tech, v.13, no.4, p.75-90.

THE FUTURE OF COMMUNICATION FORMATS