[This local archive copy is from the official and canonical URL, http://www.acctbief.org/avenir/marcsgml.htm#par1; please refer to the canonical source document if possible.]
MARC Data in an SGML Structure |
October 9th 1996
MARC and SGML Why Create a MARC Data DTD? Development of the MARC Data DTD Plans Conclusion |
Vocabulary
Before describing some general characteristics of MARC and SGML, clarification of the vocabulary used is needed. SGML employs a very different vocabulary from MARC to describe its parts and products, although most concepts are roughly comparable. Since many readers will be more familiar with the MARC nomenclature for describing machine-readable bibliographic records, the following will relate the vocabularies but then primarily use the MARC terms in order to simplify the presentation.
SGML is actually only a generalized data structure, formally called ISO 8879, Standard Generalized Markup Language (SGML). The comparable structure standard for MARC is ISO 2709, Format for Information Exchange (INEX). Neither SGML nor INEX can be used in an implementation until tags and rules are established in conformance with these ISO structure standards. A tag set for the INEX structure is called a "format", whereas one under the SGML structure is called a "document type definition" or "DTD". So formats and DTDs are comparable concepts. Finally, if a bibliographic description is marked up using tags specified in a format or a DTD, the result is called a "record" in the INEX environment and a "document instance" in the SGML environment.
The obvious question is, then, what is MARC? MARC actually refers to a certain subset of formats that have the INEX structure -- for example Can/MARC, InterMARC, CCF, and USMARC are all MARC formats that have the INEX structure. They all essentially use the same options allowed by INEX and all are intended to encode bibliographic data.
In the following the more specific terminology "MARC", "MARC format" and "MARC record" is used, rather than the strictly correct "INEX", "INEX format" and "INEX record". The comparable SGML concepts will be called "SGML", "SGML DTD", and "SGML record". This is a slight departure from orthodox SGML vocabulary. For this paper the SGML "document instance" concept will be split into "SGML record", if the content is bibliographic, and "SGML document", if it is the full text of an item. Marked up bibliographic data will always be called a "record". Although different MARC formats have different tags, they all have a commonality in the data they encode. That bibliographic content of records will be referred to by various commonly used terms: cataloging data, metadata, bibliographic data, or MARC data, regardless of whether it is marked up using MARC format tagging or SGML DTD tagging. The terms "tagging" and "marking up" will be used interchangeably.
Characteristics of SGML and MARC
ISO 2709 or MARC takes a data dictionary approach to markup. The characteristics of the MARC data structure focus on allowing detailed markup of variable length (but not extremely long) data elements to facilitate complex indexing and sorting requirements. It allows efficient access to the data in order to manipulate and use various subset combinations and to index selected parts. The field tags, indicators, subfields, and particularly the directory structure specified in ISO 2709 support these requirements.
The data structure specified in ISO 8879 or SGML is targeted at encoding documents that may be any length, but will usually be very long. The content of these documents must maintain their linear order but hierarchies inherent in the data also need to be preserved. The method for establishing tags and attributes in SGML is supportive of these requirements. The absence of directory access to the data, as found in MARC, is suitable for the often great (yet variable) length of the data segments. The emphasis in tagging is on structural and presentational characteristics rather than retrieval or sorting characteristics.
As was mentioned above, MARC and SGML are both formal standards that specify data structures. The formats based on those structures, that specify the tags used for the data and the semantics for using them, must be established by the user communities. For MARC, there is strong industry standardization on an implementation schema for cataloging and related data, the various specific MARC formats -- Can/MARC, USMARC, UNIMARC, CCF, etc.
SGML is newer than MARC and is still in a period of proliferation of implementations for textual data. Two leading tagging schemes are ISO 12083, Electronic Manuscript Preparation and Markup and TEI, the Guidelines for Electronic Text Encoding and Interchange. ISO 12083 was targeted for modern publications, with emphasis on hierarchical structures, while the TEI was developed initially for conversion of older text into electronic form. Older documents often offer an array of structural characteristics that are not compatible with a hierarchical tagging schema.
There are a number of other SGML document markups that were developed for special purposes, such as CALS (Computeraided Acquisition and Logistics Support Program), a markup designed for manuals and other technical documentation; HTML (Hypertext Markup Language), a very simple markup designed for use in marking documents for World Wide Web (WWW) servers; and MAJOUR, a markup targeted for journal articles. Individual publishers that have adopted SGML structures for publications have tended to define their own proprietary SGML tagging schemes. There are thus many, many unique SGML-based markups due to differences in publications and viewpoints. The programming language nature of the basic standard and development of markup-creation tools has encouraged experimentation and invention.
Assumptions
I want to state three assumptions as a basis for the following
discussion. The assumptions indicate continuing need for catalogs
of bibliographic records, even though new, non-catalog uses for
cataloging data may have given rise to this MARC Data DTD effort.
First, in the future there will continue to be both electronic
publications and items in other media, including an enormous number
of print materials.
I say this because when working in the SGML environment, it is
easy to begin to focus on electronic items as if everything is
or will soon be available to end users electronically. That is
not the case, although the number of important (and unimportant)
electronic documents certainly will increase. SGML markup is
used to create documents to be disseminated electronically as
well as those that will be distributed only in print form.
The second assumption is that in the future catalogs that bring
together data about items, i.e., cataloging or metadata, will
continue to be essential. These catalogs will continue to contain
metadata about items in all media, including print and electronic.
This point is made because, when working with SGML, the distinction
between the item itself and metadata about the item becomes blurred.
It is easy to begin to assume that if parts of an electronic
item, such as the title, are tagged, or if metadata is tagged
and appended to an item, then there is no need to also pull that
information into a catalog. That is not the case. The concept
of the catalog of only metadata in which all items in all media
are described will continue to be essential for information access.
The third assumption is that cataloging data will continue to
consist of 1) information directly from the item, 2) subject information
about the item that is supplied through document analysis, and
3) information that has been normalized for amalgamation and subsequent
retrieval with cataloging data from the past and the future.
While derivation of cataloging data automatically from the piece
is an ideal that may some day be possible, the requirements for
a catalog are still such that there are human contributions necessary
to the process.
Proposed Uses of a MARC Data DTD
There are increasing numbers of electronic bibliographic items
being produced. Thus one use of a MARC Data DTD is for enclosing
the MARC record with the electronic document. If MARC data had
a standard representation in SGML, then for electronic documents
the MARC data could be sent along with the document and a standard
utility could process the data into a MARC structure for addition
to the catalog.
For electronic documents, there is also a potential for full
text retrieval but experience has shown that such retrieval is
very inexact. There is interest in making MARC data, with its
controlled access points, available with the full text to the
retrieval programs that process the documents. The record could
then help refine full text searching and reduce false hits. This
is of special interest to the Library of Congress in its digital
library program.
It could be especially valuable to include the catalog record
with electronic documents that are digital images. The MARC data
that accompanies the image could be converted and derived into
a MARC-based catalog and also reside in the SGML-based file for
electronic documents.
Experiments are also taking place using SGML as the format for
bibliographic citation files. While most of this work thus far
has been for the construction of bibliographies, which are static
bibliographic files, experimentation is also taking place with
dynamic catalog files also, especially in instances where the
application deals only or mostly with electronic documents. The
Cheshire Project at Berkeley is such an experiment.
In the past few years, there have been several experiments in
creating SGML access to MARC records. Among them was Michel Vulpe
who, in 1990 while working for Softquad, Inc., wrote a MARC data
DTD for experimental purposes. He revised it several times to
experiment with different tagging schemes -- which highlights
the fact that there are many different ways that a MARC record
can be treated in a DTD.
Another experiment began in 1994, when Jerome McDonough at the
University of California at Berkeley School of Information Management
and Systems, wrote a DTD for MARC records to support the Cheshire
catalog research project mentioned above. Version 8 of this DTD
is still being used in the Cheshire project. This DTD carries
the MARC coded data fields as strings, and includes all MARC leader
values and the record directory in the SGML record. Reversibility
of records back to the USMARC format was one of the basic criteria
of the DTD.
In the fall of 1995, the Library of Congress Network Development
and MARC Standards Office, where the USMARC format is maintained,
began a project to develop a MARC Data DTD that could be made
widely available and lead to a standard approach. There had been
various requests that this be done, or else the community would
have many different MARC DTDs and experimentation and interchange
would be adversely affected.
A MARC/SGML working group consisting of several SGML experts
and several MARC experts was brought together. The MARC-oriented
members all had automation experience and were working with electronic
text issues. One member of the group was Jerome McDonough who
had written the Cheshire project DTD, since one option was to
consider whether the MARC Data DTD could simply be an extension
of the Berkeley work. In addition the working group had members
who had worked on the ISO 12083 DTD and the TEI DTD.
After intense discussion, the group established a set of general
requirements and specifications that they thought should be followed
for the DTD. It was decided that the differences were such that
the Cheshire project DTD could not be used, although it had a
strong influence on the principles developed. At the present
time the two DTDs are quite different in certain areas, but they
may work for different types of applications.
The MARC Data DTD was then written by a contractor, ATLIS Systems,
in the spring of 1996 and an alpha-test version was completed
in June of this year. It took another couple of months to test
the parsing and verify that all the required external files were
working correctly with the DTD, but the DTD is now ready for an
alpha test phase.
Since SGML is really a high level programming language, a DTD
may be viewed as a program. This MARC DTD is composed of over
20,000 lines of code and references 12 external entity sets.
The DTD has a large number of comments, making it easy to understand
and implement.
Characteristics of the MARC Data DTD
The following are some of the key principles around which the
DTD was built, established largely by the working group.
SGML is flexible so accommodating all MARC data, including the
indicators and the coded data fields was not difficult. It should
be noted that it will be possible to create a record using a MARC
format system or a MARC Data DTD system. If the record is created
in the SGML environment, it is fully convertible to the MARC structure.
There are five USMARC formats, Bibliographic, Holdings, Community
Information, Authority and Classification. The first three are,
with a few exceptions for Community Information, one format split
into three and the last two are one split into two. This decision
to reduce the number of DTDs to two was actually made by my Office
when we were working with the contractor. It is an interesting
approach but will need to be reviewed during the testing phase.
For example, the MARC record has some values in its leader area
that help define the record structure, such as the length of the
indicator values. The DTD is defined for USMARC where the indicator
values are always 2, so that value can be omitted in the SGML
record. The record directory is another structure that will not
be carried into SGML. When MARC data is converted from SGML to
MARC, the converter will be expected to generate the directory.
Field tag: mrcb245 -- "Title Statement" --
There was much discussion on this point. The use of mnemonics
or word names was considered as they could be more usable to those
unfamiliar with MARC. The number of data elements, however, makes
it difficult to devise meaningful brief names, and the numeric
naming is a well established language to those familiar with USMARC.
Numeric tagging also transcends language differences and there
are many users in non-English-speaking countries.
It should be noted that the alphabetic prefix to the tag is needed
since SGML requires that tags begin with alphabetics. For the
MARC Data DTD the alpha characters were used to indicate context:
"mrcb" for "MARC Bibliographic DTD" and "mrca"
for "MARC Authority DTD."
Subfield tag: mrcb245-a -- "Title" --
MARC format subfield tagging was another difficult decision.
The original suggestion from the working group was to establish
a generic pool of SGML elements without specific names. The tags
could have been as simple as "a" through "z"
and "zero" through "nine". In the end we
did in fact define each subfield as a unique element so that a
name could be associated with it. This will facilitate the use
of SGML tools for record creation. It will also make conversion
to a MARC format record easier since the SGML subfield tags contain
context. It results, however, in a very large number of SGML
elements. This is a choice that will need to be carefully examined
during the test period.
It is difficult to make items optional and not repeatable in
SGML without forcing an order on them. MARC takes a more data
dictionary approach to the data in the records. It clusters data
elements into fields, but the order of data in the subfields is
not mandated by the format but is controlled by the data itself.
While there may be "usual" order, MARC allows the "unusual"
order also. In the DTD, the order of fields is specified, but
the order of subfields is flexible. All fields and subfields
are optional. Repeatability was treated as an attribute.
Obsolete data elements in USMARC are a problem because record
creation now spans over 25 years and there are millions of records
in thousands of systems. There are not a large number of these
elements, and many were very early experimental tags and thus
were little used. To preserve upward compatibility, however,
MARC systems are not required to remove these tags when they become
obsolete, but are barred from using them in new input. Since
an older MARC record with such data could well be subject to conversion
into SGML these data elements need to be defined. A flag in the
DTD will enable applications to block input of obsolete tags when
new input is being carried out in an SGML system. This arrangement
is the norm for MARC systems.
The working group also agreed upon several steps that will be
necessary for the DTD to really become useful for experimental
work, and have the desired standardizing impact. These are to
be worked out during the alpha and beta test phases of the DTD.
My office, the Network Development and MARC Standards Office,
has taken responsibility for maintaining and making available
the DTD and assuring that it is updated when the MARC formats
are.
The Network Development and MARC Standards Office plans to develop
a conversion program and make it available for ftp download.
It will be accessible from the MARC home page on the Web. The
DTD itself and test files will be available in the same manner.
The DTD is currently written as a stand-alone although it can
obviously accompany any type of document. Consideration was given
when establishing the tagging to the possibility of embedding.
The SGML community is currently working on rules for joining
DTDs. We hope to have this requirement satisfied during the development
period.
Both of the document DTDs mentioned above, ISO 12083 and TEI,
allow for including and tagging some bibliographic information
in the SGML document instance. In the case of ISO 12083 it is
in the "front matter" area, and for TEI it is in the
"title page" or the header. The MARC Data DTD is not
intended to be a substitute for these areas or for the TEI header.
An SGML encoded MARC record may be attached to an ISO 12083 or
a TEI document where it may repeat or augment data already in
the document header or front matter sections.
Examples
The following title area fragments from the USMARC format and
the MARC Data DTD indicate the differences in specification and
approach and the similarity in outcome of the two document structures.
MARC Format fragment, the title field:
MARC Data DTD fragment:
Marked up fragment of a record:
MARC:
SGML:
This work is an interesting development for bibliographic control,
driven by the revolution taking place in the accessibility of
electronic documents. Over the next year we expect to see several
experiments started and await the results to see how the DTD will
need to evolve. The digital library "movement" itself
is in its infancy and may go in any number of directions, but
the cataloging tools built by librarians, who have been maintaining
electronic catalogs for over 20 years, will still be a cornerstone
of the new mixed environment. Our formats and cataloging data
will be a major contribution to the digital environment.
References:
Larson, Ray, et al. Cheshire II: "Designing a Next-GEneration
Online Catalog." Journal of the American Society for
Information Science, v.47, no.7, p.555-567.
Cole, Timothy W. and Kazmer, Michelle M. "SGML as a Component
of the Digital Library." Library Hi-Tech, v.13, no.4,
p.75-90.
MARC and SGML
Why Create a MARC Data DTD?
Development of the MARC Data DTD
Plans
245 TITLE STATEMENT (NR)
Indicators
First Title added entry
0 No title added entry
1 Title added entry
Second Nonfiling characters
0-9 Number of nonfiling characters present
Subfield Codes
$a Title (NR)
$b Remainder of title (NR)
$c Remainder of title page transcription/statement of responsibility
(NR)
<!ELEMENT mrcb245 - - ((mrcb245a | mrcb245b | mrcb245c | mrcb245d
| mrcb245e | mrcb245f | mrcb245g | mrcb245h | mrcb245k | mrcb245n
| mrcb245p | mrcb245s | mrcb2456)*) >
<!ATTLIST mrcb245
name CDATA #FIXED "Title Statement"
obsolete CDATA #FIXED "no"
repeatable CDATA #FIXED "no"
i1 (i10 | i11 | i1fill) #REQUIRED
i2 (i20 | i21 | i22 | i23 | i24 |
i25 | i26 | i27 | i28 | i29 |
i2fill) #REQUIRED >
<!ELEMENT mrcb245a -- (#PCDATA) >
<!ATTLIST mrcb245a
name CDATA #FIXED "Title"
obsolete CDATA #FIXED "no"
repeatable CDATA #FIXED "no" >
<!ELEMENT mrcb245b -- (#PCDATA) >
<!ATTLIST mrcb245b
name CDATA #FIXED "Remainder of title"
obsolete CDATA #FIXED "no"
repeatable CDATA #FIXED "no" >
<!ELEMENT mrcb245c -- (#PCDATA) >
<!ATTLIST mrcb245c
name CDATA #FIXED "Remainder of title page transcription..."
obsolete CDATA #FIXED "no"
repeatable CDATA #FIXED "no" >
[245] 10$aSGML :$ban author's guide to the standard generalized markup language/$cMartin Bryan
<mrcb245 i1=1 i2=0><mrcb245-a>SGML : <mrcb245-b>an author's guide to the standard generalized markup language / <mrcb245-c>Martin Bryan
Conclusion