This is the title slide of my presentation using XML (the
eXtensible Markup Language). It looks a
lot like HTML, doesn't it? I had
considered starting with some bad puns, such as "I may be missing the mark
today," or something about "being out of my element" but decided
I really shouldn't.
In May 1999, at the Medical Library Association's annual meeting
here in Chicago, I identified these frustrations, which we encounter regularly
at Lane Medical Library, in trying to cope with proliferating digital
resources. With burgeoning web
development, we felt that our 'library information' was under-utilized due to
its segregation from mainstream web resources, and in danger of becoming
marginalized. We are keenly aware of
users' reluctance to search multiple resources, and in medical libraries, that
puts our catalogs after Medline, fulltext databases, etc. Many integrated library systems, even those
which claim to be open, have limited interface flexibility, which discourages
re-utilization of catalog information, in other web contexts. Efforts to overcome such limitations often
involve redundant work, particularly in bibliographic control of web resources.
While the proprietary solutions of many library systems can be
faulted, more fundamentally, the MARC formats hamper bibliographic data's
effective integration in web interfaces.
Ever inventive librarians have made progress in managing digital
material, but despite this, a de facto dual system exists. Lincoln's Biblical reference to "a
house divided" comes to mind.
We cannot long sustain dual systems for handling bibliographic
access to library resources, which differ only in format. Not only is it unaffordable, it is a
disservice to our users, who will increasingly ignore resources which are not
directly available in ordinary web interfaces.
Currently, segregated catalog information is less amenable to integrated
web access, than other types of database information, now considered dark data
on the web.
Business interests recognize that users prefer to search a single
resource, and are working 'round the clock to prepare enticing information portals
complete with their "brands" of information. Libraries, however, have the unique
advantage of well-known and long-held values of impartiality, trust,
confidentiality, thoroughness, and lack of commercial interest, and are thus
well-positioned to add the technical infrastructure of XML to their arsenal,
far more easily than business interests can convince their customers, that they
too share our values and high standards.
XML's growth is phenomenal, especially in e-commerce, science and
computing. The trend is clearly permeating
the library world as well. Questia
uses XML. I doubt that all medical librarians are aware, that as a part of
its modernization program, the National Library of Medicine chose XML as the
format for disseminating MEDLINE citation data. Much of this year is being spent converting
more than 11 million records, and beginning in 2001, XML will be the only
distribution format available.
Could Chicken Little be right?
Projections at Lane Library, based on foot traffic and reshelving data,
indicate a dramatic decline in print utilization, while digital resource
utilization shows an even more dramatic increase. At the rate digital materials are becoming available, we cannot
afford to be reactive, when the sky is falling, so to speak.
In May 1999, I identified MARC as the chief impediment to
effective integration of library
resources with web resources, and said that it needed to be replaced with XML
as soon as possible. This referred to
the development of a suite of XML Document Type Definitions, or schemas, for
the same data defined by MARC.
Redefinition provides an opportunity to review the structure and
organization of bibliographic data.
Trying to illuminate this position necessarily involves correlation of
XML, MARC and cataloging-- rather a daunting task for 30 minutes, so, forgive
me if I am ignorant (especially of scale), naive, or preach to the
converted. First, XML:
XML is becoming the de facto standard for representation of
content, optimized for web delivery.
Technically, it is a 1998 recommendation of the World Wide Web
Consortium. It is a metalanguage,
permitting the definition of an unlimited number of specific markup languages,
each of which may contain an unlimited number of tags, hence extensible. Most succinctly, you might think of it as
simplified SGML, using HTML syntax, with added web efficiencies. Please refer to my article in the current
NetConnect supplement of Library journal for an introduction to its features
and potential for libraries.
The most significant aspect of XML may be its separation of
content, presentation, and linking, so that each may be handled optimally. It's computer platform, and software
application neutrality supports interoperability by design. Of special interest to libraries, XML's
fixed character set is Unicode (and its extensions), which allow diacritics,
special characters, and non-Roman data to be handled like ordinary text. Due to Unicode and platform neutrality, XML
offers the greatest promise of data longevity (or future-proofing), as
hardware, software, and network protocols continue to change. And, XML provides for the unambiguous
identification of complex data structures, that can be treated as objects, well
suited for bibliographic data.
Basic XML is deceptively simple.
Physically, entities allow components of a document, or record, to be
named and stored separately, permitting information reuse, and non-XML data
referencing, such as for images.
Logically, XML documents
consist of a hierarchy of named elements, which may be likened to
fields, with nested elements akin to subfields. Each instance of a document has a single root element, to which
the other elements are subordinate.
Container elements may contain text or other elements.
A DTD declares each of the permitted entities, elements, and
attributes, and the relationships among them, basically forming a template for
the logical structure of associated XML documents. It expresses the hierarchy and granularity of data, allowable
attribute values, and whether elements are optional, repeatable, etc.
Since a DTD defines a single namespace, a suite of DTDs can be
defined to permit elements from different DTDs to occur in one document. Thus, a namespace guarantees that element's
names are unique across the suite.
The separation of content, presentation, and linking, that I
mentioned, refers to XML's adjunct standards.
For display, XSL, the XML Stylesheet Language, permits the same data to
be displayed in as many different formats, as stylesheets are defined for
various purposes. XLink accommodates
hypertext linking, but goes beyond simple hotlinking, by permitting a single
link to reference multiple related documents.
This diagram illustrates XML's structure, which can be
characterized as an inverted tree, with data values occupying the leaves (at
the bottom). Here, a name element
consists of personal name elements, which in turn consist of surname, forename,
and dates elements. Name and person
represent container elements, as they contain other elements. Conveying that Dr. Brodman is the main
entry, and Dr. Billings is the subject would be handled by XML attributes.
This fragment shows how two names in an authority record might
appear in XML. Note that attributes are
embedded in an element's start tag. The
first version has a type attribute 'heading' and the second 'variant'. In XML, names of elements and attributes are
almost arbitrary, but start/end tags are required. The blue markup, with red element and attribute names, and black
data values, is conventional.
In this example, the container element 'concept' has three
attributes (scheme, type, and level), and two subordinate elements, each with
an 'id' attribute. The 'descriptor'
and 'qualifier' elements inherit the attributes of the 'concept' element.
Now that I've eliminated all the mysteries of marking up content
in XML, we can combine a stylesheet,
represented here, with an XML document, to organize and characterize how our
content should be displayed on the web.
This full display of an enhanced authority record illustrates
applying the XSL-defined style to the XML-delineated content-- divide and conquer if you will. This is a not a mock-up; any other of Lane's
authority records could be substituted.
Although it's hard to believe, XML is not flawless. This list reflects what has come to my
attention. XML schemas, which use XML
syntax, and support data typing, could potentially eliminate the first two
problems. The third, refers to angle
brackets, ampersands, etc., which must be disguised. Element names can't begin with numbers, and reaching agreement on
standard names could be a challenge.
Let me know, if you're aware of a font for handling ALA characters in
Unicode. Not limited to XML, is the
difficulty of analyzing data, and establishing logical relationships.
Here are a few more issues.
Since XML is so flexible, standards are critical to maximize it's
benefit. XML is verbose, but compactness
is less of an issue nowadays. Lane's
raw XML is ugly, because we paid little attention to order since stylesheets
cover appearance. Potentially, an XML
editor for library data, could include an interactive, data entry
"stylesheet," mostly obviating the need to view raw XML.
Now just a few words about Lane's software. I must stress that it was developed as a
feasibility study-- and we believe it has shown that. To take advantage of XML's features, we necessarily had to look
at bibliographic structure as well.
Please be aware that our DTDs were developed expediently, in less than 6
weeks, covering many fields we don't use.
Currently, we are exploring indexing, search access, and presentation.
Our Java code converts MARC records to XML documents, one-way, utilizing
DTDs for bibliographic and authority data. Equivalence maps, also in XML,
provide flexibility. The DTDs
provided can be changed, or other ones substituted, as long as the correlation
with a map is maintained. We intend
to improve the documentation, and
would appreciate learning, if anyone has found anything in MARC which cannot
be converted. So, far we have about
330 licencees from 41 countries, but have heard little of what users are doing.
Well, that was the easy part.
Characterizing MARC in this venue requires some ginger. These qualities seem to best extol MARC for
me, although I had to ignore the multiple flavors, mixed success of format
integration, and a steep learning curve.
Like XML, MARC is extensible in that new formats, and local fields can
be defined. I hope I don't end up over
there on the right, in the doghouse after this!
MARC might be characterized as a big, old, rambling,
comfortable, house. It helped me, to prepare for today, by
thinking in terms of whether we should just remodel, or build anew.
Although MARC technically permits Unicode, I didn't try to assess
this. The length of time it took for
underscores and spacing tildes, occurring in URLs, to be supported in MARC,
"underscores" the need for Unicode, the character set of XML.
MARC's economy of expression stems from it's origin in card
printing. This has led to
inflexibility, especially with gradual additions over the years. For example, in the fixed fields there isn't
room for a Y2K-compliant entry date.
There are, however, four different ways to enter dates, and it's curious
that create and update dates have different formats. The fixed fields also illustrate the difficulty in changing
overlapping values during format integration.
Here, I would also note the inter-mixing of content and control data.
Handling bibliographic format in the fixed fields also limits
MARC's flexibility. Repeating variable
field values is interesting to consider in contrast to format integration's
temporary solution of adding repeatable 006s, since many of these represent
form/genre. Further, pre-determined
fixed fields are disproportionately prominent, in that they must be coded
routinely, rather than only when needed.
A familiar feature of MARC is the provision of multiple areas to
record the same, or same type of data.
Often we can rely on coding language once, although there are three
places in addition to subfield 'l' in uniform titles. Form comes in on the extreme side with 7 possibilities. There are also cases of redundancy across
formats.
Perhaps the most exotic MARC complexity occurs when fixed-length
fields are embedded in subfields of variable length fields, I assume because the indicators were
already taken. (I learned recently that
more than two indicators were possible, and recalled having seen alpha values
in position three in RLIN.) XML's separation
of elements and their attributes, as well as treating display separately,
greatly simplifies addressing the situation that led to this coding.
Mixing elements and their properties makes information management
more difficult. In this example, tag
700, which actually resides in the directory, identifies a personal name element.
In the actual data area, the surname, forename and dates in the middle
can be interpreted as XML elements, while the indicators at the beginning,
and defamed subfield 'e' at the end would better be interpreted as attributes.
The 700 also conveys an attribute of the name, added entry. The delimiters themselves are equivalent to
markup. Incidentally, the comma before
subfield 'e' illustrates conditional punctuation, problematic of ISBD.
Generally, the practice of using coded forms of information is no
longer justified in terms of saving space, or requiring additional keying. This false economy impedes searching and/or
adds overhead to presentation. Data
entry and user retrieval could be improved by inclusion of the abbreviated
versions as variants of the full forms in authority records.
MARC is flat, with limited support for hierarchy, whereas XML is
inherently hierarchical. MARC has some
provision for linkage, while XML's support is advanced and web-oriented. Field length limitations are mostly
system-specific, although I believe a government agency encountered truncation
of some geospatial data mapped into MARC.
I'm unsure of the status of a proposed solution for nonfiling indicator
inconsistency. Lastly, I'm perplexed by
subfielding differences referencing the same data.
Now, cataloging. It is
difficult for me to separate MARC issues from cataloging issues, although
I'm presently acutely aware that separate committees coordinate them. Cataloging was more difficult for me to approach.
It is very rule-intensive (one cataloger reminded me to not forget
that LCRI and the NACO participants manual are needed in addition to AACR2
and MARC21), and some simplification efforts strike me as doing less than
is desirable. I also wasn't sure what
to make of an 806-page explanation of the logical structure of the 677-page
AACR2, although from skimming the summary, it seemed a necessary undertaking
prior to AACR3.
Currently, I perceive an overemphasis on description, especially
in considering the growing availability of fulltext, gradual improvement in access, and a decided underemphasis on
relationships. I believe that access
is paramount, relationships critical, and description largely optional-- for
rare books or to provide otherwise lacking clarity. First, access and identity issues:
I thought about calling this slide "Do not strain for
consistency." It illustrates how
reliance on description can lead to inconsistency in access, and possibly
unnecessary work in establishing cross references.
Mixing languages in headings appears to me of dubious value, and
actually impedes the internationalization of cataloging rules. The 410 in this example tacitly recognizes
that phrase access under the operative term, is more important than access by
an initial generic word. Phone books
have long recognized this, hence, we don't find subordinate entries under
"Dept. of ...", regardless of usage.
Similar to my comments about MARC, bibliographic access is
complicated by the mixture of not only elements and attributes, but by mixing
different elements in the same heading, and unnecessary dispersion of the
"same" element throughout a record.
Three other subfields further complicate the picture.
Such pre-coordination permeates cataloging. Dealing with fundamental bibliographic
elements discretely, and explicitly coding which are actually present, can
result in greater clarity for user and cataloger alike. Permutation software could support options
for user-specified subarrangements.
Such joint headings parallel pairs of fields in works, essentially
establishing a dependency in order to identify the referenced work. This is
further exacerbated in cases of generic titles (which elementally reflect type
as well), in that the title is entirely dependent on a field, or a subfield of
a different element. It is interesting
to consider whether qualifiers in uniform titles (which are oddly not
subfielded in MARC) could be supplied automatically in XML. The archives folks seem to understand the
value of title's standing alone.
I couldn't resist including the further complication of title
changes here. It seems that a rule for
when each treatment is appropriate
would be better than an option for using either, which can result
overlapping identities for the same works.
Giving titles bibliographic precedence, requires that they serve
as visual identifiers for the works represented, as well as visual anchors
for relationships among works. Because
identity can be problematic in cases of multiple manifestations and title
conflict, uniform title headings were developed. I realize that the parallelism illustrated here is not accidental.
However, pondering why series are treated as uniform titles led me
to consider whether all titles should reside in bibliographic records, in
the German manner, rather than in authority records. To represent a special class of fabricated
bibliographic records for some cases, I chose the term formal title.
To expand slightly, we should consider at what level such formal
title records are really necessary. It
seems that we have too many "empty" authority records representing
instances, which could be more economically handled by solo formal titles,
existing only in ordinary bibliographic records. The current approach also relies on the resulting co-occurrence
of titles in an index. This, and the
issue of versions leads nicely into relationships.
The simple separation of discrete bibliographic entities, and provision of links between them, is very
powerful. Cataloging is laden with
relationships not covered in linking entry fields. In XML, the related
record need not even reside in the same system to be directly linkable.
Far too often we, and our users, must tease out why various hits
occur in their result sets, by delving deeply into notes, added entries,
etc. RLIN's very powerful clustering
approach provides an accessible example, of how linked records can be retrieved
from terms occurring in any of the associated records. This same concept has been demonstrated with
linked XML records.
My dissection revealed three fundamental types of
relationships: bibliographic to
bibliographic; bibliographic to authority; and authority to authority. I
indicate a separate box for the relationship itself as a potential home for the
type, aspects, and duration of the relationship.
Cataloging provides too many treatments of BIB-BIB relationships,
many of them implied, such as series linking analytics to a host record,
reference notes linking a work to a bibliography, and title as subject. Some systems, notably Ex Libris, compensate
for linking inadequacies to a degree.
XML's flexible linking features could handle the omitted link from a
serial to its analytics, a one to many relationship, by employing a single link
which includes a script.
Types of relationships have received much attention from Goosens,
Tillett, Leazer, and others. Their
realization in fulltext referencing, notably by Highwire Press, make their
underemphasis in cataloging even more noticeable. Save equivalence, relationships seem to fit nicely into five
categories, making for a simple navigational tool.
BIB-AUT relationships are simplified by handling titles (and their
variance and relationships) separately.
By also ignoring description and control, a group of fundamental
bibliographic aspects remains. These follow
the pattern of BIB-BIB relationships in having potential types, aspects, and
duration. For example, a person may
serve as translator, organizations as sponsors, etc. Precedence comes into play, when considering a person as a
subject, which would better be handled as an attribute in XML. Further analysis is needed to optimize
access for each of these categories and deal with precedence issues.
I've included the quasi- "Word / Phrase" category
because despite the prevalence of this form of searching, particularly on the
web, it is not well-supported by our bibliographic apparatus.
These examples provide a hint of how thesaural relationships, and
variant forms, might be automatically included, or user-selectable, along with
links to dictionary entries, to enhance browse word searches. Time saved with a sleeker cataloging system
would permit more strategic allocation of effort. This needn't be an all manual task, as data files exist which
could be used to "seed" a smart word index. Here I refer to the Specialist Lexicon component of NLM's Unified
Medical Language System.
AUT-AUT relationships appear to follow the same pattern. While we can record a person to person
relationship such as Plato to Socrates, the nature of the relationship
(student) would be left to notes.
Further analysis is needed to consider issues such as whether French, a
language, is different than French Language, a topic, or whether it's a matter
precedence. Analyzing elements and
their relationships is an inherent part of creating any bibliographic
structure, but especially important in XML.
Type has only recently come into its own. Treating type (including form, genre, and
perhaps physical format) as one of the basic, repeatable elements, eliminates
the problem of bibliographic formats.
Monograph, serial and collection appear to be pure types, whereas the
others indicated here represent relationships, but could be treated
additionally as types. Recent proposals
relating to integrating resources, introduce further complexities in this area,
where the existing "collection" appears to cover websites and
databases adequately, as they too are types.
Hard-coding bibliographic types invites difficulties with inevitable
reinterpretations. Have
"Relationship Authorities" been considered?
Now, just a word on description.
Are these introductory author strings any less statements of
responsibility than trailing ones? I
couldn't locate an example I've seen where four authors introduce a title. Identity comes into play in description as
well as in access and relationships.
Authority records are underutilized. Unnecessary duplication could be eliminated, by including the
place of publication once in a publisher's authority record, and not in
bibliographic records. Problems such as
O'Reilly being coded as China, since Beijing now appears first alphabetically
on their US imprints, could be eliminated by allowing repeatable places of
publication in the authority record, furthering international accommodation as
well.
I was caught somewhat off-guard, when asked to back-up my
statements about replacing MARC and restructuring bibliographic data. My dissection is undoubtedly inadequate, but
it led me to think of this analytical framework to use in trying to conceive a
simple, and more elegant bibliographic structure.
The equivalence I skipped over earlier, might be handled by
version extensions to the title to represent the rough equivalence of print,
microform and digital versions, for example, on the same record. Such extended titles could be included as
separate entries in a title index, and discrete bibliographic elements applying
to a specific version could be handled by making version an XML attribute. XSL stylesheets could display records
differently depending on the index entry chosen.
Although I don't know what you think of this, to me it is not so
far removed from current practice, as it might seem on first inspection. While XML is a radically different structure
from MARC, restructuring cataloging takes advantage of this opportunity by
using XML's features to seek improvements, while preserving content and
previous effort.
XML-based library systems would put libraries in the Web mainstream,
and foster, rather than impede, our ability to provide new and improved user
services in this exciting environment. This diagram is meant to merely suggest some of the routes and means
by which data might flow more freely, into and out of library systems.
XML offers librarians an enhanced role in the Information Age by allowing
our eXpertise to be more broadly applied than currently.
Based on my current understanding of XML, MARC and cataloging, I
hoped to provide some insight and synthesis of complex issues, and although I
may have included something to offend everyone, all was meant in the spirit of
inquiry.
Although I come here as an ordinary librarian, you may think I'm a
crackpot, especially if I were to discuss my concept of organic
bibliography. For that I'll refer you
to an upcoming issue of CCQ, which covers many of my leanings toward ideas
expressed here today.
I hope the third option isn't your reaction. With the world changing as rapidly as it is,
we cannot afford to pass up the opportunity to seriously consider XML's
potential in dealing with previously intractable bibliographic problems.
The suggestions I would offer are: do more analysis, build a model, and consider transitional
strategies. We also need to find a
speedier way of developing standards.
I would like to thank our Project staff, past and present,
especially Prisdha Dharma who wrote the XMLMARC software. If you like what I said, they deserve credit
as well; if you don't, please blame me alone.
Thank you.