XML and MARC: a choice or replacement?

[Title Slide]

This is the title slide of my presentation using XML (the eXtensible Markup Language). It looks a lot like HTML, doesn't it? I had considered starting with some bad puns, such as "I may be missing the mark today," or something about "being out of my element" but decided I really shouldn't.

[Slide 2] [Frustrations]

In May 1999, at the Medical Library Association's annual meeting here in Chicago, I identified these frustrations, which we encounter regularly at Lane Medical Library, in trying to cope with proliferating digital resources. With burgeoning web development, we felt that our 'library information' was under-utilized due to its segregation from mainstream web resources, and in danger of becoming marginalized. We are keenly aware of users' reluctance to search multiple resources, and in medical libraries, that puts our catalogs after Medline, fulltext databases, etc. Many integrated library systems, even those which claim to be open, have limited interface flexibility, which discourages re-utilization of catalog information, in other web contexts. Efforts to overcome such limitations often involve redundant work, particularly in bibliographic control of web resources.

[Slide 3] [Bibliographic apartheid]

While the proprietary solutions of many library systems can be faulted, more fundamentally, the MARC formats hamper bibliographic data's effective integration in web interfaces. Ever inventive librarians have made progress in managing digital material, but despite this, a de facto dual system exists. Lincoln's Biblical reference to "a house divided" comes to mind.

[Slide 4] [Shift happens!]

We cannot long sustain dual systems for handling bibliographic access to library resources, which differ only in format. Not only is it unaffordable, it is a disservice to our users, who will increasingly ignore resources which are not directly available in ordinary web interfaces. Currently, segregated catalog information is less amenable to integrated web access, than other types of database information, now considered dark data on the web.

Business interests recognize that users prefer to search a single resource, and are working 'round the clock to prepare enticing information portals complete with their "brands" of information. Libraries, however, have the unique advantage of well-known and long-held values of impartiality, trust, confidentiality, thoroughness, and lack of commercial interest, and are thus well-positioned to add the technical infrastructure of XML to their arsenal, far more easily than business interests can convince their customers, that they too share our values and high standards.

XML's growth is phenomenal, especially in e-commerce, science and computing. The trend is clearly permeating the library world as well. Questia uses XML. I doubt that all medical librarians are aware, that as a part of its modernization program, the National Library of Medicine chose XML as the format for disseminating MEDLINE citation data. Much of this year is being spent converting more than 11 million records, and beginning in 2001, XML will be the only distribution format available.

Could Chicken Little be right? Projections at Lane Library, based on foot traffic and reshelving data, indicate a dramatic decline in print utilization, while digital resource utilization shows an even more dramatic increase. At the rate digital materials are becoming available, we cannot afford to be reactive, when the sky is falling, so to speak.

In May 1999, I identified MARC as the chief impediment to effective integration of library resources with web resources, and said that it needed to be replaced with XML as soon as possible. This referred to the development of a suite of XML Document Type Definitions, or schemas, for the same data defined by MARC. Redefinition provides an opportunity to review the structure and organization of bibliographic data. Trying to illuminate this position necessarily involves correlation of XML, MARC and cataloging-- rather a daunting task for 30 minutes, so, forgive me if I am ignorant (especially of scale), naive, or preach to the converted. First, XML:

[Slide 5] [XML: universal web data format]

XML is becoming the de facto standard for representation of content, optimized for web delivery. Technically, it is a 1998 recommendation of the World Wide Web Consortium. It is a metalanguage, permitting the definition of an unlimited number of specific markup languages, each of which may contain an unlimited number of tags, hence extensible. Most succinctly, you might think of it as simplified SGML, using HTML syntax, with added web efficiencies. Please refer to my article in the current NetConnect supplement of Library journal for an introduction to its features and potential for libraries.

The most significant aspect of XML may be its separation of content, presentation, and linking, so that each may be handled optimally. It's computer platform, and software application neutrality supports interoperability by design. Of special interest to libraries, XML's fixed character set is Unicode (and its extensions), which allow diacritics, special characters, and non-Roman data to be handled like ordinary text. Due to Unicode and platform neutrality, XML offers the greatest promise of data longevity (or future-proofing), as hardware, software, and network protocols continue to change. And, XML provides for the unambiguous identification of complex data structures, that can be treated as objects, well suited for bibliographic data.

[Slide 6] [XML in a nutshell]

Basic XML is deceptively simple. Physically, entities allow components of a document, or record, to be named and stored separately, permitting information reuse, and non-XML data referencing, such as for images.

Logically, XML documents consist of a hierarchy of named elements, which may be likened to fields, with nested elements akin to subfields. Each instance of a document has a single root element, to which the other elements are subordinate. Container elements may contain text or other elements.

A DTD declares each of the permitted entities, elements, and attributes, and the relationships among them, basically forming a template for the logical structure of associated XML documents. It expresses the hierarchy and granularity of data, allowable attribute values, and whether elements are optional, repeatable, etc.

Since a DTD defines a single namespace, a suite of DTDs can be defined to permit elements from different DTDs to occur in one document. Thus, a namespace guarantees that element's names are unique across the suite.

The separation of content, presentation, and linking, that I mentioned, refers to XML's adjunct standards. For display, XSL, the XML Stylesheet Language, permits the same data to be displayed in as many different formats, as stylesheets are defined for various purposes. XLink accommodates hypertext linking, but goes beyond simple hotlinking, by permitting a single link to reference multiple related documents.

[Slide 7] [XML: hierarchical structure]

This diagram illustrates XML's structure, which can be characterized as an inverted tree, with data values occupying the leaves (at the bottom). Here, a name element consists of personal name elements, which in turn consist of surname, forename, and dates elements. Name and person represent container elements, as they contain other elements. Conveying that Dr. Brodman is the main entry, and Dr. Billings is the subject would be handled by XML attributes.

[Slide 8] [XML fragment: name authority]

This fragment shows how two names in an authority record might appear in XML. Note that attributes are embedded in an element's start tag. The first version has a type attribute 'heading' and the second 'variant'. In XML, names of elements and attributes are almost arbitrary, but start/end tags are required. The blue markup, with red element and attribute names, and black data values, is conventional.

[Slide 9] [XML fragment: topic]

In this example, the container element 'concept' has three attributes (scheme, type, and level), and two subordinate elements, each with an 'id' attribute. The 'descriptor' and 'qualifier' elements inherit the attributes of the 'concept' element.

[Slide 10] [XSL]

Now that I've eliminated all the mysteries of marking up content in XML, we can combine a stylesheet, represented here, with an XML document, to organize and characterize how our content should be displayed on the web.

[Slide 11] [Full display]

This full display of an enhanced authority record illustrates applying the XSL-defined style to the XML-delineated content-- divide and conquer if you will. This is a not a mock-up; any other of Lane's authority records could be substituted.

[Slide 12] [XML: limitations?]

Although it's hard to believe, XML is not flawless. This list reflects what has come to my attention. XML schemas, which use XML syntax, and support data typing, could potentially eliminate the first two problems. The third, refers to angle brackets, ampersands, etc., which must be disguised. Element names can't begin with numbers, and reaching agreement on standard names could be a challenge. Let me know, if you're aware of a font for handling ALA characters in Unicode. Not limited to XML, is the difficulty of analyzing data, and establishing logical relationships.

[Slide 13] [XML: other issues?]

Here are a few more issues. Since XML is so flexible, standards are critical to maximize it's benefit. XML is verbose, but compactness is less of an issue nowadays. Lane's raw XML is ugly, because we paid little attention to order since stylesheets cover appearance. Potentially, an XML editor for library data, could include an interactive, data entry "stylesheet," mostly obviating the need to view raw XML.

[Slide 14] [XMLMARC software]

Now just a few words about Lane's software. I must stress that it was developed as a feasibility study-- and we believe it has shown that. To take advantage of XML's features, we necessarily had to look at bibliographic structure as well. Please be aware that our DTDs were developed expediently, in less than 6 weeks, covering many fields we don't use. Currently, we are exploring indexing, search access, and presentation.

Our Java code converts MARC records to XML documents, one-way, utilizing DTDs for bibliographic and authority data. Equivalence maps, also in XML, provide flexibility. The DTDs provided can be changed, or other ones substituted, as long as the correlation with a map is maintained. We intend to improve the documentation, and would appreciate learning, if anyone has found anything in MARC which cannot be converted. So, far we have about 330 licencees from 41 countries, but have heard little of what users are doing.

[Slide 15] [MARC]

Well, that was the easy part. Characterizing MARC in this venue requires some ginger. These qualities seem to best extol MARC for me, although I had to ignore the multiple flavors, mixed success of format integration, and a steep learning curve. Like XML, MARC is extensible in that new formats, and local fields can be defined. I hope I don't end up over there on the right, in the doghouse after this!

[Slide 16] [MARC]

MARC might be characterized as a big, old, rambling, comfortable, house. It helped me, to prepare for today, by thinking in terms of whether we should just remodel, or build anew.

[Slide 17] [MARC: character sets]

Although MARC technically permits Unicode, I didn't try to assess this. The length of time it took for underscores and spacing tildes, occurring in URLs, to be supported in MARC, "underscores" the need for Unicode, the character set of XML.

[Slide 18] [MARC: flexibility / accretion]

MARC's economy of expression stems from it's origin in card printing. This has led to inflexibility, especially with gradual additions over the years. For example, in the fixed fields there isn't room for a Y2K-compliant entry date. There are, however, four different ways to enter dates, and it's curious that create and update dates have different formats. The fixed fields also illustrate the difficulty in changing overlapping values during format integration. Here, I would also note the inter-mixing of content and control data.

[Slide 19] [MARC: formats / repeatability]

Handling bibliographic format in the fixed fields also limits MARC's flexibility. Repeating variable field values is interesting to consider in contrast to format integration's temporary solution of adding repeatable 006s, since many of these represent form/genre. Further, pre-determined fixed fields are disproportionately prominent, in that they must be coded routinely, rather than only when needed.

[Slide 20] [MARC: redundancy]

A familiar feature of MARC is the provision of multiple areas to record the same, or same type of data. Often we can rely on coding language once, although there are three places in addition to subfield 'l' in uniform titles. Form comes in on the extreme side with 7 possibilities. There are also cases of redundancy across formats.

[Slide 21] [MARC: complexity]

Perhaps the most exotic MARC complexity occurs when fixed-length fields are embedded in subfields of variable length fields, I assume because the indicators were already taken. (I learned recently that more than two indicators were possible, and recalled having seen alpha values in position three in RLIN.) XML's separation of elements and their attributes, as well as treating display separately, greatly simplifies addressing the situation that led to this coding.

[Slide 22] [MARC: elements / attributes

Mixing elements and their properties makes information management more difficult. In this example, tag 700, which actually resides in the directory, identifies a personal name element. In the actual data area, the surname, forename and dates in the middle can be interpreted as XML elements, while the indicators at the beginning, and defamed subfield 'e' at the end would better be interpreted as attributes. The 700 also conveys an attribute of the name, added entry. The delimiters themselves are equivalent to markup. Incidentally, the comma before subfield 'e' illustrates conditional punctuation, problematic of ISBD.

[Slide 23] [MARC: encoded values]

Generally, the practice of using coded forms of information is no longer justified in terms of saving space, or requiring additional keying. This false economy impedes searching and/or adds overhead to presentation. Data entry and user retrieval could be improved by inclusion of the abbreviated versions as variants of the full forms in authority records.

[Slide 24] [MARC: miscellany]

MARC is flat, with limited support for hierarchy, whereas XML is inherently hierarchical. MARC has some provision for linkage, while XML's support is advanced and web-oriented. Field length limitations are mostly system-specific, although I believe a government agency encountered truncation of some geospatial data mapped into MARC. I'm unsure of the status of a proposed solution for nonfiling indicator inconsistency. Lastly, I'm perplexed by subfielding differences referencing the same data.

[Slide 25] [Cataloging / AACR2R]

Now, cataloging. It is difficult for me to separate MARC issues from cataloging issues, although I'm presently acutely aware that separate committees coordinate them. Cataloging was more difficult for me to approach. It is very rule-intensive (one cataloger reminded me to not forget that LCRI and the NACO participants manual are needed in addition to AACR2 and MARC21), and some simplification efforts strike me as doing less than is desirable. I also wasn't sure what to make of an 806-page explanation of the logical structure of the 677-page AACR2, although from skimming the summary, it seemed a necessary undertaking prior to AACR3.

[Slide 26] [Relative value / blurring]

Currently, I perceive an overemphasis on description, especially in considering the growing availability of fulltext, gradual improvement in access, and a decided underemphasis on relationships. I believe that access is paramount, relationships critical, and description largely optional-- for rare books or to provide otherwise lacking clarity. First, access and identity issues:

[Slide 27] [Access: consistency]

I thought about calling this slide "Do not strain for consistency." It illustrates how reliance on description can lead to inconsistency in access, and possibly unnecessary work in establishing cross references.

[Slide 28] [Access: language / sequence]

Mixing languages in headings appears to me of dubious value, and actually impedes the internationalization of cataloging rules. The 410 in this example tacitly recognizes that phrase access under the operative term, is more important than access by an initial generic word. Phone books have long recognized this, hence, we don't find subordinate entries under "Dept. of ...", regardless of usage.

[Slide 29] [Access: complexity / dispersion]

Similar to my comments about MARC, bibliographic access is complicated by the mixture of not only elements and attributes, but by mixing different elements in the same heading, and unnecessary dispersion of the "same" element throughout a record. Three other subfields further complicate the picture.

[Slide 30] [Access: pre-coordination]

Such pre-coordination permeates cataloging. Dealing with fundamental bibliographic elements discretely, and explicitly coding which are actually present, can result in greater clarity for user and cataloger alike. Permutation software could support options for user-specified subarrangements.

[Slide 31] [Identity: dependency]

Such joint headings parallel pairs of fields in works, essentially establishing a dependency in order to identify the referenced work. This is further exacerbated in cases of generic titles (which elementally reflect type as well), in that the title is entirely dependent on a field, or a subfield of a different element. It is interesting to consider whether qualifiers in uniform titles (which are oddly not subfielded in MARC) could be supplied automatically in XML. The archives folks seem to understand the value of title's standing alone.

[Slide 32] [Identity: change vs. variation]

I couldn't resist including the further complication of title changes here. It seems that a rule for when each treatment is appropriate would be better than an option for using either, which can result overlapping identities for the same works.

[Slide 33] [Identity: optional formal title?]

Giving titles bibliographic precedence, requires that they serve as visual identifiers for the works represented, as well as visual anchors for relationships among works. Because identity can be problematic in cases of multiple manifestations and title conflict, uniform title headings were developed. I realize that the parallelism illustrated here is not accidental. However, pondering why series are treated as uniform titles led me to consider whether all titles should reside in bibliographic records, in the German manner, rather than in authority records. To represent a special class of fabricated bibliographic records for some cases, I chose the term formal title.

[Slide 34] [Identity: hierarchy of formal related titles]

To expand slightly, we should consider at what level such formal title records are really necessary. It seems that we have too many "empty" authority records representing instances, which could be more economically handled by solo formal titles, existing only in ordinary bibliographic records. The current approach also relies on the resulting co-occurrence of titles in an index. This, and the issue of versions leads nicely into relationships.

[Slide 35] [Relationships: discreteness]

The simple separation of discrete bibliographic entities, and provision of links between them, is very powerful. Cataloging is laden with relationships not covered in linking entry fields. In XML, the related record need not even reside in the same system to be directly linkable.

Far too often we, and our users, must tease out why various hits occur in their result sets, by delving deeply into notes, added entries, etc. RLIN's very powerful clustering approach provides an accessible example, of how linked records can be retrieved from terms occurring in any of the associated records. This same concept has been demonstrated with linked XML records.

[Slide 36] [Relationships: uniformity]

My dissection revealed three fundamental types of relationships: bibliographic to bibliographic; bibliographic to authority; and authority to authority. I indicate a separate box for the relationship itself as a potential home for the type, aspects, and duration of the relationship.

[Slide 37] [Relationships: BIB-BIB]

Cataloging provides too many treatments of BIB-BIB relationships, many of them implied, such as series linking analytics to a host record, reference notes linking a work to a bibliography, and title as subject. Some systems, notably Ex Libris, compensate for linking inadequacies to a degree. XML's flexible linking features could handle the omitted link from a serial to its analytics, a one to many relationship, by employing a single link which includes a script.

Types of relationships have received much attention from Goosens, Tillett, Leazer, and others. Their realization in fulltext referencing, notably by Highwire Press, make their underemphasis in cataloging even more noticeable. Save equivalence, relationships seem to fit nicely into five categories, making for a simple navigational tool.

[Slide 38] [Relationships: BIB-AUT]

BIB-AUT relationships are simplified by handling titles (and their variance and relationships) separately. By also ignoring description and control, a group of fundamental bibliographic aspects remains. These follow the pattern of BIB-BIB relationships in having potential types, aspects, and duration. For example, a person may serve as translator, organizations as sponsors, etc. Precedence comes into play, when considering a person as a subject, which would better be handled as an attribute in XML. Further analysis is needed to optimize access for each of these categories and deal with precedence issues.

I've included the quasi- "Word / Phrase" category because despite the prevalence of this form of searching, particularly on the web, it is not well-supported by our bibliographic apparatus.

[Slide 39] [Relationships: BIB-AUT word / phrase]

These examples provide a hint of how thesaural relationships, and variant forms, might be automatically included, or user-selectable, along with links to dictionary entries, to enhance browse word searches. Time saved with a sleeker cataloging system would permit more strategic allocation of effort. This needn't be an all manual task, as data files exist which could be used to "seed" a smart word index. Here I refer to the Specialist Lexicon component of NLM's Unified Medical Language System.

[Slide 40] [Relationships: AUT-AUT]

AUT-AUT relationships appear to follow the same pattern. While we can record a person to person relationship such as Plato to Socrates, the nature of the relationship (student) would be left to notes. Further analysis is needed to consider issues such as whether French, a language, is different than French Language, a topic, or whether it's a matter precedence. Analyzing elements and their relationships is an inherent part of creating any bibliographic structure, but especially important in XML.

[Slide 41] [Relationships: bibliographic type]

Type has only recently come into its own. Treating type (including form, genre, and perhaps physical format) as one of the basic, repeatable elements, eliminates the problem of bibliographic formats. Monograph, serial and collection appear to be pure types, whereas the others indicated here represent relationships, but could be treated additionally as types. Recent proposals relating to integrating resources, introduce further complexities in this area, where the existing "collection" appears to cover websites and databases adequately, as they too are types. Hard-coding bibliographic types invites difficulties with inevitable reinterpretations. Have "Relationship Authorities" been considered?

[Slide 42] [Description: discreteness]

Now, just a word on description. Are these introductory author strings any less statements of responsibility than trailing ones? I couldn't locate an example I've seen where four authors introduce a title. Identity comes into play in description as well as in access and relationships.

[Slide 43] [Description: enhanced authorities]

Authority records are underutilized. Unnecessary duplication could be eliminated, by including the place of publication once in a publisher's authority record, and not in bibliographic records. Problems such as O'Reilly being coded as China, since Beijing now appears first alphabetically on their US imprints, could be eliminated by allowing repeatable places of publication in the authority record, furthering international accommodation as well.

[Slide 44] [Analytical framework]

I was caught somewhat off-guard, when asked to back-up my statements about replacing MARC and restructuring bibliographic data. My dissection is undoubtedly inadequate, but it led me to think of this analytical framework to use in trying to conceive a simple, and more elegant bibliographic structure.

The equivalence I skipped over earlier, might be handled by version extensions to the title to represent the rough equivalence of print, microform and digital versions, for example, on the same record. Such extended titles could be included as separate entries in a title index, and discrete bibliographic elements applying to a specific version could be handled by making version an XML attribute. XSL stylesheets could display records differently depending on the index entry chosen.

Although I don't know what you think of this, to me it is not so far removed from current practice, as it might seem on first inspection. While XML is a radically different structure from MARC, restructuring cataloging takes advantage of this opportunity by using XML's features to seek improvements, while preserving content and previous effort.

[Slide 45] [Integration / continuum]

XML-based library systems would put libraries in the Web mainstream, and foster, rather than impede, our ability to provide new and improved user services in this exciting environment. This diagram is meant to merely suggest some of the routes and means by which data might flow more freely, into and out of library systems. XML offers librarians an enhanced role in the Information Age by allowing our eXpertise to be more broadly applied than currently.

[Slide 46] [Strategic opportunity]

Based on my current understanding of XML, MARC and cataloging, I hoped to provide some insight and synthesis of complex issues, and although I may have included something to offend everyone, all was meant in the spirit of inquiry.

Although I come here as an ordinary librarian, you may think I'm a crackpot, especially if I were to discuss my concept of organic bibliography. For that I'll refer you to an upcoming issue of CCQ, which covers many of my leanings toward ideas expressed here today.

I hope the third option isn't your reaction. With the world changing as rapidly as it is, we cannot afford to pass up the opportunity to seriously consider XML's potential in dealing with previously intractable bibliographic problems.

The suggestions I would offer are: do more analysis, build a model, and consider transitional strategies. We also need to find a speedier way of developing standards.

[Slide 47] [Acknowledgements]

I would like to thank our Project staff, past and present, especially Prisdha Dharma who wrote the XMLMARC software. If you like what I said, they deserve credit as well; if you don't, please blame me alone.

Thank you.