Created: May 30, 2003.
NLM Releases XML Tagset and DTDs for Journal Publishing, Archiving, and Interchange.

An announcement from the US National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) describes the release of a Tagset and two XML DTDs designed to "simplify journal publishing and increase the accuracy of the archiving and exchange of scholarly journal articles. The Journal Publishing DTD and the Archiving and Interchange DTD have been created from the Archiving and Interchange Tagset, a set of XML elements and attributes that can be used to define many other types of documents, including textbooks and online documentation. The Tagset provides a set of XML modules that defines elements and attributes for describing the textual and graphical content of journal articles as well as some nonarticle material such as letters, editorials, and book reviews. The purpose of the Tagset is to preserve the intellectual content of journals independently of the form in which that content was originally created. The Tagset has been written as a set of XML DTD modules, each of which is a separate file. No module is a complete DTD by itself, but these modules can be combined to create any number of new DTDs." The NLM Tagset represents an open specification: the DTDs and the Tagset are in the public domain so that any organization wishing to create its own DTD from the Tagset may do so without permission from NLM. NLM is forming an XML Interchange Structure Advisory Board to assist in development and maintenance of the Tagset. An Archiving and Interchange Tagset Secretariat will collect feedback and will physically maintain the files and documentation.

Overview of NLM Journal Archiving and Interchange DTD Tag Library

The intent of this DTD Suite is to 'preserve the intellectual content of journals independent of the form in which that content was originally delivered'. The tags defined here will be used to describe journal articles that originate with many publishers and societies but whose content will be stored in repositories, such as the NLM PubMed Central repository. Therefore, the Suite has been optimized for conversion from a variety of journal source DTDs, with the intent of providing a single format in which publishers can deliver their content to a wide range of archives. There are so many journal DTDs currently in use by publishers, repositories, content-aggregators, scientific societies, and compositors that this Suite cannot possibly incorporate all the variation to be found in such diverse models. But a wide variety of structures can be accommodated, because the content models for the elements have been made very flexible, including a wide range of elements with nearly all structures optional.

The conversion focus also means that this is a larger, more inclusive DTD than might have been necessary if the intent had been, for example, to create only a journal-authoring DTD. Many elements have been created explicitly so that information tagged by publishers would not be discarded when they converted material from another DTD to an archival interchange or repository DTD created from this Suite. Because of the broad scope of the several proposed electronic archives, this Suite contains elements and attributes that may occur only in a very few journals. Attribute values that a particular DTD would restrict to a list of options, were declared as data character values so that all options could be accepted. Care has been taken to provide several mechanisms (frequently information classing attributes) to preserve the intellectual content of a document structure when that structure is converted from another DTD or schema to this one, even if there is no exact element equivalent of the structure.

Modular DTD Design: The Archiving and Interchange DTD Suite has been written as a set of XML DTD modules called DTD 'modules', each of which is a separate physical file. No module is an entire DTD by itself, but these modules can be combined into a number of different DTDs, for example, both an Archival and Interchange DTD and an Archival Repository DTD. Modules are primarily intended for maintenance; all the elements of the same 'type' (class) are stored together... There are many advantages to such a modular approach. The smaller units are written once, maintained in one place, and used in many different DTDs. This makes it much easier to keep lower-level structures consistent across document types, while allowing for any real differences that analysis identifies. A DTD for a new function (such as an authoring DTD) or a new publication type can be built quickly, because most of the necessary components will already be defined in the DTD Suite. Editorial and production personnel can bring the experience gained on one tagging project directly to the next with very little loss or retraining. Customized software (including authoring, typesetting, and electronic display tools) can be written once, shared among projects, and modified only for real distinctions... [from the Introduction]

Overview of NLM Journal Publishing Tag Library

The Journal Publishing DTD defines a document type for journal articles and some non-article journal material such as product and book reviews, editorials, and letters to the editor. The DTD was written to describe both the metadata for a journal article and the content of the article, but it can also describe just the article header metadata. This is a prescriptive DTD, optimized for the authoring and initial XML tagging of journal material. Although designed for biomedical journals, this DTD should be sufficiently general to describe not only STM journals but technical journals in any field.

The DTD was constructed using the modules of the Archiving and Interchange DTD Suite and has been modeled along the same philosophical lines as the Journal Archiving and Interchange DTD, which is a DTD for interchange and storage of journal material. However, because this is a publishing DTD optimized for the creation of new material, the DTD is far smaller (fewer elements, and fewer choices in many contexts) than was the full Journal Archiving and Interchange DTD. Where, in the interchange DTD, there may have been several ways to express the same information, only one way is provided for this publishing DTD. It was not the intention to limit the expressive power licensed by this DTD but rather to limit the meaningless choices that a full interchange DTD needs to make conversion from a wide variety of formats as easy as possible. The philosophy for the interchange DTD was to accept as many varied forms of many structures as possible. The philosophy of this DTD is to prefer a single structural form, or at least a single style of tagging. [from the Introduction]

From the Announcement

Built using the same [common] set of elements, the Archiving and Interchange DTD also defines journal articles, but it has a more open structure [than the Publishing DTD]. Where the Publishing DTD defines the content and order of most of its elements (which eases content creation), the Archiving DTD is less strict about required elements and their order. The Archiving DTD defines a target content model for the conversion of any sensibly structured journal article and provides a common format in which publishers, aggregators, and archives can exchange journal content.

The Archiving and Publishing DTD may be used as is, or the Tagset can be used to construct DTDs for authoring and archiving journal articles as well as DTDs for transferring journal articles from publishers to archives and between archives.

NCBI will encourage the use of the Publishing DTD to define the incoming data for PubMed Central (PMC) for journals that do not already have content in SGML or XML. PMC is NLM's digital archive of life sciences journal literature.

"We didn't start out to create a standardized archiving format for articles," says Jeff Beck of the NCBI. "We were starting a major revision to our PMC DTD at the same time that Inera was working on the E-Journal Archival DTD Feasibility Study' for the Harvard University E-Journal Archiving Project."

"The study concluded that a common format for archiving was possible, but that it hadn't been defined yet. We shared our revised DTD with Inera, and it seemed like we almost had it."

In April 2002, representatives from NCBI, Mulberry Technologies, Inc. (Rockville, MD), Inera, Inc. (Newton, MA), the Harvard University E-Journal Archiving Project, and the Mellon Foundation (supporting the Harvard project and Inera) met in Bethesda, MD to discuss what changes needed to be made to the PMC DTD to reach the target of the common DTD format for archiving.

The conclusion was that a modular DTD library (the Tagset) should be created, and archiving, interchange, and authoring (publishing) DTDs could be created from that. Mulberry Technologies and Inera examined thousands of articles from hundreds of journals and dozens of journal DTDs to be sure that the content models being defined by the Tagset were comprehensive. After this extensive modeling exercise, the consultants worked with NCBI to create the Archiving and Interchange DTD as a general archiving DTD. NCBI and Mulberry then created the Journal Publishing DTD to help publishers who had not yet selected a format for their electronic content.

NLM is planning to create other DTDs from the Tagset, including one for textbooks and one for online documentation. Because all of these types of publications will be tagged using the same elements and attributes, publishing tools created for the Tagset will be applicable to all of these document types. This confluence of tagging models will greatly simplify the publication and archiving of content at the National Library of Medicine and in the journal publishing industry in general.

