NLM XML DTDs for Archiving and Publishing of Journal Articles
New XML DTD Describes Standard Content Model for Electronic Archiving and Publishing of Journal Articles
Bethesda, MD, USA. May 27, 2003.
Many journals today create their own XML/SGML or have the files created for them by content aggregators for online publishing and archiving purposes. "This XML is created with DTDs that were written to meet the needs of the online publishing world -- usually without much thought given to long-term archiving of the content," says Dr. David Lipman, Director of the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM). "Today we release two Document Type Definitions (DTDs) that will simplify journal publishing and increase the accuracy of the archiving and exchange of scholarly journal articles." The Journal Publishing DTD and the Archiving and Interchange DTD were both created from the Archiving and Interchange Tagset, a set of XML elements and attributes that can be used to define many other types of documents, including textbooks and online documentation.
NCBI created the Journal Publishing DTD to define a common format for the creation of journal content in XML. The advantages of a common format are portability, reusability, and the creation and use of standard tools. Although the Publishing DTD was created for electronic production, the structures are robust enough to support print publication as well.
Built using the same set of elements, the Archiving and Interchange DTD also defines journal articles, but it has a more open structure. Where the Publishing DTD defines the content and order of most of its elements (which eases content creation), the Archiving DTD is less strict about required elements and their order. The Archiving DTD defines a target content model for the conversion of any sensibly structured journal article and provides a common format in which publishers, aggregators, and archives can exchange journal content.
The Tagset provides a set of XML modules that defines elements and attributes for describing the textual and graphical content of journal articles as well as some nonarticle material such as letters, editorials, and book reviews. The purpose of the Tagset is to preserve the intellectual content of journals independently of the form in which that content was originally created. The Tagset has been written as a set of XML DTD modules, each of which is a separate file. No module is a complete DTD by itself, but these modules can be combined to create any number of new DTDs.
The Archiving and Publishing DTD may be used as is, or the Tagset can be used to construct DTDs for authoring and archiving journal articles as well as DTDs for transferring journal articles from publishers to archives and between archives.
These DTDs and the Tagset are in the public domain. Any organization that wants to create its own DTD from the Tagset may do so without permission from NLM. Complete information and documentation can be found at http://dtd.nlm.nih.gov.
To keep the DTD relevant to the publishing and archiving communities, NLM is creating the XML Interchange Structure Advisory Board. This board will advise NLM on recommended changes and/or additions to the Tagset.
NCBI will encourage the use of the Publishing DTD to define the incoming data for PubMed Central (PMC) for journals that do not already have content in SGML or XML. PMC is NLM's digital archive of life sciences journal literature.
"We didn't start out to create a standardized archiving format for articles," says Jeff Beck of the NCBI. "We were starting a major revision to our PMC DTD at the same time that Inera was working on the E-Journal Archival DTD Feasibility Study' for the Harvard University E-Journal Archiving Project."
"The study concluded that a common format for archiving was possible, but that it hadn't been defined yet. We shared our revised DTD with Inera, and it seemed like we almost had it."
In April 2002, representatives from NCBI, Mulberry Technologies, Inc. (Rockville, MD), Inera, Inc. (Newton, MA), the Harvard University E-Journal Archiving Project, and the Mellon Foundation (supporting the Harvard project and Inera) met in Bethesda, MD to discuss what changes needed to be made to the PMC DTD to reach the target of the common DTD format for archiving.
The conclusion was that a modular DTD library (the Tagset) should be created, and archiving, interchange, and authoring (publishing) DTDs could be created from that. Mulberry Technologies and Inera examined thousands of articles from hundreds of journals and dozens of journal DTDs to be sure that the content models being defined by the Tagset were comprehensive. After this extensive modeling exercise, the consultants worked with NCBI to create the Archiving and Interchange DTD as a general archiving DTD. NCBI and Mulberry then created the Journal Publishing DTD to help publishers who had not yet selected a format for their electronic content.
NLM is planning to create other DTDs from the Tagset, including one for textbooks and one for online documentation. Because all of these types of publications will be tagged using the same elements and attributes, publishing tools created for the Tagset will be applicable to all of these document types. This confluence of tagging models will greatly simplify the publication and archiving of content at the National Library of Medicine and in the journal publishing industry in general.
National Center for Biotechnology Information
National Library of Medicine
National Institutes of Health
Building 45, 5th Floor, Room 5an36A
45 Center Drive
Bethesda, Maryland 20892
Prepared by Robin Cover for The XML Cover Pages archive. See other details in the news story "NLM Releases XML Tagset and DTDs for Journal Publishing, Archiving, and Interchange."