The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Last modified: May 30, 2003
Harvard University E-Journal Archive Project

[January 07, 2002] The Harvard University E-Journal Archive Project was initiated through an October 2000 request to the Andrew W. Mellon Foundation for funding to create a plan for the archiving of electronic journals. In May 2001, the the Harvard University Library and three major publishers of scholarly journals (Blackwell Publishing; John Wiley & Sons, Inc.; the University of Chicago Press) announced an agreement to work together on a plan to develop an experimental archive for electronic journals. In December 2001. the Harvard E-Journal Archive project team published a Version 1.0 draft Submission Information Package (SIP) Specification which "defines acceptable data formats, file naming conventions, bibliographic and technical metadata, etc." The SIP Normative Data Formats are included in appendices A and B of this document, including several XML DTDs and Schemas. An E-Journal Archival DTD Feasibility Study commisioned by Harvard University has also been released. It recommends the creation of an XML DTD or Schema which "can be developed, allowing successful conversion of significant intellectual content from publisher SGML and XML files into a common format for archival purposes."

E-Journal Archive description from the Draft Submission Information Package (SIP) Specification [2001-12-19]:

The purpose of the Harvard University E-Journal Archive is to preserve the significant intellectual content of journals independent of the form in which that content was originally delivered in order to assure that this content will be available to the scholarly community for the indefinite future. Functionally, the archive is designed to render text and still images and other formats as practical with no significant loss in intellectual content. The archive reserves the right to freely manipulate the internal format of the manifestation over time as long as the plain meaning of the intellectual content is preserved.

The framework for discussing the architecture and operation of the archive is provided by the Open Archival Information System (OAIS) Reference Model. Under the OAIS model, material from a content provider is transmitted to the archive in a form called a Submission Information Package (SIP). The format of the SIP acceptable to the Harvard archive is described normatively by this specification... The archive Ingest function accepts the SIP and potentially transforms its contents into an internal form called an Archival Information Package (AIP) for long-term preservation.

[May 30, 2003]   NLM Releases XML Tagset and DTDs for Journal Publishing, Archiving, and Interchange.    An announcement from the US National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) describes the release of a Tagset and two XML DTDs designed to "simplify journal publishing and increase the accuracy of the archiving and exchange of scholarly journal articles. In April 2002, representatives from NCBI, Mulberry Technologies, Inc., Inera, Inc., the Harvard University E-Journal Archiving Project, and the Mellon Foundation (supporting the Harvard project and Inera) met in Bethesda, MD to discuss what changes needed to be made to the PMC DTD to reach the target of the common DTD format for archiving. The [resulting] Journal Publishing DTD and the Archiving and Interchange DTD have been created from the Archiving and Interchange Tagset, a set of XML elements and attributes that can be used to define many other types of documents, including textbooks and online documentation. The Tagset provides a set of XML modules that defines elements and attributes for describing the textual and graphical content of journal articles as well as some nonarticle material such as letters, editorials, and book reviews. The purpose of the Tagset is to preserve the intellectual content of journals independently of the form in which that content was originally created. The Tagset has been written as a set of XML DTD modules, each of which is a separate file. No module is a complete DTD by itself, but these modules can be combined to create any number of new DTDs." The NLM Tagset represents an open specification: the DTDs and the Tagset are in the public domain so that any organization wishing to create its own DTD from the Tagset may do so without permission from NLM. NLM is forming an XML Interchange Structure Advisory Board to assist in development and maintenance of the Tagset. An Archiving and Interchange Tagset Secretariat will collect feedback and will physically maintain the files and documentation.

December 5, 2001 update from Marilyn Geller, Project Manager for the Harvard University E-Journal Archive: "Harvard has completed a first round of business meetings and technical meetings with our publisher-partners, Blackwell, John Wiley, and University of Chicago Press. We have also received a report from Inera, Inc.on the feasibility of developing a common archival article DTD... The project's technical team has met with each of the publishers regarding the principles of technical development and the specifications for ingesting content. The most significant technical development in the last few months has been the delivery of the Inera study on the feasibility of creating a common archival DTD that would allow the archive to received material from all publishing partners tagged in the same manner. Ten publishers participated in this study by contributing their DTDs, documentation, and samples for review. The significant conclusions drawn from this study are that it is possible to create a common archival article DTD that would represent the intersection and the union of several existing publisher DTDs and that thorough documentation and quality assurance tools would be essential to insure that conversion is successful. Because this study has so much potential for resolving ingest, storage and delivery issues, it is being made available to the entire scholarly communications community. We are optimistic that this will encourage discussion and progress in the technical aspects of e-journal preservation... In the coming months, we hope to finalize the conceptual agreement with our publishing partners, document technical development, operations, and staffing of the archive, and refine the business model that will sustain this archive over time."

Project description from DLF:

Harvard is basing its archive on the architectural framework provided by the Open Archival Information System (OAIS) Reference Model. Under the OAIS model, material from a content producer is transmitted to the archive in a form called a Submission Information Package, or SIP. We have put together a tentative draft proposal for the technical specifications of the SIP that defines acceptable data formats, file naming conventions, bibliographic and technical metadata, and so forth. We are scheduling a round of meetings with technical representatives of our publishing partners to discuss and refine this proposal.

One of the key ideas we are exploring on the technical side is whether it is practical to design a common XML DTD that will reasonably represent the intellectual content of archival e-journal articles. Such a common DTD would simplify the work of gathering content from a variety of publishers using different DTDs. In this study, we have contracted with Inera because of their substantial background in this area and will look at the article DTDs being used by our publishing partners as well as a sampling of other DTDs representing large volumes of content and interesting elements. After determining the common elements of these DTDs, we hope to analyze the usefulness of this approach paying attention to what information is common to all DTDs and what information may be lost by using this common DTD. [description from August 31, 2001]

In the Harvard University draft Submission Information Package (SIP) Specification, the issue-level metadata file (issue-md.xml) is XML-encoded according to the METS XML schema. This file contains descriptive, administrative, and structural metadata related to the issue and all issue-level SIP components... METS (Metadata Encoding and Transmission Standard) is a XML-formatted metadata framework for encoding descriptive, administrative, and structural metadata of digital library objects. It was developed as an initiative of the Digital Library Federation, and is built upon work previously performed for the DLF-funded Making of America II project coordinated at the University of California, Berkeley. The METS mechanisms for defining structural metadata and synchronization were derived in part from TEI and SMIL. METS is a metadata framework capturing structural relationships and providing containers for descriptive and administrative metadata encoded according to standards external to METS itself." See "Metadata Encoding and Transmission Standard (METS)."

SIP XML Schemas and DTDs: Appendix B of the draft SIP specification documents the SIP Normative Data Formats for representing content components within the SIP. XML 1.0 is identified for use in the Metadata, Issue and Item-Level Text, and Item-Level Linkage; beyond conformance to the XML 1.0 standard, certain SIP components must also conform to specific XML schemas. For example, (1) Structural Metadata schema are governed by the METS XML schema with namespace []; (2) Descriptive and Administrative Metadata use the EJAR schema with namespace []; (3) Issue-level data conforms to the EJAR-ISSUE schema with namespace []; (4) Item-level data conforms to the EJAR-ITEM schema with namespace []; (5) Item reference links use the EJAR-LINKS with namespace []. The document also specifies the use of W3C SVG and the DTD for MathML, Version 2.0.


Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: