Digital Library Federation (DLF) Report on Electronic Journal Archiving
"Harvard University Library: A Study of Electronic Journal Archiving"
From a Digital Library Federation (DLF) 'Summary of the Projects and their Progress'. See: "Harvard University Library: A Study of Electronic Journal Archiving."
Harvard University Library: A Study of Electronic Journal Archiving
The Project as Planned
Can a major research library arrange with multiple publishers to archive many of the varied journals and databases that it provides in its electronic gateway? The Harvard University Library's list of such resources exceeds 2,000. Harvard gets paper copies also, primarily for preservation, but the Library does not regard this costly duplication as sustainable. Now it will plan an archive to preserve journals electronically, based on infrastructure for the creation, storage, and delivery of digital library collections in which it has invested heavily in the past two years.
In its planning, Harvard will analyze a two-part question: Which journals-and which components of them-will it archive? Answering will involve arranging with at least one journal publisher to provide a significant volume of material to test the scaling of the archive, working with that publisher (and possibly also with an actively publishing scholarly society) to develop a model for an archiving relationship, and selecting titles to archive from the list of journals that Harvard now acquires only in digital copies.
Harvard's plan includes drafting a policy on the components part of the question-will the archive contain only article texts from journals or also their covers, ads, letters-to-the-editor, book reviews, and digital links? The project also will investigate technical requirements for accession automation, archival formatting, on-going validation, bibliographic control, naming systems, access management, storage strategy, and output facilities.
The project will not now negotiate archiving licenses, but will explore what publishers are willing to provide and under what arrangements. A major concern is cost-designing the archiving process to minimize marginal costs, developing a model for cost distribution, and exploring long-term options for financial support.
Project Developments as of 5 December 2001
Marilyn Geller, project manager of the Harvard project, provides the following report: Since the last update this summer, Harvard has completed a first round of business meetings and technical meetings with our publisher-partners, Blackwell, John Wiley, and University of Chicago Press. We have also received a report from Inera, Inc.on the feasibility of developing a common archival article DTD [document type description].
Our business meetings have helped us refine the mission of the archive as a set of services and a logical organization for the preservation of significant intellectual content of the journal independent of the form in which that content was originally delivered. Substantive discussions have also taken place around the issue of the archive's stakeholders including researchers, authors, societies, publishers, and subscribers as represented by libraries. This stakeholder community, however it is organized, would have the opportunity to review and comment on policies and procedures for the development, administration, ongoing maintenance, and financing of the archive. Policies regarding access and financing of the archive continue to evolve.
The project's technical team has met with each of the publishers regarding the principles of technical development and the specifications for ingesting content. The most significant technical development in the last few months has been the delivery of the Inera study on the feasibility of creating a common archival DTD that would allow the archive to received material from all publishing partners tagged in the same manner. Ten publishers participated in this study by contributing their DTDs, documentation, and samples for review. The significant conclusions drawn from this study are that it is possible to create a common archival article DTD that would represent the intersection and the union of several existing publisher DTDs and that thorough documentation and quality assurance tools would be essential to insure that conversion is successful. Because this study has so much potential for resolving ingest, storage and delivery issues, it is being made available to the entire scholarly communications community. We are optimistic that this will encourage discussion and progress in the technical aspects of e-journal preservation.
In the coming months, we hope to finalize the conceptual agreement with our publishing partners, document technical development, operations, and staffing of the archive, and refine the business model that will sustain this archive over time.
Project Developments as of 31 August 2001
In the past few months, both the Steering Committee and the Technical Team of the Harvard E-Journal Archiving Project have made significant progress in refining their broad understanding of the research topic and exploring the detailed implications of this understanding. As a whole, Project Manager Marilyn Geller reports, the project has selected and begun to discuss the business and technical models with three publishers as partners in this project: Blackwell, University of Chicago Press, and Wiley.
Discussions of the business model have been centering around the nature of access to the archive; specifically, the project and the publisher partners are exploring who should have access to the archives, when, under what circumstances, and how. Initially, the project proposed three access "trigger" events: (1) when the content is no longer available on-line, (2) when the title ceases to be published, and (3) after a defined amount of time has passed; and it is the third type of trigger event that is generating comment and being refined.
The project also has delved into the issue of costs to understand what elements of the process of building and maintaining the archive are sensitive to size or quantity and how this might influence a model for sustainable financing of the archive. Project staff envision that both the number and kind of digital objects to be deposited will increase over time and may be difficult to estimate. To a lesser extent, the size of the archived content will have an effect on storage costs. Additionally, the cost of migrating formats will be dependent on the number of digital objects to be migrated, the frequency of migration, and the technology available to accomplish the migration.
Harvard is basing its archive on the architectural framework provided by the Open Archival Information System (OAIS) Reference Model. Under the OAIS model, material from a content producer is transmitted to the archive in a form called a Submission Information Package, or SIP. We have put together a tentative draft proposal for the technical specifications of the SIP that defines acceptable data formats, file naming conventions, bibliographic and technical metadata, and so forth. We are scheduling a round of meetings with technical representatives of our publishing partners to discuss and refine this proposal.
One of the key ideas we are exploring on the technical side is whether it is practical to design a common XML DTD that will reasonably represent the intellectual content of archival e-journal articles. Such a common DTD would simplify the work of gathering content from a variety of publishers using different DTDs. In this study, we have contracted with Inera because of their substantial background in this area and will look at the article DTDs being used by our publishing partners as well as a sampling of other DTDs representing large volumes of content and interesting elements. After determining the common elements of these DTDs, we hope to analyze the usefulness of this approach paying attention to what information is common to all DTDs and what information may be lost by using this common DTD.
Prepared by Robin Cover for The XML Cover Pages archive. See the source document and the news item of 2002-01-04: "Harvard University Library Feasibility Study Recommends XML DTD/Schema for E-Journal Archives."