Last modified: October 13, 2006
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH)

[August 23, 2000] The 'Open Archives Metadata Set' describes a collection of metadata elements used in the Santa Fe Convention. The Santa Fe Convention [Santa Fe, New Mexico, October 21-22, 1999], adapted for use by the Open Archives Initiative "presents a technical and organizational framework designed to facilitate the discovery of content stored in distributed e-print archives. It makes easy-to-implement technical recommendations for archives that will allow data from e-print archives to become widely available via its inclusion in a variety of end-user services such as search engines, recommendation services and systems for interlinking documents. The Open Archives Initiative aims to support archives, both those focused on e-prints (e.g., preprints and reprints, often connected with journals and conferences) and those representing a wide variety of other content types (e.g., theses and dissertations, Web log files, and educational resources). The emphasis has been on allowing harvesting of metadata that describes diverse "records" of content, stored in managed repositories. As of June 2000, there were six (6) conforming archives with content available for harvesting." XML is used as the transfer syntax for the Open Archives Metadata Set, per a consensus agreement in the Santa Fe meeting that the participants "would use a common syntax, XML, for representing and transporting both the OAMS and archive-specific metadata sets." The semantics of the Open Archives Metadata Set "has purposely been kept simple in the interest of easy creation and widest applicability. The expectation is that individual archives will maintain metadata with more expressive semantics and the Open Archives Dienst Subset provides the mechanism for retrieval of this richer metadata."

A number of different metadata formats are used by data providers in the context of the Open Archives Initiative. For example, the Virginia Tech Digital Library Research Laboratory "has undertaken to create an XML DTD to support wider distribution of MARC records within the Open Archives community." [The researchers now] provide a freely available set of Java classes to handle translations between MARC tape format and OAi XML. The design providea two layers of classes: a MarcRecord class that can read and write both MARC tape format and the OAi MARC XML format, and a MarcDocument subclass that can provide additional translations, for instance to Open Archives Metadata Standard (OAMS) records and to pretty-printing HTML. As of 4-July-2000, the Java MarcRecord object can read and write both MARC tape format and OAi MARC XML. The program has been tested on over 4,000 MARC records, moving from tape format to XML and back to tape format without losing a character. The MarcDocument object can now produce short and long description in ASCII or ANSEL, long descriptions in HTML, and something approximating OAMS metadata records in the XML transport defined in the Santa Fe Convention." See description in "MARC as an Open Archives Metadata Standard." Similarly, an XML DTD has been prepared to represent the elements of the RFC 1807 Metadata Set (Format for Bibliographic Records). Another XML DTD has been constructed for the Dublin Core Metadata Set.

Why a 'common metadata format'? "Mapping among multiple metadata formats would place a considerable burden on service providers, who harvest the metadata and use it to build higher level services. While there is research work on creating services such as common search interfaces across heterogeneous metadata formats, a less burdensome and ultimately more deployable solution is to require repositories to map to a simple and common metadata format. The fifteen elements Dublin Core has over the past several years developed as a de facto standard for simple cross-discipline metadata and is thus the appropriate choice for a common metadata set. The metadata harvesting protocol supports the notion of parallel metadata sets, allowing communities to expose metadata in formats that are specific to their applications and domains. The technical framework places no limitations on the nature of such parallel sets, other than that the metadata records be structured as XML data, which have a corresponding XML schema for validation.

  • [October 13, 2006] "Open Archives Initiative Announces Object Reuse and Exchange (ORE)." — The Open Archives Initiative (OAI), with the generous support of the Andrew W. Mellon Foundation, announces a new effort as part of its mission to develop and promote interoperability standards that aim to facilitate the efficient dissemination of content. Object Reuse and Exchange (ORE) will develop specifications that allow distributed repositories to exchange information about their constituent digital objects. These specifications will include approaches for representing digital objects and repository services that facilitate access and ingest of these representations. The specifications will enable a new generation of cross-repository services that leverage the intrinsic value of digital objects beyond the borders of hosting repositories. The goals of ORE are inspired by advances in scholarly communication and the growth of scholarly material that is available in scholarly repositories including institutional repositories, discipline-oriented repositories, dataset warehouses, and online journal repositories. This growth is significant by itself. However, its real importance lies in the potential for these distributed repositories and their contained objects to act as the foundation of a new digitally-based scholarly communication framework. Such a framework would permit fluid reuse, refactoring, and aggregation of scholarly digital objects and their constituent parts — including text, images, data, and software. This framework would include new forms of citation, allow the creation of virtual collections of objects regardless of their location, and facilitate new workflows that add value to scholarly objects by distributed registration, certification, peer review, and preservation services. Although scholarly communication is the motivating application, we imagine that the specifications developed by ORE may extend to other domains. ORE is funded by Mellon for two years beginning October 2006. It is coordinated by Carl Lagoze of Cornell University Information Science and Herbert Van de Sompel of the Los Alamos Research Library. The ORE two-year work plan includes: (1) Formation of an international advisory committee, consisting of leaders in e-Science, institutional repositories, publishing, library, and educational technology communities. (2) Formation of an international working group that will meet over the two year period and develop the set of ORE specifications. (3) Establishment and management of an experimental deployment community that will exercise the developed standards in a variety of contexts. (4) Establishment of a sustainable community to support the widespread deployment and management of the standards fabric. OAI-ORE will co-exist within the Open Archives Initiative with the Protocol for Metadata Harvesting (OAI-PMH), the widely deployed standard for exchange of metadata..."

  • [May 06, 2005]   Open Archives Initiative Releases Specification for Conveying Rights Expressions.    The Open Archives Initiative has published an Implementation Guideline specification for Conveying Rights Expressions About Metadata in the OAI-PMH Framework. This specification defines mechanisms for data providers to associate XML-based rights expressions with harvested metadata that is queried and delivered via service providers using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The rights expressions indicate how data may be used, shared, and modified after it has been harvested. Detailed examples are provided in the specification for declaring rights using the Creative Commons and GNU licenses; however, the rights expression mechanism under the OAI-PMH data model is agnostic as the particular rights expression language used by the data provider. The OAI Protocol for Metadata Harvesting (OAI-PMH) has now become the most widely adopted approach for publication of both "data" and "metadata" by online digital libraries and archive centers. A list of registered OAI conforming repositories (perhaps not current) identifies some 286 digital library projects using this federated database approach, and the OAIster Digital Library Production Service at the University of Michigan stores some 5,366,375 records of digital resources from 472 institutions. The essence of the open archives approach is "to enable access to Web-accessible material through interoperable repositories for (meta-)data sharing, publishing and archiving. OAI develops and promotes a low-barrier interoperability framework and associated standards based upon open protocols. In the OAI model, a data provider maintains one or more repositories (web servers) that support the OAI-PMH as a means of exposing metadata; a service provider issues OAI-PMH requests to data providers and uses the metadata as a basis for building value-added services. According to the online tutorial, OAI-PMH provides "a simple technical option for data providers to make their metadata available to services, based on the open standards HTTP (Hypertext Transport Protocol) and XML (Extensible Markup Language). The metadata that is harvested may be in any format that is agreed by a community (or by any discrete set of data and service providers), although unqualified Dublin Core is specified to provide a basic level of interoperability. MARCXML, METS, and OLAC are also popular supported formats. The OAI-PMH protocol is based on HTTP with support for flow control. Request arguments are issued as GET or POST parameters. OAI-PMH supports six request types, known as verbs; responses are encoded in XML syntax. OAI-PMH supports any metadata format encoded in XML, and OAI-PMH defines a single XML Schema to validate responses to all OAI-PMH requests."

  • [February 2005] "A Metadata Search Engine for Digital Language Archives." By Baden Hughes and Amol Kamat ( Department of Computer Science and Software Engineering University of Melbourne). From D-Lib Magazine Volume 11 Number 2 (February 2005). "In this article we describe the design and implementation of a full- featured metadata search engine within the Open Language Archives Community (OLAC). Unlike many digital library search engines, this particular implementation has a high degree of affinity with web search engines in terms of reasoning and results display, and presumes no knowledge of the underlying metadata or database structures on behalf of the user. The Open Language Archives Community (OLAC) is a consortium of linguistic data archives, consisting of 31 archives and a corresponding catalogue of more than 28,000 objects described by metadata. Derived from the model adopted within the OAI, the OLAC model has a two-tiered approach to implementation. Data providers are the institutional language archives that publish their XML-based metadata according to the OAI Static Repository standard. Individual archives use a variety of software to manage their catalogues internally. Service providers leverage the OAI Protocol for Metadata Harvesting to harvest the XML expressions of metadata catalogues. Within the OLAC community, typical practice is to aggregate these into an SQL database using the OLAC Harvester and Aggregator. Service providers can then build services that utilise the union catalogue of OLAC metadata. As a metadata community and a virtual digital library, OLAC has motivated a number of developments at the OAI level, notably the need for supporting static repositories, the development of virtual service providers, and personal metadata creation and management tools..." See also the OLAC web site.

  • [February 2005] "The Extensible Past: The Relevance of the XML Data Format for Access to Historical Datasets and a Strategy for Digital Preservation." By Annelies van Nispen, Rutger Kramer and René van Horik (Netherlands Institute for Scientific Information Services, Amsterdam, The Netherlands). From D-Lib Magazine Volume 11 Number 2 (February 2005). This article reports on the X-past project carried out by the Netherlands Historical Data Archive (NHDA). The main goal of the project has been to investigate how the XML data format can improve the durability of and access to historical datasets. The X-past project furthermore investigated whether it would be possible to provide access to historical datasets by means of the 'Open Archives Initiative - Protocol for Metadata Harvesting' (OAI-PMH). Within the framework of the X-past project a prototype information system has been developed and a number of users have been asked to report on usability issues concerning this system... Transparent access to research data is one of the main tasks of a data archive. The OAI-PMH is a very promising protocol for creating an open archive that can act as a solution for durable access to datasets. The X-past project incorporated the OAI-PMH for the dissemination of metadata from a repository of historical datasets. A Data Provider as well as a Service Provider is implemented in order to enable web access to this metadata repository. The advantages of using this approach are: (1) The storage of the repository (at the Data Provider side) is independent of the end-user interface (at the Service Provider). This way, changes in the storage structure or other re-factoring efforts on the side of the Data Provider will have no effect on the accessibility of the collections on the Server Provider side. (2) Third parties can also act as a Service Provider and harvest the X-past repository with minimal additional effort. (3) The X-past Service Provider will also be able to disseminate collections from other related repositories that may be built in the future, offering the end user access to a whole range of research databases worldwide. In conclusion, the use of the OAI-PMH will make the implementation flexible, scalable, and easy to maintain and manage. Moreover, it will enable the NHDA to join future international initiatives for the interchange of research datasets. The flexibility of OAI-PMH will make decentralized and virtual data archiving possible..." See also the project description.

  • [December 2004] "Resource Harvesting within the OAI-PMH Framework." By Herbert Van de Sompel (Los Alamos National Laboratory, Research Library), Michael L. Nelson (Old Dominion University, Computer Science Department), Carl Lagoze (Cornell University, Computing and Information Science) and Simeon Warner (Cornell University). From D-Lib Magazine Volume 10 Number 12 (December 2004). "Motivated by preservation and resource discovery, we examine how digital resources, and not just metadata about resources, can be harvested using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). We review and critique existing techniques for identifying and gathering digital resources using metadata harvested through the OAI-PMH. We introduce an alternative solution that builds on the introduction of complex object formats that provide a more accurate way to describe digital resources. We argue that the use of complex object formats as OAI-PMH metadata formats results in a reliable and attractive approach for incremental harvesting of resources using the OAI-PMH... Such [complex] formats exist, and are generally referred to as complex object formats; examples include MPEG-21 DIDL, METS, and SCORM. The combination of these complex object formats with the OAI-PMH results in a framework that allows for reliable harvesting of digital resources... Complex object formats allow the unambiguous distinction between an identifier of the resource and the location of a resource, and as such alleviate the lack of expressiveness that Dublin Core provides with that respect. Also, the correct interpretation of the notion of the OAI-PMH datestamp to complex object representations yields a datestamp that changes whenever a constituent of the represented resource changes. The result is a reliable trigger for incremental harvesting of resources."

  • [December 2004] "A Repository of Metadata Crosswalks." By Carol Jean Godby, Jeffrey A. Young, and Eric Childress. From D-Lib Magazine Volume 10 Number 12 (December 2004). This paper proposes a model for metadata crosswalks that associates three pieces of information: the crosswalk, the source metadata standard, and the target metadata standard, each of which may have a machine- readable encoding and human-readable description. The crosswalks are encoded as METS records that are made available to a repository for processing by search engines, OAI harvesters, and custom-designed Web services. We define a data model that expresses a crosswalk not as a single file but as a complex object representing six pieces of information: the table of equivalences, the source metadata standard and the target metadata standard, each of which may have a machine- processable encoding and a human-readable description. We support the current interest in XML processing by creating XML-encoded metadata records for crosswalk objects, linking them to XML-encoded versions of the relevant standards as well as XSLT expressions of crosswalk tables, and making this data available in a repository that tive (OAI). Encoded in the XML schema defined by the METS sponsors, the crosswalk object produces a relatively simple but not a trivial record. We show an essential fragment, which depicts a crosswalk from MARC XML to an OAI encoding of Unqualified Dublin Core. See also METS.

  • [September 2004] " Experiences of Educators Using a Portal of Aggregated Metadata." By Sarah Shreeves and Christine Kirkham. From Journal of Digital Information Volume 5, Issue 3 (September 2004). "The University of Illinois at Urbana-Champaign Open Archives Initiative Metadata Harvesting Project sought to test the viability of a search portal containing aggregated metadata for cultural heritage resources harvested using the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Metadata was collected from 39 providers, including museums, archives, libraries, historical societies, consortiums, and digital libraries. Some resources existed in digital formats, such as .JPG images. Other resources were analog objects and were represented digitally through the metadata. The paper documents a pilot user test with a small group of K-12 teachers-in-training. The OAI Protocol for Metadata Harvesting (PMH) is now well established as an important tool for building aggregations of metadata from dispersed collections. The OAI-PMH is technically a 'low-barrier' protocol to implement that relies primarily on HTTP and XML, and has been particularly successful for: (1) sharing metadata describing resources not readily available to current Web search engines, such as those within databases or with non-HTML content; (2) allowing participation by content developers who may be unable to participate in other methods for federated searching, such as Z39.50, due to technical or other limitations...

  • [June 2004] "The Multi-Faceted Use of the OAI-PMH in the LANL Repository." By Henry N. Jerez, Xiaoming Liu, Patrick Hochstenbach, and Herbert Van de Sompel (Los Alamos National Laboratory, Los Alamos, NM). In Proceedings of the 4th ACM/IEEE-CS Joint Conference on Digital libraries [Tuscon, AZ, USA; June 07-11, 2004] This paper focuses on the multifaceted use of the OAI-PMH in a repository architecture designed to store digital assets at the Research Library of the Los Alamos National Laboratory (LANL), and to make the stored assets available in a uniform way to various downstream applications. In the architecture, the MPEG-21 Digital Item Declaration Language is used as the XML-based format to represent complex digital objects. Upon ingestion, these objects are stored in a multitude of autonomous OAI-PMH repositories An OAI--PMH compliant Repository Index keeps track of the creation and location of all those repositories, whereas an Identifier Resolver keeps track of the location of individual objects. An OAI-PMH Federator is introduced as a single-point-of-access to downstream harvesters. It hides the complexity of the environment to those harvesters, and allows them to obtain transformations of stored objects While the proposed architecture is described in the context of the LANL library, the paper will also touch on its more general applicability... The question of more general applicability of the proposed architecture becomes harder to answer when loosely-structured or unmanaged federations are considered. Consider, for example, the collection of all public OAI-PMH repositories as a federation. This is not a federation of OAI-PMH repositories that expose complex objects, but rather one in which more regular metadata formats (such as DC and MARCXML) are supported. This does not really influence the nature of the architecture. In this federation, an interesting parallel can be drawn between the Repository Index and the registries operated by the OAI and UUIC, as the latter list the baseURLs of all repositories in this loose federation. Also, the ERRoLs service capable of resolving oai-identifiers in a sense resembles the Identifier Resolver, although it uses business rules rather than data collected from individual repositories to resolve identifiers. This suggests that an OAI-PMH Federator might potentially be added to this federation as a single point of access to all public OAI-PMH repositories. However, since neither the synchronicity requirement nor the high uptime of repositories seems straightforward to implement in such a loosely-structured federation, further research would be required to determine the usability of the proposed solution in that realm."

  • [February 2004] "Using MPEG-21 DIP and NISO OpenURL for the Dynamic Dissemination of Complex Digital Objects in the Los Alamos National Laboratory Digital Library." By Jeroen Bekaert, Lyudmila Balakireva, Patrick Hochstenbach, and Herbert Van de Sompel (Los Alamos National Laboratory, Research Library). From D-Lib Magazine Volume 10 Number 2 (February 2004). "This paper focuses on the use of NISO OpenURL and MPEG-21 Digital Item Processing (DIP) to disseminate complex objects and their contained assets, in a repository architecture designed for the Research Library of the Los Alamos National Laboratory. In the architecture, the MPEG-21 Digital Item Declaration Language (DIDL) is used as the XML-based format to represent complex digital objects. Through an ingestion process, these objects are stored in a multitude of autonomous OAI-PMH repositories. An OAI-PMH compliant Repository Index keeps track of the creation and location of all those repositories, whereas an Identifier Resolver keeps track of the location of individual complex objects and contained assets. An MPEG-21 DIP Engine and an OpenURL Resolver facilitate the delivery of various disseminations of the stored objects..."

  • [December 2003] "Open Archives Data Service Prototype and Automated Subject Indexing Using D-Lib Archive Content As a Testbed." By Larry Mongin, Yueyu Fu, and Javed Mostafa (Indiana University School of Library and Information Science). From D-Lib Magazine Volume 9 Number 12 (December 2003). "The purpose of the new Indiana University School of Library and Information Science Information Processing Laboratory is to facilitate collaboration between scientists in the department in the areas of information retrieval (IR) and information visualization (IV) research. We are using the D-Lib Magazine archives as a dataset for our prototype; since March 1999, D-Lib has created an XML metadata file associated with each article published in the magazine. A harvester provides the means for collecting metadata from repositories: we wanted a harvester that would be easy to install and we wanted the harvester to be open source and implemented in Java. Among the many existing OAI-PMH harvesting tools, we chose OAIHarvester from OCLC. We developed an article browser with a search service using as data the D-Lib Magazine articles and metadata. The browser is now running on data in our OAI-PMH repository. The Apache Tomcat (an open source Java servlet engine) search servlet queries the database using SQL commands; results from user queries are written to an HTML file and returned to the user..."

  • [September 26, 2003] RoMEO and OAI-PMH Teams Develop Rights Solution Using ODRL and Creative Commons Licenses.   Project RoMEO (Rights Metadata for Open Archiving) has completed its first year of operation with funding from the Joint Information Systems Committee (JISC) and has published a rights solution report. A sixth interim Study and the Final Report describe an XML-based system for the expression of rights and permissions governing metadata and resources in institutional repositories. A principal goal of RoMEO, like that of the Creative Commons, is to neutralize the negative effects of (default) copyright law and controlling intermediaries in order to facilitate easy, open access to protected digital works. On this model, consumers do not need to ask permission for use of resources because permission in various forms has already been granted. The RoMEO Project team sought to develop an interoperable set of metadata elements and methods of incorporating the rights elements into document metadata processed by the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). The goal is to protect research papers and other digital resources in an open-access environment. The project team has developed an XML metadata notation using the Open Digital Rights Language (ODRL) and Creative Commons licenses for disclosure of the rights expressions under the OAI-PMH. The markup model covers both individual digital resources and collections of metadata records. A new 'OAI-RIGHTS' Technical Committee has been formed by members of the RoMEO and OAI project teams to further develop the proposals and to publish generic guidelines for disclosing rights expressions. Note: See now the specification: "Open Archives Initiative Releases Specification for Conveying Rights Expressions."

  • [July 15, 2003] "Using the OAI-PMH ... Differently." By Herbert Van de Sompel (Digital Library Research and Prototyping, Los Alamos National Laboratory), Jeffrey A. Young (OCLC Office of Research), and Thomas B. Hickey (OCLC Office of Research). In D-Lib Magazine Volume 9, Number 7/8 (July/August 2003). ISSN: 1082-9873. "The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) was created to facilitate discovery of distributed resources. The OAI-PMH achieves this by providing a simple, yet powerful framework for metadata harvesting. Harvesters can incrementally gather records contained in OAI-PMH repositories and use them to create services covering the content of several repositories. The OAI-PMH has been widely accepted, and until recently, it has mainly been applied to make Dublin Core metadata about scholarly objects contained in distributed repositories searchable through a single user interface. This article describes innovative applications of the OAI-PMH that we have introduced in recent projects. In these projects, OAI-PMH concepts such as resource and metadata format have been interpreted in novel ways. The result of doing so illustrates the usefulness of the OAI-PMH beyond the typical resource discovery using Dublin Core metadata. Also, through the inclusion of XSL stylesheets in protocol responses, OAI-PMH repositories have been directly overlaid with an interface that allows users to navigate the contained metadata by means of a Web browser. In addition, through the introduction of PURL partial redirects, complex OAI-PMH protocol requests have been turned into simple URIs that can more easily be published and used in downstream applications... Through the creative interpretation of the OAI-PMH notions of resource and metadata format, repositories with rather unconventional content, such as Digital Library usage logs, can be deployed. These applications further strengthen the suggestion that the OAI-PMH can effectively be used as a mechanism to maintain state in distributed systems. [We] show that simple user interfaces can be implemented by the mere use of OAI-PMH requests and responses that include stylesheet references. For certain applications, such as the OpenURL Registry, the interfaces that can be created in this manner seem to be quite adequate, and hence the proposed approach is attractive if only because of the simplicity of its implementation. The availability of an increasing amount of records in OAI-PMH repositories generates the need to be able to reference such records in downstream applications, through URIs that are simpler to publish and use than the OAI-PMH HTTP GET requests used to harvest them from repositories. This article shows that PURL partial redirects can be used to that end..."

  • [February 14, 2003] "Exposing Information Resources for E-Learning. Harvesting and Searching IMS Metadata Using the OAI Protocol for Metadata Harvesting and Z39.50." By Andy Powell (UKOLN, University of Bath) and Steven Richardson (UMIST). In Ariadne Issue 34 (January 14, 2003). "IMS is a global consortium that develops open specifications to support the delivery of e-learning through Learning Management Systems (LMS) -- or 'Virtual Learning Environment (VLE)' as used in the UK. IMS activities cover a broad range of areas including accessibility, competency definitions, content packaging, digital repositories, integration with 'enterprise' systems, learner information, metadata, question & test and simple sequencing. Of particular relevance to this article is the work of the IMS Digital Repositories Working Group (DRWG). The DRWG is working to define a set of interfaces to repositories (databases) of learning objects and/or information resources in order to support resource discovery from within an LMS. In particular, the specifications currently define mechanisms that support distributed searching of remote repositories, harvesting metadata from repositories, depositing content with repositories and delivery of content from the repository to remote systems. Future versions of the specifications will also consider alerting mechanisms, for discovering new resources that have been added to repositories. Note that, at the time of writing, the DRWG specifications are in draft form. Two broad classes of repository are considered: (1) Native learning object repositories containing learning objects; (2) Information repositories containing information resources (documents, images, videos, sounds, datasets, etc.). In the former, it is assumed that, typically, the learning objects are described using the IMS metadata specification and packaged using the IMS content packaging specification. The latter includes many existing sources of information including library OPACs, bibliographic databases and museum catalogues where metadata schemas other than IMS are in use. In both cases it is assumed that the repository may hold both assets and metadata or just metadata only. Both the example implementations described below fall into the second category of repository. The DRWG specifications describe the use of XQuery over SOAP to query 'native' repositories of learning objects. This usage is not discussed any further in this article. The specifications also describe how to search and harvest IMS metadata from 'information' repositories using the OAI Protocol for Metadata Harvesting (OAI-PMH) and Z39.50..."

  • [December 16, 2002] "A Quantitative Analysis of Dublin Core Metadata Element Set (DCMES) Usage in Data Providers Registered with the Open Archives Initiative (OAI)." By Jewel Ward (Graduate Student, The School of Information and Library Science, Univ of North Carolina at Chapel Hill; WWW). November, 2002. 68 pages. A Master's paper for the MS degree in I.S. Abstract: "This research describes an empirical study of how the Dublin Core Metadata Element Set (DCMES) is used by 100 Data Providers (DPs) registered with the Open Archives Initiative (OAI). The research was conducted to determine whether or not the DCMES is used to its full capabilities. Eighty-two of 100 DPs have metadata records available for analysis. DCMES usage varies by type of DP. The average number of Dublin Core elements per record is eight, with an average of 91,785 Dublin Core elements used per DP. Five of the 15 elements of the DCMES are used 71% of the time. The results show the DCMES is not used to its fullest extent within DPs registered with OAI..."

  • [June 14, 2002] "Federated Searching Interface Techniques for Heterogeneous OAI Repositories." By Xiaoming Liu, Kurt Maly, Mohammad Zubair, Qiaoling Hong (Old Dominion University, Norfolk, Virginia USA ); Michael L. Nelson (NASA Langley Research Center, Hampton, Virginia USA); Frances Knudson and Irma Holtkamp (Los Alamos National Laboratory, Los Alamos, New Mexico USA). In Journal of Digital Information Volume 2 Issue 4 (May 2002). "Federating repositories by harvesting heterogeneous collections with varying degrees of metadata richness poses a number of challenging issues: (1) how to address the lack of uniform control for various metadata fields in terms of building a rich unified search interface, and (2) how easily new collections and freshly harvested data in existing repositories can be incorporated into the federation supporting a unified interface? This paper focuses on the approaches taken to address these issues in Arc, an Open Archives Initiative-compliant federated digital library. At present Arc contains over 1M metadata records from 75 data providers from various subject domains. Analysis of these heterogeneous collections indicates that controlled vocabularies and values are widely used in most repositories. Usage is extremely variable, however. In Arc we solve the problem by implementing an advanced searching interface that allows users to search and select in specific fields with data we construct from the harvested metadata, and also by an interactive search for the subject field. As the metadata records are incrementally harvested we address how to build these services over frequently-added new collections and harvested data. The initial result is promising, showing the benefits of immediate feedback to the user in enhancing the search experience as well as in increasing the precision of the user's search..." See also "Arc - An OAI Service Provider for Digital Library Federation," published in D-Lib Magazine; 'The Open Archive Initiative (OAI) is one major effort to address technical interoperability among distributed archives. The objective of OAI is to develop a framework to facilitate the discovery of content in distributed archives.'

  • [January 04, 2002] "DP9 Service Provider for Web Crawlers." By Xiaoming Liu (Computer Science Department, Old Dominion University, Norfolk, Virginia, USA). In D-Lib Magazine Volume 7 Number 12 (December 2001). ISSN: 1082-9873. "The Open Archive Initiative (OAI) team (K. Maly, M. Zubair, M. Nelson, X. Liu) of the Old Dominion University (ODU) Digital Library group has "announced DP9 -- a new OAI service provider for web crawlers. DP9 is an open source gateway service that allows general search engines, (e.g., Google, Inktomi, etc.) to index OAI-compliant archives. DP9 does this by providing a persistent URL for repository records and converting this to an OAI query against the appropriate repository when the URL is requested. This allows search engines that do not support the OAI protocol to index the "deep web" contained within OAI-compliant repositories. Indexing OAI collections via an Internet search engine is difficult because web crawlers cannot access the full contents of an archive, are unaware of OAI, and cannot handle XML content very well. DP9 solves these problems by defining persistent URLs for all OAI records and dynamically creating a series of HTML pages according to a crawler's requests. DP9 provides an entry page, and if a web crawler finds this entry page, the crawler can follow the links on this page and index all records in an OAI data provider. DP9 also supports a simple name resolution service: given an OAI Identifier, it responds with an HTML page, a raw XML file, or forwards the request to the appropriate OAI data provider. DP9 consists of three main components: a URL wrapper, an OAI handler and an XSLT processor. The URL wrapper accepts the persistent URL and calls the internal JSP/Servlet applications. The OAI handler issues OAI requests on behalf of a web crawler. The XSLT processor transforms the XML content returned by the OAI archive to an HTML format suitable for a web crawler. XSLT allows DP9 to support any XML metadata format simply by adding an XSL file. DP9 is based on Tomcat/Xalan/Xtag technology from Apache... The DP9 code is available for installation by any interested OAI-compliant repository." See also "Repositories Open Up to Web Crawlers," by Scott Wilson [CETIS], November 28, 2001.

  • [July 02, 2001] "The Open Archives Initiative Protocol for Metadata Harvesting." Protocol Version 1.1 of 2001-07-02. Edited by Herbert Van de Sompel (Cornell University, Computer Science) and Carl Lagoze (Cornell University, Computer Science). "The goal of the Open Archives Initiative Protocol for Metadata Harvesting (referred to as the OAI protocol in the remainder of this document) is to supply and promote an application-independent interoperability framework that can be used by a variety of communities who are engaged in publishing content on the Web. The OAI protocol described in this document permits metadata harvesting. The result is an interoperability framework with two classes of participants: Data Providers administer systems that support the OAI protocol as a means of exposing metadata about the content in their systems; Service Providers issue OAI protocol requests to the systems of data providers and use the returned metadata as a basis for building value-added services. A record is an XML-encoded byte stream that is returned by a repository in response to an OAI protocol request for metadata from an item in that repository..." See Appendix 1 for Sample XML Schemas for metadata formats: "Each metadata format that is included in records disseminated by the OAI protocol is identified within the repository by a metadata prefix and across multiple repositories by the URL of a metadata schema. The metadata schema is an XML schema that may be used as a test of conformance of the metadata included in the record. XML Schemas for three metadata formats are given here (1) An XML Schema for the mandatory unqualified Dublin Core metadata format; (2) An XML Schema for the rfc1807 metadata format; (3) An XML Schema to represent MARC21 records in an XML format..."

  • [October 26, 2001] "Subject Portals." In Ariadne Issue 29 (September 2001). Published by UKOLN. ['Judith Clark describes a 3-year project to develop a set of subject portals - part of the Distributed National Electronic Resource (DNER) development programme.'] "The RDN portals are primarily concerned with technologies that broker subject-oriented access to resources. Effective cross-searching depends on consistent metadata standards, but these are still under development and although the RDNs collection is governed by sophisticated metadata schemas, this is not the case for many of the other resources targeted by the portals. Z39.50 is the standard that has been adopted for the preliminary cross-search functionality. Further portal functionality is being developed using RSS (Rich Site Summary) and OAI (Open Archives Initiative). Other standards applications that underpin the portals are notably Dublin Core and a variety of subject-specific thesauri such as the CAB Thesaurus and MeSH... [The OAI protocol provides a mechanism for sharing metadata records between services. Based on HTTP and XML, the protocol is very simple, allowing a client to ask a repository for all of its records or for a sub-set of all its records based on a date range. An OAI record is an XML-encoded byte stream that is returned by a repository in response to an OAI protocol request for metadata from an item in that repository.] The RDNs services today fulfil an expressed goal of the eLib programme, which in 1994 funded a series of demonstrator services designed to create a national infrastructure capable of generating significantly more widespread use of networked information resources. Those services, then known as subject gateways, developed in response to community interests specific to each gateway. The RDN itself was established in 1999 to bring the gateways together under a federated structure. The Hubs are based around faculty-level subject groupings, chosen with a view to potential for partnership, sustainability, and growth, while preserving legacy investments. The RDNs internet resource catalogues include records describing almost 40,000 web sites and are growing steadily as new subject areas are encompassed..." See technical details in "The DNER Technical Architecture: scoping the information environment" [also PDF]

  • [April 21, 2001] "Arc - An OAI Service Provider for Digital Library Federation." By Xiaoming Liu, Kurt Maly, and Mohammad Zubair (Old Dominion University, Norfolk, Virginia USA) and Michael L. Nelson (NASA Langley Research Center Hampton, Virginia USA). In D-Lib Magazine [ISSN: 1082-9873] Volume 7, Number 4 (April, 2001). "The usefulness of the many on-line journals and scientific digital libraries that exist today is limited by the inability to federate these resources through a unified interface. The Open Archive Initiative (OAI) is one major effort to address technical interoperability among distributed archives. The objective of OAI is to develop a framework to facilitate the discovery of content in distributed archives. In this paper, we describe our experience and lessons learned in building Arc, the first federated searching service based on the OAI protocol. Arc harvests metadata from several OAI compliant archives, normalizes them, and stores them in a search service based on a relational database (MySQL or Oracle). At present we have over 320,000 metadata records from 18 data providers from various subject domains. We have also implemented an OAI layer over Arc, thus making hierarchical harvesting possible. The experiences described within should be applicable to others who seek to build an OAI service provider... Bulk harvesting is ideal because of its simplicity for both the service provider and data provider. It collects the entire data set through a single http connection, thus avoiding a great deal of network traffic. However, bulk harvesting has two problems. First, the data provider may not implement the resumptionToken flow control mechanism of the OAI metadata harvesting protocol, and thus may not be able to correctly process large (but partial) data requests. Secondly, XML syntax errors and character-encoding problems -- these were surprisingly common -- can invalidate entire large data sets... During the testing of data harvesting from OAI data providers, numerous problems were found. We discovered that not all archives strictly follow the OAI protocol; many have XML syntax and encoding problems; and some data providers are periodically unavailable. Many OAI responses were not well-formatted XML files. Sometimes foreign language and other special characters were not correctly encoded. XML syntax errors and character-encoding problems were surprisingly common and could invalidate entire large data sets. Incremental harvesting proved beneficial as a work-around. The OAI website validates registered data providers for protocol compliance. It uses XML schemas to verify the standard conformance. However, this verification is not complete; it does not cover the entire harvesting scenario and does not verify the entire data set. Additionally, such verification cannot detect semantic errors in the protocol implementation, such as misunderstanding of DC fields. For certain XML encoding errors, an XML parser can help avoid common syntax and encoding errors... The contribution of Arc is to prove not only that an OAI-compliant service provider can be built, but also that one can be built at a scale previously unrealized within the e-print community. The Open Archives Initiative has been successful in getting data providers to adopt the protocol and provide an OAI layer to their repositories..."

  • [February 17, 2001]   Open Archives Initiative Publishes XML Schemas for the OAI Metadata Harvesting Protocol.    Version 1.0 of The Open Archives Initiative Protocol for Metadata Harvesting has been published with appendices documenting XML schemas for metadata representation (e.g., Dublin Core metadata format, RFC1807 metadata format, MARC21 records in an XML format). The Open Archives Initiative, with funding from the Digital Library Federation and the Coalition for Networked Information, "develops and promotes interoperability standards" for efficient dissemination of content on the Web. The published OAI protocol defines a mechanism for harvesting records containing metadata from repositories, where metadata records be structured as XML data. The OAI protocol and has been extensively tested by a variety of alpha testers before its public release and is now widely implemented. The OAI metadata harvesting protocol supports an "interoperability framework with two classes of participants: (1) Data Providers administer systems that support the OAI protocol as a means of exposing metadata about the content in their systems; (2) Service Providers issue OAI protocol requests to the systems of data providers and use the returned metadata (XML-encoded byte stream) as a basis for building value-added services." An online 'Open Archives Initiative Repository Explorer' supports interactive testing of archives for conformance with the OAI protocol. [Full context]

  • [February 17, 2001] Open Archives Initiative Repository Explorer v1.0 "This site presents an interface to interactively test archives for compliancy with the Open Archives Initiative Protocol version 1.0. It works by interrogating the Open Archive using the service requests defined by the protocol. The results in XML are then parsed and interpreted to present the user with a navigable interface." Several predefined archives are listed with URLs.

  • Open Archives Metadata Set DTD. [cache]

  • Sample OAMS document; [cache]

  • Santa Fe Convention. "The meeting was supported by the Council on Library and Information Resources , the Digital Library Federation, the Scholarly Publishing & Academic Resources Coalition, the Association of Research Libraries and the Los Alamos National Laboratory. The first meeting, held in Santa Fe in October 1999, brought together individuals representing a variety of organizations, many of them managing existing EPrint initiatives (e.g,:, CogPrints, NDLTD, RePEc, EconWPA, NCSTRL, NTRS). The result of that meeting was the so-called Santa Fe Convention, a set of technical and organizational specifications to permit cross-archive metadata harvesting."

  • Summary of the second workshop. June 3, 2000. By Edward A. Fox, Professor, Department of Computer Science, Virginia Tech. "At the Second OAi meeting, 43 people assembled from 5 countries. Eleven of those had attended the Santa Fe meeting, and many others were closely affiliated with an organization or group that participated in the Santa Fe meeting. Among those were the workshop organizing committee: Edward Fox, Carl Lagoze, Clifford Lynch, and Hussein Suleman -- each of whom gave brief presentations... See the proceedings. [cache]

  • [February 17, 2001] "The Open Archives Initiative Protocol for Metadata Harvesting." Edited by Herbert Van de Sompel (Cornell University - Computer Science) and Carl Lagoze (Cornell University - Computer Science). Protocol Version 1.0. Document Version 2001-01-21. "The goal of the Open Archives Initiative Protocol for Metadata Harvesting is to supply and promote an application-independent interoperability framework that can be used by a variety of communities who are engaged in publishing content on the Web. The OAI protocol described in this document permits metadata harvesting. The result is an interoperability framework with two classes of participants: (1) Data Providers administer systems that support the OAI protocol as a means of exposing metadata about the content in their systems; (2) Service Providers issue OAI protocol requests to the systems of data providers and use the returned metadata as a basis for building value-added services. A repository is a network accessible server to which OAI protocol requests, embedded in HTTP, can be submitted. The OAI protocol provides access to metadata from OAI-compliant repositories. This metadata is output in the form of a record. A record is the result of a protocol request issued to the repository to disseminate metadata from an item. A record is an XML-encoded byte stream that is returned by a repository in response to an OAI protocol request for metadata from an item in that repository. Appendix 1 supplies 'Sample XML Schema for metadata formats': Each metadata format that is included in records disseminated by the OAI protocol is identified within the repository by a metadata prefix and across multiple repositories by the URL of a metadata schema. The metadata schema is an XML schema that may be used as a test of conformance of the metadata included in the record. XML Schemas for three metadata formats are provided: (1) An XML Schema for the mandatory unqualified Dublin Core metadata format; (2) An XML Schema for the RFC1807 metadata format; (3) An XML Schema to represent MARC21 records in an XML format. Appendix 2 supplies 'Sample XML Schemas for the description part of a reply to Identify request': The response to an Identify request may contain a list of description containers, which provide an extensible mechanism for communities to describe their repositories. Each description container must be accompanied by the URL of an XML schema, which provides the semantics of the container. XML Schemas for two examples of description containers are provided. See also the XML Schema for the Response Format [source] and related schemas. [cache]

  • Summary of the First meeting of the Open Archives initiative

  • "The Santa Fe Convention of the Open Archives Initiative." By Herbert Van de Sompel and Carl Lagoze. In D-Lib Magazine, February 2000.

  • XML DTD for RFC 1807 Metadata; [cache]

  • OAi MARC XML format DTD; [cache]

  • XML DTD for the Dublin Core Metadata Set (CIMI Report version); [cache]

  • is "dedicated to the freeing of the refereed research literature online through author auto-archiving. Auto-archiving software,, is currently under development at the Electronics and Computer Science Department of the University of Southampton. eprints is already running in the form of the CogPrints Cognitive Sciences Eprint Archive, a JISC-funded open archive for research literature, pre- and post-refereeing. The software is designed to be as flexible and adaptable as possible so that universities can adopt and configure it with minimal effort for all disciplines. The generic version of eprints is fully interoperable with other open archives participating in the the Open Archives Initiative."

  • "Open Archives: A Key Convergence." By Roy Tennant. In Library Journal February 15, 2000. "In October 1999, several organizations -- including the Digital Library Federation, Association of Research Libraries, and Los Alamos National Laboratory -- recruited a group of experts 'to work towards achieving a universal service for author self-archived scholarly literature.' Self-archiving denotes the process of authors depositing their own papers into an archive. A common practice among scientists is to make preliminary drafts of their papers (or 'preprints') available to colleagues prior to publication. Preprints can subsequently undergo peer review and be published in a professional journal. One outcome of the meeting, which took place in Santa Fe, NM, was the establishment of the Open Archives initiative (formerly known as the Universal Preprint Service initiative). The initiative aims to develop an open architecture that supports simultaneous searching and retrieval of papers from disparate archives. This is the logical next step after several separate projects have successfully archived papers of various kinds (such as technical reports, theses, dissertations, preprints, working papers, and conference papers). There are lessons to learn from each of the projects...The Open Archives initiative aims to specify the methods by which these various individual archives can interoperate. Such interoperability will largely be achieved by specifying first a protocol for 'harvesting' (gathering) metadata from participating archives; then criteria that can be used to selectively harvest metadata; and lastly, a common metadata format for archives to use in responding to harvesting requests. At first, the initiative will use a modified version of the Dienst protocol that comes out of the NCSTRL effort as the harvesting protocol. Dienst is well established for this kind of activity, having supported the same kind of work on behalf of computer science technical reports for some years. Accession date was thought by the Santa Fe meeting attendees to be the most important criteria for selective harvesting, with author affiliation, subject, and publication type also being deemed important. For the metadata component, a minimal set of the Dublin Core elements will be used. An early experimental implementation of such a service is the Universal Preprint Service (UPS) prototype server. Additional participants in the Open Archives effort include the California Digital Library of the University of California, through its eScholarship initiative, CogPrints (Cognitive Sciences archive) RePEc (Research Papers in Economics), and EconWPA (Economics Working Papers Archive)..."

  • The Open Citation Project. "The Open Citation Project is working towards becoming a registered service provider with the Open Archives Initiative and complying with the Santa Fe Convention..."

