UKOLN, University of Bath, UK
Stuart L. Weibel
OCLC Office of Research, Dublin, Ohio, USA
The first week of April 1996 found fifty representatives of libraries, Internet standards, text markup, and digital libraries projects converging at Warwick University to discuss advancing prospects for network resource description. The conferees came from three continents, eleven countries, and many perspectives in an effort to apply their collective experience to the clarification of issues surrounding the effective deployment of metadata for networked information resources.
The meeting was a follow-on to the previous year's OCLC/NCSA Metadata Workshop that convened a similarly diverse collection of stakeholders, and which resulted in consensus on a simple resource description record that has come to be known as the Dublin Core. The most important deliverable of that first workshop was the consensus that was achieved among the groups represented. The thirteen elements of a Dublin Core record contain few surprises, focussing largely on what might be thought of as network resource bibliography and a little bit more. [Weibel et al, 1995].
The idea received considerable attention in the year since the first meeting, but while the first workshop helped to focus discussion of the topic in many communities, the implementation of such a description record requires a formal syntax and deployment strategy that were beyond the scope of that first meeting.
Planning for the second workshop began with informal discussions between the UK Office for Library and Information Networking (UKOLN) and OCLC's Office of Research in the summer of 1995 crystallized around the theme of identifying and resolving impediments to deployment of a Dublin Core style record for resource description. The expectations of the organisers and participants were exceeded as conferees worked towards a number of related conclusions about the Dublin Core Metadata Element Set, about the need for a wider set of metadata types, and about an extensible framework for interchange of metadata of different types. A consensus about this central set of issues emerged from the workshop, and more importantly, a set of concrete proposals for moving forward has been produced. These include:
Guide to Creation and Maintenance of Metadata
The Dublin Core Metadata Element Set is a set of thirteen metadata elements proposed by the first workshop as a core description record to facilitate discovery of document-like objects in a networked environment. To facilitate progress, a number of constraints were imposed on the discussion:
|Table 1. The Dublin Core Elements|
|Subject||The topic addressed by the work.||Title||The name of the object.||Author||The person(s) primarily responsible for the intellectual content of the object.||Publisher||The agent or agency responsible for making the object available in its current form.||Other Agent||The person(s), such as editors, transcribers, and illustrators who have made other significant intellectual contributions to the work.||Date||The date of publication.||Object type||The genre of the object, such as novel, poem or dictionary.||Form||The physical manifestation of the object, such as PostScript file or Windows executable file.||Identifier||String or number used to uniquely identify the object.||Relation||Relationship to other objects.||Source||Objects, either print or electronic, from which this object is derived, if applicable.||Language||Language of the intellectual content.||Coverage||The spatial location and/or temporal duration characteristics of the object.|
The Dublin Metadata Workshop is described in greater detail in:
[Weibel, et al. 1995] and [Weibel, 1995] .
The reference description of the element set can be found at:
The development of the Dublin Core is motivated by several intended uses:
It is clear from early implementation experience that projects have employed Dublin Core semantics to develop simple resource description formats. The Dublin Core has suited those who need a format which is positioned between the terseness of the web crawler indexes and the fuller description of particular domain-specific formats (MARC, for example). It is full enough to support retrieval by a number of core attributes and to allow human users make judgements about the likely utility of a resource before requesting it. At the same time, it is simple enough not to require specialist expertise or extended manual effort to create.
This latter feature is especially important in the context of the second target use mentioned here. Conferees recognized the importance of richer metadata embedded in Web documents to be harvested by software robots. The use of the Dublin Core as the basis for such data is seen as a critical success factor in its adoption. The ability to embed data in other objects was also seen as essential.
Future applications will have to work with different types of metadata from different sources. The first workshop identified a need for a generic semantics which could act as the basis for semantic interoperability between multiple description schemas. The Dublin Core was positioned to provide a unifying semantics across description models. Early implementations such as the NDIS application (described below) is one example of such a use.
Even absent a clearly defined syntax, the Dublin Core element set attracted the interest of a number of early adopters who developed projects that built on the consensusthat emerged from the the Dublin Metadata Workshop:
To be provided
The URN Interoperability Project (TURNIP) initiated by the DSTC in Australia, has produced a URN Resolution service that utilises the Dublin Core element set for URC metadata. The Dublin Core elements are used to describe DSTC Technical reports and are supplemented with Administrative metadata elements (eg URC-Type, Date-Creation, Owner). Three main issues arising from this deployment of Dublin Core included the need to group elements together, a common syntax for the exchange of URCs, and standards for element qualifiers.More information on the TURNIP project can be found at:
Research involving the Dublin Core Element Set at OCLC includes interfaces into online databases, systems for user-based resource description and issues associated with automaticly generated resource descrition and metadata harvesting based on the Dublin Core.
A preliminary evaluation of the Dublin Core Element Set as an search interface into complex information was conducted utiling the OCLC's WorldCat Database. Similar experiements in a distributed environmrnt are planned between OCLC's map collections and UCSB's Project Alexandria Database.
The Spectrum project is exploring various user interface issues associated with user-based resource description of electronic information based on the Dublin Core element set. This project, in part, is exploring various issues associated with the extensibility framework of the Dublin Core and the Warwick Framework for local, genra specific descriptive information.
The Scorpion Project is an OCLC research project involved in automatic subject assignment based on the Dewey Decimal Classification system. A Spectrum-Scorpion dovetail provides both user-described and automatically generated classification metadata based on the Dublin Core Element Set. An extension of this project involves the automatic harvesting of internet resources to explore the feasibility of using the Dublin Core element set as an effective descriptive framework for distributed indexing of networked information.
The NDIS project (National Document and Information Service) is a joint development of the National Library of Australia and National Library of New Zealand aimed at providing a sophisticated search service to Australian and overseas databases, collection management services and state of the art document delivery services. The first phase of the project will implement a search and document request service across an integrated information resource of MARC based bibliographic data and a suite of indexing, directory and thesauri databases in a variety of encoding formats. Further information about the project can be located at: http://www.nla.gov.au/2/NDIS.
The NDIS project used the Dublin Core as a tool in determining generic metadata for bibliographic data, with extensions of the core element set or adoption of other metadata standards for non-bibliographic data. The creation of additional metadata can be viewed as extensions or separate core elements sets.
The Dublin Core serves as a useful model for the generic storage and access requirements in cross-database searching, and its concept of qualification offers a model for normalising disparate data types, and search precision at the individual database level via specific schema or types. NDIS implementation utilises many principles of the Dublin Core, such as extensibility and modifiability, but differs on optionality, as only those metadata elements that intersect across data types are core information resource elements. Metadata intersecting a grouping of data or item types are considered "common" metadata element sets.
To be provided
The Alexandria Digital Library (ADL) is one of six NSF/NASA/ARPA-funded Digital Library Initiatives. ADL focuses on online access to spatial data; given that an estimated 90% of all spatial data is available only in hard-copy form, metadata is of prime importance. At the same time, ADL recognizes that a full cataloging record is not needed by the vast majority of general users. ADL therefore decided to translate Dublin Core fields into ADL fields , and to add fields required specifically for spatial data and specifically for hard-copy items. This set of fields is the default display set that general users see when they perform a search and then call up metadata.
It is important to note that there are simple resource description models other than the Dublin Core that were discussed at the Warwick Workshop. Indeed, among the factors that motivated the Warwick Framework described later in this paper is the principle that there will be a variety of resource description models that emerge from different communities, and such models should be able to coexist.
Two such models that were discussed at the workshop are described below:
This RFC [http://ds.internic.net/rfc/rfc1807.txt] defines a format for bibliographic records describing technical reports. This format is used by the Cornell University Dienst protocol and the Stanford University SIFT system.
RFC 1807 is a bibliographic record tailored to the needs of the NCSTRL project [http://www.ncstrl.org.]. It is targetted specifically to the description of computer science technical reports. As such it has many characteristics one would like to see in a resource description record for document-like objects.
ROADS [http://ukoln.bath.ac.uk/roads/] ROADS is an eLib funded project to implement software for resource organisation and discovery in subject-based services. The aim is to develop a sharable resource discovery system and, in particular, to fulfil the requirements of the eLib subject based services. The intention is to involve information providers in resource description as this is viewed as essential to a sustainable service. There are already two subject services in production, OMNI and SOSIG, using a prototype version of ROADS. The choice of standards for ROADS was based on the criteria of simplicity and availability, to allow for speedy start-up of the subject services. To this end we chose to use a simple attribute-value record structure based on the IAFA/whois++ template definition; and to use the whois++ directory service protocol for search and retrieval. A later version of ROADS will pilot implementation of the common indexing protocol (CIP) to allow for a distributed system of shared indexing. Our initial experience of deployment of the IAFA/whois++ template has allowed us to collect statistical information on the frequency of use of both bibliographic and administrative attributes. We hope this will provide useful feedback for the development of the whois++ template structure.
Specification of a Transfer Syntax
Discussions of syntax are often difficult, burdened as they are with the biases of familiarity and competing methodologies. The Dublin Workshop made progress partly because such discussions were ruled out of scope. However, consensus concerning semantics cannot be deployed without a concrete syntax (or syntaxes). In pilot implementations, the absence of a common model led to different syntax and structuring choices. Clearly, any widespread deployment of Dublin Core (or any similar description scheme) hinges on reaching consensus about a transfer syntax.
Given that the Web is the primary medium of the electronic milieu, it was further recognized that deployment of metadata in the Web is the primary strategic application; successful deployment of metadata in HTML is necessary, though almost certainly not sufficient.
A working group on syntax formed around this issue and this group has elaborated a position paper describing a formal syntax for Dublin Core Metadata. A Syntax for Dublin Core Metadata (Burnard, Miller, Quin, and Sperberg-McQueen) includes:
In related developments, a convention for embedding metadata in HTML was proposed in a break-out group at the W3C Distributed Indexing and Searching Workshop, May 28-29, 1996 LINK TO DIST-INDEX WORKSHOP. This break out group included representatives of the Dublin Core/Warwick Framework Metadata meetings, representatives of several major Web search vendors (Lycos, Microsoft, WebCrawler), various other software vendors, and the W3 Consortium.
The problem is to identify a simple means of embedding metadata within HTML documents without requiring additional tags or changes to browser software, and without unnecessarily compromising current practices for robot collection of data.
While metadata is intended for display in some situations, it is judged undesireable for such embedded metadata to display on browser screens as a side effect of displaying a document. Therefore, any solution requires encoding information in attribute tags rather than as container element content.
The goal was to agree on a simple convention for encoding structured metadata information of a variety of types (which may or may not be registered with a central registry analogous to the Mime Type registry). It was judged that a registry may be a necessary feature of the metadata infrastructure as alternative schema are elaborated, but that deployment in the short-term could go forward without such a registry, especially in light of the proposed use of the LINK tag to link descriptions to a standard schema description as described below.
The solution agreed upon is to encode schema elements in META tags, one element per META tag, and as many META tags as are necessary. Grouping of schema elements is achieved by a prefix schema identifier associated with each schema element.
A convention for linking resource description tags to the reference definition of the metadata schema (or schemata) used in a document was also proposed. Doing so serves as a primitive registration mechanism for metadata schemata, and lays the foundation for a more formal, machine-readable linkage mechanism in the future.The proposed conventions are described more fully in LINK TO HTML-META CONVENTION
Development of User Guides
Resource descriptions might be created by a number of actors on the metadata use chain: authors (embedded HTML tags), site and collection administrators, third-party 'cataloguers'. Guidelines for the creation of metadata are needed. A guide for authors themselves would be especially useful in supporting a move to document-embedded descriptions, and at least one producer of HTML authoring tools (SoftQuad, Ltd.) has committed to embedding Dublin Core resource description templates in their products when the syntax and guidelines are sufficiently stable.
need elaboration on the focus of the User Guides and where/when they will be published
Extensibility -- Mixing and Matching Metadata
The Dublin Core addresses one particular niche of the metadata ecology. It is a simple resource description format that is intended to be extensible in at least two ways. As its name implies, it is intended to provide a commonly understandable core of elements that will help unify different models of resource description. Its simplicity is among its major virtues, but users may well wish to augment description of their resources with additional data.
Original concepts of extensibility for the Dublin Core assumed a mechanism for local extensions -- additional elements added at the discretion of authors or collection maintainers. Such local information may be critical to the effective use of a particular collection, though the local character of such elements may not be of general interest or usefulness.
Of perhaps greater importance is the need to link Dublin Core records to other, richer description schemes (for example, MARC, or FGDC). The ability to link a simple description record to a richer description model provides a means to promote one record type to a more complete description as warranted, and also affords a more continuous axis of resource description (from simple to complex) to suit a variety of user or system needs.
Additionally, Dublin Core data address only one slice of the metadata pie (resource description for search and retrieval). Other types of description are desired, as well... terms and conditions (who must pay what to whom, for example), archival status, administrative metadata and others.
Finally, there are competing models of resource description that overlap the Dublin Core to one degree or another. The IAFA document template is an example of one such format, USMARC another, the TEI header a third. RFC 1807 [NEED LINK] is a bibliographic description format developed by Rebecca Lasher as part of the NCSTRL [NEED LINK] project, an electronic library for technical reports in computer science.
Workshop discussions on extensibility merged with the common recognition of multiple models of description, some of which would be complementary, some of which would be overlapping, some of which would be competing. No single format for resource description would fill all the needs, nor could such a monolithic model be maintained or managed. The consensus of the workshop converged on a need for an architecture that would accomodate the diversity of models and levels of description complexity that characterize the chaotic world of electronic resources.
The proposal that emerged from these discussions is known as the Warwick Framework. It is a container architecture for the aggregation and interchange of discreet metadata packages. Such an architecture will afford the opportunity to mix and match metadata sets, allowing rational deployment of many existing and emergent description models. The following section describes the essential features of the Warwick Framework in greater detail.
No single element set will satisfy all metadata requirements. Different communities of users or different application areas will require data of different elements and levels of complexity. The Workshop took as its starting point the Dublin Core, a simple scheme for what might be thought of as electronic bibliography. However, other application areas might require the fullness and structure provided by a MARC-type record, for example, or might have domain specific descriptive requirements not addressed in the Dublin Core. At the same time other types of data exist which were outside the scope of the Dublin Core: terms and conditions, evaluative data, for example.
Satisfying the need for competing, overlapping, and complementary metadata models requires an architecture that will accommodate a wide variety of seperately maintained metadata models. It was concluded that a container architecture for the interchange of metadata packages was required. A package is conceived as metadata object specialized for a particular purpose. A Dublin Core based record might be one package, a MARC record another, terms and conditions another, and so on. This architecture should be modular, to allow for differently typed metadata objects; extensible, to allow for new metadata types; distributed, to allow external metadata objects to be referenced; recursive, to allow metadata objects to be treated as 'information content' and have metadata objects associated with them.
Packages are typed objects. They may be primitive (a package is one of a number of separately defined metadata formats); indirect (a package is a reference to an external object); or a container (a container contains another container).
Several benefits flow from this approach:
The Warwick Framework is a high level container architecture: it makes no assumptions about the contents of the packages. Nor can it be assumed that clients (or agents) will be able to interpret all packages. To ensure such ability will require prior agreement. Conferees agreed that packages should be strongly typed and that a registry for metadata types will probably be required, perhaps along the same lines as the IANA registry for Internet Media Types (also known as MIME types).
The requirements for an architecture and the architecture itself are described more fully in a companion article in this issue, The Warwick Framework -- A Container Architecture for Aggregating Metadata Packages.
The architecture needs to be realised in one or more concrete implementations. Proposals for MIME- and SGML- based implementations have been prepared as well as a discussion of the architecture in a distributed object environment based on CORBA.
A registry agency for metadata object types needs to be established. Early implementation pilot projects should not be hampered by the lack of such an agency, but as more metadata sets are elaborated by various stakeholders, a formal means for managing changes will be important.
The Warwick Framework was enthusiastically welcomed at the workshop as a practical approach to the effective integration of metadata into a global information infrastructure. The realization of such an architecture will require great effort on many fronts, in many communities. The great hope is that the consensus achieved at this meeting will have provided the foundation for coordination, and sufficient freedom in the proposed architecture to allow progress without an undue burden of close coordination.
The following working papers address aspects of the Warwick Framework more fully:
Conferees left Warwick convinced that significant progress had been made in important areas. This conviction is corroborated by the rapid appearance of a number of documents supporting key decisions and recommendations.
The consensus concerning embedding metadata in HTML reached at the W3C workshop on Distributed Indexing and Searching LINK provides an encouraging impetus to rapid deployment of richer resource description techniques on the Web along the lines developed in the Warwick Workshop.
The recent appearance of a Dublin Core implementation based on these developments LINK TO A.P. Miller's Archeology Project is an promising indicator of the need and demand for better resource description on the Net, and the speed with which such ideas can be promulgated with the strength of community concensus and a clear direction for development.
It is hoped that the Warwick Workshop will prove to have galvanized such a consensus and provided an important signpost for the development of more effective networked resource description.