The Warwick Metadata Workshop:

A framework for the deployment of resource description

Lorcan Dempsey
UKOLN, University of Bath, UK

Stuart L. Weibel
OCLC Office of Research, Dublin, Ohio, USA

June 30, 1996

Introduction
Moving the Dublin Core Forward
The Warwick Framework: an Architecture for Metadata
Proposals and Progress

1. Introduction

The first week of April 1996 found fifty representatives of libraries, Internet standards, text markup, and digital libraries projects converging at Warwick University to discuss advancing prospects for network resource description. The conferees came from three continents, eleven countries, and many perspectives in an effort to apply their collective experience to the clarification of issues surrounding the effective deployment of metadata for networked information resources.

The meeting was a follow-on to the previous year's OCLC/NCSA Metadata Workshop that convened a similarly diverse collection of stakeholders, and which resulted in consensus on a simple resource description record that has come to be known as the Dublin Core. The most important deliverable of that first workshop was the consensus that was achieved among the groups represented. The thirteen elements of a Dublin Core record contain few surprises, focussing largely on what might be thought of as network resource bibliography and a little bit more. [Weibel et al, 1995].

The idea received considerable attention in the year since the first meeting, but while the first workshop helped to focus discussion of the topic in many communities, the implementation of such a description record requires a formal syntax and deployment strategy that were beyond the scope of that first meeting.

Planning for the second workshop began with informal discussions between the UK Office for Library and Information Networking (UKOLN) and OCLC's Office of Research in the summer of 1995 crystallized around the theme of identifying and resolving impediments to deployment of a Dublin Core style record for resource description. The expectations of the organisers and participants were exceeded as conferees worked towards a number of related conclusions about the Dublin Core Metadata Element Set, about the need for a wider set of metadata types, and about an extensible framework for interchange of metadata of different types. A consensus about this central set of issues emerged from the workshop, and more importantly, a set of concrete proposals for moving forward has been produced. These include:

Dublin Core

A concrete syntax for the Dublin Core expressed as an SGML DTD.
A mapping of this syntax to existing HTML tags to enable a consistent way of author-description through embedded descriptive metadata in web documents. Other mappings will be carried out in the future (to enable Dublin Core descriptions to be embedded in various image file formats and in PostScript's Structured Comments, for example.)

Warwick Framework

The Warwick Framework: a container architecture for aggregating metadata objects for interchange.
Descriptions of how to implement this architecture in Mime, SGML and CORBA environments.

Guide to Creation and Maintenance of Metadata

Guide to authors for generating resource descriptions.
Guide to administrators of collections.

This paper provides a high-level overview of the issues discussed at the workshop. It brings together descriptions of the above outcomes, and places them in context. Section 2 discusses the Dublin Core and the proposals for taking it forward; Section 3 discusses the rationale for the Warwick Framework; Section 4 draws together the concrete proposals and actions which were the workshop's outcome.

2. Moving the Dublin Core forward

2.1 The Dublin Core

The Dublin Core Metadata Element Set is a set of thirteen metadata elements proposed by the first workshop as a core description record to facilitate discovery of document-like objects in a networked environment. To facilitate progress, a number of constraints were imposed on the discussion:

Only descriptive data elements required to support resource discovery were considered: the goal was to develop something very much like a bibliographic description for electronic resources, a simple resource discovery record that could be generated by authors of the data without extensive training. Data elements covering terms and conditions, archival status, and other varieties of metadata were not included.
Discussion was restricted to elements required for the discovery of document-like-objects (DLO), largely understood by example (for example, an electronic version of a newspaper article is a DLO, while an unannotated collection of 2x2 slides is not).
A widely understandable semantics was the goal; syntax was left deliberately unspecified to avoid becoming bogged down in the tar pits of implementation minutiae.
Extensibility was judged a key characteristic. The Dublin Core was not intended to replace other well-known resource description sets, but rather to act as a simple record with elements of commonly understood meaning that could help unify other more complex description shemes. Thus, it was judged essential to develop means for extension of Dublin Core elements and for linking it to other, richer description models.
Elements were defined to be optional, repeatable and modifiable. Elements can be modified by qualifiers (for example, an element can include a specified schema to identify a controlled vocabularry or rule set governing the element).

Table 1. The Dublin Core Elements
Subject	The topic addressed by the work.
Title	The name of the object.
Author	The person(s) primarily responsible for the intellectual content of the object.
Publisher	The agent or agency responsible for making the object available in its current form.
Other Agent	The person(s), such as editors, transcribers, and illustrators who have made other significant intellectual contributions to the work.
Date	The date of publication.
Object type	The genre of the object, such as novel, poem or dictionary.
Form	The physical manifestation of the object, such as PostScript file or Windows executable file.
Identifier	String or number used to uniquely identify the object.
Relation	Relationship to other objects.
Source	Objects, either print or electronic, from which this object is derived, if applicable.
Language	Language of the intellectual content.
Coverage	The spatial location and/or temporal duration characteristics of the object.

The Dublin Metadata Workshop is described in greater detail in:

[Weibel, et al. 1995] and [Weibel, 1995] .

The reference description of the element set can be found at:

http://purl.org/metadata/dublin_core_elements

2.2 Target uses for the Dublin Core

The development of the Dublin Core is motivated by several intended uses:

A simple interchange format for descriptive metadata
Content self-description for networked objects
Semantic interoperability across domains

It is clear from early implementation experience that projects have employed Dublin Core semantics to develop simple resource description formats. The Dublin Core has suited those who need a format which is positioned between the terseness of the web crawler indexes and the fuller description of particular domain-specific formats (MARC, for example). It is full enough to support retrieval by a number of core attributes and to allow human users make judgements about the likely utility of a resource before requesting it. At the same time, it is simple enough not to require specialist expertise or extended manual effort to create.

This latter feature is especially important in the context of the second target use mentioned here. Conferees recognized the importance of richer metadata embedded in Web documents to be harvested by software robots. The use of the Dublin Core as the basis for such data is seen as a critical success factor in its adoption. The ability to embed data in other objects was also seen as essential.

Future applications will have to work with different types of metadata from different sources. The first workshop identified a need for a generic semantics which could act as the basis for semantic interoperability between multiple description schemas. The Dublin Core was positioned to provide a unifying semantics across description models. Early implementations such as the NDIS application (described below) is one example of such a use.

2.3 Early Pilot Projects

Even absent a clearly defined syntax, the Dublin Core element set attracted the interest of a number of early adopters who developed projects that built on the consensusthat emerged from the the Dublin Metadata Workshop:

The Nordic Core

To be provided
TURNIP

The URN Interoperability Project (TURNIP) initiated by the DSTC in Australia, has produced a URN Resolution service that utilises the Dublin Core element set for URC metadata. The Dublin Core elements are used to describe DSTC Technical reports and are supplemented with Administrative metadata elements (eg URC-Type, Date-Creation, Owner). Three main issues arising from this deployment of Dublin Core included the need to group elements together, a common syntax for the exchange of URCs, and standards for element qualifiers.
More information on the TURNIP project can be found at:
http://www.dstc.edu.au/RDU/TURNIP/
OCLC

Research involving the Dublin Core Element Set at OCLC includes interfaces into online databases, systems for user-based resource description and issues associated with automaticly generated resource descrition and metadata harvesting based on the Dublin Core.

A preliminary evaluation of the Dublin Core Element Set as an search interface into complex information was conducted utiling the OCLC's WorldCat Database. Similar experiements in a distributed environmrnt are planned between OCLC's map collections and UCSB's Project Alexandria Database.

The Spectrum project is exploring various user interface issues associated with user-based resource description of electronic information based on the Dublin Core element set. This project, in part, is exploring various issues associated with the extensibility framework of the Dublin Core and the Warwick Framework for local, genra specific descriptive information.

The Scorpion Project is an OCLC research project involved in automatic subject assignment based on the Dewey Decimal Classification system. A Spectrum-Scorpion dovetail provides both user-described and automatically generated classification metadata based on the Dublin Core Element Set. An extension of this project involves the automatic harvesting of internet resources to explore the feasibility of using the Dublin Core element set as an effective descriptive framework for distributed indexing of networked information.
The National Document and Information Service

The NDIS project (National Document and Information Service) is a joint development of the National Library of Australia and National Library of New Zealand aimed at providing a sophisticated search service to Australian and overseas databases, collection management services and state of the art document delivery services. The first phase of the project will implement a search and document request service across an integrated information resource of MARC based bibliographic data and a suite of indexing, directory and thesauri databases in a variety of encoding formats. Further information about the project can be located at: http://www.nla.gov.au/2/NDIS.

The NDIS project used the Dublin Core as a tool in determining generic metadata for bibliographic data, with extensions of the core element set or adoption of other metadata standards for non-bibliographic data. The creation of additional metadata can be viewed as extensions or separate core elements sets.

The Dublin Core serves as a useful model for the generic storage and access requirements in cross-database searching, and its concept of qualification offers a model for normalising disparate data types, and search precision at the individual database level via specific schema or types. NDIS implementation utilises many principles of the Dublin Core, such as extensibility and modifiability, but differs on optionality, as only those metadata elements that intersect across data types are core information resource elements. Metadata intersecting a grouping of data or item types are considered "common" metadata element sets.
Mapping between the Dublin Core and MARC records

To be provided
Deployment of Dublin Core records in the Alexandria Project

The Alexandria Digital Library (ADL) is one of six NSF/NASA/ARPA-funded Digital Library Initiatives. ADL focuses on online access to spatial data; given that an estimated 90% of all spatial data is available only in hard-copy form, metadata is of prime importance. At the same time, ADL recognizes that a full cataloging record is not needed by the vast majority of general users. ADL therefore decided to translate Dublin Core fields into ADL fields , and to add fields required specifically for spatial data and specifically for hard-copy items. This set of fields is the default display set that general users see when they perform a search and then call up metadata.

2.4 Other Simple Resoource Description Models

It is important to note that there are simple resource description models other than the Dublin Core that were discussed at the Warwick Workshop. Indeed, among the factors that motivated the Warwick Framework described later in this paper is the principle that there will be a variety of resource description models that emerge from different communities, and such models should be able to coexist.

Two such models that were discussed at the workshop are described below:

RFC 1807 (A Format for Bibliographic Records by R. Lasher and D. Cohen.)
This RFC [http://ds.internic.net/rfc/rfc1807.txt] defines a format for bibliographic records describing technical reports. This format is used by the Cornell University Dienst protocol and the Stanford University SIFT system.

RFC 1807 is a bibliographic record tailored to the needs of the NCSTRL project [http://www.ncstrl.org.]. It is targetted specifically to the description of computer science technical reports. As such it has many characteristics one would like to see in a resource description record for document-like objects.
IAFA templates and ROADS (Resource Organisation and Discovery in Subject-based services)

ROADS [http://ukoln.bath.ac.uk/roads/] ROADS is an eLib funded project to implement software for resource organisation and discovery in subject-based services. The aim is to develop a sharable resource discovery system and, in particular, to fulfil the requirements of the eLib subject based services. The intention is to involve information providers in resource description as this is viewed as essential to a sustainable service. There are already two subject services in production, OMNI and SOSIG, using a prototype version of ROADS. The choice of standards for ROADS was based on the criteria of simplicity and availability, to allow for speedy start-up of the subject services. To this end we chose to use a simple attribute-value record structure based on the IAFA/whois++ template definition; and to use the whois++ directory service protocol for search and retrieval. A later version of ROADS will pilot implementation of the common indexing protocol (CIP) to allow for a distributed system of shared indexing. Our initial experience of deployment of the IAFA/whois++ template has allowed us to collect statistical information on the frequency of use of both bibliographic and administrative attributes. We hope this will provide useful feedback for the development of the whois++ template structure.

2.5 Impediments to Wider Deployment

Among the major goals of the Warwick Workshop was the identification of impediments to successful deployment of a simple Internet resource description format such as the Dublin Core. Early workshop discussions identified four areas requiring substantive progress:

Specification of a transfer syntax,
Development of user guides,
Identification of extensibility mechanisms, and
Specification of a framework to accomodate different varieties of metadata

Specification of a Transfer Syntax

Discussions of syntax are often difficult, burdened as they are with the biases of familiarity and competing methodologies. The Dublin Workshop made progress partly because such discussions were ruled out of scope. However, consensus concerning semantics cannot be deployed without a concrete syntax (or syntaxes). In pilot implementations, the absence of a common model led to different syntax and structuring choices. Clearly, any widespread deployment of Dublin Core (or any similar description scheme) hinges on reaching consensus about a transfer syntax.

Given that the Web is the primary medium of the electronic milieu, it was further recognized that deployment of metadata in the Web is the primary strategic application; successful deployment of metadata in HTML is necessary, though almost certainly not sufficient.

A working group on syntax formed around this issue and this group has elaborated a position paper describing a formal syntax for Dublin Core Metadata. A Syntax for Dublin Core Metadata (Burnard, Miller, Quin, and Sperberg-McQueen) includes:

A concrete syntax expressed as an SGML DTD
A mapping of this DTD into existing HTML tags using the meta element of HTML2
A proposal for 'keeping the metadata at arms length' by allowing metadata consumers recognise references to external metadata using the LINK element.

In related developments, a convention for embedding metadata in HTML was proposed in a break-out group at the W3C Distributed Indexing and Searching Workshop, May 28-29, 1996 LINK TO DIST-INDEX WORKSHOP. This break out group included representatives of the Dublin Core/Warwick Framework Metadata meetings, representatives of several major Web search vendors (Lycos, Microsoft, WebCrawler), various other software vendors, and the W3 Consortium.

The problem is to identify a simple means of embedding metadata within HTML documents without requiring additional tags or changes to browser software, and without unnecessarily compromising current practices for robot collection of data.

While metadata is intended for display in some situations, it is judged undesireable for such embedded metadata to display on browser screens as a side effect of displaying a document. Therefore, any solution requires encoding information in attribute tags rather than as container element content.

The goal was to agree on a simple convention for encoding structured metadata information of a variety of types (which may or may not be registered with a central registry analogous to the Mime Type registry). It was judged that a registry may be a necessary feature of the metadata infrastructure as alternative schema are elaborated, but that deployment in the short-term could go forward without such a registry, especially in light of the proposed use of the LINK tag to link descriptions to a standard schema description as described below.

The solution agreed upon is to encode schema elements in META tags, one element per META tag, and as many META tags as are necessary. Grouping of schema elements is achieved by a prefix schema identifier associated with each schema element.

A convention for linking resource description tags to the reference definition of the metadata schema (or schemata) used in a document was also proposed. Doing so serves as a primitive registration mechanism for metadata schemata, and lays the foundation for a more formal, machine-readable linkage mechanism in the future.

The proposed conventions are described more fully in LINK TO HTML-META CONVENTION

Development of User Guides

Resource descriptions might be created by a number of actors on the metadata use chain: authors (embedded HTML tags), site and collection administrators, third-party 'cataloguers'. Guidelines for the creation of metadata are needed. A guide for authors themselves would be especially useful in supporting a move to document-embedded descriptions, and at least one producer of HTML authoring tools (SoftQuad, Ltd.) has committed to embedding Dublin Core resource description templates in their products when the syntax and guidelines are sufficiently stable.

need elaboration on the focus of the User Guides and where/when they will be published

Extensibility -- Mixing and Matching Metadata

The Dublin Core addresses one particular niche of the metadata ecology. It is a simple resource description format that is intended to be extensible in at least two ways. As its name implies, it is intended to provide a commonly understandable core of elements that will help unify different models of resource description. Its simplicity is among its major virtues, but users may well wish to augment description of their resources with additional data.

Original concepts of extensibility for the Dublin Core assumed a mechanism for local extensions -- additional elements added at the discretion of authors or collection maintainers. Such local information may be critical to the effective use of a particular collection, though the local character of such elements may not be of general interest or usefulness.

Of perhaps greater importance is the need to link Dublin Core records to other, richer description schemes (for example, MARC, or FGDC). The ability to link a simple description record to a richer description model provides a means to promote one record type to a more complete description as warranted, and also affords a more continuous axis of resource description (from simple to complex) to suit a variety of user or system needs.

Additionally, Dublin Core data address only one slice of the metadata pie (resource description for search and retrieval). Other types of description are desired, as well... terms and conditions (who must pay what to whom, for example), archival status, administrative metadata and others.

Finally, there are competing models of resource description that overlap the Dublin Core to one degree or another. The IAFA document template is an example of one such format, USMARC another, the TEI header a third. RFC 1807 [NEED LINK] is a bibliographic description format developed by Rebecca Lasher as part of the NCSTRL [NEED LINK] project, an electronic library for technical reports in computer science.

Workshop discussions on extensibility merged with the common recognition of multiple models of description, some of which would be complementary, some of which would be overlapping, some of which would be competing. No single format for resource description would fill all the needs, nor could such a monolithic model be maintained or managed. The consensus of the workshop converged on a need for an architecture that would accomodate the diversity of models and levels of description complexity that characterize the chaotic world of electronic resources.

The proposal that emerged from these discussions is known as the Warwick Framework. It is a container architecture for the aggregation and interchange of discreet metadata packages. Such an architecture will afford the opportunity to mix and match metadata sets, allowing rational deployment of many existing and emergent description models. The following section describes the essential features of the Warwick Framework in greater detail.

3. An Architecture for Metadata: the Warwick Framework

3.1 The Need for the Warwick Framework

No single element set will satisfy all metadata requirements. Different communities of users or different application areas will require data of different elements and levels of complexity. The Workshop took as its starting point the Dublin Core, a simple scheme for what might be thought of as electronic bibliography. However, other application areas might require the fullness and structure provided by a MARC-type record, for example, or might have domain specific descriptive requirements not addressed in the Dublin Core. At the same time other types of data exist which were outside the scope of the Dublin Core: terms and conditions, evaluative data, for example.

Satisfying the need for competing, overlapping, and complementary metadata models requires an architecture that will accommodate a wide variety of seperately maintained metadata models. It was concluded that a container architecture for the interchange of metadata packages was required. A package is conceived as metadata object specialized for a particular purpose. A Dublin Core based record might be one package, a MARC record another, terms and conditions another, and so on. This architecture should be modular, to allow for differently typed metadata objects; extensible, to allow for new metadata types; distributed, to allow external metadata objects to be referenced; recursive, to allow metadata objects to be treated as 'information content' and have metadata objects associated with them.

Packages are typed objects. They may be primitive (a package is one of a number of separately defined metadata formats); indirect (a package is a reference to an external object); or a container (a container contains another container).

Several benefits flow from this approach:

It allows the solution space to be partitioned in a way that more closely resembles the problem space than if all requirements were to be met by a single format. [NEEDS CLARIFICATION]
It provides a framework in which metadata objects can be aggregated and exchanged in a consistent way.
It avoids the need to reinvent wheels or do redundant design work: the modular approach means that packages can be specialised for their particular function and that existing formats and best practice can be readily accommodated.
The particular aggregation of metadata objects can be optimised for particular content types. It can also be optimised for particular user groups: the user as client or agent, the user as end-user, the user as customer, and so on.
The architecture is extensible and can accommodate unanticipated requirements. It allows metadata objects to be treated as information objects with associated metadata, allowing, for example, terms and conditions to be applied to some or all of the metadata packages.

The Warwick Framework is a high level container architecture: it makes no assumptions about the contents of the packages. Nor can it be assumed that clients (or agents) will be able to interpret all packages. To ensure such ability will require prior agreement. Conferees agreed that packages should be strongly typed and that a registry for metadata types will probably be required, perhaps along the same lines as the IANA registry for Internet Media Types (also known as MIME types).

The requirements for an architecture and the architecture itself are described more fully in a companion article in this issue, The Warwick Framework -- A Container Architecture for Aggregating Metadata Packages.

3.2 Impediments to Implementation

Concrete implementations

The architecture needs to be realised in one or more concrete implementations. Proposals for MIME- and SGML- based implementations have been prepared as well as a discussion of the architecture in a distributed object environment based on CORBA.

Registration

A registry agency for metadata object types needs to be established. Early implementation pilot projects should not be hampered by the lack of such an agency, but as more metadata sets are elaborated by various stakeholders, a formal means for managing changes will be important.

3.3 Moving Forward

The Warwick Framework was enthusiastically welcomed at the workshop as a practical approach to the effective integration of metadata into a global information infrastructure. The realization of such an architecture will require great effort on many fronts, in many communities. The great hope is that the consensus achieved at this meeting will have provided the foundation for coordination, and sufficient freedom in the proposed architecture to allow progress without an undue burden of close coordination.

The following working papers address aspects of the Warwick Framework more fully:

The Warwick Framework: a container architecture for aggregating metadata objects (Carl Lagoze and Clifford Lynch, and Ron Daniel)
An overview of the Warwick Framework Architecture.
A discussion of a distributed object implementation of the architecture. A MIME implementation for the Warwick Framework (Jon Knight and Martin Hamilton)
A proposal for a concrete MIME implementation of the container architecture.
A syntaxfor Dublin Core Metadata (Lou Burnard, Eric Miller, Liam Quin, C.M. Sperberg-McQueen)
Includes a proposal for a concrete SGML implementation of the Warwick Framework.

4. Moving Forward: Proposals and Actions

Conferees left Warwick convinced that significant progress had been made in important areas. This conviction is corroborated by the rapid appearance of a number of documents supporting key decisions and recommendations.

The consensus concerning embedding metadata in HTML reached at the W3C workshop on Distributed Indexing and Searching LINK provides an encouraging impetus to rapid deployment of richer resource description techniques on the Web along the lines developed in the Warwick Workshop.

The recent appearance of a Dublin Core implementation based on these developments LINK TO A.P. Miller's Archeology Project is an promising indicator of the need and demand for better resource description on the Net, and the speed with which such ideas can be promulgated with the strength of community concensus and a clear direction for development.

It is hoped that the Warwick Workshop will prove to have galvanized such a consensus and provided an important signpost for the development of more effective networked resource description.

5. References and Bibliography

A Syntax for Dublin Core Metadata (Lou Burnard, Eric Miller, Liam Quin, C.M. Sperberg-McQueen)
A proposal for a concrete SGML implementation of the Warwick Framework.
On Information factoring in Dublin metadata records (C.M. Sperberg-McQueen)
The Warwick Framework: a container architecture for aggregating metadata objects(Carl Lagoze and other, CLifford Lynch, and Ron Daniel)
A MIME implementation for the Warwick Framework (Jon Knight and Martin Hamilton)
Report on Distributed Indexing Workgroup....Embedding Metadata
W3C Distributed Indexing Workshop Link
Guidelines for the preparation of Dublin Core metadata / Warwick Framework containers (John Kunze and Others)

Acknowledgements

The authors are indebted to many organizations and individuals that paved the way for this work and contributed substantively to the success achieved.

Hazel Gott, whose able organizational skills provided a superb working environment conducive to our task, and whose amiable hospitality made us all feel more at home.
UKOLN and OCLC, for providing staff time for organization and travel support for many attendees.
JISC for their support of UKOLN's MODELS project through which UK conferee attendance was supported.
CNRI and ERCIM, whose contribution of staff time and effort was a key factor in bringing together the ideas and people that made the workshop a success.
Finally, and most importantly, the attendees of this workshop, whose good faith and committment to progress during and after the workshop are the bedrock on which this effort is founded.