[Archive copy mirrored from: http://ota.ahds.ac.uk/reports/metarep1.html]

Oxford Text Archive

Metadata for Electronic Texts Workshop Report

A report of the results of the metadata workshop organized by the Oxford Text Archive in Oxford, 2nd May 1997

Table of Contents


This workshop was one of several organized by the five Arts and Humanities Data Service (AHDS) Service Providers, under the auspices of the AHDS and UKOLN. The aim of this particular workshop was to identify the metadata essential to finding electronic texts of interest to those working in the fields of literary and linguistic studies, encompassing texts of every type and period. We worked with a broad definition of what might constitute a text, so as to allow us to consider various forms of text collection (e.g. collected works, anthologies etc.), linguistic corpora, and other works (e.g. dictionaries, reference works, and so on).

The Oxford Text Archive (OTA) is already strongly committed to following the recommendations set-out in the TEI Guidelines (TEI-P3) concerning the creation of electronic texts intended for interchange, especially with regard to the production and use of essential "header" information (which provides a mechanism for fully documenting the creation of an electronic text, details of any non-electronic source, and the processes of creation and revision). It is from these headers that the OTA is able to automatically generate subsets of metadata information, for interchange in whatever format might be required (e.g. Dublin Core, MARC, etc.). Since this workshop focused on the metadata essential for initial resource discovery, we did not expect to identify any major omissions from the large extensible and flexible set of (meta)data that it is possible to store in the form of a TEI header, although all comments were welcomed. However, it was still important to confirm that the OTA can readily provide users with the kinds of metadata that they would need in order to identify materials likely to be of relevance to their work.

Prior to the workshop, invitees were asked to read Paul Miller's short guidelines for workshop participants. Particular attention was drawn to the final statement in this document, namely that the workshop should:

attempt a mapping of these [subject-specific] concepts and processes to a Dublin Core-like model, deciding which are essential for initial resource discovery and which may be relegated to lower levels of metadata

This statement underpinned all of the discussions at the workshop, and is thus reflected in the contents of this report. It is important that readers of this report should keep this objective in mind, and remember that this workshop was not an attempt to identify all the possible metadata items which might play a part in resource discovery. Discrepancies arose concerning people's understanding of what constituted "initial" resource discovery, and the amount and types of metadata that they would like in response to an initial search query.


The Need for Resource Discovery

A fundamental issue emerged right at the very start of this workshop, concerning the likely (discipline-specific) requirements of anyone who might want to use an electronic text. Although this workshop was intended to reflect the needs and desires of people working in "literary and linguistic studies", those present quickly agreed that it was not practical to regard this as a single discipline (as no-one felt in a position to comment on the needs of a community as diverse as, say, scholars working with modern English language texts, medieval Japanese texts, and ancient Greek texts, let alone the variety of approaches to linguistic scholarship). It was also very difficult to say what scholars might want to do with electronic texts, as this might range from simply reading some or all of it, through to performing sophisticated textual or linguistic analysis. It was therefore agreed that everyone would contribute on the basis of his/her own area of expertise, in the hope that this would be sufficient to identify the key issues in initial resource discovery.

Another difficulty which arose very early-on in discussions, was the problem of scope. For example, in an anthology of verse or the collected works of an individual playwright, should the metadata relate only to description at the collection level, or should each individual work within a collection also have its own descriptive metadata? If the latter, then in certain circumstances (e.g. a collection of works by the same author), perhaps certain metadata could be inherited from the collection-level description by each of the works that constituted the collection? Similarly, the collection-level metadata description should perhaps be sufficient to convey basic information about each of the individual works within the collection (but would this be feasible in the case of, say, an anthology of 500 poems produced by different authors?). These issues are of even greater concern when considering large-scale literary or linguistic corpora, which may contain many thousands of individual texts. The concept of scope also raised a number of related issues, such as the possible requirement to identify discrete resources (e.g. a number of specific texts within a corpus), and the need to know whether or not a resource was static or dynamic (i.e. liable to change), as knowing such information might aid initial resource discovery when searching across large volumes of material.

When considering existing electronic and non-electronic procedures for initial resource discovery, it is perhaps reasonable to say that the needs of most academics working in the fields of literary and linguistic studies are statisfied with, say, a conventional online library catalogue which provides basic search facilities for author/title, keyword, and subject. The initial enquiry might then be followed by either calling-up the complete library catalogue record, and/or consulting a copy of the printed work. Having found the location of a particular printed work within a library, many academics also browse the surrounding books and shelves in an attempt to identify other works which might be of interest (which in some respects represents a somewhat more focused attempt at initial resource discovery). For the purposes of discussion at this workshop, the use of a library catalogue combined with a certain amount of focused browsing would appear to model most academics' approach to initial resource discovery. In light of the fact that online library catologues are intended for use with texts and text-like objects, it ought to be possible (through appropriate cataloguing techniques), to integrate records of electronic texts into those for non-electronic sources, although it would be strongly desirable to have a simple means of distinguishing between the two types of text for the purposes of resource discovery. In all other respects, the basic information necessary for the successful discovery of non-electronic resources in literary and linguistic studies would also appear to be sufficient for discovering electronic texts.

The specific hurdles confronting the integration of electronic texts with the other kinds of resource of interest to those working in literary and linguistic studies are, by now, well-known. In the field of literary studies, many good editions of texts by cannonical authors are readily available at relatively low cost, so it is likely to be some time before scholars will choose an electronic version of a text simply to read. Although electronic texts open up new avenues of scholarly enquiry (as well as easier ways to carry-out certain types of conventional research), apocryhpal wisdom suggests that many academics are not yet comfortable working with these types of resource and the necessary tools. The same is probably much less true of those working in linguistic studies (especially corpus linguistics), where a familiarity with computers, computer-based tools, and electronic resources would appear to be becoming increasingly commonplace. Concerns about issues of copyright and other types of right ownership are shared by both literary and linguistic studies, but such concerns are also felt in many other disciplines.


Current Initiatives

There are numerous existing standards (both de jure and de facto), that potentially relate to the use of electronic texts, although they tend to fall into two distinct types: those intended for resource discovery, and those meant to aid resource use. As mentioned above, many of the existing standards for resource discovery relate to electronic texts or text-like objects (e.g. library catalogue records), but from the point of view of the Oxford Text Archive the most interesting and relevant standards are MARC records, TEI headers, and the Dublin Core (for more information, see Appendix 2 of this report which lists the papers that were pre-circulated for this meeting). Of these three, MARC is probably the most well-known approach for integrating electronic text resources with catalogues of more conventional materials.

With regard to the use of electronic text resources there are many potentially relevant standards, although these are very much tied to how one wishes to use an electronic text resource (e.g. SGML applications for those concerned with explicitly identifying certain structures or features of a text, PDF for preserving the appearance, proprietary wordprocessing formats to assist reusability, other types of application-specific formats for text analysis, hypermedia presentations etc., page-description standards to standardize printing, and so on). The OTA favours the use of SGML because it is designed to facilitate the free-movement of data between different types of application -- wordprocessing, text analysis, printing, database loading, etc. -- and in particular that application of SGML produced by the Text Encoding Initiative (TEI), which is intended for the preparation, interchange, and use of scholarly electronic text materials.

Given that the TEI offers good support for both resource discovery (via TEI headers), and resource use (via marked-up body text), then at the moment this would appear to be the most important standard within the fields of literary and linguistic studies. That it is also possible to map from a TEI header to produce MARC records or Dublin Core elements is also of particular interest to the OTA.


Dublin Core

During discussions at the workshop, it was apparent that most people were happy to consider the Dublin Core as a basis for initial resource discovery. However two concerns were frequently cited throughout the day -- exactly what is meant by "initial" resource discovery, and how much metadata should be returned as the result of any such initial enquiry? The consensus was that the more information that could be fed back to a user in response to an enquiry, the easier it would be for that person to identify the resources which are likely to be of interest.

The only alternative to the Dublin Core considered during the workshop was use of the TEI header. It was felt by many present that the TEI header provides an acceptable method of recording all sorts of (metadata) information about an electronic text, which can then be mapped into whatever information structure might be required (e.g. Dublin Core, MARC records etc) although the reverse may not be equally true. This essentially endorses the current strategy and practice of the OTA. However, a number of people expressed the belief that TEI headers are too difficult to expect the average (i.e. non-expert) user to create, but it was subsequently agreed that this was a matter which fell outside the remit of this workshop, and was a topic which could usefullly be addressed elsewhere and by other means (e.g. dedicated TEI header workshops, instructional literature, demonstration sample headers, and so on.).

A number of extensions to the Dublin Core element set were discussed (see below), but many took the form of attributes added to a Dublin Core element which could carry values drawn from a pre-defined set of options (possibly based, for example, on the sorts of authority files and controlled vocabulary lists used to create MARC records). In the following discussion of the Dublin Core elements, the definition of each element supplied by Paul Miller is presented as emphasized text.


1 Title Label: TITLE
The name given to the resource by the CREATOR or PUBLISHER.

Two key issues were identified: the need for the contents of theTITLE element to be carefully checked against a list of uniform titles (in accordance with conventional library cataloguing practice), and the need to support searching on sub- or alternate-titles. This element also raised the issue of scoping mentioned above, as there was some debate about how to handle text collections (e.g. anthologies and corpora), for if the TITLE element were to be repeated or contain all the titles of the individual texts within a large collection, this might become unmanageable. It was felt that this element should be considered essential for initial resource discovery in the discipline area of literary and linguistic studies.


2 Author or Creator Label: CREATOR
The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

It was agreed that this element should support a controlled list of responsiblity statements, which could be used to explicitly identify the roles of all the persons to be identified as, in some capacity, a CREATOR of the resource (also note that this would be in addition to any persons identified via the CONTRIBUTOR label). Again, this raised the issue of the scope of any such statement when dealing with anthologies etc. This element should be considered essential for initial resource discovery.


3 Subject and Keywords Label: SUBJECT
The topic of the resource, or keywords or phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as Medical Subject Headings or Art and Architecture Thesaurus descriptors) as well.

Discussions suggested that this element would be difficult to satisfy for certain kinds of purely literary resource, although it might well be more applicable for secondary sources (for example, there are many potential keywords for Shakespeare's play "Hamlet", but a text about the play might require only a handful of SUBJECT keywords). Linguistic resources are often described in terms of an extensive set of descriptive keywords, and so this element is clearly relevant and useful to this discipline.


4 Description Label: DESCRIPTION
A textual description of the content of the resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

As for SUBJECT, it was generally felt that this element would be far more relevant and applicable to those searching for linguistic resources, rather than those looking for purely literary materials (for the same reason: that it is difficult to see how one could provide a satisfactory DESCRIPTION of a complex literary work that would be useful to a broad range of potentially interested users).


5 Publisher Label: PUBLISHER
The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

This element was felt to be desirable, but not essential, for initial resource discovery in both literary and linguistic areas of study. It was certainly believed to be an important piece of information that should be returned in response to any enquiry (whereas the occasions upon which it would play a crucial part in an initial resource discovery enquiry were generally felt to be rather rare).


6 Other Contributors Label: CONTRIBUTORS
Person(s) or organization(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specifed in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).

It was agreed that this element should support a controlled list of responsiblity statements, which could be used to explicitly identify the roles of all the persons to be identified as, in some capacity, a CONTRIBUTOR to the creation of the resource (also note that this would be in addition to any persons identified via the CREATOR label). Again, this raised the issue of the scope of any such statement when dealing with anthologies etc. This element should be considered essential for initial resource discovery.


7 Date Label: DATE
The date the resource was made available in its present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

For the purposes of identifying literary resources the group felt it would be essential to be able to explicitly identify three dates: the original creation date of a work, the publication date of the relevant printed edition of that work, and the release date of the electronic version of the printed edition. In each case it would also be important to be able to identify date ranges, for example, in the case of an original work for which the creation date was uncertain. For linguistic materials such as language corpora, it would also be useful to know relevant date information which related to the scope of the collection (e.g. to capture the information contained in a conventional prose description, such as "Transcriptions of spoken British English collected between 1945 and 1975").


8 Resource Type Label: TYPE
The category of the resource, such as home page, novel, poem, working paper, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. A preliminary set of such types can be found at the following URL: http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html

The group were somewhat sceptical about the usefulness of the proposed Dublin Core object types mentioned above, and instead recommended the use of one of the many existing controlled vocabulary lists, such as those used by conventional library cataloguing staff to describe genres of literary resources. For the purposes of initial resource discovery the group felt that this would probably be useful, but not essential, to those working with literary materials -- however, it would be essential for those working in linguistics, who might wish to identify particular types/genres of works as the initial step in building a corpus of works for further study.


9 Format Label: FORMAT
The data representation of the resource, such as text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

The group agreed that this information should be considered essential for initial resource discovery.


10 Resource Identifier Label: IDENTIFIER
String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

The group agreed that this information should be considered essential for initial resource discovery, on the understanding that additional attributes would be available to make explicit which identifying scheme was being used to identify a resource.


11 Source Label: SOURCE
The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.

This was felt to be essential for both literary and linguistic studies. Throughout the course of discussions, it was apparent that group would like to know as much as possible about the SOURCE of any resource (see the remarks about DATE, above), and that this information should be as comprehensive and well-structured as the metadata for the resource itself. Perhaps in practice this could only be achieved by having some sort of pointer to where the metadata for the SOURCE is held?


12 Language Label: LANGUAGE
Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages. See: http://www.sil.org/sgml/nisoLang3-1994.html

The group felt that this information would be very useful for literary studies, but essential for linguistic research. However, we were unclear about what would happen in the case of, say, a work or collection of works that contained electronic texts in more than one language.


13 Relation Label: RELATION
Relationship to other resources. The intent of specifying this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

The group felt that it would be useful to have this kind of information reported back as part of the results of any enquiry, but that it was not of itself vital to initial resource discovery. The group was not certain if this would be the appropriate place to express other kinds of relations familiar to those studying literary materials (e.g. an adaptation by X of Y's translation of a work by Z), or whether this would be conveyed in (an extened version of) SOURCE.


14 Coverage Label: COVERAGE
The spatial locations and temporal durations characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

The group were unclear about how this label might be usefully applied to literary materials, especially if DATE was extended in the manner described above. It would clearly be of relevance and use to people looking for linguistic materials (e.g. "samples of spoken British English collected in Newcastle and the north east of England"), and would probably be considered essential for initial resource discovery by those working in this discipline area.


15 Rights Management Label: RIGHTS
The content of this element is intended to be a link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.

The group agreed that this information should be considered essential to initial resource discovery. There was some brief discussion about the potential usefulness of having some sort of simple coding or classification scheme, such as that already employed by the OTA, which could be used to identify the status of a resource (e.g. public domain, only to be used for private research etc.) in a straightforward manner. However, it was agreed that this would probably be extremely difficult to implement in practice, as information providers may wish to impose all sorts of subtle constraints and conditions on the use of their materials. It was also felt that it might be necessary to allow for the inclusion of multiple RIGHTS statements, as different conditions may pertain to an electronic text, the illustrations it contains, the way it is distributed and used, and so on.


Further Work

The group agreed that there was probably very little work (on behalf of the OTA) that would need to be undertaken in the short term. It would be useful for the OTA to describe which elements in a TEI header could usefully be mapped to an extended/adapted Dublin Core framework, and this could be done by drawing on some exploratory work already undertaken on the OTA's behalf.


Towards a Model

In order to meet the needs of the communities of literary and linguistic scholars served by the OTA, an AHDS resource discovery system should consider two perspectives. From the point of view of end-users, if the proposed AHDS system can support initial resource discovery on the basis of the framework of the Dublin Core (preferably extended in light of the comments made above), and particularly those elements identified as essential, then that would appear to be satisfactory. With regard to the OTA's position as an AHDS Service Provider, it would be necessary to document the relationship between elements of the TEI header and the appropriate elements of the Dublin Core. The OTA would also need to know how much of the information that might be contained within a TEI header could, or should, be mapped into a Dublin Core framework for the purposes of satisfying an initial resource discovery enquiry (for example, a TEI header allows for extensive documentation of the rights and respsonbilities of both rights holders and end-users, far more information than would probably be required or expected as the result of an initial enquiry).



Readers should consult Appendix 2 - Pre-circulated papers.



Appendix 1 - List of Participants

The following persons attended the workshop:

The staff of the Oxford Text Archive were also present: Michael Popham (Head of the OTA), Alan Morrison (OTA Information Officer), and Jakob Fix (OTA Computing Officer).


Appendix 2 - Pre-circulated Papers


Appendix 3 - Feedback on this Report

None to date.


Appendix 4 - Sample Records

20 examples applying your Dublin Core-like model to real or imaginary records relating to Charles Dickens, as discussed at the AHDS Catalogue working group meeting, 19 March 1997.