[Mirrored from: http://www.ukoln.ac.uk/metadata/review]
This is a 'pre-publication draft' version of the following paper:
Review of Metadata Formats. Program Vol 30, Issue no. 4, October 1996.
In any citation please refer to the printed version.
Copyright to this work is retained by the author. Permission is granted for non-commercial reproduction of this work provided full acknowledgement of authorship and copyright is given.
Increasing use of the Internet has heightened awareness among the information community of the need to provide user friendly searching and navigation tools that lead to quality information. An essential part of gaining effective access to Internet resources is to provide an index of available items in order to save users time and network overload. Discussions on metadata are focused on the format of the record used as the basis for the index. Control of the vast number of resources on the Internet requires an appropriate record format (or formats) which will enable the resource to be adequately described and easily located; records must be compatible with an appropriate search engine which in turn would ideally be compatible with a search and retrieval internet protocol (e.g. Z39.50, whois++) and all components should conform to international standards. At present there are a number of formats which meet at least some of these criteria, each of which has its own strengths.
This paper intends to review a number of metadata formats in order to highlight their characteristics. The comparison will be done in the context of the requirements of bibliographic control, with reference to the suitability of the various record formats for this purpose. The author is a researcher working at UKOLN on the ROADS (Resource Organisation and Discovery in Subject-based services) project Ref 1, part of the eLib Electronic Libraries Programme, and a special concern is to establish a comparative context in which to discuss the IAFA template which is being used in that project. The choice of formats for comparison has been limited due to practical considerations. The formats chosen for consideration (MARC, IAFA templates, TEI headers and URCs) were chosen for their particular relevance to those working within the UK eLib projects. Other formats such as GILS (US Government Information Locator Service) and Harvest SOIF (Summary Object Interchange Format) would also merit investigation in the future.
Metadata in its broadest sense is data about data. The familiar library catalogue record could be described as metadata in that the catalogue record is 'data about data'. Similarly database records from abstracting and indexing services are metadata (with a different variation on location data). However the term metadata is increasingly being used in the information world to specify records which refer to digital resources available across a network, and this is the definition used within this paper. By this definition a metadata record refers to another piece of information capable of existing in a separate physical form from the metadata record itself. Metadata also differs from traditional catalogue data in that the location information is held within the record in such a way to allow direct document delivery from appropriate application software, in other words the record may well contain detailed access information and the network address(es).
There is the possibility that a particular field within a metadata record could be regarded as metadata in relation to the record itself . This would be the case if a field described characteristics of the record e.g. information that the record is in USMARC format and has a unique record number. Particularly within discussions on the role of URCs (Universal Resource Characteristics) this sort of data is referred to as meta-metadata and there have been tentative proposals that this meta-metadata be stored as the URC. Ref 2
Caplan Ref 3 points out the benefits of using a 'new' term to describe internet resource records. There is no residual meaning attached to the term 'metadata' as opposed to the traditional connotations of 'catalogue record'. Coining a new term emphasises the differences inherent in records describing network resources and indicates that these records will be used outside the library cataloguing tradition.
It is worth considering the characteristics of network resources and how they differ from traditional 'hard' resources (physical items held within an institution) or electronic resources existing over a LAN (e.g. CDs).
Typically a catalogue record as it exists on a library OPAC refers to locations associated with that institution. A metadata record will refer to remote locations, often in no way associated with the institution. Details will be required regarding available access modes (e.g. whether FTP or HTTP) as well as access restrictions (e.g. passwords).Often the networked resource will reside on several locations on the Internet, and this is likely to increase with a growing number of mirror sites. In this way a metadata record resembles a union catalogue record as it may refer to several locations.
The same document can exist in different formats e.g. Postscript, ASCII text. Should the metadata record regard this as one document or two? In general metadata formats allow for different versions of the same document to be described in the one record, whereas in the hard print version these would be regarded as different editions.
Data is often short lived on the Internet. Files are moved around on web servers and the original URL pointing to the resource becomes obsolete. In other instances authors change and develop documents with an existing URL, so that in effect many documents on the Internet are working documents.
Can metadata inform users of the status of the document in terms of quality? or is this a function of the collection of records in which any particular record appears? Within the book world messages about the quality of a book are implied by the fact that it is contained within a particular collection e.g. whether it is found in a second hand bookshop as compared with the LSE library.
Documents remain on the Internet way after their sell by date. Old versions of documents are not removed or the information becomes out of date. Some metadata records contain record maintenance data to enable records to be reviewed, and to give the reader information on updates.
The indexer or cataloguer must decide at what level to analyse any traditional document, whether as a book or chapter, an article or a journal. In traditional cataloguing records the decision is bound up with the bibliographic record format. MARC records traditionally equate to 'complete' physical items e.g. books, CDs, videos, journal titles; and library OPACs traditionally contain records catalogued at this level. Opportunities to describe the library contents in more detail using Table of Contents or abstracts (BookData) have existed for some time but are not widespread for monograph material.
A similar decision has to be made with networked resources: do you index at the database level? or should each item within a database be included? How can web pages best be described, when, for example, a single project is described but there are separate sections referring to people, technical details, and objectives? Is that one document or three?
There are also important issues here as regards duplication of indexing effort (why spend time describing the items in a database already searchable elsewhere? why replicate all the levels that exist on a series of inter-related web pages?). The emerging directory services attempt to address this problem by sharing indexing, thus a whois++ mesh shares indexing 'centroids'. Ref 4
In order to use networked resources other access information may be required in addition to the network address (URL), such as access restrictions or support contacts. This non-bibliographic information does not equate with the location details held in traditional record formats. In the UK no standards have been developed for location details for physical items although in the US there are agreed formats for holdings information. As for electronic resources, apart from the URL, there is little consensus on the description of access.
The following criteria have been chosen to compare record characteristics:
Who is actively using this record format? is it associated with a particular professional interest or academic discipline?
The complexity of record creation will be reviewed. Are special skills required to formulate the records? Are the records designed to be created by the information 'publisher' or centrally by service providers?
The content of different metadata record formats can be compared from the aspects of structure and syntax, but perhaps most important is an evaluation of the usefulness and purpose of the information within them. The end user will judge on the basis of the usefulness of the record content.
In the UK MARC manual (1989) MARC records are described as involving three elements:
A similar split is used by Gredley and Hopkinson Ref 6 who describe bibliographic data formats as consisting of a defined physical structure, content designators, and content as governed by rules for formulation of the different data elements. Following these models metadata formats will be compared by:
In addition other aspects of content particularly relevant to networked resources will be examined. It is useful to highlight the difference in emphasis on non- bibliographic as opposed to bibliographic data in the various formats; similarly different formats vary in their approach to the manner in which location details are designated and the way electronic access methods are described.
This section will outline whether particular metadata formats can be carried by existing internet protocols; and whether a database of that metadata format can be searched using existing internet protocols.
Metadata is being actively explored in several standard making communities. At present there is no international standard for metadata (even within the USMARC environment the fields referring to networked resources are still under discussion). This section will review each metadata format as regards progress towards ratification as a standard.
Internet Anonymous Ftp Archive (IAFA) templates were designed to facilitate effective access to ftp (file transfer protocol) archives by means of describing the contents and services available from the archive. Over the last few years many organisations wanting to allow access to their data, whether documents, datasets, images or software, have made them available as archives accessed by anonymous ftp. The original IAFA template format has been developed for use with the whois++ protocol, chiefly through the instigation of Bunyip who are developing directory service software conformant to this protocol.
IAFA templates were designed by the IAFA working group of the IETF (Internet Engineering Taskforce) and guidelines were published in the form of an internet draft in July 1995. Ref 7 Much of the driving force came from private companies, in particular from Bunyip as part of their development of internet navigational tools and directory services; and from Martijn Koster at Nexor as a personal initiative. The aim of the IAFA template designers was to construct a record format which could be used by ftp archive administrators to describe the various resources available from their own archives. Template formats were drawn up for the various categories of information present on ftp archives: images, documents, sounds; services such as mailing lists and databases; as well as mailing list archives, usenet archives, datasets and software packages.
The original intention was that each ftp site administrator would be responsible for ensuring that IAFA templates were available for each file on their archive. This information would be available for individuals visiting the archive and also, if ftp archive sites followed a common set of indexing and cataloguing guidelines, then it would be possible for software (such as Harvest) to automatically pick up the records. This is in fact happening in some implementations of the IAFA/whois++ templates, although in others records are being created centrally. The recently developed directory service software, whois++ Ref 8 allows search and retrieval of databases created in this way, and also offers the possibility of searching across multiple databases. Experimental work is being done using the Common Indexing protocol (CIP) which gathers together a 'centroid' or summary from a number of database to form an 'index server'. Ref 9 The index server contains an index of all unique attribute values contributed by the centroids, and searches can be referred from one index server to another by interlinking the servers in a mesh. Ref 10 Supporters of IAFA templates have widened the original aim, and the intention now is to devise a record format simple enough to be generated by the wide variety of individuals and organisations involved with creating resources on the Internet, whether on web servers or ftp archives. The underlying philosophy is that it must be the information providers who create metadata records if indexing of the Internet is to be a viable proposition. Given the instability of network resources the alternative of centrally creating records would be a high cost option.
There are now several implementation using IAFA/whois++ templates. The ALIWEB Ref 11 search system was the first to implement IAFA templates and it did so in the context for which they were originally designed. ALIWEB was set up as an experimental approach to providing access to ftp archives. Although technically successful, the effort required to encourage ftp administrators to create records describing their archives could not be sustained and ALIWEB was integrated into the already established CUI W3 Catalog in order to encourage information providers. The future of ALIWEB remains uncertain but at present it is operational and is mirrored at various sites world-wide.
Bunyip are now leading development of a whois++ White Pages directory system, Digger, which uses whois++ templates, a variation on the IAFA templates. Ref 12 Within the eLib framework, so far two projects SOSIG Ref 13(Social Science Information Gateway) and OMNI Ref 14(Medical Information Gateway) have launched their services using the ROADS software. ROADS uses IAFA templates for description of resources, and the next release in May 1996 will incorporate the whois++ protocol. Within the UK there is also an implementation at the HENSA Unix Archive Ref 15, University of Kent, which uses IAFA templates for a database containing information on parallel computing. At the University of Manchester, a volunteer effort NetEc Ref 16 provides a database of resources in economics using the IAFA template as the basis for the record structure.
The main advantage of the IAFA templates is that they are easy to create. IAFA templates are designed for use in a distributed system of record creation and storage so the simplicity of the records has been an underlying criteria in the design. Also they have been designed in relation to the objects they are trying to describe and are not hidebound by practices relating to non-electronic data.
Records are held in simple ASCII text format and the syntax and semantics of data element names and values has been restricted to facilitate automated collection and indexing . Data elements are defined as attribute/value pairs and are of variable length. Attributes, record start and finish and continuation lines are recognised by the structure of the text and by insertion of defined 'special' characters. So for example continuation lines are signified by the first character being '+' or '-'; and records are delimited by blank lines. The simplicity of the record structure is paramount, there is no allowance for identification of subfields, nor for 'qualifiers' to be attached to attributes.
The IAFA guidelines state that text is assumed to be in English using the standard ASCII character set although, using the whois++ protocol, it is possible to change character set within a template for a particular attribute pair by means of a system message. In the European context a more sophisticated means of character set negotiation is needed, but this could be overcome by having an agreed character set between particular clients and servers not subject to on-line negotiation.
The simplicity of the record structure leads to a 'flat' file structure, and the only way to identify data more closely is to proliferate the number of attributes. There is no means of indicating a parent/child relation between documents (analytics), nor to link documents with 'continued as' or 'replaced by'. However it could be argued that these links should be provided by the keyword and subject descriptor searching and should not be built into the record structure.
There are a number of different template types defined within the guidelines to describe the variety of network resources available:
Other template types are designed for use in the context of ftp archives to provide information about a particular ftp site:
The configuration files would be relevant for the automatic collection of records, and in a broader context, the service template would be used to describe free-standing resources.
Each record can only have one template type, but any of the other data elements can be repeated. It is intended that template types and data elements should be extensible, although extensions would not be inter-operable unless agreed between implementations.
Effort has been made to ensure the templates are 'human readable' which means less processing is required to make the data understandable. This helps to ensure there is a low entry cost to implement the templates. Attribute names are therefore written in full.
The content is deliberately limited in detail in order to ensure the record is simple to create. The content includes simplified bibliographic fields (title, author, publisher, language) but in addition IAFA records are characterised by containing a large amount of non-bibliographic information. The content of the record is designed to take advantage of the context in which the record will be used, so URL and e-mail links to authors and publishers are included.
The content includes detailed record management information including the date a record was created, the date for review as well as details of the creator of the record. This allows for the development of automated record maintenance procedures. It allows system administrators to keep track of rapidly changing resources and allows for quality checks to be carried out at regular intervals.
Every time an individual or organisation occurs in a record there are a number of common data elements required to describe them e.g. name, address, telephone number, e-mail address. These logically grouped data elements are termed clusters in the guidelines and can be used to save indexing time by creating the details once then referring to them by a unique handle. The IAFA guidelines define the content of clusters for both individuals and organisations. Clusters of data elements can be identified by a unique handle although it is dependent on the implementation how the cluster information is incorporated into the record. It would be possible to repeat the cluster data in each record containing the cluster, or to use the clusters in a more sophisticated way to enable the clustered data to be stored once only in the database.
Further proposals to extend the use of clusters have been circulated by Bunyip Ref 17 as part of the development of more detailed White Pages whois++ templates for use with the whois++ protocol. These proposals suggest definitions of further clusters at a lower level for names, phone numbers and addresses. In addition it is proposed that the user and organization cluster would be broken down to include record management details. This reflects the need to keep record maintenance information at the cluster level as well as the resource level.
Each record and cluster within the database is identified by a string of characters and/or digits unique within the system on which it resides. If the record or cluster is removed to another context (e.g. referred to within a centroid index) then handles must be adjusted to retain their uniqueness in the new system. This problem will remain until a universal URI scheme is agreed.
The IAFA templates identify roles of persons and organisations by prefixing the name of the cluster (or other attribute name) by the role definition e.g.
It would be possible to maintain a controlled list of terms defining roles for individuals and organisations, although this has not been developed in the original guidelines.
The guidelines set down that different versions of the same resource are described as variants. If a resource has 'the same intellectual content' it is taken to be the same resource regardless of language or text format (ASCII, Adobe, Postscript etc.).
The guidelines acknowledge that the content of particular fields must be standardised to allow for effective indexing and retrieval. The following data elements have rules defined for the form of content as specified :
e-mail addresses: RFC 822 Ref 18
host names: RFC 1034 Ref 19
host IP addresses: defined in guidelines
numeric values: defined in guidelines
dates/times: RFC 822 Ref 20 amended by RFC 1123 Ref 21
telephone numbers: defined in guidelines
latitude/longitude: defined in guidelines
personal names: BibTex Ref 22
formats of resource: RFC 1521 Ref 23
The diverse locations of these rules, and the relative lack of detail compared to traditional cataloguing manuals, will inevitably lead to inconsistencies in practice. It remains to be seen whether the indexing and retrieval software can ameliorate the inconsistencies or whether 'simplified cataloguing rules' will need to be drawn up.
IAFA/whois++ templates are associated with the whois++ directory service protocol. Ref 24 This protocol fits closely with the IAFA template structure in that it passes attribute/value pairs and allows limits on search by template type, attribute, value or handle.
At present records based on IAFA templates do not fit well with the Z39.50 protocol although databases of these records could be 'shoe-horned' into compliance. It would be possible to define an attribute set in Z39.50 for IAFA templates although it would mean a change in the sort of data held in Z39.50 attribute sets. Existing attribute sets do not contain access and version (e.g.. postscript, adobe) information as these are considered part of the 'document delivery' transaction, and are therefore not searchable. Alternatively IAFA template fields could be mapped to a small number of bib-1 use attributes. The records could be delivered in SUTR (simplified unstructured text records) format. In addition the Z39.50 protocol splits location information into a separate record (OPAC record) rather than combining location and resource description as in the IAFA template. This means the IAFA record would need to be split to conform to the Z39.50 protocol.
The main reference is the Internet Draft 'Publishing Information on the Internet with Anonymous FTP'. This document is a working draft which has no status as a standard, however it is a well developed exploration of a metadata record format specifically designed for internet use. Both Bunyip and ROADS project workers are putting effort into editing and re-submitting the draft. Convergence of the IAFA and whois++ template structures is likely as more implementations interoperate in a whois++ mesh. Implementation of services using the templates should also provide impetus to further development and modification of the template, and will also provide justification for progress along the standards track.
As yet there is no agreed mechanism for controlling amendments and additions to the template structure. Establishing a means to communicate and control changes to the templates would be an essential step in the move towards a standard. Until then the tendency is for attributes to proliferate and for the overall structure to remain unstable.
The MARC (machine readable catalogue) format is not a single format but a family of formats, all with a similar record structure and similar method of tagging data, but with significant differences in the manner of implementation.
MARC is the most long-lived and highly developed of the metadata formats we are examining. It originated in the late 1960's as a response to the opportunities offered by computerisation of libraries and printing. MARC was a means to allow the exchange of catalogue records between co-operating libraries, it was a format for national bibliographies to use for printed bibliographic records, and it was used by bibliographic agencies for their supply of records to libraries. As library systems became computerised, MARC was used in library automation software as the basis for manipulating library records for display and indexing. There has always been some tension in MARC design between its various uses. The requirement for uniformity for exchange purposes conflicts with the requirement of individual libraries and national bibliographic services who wish to include specific local requirements (e.g. to facilitate bi-lingual descriptions, or to include local numbering schemes). This tension is manifest in the proliferation of MARC formats on a national basis; and within each national format further variations have emerged e.g. in the UK, the SLS and BLCMP library automation systems both use different variants of UK MARC. Some rationalisation is to be attempted by a three year programme for convergence of UK, US and Canadian MARC, with a target date of 1st January 1999.
Development of the MARC format reflects its use within libraries and national bibliographies. It has developed to deal with the structure of material typically held within a library i.e. it is focused on whole items held at a local level. It is oriented to documents, videos, sound-recordings at the 'object' level. Individual libraries add location details for their own holdings but there is no tradition of including service details, or contact details for remote resources. In addition each national format tends to deal with the material handled by the national bibliography of that country. For example UK MARC has tended to concentrate on monographs, whereas US MARC reflects the wider interest of the Library of Congress cataloguing policy and originally included separate treatment of non-book material such as maps, serials, and music, although a programme to integrate these formats is now progressing. The MARC format has not been adapted for secondary services (abstract and indexing records), nor has it been used to describe services or systems.
Over the last five years within the USMARC community there have been discussions and proposals considering changes in the format to enable cataloguing of networked resources. Both the US and UKMARC formats were amended for cataloguing computer files in the early 1980s, but even so the formats became increasingly inadequate as a means of describing networked resources which needed to include details of access methods and addresses. After considerable debate within the USMARC community a new field, 856, has been adopted for the location of electronic resources. Guidelines for its use have been issued by the Library of Congress. The new field is now being used on a experimental basis in the Intercat project for cataloguing Internet resources. This project, scheduled to run over the period July 1995 to March 1996, is led by OCLC with partial funding from the US Department of Education. The project involves participation of 200 libraries more than 60% of which are academic libraries; almost all active participants being in the US. There has been little involvement from UK libraries, despite the agreement to work towards convergence between US and UKMARC.
The Intercat project aims to build up a searchable catalogue of Internet resources on an experimental basis. The records will be integrated into individual library catalogues but are also made accessible by OCLC as a web based search service for searching. Ref 25 The Intercat project contained a total of 2000 resources after a period of three months. This relatively small number (as a comparison at that time OCLC's NetFirst database contained 40,000 records) partly reflects the co-operating libraries selection criteria. Libraries select only those resources they wish to integrate with their MARC library catalogue, and they select resources only if they are of sufficient quality and stability to warrant the effort of cataloguing.
MARC format provides a means of integrating metadata into existing systems. National bibliographies, bibliographic record supply agencies, and individual libraries all have large collections of existing MARC records and want to integrate 'internet description' into their systems. For them MARC is the obvious choice as it means their basic retrieval software can still be used to offer an integrated solution. For example LC has a requirement to include electronic resources in its existing cataloguing services; OCLC needs to keep in line with the requirements of its members in order to provide distribution and updating of internet records; and individual libraries wish to include reference to electronic resources in their OPACs.
MARC records have become increasingly complex over the years as new requirements have been bolted on to the original format. The content of MARC records is strictly formulated and, in order to use it uniformly, cataloguers require training and experience. Typically cataloguers specialise in their own national version of MARC and may even specialise by type of material or subject content. A good quality MARC catalogue represents a significant investment and the level of investment required reflects the expectation that the record (and the resource to which it refers) will have a long lifetime. The complexity of editing also reflects the expectation that the location will be reasonably stable.
All MARC formats conform to ISO 2709 'Format for bibliographic information interchange on magnetic tape.' The original version was ratified in 1973 and revised in 1981. This standard defines a record structure but does not prescribe the content of the record within that structure. ISO 2709 states that a MARC record must consist of variable length fields with content designators; the record should have a record label, a directory, data field separators and record separators. Within these constraints different implementations of the standard have used different numbered tags and different subfield codes to identify the same type of bibliographic data.
The standard allows for optional indicators to appear after the tag and these are used in many MARC formats to qualify the tag. Within the text of a record, embedded subfield identifiers further identify data elements. ISO allows for fixed length data as part of the record label and this is used to store codes relating to material type, language, dates and so on. Optionally subrecords can be used for analytics; this has been implemented in UKMARC whereas USMARC uses tagged links to indicate relationships between parts of a collected work.
The record structure defined in ISO 2709 largely reflects the requirements of computers in the early 1970s when systems were tape based with a need to minimise data storage. The inclusion of fixed length coded data in the MARC record label was a space saving feature. In the same way the inclusion of a record directory, in effect duplicating the tagged field identifiers, was designed to save time locating data on tape. Although use of tapes is now rare the record structure has not developed to take account of new exchange and storage media. The MARC record structure has not developed with changes in the technology. Coded data, unreadable by the user, and record directories, rarely used by programmers, would not be included by choice in present day record structure. The MARC tags themselves are not user friendly and many library automation systems translate the tags to labelled displays for cataloguing input.
As stated by Gredley and Hopkinson Ref 26
"An important question is whether a highly formalised, bibliographically detailed record, with structured access points and precise division into fields and subfields, is appropriate or necessary for the practical applications of the majority of MARC records, especially in on-line catalogues."
Although it could be argued the MARC record structure is stultified, the high investment in present systems means there is a reluctance to change.
MARC records are designed for detailed bibliographic information. Although it is difficult to strictly demarcate between bibliographic and non-bibliographic information (such as price and classification number), the MARC record is highly developed for bibliographic and bibliographic-like data. Non-bibliographic data is unstructured and tends to be placed in notes fields.
The developments in USMARC to catalogue network resources are centred round the 856 field which is used to describe the location of the resource. The 856 field is designed to contain sufficient information to locate a networked resource and retrieve it. The field can also be used to link a MARC record to a networked electronic finding aid, in other words additional information about a hard copy resource such as an electronic Table of Contents. In addition the 856 field can be used for identifying and retrieving 'partial' resources where only a specified part of a complete resource is available in electronic form. The 856 field is a structured field with subfields describing method of access, and it can be repeated to allow for different access methods. The 856 indicator details the mode of access over the network (e-mail, ftp, Telnet, dial up) or if none of these (e.g. http, gopher, wais, prospero) then the mode can be defined in a subfield.
USMARC is significantly different from other metadata formats in that it allows description of a hard copy and electronic resource within the same record. In practice some libraries are choosing to make separate records to avoid confusion.
In order to fully describe electronic resources other existing fields need to be used in addition to the new 856 address field. The MARBI Discussion Paper No. 49 Ref 27 presented a preliminary list of data elements to describe an online information resource which is developed in MARBI Discussion Paper No 54 (Providing Access to Online Information Resources). Ref 28 These papers map the required data elements onto USMARC fields and subfields. For example:
Type of resource 256$a File characteristics
Frequency of update 310$a Current frequency
Other providers of a database 582$a Related Computer File Note
The Intercat project has also involved discussion on its mailing list of the use of various other fields in this context. For example:
Detailed contents e.g. list of web links 505 $a Contents note
Access restriction notes 506 $a Restrictions on access note
Mode of connection and resource address 538 $a Technical Details note
The project is addressing the need to provide detailed information about resources without duplication of information in different fields.
As yet the 856 field has not been incorporated into UKMARC, nor has there been significant discussion regarding cataloguing of internet resources using UKMARC. At present there is an outstanding proposal from the British Library to adopt the USMARC 856 field as part of the convergence between UK and USMARC; this proposal will be considered in the first instance by the Book Industry Commission (BIC) Bibliographic Standards Working Party Technical Subgroup (see below).
The Intercat project has encountered the need to choose the level at which to catalogue resources, and there has been some discussion on the Intercat mailing list as to the best solution. Because of practical constraints on time, and the traditional ethos of MARC the choice has tended to be to describe resources at a high level. However there have been some attempts to include hierarchies of web pages in MARC records by listings in the notes fields.
The bibliographic descriptions within the MARC fields strictly follow the rules set out in AACR2 Ref 29 and ISBD(G) Ref 30 and this is reflected in the structure of tags and subfields. However as the requirement for access gains in importance over description, the justification for strict adherence to AACR2 is becoming increasingly problematic. The increasing affordability of computing power and storage means more access points can be made available in each record and more descriptive content can be included. It could be argued that this lessens the need for strictly controlled description and display. In addition a record describing network resources might usefully contain details of several locations, versions and editions. Heaney's argument for adoption of object- oriented cataloguing proposes a move away from AACR2:
" .....as far as library catalogs are concerned, AACR2 is not so much a plateau of maturity as an evolutionary blind alley. It is a code of the age of the catalog card and printed bibliography. While such purposes and products still exist, AACR2 still has a place; however, for computerized library catalogs and in the world of the virtual library, it is time to retreat to first principles and initiate the development of AACR3 as an object-oriented cataloging code....." Ref 31
Those fields not controlled by AACR2 are formulated by other schema : classification rules such as Dewey, UDC; and controlled lists of subject headings e.g. LC, National Library of Medicine. Within the 856 field , description of access methods, other than those taken from the indicator, follow the controlled vocabulary for internet media types (also known as MIME types) in RFC 1738 : URLs. Ref 32 The content of coded fields is controlled by authorised lists contained in the national MARC manuals e.g.. for lists of language codes.
The Z39.50 protocol which enables search and retrieval of bibliographic information over the internet is particularly designed to accommodate the search and retrieval of MARC records. The protocol can be used to pass searches of MARC fields from a Z39.50 client to a Z39.50 server fronting a databases of MARC records; and retrieved records can be returned in MARC format. The Z39.50 protocol uses attributes to identify how search terms should be treated by the server in a search. The bib-1 attribute set is defined in the standard and within that set the 'Use Attributes' were designed to map onto bibliographic records such as MARC. The bib-1 Use Attribute set does not contain any location or other non-bibliographic data so it is not possible to search on these fields. The protocol does allow for delivery of MARC records in full or abridged versions. There is no attempt in the standard to identify whether records searched or delivered are in the US or UK or other MARC formats. This can cause problems for interoperability in, for example, author personal name searches where the name is stored differently in US and UK MARC; similarly because the 'flavour' of MARC format is not identified it is not easy for the client to vary the display depending on the MARC format of the retrieved records. The standard allows for delivery of holdings information in OPAC records. At present the electronic address (and other non-bibliographic information) is not part of the bib-1 attribute set and is not searchable, however it would be displayed in a retrieved MARC record.
The majority of library automation systems allow for input and retrieval in MARC format, even if the records are stored internally in another format. Any changes in MARC, particularly regarding the display of the 856 field, will need to be reflected in OPAC software.
The common feature of all MARC formats is that the record structure adheres to ISO 2709: 1981. In he US and UK, along with many other countries, national standards exist for MARC record structure which adhere to ISO 2709. In the UK the national standard appears as BS 4748: 1982 Ref 33 and in the US as Z39.2. Ref 34 The data content within MARC records as embodied in national formats is not covered by any internationally recognised standards but by 'de facto' standards controlled by national libraries. These take the form of cataloguing manuals outlining the formats and offering guidelines for their use.
The USMARC manual is issued by the Library of Congress Ref 35 and additions and amendments are controlled by the Library of Congress on the advice of the US MARC Advisory Group. This Group is made up of MARBI (The American Library Association's Machine-Readable Bibliographic Information Committee) and representatives from the US National Libraries, the National Library of Canada, the National Library of Australia, large bibliographic utilities(OCLC, and RLIN), special library associations and library system vendors. The Library of Congress regularly publish discussion documents and proposals for comment which are considered at the twice yearly MARC Advisory Group meetings and, if agreed, are published occasionally as updates to the MARC format.
The UKMARC Manual Ref 36 is published by the British Library and consists of a loose-leaf publication with several updates. The British Library National Bibliographic Service (NBS) is responsible for the UKMARC format. There are less complex procedures for agreeing amendments than in the US. The British Library (BL) introduced consultation procedures in 1992 Ref 37 whereby BL's proposals initially go for comment to the Book Industry Commission (BIC) Bibliographic Standards Working Party Technical Subgroup. The Subgroup is made up of UK representatives of the different library sectors, book suppliers, bibliographic utilities who are also library system vendors, as well as the NBS. This is followed by a period of public consultation with proposals included in the BL Interface NBS Technical Bulletin. After a period for comment the proposals may be adopted according to the final decision of the BL.
The Text Encoding Initiative Guidelines were published in 1994 as a result of an international research project which started in 1987. Burnard Ref 38 describes the goal of the project as
" to define a set of generic guidelines for the representation of textual materials in electronic form, in such a way as to enable researchers in any discipline to interchange and re-use resources, independently of software, hardware, and application area."
TEI is a joint project sponsored by three professional bodies: the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing. The project was funded jointly from the US National Endowment for the Humanities and the European Union 3rd Framework Programme for Linguistic Research and Engineering. At present the project has two years more funding from the US for tutorial and dissemination work. The academic community in the US and Europe have been involved in the project forming a number of committees to consider different aspects of the encoding guidelines.
The TEI initiative aimed to reach agreement on encoding text across a range of disciplines. According to Giordano Ref39
"It represents a major milestone - before the TEI it had not been possible to reach consensus among research communities about encoding conventions to support the interchange of electronic texts."
The TEI Guidelines, despite their origins in the humanities and linguistics were designed to form an extensible framework which could be used to describe all kinds of texts. Burnard Ref 40 says
"... the word text should not be read too literally. The TEI is equally concerned with both textual and non-textual resources in electronic form, whether as constituents of a research database or components of non-paper publications."
The TEI Guidelines specify that every TEI text must be preceded by a TEI header that describes the text. The header specification was formulated as part of the project by the Committee on Text Documentation comprising librarians and archivists from Europe and North America and the overall layout is grounded in a cataloguing tradition .
The TEI header can be used in different operational settings. Firstly it can exist as part of a conformant text. In this context the header might be created by the author or publisher as part of the original encoding; or it might be created during the TEI encoding of an existing document when it is used in a research or archival environment. Researchers can use the header in the process of textual analysis or, as is the case in a growing number of text archives, TEI headers are used as a means of bibliographic control.
The TEI Guidelines suggest that headers can be used in a second way by those libraries, research sites and indeed text archives who wish to build up databases of records referring to TEI encoded text held at remote sites. The Guidelines lay down a framework for 'independent headers`, that is headers that can be stored separately from the text to which they refer. Independent headers are free-standing TEI headers which can be used in catalogues or databases to refer to a remote TEI encoded text.
A third possibility, not outlined in the Guidelines, is that independent headers could be used to describe networked resources which are not necessarily themselves TEI encoded. It is in this third context that independent headers could be described as metadata in the sense defined in this review. (It is assumed that metadata should be capable of describing any networked resource, not that there must be a necessary relation between the structure of the electronic data in the resource and the metadata format.)
Nevertheless the majority of present implementations are in humanities archives e.g. Oxford Text Archive and the Electronic Text Centre at the University of Virginia. These implementations of TEI are at present involved with archives consisting of a small selection of texts chosen on particular 'quality' criteria.
The level of difficulty in creating TEI headers depends on the amount of detailed information entered in the header, and the conformance of the content to external rules such as AACR2. If an independent header is to be created which contains the same content as a MARC record with the same adherence to cataloguing practice then the same level of skill would be required as for library cataloguing. If the header is to include details on encoding, profile and revision (see below) then this also requires detailed knowledge of the text. However the ethos of the TEI Guidelines is flexibility: the level of encoding detail can suit the requirements of the situation. Thus it would be possible for an author or 'publisher' of an electronic text to create a simple TEI header. This header could then be elaborated if required by an archive administrator.
Although the Guidelines recommend that TEI independent headers should be detailed, this recommendation is in the context of an archive. It would be possible for metadata records to be created using simplified content not in conformance to AACR2. Indeed the need for a simplified version of the full guidelines has been recognised. A subset comprising a 'manageable selection' of the full DTD has now been issued as TEI Lite. Ref 41 This subset includes the majority of the TEI core tag set and is designed to be sufficient to handle most texts to a reasonable level of detail. TEI Lite is in use by the Oxford Text Archive for the encoding of its own texts.
The TEI Guidelines define textual features in terms of Standard Generalized Markup Language (SGML) elements and attributes, grouped into sets of tags. SGML is specified in an international standard ISO 8879-1986. SGML allows for a family of encoding schemes each with their own document type definition (DTD). Within TEI it is possible to build a customised DTD, appropriate to the document being encoded, by declaration of tag sets being used. The independent header has its own auxiliary DTD set out in the Guidelines.
SGML provides a framework for defining an encoding scheme in terms of elements and attributes (note that in SGML schemes these terms have particular meanings different from usage in other metadata). An element is a textual unit such as a paragraph; within the header an element would be a unit such as a title or author. An attribute gives information about a particular occurrence of an element and would be structured as an attribute/value pair e.g. in the Profile Description there is a 'textClass' element to identify the subject headings for a text. If a controlled vocabulary is used to identify the subject keywords then the scheme is identified by an attribute 'keywords scheme=LCSH'; for classification numbers schemes are identified in a similar way e.g. 'classcode scheme=DDC19'
The various elements in TEI are grouped into tag sets:
The tag sets are extensible to enable mark up of new sorts of material.
The TEI header forms one of the two core tag sets available by default to all TEI DTDs. Presence of the TEI header is mandatory in a TEI encoded text. The TEI header is made up of :
Within the header, elements may be indicated as being in free prose, or as being structured statements.
The role of the TEI and independent header is so flexible that it can include large amounts of detail to enable analysis of text or it can be used in a simplified version to provide a known audience with bibliographic access to a collection of documents. The independent header has the same structure as the TEI header but more guidelines on content . The independent header has more mandatory and recommended elements and the Guidelines recommend it should contain structured information rather than unstructured prose. Within this specification there is still a large amount of flexibility which might well lead to difficulties. It is desirable that all the headers in a particular database should have a comparable level of detail. Unless there is uniformity in the level of detail across the database, retrieval will suffer. This difficulty in controlling the level of detail would increase in a distributed environment and could lead to problems with interoperability and record sharing.
The File Description is the only mandatory part of the header and it contains a standard set of bibliographic elements: title, author, publisher etc. Within each element there is detailed bibliographic information e.g. the title statement includes information on intellectual responsibility specifying author, sponsor, funder, principal researcher, other contributions. Within the File Description the title, publication and source are mandatory for all TEI headers, but more elements are recommended as mandatory for independent headers.
The file description contains detailed structured information in the library cataloguing tradition
" The file description element of the TEI header has therefore been closely modelled on existing standards in library cataloguing; it should thus provide enough information to allow users to give standard bibliographic references to the electronic text; and to allow cataloguers to catalogue it." Ref 42
Within the Guidelines there is consideration of the conversion of TEI headers to USMARC records. The TEI Guidelines suggest the File Description could be used
" to generate bibliographic records..............whereas the profile, encoding and revision history could either be incorporated into the bibliographic record or be used as an attached codebook." Ref 43
The Guidelines include detailed suggestions for mapping particular TEI elements to USMARC tags, as does Giordano. Ref 44 But the Guidelines acknowledge that human intervention would be required to create a quality MARC record. There is no attempt in TEI markup to identify the author 'main entry' , neither is the personal name format prescribed. Much of the non-bibliographic information would have no definitive resting place in MARC and would need to be moved to Notes fields.
In the independent header the usefulness of the profile, encoding and revision descriptions would be limited for analysis purposes unless the text was TEI encoded. Much of their usefulness depends on pointers in the electronic text to the header, relating information together. In the metadata context the file description would be most useful.
As with MARC the content is oriented to the description of 'physical objects' and there is no consideration within the Guidelines of the description of services. There is no provision for including location information within the header, and no consideration of library call numbers nor electronic addresses. However the flexible nature of TEI means that the tag sets could be extended to include this information.
Where structured information is included in appropriate elements then the Guidelines give rules which follow AACR2 and ISBD. Those elements that are unstructured contain free text.
Independent headers can be manipulated, searched an retrieved by any software that deals with SGML records e.g. Panorama, but as yet there is no provision within internet search and retrieve protocols for TEI headers. Some research work is proposed to incorporate SGML DTDs into the experimental URCs.
TEI headers are conformant to the international SGML standard.
'Dublin Core' is shorthand for the Dublin Metadata Core Element Set which is a core list of metadata elements agreed at the OCLC/NCSA Metadata Workshop in March 1995. The workshop was organised by OCLC and the National Centre for Supercomputer Applications (NCSA) to progress development of a metadata record to describe networked electronic information. This workshop followed on from joint meetings and discussions of the American Library Association. The goals of the workshop are described by Weibel Ref 45 as
"(1) fostering a common understanding of the needs, strengths, shortcomings, and solutions of the stakeholders; and (2) reaching consensus on a core set of metadata elements to describe networked resources."
The workshop bought together a range of interested parties from different professional backgrounds and subject disciplines, all of whom had been involved with metadata issues. Weibel describes the attendees
"...fifty two librarians, archivists, humanities scholars and geographers as well as standards makers in the Internet, Z39.50 and Standard Generalised Markup Language (SGML) communities......" Ref 46
The Dublin Core workshop recognised that widespread indexing and bibliographic control of internet resources depends on the existence of a simple record to describe networked resources. The objective was to define a simple set of data elements so that authors and publishers of internet documents could create their own metadata records in a distributed way. The Dublin Core approach is to have the level of bibliographic control midway between the detailed approaches of MARC and 'structured' TEI, and the automatic indexing of locator services such as Lycos.
"Another solution, not yet implemented, that promises to mediate these extremes involves the manual creation of a record that is more informative than an index entry but is less complete than a formal cataloguing record." Ref 47
The Dublin Core is a set of elements that can be used to describe a resource but there is no attempt to prescribe a record structure. During the workshop there was an explicit decision taken not to define syntax at this stage.
Two particular constraints on the design of the Dublin Core element set were accepted by the workshop participants. Firstly the ob ject of the element set is to describe 'document like objects' (DLOs), although the definition of what constitutes a DLO was left vague, as Caplan Ref 48 explains
"To me, an electronic text, map, or image would be a DLO. To some participants, only textual material qualified, while to others, computer systems or even people could be DLOs."
The second constraint accepted by the participants was that extrinsic data such as cost and details of access methods would be excluded from the element set. It was accepted that only elements for resource discovery would be included, not retrieval or request.
MARBI Discussion Paper No 86 (Mapping the Dublin Core elements to USMARC) looks at options and problems in matching Dublin Core to USMARC. Because Dublin Core elements are less specific than MARC, some fields cannot be sufficiently identified to tag them correctly. For example the author field in MARC is identified as being personal or corporate name, whereas Dublin Core does not make this differentiation.
Initial attempts to include consideration of Dublin core elements as part of an IETF working group were not taken forward, on the grounds that the content of metadata records is outside the scope of IETF standards. However the Dublin core elements have been considered by USMARC as central to their development of the USMARC record so the impact has already been seen in the formation of other metadata.
A second workshop is planned in the UK in April 1996 sponsored by UKOLN and OCLC to carry forward work on reaching a consensus of core data elements, and to increase international involvement.
URCs have a potential role as metadata formats but as yet they are at the pre-concept stage. The history of URCs is very much within the world of the Internet Engineering Task Force (IETF). The IETF Working Group on Uniform Resource Identifiers developed the concepts of Uniform Resource Names (URNs) as well as URCs, and looked at the way these could be used with Uniform Resource Locators (URLs) to form an effective means of locating resources on the Internet. Together the three devices for locating, naming, and resolving name to location (URL, URN and URC) are referred to as Uniform Resource Identifiers (URIs).
The IETF Working Group on URIs was disbanded in 1995 as the IETF felt their remit had become diffused. At present there are moves from Ron Daniel to form a new IETF Working Group on URCs and the proposed charter is under discussion on the URC mailing list. Daniels has issued various Internet drafts outlining possible scenarios for the role and content of URCs. Ref 49, Ref 50 Although some Internet experts see URCs as a solution to the addressing problem, there are others who regard metadata structures and content as outside the scope of IETF standards and are therefore wary of moving forward with URCs.
The proposed role of the URN is to provide a unique 'persistent' identification for a resource that is not dependent on location. There are now various pilot implementations of URNs (e.g. handles, PURLs) and some prospect of use in an operational setting. The URL though is being used in production services as both the naming and location device. The URL gives the address of a resource which includes (as a prefix) the access method and together this enables software to locate that resource. URLs are in use by all web browsers and are covered by the IETF standard RFC 1738. Whereas the URL and URN have moved from concept stage to implementation, the URC is still at a theoretical stage.
The Internet community has yet to reach consensus on the role of URCs. It is generally agreed that, if the URC is to be developed, then it must contain some level of descriptive data about the resource it identifies in order to assist with the resolution of URN to URL. But there is debate as to how much descriptive data should be included. On the one hand the URC could include detailed bibliographic information, as well as access data for a resource. On the other hand it could be used solely as a means of connecting a URN and URL together, and only contain sufficient meta-information to achieve this function e.g. the Internet media type of various instances of the resource access mechanism.
The URC could be viewed as becoming a standard for the structural content of metadata, which would be used by different providers of metadata. Others see URCs rather differently, as a transport mechanism for existing metadata structures. Depending on which of these roles it fulfils, the URC would either be structured to contain the various bibliographic and non-bibliographic 'core' data elements itself. Or it would become the carrier for other metadata formats and would itself only have sufficient structure to contain MIME type and record management information.
A further possibility is that there is no place for the URC, and that the URL can be further developed to include additional meta-information. Alternatively there could be a means to resolve URN to URL without the need for URCs as in the experimental PURLs.
The need for simplified record structure has been acknowledged, and there are metadata formats which offer possible solutions (IAFA templates, TEI Lite), but there is as yet little progress towards simplification of the rules for content. The options for content at present offer two extremes: to follow the AACR2 and ISBD cataloguing rules, or to use a poorly defined set of ad hoc rules. Although ad hoc rules may work in the short term for small collections, and even for large discrete databases, inconsistent usage will not favour cross database searching, nor the interoperability of 'centroid' style indexes.
There is little consensus on the level of complexity of semantic structure required in metadata. There is a need for metadata to cope with the different levels at which a network resource can be described (the granularity problem), with different versions in terms of 'editions' and 'formats', and different linking mechanisms to other records. Development of an effective syntax depends on some stability and agreement on these semantic structures.
A successful metadata format must allow for changes in record content to match the inevitable changes in the resources themselves. A metadata format needs to incorporate changes in addressing, as well as allowing for changes in the form of the resources themselves. This means there must be flexibility in the change control for the format.
It is unlikely that one format will satisfy the requirements of all stakeholders. Different disciplines and professional backgrounds favour different approaches to resource description, different navigation and searching tools favour different record formats. So it would seem that, at least in the short term, there will be an increasing need for interoperability between different systems based on different metadata. In addition it is necessary to integrating 'new' systems describing electronic resources with legacy systems dealing with hard copy material.
In a distributed networking environment there is every possibility that metadata itself will become distributed. The simple barebones of resource identification and location may be held in one system (possibly the URC) with detailed bibliographic information held elsewhere (MARC records?); different versions and 'editions' of the resource might be described in yet a third record (IAFA template?). And for those interested in textual analysis, a TEI header would enable detailed searching of textual features.