[Mirrored from: http://www.uic.edu/~cmsmcq/tech/metadata.syntax.html]
This document summarizes a set of recommendations concerning the representation of metadata, derived from discussion within the syntax working group which met at the second Metadata Workshop, held at Warwick University in April 1996. The discussion begun in Warwick has been continued electronically by the current authors, and this paper presents both the recommendations agreed on by the syntax working group in Warwick and some further developments for which the authors alone are responsible.
In brief, the syntax working group recommended:
Discussions in Warwick also led to an informal demonstration of how SGML could be used as the mechanism for encoding the containers and metadata packages foreseen in the Warwick Framework. A sample DTD for such packages is given in section 4 The Warwick Framework DTD fragment.
The following criteria were advanced as desirable features in whatever syntax is to be used in defining a standard format for metadata:
The following functional requirements were identified for the syntax:
The ability to carry out down-translation to the proposed scheme from existing metadata schemes (specifically richer formats such as MARC, TEI, IAFA templates) was assumed.
There was no discussion of how the following additional requirements might be achieved, though there was a general feeling that they were all highly desirable:
The problem of grouping, inheritance, and their meaning is discussed further in a paper by C. M. Sperberg-McQueen, "On Information Factoring in Dublin Metadata Records," which is accessible on the World-Wide Web at http://www.uic.edu/~cmsmcq/tech/metadata.factoring.html.
This section presents the various possible approaches discussed at the workshop, whether actively recommended by the syntax working group or not. A fuller treatment of some of them is also presented in a paper written by Eric Miller following the first Metadata Workshop (see Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/publications/metadata/issues.html).
The syntax working group recommended that authors, publishers, and site managers be encouraged to provide metadata in HTML documents by means of HTML <meta> elements embedded in their documents. More elaborate metadata can be provided if the metadata records are external to the HTML document, as described below, but for information providers with limited ambitions, the method described here is recommended.
The assumption here is that existing browsers and search engines cannot be expected to accommodate any variation from current practice. Any additional features must be transparent to existing software and authoring practices.
The <meta> element of HTML2 should be used, with nameand content attributes set to the metadata element's name and value respectively.
<html> <head> <title>On the pulse of the morning</title> <meta name="title" content="On the pulse of the morning"> <meta name="publisher content="University of Virginia Electronic Text Center"> <meta name="otheragent:transcriber" content="University of Virginia Electronic Text Center"> <meta name='date(ISO)' content="1993-01-23"> <meta name="objectType" content="poem"> <meta name="form" content="1 ASCII file"> <meta name="form(IMT)" content="text/ASCII"> <meta name="source" content="Newspaper stories and oral performance of text at the Presidential inauguration of Bill Clinton"> <meta name="language(ISO 639)" content="en"> ... </head> <body> <h1>On the pulse of the morning</h1> ...
Advantages: No change is needed to existing browsers or search engines. Any set of attribute-value pairs can be represented.
Disadvantages: No constraint can be imposed on the semantics of the attribute names used, and name clashes may occur. Other, possibly inconsistent, conventions are already established for use of the <meta> elements by other agents. This could however be overcome by using a prefix such as "DC:", e.g.
<meta name='DC:date(ISO)' content="1993-01-23">
The order of <meta> elements within the <head> element is not significant, and elements cannot be grouped, though a sufficiently determined imagination might conceive of something like the following:
<meta name='DC:groupStart' content='group number 42'> <meta name='DC:something' content="something else"> <!-- more metas here --> <meta name='DC:groupEnd' content= 'group number 42'>
Miller's paper, referred to above, also suggests prefixing a group of <meta> tags which together make up a metadata description of this kind with a particular labelling <meta> tag such as
<meta name='citation' content="Dublin Core">
Without additional attributes such as source and type, considerable `overloading' of the attribute values is necessary to contain all the information available in the Dublin core. Even in this trivial example, it has been necessary to introduce some arbitrary syntax (the use of the colon and parentheses) to distinguish parts of the name attribute.
Furthermore, attribute values are limited in length by the value of LITLEN (1024 according to the official SGML declaration for HTML2), or by other arbitrary limits imposed by particular browsers. A literal cannot contain any tags which a browser might recognize, so another syntax must be invented if subfields of Dublin core elements are required.
For more complex metadata records, an unstructured series of <meta> elements will not suffice; the syntax working group recommended, therefore, that metadata consumers recognize references to external metadata from within the HTML <head> element.
This approach involves keeping the metadata in an distinct document. Because the metadata is independent of the form of the data proper, free-standing metadata can document with equal facility documents in HTML, ASCII, SGML, PDF, or proprietary formats, images, sound files, maps, etc. Clear endorsement of free-standing metadata, and the construction of metadata catalogs, is thus important for ensuring that metadata is usable for objects on the net which are not also objects on WWW.
Two variants of the encoding syntax for metadata were discussed at the meeting: in the first, the metadata document uses existing HTML elements. In the second it uses some other syntax better suited to the requirements listed above. At the Warwick meeting, the workgroup agreed that this syntax should be expressed using an SGML DTD, and this is the approach which has been followed below. However, there is no reason why some other syntax that meets the functional requirements outlined above could not be invented for this purpose.
A one-way linkage in HTML documents, for example, is effected using the <link> element:
<html> <head> <link rel='metadata' href='pulse.meta'> </head> ... </html>
Separating the metadata from the document makes it easy for existing browsers and search engines to ignore it if they wish, while those which are Dublin core-aware can access and process it effectively with no additional cost. On the other hand, there may be significant additional costs in ensuring that metadata and data are kept in step and consistent.
The next two sections discuss what exactly might be the contents of the object referenced by pulse.meta.
The attribute-value-class triples needed for the Dublin core can be mapped on to any appropriate HTML element. At the meeting, the <DL> element was suggested, as in the following example:
<html> <head><title>Metadata for the Nice Pome</title></head> <body> <dl> <dt>title</dt> <dd>On the pulse of the morning</dd> <dt>publisher</dt> <dd>University of Virginia Electronic Text Center</dd> <dt>otheragent:transcriber</dt> <dd>University of Virginia Electronic Text Center</dd> <dt>date:created/ISO</dt> <dd>1993-01-23</dd> <dt>objectType</dt> <dd>poem</dd> <dt>form</dt> <dd>1 ASCII file</dd> <dt>form/IMT</dt> <dd>text/ASCII</dd> <dt>source</dt> <dd>Newspaper stories and oral performance of text at the Presidential inauguration of Bill Clinton</dd> <dt>language/ISO 639</dt> <dd>en</dd> </dl> </html>
Advantages: Metadata is cleanly separated from the data. Problems consequent on using attribute values to represent element content are no longer a concern and more powerful structuring abilities (e.g. nesting, repetition) are potentially available.
Disadvantages: Almost anything can go into a metadata description. (Unenforceable) conventions need to be established about how the metadata descriptions are to be mapped to HTML elements. It's not clear how, for example, to do the SOURCE and TYPE attributes of the Dublin Core without extending HTML2.
This suggested approach did not gain much support from the syntax working group and is not recommended.
The syntax working group recommended the preparation of an SGML DTD for Dublin-Core metadata records; one such DTD is described below in section 3 The Dublin Core DTD fragment.
The Dublin DTD defines specific elements for the 13 core elements, each of which bears attributes for type and source.
Using this syntax, the above example might like look this:
<!DOCTYPE dublinCore PUBLIC '-//OCLC//DTD Dublin core v.1//EN'> <dublinCore> <title>On the Pulse of Morning</title> <author>Maya Angelou</author> <publisher>University of Virginia Electronic Text Center</publisher> <otherAgent name='transcriber'>University of Virginia Electronic Text Center</otherAgent> <date name='created' scheme='ISO'>1993-01-23</date> <objectType>poem</objectType> <form>1 ASCII file</form> <form scheme='IMT'>text/ASCII</form> <source>Newspaper stories and oral performance of text at the Presidential inauguration of Bill Clinton</source> <language name='ISO 639'>en</language> </dublinCore>
Advantages: The syntax makes explicit the semantics of each Dublin core element. Distinct attributes can be defined for scheme and type. Element content could include other tags if subfields are required.
Disadvantages: Only Dublin core elements are provided (but there is an extension field). Discrete packages of metadata cannot be identified and the semantics of repeated elements are not specified.
At the workshop, the authors suggested applying SGML not only to the encoding of Dublin-Core records but also to the creation of metadata packages and containers, as defined in the architectural proposals for the Warwick Framework. This section summarizes the relevant points.
The Warwick Framework DTD builds on the notion of discrete packages of metadata elements discussed at the Warwick Workshop. One such package might contain Dublin core elements; others might contain specialised elements appropriate to other kinds of metadata, or references to other components using other (possibly non-SGML) notations.
No specific package types additional to the Dublin core were discussed in any detail; though it seems likely that other groups will wish to define them. This can be done relatively easily by defining an additional DTD fragment (along the lines of that discussed below). Alternatively, new package types can also be created by using a generic package type called a <package>, composed of typed <metaData> or nestable <metaGroup> elements. This may be easier to define (and avoids possible namespace clashes). Full details of these are given below in section 4 The Warwick Framework DTD fragment.
This approach was explored in some detail by the authors in order to demonstrate that the additional syntax and functionality required by the `container' approach could be supported directly by SGML, with no need to invent a new syntax and consequently additional ad hoc software.
A collection of packages of the same or different types makes up a <container> element. This could be linked to from an HTML document in the same way as in the preceding examples (using a <link> element in the HTML document), or form a part of a multipart MIME message along with the document itself. An example might look like the following:
<!DOCTYPE container PUBLIC '-//OCLC//DTD Warwick Framework Demo v.1//EN'> <container> <dublinCore> <!-- etc. as in preceding example --> </dublinCore> <package URI='hdl:oclc:repository/tc' name='OCLC Standard Terms and Conditions / Set FPC45' version='1.0'> <metadata name='permit'>www.oclc.org <metadata name='permit'>rsch.oclc.org <metadata name='permit'>firstname.lastname@example.org <metadata name='deny'>dev.oclc.org <metadata name='inquiries'>email@example.com </package> </container>
This DTD fragment, prepared by the authors, is a slightly simplified version of that proposed in the paper by Miller cited above. It defines the following metadata elements, one for each of the components of the Dublin Core, as defined at the first Metadata Workshop:
These elements all share the following attributes:
No closed set of values is defined for either of these attributes in the present proposal, though some suggestive examples are to be found in Miller's paper cited above.
These elements and attributes are formally defined as follows:
<!ENTITY % a.global ' type CDATA #IMPLIED scheme CDATA "uncontrolled"'> <!ELEMENT title - O (#PCDATA) > <!ATTLIST title %a.global > <!ELEMENT author - O (#PCDATA) > <!ATTLIST author %a.global > <!ELEMENT otherAgent - O (#PCDATA) > <!ATTLIST otherAgent %a.global > <!ELEMENT publisher - O (#PCDATA) > <!ATTLIST publisher %a.global > <!ELEMENT date - O (#PCDATA) > <!ATTLIST date %a.global > <!ELEMENT subject - O (#PCDATA) > <!ATTLIST subject %a.global > <!ELEMENT objectType - O (#PCDATA) > <!ATTLIST objectType %a.global > <!ELEMENT form - O (#PCDATA) > <!ATTLIST form %a.global > <!ELEMENT identifier - O (#PCDATA) > <!ATTLIST identifier %a.global > <!ELEMENT relation - O (#PCDATA) > <!ATTLIST relation %a.global > <!ELEMENT source - O (#PCDATA) > <!ATTLIST source %a.global > <!ELEMENT language - O (#PCDATA) > <!ATTLIST language %a.global > <!ELEMENT coverage - O (#PCDATA) > <!ATTLIST coverage %a.global > <!ELEMENT metadata - O (#PCDATA) > <!ATTLIST metadata %a.global >The <metadata> element is described in more detail below, in section 4 The Warwick Framework DTD fragment.
Any number of any of the above elements may be grouped together to form a single Dublin core metadata description. Such a description is contained by a single <dublinCore> element, which bears an attribute version to indicate its version status.
This element is formally defined as follows:
<!ELEMENT dublinCore - O (title | author | otherAgent | publisher | date | subject | objectType | form | identifier | relation | source | language | coverage | metadata)* > <!ATTLIST dublinCore version CDATA #IMPLIED >
Several alternative methods have been proposed for defining the scheme and type attributes for various elements, in order to combine the virtues of a controlled vocabulary with the flexibility of an uncontrolled vocabulary:
When this version of this document was prepared, no final consensus had been reached.
This DTD, prepared by the authors, is intended to support the following three objectives:
A document conforming to this DTD is represented by a <container> element. Each <container> element consists of a sequence of one or more of the following package-level elements:
Other package-level elements may be defined at a later date: to facilitate this, the contents of the <container> element are defined indirectly using a parameter entity (see DTD below).
Package-level elements all share the following attributes:
Of these attributes, nameis required, while the other two are optional. All three have CDATA content.
These elements and attributes are formally defined as follows:
<!ENTITY % packageType 'package | dublinCore | packageRef' > <!ELEMENT container - O (%packageType)* > <!ELEMENT package - O (metadata | metaGroup | %packageType)* > <!ATTLIST package name CDATA #REQUIRED URI CDATA #IMPLIED version CDATA #IMPLIED > <!ELEMENT packageRef - O EMPTY > <!ATTLIST packageRef URI CDATA #IMPLIED name CDATA #REQUIRED version CDATA #IMPLIED >
Note that a <package> may also contain nested <package>, <dublinCore> or <packageRef> elements. This allows considerable flexibility in structuring metadata.
The components of the <dublinCore> element were defined above in section 3 The Dublin Core DTD fragment. The <package> element may contain a sequence of any number of the following sub-elements:
The above elements all share the following attributes:
These elements and attributes are formally defined as follows:
<!ELEMENT metaGroup - O (#PCDATA | metadata | metaGroup)* > <!ELEMENT metadata - O (#PCDATA | metadata)* > <!ATTLIST metadata type CDATA #REQUIRED scheme CDATA 'uncontrolled' show (show | noshow | inherit) inherit sortkey CDATA #IMPLIED index (index | noindex | asparent) asparent >