[Mirrored from: http://www.uic.edu/~cmsmcq/tech/metadata.factoring.html]

On Information Factoring in Dublin Metadata Records


C. M. Sperberg-McQueen

17 April 1996

Table of Contents


This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled `Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html.

1 The Problem

The data elements defined for a metadata record by the `Dublin Core' are all optional and all repeatable, and have no prescribed order. Some (e.g. author, title) relate to the intellectual content of an object (the work), while others (e.g.form) relate to particular realizations or instantiations of that intellectual content. Some (e.g. identifier, terms and conditions) may apply to all forms taken by a given item, or only to some forms and not others.

For example, consider the documentation for the TEI Lite SGML tag set. As a work, it may be described by the following metadata:

Title
TEI Lite: An Introduction to Text Encoding for Interchange
Author
Lou Burnard
Author
C. M. Sperberg-McQueen
It has, however, three realizations with distinct URLs, one for the TEI version:
Form
TEI Lite
Identifier (URL)
http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
and two for two different HTML versions:
Form
HTML
Identifier (URL)
http://www-tei.uic.edu/orgs/tei/intros/teiu5.html
Identifier (URL)
http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html

How should, could, or must metadata for such items be represented?

At the Warwick meeting, Dan LaLiberte argued that in Dublin it was agreed that a given metadata record should describe only a single realization of an intellectual object; this would help ensure that metadata records are unambiguous. I don't find this explicit in the Dublin conference report, but that report does say explicitly that multiple versions may require multiple records. Redundancy may be controlled by factoring common information (e.g. work-related information) into separate records and `inheriting' it in the records for specific realizations. On this view, the three instantiations of the TEI Lite documentation will each require a separate metadata record.

Reports at the Warwick meeting (April, 1996) from users of the Dublin core, however, make clear that in practice, there is a strong desire to put metadata for a given work in a single record, using some mechanism such as repeating groups to describe multiple realizations. This paper, for example, might be represented thus with repeating groups (I use the DTD described by Eric Miller's paper Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/tmp/paper.html.):

 
<citation>
<title>TEI Lite: An Introduction to 
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
<form>TEI Lite</form>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
</identifier>
<form>HTML</form>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.html 
</identifier>
<identifier scheme='URL'>
http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
</identifier>
</citation>

The only problem with this method is that it requires a lot of intelligence in the reader or user of the metadata to interpret the meaning of fields which occur more than once. A human may easily realize that the first form (TEI Lite) applies only to the first identifier, and that the second and third identifiers are for objects in the second form (HTML); software will realize it only if suitably instructed. A human will realize, perhaps even without conscious thought, that the two <author> elements both apply, at the same time, to all instantiations of the paper, because there are two authors for the paper, while the two <form> elements each relate to separate and distinct instantiations of the paper. Software is unlikely to realize this critical difference without help.

The association of form and identifier information can be made explicit, using the <instance> element of Eric Miller's DTD:

 
<citation>
<title>TEI Lite: An Introduction to 
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
<instance>
  <form>TEI Lite</form>
  <identifier scheme='URL'>
  http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
  </identifier>
</instance>
<instance>
  <form>HTML</form>
  <identifier scheme='URL'>
  http://www-tei.uic.edu/orgs/tei/intros/teiu5.html 
  </identifier>
  <identifier scheme='URL'>
  http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
  </identifier>
</instance>
</citation>
This is an improvement, but not a full solution (note, for now, that two HTML identifiers still require different interpretation from the two <author> elements).

If common information is factored out into other records, we may be able to escape some of these logical difficulties, but we need a clear explanation of how information in the local record and the inherited information imported from an external record are to be related: are they always additive, even if the same field appears in both records? Or does a local field `override' the inherited value for that field?

2 Semantic Models

We can get a better grip on the problem if we apply some principles of formal logic. The simplest way to formalize the semantics of a Dublin metadata record, it seems to me, is using sentential logic. Existential quantifiers may also be used, and I describe that possibility briefly, enough to persuade myself that the more complex formalism does not require a more complex syntax for metadata records. Either approach allows us first to express more clearly the types of ambiguity arising from repeated fields or groups, and second to see what sorts of mechanisms might suffice to disambiguate them.

2.1 Sentential Logic

Let us consider first why a simple record like the following seems less problematic than the sample given above:

 
<citation>
  <title>On the Pulse of Morning
  <author>Maya Angelou
  <publisher>University of Virgina Library Electronic Text Center
  <otherAgent>Transcribed by the University of Virginia Electronic Text Center
  <date>1993
  <object>Poem
  <form>1 ASCII file
  <source>Newspaper stories and oral performance of text at the presidential inauguration of Bill Clinton
  <language>English
</citation>

The key difference, I believe, is that all of the metadata in this record unambiguously applies all the time, while some elements of the previous record apply only in conjunction with certain other elements.

If we express each element as a logical proposition, the simple record has a correspondingly simple logical form. For convenience, let us give each proposition a short name:

Then the metadata record as a whole can be expressed formulaically: (T & A & P & D & OA & Ob & F & S & L), or "The item has the title On the Pulse of Morning and the item was written by Maya Angelou and ...".

The more complex record has a more complex logical structure. If we name the propositions thus:

then the record as a whole has the following formulaic interpretation: (T & A1 & A2 & ((F1 & I1) | (F2 & (I2 | I3)))), which can be paraphrased in English roughly thus:

Each individual instance can be described (as Dan LaLiberte points out) with a simple metadata record, which translates into a simple formula:

i.e.

I believe that this simple form of expression, in which the only connector is and, corresponds to the class of metadata records which are unambiguous and easy to interpret. The problem of interpreting complex metadata records (ones with repeating fields or groups) can thus be paraphrased: how do we derive a set of simple and-expressions from the logical expression representing a complex metadata record?

Fortunately, the answer is simple.

If we combine the three simple expressions into a single formula, we get a paraphrase of the metadata record as a whole:

 
( (T & A1 & A2 & F1 & I1)
| (T & A1 & A2 & F2 & I2)
| (T & A1 & A2 & F2 & I3)
)
which can be paraphrased roughly thus:
(if you have an item in hand described by this metadata record, then one of these three things is true:)

The salient point (and the only interesting or new claim in this entire paper) is that this expression is logically equivalent to the original formula for the example, but unlike the original this one is in disjunctive normal form.[1] It is fortunately not hard to generate the disjunctive normal form of arbitrary logical expressions, particularly when (as here) the only operators allowed are and and or.

We can then describe the semantics of metadata records thus:

We do need, however, a way to make explicit not only the parenthetical groupings in the formula (<instance> does this) but also which propositions in the formula are joined by and (&) and which by or (|). We can see therefore that proposals calling for a single grouping element (such as that made by Eric Miller in the paper already mentioned, or by myself in informal DTD sketches) will not suffice to solve the problem. We need not one but two distinct types of group. Miller's <citation> element already serves as an and-group, since simple citations are interpreted as the and-ing together (formally, the conjunction) of their elements. It will have to be able to nest recursively, however, if we want to handle all cases of shared metadata. And we will need a second grouping element, to serve as an or-group. For examples, see the section Groups, below.

2.2 Existential Quantifiers

Some readers may resist the use of sentential logic as a formalism for representing the meanings of metadata records in general, since the meaning of

 
<citation>
<title>On the pulse of morning</title>
</citation>
is not, in general, merely "The title is On the pulse of morning" but something more like "(There is an object, described by this record, and) the title (of the object described by this record) is On the pulse of morning." That is, there is an implied existential quantifier inherent in the existence of a metadata record, and there is an implied argument for each metadata element, viewed as a logical function.

Paraphrasing records at this level of detail would make it easier to capture the semantics of work and realization more clearly. Represented in first-order predicate calculus, our example might look like this:

 
(E w)(E lb)(E cmsmcq)(E i1)(E i2)(E i3)
     ( work(w)
     & title(w,"TEI Lite ...") 
     & name(lb,"Lou Burnard")
     & name(cmsmcq,"C. M. Sperberg-McQueen")
     & author(w,lb) & author(w,cmsmcq)
     & instance(w,i1)
     & form(i1,teilite)
     & url(i1,".../teiu5.tei")
     & instance(w,i2)
     & form(i2,html)
     & url(i2,".../teiu5.html")
     & instance(w,i3)
     & form(i3,html)
     & url(i3,".../teiu5.html")
     & (i1 != i2) & (i1 != i3) 
     )
which we might paraphrase as:

Note, in passing, that from the <form> elements we can infer that the first instantiation is not identical to the second or third, but the second and third instantiations, both being in HTML, could conceivably be identical. Hence there is no claim that (i2 != i3).

If we are willing to assume that different instantiations are the only possible causes of or-groups in metadata records, then we may plausibly believe (a) that complex metadata records can all be described with a single and-group, if instantiations are given identifiers (such as the i1, i2, i3 of the example) and the identifiers are used to associate the metadata elements applying to each instantiation, and (b) that Eric Miller's <instance> element suffices, after all, since all instances are implicitly or-ed with each other, and nothing else will cause an or group.

I'm reluctant to accept this logic, first because while many (all?) examples of logical complication in metadata records do involve multiple instantiations, I certainly haven't seen any argument that proves this is a logical necessity. Second, tempting though this argument is, I still don't know how to derive the formula just given systematically from the metadata record itself. The formula has three instantiations, and three form() predicates, while the metadata record itself has three <identifier> elements, but only two <instance> elements, and only two <form> elements.

3 Markup Solutions

3.1 Groups

We saw earlier, when we used sentential logic to say what metadata records mean, that we need both a grouping element meaning and and one meaning or. The or-group we must invent. For now, let's call it <or>. The and-group we already have, in the citation element. The only drawback is that the term citation seems to imply that its contents constitute a complete citation, which will not always be the case. For purposes of illustration, therefore, let's invent a second new grouping element called <and>.

If we augment Eric Miller's DTD with <or> and <and>, our example record will look like this (I augment the <or> and <and> elements with identifiers, so I can refer to them in later discussion):

 
<citation>
<title>TEI Lite: An Introduction to 
Text Encoding for Interchange</title>
<author>Lou Burnard</author>
<author>C. M. Sperberg-McQueen</author>
<or id=O1>
  <and id=A1>
    <form>TEI Lite</form>
    <identifier scheme='URL'>
    http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei
    </identifier>
  </and>
  <and id=A2>
    <form>HTML</form>
    <or id=O2>
      <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.html 
      </identifier>
      <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
      </identifier>
    </or>
  </and>
</or>
</citation>

3.2 Implicit Anding and Oring

It might be suggested (I suggested it myself, in the first draft of these notes) that we don't really need the <or> element everywhere it occurs in the example just given. It would be clear enough simply to write

 
  <and id=A2>
    <form>HTML</form>
    <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.html 
    </identifier>
    <identifier scheme='URL'>
      http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html
    </identifier>
  </and>
since the URL is a characteristic of the realization, not the work, and in general two identifiers in the same scheme will always refer to distinct realizations. They thus might as well be regarded as forming a sort of automatic, implicit or-group.

On this view, some elements are implicitly and-ed together when they repeat: author, for example. Some elements (e.g. identifier, form) are intrinsically incapable of being and-ed together and thus form implicit or-groups. Some elements can go either way: multiple titles may all apply, or they might apply each to one particular instantiation of the work (the French title applies to the French version, the English title to the English version of a European regulation, which might need juridically to be treated as a single work, since all national-language versions have equal authority).

On the whole, it seems better not to make too much of this generalization -- though it might be a useful heuristic for plausibility checking. Since some elements can go either way, we will need <and> and <or> (or rather, their logical equivalents: I am not proposing actual elements here, just pointing out the need for elements with conjunctive and disjunctive meaning) regardless, and using such elements explicitly seems simpler and less confusing than hard-wiring so much intelligence into software.

It also is worth pointing out that the initial premise of this section is false: two URLs may very easily point to the same object, and it is easy to imagine methods of describing formats which would allow multiple names to be applied to the same format (just as in some programming languages the same data type can be referred to by multiple names).

3.3 Inheritance

There are three ways to treat inheritance of metadata from other records. We can insist that the inherited metadata never include the same elements as are present locally, or we can specify that locally specified elements override inherited elements of the same name, or we can attempt to specify some method of merging the two records so as to keep all the information from both records, by and-ing or or-ing corresponding elements together. In the first event, it may be overkill to speak of `inheritance'; in the latter, we may be reintroducing all the problems of repeating groups.

If we take the first or the second approach (or even the third approach, as long as we provide a simple rule, such as "All inherited data is and-ed together with local data"), we will be able to interpret a reference to external metadata fairly rigorously:

By making liberal use of references to other records, we can do without <and> and <or> elements. We can demonstrate this by giving a method of transforming records with <and> and <or> into sets of records linked by reference. For each SGML element in the source record, we do the following:

Or perhaps it would be clearer to put it this way: We begin by copying the entire <citation> element and all its children into a new record, which we then process as follows:

The sample record from the previous section would turn into the following set of records:

More work is needed here, I think, both to specify how to interpret the record when the same element occurs both locally and in the referenced object, and to specify what constitutes the same element.

4 Conclusion

An adequate syntax for multiple versions (realizations) of the same work (intellectual content) requires an explicit semantic interpretation, to avoid hopeless ambiguity. If we provide our syntax with mechanisms for both disjunctive and conjunctive groupings (and-groups and or-groups), we can provide simple rules for interpreting complex records in terms of their disjunctive normal form.

More complex semantic formalisms, using existential quantifiers, may also be defined, but do not require any syntax more elaborate than the simpler semantics.

Notes

[1] A formula in sentential logic is in disjunctive normal form if it is a disjunction (or alternation, or or-group) of one or more terms, and if each term is a conjunction of one or more primitive sentences or their negations. No nested expressions are allowed. For a fuller discussion, any book on formal logic may be consulted, but perhaps the best discussion of disjunctive normal form and the algebraic manipulations used to achieve it may be found in W.V. Quine, Methods of logic 4th ed. (Cambridge, Mass. : Harvard University Press, 1982).
[return to text]