[Mirrored from: http://www.uic.edu/~cmsmcq/tech/metadata.factoring.html]
This document describes some problems in the interpretation of metadata records which contain repeated fields, repeated field groups, or references to other metadata records. The semantics of repeated fields and groups (and, equivalently, of references to other metadata records) are described using sentential logic, and a proposal is made to specify the interpretation of repeating groups using the disjunctive normal form of corresponding logical expressions. From this proposal, requirements for grouping elements and inheritance are derived. The semantic principles involved may be of wider applicability, but all examples are from the socalled `Dublin Core' of metadata elements, described in the paper OCLC/NCSA Metadata Workshop Report, by Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, available on the World Wide Web at http://www.oclc.org:5046/conferences/metadata/dublin_core_report.html.
The data elements defined for a metadata record by the `Dublin Core' are all optional and all repeatable, and have no prescribed order. Some (e.g. author, title) relate to the intellectual content of an object (the work), while others (e.g.form) relate to particular realizations or instantiations of that intellectual content. Some (e.g. identifier, terms and conditions) may apply to all forms taken by a given item, or only to some forms and not others.
For example, consider the documentation for the TEI Lite SGML tag set. As a work, it may be described by the following metadata:
How should, could, or must metadata for such items be represented?
At the Warwick meeting, Dan LaLiberte argued that in Dublin it was agreed that a given metadata record should describe only a single realization of an intellectual object; this would help ensure that metadata records are unambiguous. I don't find this explicit in the Dublin conference report, but that report does say explicitly that multiple versions may require multiple records. Redundancy may be controlled by factoring common information (e.g. work-related information) into separate records and `inheriting' it in the records for specific realizations. On this view, the three instantiations of the TEI Lite documentation will each require a separate metadata record.
Reports at the Warwick meeting (April, 1996) from users of the Dublin core, however, make clear that in practice, there is a strong desire to put metadata for a given work in a single record, using some mechanism such as repeating groups to describe multiple realizations. This paper, for example, might be represented thus with repeating groups (I use the DTD described by Eric Miller's paper Issues of Document Description in HTML, available at http://www.oclc.org:5046/~emiller/tmp/paper.html.):
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> <form>HTML</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </citation>
The only problem with this method is that it requires a lot of intelligence in the reader or user of the metadata to interpret the meaning of fields which occur more than once. A human may easily realize that the first form (TEI Lite) applies only to the first identifier, and that the second and third identifiers are for objects in the second form (HTML); software will realize it only if suitably instructed. A human will realize, perhaps even without conscious thought, that the two <author> elements both apply, at the same time, to all instantiations of the paper, because there are two authors for the paper, while the two <form> elements each relate to separate and distinct instantiations of the paper. Software is unlikely to realize this critical difference without help.
The association of form and identifier information can be made explicit, using the <instance> element of Eric Miller's DTD:
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> <instance> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> </instance> <instance> <form>HTML</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </instance> </citation>This is an improvement, but not a full solution (note, for now, that two HTML identifiers still require different interpretation from the two <author> elements).
If common information is factored out into other records, we may be able to escape some of these logical difficulties, but we need a clear explanation of how information in the local record and the inherited information imported from an external record are to be related: are they always additive, even if the same field appears in both records? Or does a local field `override' the inherited value for that field?
We can get a better grip on the problem if we apply some principles of formal logic. The simplest way to formalize the semantics of a Dublin metadata record, it seems to me, is using sentential logic. Existential quantifiers may also be used, and I describe that possibility briefly, enough to persuade myself that the more complex formalism does not require a more complex syntax for metadata records. Either approach allows us first to express more clearly the types of ambiguity arising from repeated fields or groups, and second to see what sorts of mechanisms might suffice to disambiguate them.
Let us consider first why a simple record like the following seems less problematic than the sample given above:
<citation> <title>On the Pulse of Morning <author>Maya Angelou <publisher>University of Virgina Library Electronic Text Center <otherAgent>Transcribed by the University of Virginia Electronic Text Center <date>1993 <object>Poem <form>1 ASCII file <source>Newspaper stories and oral performance of text at the presidential inauguration of Bill Clinton <language>English </citation>
The key difference, I believe, is that all of the metadata in this record unambiguously applies all the time, while some elements of the previous record apply only in conjunction with certain other elements.
If we express each element as a logical proposition, the simple record has a correspondingly simple logical form. For convenience, let us give each proposition a short name:
(T & A & P & D & OA & Ob & F & S & L)
, or
"The item has the title On the Pulse of Morning
and the item was written by Maya Angelou
and ...".
The more complex record has a more complex logical structure. If we name the propositions thus:
(T & A1 & A2 & ((F1 & I1) | (F2 & (I2 | I3))))
,
which can be paraphrased in English roughly thus:
- The item is called "TEI Lite: ..."
- and it was written by Lou Burnard
- and it was (also) written by C. M. Sperberg-McQueen
- and it is
- either in TEI Lite as .../teiu5.tei
- or in HTML
- either as .../teiu5.html
- or as .../teiu5.split.html.
Each individual instance can be described (as Dan LaLiberte points out) with a simple metadata record, which translates into a simple formula:
T & A1 & A2 & F1 & I1
T & A1 & A2 & F2 & I2
T & A1 & A2 & F2 & I3
I believe that this simple form of expression, in which the
only connector is and
, corresponds to the class of
metadata records which are unambiguous and easy to interpret.
The problem of interpreting complex metadata records (ones with
repeating fields or groups) can thus be paraphrased: how do we
derive a set of simple and
-expressions from the logical
expression representing a complex metadata record?
Fortunately, the answer is simple.
If we combine the three simple expressions into a single formula, we get a paraphrase of the metadata record as a whole:
( (T & A1 & A2 & F1 & I1) | (T & A1 & A2 & F2 & I2) | (T & A1 & A2 & F2 & I3) )which can be paraphrased roughly thus:
(if you have an item in hand described by this metadata record, then one of these three things is true:)
- either the title is TEI Lite ... and the author(s) are LB and CMSMcQ and the form is TEI Lite the URL is .../teiu5.tei
- or the title is TEI Lite ... and the author(s) are LB and CMSMcQ and the form is HTML the URL is .../teiu5.html
- or the title is TEI Lite ... and the author(s) are LB and CMSMcQ and the form is HTML the URL is .../teiu5.split.html
The salient point (and the only interesting or new claim in this
entire paper) is that this expression is logically equivalent to
the original formula for the example, but unlike the original this
one is in disjunctive normal form.[1]
It is fortunately not hard to generate the disjunctive normal form
of arbitrary logical expressions, particularly when (as here) the
only operators allowed are and
and or
.
We can then describe the semantics of metadata records thus:
and
-ing together (conjunction) of its sub-elements.or
-ing together (disjunction)
of several simple records, each represented
by one term in the complex record's disjunctive normal form.
We do need, however, a way to make explicit not only the
parenthetical groupings in the formula (<instance> does this)
but also which propositions in the formula are joined by and
(&) and which by or
(|). We can see therefore that proposals
calling for a single grouping element (such as that made by Eric
Miller in the paper already mentioned, or by myself in informal DTD
sketches) will not suffice to solve the problem. We need not one but
two distinct types of group. Miller's <citation> element
already serves as an and
-group, since simple citations are
interpreted as the and
-ing together (formally, the
conjunction) of their elements. It will have to be able to nest
recursively, however, if we want to handle all cases of shared
metadata. And we will need a second grouping element, to serve as
an or
-group.
For examples, see the section
Groups, below.
Some readers may resist the use of sentential logic as a formalism for representing the meanings of metadata records in general, since the meaning of
<citation> <title>On the pulse of morning</title> </citation>is not, in general, merely "The title is On the pulse of morning" but something more like "(There is an object, described by this record, and) the title (of the object described by this record) is On the pulse of morning." That is, there is an implied existential quantifier inherent in the existence of a metadata record, and there is an implied argument for each metadata element, viewed as a logical function.
Paraphrasing records at this level of detail would make it easier to capture the semantics of work and realization more clearly. Represented in first-order predicate calculus, our example might look like this:
(E w)(E lb)(E cmsmcq)(E i1)(E i2)(E i3) ( work(w) & title(w,"TEI Lite ...") & name(lb,"Lou Burnard") & name(cmsmcq,"C. M. Sperberg-McQueen") & author(w,lb) & author(w,cmsmcq) & instance(w,i1) & form(i1,teilite) & url(i1,".../teiu5.tei") & instance(w,i2) & form(i2,html) & url(i2,".../teiu5.html") & instance(w,i3) & form(i3,html) & url(i3,".../teiu5.html") & (i1 != i2) & (i1 != i3) )which we might paraphrase as:
Note, in passing, that from the <form> elements we can
infer that the first instantiation is not identical to the second
or third, but the second and third instantiations, both being
in HTML, could conceivably be identical. Hence there is no
claim that (i2 != i3)
.
If we are willing to assume that different instantiations are the
only possible causes of or
-groups in metadata records, then
we may plausibly believe (a) that complex metadata records can all
be described with a single and
-group, if instantiations are
given identifiers (such as the i1, i2,
i3 of the example) and the identifiers are used to
associate the metadata elements applying to each instantiation, and
(b) that Eric Miller's <instance> element suffices, after all,
since all instances are implicitly or
-ed with each other,
and nothing else will cause an or
group.
I'm reluctant to accept this logic, first because while many (all?) examples of logical complication in metadata records do involve multiple instantiations, I certainly haven't seen any argument that proves this is a logical necessity. Second, tempting though this argument is, I still don't know how to derive the formula just given systematically from the metadata record itself. The formula has three instantiations, and three form() predicates, while the metadata record itself has three <identifier> elements, but only two <instance> elements, and only two <form> elements.
We saw earlier, when we used sentential logic
to say what metadata records mean,
that we need
both a grouping element meaning and
and one meaning or
.
The or
-group we must invent.
For now, let's call it <or>.
The and
-group we already have, in the citation
element.
The only drawback is that the term citation seems
to imply that its contents constitute a complete citation, which
will not always be the case. For purposes of illustration, therefore,
let's invent a second new grouping element called <and>.
If we augment Eric Miller's DTD with <or> and <and>, our example record will look like this (I augment the <or> and <and> elements with identifiers, so I can refer to them in later discussion):
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> <or id=O1> <and id=A1> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> </and> <and id=A2> <form>HTML</form> <or id=O2> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </or> </and> </or> </citation>
It might be suggested (I suggested it myself, in the first draft of these notes) that we don't really need the <or> element everywhere it occurs in the example just given. It would be clear enough simply to write
<and id=A2> <form>HTML</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </and>since the URL is a characteristic of the realization, not the work, and in general two identifiers in the same scheme will always refer to distinct realizations. They thus might as well be regarded as forming a sort of automatic, implicit
or
-group.
On this view, some elements are implicitly and
-ed
together when they repeat: author, for example.
Some elements (e.g. identifier, form) are intrinsically incapable
of being and
-ed together and thus form implicit
or
-groups.
Some elements can go either way: multiple titles may all apply,
or they might apply each to one particular instantiation of the
work (the French title applies to the French version, the English
title to the English version of a European regulation, which might
need juridically to be treated as a single work, since all
national-language versions have equal authority).
On the whole, it seems better not to make too much of this generalization -- though it might be a useful heuristic for plausibility checking. Since some elements can go either way, we will need <and> and <or> (or rather, their logical equivalents: I am not proposing actual elements here, just pointing out the need for elements with conjunctive and disjunctive meaning) regardless, and using such elements explicitly seems simpler and less confusing than hard-wiring so much intelligence into software.
It also is worth pointing out that the initial premise of this section is false: two URLs may very easily point to the same object, and it is easy to imagine methods of describing formats which would allow multiple names to be applied to the same format (just as in some programming languages the same data type can be referred to by multiple names).
There are three ways to treat inheritance of metadata from other
records. We can insist that the inherited metadata never include
the same elements as are present locally, or we can specify that
locally specified elements override inherited elements of the same name,
or we can attempt to specify some method of merging the two records
so as to keep all the information from both records, by and
-ing
or or
-ing corresponding elements together. In the first
event, it may be overkill to speak of `inheritance';
in the latter, we may be reintroducing all the problems of repeating
groups.
If we take the first or the second approach (or even the third
approach, as long as we provide a simple rule, such as
"All inherited data is and
-ed together with
local data"), we will be able to
interpret a reference to external metadata fairly rigorously:
By making liberal use of references to other records, we can do without <and> and <or> elements. We can demonstrate this by giving a method of transforming records with <and> and <or> into sets of records linked by reference. For each SGML element in the source record, we do the following:
inheritance
Or perhaps it would be clearer to put it this way: We begin by copying the entire <citation> element and all its children into a new record, which we then process as follows:
The sample record from the previous section would turn into the following set of records:
http://www.meta.org/catalog/c
):
<citation> <title>TEI Lite: An Introduction to Text Encoding for Interchange</title> <author>Lou Burnard</author> <author>C. M. Sperberg-McQueen</author> </citation>
http://www.meta.org/catalog/a1
):
<citation id=A1> <relation scheme='URL' type='OtherType' othertype='inherits'> http://www.meta.org/catalog/c </relation> <form>TEI Lite</form> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.tei </identifier> </citation>
http://www.meta.org/catalog/a2
):
<citation id=A2> <relation scheme='URL' type='OtherType' othertype='inherits'> http://www.meta.org/catalog/c </relation> <form>HTML</form> </citation>
http://www.meta.org/catalog/h1
):
<citation> <relation scheme='URL' type='OtherType' othertype='inherits'> http://www.meta.org/catalog/a2 </relation> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.html </identifier> </citation>
http://www.meta.org/catalog/h2
):
<citation> <relation scheme='URL' type='OtherType' othertype='inherits'> http://www.meta.org/catalog/a2 </relation> <identifier scheme='URL'> http://www-tei.uic.edu/orgs/tei/intros/teiu5.split.html </identifier> </citation>
More work is needed here, I think, both to specify how to interpret the record when the same element occurs both locally and in the referenced object, and to specify what constitutes the same element.
An adequate syntax for multiple versions (realizations) of the same
work (intellectual content) requires an explicit semantic interpretation,
to avoid hopeless ambiguity. If we provide our syntax with mechanisms
for both disjunctive and conjunctive groupings (and
-groups
and or
-groups), we can provide simple rules for interpreting
complex records in terms of their disjunctive normal form.
More complex semantic formalisms, using existential quantifiers, may also be defined, but do not require any syntax more elaborate than the simpler semantics.
[1]
A formula in sentential logic is in disjunctive normal form
if it is a disjunction (or alternation, or or
-group) of
one or more terms, and if each term is a conjunction
of one or more primitive sentences or their negations. No nested
expressions are allowed. For a fuller discussion, any book
on formal logic may be consulted, but perhaps the best discussion of
disjunctive normal form and the algebraic manipulations used to achieve
it may be found in W.V. Quine,
Methods of logic
4th ed.
(Cambridge, Mass. : Harvard University Press, 1982).
[return to text]