[This local archive copy is from the official and canonical URL, http://www.ornl.gov/sgml/wg8/document/1998.htm; please refer to the canonical source document if possible.]
|
Schloss/Newcomb Correspondence on Metadata |
|
|
SOURCE: |
Steve Newcomb, with comments from Robert Schloss |
|
PROJECT: |
Metadata Workshop, Paris |
|
PROJECT EDITOR: |
|
|
STATUS: |
Summary of e-mail conversations |
|
ACTION: |
For information |
|
DATE: |
29 June 1998 |
|
DISTRIBUTION: |
SC34 and Liaisons |
|
REFER TO: |
|
|
REPLY TO: |
Dr. James David Mason |
Date: Fri, 26 Jun 1998 14:44:33 -0500
From: "Steven R. Newcomb" <srn@techno.com>
Subject: Schloss/Newcomb correspondence
To: metadata@gca.org
Message-id: <199806261944.OAA03963@bruno.techno.com>
X-UIDL: 68e30dfb18505b0cf56129bd1c4e9a1e
Dear Paris Metadata Summit Participants,
Once this list was set up (sorry about the long delay), I wrote to Bob
and asked whether I should send the fruit of our labor to understand
each other to the list. Bob responded:
> I am now convinced that there are situations where AFs should be
> used and others where namespace prefixes are better. I was hoping
> to write all this out to share with you and others, but that has not
> happened. And in 30 minutes I disappear for one month of vacation
> in Israel with my wife and son. I know that when I come back, I
> won't have a lot of time to pursue this until mid-August the
> earliest.
> If you wish to post our correspondance, that is okay, but you should
> probably add a note that says "Bob has done some additional thinking
> but was unable to continue the discussion before he left to go out
> of town until the end of July" .
So, please consider the above notes added. Maybe Bob will find the
time to share his further thoughts with us in the relatively near future.
-Steve
--
Steven R. Newcomb, President, TechnoTeacher, Inc.
srn@techno.com http://www.techno.com ftp.techno.com
voice: +1 972 231 4098 (at ISOGEN: +1 214 953 0004 x137)
fax +1 972 994 0087 (at ISOGEN: +1 214 953 3152)
3615 Tanner Lane
Richardson, Texas 75082-2618 USA
********************************************************************************
This first installment is what I was trying to articulate at the end
of the meeting in Paris. It has been edited according to Bob's
instructions. (I'm sparing you all the correspondence that went into
preparing this.)
There is more correspondence between Bob and me to share with you, but
I don't want to send it unless there is some expression of interest in
it. The rest of it is pretty much devoted to an explanation of how
architectural forms work, in the form of answers to Bob's pointed
questions about them. --SRN
********************************************************************************
Some Observations Apropos the Metadata Summit in Paris, May 22, 1998
Steven R. Newcomb
As currently drafted, RDF uses a single standard algorithm to convert
metadata represented in an XML document (using a vocabulary from a
number of declared namespaces) into a queriable resource (tuples/property
graphs). The fact that there is a single algorithm for generating, in
effect, an API to the metadata objects, imposes many constraints on
both the interchange architecture of the metadata (the DTDs or other
schema representations of the structure of metadata information in the
form in which it is normally interchanged, as XML instances) and also
on the API to the information being interchanged by that architecture.
The conversion of XML instances into what are effectively APIs to
their information content is something that every XML application must
do, in one way or another. However, because there is no limit to the
variability of the kinds of information conveyed in XML documents,
there can never be a single algorithm that will convert instances
conforming to every interchange architecture (DTD) into the most
useful and minimal API to the meaning of instances conforming to that
architecture. As an example, consider the XLink interchange
architecture. Any really useful API to the meaning of an XLink would
be able to provide, among other things, reports as to the anchor
status of information nodes that might not even be in the same
document as the XLinks. Therefore, a special API, written to provide
useful access to the meaning of XLinks, is needed. It is hard to
imagine how any algorithm could generate such an API, given only the
schema of the XLink interchange architecture or an XLink element type
definition. (Masatomo Goto of Fujitsu Labs developed a Property Set
for XLink in order to build the XLink engine he demonstrated at SGML
Europe '98. A diagram of that Property Set exists; see
"References", below.)
At this time, RDF's designers are working to the requirements of a few
popular metadata architectures. It is expected that these metadata
architectures can be constrained in such a way that a single algorithm
can generate a useful API directly from the architectures.
Although there is nothing fundamentally wrong with the current RDF
approach, given its limited requirements, the current RDF approach is
profoundly suboptimal when considered in the larger context. To
understand the larger context, we must recognize that practically
everything that will be done with XML, including all of its draft and
proposed semantic enhancements (including XLink and XPointers), is
best realized as a pair of distinct formal expressions:
(1) the document type definition (DTD) or other schema that is the
formal expression of the *interchange syntax* of the architecture,
and
(2) the Property Set or other schema that is the formal expression of
the *abstract API to the information conveyed by instances that
conform to the architecture*. (If you can imagine having the
ability to add a module to the DOM for each interchange
architecture, so that there are now additional objects that
reflect the semantic phenomena expressible in that interchange
architecture, you can understand what a Property Set is.
Unfortunately, the current draft of the DOM is not set up to
support this, but it could be, while still meeting or exceeding
all of its current objectives and requirements. First of all,
there needs to be a Property Set for XML itself, and, not
surprisingly, such a Property Set is now being developed. But
that's another story.)
In general, interchange architectures for conveying rich semantics
really need both a DTD and a Property Set, because in the general case
it is not possible to generate useful Property Sets from DTDs using
any single algorithm. In the case of RDF, the limitations of the
algorithm used (to implicitly generate what amounts to a Property Set)
imposes constraints on the complexity of the metadata information that
can be interchanged, and it also imposes inconvenience on developers
of software intended to make use of the information thus conveyed.
Here are the implications for RDF of the twin notions of Architectural
Forms and Property Sets as the new basis for RDF:
* The structure and complexity of interchange architectures used for
metadata will no longer necessarily be constrained. (I, Steve,
think this flexibility is goodness -- that there should not be any
constraints except for those which were consciously and voluntarily
designed into each architecture to meet its own interchange,
software reuse, learnability, reliability, and other requirements.
Some of the RDF folks think that the blanket constraints on metadata
structure now provided by RDF will maximize software reuse (at
search engines, for example) and learning among Web users,
programmers, and content developers. This idea certainly has merit.
However, I would prefer simply to allow information architects the
flexibility to maximize the naturalness of the expression of
metadata. For me, naturalness is simplicity. Blanket structural
constraints usually have the effect of requiring architects to
employ greater complexity in order to express the same information,
and this added complexity often decreases learnability and increases
the difficulty of implementation.)
* The complexity of the semantics of supportable metadata will no
longer necessarily be constrained. (I, Steve, think this
flexibility is goodness -- that the full unboundedness of the set of
possible metadata semantics should be supportable at some level.
Some of the RDF folks think constraining metadata semantics will
maximize software reuse (at search engines, for example) and
learning among Web users, programmers, and content developers.
These desirable goals seem to me better served by limiting the scope
of RDF to a certain list of metadata semantics. Among other things,
such a list could be an invaluable resource for implementers by
clarifying which RDF architectures (vocabularies) share which
equivalent semantics. A good way to express such a list of
semantics is to create a property set for RDF.)
* Reusable software engines for the semantic processing of instances
conforming to particular interchange architectures become practical
and extremely cost-effective. Each such engine is responsible for
processing the interchangeable XML form of the information in such a
way as to generate a "grove" (an object graph whose schema is the
relevant Property Set) from any XML instance that conforms to the
architecture. The fact that the engine is reusable means that it
can mature and offer reliable semantic services in a variety of
application contexts. The cost of developing applications is
reduced, as is their time-to-market, and their reliability improves.
* The design of any given metadata architecture will require more work
and more careful thought. (I, Steve, think this is goodness. The
W3C people believe that the RDF data model will require slightly
less work and less thought when a new metadata schema is defined,
and this reduction in effort is beneficial. I, by contrast, believe
that each interchange architecture should maximize the
appropriateness of its design to the nature of the information it
models, and that each Property Set should maximize the convenience
of applications developers. This way, the semantic processing that
is common to all applications of a given architecture is supportable
by a reusable engine. I believe the distinct formal expressions
(both schemas/DTDs and Property Sets) that result from the added
design effort will pay handsome rewards in terms of increased
reliability of applications and decreased cost of information
interchange.)
* RDF's supporting formalisms and application integration mechanisms
need not differ from those used to support any other information
interchange architecture, for any purpose. Less is more.
* The overhead of supporting any given application's use of any
metadata architecture need never be more than it would have been
under the current RDF proposal. It may often turn out to be less,
because the popularity of certain interchange architectures may
encourage the development of highly specialized and/or optimized
engines for supporting them.
* Software vendors will be able to demonstrate conformance to the
semantics of metadata architectures, and purchasers will be able to
verify that conformance.
* "Namespaces" are entirely replaced by the use of architectural
forms. (An "architectural form" is an element type definition in a
DTD used as what RDF now calls a "namespace" or maybe a "vocabulary
resource". In ISO jargon, an element that conforms to an
architectural form is said to be a "client element" of the
referenced DTD resource or "namespace".)
* At least some known problems with the current design of RDF will be
resolved. One of these is that when an application expects that the
value of a particular property is a simple string, but the metadata
instance received actually has a compound expression using tags from
another vocabulary, RDF is as yet unclear how the compound
expression will be manipulated in order to supply a simple string.
Please see the attached discussion entitled: "A Known Problem with
RDF, Resolved by Architectural Forms"
In the larger context, assuming that architectural forms and property
sets are widely used with XML, there will be the following additional
consequences:
* Metadata queries can occur inside any other kind of query, and any
other kind of query can occur inside a metadata query. There is
already a query language, SDQL (Standard Document Query Language)
that will work for all architectures, and not just metadata
architectures. Since it conceptually queries groves, and since
groves can be generated from any notation for which there is a
property set, the same query language can be used to provide
addressing and linking services to non-XML notations. In other
words, by making everything appear to conform to the grove/property
set object model, everything becomes addressable in its own most
convenient terms, including the things that were only implicit in
their interchangeable forms. [Note: The primary significance of
Eliot Kimber's "PHyLIS" demo during the meeting was that this idea
of groves and property sets actually works. In that demo, we saw a
totally grove-based integration of XML documents and CGM documents,
with XLink-style extended links providing traversal services between
the objects in the groves of both kinds of documents.]
* XML documents will be able to contain elements that can be processed
in accordance with several interchange architecture simultaneously.
Such elements can be said to exhibit the semantic equivalent of
multiple inheritance. Information interchange architectures that
overlap semantically can nonetheless be harmonized in an instance
that uses all of them, without repeating any information, even if
they use conflicting element type names and attribute names. The
implications of this harmonizability are enormously beneficial for
E-commerce, among other things.
********************************************************************************
Some notes:
********************************************************************************
What ISO's SGML Extended Facilities calls...
a "base architecture", or
an "(information-interchange-)enabling architecture", or
(when referring to the formal machine-processable model of the
interchange syntax of a base architecture) a "meta-DTD",
...is meant to fulfill the same roles and requirements (and more) as
what the RDF draft calls...
a "namespace", or
a "vocabulary", or
a "Scheme".
Similarly, what ISO calls...
a "property set"
...the RDF draft calls...
a "Scheme".
What ISO calls...
a "grove" (acronym: Graph Representation Of property ValuEs),
- - - - -
...the RDF draft calls...
a set of "3-tuples", or
a graph.
Within the conceptual frameworks of ISO "groves" and RDF "graphs," the
terms "node", "property", and "arc" appear to have the same meanings
in both the SGML Extended Facilities and in the RDF draft, at least
for purposes of this discussion.
RDF has no *general* element subtyping or "semantic load inheritance"
facility, but RDF *does* provide a facility called "namespaces" which
allows a (metadata) element to declare that it should be considered to
convey the same kind of information as an element of a certain type in
one of several popular DTDs (or other schema-like things) for metadata
documents. The sets of names that are referencable in the schema-like
things that contain the names of the inherited element types are
called "vocabularies". (A vocabulary need not be a DTD, because there
is no actual architectural subtyping or checking. In the current
draft of RDF, a tag set is really all that is required.)
*******************************************************************
"A Known Problem with RDF, Resolved by Architectural Forms"
*******************************************************************
At least some known problems with the current design of RDF are
readily resolved by the architectural forms paradigm. One of these is
that when an application expects that the value of a particular
property is a simple string, but the metadata instance received
actually has a compound expression using tags from another vocabulary,
RDF is as yet unclear how the compound expression will be manipulated
in order to supply a simple string.
For example, in the fragment below, if the content of
<RDF:Description> is supposed to be a simple string, what does that
string turn out to be?
<DC:Creator>
<RDF:Description>
<IBMPerson:Name>Bob Schloss</IBMPerson:Name>
<IBMPerson:Email>schloss@watson.ibm.com</IBMPerson:Email>
</RDF:Description>
</DC:Creator>
Another, perhaps more general, way to put the problem is this: "What
do we do about the content of an element whose semantic is borrowed
from one namespace, when its content's semantics are borrowed from one
or more other namespaces?"
In order to understand the several solutions that the "architectural
form paradigm" brings to the above puzzle, it is first necessary to
understand that a single instance of an element can conform to several
architectural forms. Since an element instance can have only one
generic identifier, it is impractical to use the generic identifier to
specify all the architectures (such as DC, RDF and IBMPerson) to which
that single element conforms.
The biggest syntactic difference between architectural forms and
namespaces is in their use of the generic identifier (the "generic
identifier" is the name of the element type that always appears as the
first string found in any element instance's start tag). Namespaces
use the generic identifier to specify both the architecture and a
particular semantic-laden name within the architecture. Because there
can be only one generic identifier in any element instance, the syntax
of namespaces effectively prohibits a single element instance from
declaring its author's intention that it be processable in terms of
more than one namespace. By contrast, the syntax of architectural
forms does not constrain the generic identifier in any way; indeed,
the generic identifier is pretty much ignored, for purposes of
architectural processing. As far as architectural processing is
concerned, the main purpose of the generic identifier is to provide a
hook for markup minimization. The generic identifier is relegated to
a role in which it serves as a kind of macro call: it brings in the
default values of all the attributes declared in the DTD, if any, for
that element type, as we'll see shortly.
The syntax of architectural forms is actually simpler than the syntax
of namespaces; there is no new syntactic separator (":") required,
generic identifiers are not split up into fields, and there are no new
constraints on generic identifiers at all. Each architecture
is referenced by means of an attribute name, and the value of that
attribute is the name of the element type within the architecture to
which the element is claiming both syntactic conformance and semantic
equivalence. In other words, what in namespace syntax would be
expressed as:
<DC:Creator>...</DC:Creator>
might become:
<foo DC="Creator">...</foo>
Therefore, it becomes possible for a single element to claim
conformance with more than one architecture:
<foo DC="Creator" LCCC="Author">...</foo>
The following is a digression (but nonetheless a significant
digression) about markup minimization: the above looks pretty verbose,
and, given the reasonable expectation of decentralized control over
metadata architectures, and the increasing need for documents to be
useful in a variety of contexts, verbosity may get a lot worse. For
example:
<foo DC="Creator" LCCC="Author" DEA="Officer" NAWCAD="TextAuth"
USGS="Surveyor" Ford="ietmAuthor" Paramount="Creator">...</foo>
We can completely conquer this verbosity by using a DTD to cause all
the architectural form attributes to be present and to have the
necessary values by default, for all instances of the element type
"foo":
<!ELEMENT foo - - ( whatever )>
<!ATTLIST foo
DC NAME "Creator"
LCCC NAME "Author"
DEA NAME "Officer"
NAWCAD NAME "TextAuth"
USGS NAME "Surveyor"
Ford NAME "ietmAuthor"
Paramount NAME "Creator"
>
Now the same element instance can be expressed as:
<foo>...</foo>
and still be processed in terms of all those different architectures
in exactly the same way, because all the architectural form attributes
are still implicitly present, and they will be reported by the parser
as if they were explicit.
(Note: XML documents that do not have DTDs cannot take advantage of
this technique, but they can still take full advantage of the
architectural form paradigm. The only difference is that such
documents must specify, in each element instance, all the
architectural form attributes needed to process that element in
terms of all the desired architectures. As we have just seen,
doing without a DTD can make documents that use architectural forms
extremely verbose. It's exactly like the question of whether
(a) to store a PostScript document with fonts that describe each
glyph's curve set, and then reference the glyphs whenever they
are to be used, or
(b) to store each glyph as an explicit set of curves.
If the document contains only a dozen characters, it may be more
sensible not to include the font(s) from which they were selected,
and simply to be explicit about the curves that make up each glyph.
If the document contains many characters, a huge efficiency
advantage is gained by including the font and referencing the
glyphs in the font by means of the characters. Similarly, if we
include a DTD with our document, we can, in effect, reference any
number of attributes and their default values simply by uttering an
element's generic identifier (<foo>, in the example above). If we
have a lot of elements in our document, using a DTD offers a big
efficiency advantage. But it's not strictly necessary to use a
DTD.
It should also be noted that it's not strictly necessary to include
a DTD with every document, even if you wish to use one. It's only
necessary that the recipient of your document also have a copy of
the same DTD (or something with equivalent ability to drive the
parsing process) that you intend the document to be used with.
Again, it's exactly like the situation with fonts in PostScript:
you don't have to include the font in a PostScript document if you
know that the recipient's printer has that font already inside it
(or can load it).)
It is also not always necessary to be explicit, even in a DTD, about
all the architectures to which an element conforms, if one
architectural form is already a subtype of another. For example, we
can take advantage of the fact that, in the NAWCAD architecture, the
"TextAuth" architectural form (remember that "architectural form" ==
"element type") is declared in the NAWCAD architecture as a subtype of
the "Creator" architectural form in the "DC" architecture:
Assuming that in the NAWCAD architecture's DTD:
<!ELEMENT TextAuth - - (whatever)>
<!ATTLIST TextAuth
DC NAME #FIXED "Creator"
>
...then every NAWCAD <TextAuth> is by definition also a DC <Creator>.
********************************************************************
********************************************************************
** In the architectural form paradigm, the rule is: An instance **
** of an element that claims conformance to any architectural **
** form may not violate any of the constraints on the **
** architectural form to which it presumably conforms. **
********************************************************************
********************************************************************
(Note: This simple rule helps to dramatize the differences between
W3C Namespaces, as presently constituted, and the architectural
forms paradigm.
* The rule applies to an element's *context*, in that no element
can appear where its architectural context (the architectural
forms of its surrounding elements) would not allow it to appear.
By contrast, the W3C Namespace paradigm does not constrain a
namespace-referencing element's context to make sense in terms of
the architecture's (or, rather, the Namespace's) constraints.
* The rule applies to an element's *content,* in that no
architectural elements can appear inside it unless those
architectural elements are permitted by the architecture. Again,
in the Namespace paradigm, no such constraints are placed on
namespace-referencing element types.
(Note: the above are two aspects of the same idea: that an
element's content must be consistent with all of the
architectural to which it declares conformance. If you have
guessed by now that the document element must always conform to
the architectural forms of the document elements of all the
architectures used in the document, you guessed correctly.)
* The rule also applies to the element's *attributes,* in that any
attributes that are required by the architecture must be present
in the element instance, and if they are not required and not
present, they are assumed to have their architecturally-defined
default values and/or #IMPLIED effects on applications of that
architecture. If there are attributes present that do not appear
in the architecture, they are ignored. The presence of such
non-architecturally-defined attributes is regarded as implying
additional constraints, but not as violating any existing
constraints. No architecture has the authority to prevent
additional, non-architectural attributes from appearing on
elements. From each architecture's perspective, the attributes
that are present but not defined by the given architecture are
invisible.
* Finally, the rule applies to any *other constraints* on element
content and attributes, even if they cannot necessarily be
detected by a generic parser. These are detectable by any
validating semantic processor engine for that architecture. For
example, the HyTime varlink architecture (from which XLink was
derived) does not allow the number of anchors to exceed 2 unless
the "manyanch" option is supported and is specified with no value
or a value greater than "2". No generic parser can check the
conformance of an element to this constraint, but a validating
XLink or varlink architecture processing engine can. When we
consider the boundless variety of architectures, we must admit
that there is probably a boundless variety of such constraints,
and the best way to handle them is to relegate all
architecture-specific constraint checking to a re-usable engine
for that architecture.)
Since the above NAWCAD DTD fragment constrains all NAWCAD <TextAuth>
elements to conform to all the constraints and requirements of DC
<creator> elements, it is therefore unnecessary to mention the "DC"
architectural form attribute in the <foo> element, because it is
already there! By definition, a subtype always conforms to the
constraints and requirements of its supertype(s).
In a NAWCAD-oriented application, <foo>'s "NAWCAD" attribute means not
only that our <foo> element can be extracted into a valid NAWCAD
document as a valid <TextAuth> element, but also that it can be
extracted into a valid DC document as a valid <Creator> element. (In
the jargon of the SGML Extended Facilities, we say that the output of
the parser, conceptually speaking, includes a "grove" -- a parse tree
-- for each of the architectures used by the document. There is no
requirement that any application actually produce groves; groves are a
concept developed to explain, in abstract terms, the effects of
parsing, processing, and component addressing.)
Now, having accumulated the necessary background information, let's go
back to the original question that necessitated all the above
explanation: "What do we do about the content of an element whose
semantic is borrowed from one namespace, when its content's semantics
are borrowed from one or more other namespaces?" In the architectural
forms paradigm, this question really should become, "What is the
containing element's role in the contained elements' architecture,
and/or what are the contained elements' roles, if any, in the
containing element's architecture? In the architectural forms
paradigm, any element instance can play several distinct and
unambiguous roles in as many distinct architectures, so it becomes
possible for the contained elements not only to have IBMPerson-defined
semantics, but also RDF semantics, and DC semantics, too. In fact,
all the elements can have a role to play in every architecture,
provided that when, conceptually speaking, each architectural instance
is extracted from the document, it meets the structural and semantic
constraints imposed by its architecture.
There is more than one way to handle the puzzle, but first, let's see
what happens if we don't take advantage of anything of the special
facilities of architectural forms. In the following example:
<auth DC="Creator">
<authInfo RDF="Description">
<persName IBMPerson="Name">Bob Schloss</persName>
<email IBMPerson="Email">schloss@watson.ibm.com</email>
</authInfo>
</auth>
the <persName> and <email> elements are not architectural with respect
to the RDF architecture. From the RDF architecture's perspective,
therefore, the <authInfo> element looks like this:
<Description>Bob Schlossschloss@watson.ibm.com</Description>
In other words, the markup of the contained non-architectural elements
has been deleted altogether, leaving Bob Schloss with a very strange
surname, indeed.
(Digression: Why does it work that way? It's because, in the case
of mixed content (which is not the situation in our puzzle
example), the deletion of non-architectural markup still leaves the
data in pretty good shape. For example:
<authInfo RDF="Description">Bob Schloss's e-mail address is
<email>schloss@watson.ibm.com</email>, but you can also use
<email>rschloss@us.ibm.com</email>.</authInfo>
becomes, from RDF's perspective:
<Description>Bob Schloss's e-mail address is
schloss@watson.ibm.com, but you can also use
rschloss@us.ibm.com.</Description>
To handle cases other than mixed content, such as our puzzling
example, there is no one algorithm that can be automatically
applied in such a way as to give universally acceptable results.
In any case, no such algorithms are built into the SGML Extended
Facilities.)
Probably the best way to handle the puzzle of how to make the
<RDF:Description> element get back a simple string is *not* to give it
a simple string, but instead to make the contained elements meaningful
in RDF terms, as well as IBMPerson terms. For example:
<auth DC="Creator">
<authInfo RDF="Description">
<persName IBMPerson="Name" RDF="PersonName">Bob Schloss</persName>
<email IBMPerson="Email" RDF="PersonEmail">schloss@watson.ibm.com</email>
</authInfo>
</auth>
Note that in the above example, I've taken the liberty of equipping
the RDF architecture with the architectural forms <PersonName> and
<PersonEmail>. Obviously, very few people will have the authority to
do any such thing, so I'm assuming that the creators of RDF
anticipated this particular need and provided these architectural
forms, and all I needed to do was reference them. I can do that
without affecting the usefulness of my references to the <Name> and
<Email> forms of the IBMPerson architecture; again, in the
architectural forms paradigm, any element instance can conform
explicitly to architectural forms in more than one architecture.
Now let's imagine that the RDF architecture provides a <PersonName>
architectural form, but not a <PersonEmail> form. We're still ok,
because now, from an RDF architectural perspective:
<authInfo RDF="Description">
<persName IBMPerson="Name" RDF="PersonName">Bob Schloss</persName>
<email IBMPerson="Email">schloss@watson.ibm.com</email>
</authInfo>
becomes:
<Description><PersonName>Bob Schloss</PersonName>
schloss@watson.ibm.com</Description>
... and this leaves our RDF engine in a position to at least
distinguish between some well-understood data and some raw data, in
mixed content. At the very least, the boundary between the data
contents of the two contained elements has been preserved.
Now let's imagine that there is neither a <PersonName> nor a
<PersonEmail> in the RDF architecture, and that the string
Bob Schlossschloss@watson.ibm.com
is unacceptably Delphic as the content of an RDF <Description>. What
can we do?
One way to handle the problem is to ignore, from an RDF perspective,
the data content of all but one of the contained elements. For
this, we must turn to one of the deeper facilities of the AFDR: the
"ArcIgnD" (architecture ignore data) architectural control attribute,
which allows us to prevent the data content of an element (i.e., the
data consisting of all of its leaves in the parse tree) from being
considered to be part of the document, from the perspective of any
particular architecture. If, for example, we wanted to ignore the
<persName> element's content for all purposes of RDF processing, we
could say:
<authInfo RDF="Description">
<persName IBMPerson="Name" RDFIgDat="ArcIgnD">Bob Schloss</persName>
<email IBMPerson="Email">schloss@watson.ibm.com</email>
</authInfo>
>From an RDF perspective, the above looks like this:
<Description>schloss@watson.ibm.com</Description>
To explain the above example, the following is a digression about
"architecture control attributes", and how they are being used in the
above example.
The names of all "architectural control attributes" used to control
architectural processing in any document instance are declared in
certain special processing instructions (see "References" below).
There is one processing instruction per architecture. Each such
processing instruction identifies the architecture, and provides,
among other things, the names of the architectural control attributes
whose values will control the architectural processing of each
element. The most basic attribute is the "Architectural Form
Attribute", examples of which have appeared in most of the above
examples (as the "DC", "RDF" and "IBMPerson" attributes). We have
been assuming, in the above examples, that in our document, the RDF
architecture's architectural control attribute's name is "RDF".
However, it could have been any XML name. Similarly, we have been
assuming that the Dublin Core architecture's architectural form
attribute name is "DC", and the IBMPerson architecture's is
"IBMPerson".
Another architectural processing attribute that can be declared in the
same processing instruction is the "Architecture Ignore Data"
attribute. In our above example, we are assuming that for the RDF
architecture, in this document, the name of the "Architecture Ignore
Data" attribute has been declared in the relevant processing
instruction to be "RDFIgDat". In the above example, the value
"ArcIgnD" is an ISO-defined string that means "data is always
ignored."
(Note: The other possibilities are:
"nArcIgnD", which means that data is not ignored, and it is an
error if data occurs where the architecture does not
allow it, and
"cArcIgnD", which means data is conditionally ignored (data will
be ignored only when it occurs where the architecture
does not allow it.)
If all this seems rather complex, please remember that the problem of
reliably and smoothly meshing the semantics of multiple namespaces in
a single document is a complex one. Indeed, it is a problem which the
present simplicity of namespaces is unable to cope with, at least in
the general case. There is no requirement that anyone use the
"architecture ignore data" attribute, but it's nice that it's there
when it's really needed and nothing less will do.
There are other architecture control attributes, and there are still
other things that can be declared in the processing instructions that
define architecture control attributes.
(Here ends the digression about architectural control attributes.)
********************************************************************************
Some references
********************************************************************************
Architectural Forms / (Multiple) Inheritance ("Architectural Form
Definition Requirements" or "AFDR"):
http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html
This standard is being amended to provide for XML's use of
architectural forms by means of processing instructions (which XML
supports) instead of #NOTATION attributes (which XML does not
support). See http://www.ornl.gov/sgml/wg8/document/1957.htm for
the details of this amendment.
Property Sets ("Property Set Definition Requirements" or "PSDR"):
http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.4.html
HyTime Property Set (just a good example of a full-featured property
set) http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-B.html
A Property Set for XLink (in the form of a diagram in PostScript)
ftp://ftp.techno.com/TechnoTeacher/MISC/xllprops2.ps
********************************************************************************
Acknowledgement
********************************************************************************
As the reader may have guessed, this paper would not have been
possible without the patient substantive help of Robert J. "Bob"
Schloss of IBM's Thomas J. Watson Research Center.
- 30 -