[Archive copy mirrored from: http://developer.netscape.com/mcf.html, June 12, 1997. See also: http://www.textuality.com/sgml-erb/w3c-mcf.html.]
R.V. Guha (Netscape Communications)
Tim Bray (Textuality)
This document provides the specification for a data model for describing
information organization structures (metadata) for collections of networked
information. It also provides a syntax for the representation of instances
of this data model using XML, the Extensible Markup Language.
The need for machine-usable descriptions of collections of distributed
information is increasing rapidly. There have been a number of proposals
in the recent past that have made significant steps toward this goal, including
HotSauce MCF, CDF, PICS, and WebCollections.
The existence of multiple proposals reflects the fact that this type
of information is needed for multiple purposes, and that there are many
groups interested in its availability and use. This diversity of effort
is reflected in a diversity of terminology; discussions have been couched
in terms of "metadata", "typing", "schemas", "labels", and "collections,"
while all in fact dealing with the same underlying constructs and problems.
We believe the following principles to be central to making progress
in this area:
The consequence of the first principle is that it is simply incorrect to
reserve any special syntax for use just in "metadata".
There is no useful distinction between data and metadata. Every item of
information, without exception, is likely to be regarded by some applications
as ancillary and never to be displayed, and by others as core content that
needs to be formatted, printed, or searched.
For interoperability and efficiency, schemata designed to serve different
applications should share as much as possible in the way of data structures,
syntax, and vocabulary.
The second principle is what really drives this proposal. It is inevitable
that there will be a plethora of classes of information about information;
note some of the examples listed above. If they share a common syntax,
this is good, but it is not enough. For example, suppose a mature commercial
word processor package were to offer a "save as XML" format, which exported
an XML representation of its internal document data structures and attributes.
While marginally more open than the processor's native format, this would
not be of any substantial use, because to operate on this file would de
facto require the use of the program which generated it.
To a certain extent this is inevitable - in many cases, data created
for the purposes of a particular application will contain items that are
only meaningful to that application. But the situation can be greatly improved.
If information about information can share a common data model and vocabulary,
it will be possible to query and manage metadata to some degree, even without
fully understanding it.
In this document, we draw upon the features provided in the other proposals
mentioned above, and on other work in this area, to develop a single data
model and corresponding interchange format which can be used for many purposes,
including for example
Meta Content Framework (referred to henceforth as MCF) is a structure description
language. The field of structure description languages is well understood
and it is not our desire to reinvent any of it. Our goal is to select the
portions of it that are required for our task. One benefit of this approach
is the ready availability of tools and algorithms for manipulating MCF.
describing the structure of web sites or a set of channels
distributed annotation and authoring
exchanging commerce-related information such as prices, inventories, and
We abstract an information organization structure as a Directed Labelled
Graph (DLG). DLGs are well understood and as far as possible, we will use
the terminology that is standard to the treatment of DLGs. In MCF, relationships
between objects are represented in an unsurprising way by DLG arcs. DLG
arc labels are themselves objects which participate in relationships.
New kinds of data appear on the web routinely. It should be possible
to extend MCF dynamically to accommodate them. Furthermore, the list of
potential applications for MCF is open-ended and each application might
wish to add and use its own kinds of metadata. Though an application might
associate arbitrary semantics with the new labels, it would be highly desirable
if some significant portion of these semantics could itself be expressed
with MCF. In light of these requirements, using DLGs, we include a simple,
extensible type system as part of MCF.
An MCF database is a set of Directed Labelled Graphs, comprising:
In MCF, nodes can represent things like web pages, images, subject categories,
channels, and sites. They can also represent "real-world" objects such
as people, places, and events. The labels are nodes that correspond to
properties such as size or lastRevisionDate used to describe
web pages, subject categories, etc., and also to relations (such as hyperlinks,
authorship or parenthood) between these things.
a set of labels (often referred to as properties)
a set of nodes
a set of arcs where each arc is a triple consisting of two nodes (the origin
and destination) and a label.
Each label/property is a node (but not all nodes are properties). So,
if we had a label pageSize that is used to specify the basic size
of documents, we would also have a pageSize node. This node could
itself participate in relationships that help constrain and therefore specify
the semantics of pageSize. We would for example specify that the
domain of pageSize is Document and its range
is SizeInBytes and that a document has exactly one pageSize.
We would also use a Property to provide human readable documentation of
the intended semantics of pageSize.
A node can either be a primitive data type or a "Unit". The primitive data
types are the same as the Java primitive data types. In addition, a DATE
type should be supported by the low-level MCF machinery, because it is
tricky to implement (beyond the reach of regexps, for example) and yet
commonly available in operating system and compiler libraries, e.g. java.util.
The concept of "Unit" corresponds loosely to the Java concept of "Object".
Every unit has a unique identifying string, called its unique identifier.
To simplify syntactic expression of MCF, the unique identifiers of Category
and Property units (defined below) are constrained to be valid XML Names.
For objects addressable on the Web and which have a canonical URI, it is
expected to be common practice to use the URI as the unique identifier.
A small set of units with predefined semantics are assumed to exist in
order to bootstrap the type system. Specifically, these are,
As a convention, the identifying strings and names of the commonly used
terms (including those listed above) are the same. Also, properties are
named beginning with a lower-case letter and non-properties with an upper-case
this is the Property used to specify that the given object is of a certain
type. A node can be the origin of multiple typeOf arcs; for example,
the node for a person can simultaneously be typeOf Person, typeOf
Golfer, and typeOf Doctor. Every unit has (at least implicitly)
a typeOf Property, since Unit is a type.
This corresponds to the concept of Class. The destination of typeOf
arc has a typeOf arc which ends at Category (with the single exception
of the node for "Category" itself).
This is the most general Category. It is implicitly or explicitly the super
class of all Categories (with the single exception of the node for "Unit"
this Property is used to specify the type constraints on the use of a Property,
in particular of its origin node; the range of domain is Category.
this Property is used to specify the type constraints on the use of a Property,
in particular of its destination node. The range of range is Category.
this Property is used to indicate the superset relation between Categories.
this is the typeOf all properties/arcs/relations.
Certain properties behave like functions, i.e., there can be at most one
arc of that type originating from a given node and every object in their
domain has one. e.g., lastRevisionDate. Such properties are typeOf
a relation between two properties. If s1 is a superProperty of s2,
then the existence of an s2 arc between nodes A and B implies that there
is also an s1 arc between A and B. E.g., biologicalParent is a superProperty
a reflexive relation between two categories which implies that nothing
can be an element of both these categories simultaneously. For example,
the categories for the built-in types (int, float, etc) are all mutuallyDisjoint.
this can be used to provide a string which names the object. An object
may or may not have a name, though for convenience, it is assumed that
properties and categories will be given names.
a descriptive string used for human consumption.
is is the most generic relation. The domain and range are
This category is a special convenience used to express sequences. It is
normally expected to source a number of arcs whose labels are natural numbers
sequentially increasing from 1; the targets of these arcs are the nodes
which are to be considered sequenced.
(short for ordinal) is a property not actually used in MCF, but which is
reserved because the label is needed for the syntactic expression of MCF
Though it is possible for a source of MCF to only assume the basic bootstrapping
vocabulary and define everything else it needs dynamically (i.e., as part
of the MCF database), for purposes of interoperability, it would be good
to standardize the vocabulary for commonly used terms. This will also reduce
the amount of information that needs to be transmitted. An appendix to
this document proposes some items for this vocabulary (largely derived
from existing standards such as the Dublin Core) for describing web content.
Our goal is to provide an XML based syntax for representing MCF. XML aims
to serve as a general purpose data representation language. One of the
components of any adequate data representation language is a type system;
MCF attempts to provide such a type system for XML.
MCF is expressed using XML syntax with a few conventions provided by
this specification. The entire MCF (which may occur as a separate file
or be embedded within HTML) is wrapped inside a block. All MCF blocks are
Given XML's flexibility, a number of strategies could serve for expressing
MCF structures in terms of elements and attributes; all would be essentially
isomorphic. However, it seems likely that it will be common practice to
use MCF to express a series of facts about some object, framed as arcs
with that object as the source. Thus, the source is expressed as a container
element, with a series of child elements each representing a Property,
or arc with that source. The destination of the arc is represented using
one of the attributes VALUE or UNIT attached to the "Property"
element. The use of UNIT indicates that the destination is another
unit, and, the UNIT is its unique identifier. The use of VALUE
indicates that the node represents a datum of a primitive type; for example,
a date would be given with VALUE=, but a more sophisticated type
such as MONTHLY would be a Unit.
The unit description element uses the typeOf the unit as the
element name. The unit description element must have an attribute ID
which specifies the unique identifier. If the unit is an element of more
than one category, the additional categories can be specified using typeOf
Beyond the above, there are several special XML idioms available for convenience
in representing certain Properties.
The simple parent relationship may be expressed simply by inclusion. That
is to say, a source container element may contain not only Property elements
but also other source container elements; the effect is exactly the same
as as if the contained source container were standing alone and contained
a parent property pointing at the containing element.
The range of a description is expected to be a potentially
lengthy piece of free text, which might even include markup. For this reason,
the value of the description property is provided in the content
of the description Property element.
A Sequence node may have Properties whose labels are just numbers, sequentially
increasing from 1, whose range is the sequenced nodes. These are expressed
in XML simply by replacing the numbers with the reserved property ord;
the order in which these Property nodes appear in the XML entity corresponds
to the numeric labels.
The sharing and re-use of schemata is uncontroversially good. In order
to avoid duplication, we propose use of the XML Hyperlink machinery to
refer to externally-stored MCF blocks. While details of this syntax will
have to wait for that specification to stabilize, the following examples
contain references which should be at least suggestive.
Of course, when multiple schemata are in use, a namespace problem occurs.
In the following examples, we use the syntax of the recent Layman/Bray
proposal; but the namespace resolution mechanism is an orthogonal problem.
For HTML pages, presumably the HTML LINK element would be used
to associate MCF files.
If a program reading an MCF block encounters a semantic contradiction,
the entire MCF block is to be considered as unreliable and information
from it is not to be used. An example of such a contradiction would be
two arcs originating from the same node, labelled with a Property that
has been declared a FunctionalProperty, or for example, assertions that
some node is both typeOf float and typeOf character.
Note, however, that different MCF blocks, obtained from different sources,
describing same object, may be inconsistent. The decision as to how this
should be handled is highly application-dependent.
The following example uses MCF to describe a range of information about
the website of the Acme Content Company.
The following segment contains information that can be used for diverse
purposes. For example,
a robot could use it to determine which portions of the site to index.
a browser could use it to present a site map.
a channel client could use it to periodically download portions of the
the rich information here could be used by a search engine to provide better
search (filters, concept based searches, etc.)
<!--- WebSiteVocab and PersonVocab define some of the
terms we will use to describe
web sites and people
respectively. AcmeVocab are
some extensions defined by
the folks at Acme Content
<NBLK XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK"
<NBLK XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK"
<NBLK XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK"
Content Company Website Table of Contents"/>
value="ACME Content Company Web Site"/>
<!--- we are using the email address as the unique identifier
Brown, who amongst other things, takes care of
ACME web site</XML-MCF:description>
value="The Acme Content Company"/>
value="Wild Life Pictures taken in the Sahara"/>
Brown believes that this Subject should belong
to the Yahoo arts category, irrespective of its listing in
value="Still Life Pictures"/>
<author unit="firstname.lastname@example.org" />
<XML-MCF:name value="Pictures of Apples and Oranges"/>
<nextUpdateTime value="June 1 1997"/>
<superTopic unit="acc.com/Fruit Pictures"/>
<!--- superTopic is a more specialized relation than just
<!--- so that a smart browser can use the mirror if
the server is too loaded --->
entries under this subject can be found at the above url --->
The following describes the schema extensions made by the Acme Content
Company that are available from http://www.acme.com/AcmeVocab.mc This is
a very small extension, but it illustrates the concept of how MCF can be
used to extend itself:
we have declared a new property called
accDeptOfPage which applies
to web pages and whose entry is an ACCDepartment.
We have also said
that there may be at most one department
responsible for each page and
that the department is also the contactAgent
for the page
Every page has a department associated with it (at ACC).
This property is
used to specify the ACC department associated with the page.</description>
in the Acme Content Company"/>
department number associated with an ACC department"/>
Consider the following DTD:
The XML expression of the MCF version makes heavy use of the Sequence
|<!ELEMENT EMAIL (HEAD, BODY)>
<!ELEMENT HEAD (FROM, TO, CC*, SUBJECT)>
<!ELEMENT BODY (P+,SIG?)>
<!ELEMENT FROM #PCDATA>
<!ELEMENT TO #PCDATA>
<!ELEMENT CC #PCDATA>
<!ELEMENT P #PCDATA>
<!ELEMENT SIG #PCDATA>
<ORD UNIT="HEAD"/><ORD UNIT="BODY"/>
<ORD UNIT="FROM"/><ORD UNIT="TO"/><ORD
<ELEMENT ID="FROM"><TYPEOF VALUE="#PCDATA"/></ELEMENT>
<ELEMENT ID="TO"><TYPEOF VALUE="#PCDATA"/></ELEMENT>
<SEQUENCE ID="CC-STAR-SEQ"><ORD ID="CC"/></SEQUENCE>
<ELEMENT ID="CC"><TYPEOF VALUE="#PCDATA"/></ELEMENT>
<ELEMENT ID="SUBJECT"><TYPEOF VALUE="#PCDATA"/></ELEMENT>
<ORD UNIT="P-PLUS"/><ORD UNIT="SIG-QM"/></SEQUENCE>
<SEQUENCE ID="P-PLUS-SEQ"><ORD UNIT="P"/></SEQUENCE>
<ELEMENT ID="P"><TYPEOF VALUE="#PCDATA"/></ELEMENT>
<SEQUENCE ID="SIG-QM-SEQ">ORD UNIT="SIG"/></SEQUENCE>
<ELEMENT ID="SIG"><TYPEOF VALUE="#PCDATA"/></ELEMENT>
In addition to the basic bootstrapping terms (typeOf, Category,
etc.) specified earlier, in order to promote interoperability, we also
propose some standard vocabulary that can be used for purposes of describing
the kinds of content typically found on the web.
Such standard schemata are very important, but are separate from the
data model and the transfer syntax. The purpose of this section of the
proposal is to initiate a discussion. There is significant work to do in
this area, but it should be started now.
Though the following can easily be specified in MCF itself, for purposes
of readability, we provide the following description in English. The MCF
specification will however be made available for authors.
An author can use this vocabulary as the schema for their MCF (by using
XML-transclusion) and make further modifications and additions to it as
As a convention, Categories (are in the singular. So, the category of all
people is called Person and of all organizations is called Organization.
Also, even though MCF is case insensitive, for purposes of human readability,
as a convention, categories start with a capital letter and properties
start with a lower case letter.
The name and identifier for all of the following are the same.
Includes everything from websites and web pages to legacy databases and
file folders. Its superType is Unit.
A collection of information. Includes subject categories, file folders,
channels, etc. Its superType is Content. There are no constraints on the
items belonging to a container. The items in a container could themselves
be containers. The relation between an item belonging to a container and
the container is just parent (though we might want to eventually introduce
a more specialized relation.) The distinction between a container and non-container
is one of convenience. There will be cases where we want to consider a
single page as a container and in other cases, we might want to consider
the same page as an atomic entity. The flexibility of MCF allows us this
The category of subjects. An example is the Arts category in Yahoo! or
the portion of the Developer portion of the Netscape Website. Its superType
A web site. Its superType is ContentContainer.
A document. Could be a WordPerfect document on a PC or a web page or even
a FileMaker database. Its superType is Content.
The concept of an Agent is a general one intended to cover people, robots,
organizations, etc. Its superType is Unit.
Examples include Apple Computer, United States and the Peace Corps. Organizations
are mutually disjoint with people. Its superType is Agent.
The category of people. Its superType is Agent.
The table of contents for any Content (could be for a web site, page, ...)
Its superType is Content.
Examples include English, French, etc. Its superType (for now) is Unit.
This category is used to specify information like the periodicity with
which content is updated, when it should be pulled down, etc. The range
includes both simple instances like Hourly or Daily to instances with intermediate
complexity like daily at eight am to more complex instances (such as that
proposed by CDF) like hourly between eight am and six pm on weekdays...
A.2.1 PROPERTIES USED TO DESCRIBE
There has been much work in standardizing vocabularies for describing agents,
most notably vcard, and we hope to adopt those standards as applicable.
In addition, we should also provide standard properties for describing
the location, hobbies, etc. of agents.
A string representing the email address of an agent.
The url of the home page(s) of an agent.
A string representing how the person can be contacted.
A.2.2 PROPERTIES USED TO DESCRIBE
Existing standards that these draw from (and will rely upon even more in
the future) include the Dublin Core, Z39.50 and of course, the rich body
of work in Library Science.
The individual person(s) who is(are) the authors of the content object.
The entries are not names of the authors but references to objects corresponding
to the authors. The name, email address, etc. of the author can be specified
on that object.
The organization which is the author of the content object.
The generalization of the previous 2 propertiess. The is a superProperty
of both of them.
The agent that is the editor of the content object.
The agent that is the publisher of the content object.
The agent who is the "contact" for that piece of content. Typically the
person behind "email@example.com".
The copyright declarations. The range is a string.
RELATED TO THE
SIZE OF THE
The size of a content object in bytes. Represented using an integer. This
is the size of the object alone and does not represent the size of its
inclusions (like in-line images).
The total number of bytes, including inline images, plugins, etc. of a
PROPERTIES OF THE
Some more temporal properties appear under Schedules.
The date on which a content object was first published.
The date on which the content object was last modified.
The date until the information in this content object is valid.
The frequency with which this is typically updated. The range is a Schedule
(which includes Hourly, Daily, etc. and also more complex Schedules.)
The version number of this content object or subject category. A string.
This is to be used if the content is to be proactively downloaded to the
users computer. It specifies the download schedule and the entry is a Schedule.
The next time that this piece of content is scheduled to be updated.
This is also to be used if the content is to be proactively downloaded
to the users computer. It specifies the next time this piece of content
should redownloaded. More often than not, this will suffice in lieu of
a full blown schedule and will default to the nextUpdateTime.
The subject categories that this content object falls under. parent is
a superProperty of subject. Using this, an author could for example suggest
that his/her page belongs to a certain Yahoo! subject category.
The language(s) (typically a natural language such as English or French)
in which the content is primarily encoded.
One or more tables of contents of which this content object is a part.
The home page for the site of which this content object is a part.
The page at which help can be found regarding this content object.
The content objects that a content object has hyperlinks to. parent is
a superProperty of linksTo.
To be used when one content object includes another (such as an HTML page
including an image or a poem). This is useful when we want to distinctly
identify a certain piece of a page, such as a table, as a first class unit
and specify the relation between the enclosing page and table.
The MIME type of the content.
A convenience predicate for specifying the mime types of all the included
A relational between two subject categories such as Yahoo Arts and Yahoo
Arts Museums which states that the later is a more specific subject category
of the former. parent is a superProperty of superTopic.
An icon that can be used to represent the object. The value is typically
the object corresponding to a GIF or JPEG, but could also be a platform
specific encoding. Preferably, it will be one object with several different
encodings being available.
One or more URIs at from which object content may be obtained.
Mirror uris for this content object. Mirrors are assumed to be secondary
sources of the content, which might potentially be stale. The distinction
between mirrors and location is subtle at best.
This Property can be used to specify information like whether the server
is down, the last time the content was accessible, etc. This meta-content
is typically furnished not by the content provider himself, but by indexers
This is used to specify whether the content is to be accessed via the traditional
Web pull mechism, via email (e.g., InBox Direct), via channels, etc.
The intent of this Property is to contain the information that would be
contained in a PICs-like rating. The range is Rating.
The cost of this content. The range is a Cost, which could be as simple
as "5 US Dollars" or something much more complex. The more complex specification
is beyond the scope of this proposal.
This is the day upon which the schedule will start to apply.
This is the day upon which the schedule expires and no longer applies.
The interval of time that the schedule should repeat over.
Earliest time during the schedule interval that the schedule applies to.
A very large number of people have contributed to the material in this
proposal. It draws heavily from the knowledge representation work in AI.
It owes a lot to the MCF project at Apple and we would like to thank the
folks who made that happen, including Alan Kay, Don Norman, Jed Harris
and Larry Tesler. We would also like to thank Edwin Aoki, Tom Paquin, Phil
Karlton, Tim Hickman and Mike McCue of Netscape for the comments and feedback
on this draft.
Netscape Communications Corporation
Last Updated: 06/10/97 21:32:15
Netscape Communications Corporation