[Archive copy mirrored from: http://www.textuality.com/mcf/NOTE-MCF-XML.html, June 22, 1997]
Meta Content Framework Using XML
- Editors:
- R.V. Guha (Netscape Communications) <guha@netscape.com>
- Tim Bray (Textuality) <tbray@textuality.com>
Abstract
This document provides the specification for a data model for describing information organization structures (metadata) for collections of networked information. It also provides a syntax for the representation of instances of this data model using XML, the Extensible Markup Language.
Table of Contents
1. Introduction
1.1 History and Motivation
1.2 The Basis of Meta Content Framework
2. The MCF Data Model
2.1 Labels, Nodes, and Arcs (or PropertyTypes, Nodes and Properties)
2.2 Units and Primitive Data Types
2.3 The Set of Bootstrap Nodes
3. Representation of MCF
3.1 Syntax
3.2 Linking to Schemata
3.3 Processing MCF Blocks
3.4 Special Idioms
3.4.1 Unit Identifiers
3.4.2 parent
3.4.3 Sequence
3.4.4 Namespace Prefixes
3.4.5 Structured Values
3.4.6 Inheritance
3.5 HTML and MCF
4. Examples
4.1 The Acme Content Company Web Site
4.2 Example 1
4.3 Example 2
4.4 Example Three
4.4.1 Schema Extensions for Acme
4.4.2 Structured Values
Appendices
A. Standard Vocabulary
A.1 Categories
A.2 Property Types
A.2.1 Property Types used to describe Agents
A.2.2 property types used to describe Content
A.2.2.1 Authorship Related property types
A.2.2.2 property types related to the size of the object
A.2.2.3 Temporal property types of the content
A.2.2.4 property types about the content itself
A.2.2.5 property types about content access
A.2.2.6 Other property types about content
A.2.3 property types related to schedules.
B. Acknowledgements
This document gives a complete description of the MCF data model and its syntactic expression. It is not tutorial in nature; a companion document, An MCF Tutorial, exists to serve that purpose.
The need for machine-usable descriptions of collections of distributed information is increasing rapidly. There have been a number of proposals in the recent past that have made significant steps toward this goal, including HotSauce MCF, CDF, PICS, and WebCollections.
The existence of multiple proposals reflects the fact that this type of information is needed for multiple purposes, and that there are many groups interested in its availability and use. This diversity of effort is reflected in a diversity of terminology; discussions have been couched in terms of "metadata", "typing", "schemata", "labels", and "collections," while all in fact dealing with the same underlying constructs and problems.
We believe the following principles to be central to making progress in this area:
- There is no useful distinction between the representational needs of data and metadata. The kinds of information that need to be represented in metadata and data are very similar. Furthermore, every item of information, without exception, is likely to be regarded by some applications as ancillary and never to be displayed, and by others as core content that needs to be formatted, printed, or searched.
- For interoperability and efficiency, schemata designed to serve different applications should share as much as possible in the way of data structures, syntax, and vocabulary.
The consequence of the first principle is that it is simply incorrect to reserve any special representation for use just in "metadata".
The second principle is what really drives this proposal. It is inevitable that there will be a plethora of classes of information about information; note some of the examples listed above. If they share a common syntax, this is good, but it is not enough. For example, suppose a mature commercial word processor package were to offer a "save as XML" format, which exported an XML representation of its internal document data structures and attributes. While marginally more open than the processor's native format, this would not be of any substantial use, because to operate on this file would de facto require the use of the program which generated it.
To a certain extent this is inevitable - in many cases, data created for the purposes of a particular application will contain items that are only meaningful to that application. But the situation can be greatly improved. If information about information can share a common data model and vocabulary, it will be possible to build software to query and manage it without knowing all the schemata in advance.
In this document, we draw upon the features provided in the other proposals mentioned above, and on other work in this area, to develop a single data model and corresponding interchange format which can be used for many purposes, including for example
- describing the structure of web sites or a set of channels
- threading email
- PIM functions
- distributed annotation and authoring
- exchanging commerce-related information such as prices, inventories, and delivery dates
Meta Content Framework (henceforth referred to as MCF) is a structure description language. The field of structure description languages is well understood and it is not our desire to reinvent any of it. Our goal is to select the portions of it that are required for our task. One benefit of this approach is the ready availability of tools and algorithms for manipulating MCF.
We abstract an information organization structure as a Directed Labelled Graph (DLG). DLGs are well understood and as far as possible, we will use the terminology that is standard to the treatment of DLGs. In MCF, relationships between objects are represented in an unsurprising way by DLG arcs. DLG arc labels are themselves objects which participate in relationships.
New kinds of data appear on the web routinely. It should be possible to extend MCF dynamically to accommodate them. Furthermore, the list of potential applications for MCF is open-ended and each application might wish to add and use its own kinds of metadata. Though an application might associate arbitrary semantics with the new labels, it would be highly desirable if some significant portion of these semantics could itself be expressed with MCF. In light of these requirements, using DLGs, we include a simple, extensible type system as part of MCF.
An MCF database is a set of Directed Labelled Graphs, comprising:
- a set of labels, also referred to as property types
- a set of nodes
- a set of arcs where each arc is a triple consisting of two nodes (the source and target) and a label. Arcs are also referred to as properties. Often, we will refer to an arc with a certain source as a property of that source. Similarly we will refer to the target of the arc as the value of the property.
In MCF, nodes can represent things like web pages, images, subject categories, channels, and sites. They can also represent "real-world" objects such as people, places, and events.
The arcs (properties) can represent characteristics such as size or lastRevisionDate of web pages, subject categories, etc., and also their relationships (such as hyperlinks, authorship or parenthood) to other objects.
Each property type is a node (but not all nodes are property types). So, if we had a property type pageSize that is used to specify the basic size of documents, we would also have a pageSize node. This node could itself participate in properties that help constrain and therefore specify the semantics of pageSize. We would for example specify that the domain of pageSize is Document and its range is SizeInBytes and that a document has exactly one pageSize. It could also have a property to provide human readable documentation of the intended semantics of pageSize.
The figure below illustrates some simple nodes (including some property types) and properties, illustrating that how properties can be attached to property types.
This self-description allows MCF to be its own schema definition language. This in turn allows MCF to be dynamically extended by an author or application.
A node can either be a primitive data type or a "Unit". The primitive data types are the same as the Java primitive data types. In addition, a DATE type should be supported by the low-level MCF machinery, because it is tricky to implement (beyond the reach of regexps, for example) and yet commonly available in operating system and compiler libraries, e.g. java.util.
The concept of "Unit" corresponds loosely to the Java concept of "Object".
A small set of units with predefined semantics are assumed to exist in order to bootstrap the type system. These names are reserved in MCF and may not be used for any purpose other than that given here. Specifically, these are,
typeOf
- this is the PropertyType used to specify that the given object is of a certain type. A node can be the origin of multiple typeOf arcs; for example, the node for a person can simultaneously be typeOf Person, typeOf Golfer, and typeOf Doctor. Every unit has (at least implicitly) a typeOf property, since Unit is a type.
Category
- This corresponds to the concept of Class. The destination of typeOf arc has a typeOf arc which ends at Category (with the single exception of the node for "Category" itself).
Unit
- This is the most general Category. It is implicitly or explicitly the super class of all Categories (with the single exception of the node for "Unit" itself).
domain
- this PropertyType is used to specify the type constraints on a property, in particular of its origin node; the range of domain is Category.
range
- this PropertyType is used to specify the type constraints on a property, in particular of its destination node. The range of range is Category.
superType
- this PropertyType is used to indicate the superset relation between Categories. If A is the superType of B and X is typeOf B, then X is also implicitly typeOf A.
PropertyType
- this is the typeOf all property types/labels.
FunctionalPropertyType
- Certain property types behave like functions, i.e., there can be at most one arc of that type originating from a given node. e.g., lastRevisionDate. Such properties are typeOf FunctionalPropertyType.
superPropertyType
- a relation between two property types. If s1 is a superPropertyType of s2, then the existence of an s2 arc between nodes A and B implies that there is also an s1 arc between A and B. E.g., biologicalParent is a superPropertyType of biologicalFather.
mutuallyDisjoint
- a reflexive relation between two categories which implies that nothing can be an element of both these categories simultaneously. For example, the categories for the built-in types (int, float, etc) are all mutuallyDisjoint.
name
- this can be used to provide a string which names the object. An object may or may not have a name, but it is neccessary for property types and categories to have names. Furthermore, the names of Category and PropertyType units are constrained to be valid XML Names. The tokens "typeOf", "PropertyType", etc. are the names of the corresponding units listed here.
description
- a descriptive string used for human consumption.
parent
- is is the most generic relation. The domain and range are Units.
Sequence
- This category is a special convenience used to express sequences. It is normally expected to source a number of arcs whose labels are natural numbers sequentially increasing from 1; the targets of these arcs are the nodes which are to be considered sequenced.
ord
- (short for ordinal) is a property type not actually used in MCF, but which is reserved because the label is needed for the syntactic expression of MCF in XML.
Property
- is a reserved term to be used in future versions of this document as a Category that will enable us to treat optionally treat certain properties as first class units. This will allow us to represent meta-meta content such as the volatality of a certain piece of meta content. The terms source and target are also reserved for property types that will apply to Property to specify the source and target of the arc/property.
As a convention, property types are named beginning with a lower-case letter and other units with an upper-case letter.
Though it is possible for a source of MCF to only assume the basic bootstrapping vocabulary and define everything else it needs dynamically, for purposes of interoperability, it would be good to standardize the vocabulary for commonly used terms. This will also reduce the amount of information that needs to be transmitted. An appendix to this document proposes some items for this vocabulary (largely derived from existing standards such as the Dublin Core) for describing web content.
Our goal is to provide an XML based syntax for representing MCF. XML aims to serve as a general purpose data representation language. One of the components of any adequate data representation language is a type system; MCF attempts to provide such a type system for XML.
MCF is expressed using XML syntax with a few conventions provided by this specification. The XML text describing the MCF (which may occur as a separate file or be embedded within HTML) is wrapped inside a block tagged <XML-MCF> and </XML-MCF>. All MCF blocks are well-formed XML.
Given XML's flexibility, a number of strategies could serve for expressing MCF structures in terms of elements and attributes; all would be essentially isomorphic. However, it seems likely that it will be common practice to use MCF to express a series of facts about some object, framed as arcs with that object as the source.
Thus, the source is expressed as a container element, with a series of child elements each representing an arc with that source (i.e., property of that source.) The element type of the source element is the name of unit's category. If the unit is an element of more than one category, the additional categories can be specified using typeOf property elements. The container element may be given a unique identifier, which is a string provided in the ID attribute of this container element.
The element type of each of the child elements is the PropertyType associated with that arc. If the destination of the arc is a primitive type, it is represented as the content of the element. If the destination is a unit, it is represented by using the attribute UNIT attached to the element. The value of the UNIT attribute (i.e., the reference to the unit) must match the unique identifier of a container element representing that object (please see section 3.4 where this constraint is somewhat loosened.)
If the direction of the arc needs to be reversed, i.e., the container element is the target of the arc (value of the property) and the unit refered in the UNIT attribute is the source of the arc, this can be done by using the attribute inverse with the value "true". The default value for this attribute is "false".
It is legal for a unit to not have any unique identifier. In this case, it is not possible for any element representing a property to reference it.
The unique identifier for a unit is just that and does not have any binding semantics about locations on the web. There may be many different locations associated with a unit. For example, a unit representing a web page could have different locations for its mcf block, the actual content, ratings, etc. A unit representing a person could have locations for her home page, email address, etc. These locations can be expressed by using the appropriate properties. However, we do allow for certain defaults (see section 3.4) that enable more compact representations.
Uses of unique unit identifiers (i.e., as the value of the UNIT attribute within property elements and as the value of the ID attribute within the container element) within an MCF block follow the rules of URLS and so they may be either absolute or relative to the baseURL of the MCF block within which they occur.
The sharing and re-use of schemata is uncontroversially good. In order to avoid duplication, we propose use of the XML Hyperlink machinery to refer to externally-stored MCF blocks. While details of this syntax will have to wait for that specification to stabilize, the following examples contain references which should be at least suggestive.
Of course, when multiple schemata are in use, a namespace problem occurs. In the following examples, we use the syntax of the recent Layman/Bray proposal; but the namespace resolution mechanism is an orthogonal problem.
If a program reading an MCF block encounters a semantic contradiction, the entire MCF block is to be considered as unreliable and information from it is not to be used. An example of such a contradiction would be two arcs originating from the same node, labelled with a PropertyType that has been declared a FunctionalPropertyType, or for example, assertions that some node is both typeOf float and typeOf character.
Note, however, that different MCF blocks, obtained from different sources, describing same object, may be inconsistent. The decision as to how this should be handled is highly application-dependent.
Beyond the above, there are several special XML idioms available for convenience and compactness in representing certain properties.
We mentioned earlier that the unique identifier for a unit is just that and does not have any binding semantics about locations on the web. Having said that, it would desirable to have a set of default rules that enable more compact representations. So, as a default, unless explicit values for the corresponding properties are provided, for objects addressable on the Web and which have a canonical URL, it is expected to be common practice to use the URL as the unique identifier.
One of the implications of this default is that not all the units referred to in an MCF block need to have unit descriptor containers in that block or even in blocks included in that block. For example, a web page might not have any explicit MCF unit container corresponding to it, and yet, by using the URL as a unique identifier, a table of contents could refer to the unit that denotes the page.
For implementation considerations, we impose the constraint that property types and categories should have explicit descriptors that occur in either the MCF block, or more typically, in an included block, before their first use.
The parent property may be expressed by element inclusion. That is to say, a source container element may contain not only property elements but also other source container elements; the effect is exactly the same as as if the contained source container were standing alone and contained a parent property pointing at the containing element.
A Sequence node may have Properties whose labels are just numbers, sequentially increasing from 1, whose range is the sequenced nodes. These are expressed in XML simply by replacing the numbers with the reserved property ord; the order in which these Property nodes appear in the XML entity corresponds to the numeric labels.
We would like the most common case to be very simple. In the most common case, there will be exactly one schema used and since there will not be any schema ambiguities, the author should not have to do any extra work related to namespaces. Furthermore, even if additional schemata are introduced, if there is a primary schema, the additional work should only be proportional to the extent to which the additional schema is used. To enable this, we allow for the first of the imported schemata to not have any associated prefix. Top level unit description elements that do not explicitly use any namespace prefix are assumed to use this schema.
Concretely, names which are taken from the first schema referenced (via XML-link) in an MCF block do not require prefixes; names from all others do.
There are many cases where the value of a certain property that a node has (e.g., address) is the concatenation of the values of a set of other properties of that node (e.g., streetAddress, city, state, zip). It would be convenient to not have to repeat these values. To enable this, we allow such values to be nested, as illustrated in example 3.
One of the most common uses of MCF will involve a publishing agent describing the organizational structure and other metadata about its web site. Many of these pages will share a lot of common properties (such as their table of contents, authorship, copyright and legal notices, etc.) It would be highly desirable not to have to repeat these. To enable this, we tentatively provide a simple inheritance mechanism.
The inheritence is accomplished in the XML representation using the inherits element. This appears as a child of an element representing a Category; it has an attribute named propertytype, and a value, provided in the usual way either with a unit attribute or in the propertytype element content. The effect is that all nodes with typeOf the Category are considered to have a property whose source is that node, whose associated property type is the value of the propertytype attribute of the inherits element and whose value is the value of that inherits element. Please see example 3 for an illustration of the use of this feature.
There is no direct analogue in the DLG representation; the XML expression asserts the existence of a (potentially large) number of arcs in the DLG.
For HTML pages, presumably the HTML LINK element would be used to associate MCF blocks that provide metadata about that page.
The following examples contains information about the Acme Content Company web site that can be used for diverse purposes. For example, - a robot could use it to determine which portions of the site to index.
- a browser could use it to present a site map.
- a push client could use it to periodically download portions of the site.
- the rich information here could be used by a search engine to provide better search (filters, concept based searches, etc.)
Given below are a sequence of three examples, each building on the other. - The first example provides a very simple table of contents for the website of the Acme Content Company. The example does not contain anything other than a very simple table of contents and the mcf representation is therefore very similar to a nested HTML list.
- The second example introduces Acme Company and its webmaster as units and also provides a lot more interesting information about the pages on the site.
- The third example illustrates several concepts, such as namespaces, structured values and inheritance.
<xml-mcf>
<!--- BasicVocab defines some basic vocabulary that can be used to describe the structure of web sites. ---> <MFC-REF XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK" href="http://www.standards.org/BasicVocab.mcf"/> <TableOfContents> <description>Acme Content Company Website Table of Contents</description> <Subject> <name>Living Desert</name> <description>Wild Life Pictures taken in the Sahara</description> <Page id="http://www.acc.com/scorpions.html"> <description>Scorpions in the sun</description> </Page> <Page id="http://www.acc.com/Cactus.html"> <description>Photographs of a lone cactus</description> </Page> </Subject>
<Subject> <description>Dangerous Creatures</description> <Subject> <description>Dangerous Creatures in Africa</description> <parent unit="http://www.acc.com/scorpions.html" inverse="true" /> </Subject> <Subject> <description>Dangerous Creatures in South America</description> <Page id="http://www.acc.com/anaconda.html"> <description>Pictures of Anacondas</description> </Page> <Page id="http://www.acc.com/NinjaPenguins.html"> <description>The Mythical Ninja Penguins</description> </Page> </Subject> </Subject> </TableOfContents> </xml-mcf> |
The above example corresponds to the following nested list.
- Living Desert
- Dangerous Creatures
- Dangerous Creatures in Africa.
- Dangerous Creatures in South America
In this example, we repeat most of we had in the previous example, and in addition, for each of the pages, we specify information like the size of the page, update schedule and who the author is. To help with this, we also introduce the Acme Company and the webmaster as units.
Please note that this structure itself is a little more complex (and cannot be represented using simple html lists) than that in the previous example.
<xml-mcf>
<!--- BasicVocab defines some basic vocabulary that can be used to describe the structure of web sites. ---> <MFC-REF XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK" href="http://www.standards.org/BasicVocab.mcf"/>
<WebSite id="AcmeContentCompanyWebsite"> <name>ACME Content Company Web Site</name> <siteHomePage unit="http://www.acc.com/"/> <helpPage unit="http://www.acc.com/help.html"/> <lastRevisionDate>today</lastRevisionDate> <toc unit="acctoc"/> <contactAgent unit="jb@acc.com"/> <objectIcon unit="http://www.acc.com/ACCLogo.jpg"/> </WebSite>
<Person id="jb@acc.com"> <name>John Brown</name> <description>John Brown, who amongst other things, takes care of the ACME web site</description> <contactInformation>415-937-2607</contactInformation> <email>jb@acc.com</email> <homePage unit="/people/jb.html"/> <employeeOf unit="AcmeContentCompany"/> </Person>
<Organization id="AcmeContentCompany"> <name>The Acme Content Company</name> <homePage unit="http://www.acme.com"/> </Organization>
<TableOfContents id="acctoc"> <description>Acme Content Company Website Table of Contents</description> <Subject id="LivingDesert"> <name>Living Desert</name> <description>Wild Life Pictures taken in the Sahara</description> <nextUpdateTime>June 1 1997</nextUpdateTime> <Page id="http://www.acc.com/scorpions.html"> <description>Scorpions in the sun</description> <authorIndividual unit="jb@acc.com"/> <copyright unit="copyright.html"/> <toc unit="acctoc"/> <authorOrganization unit="AcmeContentCompany"/> <size>2000</size> <loadSize>35000</loadSize> </Page> <Page id="http://www.acc.com/cobra.html"> <description>Photographs of a lone cobra</description> <authorIndividual unit="jb@acc.com"/> <copyright unit="copyright.html"/> <toc unit="acctoc"/> <authorOrganization unit="AcmeContentCompany"/> </Page> </Subject>
<Subject> <description>Dangerous Creatures</description> <subject unit="http://mcf.yahoo.com/mcf/Recreation/Animals.mcf"/> <Subject> <description>Dangerous Creatures in Africa</description> <parent unit="LivingDesert" inverse="true"/> <!--- we are incorporating the living desert sub-tree under here as well ---> </Subject>
<Subject> <description>Dangerous Creatures in South America</description> <Page id="http://www.acc.com/anaconda.html"> <description>Pictures of Anacondas</description> <authorIndividual unit="jb@acc.com"/> <copyright unit="copyright.html"/> <toc unit="acctoc"/> <authorOrganization unit="AcmeContentCompany"/> <includesContent unit="/images/anaconda.jpg"/> </Page> <Page id="http://www.acc.com/NinjaPenguins.html"> <description>The Mythical Ninja Penguins</description> <authorIndividual unit="jb@acc.com"/> <copyright unit="copyright.html"/> <toc unit="acctoc"/> <authorOrganization unit="AcmeContentCompany"/> </Page> </Subject> </Subject>
<Page id="copyright.html"> <description>Copyright and Other Legal Notices</description> <authorOrganization unit="AcmeContentCompany"/> <contentUpdateSchedule unit="NeverUpdated"/> <language unit="USEnglish"/> </Page>
</TableOfContents> </xml-mcf> |
In this example, we introduce several advanced features. Specifically, we illustrate,
- Schema additions (i.e., new categories and property types) made by Acme
- The use of namespaces
- The use of structured values
- Inheritance of certain properties that are true for a lot of the Acme pages.
We first provide the schema additions and then get into the description of the site.
The following describes the schema extensions made by the Acme Content Company that are available from http://www.acme.com/AcmeVocab.mcf This is a very small extension, but it illustrates the concept of how MCF can be used to extend itself:
<xml-mcf> <!-- The contents of the MCF block that appear at http://www.acme.com/AcmeVocab.mcf -->
<MFC-REF XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK" href="http://www.standards.org/BasicVocab.mcf"/>
<!--- we have declared a new property called accDeptOfPage which applies to web pages and whose entry is an ACCDepartment. We have also said that there may be at most one department responsible for each page and that the department is also the contactAgent for the page --->
<FunctionalPropertyType id="deptOfPage"> <description> Every page has a department associated with it (at ACC). This property is used to specify the ACC department associated with the page.</description> <domain unit="AcmePage"/> <name>deptOfPage</name> <range unit="Department"/> <superProperty unit="contactAgent"/> </FunctionalPropertyType> <Category id="Department"> <name>Department</name> <superType unit="Organization"/> <description>Departments in the Acme Content Company</description> </Category>
<FunctionalPropertyType id="departmentNumber"> <name>departmentNumber</name> <description> The ACC department number associated with an ACC department</description> <domain unit="Department"/> <range unit="Integer"/> </FunctionalPropertyType>
<Category id="AcmePage"> <name>AcmePage</name> <superType unit="Page"/> <description>All the acme web pages</description> <inherits unit="copyright.html" propertytype="copyright"/> <!--- this is equivalent to adding <copyright unit="copyright.html"/> as part of every unit whose typeOf is AcmePage ---> <inherits unit="jb@acc.com" propertytype="authorIndividual"/> <inherits unit="AcmeContentCompany" propertytype="authorOrganization"/> <inherits unit="acctoc" propertyType="toc"/> <inherits unit="help.html" propertytype="helpPage"/> <inherits propertytype="cost">$ 0</inherits> <!--- this is equivalent to adding <cost>$ 0</cost> as part of every unit whose typeOf is AcmePage ---> </Category> </xml-mcf> |
Now the actual description of the web site that uses the above schema extensions. (The base url of the above mcf block and the following mcf block are the same. If they were not, references in the above block to content units defined in the following block might need to be adjusted.)
<xml-mcf> <MFC-REF XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK" href="http://www.standards.org/BasicVocab.mcf"/> <!-- include the previously defined schema so that it is available here ---> <MFC-REF XML-LINK="SIMPLE" ROLE="XML-MCF-BLOCK" href="http://www.acc.com/accExtensions.mcf" prefix="acme"/>
<WebSite id="AcmeContentCompanyWebsite"> <name>ACME Content Company Web Site</name> <siteHomePage unit="http://www.acc.com/"/> <helpPage unit="http://www.acc.com/help.html"/> <lastRevisionDate>today</lastRevisionDate> <toc unit="acctoc"/> <contactAgent unit="jb@acc.com"/> <objectIcon unit="http://www.acc.com/ACCLogo.jpg"/> </WebSite>
<Person id="jb@acc.com"> <name>John Brown</name> <description>John Brown, who amongst other things, takes care of the ACME web site</description> <contactInformation>415-937-2607</contactInformation> <email>jb@acc.com</email> <homePage unit="/people/jb.html"/> <employeeOf unit="AcmeContentCompany"/> </Person>
<Organization id="AcmeContentCompany"> <name>The Acme Content Company</name> <homePage unit="http://www.acme.com"/> <contactInformation> <address> <streetAddress>17 Loop Drive.</streetAddress> <cityAddress>Alto Palo</cityAddress> <stateAddress>CA</stateAddress> <zip>95014</zip> </address> <phoneNumber> <areaCode>415</areaCode> <phoneNumberBody>965-1279</phoneNumberBody> </phoneNumber> </contactInformation> </Organization>
<acme:Department id="acc.com/accEMarketingDept.mcf"> <departmentNumber value="32"/> </acme:Department>
<!--- note that we don't have to specify the author, etc. on the pages in this example. All of that is inherited by virtue of the pages being AcmePage ---> <TableOfContents id="acctoc"> <description>Acme Content Company Website Table of Contents</description> <Subject id="LivingDesert"> <name>Living Desert</name> <description>Wild Life Pictures taken in the Sahara</description> <nextUpdateTime>June 1 1997</nextUpdateTime> <AcmePage id="http://www.acc.com/scorpions.html"> <description>Scorpions in the sun</description> <size>2000</size> <loadSize>35000</loadSize> </AcmePage> <AcmePage id="http://www.acc.com/cobra.html"> <description>Photographs of a lone cobra</description> </AcmePage> </Subject>
<Subject> <description>Dangerous Creatures</description> <subject unit="http://mcf.yahoo.com/mcf/Recreation/Animals.mcf"/> <Subject> <description>Dangerous Creatures in Africa</description> <parent unit="LivingDesert" inverse="true"/> <!--- we are incorporating the living desert sub-tree under here as well ---> <Subject> <description>Dangerous Creatures in South America</description> <AcmePage id="http://www.acc.com/anaconda.html"> <description>Pictures of Anacondas</description> <acme:deptOfPage unit="acc.com/accEMarketingDept.mcf"/> <includesContent unit="/images/anaconda.jpg"/> </AcmePage> <AcmePage id="http://www.acc.com/NinjaPenguins.html"> <description>The Mythical Ninja Penguins</description> </AcmePage> </Subject> </Subject> </Subject>
<Page id="copyright.html"> <description>Copyright and Other Legal Notices</description> <authorOrganization unit="AcmeContentCompany"/> <contentUpdateSchedule unit="NeverUpdated"/> <language unit="USEnglish"/> </Page> </TableOfContents>
</xml-mcf> |
In the previous example, we had,
<Organization id="AcmeContentCompany"> ... <contactInformation> <address> <streetAddress>17 Loop Drive.</streetAddress> <cityAddress>Alto Palo</cityAddress> <stateAddress>CA</stateAddress> <zip>95014</zip> </address> <phoneNumber> <areaCode>415</areaCode> <phoneNumberBody>965-1279</phoneNumberBody> </phoneNumber> </contactInformation> </Organization> |
This is equivalent to,
<Organization id="AcmeContentCompany"> ... <contactInformation>17 Loop Drive. Alto Palo CA 415 965-1279</contactInformation> <address>17 Loop Drive. Alto Palo CA</address> <streetAddress>17 Loop Drive.</streetAddress> <cityAddress>Alto Palo</cityAddress> <stateAddress>CA</stateAddress> <phoneNumber>415 965-1279</phoneNumber> <areaCode>415</areaCode> <phoneNumberBody>965-1729</phoneNumberBody> </Organization> |
I.e., the value of the enclosing property (such as contactInformation) is the concatenation of the included values. The interior properties (such as address or streetAddress), just like the exterior properties, apply to the same container.
Appendices
In addition to the basic bootstrapping terms typeOf, Category, etc.) specified earlier, in order to promote interoperability, we also propose some standard vocabulary that can be used for purposes of describing the kinds of content typically found on the web.
Such standard schemata are very important, but are separate from the data model and the transfer syntax. The purpose of this section of the proposal is to initiate a discussion. There is significant work to do in this area, but it should be started now.
Though the following can easily be specified in MCF itself, for purposes of readability, we provide the following description in English. The MCF specification will however be made available for authors.
An author can use this vocabulary as the schema for their MCF (by using XML-transclusion) and make further modifications and additions to it as they need.
As a convention, Categories are in the singular. So, the category of all people is called Person and of all organizations is called Organization.
Also, even though MCF is case insensitive, for purposes of human readability, as a convention, categories start with a capital letter and properties start with a lower case letter.
The name and identifier for all of the following are the same.
Content
- Includes everything from websites and web pages to legacy databases and file folders. Its superType is Unit.
ContentContainer
- A collection of information. Includes subject categories, file folders, channels, etc. Its superType is Content. There are no constraints on the items belonging to a container. The items in a container could themselves be containers. The relation between an item belonging to a container and the container is just parent (though we might want to eventually introduce a more specialized relation.) The distinction between a container and non-container is one of convenience. There will be cases where we want to consider a single page as a container and in other cases, we might want to consider the same page as an atomic entity. The flexibility of MCF allows us this freedom.
Subject
- The category of subjects. An example is the Arts category in Yahoo! or the portion of the Developer portion of the Netscape Website. Its superType is ContentContainer.
WebSite
- A web site. Its superType is ContentContainer.
Page
- A document. Could be a WordPerfect document on a PC or a web page or even a FileMaker database. Its superType is Content.
Agent
- The concept of an Agent is a general one intended to cover people, robots, organizations, etc. Its superType is Unit.
Organization
- Examples include Apple Computer, United States and the Peace Corps. Organizations are mutually disjoint with people. Its superType is Agent.
Person
- The category of people. Its superType is Agent.
TableOfContents
- The table of contents for any Content (could be for a web site, page, ...) Its superType is Content.
NaturalLanguage
- Examples include English, French, etc. Its superType (for now) is Unit.
Schedule
- This category is used to specify information like the periodicity with which content is updated, when it should be pulled down, etc. The range includes both simple instances like Hourly or Daily to instances with intermediate complexity like daily at eight am to more complex instances (such as that proposed by CDF) like hourly between eight am and six pm on weekdays...
There has been much work in standardizing vocabularies for describing agents, most notably vcard, and we hope to adopt those standards as applicable. In addition, we should also provide standard property types for describing the location, hobbies, etc. of agents.
emailAddress
- A string representing the email address of an agent.
homePage
- The url of the home page(s) of an agent.
contactInformation
- A string representing how the person can be contacted.
Existing standards that these draw from (and will rely upon even more in the future) include the Dublin Core, Z39.50 and of course, the rich body of work in Library Science.
authorIndividual
- The individual person(s) who is(are) the authors of the content object. The entries are not names of the authors but references to objects corresponding to the authors. The name, email address, etc. of the author can be specified on that object.
authorOrganization
- The organization which is the author of the content object.
author
- The generalization of the previous 2 property typess. The is a superPropertyType of both. of them.
editor
- The agent that is the editor of the content object.
publisher
- The agent that is the publisher of the content object.
contactAgent
- The agent who is the "contact" for that piece of content. Typically the person behind "webmaster@xyz.com".
copyright
- The copyright declarations. The range is page addressing the copyright and other legal issues.
size
- The size of a content object in bytes. Represented using an integer. This is the size of the object alone and does not represent the size of its inclusions (like in-line images).
loadSize
- The total number of bytes, including inline images, plugins, etc. of a content object.
Some more temporal property types appear under Schedules.
publicationDate
- The date on which a content object was first published.
lastRevisionDate
- The date on which the content object was last modified.
expires
- The date until the information in this content object is valid.
contentUpdateSchedule
- The frequency with which this is typically updated. The range is a Schedule (which includes Hourly, Daily, etc. and also more complex Schedules.)
versionNumber
- The version number of this content object or subject category. A string.
contentDownloadSchedule
- This is to be used if the content is to be proactively downloaded to the users computer. It specifies the download schedule and the entry is a Schedule.
nextUpdateTime
- The next time that this piece of content is scheduled to be updated.
nextDownloadTime
- This is also to be used if the content is to be proactively downloaded to the users computer. It specifies the next time this piece of content should redownloaded. More often than not, this will suffice in lieu of a full blown schedule and will default to the nextUpdateTime.
subject
- The subject categories that this content object falls under. parent is a superProperty of subject. Using this, an author could for example suggest that his/her page belongs to a certain Yahoo! subject category.
language
- The language(s) (typically a natural language such as English or French) in which the content is primarily encoded.
toc
- One or more tables of contents of which this content object is a part.
siteHomePage
- The home page for the site of which this content object is a part.
helpPage
- The page at which help can be found regarding this content object.
linksTo
- The content objects that a content object has hyperlinks to. parent is a superProperty of linksTo.
includesContent
- To be used when one content object includes another (such as an HTML page including an image or a poem). This is useful when we want to distinctly identify a certain piece of a page, such as a table, as a first class unit and specify the relation between the enclosing page and table.
contentMimeType
- The MIME type of the content.
contentPartMimeTypes
- A convenience predicate for specifying the mime types of all the included content.
superTopic
- A relational between two subject categories such as Yahoo Arts and Yahoo Arts Museums which states that the later is a more specific subject category of the former. parent is a superPropertyType of superTopic.
objectIcon
- An icon that can be used to represent the object. The value is typically the object corresponding to a GIF or JPEG, but could also be a platform specific encoding. Preferably, it will be one object with several different encodings being available.
location
- One or more URIs at from which object content may be obtained.
contentMirror
- Mirror uris for this content object. Mirrors are assumed to be secondary sources of the content, which might potentially be stale. The distinction between mirrors and location is subtle at best.
contentAvailabilityStatus
- This Property can be used to specify information like whether the server is down, the last time the content was accessible, etc. This meta-content is typically furnished not by the content provider himself, but by indexers like Yahoo!
accessMode
- This is used to specify whether the content is to be accessed via the traditional Web pull mechism, via email (e.g., InBox Direct), via channels, etc.
contentRating
- The intent of this Property is to contain the information that would be contained in a PICs-like rating. The range is Rating. The schemas for Ratings is beyond the scope of this document.
contentCost
- The cost of this content. The range is a Cost, which could be as simple as "5 US Dollars" or something much more complex. The more complex specification is beyond the scope of this proposal.
scheduleStartDate
- This is the day upon which the schedule will start to apply.
scheduleEndDate
- This is the day upon which the schedule expires and no longer applies.
scheduleIntervalTime
- The interval of time that the schedule should repeat over.
scheduleEarliestTime
- Earliest time during the schedule interval that the schedule applies to.
A very large number of people have contributed to the material in this proposal. It draws heavily from the knowledge representation work in AI. It owes a lot to the MCF project at Apple and we would like to thank the folks who made that happen, including Jed Harris, Alan Kay, Don Norman and Larry Tesler. We would also like to thank Edwin Aoki, Tim Craycroft, Tim Hickman, Phil Karlton, Mike McCue and Tom Paquin of Netscape for the comments and feedback on this draft. External feedback from Jon Bosak and Mark Walter was also of great help.