Web Collections-IBM revision

[Mirrored from: http://www-ee.technion.ac.il/W3C/WebCollection.html; or try: http://www.w3.org/pub/WWW/MarkUp/Group/WD-webmap-961018.html]

W3C WD-webmap-130297

Web Collections: A mechanism for grouping Web documents and site mapping

W3C Working Draft 13-Feb-96

Revision : 1.4
This version:
Latest version:

Yoelle Maarek, IBM Haifa Research Lab yoelle@haifa.vnet.ibm.com

Dror Zernik, IBM Haifa Research Lab, zernik@haifa.vnet.ibm.com

Contributors (Authors status pending, upon approval of this version):

Scott Berkun, Microsoft - scottber@microsoft.com 

Eric Brown, IBM T.J. Watson Research Center brown@watson.ibm.com

Dan Connolly, W3C, connolly@w3.org

George Hatoun, Microsoft georgeh@microsoft.com

Yaron Goland, Microsoft, yarong@microsoft.com

R. V. Guha, Apple, guha@apple.com

Murray Maloney, SoftQuad, murray@sq.com

Rory Stark , Netcarta, rory@netcarta.com

Liam Quin, SoftQuad, lee@sq.com

Status of this document

This is a revised of the W3C Working Draft that has been distributed on 10/18/96. It contains a great deal of material originating from the original document, but it has a different focus and a more refined notion of a collection. We suggest that this revised draft becomes the current working draft.

It also integrates most "the robust object model" document authored by Yaron Goland and George Hatoun from Microsoft (derived from the same working draft) which defines the syntax aspect of Web Collections.

This draft is a reunified version of the "forked" versions.


1. Introduction

Robust document types, such as newspapers and books, provide several different mechanisms for organizing content. In most books or magazines, a table of contents presents a roadmap of the book. An index provides the location of specific references within the book. A book bibliography specifies information about data external to the book. In contrast, the World Wide Web presently has no universal mechanism for encoding information about objects of any type, including Web-indigenous objects. "Web Collections" is an HTML-compliant data structure which can specify information about objects both Web-based and not.

For many tasks on the Web, it would be generally useful to transfer and manipulate groups of Web documents as if they were either a single document, an ordered list of document or a hierarchical structure of documents. Collections can be used for instance for,

The above examples indicate that a collection should include, in addition to the set of Web pages, a definition of the relation between these pages.

A collection provides a new high-level abstraction of the already existing Web structure. The notion that is introduced by the collection relation is independent of both the hypertext organization and the physical site organization. It thus provides a new conceptual model.

As a model it is format independent, and there may exist different notations for representing it. This document will provide one possible (standard?) notation for this model.

The collection object is a first order member of the Web, that is, a collection may refer to collections. However, the information that is included in the collection represents meta data, in the sense that it provides an additional organization perspective on the Web pages.

Accordingly, a particular Web page could be contained in one, or many collections. Since there are many ways to traverse a Web space, and equally many ways to organize it, there is no restriction on how many collections a particular Web page might be associated with. Collections are abstractions - they provide a convenient way to manipulate groupings of content, but they do not impose any significant restrictions on the actual contents of the Web pages themselves.

2. Terminology in this working draft

It should be clear before going any further what the key terminology is for discussing collections. The table below lists the key conceptual terms in this specification. The names of these terms may change as this draft is reviewed, but the meanings will remain stable - so as a note to the reader, this section should be checked out in future revisions of this document.

Term Meaning example
Objects Web units of three types:
  • pages to which URL point
  • links
  • sets of objects
Page: HTML document, GIF, etc.
Link: HREF.
Set An unordered list of objects of a single type, or of sets of the same type. list of pages
list of links
search results
Collection A heterogeneous set of sets. Typically, a set of pages and the set of cross references (links).
A collection may represent a new granularity of Web content, that is between a Web site and a Web page, or may as well span beyond the boundaries of a single site, and form a "subject-based organization". Collections will have many different purposes, such as printing, or off-line reading. However, it is not intended that collections provide or replace indexing services.
A set of pages and the set of HREFs that link them.
Operations Function on the same type of objects. Operations on objects and collections are crucial for making high level model for Web site management, for instance. make empty set, add an element to a set, union of two sets, append of two sets (lists), etc.
"meta data", attributes; Properties describing information about an object. Attributes can be content-dependent (keywords) or content-independent (size). For a page: size, author.
For a link: string, number of references.
For a collection: whether a collection is ordered or not.
Interpretation A collection provides an abstract model of Web pages and their relations. Different tools can provide different interpretations to this model, by extracting more of the meta-data, or by displaying it in different ways. Table of Contents (TOC), Webmap.
Webmap A specific interpretation of a collection for the purpose of displaying a user interface for the organization of a site. The Webmap is a special case of a collection.  
(Table of Contents)
Yet another specific interpretation of a collection that imitates the order relation of a book. organization of a site.  

3. Usage Scenarios

A collection can be stored in a file (in a default location), to reflect the meta-data that a Web-master wishes to expose to external tools, such as robots. Thus instead of telling robots which pages and documents not to visit (negative approach) via "robot.txt", the HTTP server can suggest robots which pages it should visit (positive approach) via "sitemap.txt". This would save time to robots and avoid overloading HTTP servers. Note that such collections can be manually generated (via the Webmaster's favorite authoring/publishing tool) or automatically generated by local crawling. Therefore, the generation of a collection is not necessarily a single, central atomic event. This implies that an ID for a set and for a collection are required, as well as that support for operations between sets must be permitted. (Note that links and pages are uniquely identified).

When reading the data that is stored in a collection, it is vital that the size of this data is minimized. Minimizing the cost of using the meta-data can be achieved by transferring only the relevant information.

We give below some examples of the data that must be represented in a collection, according to various usage scenarios.

The variety of possible uses requires that a mechanism for defining meta-data is supported and that there is no dependency on the completeness of the data. In other words, a collection can be accessed in pieces, and an incremental definition of a collection can be desirable.

4. Requirements

Backward compatibility
- The HTML syntax for collections will be based mostly on existing HTML constructions, to enable backward compatibility of Web browsers.

Light weight
- the total collection representation size should be small. Clearly, the collection size is proportional to the number of objects it represents, however, for each such object the collection should not require more than several hundreds of bytes. Clearly, the more data is stored on each object the more costly the use of the collection becomes.

Optional (incomplete)
- None of the portions of the data is compulsory.

Non-central (non-contiguous)
- Elements of a list can be added in pieces (using append).
- Meta-data values can be over-ridden (last/max value remains).
Non-central means that the collection as an object does not necessarily reside in a single file. It may be composed of several pieces of messages/files that are collected by agents. Non-contiguous means that if there are several agents that generate the collection, they may have to generate each set in pieces. This can work as long as they have a common naming procedure.

Flexible data
- Support of meta-data definition mechanism for tool-builders.

Unrestricted consistency update policy
The relationship between a Web Collection and the repository (the site, or list of sites it refers to) is such that the Web Collection is the reflection of the state of the repositories containing the resources. That is, the repository is always the actual world, which a Web Collection may or may not correctly represent. There may be a variety of ways to enforce, or guarantee the consistency of the collection, such as: automatic tools which scan the site periodically, manual re-writing rules, or robots. In any case, it is the responsibility of the collection user to make sure that the collection truly reflects the structure and data (as well as meta-data) that exists in the repositories.

5. Predefined attributes (Meta-Data)

The following list of attributes of the elements is provided so that the reader can compare it against a typical scenario he/she has in mind. We refer the reader to the Dublin Core (DC) list of attributes as a common ground for meta-data attributes (13 attributes). (see the Dublin Core Metadata Workshop or The Dublin Core schema).

However, we would like to emphasize that even the DC attribute list is simply a recommendation, that is, the attributes should not become a part of the collection standard.

It is suggested that each attribute of the above list should be defined, and associated with type. This will allow for some minimal type checking. For example: although "# of references" may be represented in text or as the string denoting the number, only one representation should be selected. Thus "# of references" should be of type numerical digits only.

6. Examples

Recall that sets, in this context, are typed objects. Each set is of type pages, links or collection attributes. A collection consists of exactly three sets, one of each type. Some of the sets may be empty, or ordered.

Consider the following examples:

Example 1: Using search results as a collection

Collections are well suited to store search results. The attributes of a search result collection include:

These attributes form the collection attribute set. One can always modify this set, either by providing new attributes (such as owner), or by overwriting one of the previous attributes, (for example, re-issuing the search at a later date changes the date attribute), or by deleting an attribute, (why? I don't know).

Before the search is issued the collection contains the attribute set, and two empty sets for pages and links. If the search fails for some reason, this will be the returned collection, with possible error message.

If however the search succeeds, and the site is found, but contains only a single page with no other reference, the collection that will be formed will contain the above set of attributes (with possible updates - such as transfer rate, etc.), but also a set of pages (in this case a single element set), and the set of links will still be empty. With each page (and later also for links), meta data (more attributes) can be associated. In this example, PIC rating of the page, keywords, author, title, last update, etc., can be associated with the page.

In a different scenario, the search can return a page which does have links, but all of these links are not reachable for some reason: either they are references to places outside the search scope, (for Intranet as an example), or pages that are protected (password, or credit card) or just do not exist. In this case the set of links becomes non-empty, and with each link we associate -

Example 2: Using a collection for guiding a search

Collections are particularly well-suited for guiding searches. A search engine can receive a collection as its input argument. The collection can be used to enable searching over more than a single site and at the same time, to specify priorities and order of searching. Furthermore, for explicit guidance of the search, the search should accept two input collection arguments: the GO, and DON'T_GO collections.

The GO collection includes an ordered list of sites and for each site it also includes an ordered list of links to follow. The DON'T_GO collection includes a list of pages that should not be visited. These can be pages that the user has already visited, or sites that are too expensive (time and money) to access. For each page in the GO collection, there may be different links indicating the preferred access to it.

7. Operations on Attributes and Sets

It should be clear from the usage scenarios that collections are not data structures by themselves, they are the carriers of information, that is the mean of communication between agents. Accordingly, when we discuss operations here, it is not operations on the collections themselves, it is optional, additional meta-data , which should provide means to associate relations between collections.

Consider for example two search agents that return collections which include the results of their search. Some items in the page list may be overlapping, that is, may be included in both collections. In this case, we want to be able to indicate to the receiver of the collections that the relation of the two collections to the desired search result is a union. It is the responsibility of each of the agents to provide a proper, unique name to the subcollection it returns, so that it can be referenced by other agents. As can be seen from this example, all that is needed is a naming convention, and a definition of the desired semantics of the relation between collections.

As mentioned earlier, it may be required to perform operations on the attributes of each of the objects that the model support. For a "simple" attribute, (attribute value pair), such an operation modifies the value of the attribute's value. For sets, this provides a mechanism for building new sets.

A relevant reference in this context is the implementation of collections in Java. (Java-collections) . Recall that the collections in the current context are not data structures, or containers, while the Java-collections are. Nevertheless, it will probably be natural for an application to store the input data provided by one of our collection messages in a Java-collection of some type. The operations and the semantics of the manipulations of the Java data-structure, may, in some cases, be included in the meta-data. Therefore, the collections may associate objects with operators for this purpose. This should be discussed. I'm trying to build an example. Operations on attributes include:

Operations on sets include:

8. Syntax

8.1 Introduction

This section describes in detail information about the syntax of a Web Collection object and provides a few illustrative examples. However, since the uses for Web Collections will vary significantly, extensive examples for specific applications are not provided. It is expected that future documents will address standards for applications of Web Collections.

The collection needs to contain meta-data on three objects: pages (URL), links, and sets. (Sets of pages, sets of links, and sets of sets). For pages, there exists the meta-data mechanism. For links it seems that with minor changes to the tag-id this can be done. For sets (such as lists in HTML), there is no mechanism for attaching meta-data, (as far as the authors have observed) and therefore it looks like we will have to define one. This is an undesirable situation, in which each object is associated with meta-data using a different mechanism. We therefore are considering extending the meta mechanism for all types of objects.

This section describes in detail information about the syntax of a Web Collection object and provides a few illustrative examples. However, since the uses for Web Collections will vary significantly, extensive examples for specific applications are not provided. It is expected that future documents will address standards for applications of Web Collections.


8.2 Semantic and syntactic information

The Web Collection object model consists of objects that have two parts, attributes and data. Attributes are meta-information about the current object that can link the object to another using directional links. Data, on the other hand, is information about the current object per se, rather than about its relationship with other objects.

There are four components of the Web Collection syntax. The first is the Web Collection object itself, delimited by <WC> =85 </WC> tags. This is a package within which information (data and attributes) pertaining to the object can be written. This package can appear anywhere within the <BODY> =85 </BODY> tags of an HTML document. Within the opening <WC> tag, the defined attributes are VERSION, TYPE, and NAME. VERSION is a number in #.# form which corresponds to the version number of the Web Collection spec with which the Web Collection objects are compliant. (This spec is version 0.1.) The TYPE attribute is a name token or string literal (as defined by HTML) which corresponds to the type of information represented by (or the use of) the Web Collection object. The NAME attribute is used to give a name to a Web Collection object. The value of this attribute can also either be a name token or a literal string; the function of the name is not defined by this specification. In addition, arbitrary attributes can be defined (as needed) within the <WC> tags in accordance with the standard HTML attribute assignment syntax.

The Attributes of the Web Collection object are identified by the <WCAT> </WCAT> tags, which must be fully enclosed by the object to which they pertain; they apply only to the object which most tightly surrounds them, and not to objects that might also surround them but are higher in the hierarchy. Three attributes are defined for the opening <WCAT> tag: REL, REV, and HREF. Please refer to the <A> definition in the HTML specification for authoritative information about the legal syntax of these attributes. The REL attribute describes a relationship between the current object and the object being pointed to while the REV attribute describes the relationship of the target object to the current one. HREF identifies the target object. One particular instantiation of these attributes that is defined here is REL=WC_LINK, HREF="Name Anchored URI". This pair means to treat the contents of the name anchored URI (<A NAME="foo"> </A>) as if it were inserted immediately after the closing </WCAT> tag.

The <WCAT> tag may also have other attributes expressing other types of relationships, though these attributes are not defined in this document. In addition, relationships can be expressed within the <WCAT> </WCAT> tags. Further, an unlimited number of <WCAT> </WCAT> pairs may exist and be at any point(s) within a <WC> </WC> object. Embedding <WCAT> tags within other <WCAT> tags does not make sense and is therefore disallowed.

The <WCDATA> </WCDATA> tags are intended to encode or point to the object=92s actual data, if any. <WCDATA> tags, like <WCAT> tags, can appear anywhere within a <WC> object, though they cannot be nested directly inside one another. They can have REL=WC_LINK HREF="Name Anchored URI" attributes (as defined for <WCAT> above) and HDATA attributes. HDATA attributes are either HTML name tokens or string literals which contain data which, by its placement within a tag, will remain hidden from a user viewing the Web Collection in a downstream browser. In addition, data may also be placed between <WCDATA> </WCDATA> tags.

 Finally, a <COLLECTION> tag, which can appear anywhere within a document and applies to all <WC> </WC> objects which begin linearly (not necessarily hierarchically) after it in the HTML document. This tag serves to set default attribute values for the <WC> objects. The <COLLECTION> tags are cumulative, but where an attribute is redefined, only the last definition is used (an attribute cannot have multiple values at the same time). All of the attributes =96 including arbitrary attributes =96 that are legal within the <WC> tag are allowed. Attribute values set in a <WC> </WC> object will override the values in the <COLLECTION> tag (where they overlap) but for the current collection object only.

Web Collection objects may be embedded within other Web Collection objects so long as they are completely encapsulated within or between elements. This draft does not address the full implications of nested Web Collections. Simply put, you can nest a Web Collection object at any place inside another object, so long as the nesting constitutes a full embedding of the subsidiary object inside the encapsulating object or subpart.

For example, it is legal to embed a new <WC> </WC> object inside the <WCDATA> </WCDATA> object, so long as the object is fully contained within the <WCDATA> </WCDATA> subpart. In addition, the <WC> </WC> object embedded within the data tags is both an object in its own right and part of the data of the first object (some applications of Web Collections may redefine this meaning or further restrict nesting, but we will not do this here).


Web Collections and HTML

 HTML can be randomly inserted anywhere in a web collection and between a WC tag pair, without effecting the web collection. Anything inserted between <WCDATA></WCDATA> or <WCAT></WCAT> tags takes on dual characteristics, they are at once HTML to be presented to the user as well as data/attributes for the web collection.


Example 1: A basic web collection

 (Web collection elements are indicated by bold.)

<TITLE>Web Collection Example</TITLE>
A Web Collection object:
A basic attribute tag:
<WCAT REL="TOP" REV="CHILD" HREF="http://www.foo.org/bar.html#bob">
Relationship information inside
An attribute tag which points to a name-anchored URI; these tags are to be treated as if the information inside the anchor were written in the document at this point:
<WCAT REL="WC_Link" HREF="../index.html#foo">
An attribute tag for including arbitrary name-value pairs:
A basic data tag set:
<WCDATA HDATA="Hidden text which is somebody=92s data">
Data can also go inside here
A data tag set referring to a name-anchored URI; these tags are to be treated as if the information inside the anchor were written in the document at this point:
<WCDATA REL="WC_LINK" HREF="../somelocation#anc">




Example 2: A nested web collection

 (Web collection elements are indicated by bold.)

<TITLE>Web Collection Example</TITLE>
A Web Collection object:
A basic attribute tag:
<WCAT REL="TOP" REV="CHILD" HREF="http://www.foo.org/bar.html#bob">
An attribute tag for including arbitrary name-value pairs:

A nested WC could go here
A basic data tag set:
<WCDATA HDATA="Hidden text which is somebody=92s data">
Data can also go inside here

A nested WC could go here
Another web collection could go here.



Syntax definition

 The HTML syntax for Collections requires the addition of a number of new tags, but is based on many existing HTML constructions. The following HTTP Extended BNF tokens describe Web Collections: 

WebCollection = *(Collection | WC)
Collection = "<COLLECTION" [Version] [Type] [Name] *Attribute ">"
WC = WC_Open WC_Data_Attrib_Coll WC_Close
WC_Open = "<WC" [Version] [Type] [Name] *Attribute">"
Version = "VERSION =" 1*Digit "." 1*Digit
Type = "TYPE =" HTML-string-literal
Name = "NAME =" HTML-string-literal
WC_Data_Attrib_Coll = *(WC | WCdata | WCat | Collection)
WCdata = "<WCDATA" (*Attribute | Generic_Link | Hdata) ">" (NULL | Legal-HTML) "</WCDATA>"
Hdata = "HDATA =" HTML-string-literal
WCat = "<WCAT" (*Attribute | Generic_Link) ">" (NULL | *Attribute) "</WCAT>"
Generic_Link = ((Rel | Rev | Rel Rev) Href | WC_Link)
Attribute = HTML-attribute-name "=" (HTML-name-token | HTML-string-literal)
WC_Link = "REL = WC_LINK_HREF HREF =" Anchored_URI
Anchored_URI = URI pointing to Named_Anchor in HTML file
Named_Anchor = "<A NAME=" Text ">" WC_Data_Attrib "</A>"
WC_Close = "</WC>"
HTML-attribute-name = (See definition in HTML 2.0 spec)
HTML-string-literal = (See definition in HTML 2.0 spec)
HTML-name-token = (See definition in HTML 2.0 spec)
Rel = (See definition under <A> in HTML 2.0 spec)
Rev = (See definition under <A> in HTML 2.0 spec)
Href = (See definition under <A> in HTML 2.0 spec)


Collection tags set default attributes for all WC tags. However a WC tag can override a collection attribute by including the same attribute with a different value. The collection tag effects all WC tags that come after it in the document. The effects of collection tags within a document are cumulative.

<WC> </WC>

The <WC></WC> tags wrap a web collection. The syntax is such that a single document can contain multiple collections. Collections can also be nested within each other. This specification does not define the semantics of a nested collection.

VERSION = "#.#"

The version attribute belongs to the <WC> tag. It is used to identify the version of the web collection specification the <WC> tag encloses. This specification defines version 0.1. A version attribute must be assigned to every web collection either directly in each <WC> tag or in a <COLLECTION> tag. A user agent may not process a web collection with a version it does not support.

Type = ""

The type attribute belongs to the <WC> tag. It is used to identify the type of web collection that <WC> tag encloses. A type attribute must be assigned to every web collection either directly in each <WC> tag or in a <COLLECTION> tag. A user agent may not process a web collection with a type it does not understand.

Name = ""

The name attribute belongs to the <WC> tag. It is used to give a name to a web collection. The use of this attribute is not defined by this specification.


The <WCAT> tag provides attribute information about the web collection. This specification only defines the REL/REV/HREF and HDATA attributes. Multiple <WCAT></WCAT> tags may be enclosed between <WC> tags and freely intermixed with <WCDATA></WCDATA> tag pairs. The only rule is that WC* open tags can not be mixed with WC* closing tags. Note that HTML is defined such that one can not rely upon the order of attributes within a tag however one can rely upon the relative ordering of tags. Thus the only means with which to be sure that the user agent receives <WCAT> attributes in a specific order is to put each attribute in a separate tag and then allow the user agent to read all the tags and concatenate the enclosed attributes. Attribute information can be provided in three different ways. It can be included between the <WCAT></WCAT> tags, it can be included in the tag, or it can be referred to using the REL = "WC_Link"/HREF attribute pair.


Because HTML does not allow for ordering among attributes within a tag multiple REL/REV/HREF triples may not be combined in a single tag.

REL = "WC_Link" HREF = Named Anchored URI

This specification defines the REL/HREF combination where REL = "WC_Link". This link points to a series of tags including <WC></WC>, <WCAT></WCAT>, <WCDATA></WCDATA>, and <COLLECTION> tags. Full tag pairs must be included. These tags should be treated as if they had been inserted in the document just after the <WCAT> tag in which the link is included. The URI must contain sufficient information to uniquely identify which tags to include. This specification defines only one example, a URL that points to an HTML file. In this case the URL must include a reference to a named anchor in the HTML file. The tags to be included will be completely contained between the opening <A Name = ""> tag and the matching closing </A> tag.


The <WCDATA> tag provides the data of the web collection. This specification does not provide any definitions regarding the content or format of this data. In the simplest form, data is inserted between the <WCDATA></WCDATA> tags. If web collection tags are inserted between <WCDATA></WCDATA> tags then that information becomes at once data and web collection information. The semantics of this situation is defined by the type attribute. However the default behavior is to not allow any WC* tags between <WCDATA></WCDATA> tags.

The <WCDATA> tag also supports two attributes, HDATA and REL=WC_LINK HREF=AnchoredURI. While it is possible to use the previous attributes in a single tag there is no guarantee that their order in the tag will be retained. This is the same problem with attribute ordering as explained in the section on the <WCAT> tag. Thus, to guarantee ordering, it will be necessary to use multiple <WCDATA></WCDATA> pairs.

HDATA = ""
The HDATA attribute is used to provide web collection data in a manner that will be hidden from HTML readers.

REV = WC_LINK HREF = Name Anchored URI
This HREF should be treated exactly the same as a REV = WC_LINK HREF = Name Anchored URI pair is under the <WCAT> tag.


8.3 Future add-ons to the syntax

The proposed syntax misses at least two pieces. The first one being some standard naming mechanism. Such a mechanism is essential for:

The second missing property in the proposed syntax relates to the ability to re-generate data values. This is required in order to enable updating variable values and re-sending values. A solution to this problem can be found in the "multi-part-mime-type" type, which is currently supported only by Netscape. multi-part-mime-type (Missing explanation about the "multi-part-mime-type").

The multi-part mime type support is an easy way to achieve the desired dynamic nature of attribute value, but there might be some alternative solutions, which rely on detailed naming conventions and more elaborate operation definition. For example, if each attribute of each object is uniquely identified, then for a given new value for this attribute, one can associate the type of operation that the new value requires. An example for the use of this is a crawler that reports the number of pages visited. As pages are being visited the counter is accumulated at an agent, which then transmits a value. The retrieved value must be added to the existing value since the crawler should not be aware of other agents' counters. In this example, when a counter value is being transmitted, the operation on this counter should be transmitted as well, "add". An alternative example is when a search engine can find two copies of the same page on different locations. Both agents than transfer the data about the found pages. The date of the document is the latest of the two. In this example the operation on the value should be "max". Similarly, for lists "append", and "union", and "intersect" should be defined.

Appendix A: Hierarchical Collections