Canonical XTM - a proposal
Date: 20 Feb 2001 10:25:01 +0100 From: Lars Marius Garshol <larsga@garshol.priv.no> Reply-To: xtm-wg@yahoogroups.com To: xtm-wg@yahoogroups.com Subject: [xtm-wg] Canonical XTM - a proposal
I've now written up a proposal for a Canonical XTM specification, which is appended here. It is submitted for the consideration of topicmaps.org, in the hope that it may be useful. It has already been implemented and is now used internally by Ontopia for testing purposes. --Lars M. CANONICAL XTM A canonical serialization format for XML topic maps Version 0.1 Lars Marius Garshol <larsga@ontopia.net>, with contributions by Geir Ove Grønmo <grove@ontopia.net>. (Please note: this text is just a contribution to the XTM process and has _no_ official standing.) $Id: cxtm-proposal.txt,v 1.1 2001/02/20 09:24:41 larsga Exp $ PRELIMINARIES =============== This specification describes a serialization format for XML topic maps which has the property that all logically equivalent topic maps have the exact same byte-by-byte representation in this format. This can be used to test the conformance of XTM processors. The specification describes the serialization of a topic map into an output document, but does not concern itself with where that topic map came from. It is NOT a goal to ensure that the canonical topic map can be successfully read into an XTM processor, but merely to confirm that all processing defined by the XTM 1.0 specification has been performed correctly. The topic map must before serialization be processed into consistent topic map, as defined by XTM 1.0. When applying canonicalization to XTM documents no string normalization such as Unicode canonical decomposition must be performed. The output document must be a canonical XML document, as specified in <URL: http://www.w3.org/TR/xml-c14n >. In addition, a line feed (U+00A0) must be inserted after every end tag and likewise after every start tag of elements that have element content or are empty. (This means baseNameString, variantName, resourceData, topicRef, instanceOf, resourceRef, subjectIndicatorRef.) FIXME: URI normalization FIXME: sorting of topics that have no characteristics FIXME: class-instance topic relationships with scope SERIALIZATION =============== The document element must be a <topicMap> element with these attribute value assignments: xmlns http://www.topicmaps.org/cxtm/1.0/ The topic map is serialized by first writing out all topics, and then writing out all associations. Since only one topic map is output, there is no mergemap information to serialize. <topic> --------- Topics are sorted by their sort keys (see the Ordering principles section) and then serialized in that order. All <topic> elements must have an id attribute, set to the value 'idN', where N is the number of the topic in sort order, starting with 1. Topics are serialized by first writing out all class-instance relationships as <instanceOf> elements, then the <subjectIdentity> element, then all <baseName>s, then all <occurrence>s. The <instanceOf>, <baseName> and <occurrence> elements are ordered according to the rules in the 'Ordering principles' section. <instanceOf> -------------- A class-instance relationship is serialized as an <instanceOf> element, with the 'href' attribute set to the ID of the <topic> element representing the class topic, with the character '#' prepended. Note that the <instanceOf> element is an empty element, and so, according to the Canonical XML specification must be serialized with both a start and an end tag, with nothing between the tags. <subjectIdentity> ------------------- If the topic has no addressable subject, nor any known subject indicators, this element is not output at all. If the topic has an addressable subject, that is output first using a <resourceRef> element. For each subject indicator the topic has, a <subjectIndicatorRef> element is output. The elements must be ordered according to the ordering principles. <resourceRef> --------------- The <resourceRef> element is an empty element, holding the reference to the resource in its 'href' attribute. FIXME: uri norm! <subjectIndicatorRef> ----------------------- The <subjectIndicatorRef> element is an empty element, holding the reference to the subject indicator in its 'href' attribute. FIXME: uri norm! <baseName> ------------ Each topic name is serialized using a <baseName> element. First the scope is written out using the <scope> element, then the base name value in the <baseNameString> element and finally the variant names using <variant> elements. The variant names must be ordered according to the ordering principles. <scope> --------- If the scoped topic map construct has an empty scope, this element is not output at all. If it has a non-empty scope, references to the topics making up that scope are written out using <topicRef> elements in the order defined by the ordering principles. Note that in all cases the scope that is output must consist of the scope resulting from inheriting the scope of any parent elements that have scope. The scope of variant names therefore consists of the union of their own scope and those scope of all their ancestors. <baseNameString> ------------------ Contains the base name value. <variant> ----------- Each variant name is serialized using a <variant> element. First its parameters are written out using the <scope> element, then the variant name value in the <variantName> element and finally any child variant names using <variant> elements. The variant names must be ordered according to the ordering principles. <variantName> --------------- Contains the variant name value. <occurrence> -------------- Each occurrence is written out using an <occurrence> element. If the occurrence is an instance of a class an <instanceOf> element is output, followed by a <scope> element representing the scope of the occurrence (provided it is non-empty) and last followed by a <resourceRef> element if the occurrence is an external resource or a <resourceData> element if the occurrence is an internal resource. FIXME: this is probably too vague <resourceData> ---------------- Contains the resource inline. <association> --------------- Associations are serialized using <association> elements, which first contain an <instanceOf> element (if the association is an instance of a class), a <scope> element (unless the association is in the unconstrained scope), and finally a <member> element for each participating topic in the association. The <member> elements must be ordered according to the ordering principles. ORDERING PRINCIPLES ===================== This section establishes how to determine the sort key value of each topic map element that is written out. This is used to ensure that all elements are serialized in a specific order. That order is obtained by sorting the elements according to their sort keys in lexicographical order, based on UCS code point values. Topics -------- If the topic has an addressable subject, the URI of that resource is the sort key. Failing that, if the topic has a subject indicator, the URI of the first subject indicator (as ordered according to these rules) is the sort key. Failing that, if the topic has base names, the sort key of the first base name (as ordered according to these rules) is the sort key. Failing that, if the topic has occurrences, the URI of the first occurrence (as ordered according to these rules) is the sort key. <instanceOf>, <topicRef>, <member> ------------------------------------ The sort key is the ID of the topic element referred to. Topic names, base names ------------------------- The sort key is constructed by appending the following into a string: the base name value, followed by a '|' character, followed by the assigned IDs of all topics in the scope of the topic name separated by spaces and ordered according to these principles. Occurrences, subject indicators --------------------------------- The sort key is the URI of the resource. FIXME: resourceData! Variant names --------------- The sort key is constructed by appending the following into a string: the variant name value, followed by a '|' character, followed by the assigned IDs of all topics in the scope of the variant name separated by spaces and ordered according to these principles. Associations -------------- The sort key is the sort keys of all its members in sort order, separated by '|' characters. If the association is an instance of a class, a '$' character is appended, followed by the assigned ID of the topic representing that class. Association members --------------------- The sort key is the ID of the topic element referred to by its <topicRef> child if the member has no specified role. If it does, a space and the assigned ID of the topic defining the role are appended. DTD ===== <!ELEMENT topicMap (topic*, association*)> <!ATTLIST topicMap xmlns CDATA "http://www.topicmaps.org/cxtm/1.0/" #FIXED> <!ELEMENT topic (instanceOf*, subjectIdentity?, baseName*, occurrence*)> <!ATTLIST topic id ID #REQUIRED> <!ELEMENT instanceOf EMPTY> <!ATTLIST instanceOf href CDATA #REQUIRED> <!ELEMENT subjectIdentity (resourceRef?, subjectIndicatorRef*)> <!ELEMENT resourceRef EMPTY> <!ATTLIST resourceRef href CDATA #REQUIRED> <!ELEMENT subjectIndicatorRef EMPTY> <!ATTLIST subjectIndicatorRef href CDATA #REQUIRED> <!ELEMENT baseName (scope?, baseNameString, variant*)> <!ELEMENT scope (topicRef+)> <!ELEMENT topicRef EMPTY> <!ATTLIST topicRef href CDATA #REQUIRED> <!ELEMENT baseNameString (#PCDATA)> <!ELEMENT variant (scope, variantName, variant*)> <!ELEMENT variantName (#PCDATA)> <!ELEMENT occurrence (instanceOf?, scope?, (resourceRef | resourceData)> <!ELEMENT association (instanceOf?, scope?, member+)> <!ELEMENT member (instanceOf?, topicRef)>
Prepared by Robin Cover for The XML Cover Pages archive. See: "(XML) Topic Maps."