[Mirrored from: http://www.w3.org/XML/9708/ADT]

1997]

ADT and marshalling for XML

Abstract

This is an attempt at defining XML (i.e., XML proper, not the meta-grammar for defining XML applications, nor the hyperlinking conventions) in a formal, mathematical way, that hopefully takes away all ambiguity as to what an XML document contains, or what the relation is between the document and its linearized, textual representation. Please report errors or suggest improvements.

Intro

An ADT (Abstract Data Type) is a mathematical formalism for defining the characteristics of a certain data type, without revealing the underlying model, let alone any concrete implementation in a computer program. An ADT defines exactly what is required of an implementation, and, by omission, also what is not required.

Marshalling is the process of linearizing a data structure into a sequence of bytes in a certain, previously agreed-upon way, so that one process can pass its internal data structure to another process, which will then be able to build an equivalent data structure. Since the marshalling algorithm presented here is based on the ADT, and not on an actual model, or even implementation, the two processes are free to use different internal memory structures. As a consequence, the marshalled data contains exactly enough information that a round-trip is possible, but no more.

XML is a very rich data type that contains Strings and Identifiers, partly ordered in a sequence, partly hierarchical, partly in an unordered database-like keyword/value system.

Fundamental functions

The functions defined here together define the ADT XML. The ADT makes use of previously defined types Identifier, URL, char, int, boolean, and nil, as well as of powersets and sequences of these. For example, the set consisting of all possible sequences of chars is denoted as char[] and the collection consisting of all possible sets of Identifiers is denoted as 2Identifier. The empty set is written as {}.

ADT XMLNode

import Identifier, URL, char, int, boolean, nil

The functions and the invariants that define them are as follows (`x' is the Cartesian product; `A -> B' means the function is from A to B; `=>' stands for `implies'; `+' is the set union operator for sets, and the normal addition operator for ints; `<>' means `not equal'; `\' is set difference):

create : Identifier -> XML
setSchema : XML x URL -> XML
getSchema : XML -> URL + nil

getSchema(create(S)) = nil

getSchema(setSchema(X, U)) = U

setAttribute : XML x Identifier x char[] -> XML
getAttribute : XML x Identifier -> char[] + nil
getAttributes : XML -> 2Identifier

getAttribute(create(S), I) = nil

getAttribute(setAttribute(X, I, S), I) = S

I1 <> I2 => getAttribute(setAttribute(X, I1, S), I2) = getAttribute(X, I2)

getAttributes(create(S)) = {}

getAttributes(setAttribute(X, I, S)) = getAttributes(X) + {I}

getAttribute(X, I) = getAttribute(setSchema(X, U), I)

getAttributes(X) = getAttributes(setSchema(X, U))

getSchema(X) = getSchema(setAttribute(X, I, S))

addNode : XML x XML + char -> XML + error
setNode : XML x int x XML + char -> XML + error
getNode : XML x int -> XML + char + nil
nrNodes : XML -> int
isNode : XML x XML -> boolean
isDescendant : XML x XML -> boolean
isFamily : XML x XML -> boolean

isNode(X1, X2) <=> E N : getNode(X1, N) = X2

isDescendant(X1, X2) <=> isNode(X1, X2) | (E X3 : isNode(X3, X2) & isDescendant(X1, X3))

isFamily(X1, X2) <=> X1 = X2 | isDescendant(X1, X2) | isDescendant(X2, X1)

isFamily(X1, X2) => addNode(X1, X2) = error

isFamily(X1, X2) => setNode(X1, N, X2) = error

N < 1 => setNode(X1, N, X2) = error

N > nrNodes(X1) => setNode(X1, N, X2) = error

N < 1 => getNode(X, N) = nil

N > nrNodes(X) => getNode(X, N) = nil

nrNodes(create(I)) = 0

¬isFamily(X1, X2) => nrNodes(addNode(X1, X2)) = nrNodes(X1) + 1

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getNode(setNode(X1, N, X2), N) = X2

¬isFamily(X1, X2) & N = nrNodes(X1) + 1 => getNode(addNode(X1, X2)), N) = X2

¬isFamily(X1, X2) => getNode(X1, N) = getNode(addNode(X1, X2), N)

¬isFamily(X1, X2) & 0 < M <= nrNodes(X1) & N <> M => getNode(setNode(X1, M, X2), N) = getNode(X1, N)

getNode(setSchema(X1, U), N) = getNode(X1, N)

getNode(setAttribute(X1, I, S), N) = getNode(X1, N)

¬isFamily(X1, X2) => getAttribute(addNode(X1, X2), I) = getAttribute(X1, I)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttribute(setNode(X1, N, X2), I) = getAttribute(X1, I)

¬isFamily(X1, X2) => getAttributes(addNode(X1, X1)) = getAttributes(X1)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttributes(setNode(X1, N, X2)) = getAttributes(X1)

¬isFamily(X1, X2) => getSchema(addNode(X1, X2)) = getSchema(X1)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getSchema(setNode(X2, N, X2)) = getSchema(X1)

getName : XML -> Identifier

getName(create(I)) = I

getName(setSchema(X, U)) = getName(X)

getName(setAttribute(X, I, S)) = getName(X)

¬isFamily(X1, X2) => getName(addNode(X1, X2)) = getName(X1)

¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getName(setNode(X1, N, X2)) = getName(X1)

deleteAttribute : XML x Identifier -> XML

deleteAttribute(create(I1), I2) = create(I1)

getSchema(deleteAttribute(X, I)) = getSchema(X)

getAttribute(deleteAttribute(X, I), I) = nil

getAttributes(deleteAttribute(X, I)) = getAttributes(X) \ {I}

getNode(deleteAttribute(X, I), N) = getNode(X, N)

getName(deleteAttribute(X, I)) = getName(X)

popNode : XML -> XML

popNode(create(I)) = create(I)

¬isFamily(X1, X2) => popNode(addNode(X1, X2)) = X1

¬isFamily(X1, X2) & 0 < N < nrNodes(X1) => popNode(setNode(X1, N, X2)) = setNode(popNode(X1), N, X2)

¬isFamily(X1, X2) & N = nrNodes(X1) => popNode(setNode(X1, N, X2)) = popNode(X1)

nrNodes(X) > 0 => nrNodes(popNode(X)) = nrNodes(X) - 1

N < nrNodes(X) => getNode(popNode(X), N) = getNode(X, N)

getSchema(popNode(X)) = getSchema(X)

getAttribute(popNode(X), I) = getAttribute(X)

getAttributes(popNode(X)) = getAttributes(X)

getName(popNode(X)) = getName(X)

Marshalling function

The linearization function makes use of a linearization function for Identifiers. We assume that function doesn't produce any characters from the set {'<', '>', '"', "'", '=', '!', '?', <space>, <tab>, <CR>, <LF>}.

`writeURL' (URL x char -> char[]) is a function that writes a URL as a sequence of characters, making sure that the character given as its second argument doesn't appear in the output, but is replaced, nondeterministically, by one of its escape codes (see below)

`+' is the concatenation operator for character sequences (+ : char[] x char[] -> char[])

import writeIdentifier

whitespace : int -> char[]

whitespace(n) is a nondeterministic function that produces an arbitrary sequence of an arbitrary mix of whitespace characters (<space>, <tab>, <CR>, <LF>), of length at least n.

writeXML : XML -> char[]

This also a nondeterministic function, partly because it uses whitespace(), partly because it nondeterministically chooses one of two mappings in the case nrNodes(X) = 0.

For any value of nrNodes(X), including 0: writeXML(X) = writeSchema(getSchema(X)) + '<' + writeIdentifier(getName(X)) + whitespace(1) + writeAttributes(getAttributes(X) + '>' + writeNodes(X) + '</' + writeIdentifier(getName) + whitespace(0) + '>'

For nrNodes(X) = 0: writeXML(X) = writeSchema(getSchema(X)) + '<' + writeIdentifier(getName(X)) + writeAttributes(X) + whitespace(0) + '/>'

writeSchema : URL -> char[]

Again nondeterministic selection of two mappings:

writeSchema(U) = '<!namespace' + whitespace() + '"' + writeURL(U, '"') + '"' + whitespace(0) + '>'

writeSchema(U) = '<!namespace' + whitespace() + "'" + writeURL(U, "'") + "'" + whitespace(0) + '>'

writeAttributes : XML -> char[]

Nondeterministic, because it prints the attributes in the set in unspecified order:

writeAttributes(X) = +I in getAttributes(X) writeAttribute(X, I)

writeAttribute : XML x Identifier -> char[]

writeAttribute(X, I) = writeIdentifier(I) + whitespace(0) + '=' + whitespace(0) + writeString(getAttribute(X, I)) + whitespace(0)

writeString : char[] -> char[]

Select one:

writeString(c1c2...cn) = '"' + writeChar(c1, '"') + writeChar(c2, '"') + ... + writeChar(cn, '"')

writeString(c1c2...cn) = "'" + writeChar(c1, "'") + writeChar(c2, "'") + ... + writeChar(cn, "'")

writeChar : char x char -> char[]

Select one of those that apply:

writeChar(c, c) = '&#' + decimal(c) + ';'

writeChar(c, c) = '&#x' + hex(c) + ';'

writeChar('"', d) = '&quot;'

writeChar("'", d) = '&squot;'

writeChar('<', d) = '&lt;'

writeChar('>', d) = '&gt;'

writeChar('&', d) = '&amp;'

¬(c in {'<', '>', '&', d}) & writeChar(c, d) = c

Character encoding

The marshalling function above is incomplete: it stops at the level of characters, when it should have specified bytes. The final step is to choose an encoding for characters. The chosen encoding has to be passed to the receiver of the byte sequence in some way, but this sepcification doesn't say how (`out of band').


Bert Bos <bert@w3.org>
9 Aug 1997