[Mirrored from: http://www.w3.org/XML/9708/ADT]
1997]This is an attempt at defining XML (i.e., XML proper, not the meta-grammar for defining XML applications, nor the hyperlinking conventions) in a formal, mathematical way, that hopefully takes away all ambiguity as to what an XML document contains, or what the relation is between the document and its linearized, textual representation. Please report errors or suggest improvements.
An ADT (Abstract Data Type) is a mathematical formalism for defining the characteristics of a certain data type, without revealing the underlying model, let alone any concrete implementation in a computer program. An ADT defines exactly what is required of an implementation, and, by omission, also what is not required.
Marshalling is the process of linearizing a data structure into a sequence of bytes in a certain, previously agreed-upon way, so that one process can pass its internal data structure to another process, which will then be able to build an equivalent data structure. Since the marshalling algorithm presented here is based on the ADT, and not on an actual model, or even implementation, the two processes are free to use different internal memory structures. As a consequence, the marshalled data contains exactly enough information that a round-trip is possible, but no more.
XML is a very rich data type that contains Strings and Identifiers, partly ordered in a sequence, partly hierarchical, partly in an unordered database-like keyword/value system.
The functions defined here together define the ADT XML. The ADT makes use of previously defined types Identifier, URL, char, int, boolean, and nil, as well as of powersets and sequences of these. For example, the set consisting of all possible sequences of chars is denoted as char[] and the collection consisting of all possible sets of Identifiers is denoted as 2Identifier. The empty set is written as {}.
ADT XMLNode
import Identifier, URL, char, int, boolean, nil
The functions and the invariants that define them are as follows (`x' is the Cartesian product; `A -> B' means the function is from A to B; `=>' stands for `implies'; `+' is the set union operator for sets, and the normal addition operator for ints; `<>' means `not equal'; `\' is set difference):
getSchema(create(S)) = nil
getSchema(setSchema(X, U)) = U
getAttribute(create(S), I) = nil
getAttribute(setAttribute(X, I, S), I) = S
I1 <> I2 => getAttribute(setAttribute(X, I1, S), I2) = getAttribute(X, I2)
getAttributes(create(S)) = {}
getAttributes(setAttribute(X, I, S)) = getAttributes(X) + {I}
getAttribute(X, I) = getAttribute(setSchema(X, U), I)
getAttributes(X) = getAttributes(setSchema(X, U))
getSchema(X) = getSchema(setAttribute(X, I, S))
isNode(X1, X2) <=> E N : getNode(X1, N) = X2
isDescendant(X1, X2) <=> isNode(X1, X2) | (E X3 : isNode(X3, X2) & isDescendant(X1, X3))
isFamily(X1, X2) <=> X1 = X2 | isDescendant(X1, X2) | isDescendant(X2, X1)
isFamily(X1, X2) => addNode(X1, X2) = error
isFamily(X1, X2) => setNode(X1, N, X2) = error
N < 1 => setNode(X1, N, X2) = error
N > nrNodes(X1) => setNode(X1, N, X2) = error
N < 1 => getNode(X, N) = nil
N > nrNodes(X) => getNode(X, N) = nil
nrNodes(create(I)) = 0
¬isFamily(X1, X2) => nrNodes(addNode(X1, X2)) = nrNodes(X1) + 1
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getNode(setNode(X1, N, X2), N) = X2
¬isFamily(X1, X2) & N = nrNodes(X1) + 1 => getNode(addNode(X1, X2)), N) = X2
¬isFamily(X1, X2) => getNode(X1, N) = getNode(addNode(X1, X2), N)
¬isFamily(X1, X2) & 0 < M <= nrNodes(X1) & N <> M => getNode(setNode(X1, M, X2), N) = getNode(X1, N)
getNode(setSchema(X1, U), N) = getNode(X1, N)
getNode(setAttribute(X1, I, S), N) = getNode(X1, N)
¬isFamily(X1, X2) => getAttribute(addNode(X1, X2), I) = getAttribute(X1, I)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttribute(setNode(X1, N, X2), I) = getAttribute(X1, I)
¬isFamily(X1, X2) => getAttributes(addNode(X1, X1)) = getAttributes(X1)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttributes(setNode(X1, N, X2)) = getAttributes(X1)
¬isFamily(X1, X2) => getSchema(addNode(X1, X2)) = getSchema(X1)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getSchema(setNode(X2, N, X2)) = getSchema(X1)
getName(create(I)) = I
getName(setSchema(X, U)) = getName(X)
getName(setAttribute(X, I, S)) = getName(X)
¬isFamily(X1, X2) => getName(addNode(X1, X2)) = getName(X1)
¬isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getName(setNode(X1, N, X2)) = getName(X1)
deleteAttribute(create(I1), I2) = create(I1)
getSchema(deleteAttribute(X, I)) = getSchema(X)
getAttribute(deleteAttribute(X, I), I) = nil
getAttributes(deleteAttribute(X, I)) = getAttributes(X) \ {I}
getNode(deleteAttribute(X, I), N) = getNode(X, N)
getName(deleteAttribute(X, I)) = getName(X)
popNode(create(I)) = create(I)
¬isFamily(X1, X2) => popNode(addNode(X1, X2)) = X1
¬isFamily(X1, X2) & 0 < N < nrNodes(X1) => popNode(setNode(X1, N, X2)) = setNode(popNode(X1), N, X2)
¬isFamily(X1, X2) & N = nrNodes(X1) => popNode(setNode(X1, N, X2)) = popNode(X1)
nrNodes(X) > 0 => nrNodes(popNode(X)) = nrNodes(X) - 1
N < nrNodes(X) => getNode(popNode(X), N) = getNode(X, N)
getSchema(popNode(X)) = getSchema(X)
getAttribute(popNode(X), I) = getAttribute(X)
getAttributes(popNode(X)) = getAttributes(X)
getName(popNode(X)) = getName(X)
The linearization function makes use of a linearization function for Identifiers. We assume that function doesn't produce any characters from the set {'<', '>', '"', "'", '=', '!', '?', <space>, <tab>, <CR>, <LF>}.
`writeURL' (URL x char -> char[]) is a function that writes a URL as a sequence of characters, making sure that the character given as its second argument doesn't appear in the output, but is replaced, nondeterministically, by one of its escape codes (see below)
`+' is the concatenation operator for character sequences (+ : char[] x char[] -> char[])
import writeIdentifier
whitespace(n) is a nondeterministic function that produces an arbitrary sequence of an arbitrary mix of whitespace characters (<space>, <tab>, <CR>, <LF>), of length at least n.
This also a nondeterministic function, partly because it uses whitespace(), partly because it nondeterministically chooses one of two mappings in the case nrNodes(X) = 0.
For any value of nrNodes(X), including 0: writeXML(X) = writeSchema(getSchema(X)) + '<' + writeIdentifier(getName(X)) + whitespace(1) + writeAttributes(getAttributes(X) + '>' + writeNodes(X) + '</' + writeIdentifier(getName) + whitespace(0) + '>'
For nrNodes(X) = 0: writeXML(X) = writeSchema(getSchema(X)) + '<' + writeIdentifier(getName(X)) + writeAttributes(X) + whitespace(0) + '/>'
Again nondeterministic selection of two mappings:
writeSchema(U) = '<!namespace' + whitespace() + '"' + writeURL(U, '"') + '"' + whitespace(0) + '>'
writeSchema(U) = '<!namespace' + whitespace() + "'" + writeURL(U, "'") + "'" + whitespace(0) + '>'
Nondeterministic, because it prints the attributes in the set in unspecified order:
writeAttributes(X) = +I in getAttributes(X) writeAttribute(X, I)
writeAttribute(X, I) = writeIdentifier(I) + whitespace(0) + '=' + whitespace(0) + writeString(getAttribute(X, I)) + whitespace(0)
Select one:
writeString(c1c2...cn) = '"' + writeChar(c1, '"') + writeChar(c2, '"') + ... + writeChar(cn, '"')
writeString(c1c2...cn) = "'" + writeChar(c1, "'") + writeChar(c2, "'") + ... + writeChar(cn, "'")
Select one of those that apply:
writeChar(c, c) = '&#' + decimal(c) + ';'
writeChar(c, c) = '&#x' + hex(c) + ';'
writeChar('"', d) = '"'
writeChar("'", d) = '&squot;'
writeChar('<', d) = '<'
writeChar('>', d) = '>'
writeChar('&', d) = '&'
¬(c in {'<', '>', '&', d}) & writeChar(c, d) = c
The marshalling function above is incomplete: it stops at the level of characters, when it should have specified bytes. The final step is to choose an encoding for characters. The chosen encoding has to be passed to the receiver of the byte sequence in some way, but this sepcification doesn't say how (`out of band').