The XML Meta-Architecture
Henry S. Thompson
HCRC Language Technology Group
University of Edinburgh
World Wide Web
Consortium
Presented at XML DevCon, London, 2001-02-21
© 2001 Henry S. Thompson
XML has grown
XML the language
Namespaces
A great success
As long as you
keep your expectations suitably low
XSLT/XPath/DOM
_____________________
XLink/XPointer
XML Schema
Canonical XML/XML Signatures
_____________________
XML Query/XML Protocols
What’s missing?
In
the interests of time, XML 1.0 did not define its own data model
So
XPath had to define it
And XLink had to
define it
And the DOM had
to define it
Finally,
later than we’d have liked, we’re about to get
The XML
Information Set
Or Infoset
(now in Last Call)
What’s the Infoset?
The XML 1.0 plus Namespaces abstract data model
What’s an ‘abstract data model’?
The thing that a sequence of start tags and attributes
and character data represents
A formalization of our intuition of what it means to
“be the same document”
The thing that’s common to all the uninterestingly
different ways of representing it
Single or double quotes
Whitespace inside tags
General entity and
character references
Alternate forms of empty
content
Specified vs. defaulted
attribute values
What does it mean to be ‘abstract’?
The
Infoset is a description of the information in a document
It’s
a vocabulary for expressing requirements on XML applications
It’s
a bit like numbers
As opposed to
numerals
If
you’re a type theorist
It’s just the
definition of the XML Document type
What the Infoset isn’t
It’s
not the DOM
Much higher level
It’s not about
implementation or interfacing at all
But
you can think of it as a kind of fuzzy data structure if that helps
It’s
not an SGML property set/grove
But it’s close
Infoset details
Defines a modest number of information items
Element, attribute, namespace declaration, comment,
processing instruction, document ...
Each one is composed of properties
Which in turn may have information items as values
Both element and attribute information items have [local
name] and [namespace URI] properties
Element information items have [children] and [attributes]
Attribute information items have a [normalized
value]
For more details, see my colleague Richard Tobin’s
talk on Thursday
He’s the editor of the Infoset spec.
The Infoset Revolution
We’ve
sort of understood that XML is special because of its universality
Schemas and
stylesheets and queries and … are all notated in XML
But
now we can understand this in a deeper way
The Infoset is
the common currency of all the XML specs and languages
XML
applications can best be understood as Infoset pipelines
Angle brackets
and equal signs are just an Infoset’s way of perpetuating itself
The Infoset Pipeline begins
An
XML Parser builds an Infoset from a character stream
A streaming
parser gives only a limited view of it
A
validating parser builds a richer Infoset than a non-validating one
Defaulted values
Whitespace
normalisation
Ignorable
whitespace
If a
document isn’t well-formed, or is invalid, or isn’t Namespace-conformant
It doesn’t have
an Infoset!
The XML Schema comes next
Validity
and well-formedness are XML 1.0 concepts
They are defined
over character sequences
Namespace-compliant
is a Namespace concept
It’s defined over
character sequences too
Schema-validity
is the XML Schema concept
It is
defined over Infosets
The Schema and the Infoset
So
crucially, schemas are about infosets, not character sequences
You
could schema-validate a DOM tree you built by hand!
Using a schema
which exists only as data structures ditto
The Infoset grows
Crucially,
schemas are about much more than validation
They tell you
much more than ‘yes’ or ‘no’
They
assign types to every element and attribute information item they validate
This
is done by adding properties to the Infoset
To produce what’s
called the post schema-validation Infoset (or PSVI)
So
schema-aware processing is a mapping from Infosets to Infosets
The Infoset is transformed
XSLT
1.0 defined its own data model
And distinguished
between source and result models
XSLT
2.0 will unify the two
And make use of
the Infoset abstraction to describe them
So
XSLT will properly be understood as mapping from one Infoset to another
The Infoset is composed
XLink
resources (the things pointed to by XPointers) can now be understood as items
in Infosets
The
XInclude proposal in particular fits in to my story
It provides for
the merger of (parts of) one Infoset into another
The Infoset is accessed
XML
Query of course provides for more sophisticated access to the Infoset
It
also allows structuring of the results into new Infoset items
The Infoset is transmitted
And
finally XML Protocol can best be understood as parcelling up information items
and shipping them out to be reconstructed elsewhere
A big step forward
This
is so much better than the alternative
Either
Pretending to talk
about character sequences all the time
Or
Requiring each
member of the XML standards family to define its own data model
Schemas at the heart
I
would say that, wouldn’t I :-)
Seriously,
schema processing can be integrated into this story in a way DTDs could not
You may want to
schema-process both before and after XInclude
Or between every
step in a sequence of XSLT transformations
We
actually are missing a piece of the XML story
How do we
describe Infoset pipelines?
Types and the Infoset
The
most important contribution to the PSVI
Every element and
attribute information item is labelled with its type
Integer, date,
boolean, …
Address, employee,
purchaseOrder
XPath
2.0 and XML Query will be type-aware
Types
will play a key role in the next generation of XML applications
XML is ASCII for the 21st century
ASCII (ISO 646) solved a fundamental interchange
problem for flat text documents
What bits encode what characters
(For a pretty parochial
definition of 'character')
UNICODE/ISO 10646 extends that solution to the whole
world
XML thought it was doing the same for simple
tree-structured documents
The emphasis in the XML design was on simplifying SGML
to move it to the Web
XML didn't touch SGML's architectural vision
flexible
linearisation/transfer syntax
for tree-structured
prose documents with internal links
The alternative take on
XML?
It's a
markup language used for transferring data
It is
concerned with data models
to convert between
application-appropriate and transfer-appropriate forms
It is not
concerned with human beings
It's produced and consumed by
programs
Application data
Structured markup
<POORDERHDR>
<DATETIME qualifier="DOCUMENT">
<YEAR>1996</YEAR>
<MONTH>06</MONTH>
<DAY>30</DAY>
<HOUR>23</HOUR>
<MINUTE>59</MINUTE>
<SECOND>59</SECOND>
<SUBSECOND>0000</SUBSECOND>
<TIMEZONE>+0100</TIMEZONE>
</DATETIME>
<OPERAMT qualifier="EXTENDED" type="T">
<VALUE>670000</VALUE>
<NUMOFDEC>2</NUMOFDEC>
<SIGN>+</SIGN>
<CURRENCY>USD</CURRENCY>
. . .
What just happened!?
The
whole transfer syntax story just went meta, that's what happened!
XML has
been a runaway success, on a much greater scale than its designers
anticipated
Not
for the reason they had hoped
Because
separation of form from content is right
But for a reason they barely thought
about
Data
must travel the web
Tree
structured documents (Infosets) are a useable transfer syntax for just about
anything
So data-oriented web users think of
XML as a transfer mechanism for their data
The new challenge
So how
do we get back and forth between application data and the Infoset
Old answer
Write
lots of script
New answer
Exploit
schemas and types
A type
may be either
simple, for constraining string
values
complex, for constraining elements
which contain other elements
Mapping between layers
We can
think of this in two ways
In terms of abstract data modelling
languages
Entity-Relation
UML
RDF
In concrete implementation terms
Tables
and rows
Class
instances and instance variables
The
first is more portable
The
second more immediately useful
Mapping between layers 2
Regardless
of what approach we take, we need
A vocabulary of data model
components
An attachment of that vocabulary to
types
Sample
vocabularies
entity, relationship, collection
table, row, column
instance, variable, list, dictionary
Where
should attachment be specified?
In the schema
convenient
Outside it
modular
Overall Conclusion
Think
about things in terms of Infosets and Infoset pipelines
Modular
Powerful
Scalable
Use
XML Schema and its type system to facilitate mapping
Unmarshalling is easy
Marshalling takes a little longer