Extreme 2007: Markup of Overlapping Structures

International Workshop on Markup of Overlapping Structures

Venue

Extreme Markup Languages 2007 Conference

When

Monday 6 August, 2007 — 9:30am - 6pm One day before the Extreme 2007 Conference

Where

Hotel Europa, Extreme 2007 Conference Hotel, Montréal, Canada

XML and SGML have revolutionized the representation of structured information, but not all information structures map easily into systems of hierarchically nested elements. Markup of overlapping structures is a perennially hot topic, reinvented and reimagined as often as it is solved.

This full-day workshop will bring together the proponents of some of the major proposals for markup, representation, extraction, display, and validation of semantic overlap to summarize the systems they are developing and discuss topics of common interest. A morning of formal presentations will be followed by an afternoon of free-ranging discussion. Confirmed speakers and topics include:

Alexander Dekhtyar, Center of Excellence in Traceability, on the Concurrent XML and the ARCHway Project
Steven DeRose, National Center for Biotechnology Information (National Institutes of Health), on Trojan Markup and other XML milestone-tagging techniques
Patrick Durusau, Snowfall Software, on Topic Mapping overlap
Wendell Piez, Mulberry Technologies, on LMNL (Layered Markup and Annotation Language)
C. M. Sperberg-McQueen, World Wide Web Consortium, on TexMecs and Goddag structures
Andreas Witt, University of Tübingen, on Multiple Annotations and XConcur

Discussions of the similarities, differences, unresolved gaps and problems, and potential symbiosis between and among these approaches will involve both the speakers and the audience.

Overlapping structures are ubiquitous, appearing in applications of textual markup as varied as aircraft maintenance manuals and ancient scriptural and liturgical works. The "overlap issue" raises its ugly head whenever text encoding looks beyond the snapshot view of a particular hierarchy to represent and process multiple concurrent aspects of a text, including features that reflect the text's evolution across multiple versions and variants whether typographic or presentational, structural, annotational or referential, taxonomic or topical.

Overlap is a problem in texts as diverse as technical documents and product manuals (versioning), legal codes (effectivity), literary works (prosadic versus dramatic stucture, rhetorical structures, annotation), sacred texts (chapter plus verse reference versus sentence structure and commentary), and language corpora (multiple layers of linguistic annotation).

While many approaches to the representation and processing of overlap and multiple concurrent hierarchies in digitally encoded text have been proposed, to date no single effort has demonstrated a general and widely replicable solution. This is a hard problem. Is a single solution possible? Is one necessary?

Overlap References

Extreme 2007 References

Extreme 2007 Program

Monday, August 6, 2007

9:30 - 6:00

International Workshop on Markup of Overlapping Structures

Alexander Dekhtyar, California Polytechnic State University
Steven DeRose, National Center for Biotechnology Information, NIH
Patrick Durusau, Snowfall Software
Wendell Piez, Mulberry Technologies
C. M. Sperberg-McQueen, World Wide Web Consortium &
Andreas Witt, University of Tübingen

This full-day interactive workshop on markup of overlapping structures (led by proponents of the major proposals for recording, extracting, and validating semantic overlap) will be of interest to people working in effectivity and change control, literary analysis, markup of scriptures, and historical and linguistic analysis. After an introduction to the problem of overlap, several methods for working with overlap will be discussed, including: Goddag structures, LMNL, Trojan Horses and other XML milestone-tagging techniques, Topic Mapping Overlap, and Multiple Annotations and XConcur. Discussions of the similarities, differences, unresolved gaps and problems, and potential symbiosis between and among these approaches will involve both the speakers and the audience.

Tuesday, August 7, 2007

9:15 - 9:45

Conference Opening

Steven J. Newcomb, Coolheads
C. M. Sperberg-McQueen, World Wide Web Consortium &
B. Tommie Usdin, Mulberry Technologies

Welcome to Extreme Markup Languages 2007, including "Guide to the Conference"

9:45 - 10:30

Riding the wave, riding for a fall, or just along for the ride?

B. Tommie Usdin, Mulberry Technologies

XML is taking over the world! XML is ubiquitous! XML wins! The work is over. Or is the work just starting? Are we at Extreme Markup Languages riding the wave of XML success? Or riding for a fall as we learn that our comfortable world is not the whole world? Have we created the success of XML, or are we just along for the ride?

11:00 - 11:45

Easy RDF for real-life system modeling

Thomas B. Passin, Noblis

Can RDF be used to capture system modeling information, even while the concepts being captured are fluid and evolving, and even when the modelers are not RDF specialists? It can. And it can even be easy. One of the keys is to use a subset of RDF that is sufficient for only a few of complex RDF's variations. The author describes the application of such a subset to enterprise architecture (EA) system modeling, along with simple notational and renditional techniques (indented lists, XSLT) developed and applied in the course of the development of an actual system. The techniques made the evolution of the system's design easy to follow. XSLT transformations allow the RDF data to be presented as data collection templates and class hierarchies, and to convert it as needed to XMI (UML) and to relational schemas.

11:45 - 12:30

Writing an XSLT optimizer in XSLT

Michael Kay, Saxonica

In principle, XSLT is ideally suited to the task of writing an XSLT or XQuery optimizer. After all, optimizers consist of a set of rules for rewriting a tree representation of the query or stylesheet, and XSLT is specifically designed as a language for rule-based tree rewriting. The paper illustrates how the abstract syntax tree representing a query or stylesheet can be expressed as an XML data structure making it amenable to XSLT processing, and shows how a selection of rewrites can be programmed in XSLT. The key question determining whether the approach is viable in practice is performance. Some simple measurements suffice to demonstrate that there is a significant performance penalty, but not an insurmountable one: further work is needed to see whether it can be reduced to an acceptable level.

2:00 - 2:45

From Word to XML to mobile devices

David Lee, Epocrates

The clinical content delivered by Epocrates to the mobile devices used by medical practitioners is sent to the mobile devices in an efficient form of binary XML. But much of the content was created in Microsoft Word; Epocrates does not control the creators' authoring infrastructures. The task of 'matching the impedances' of the Word sources to the delivered XML has been approached multiple times in multiple ways, including ways that use Word macros (VB scripts), RTF, HTML, Word 2003 XML, and a variety of tools. An XML-oriented pipeline process has evolved, that makes heavy use of the Saxon XQuery implementation; it has proven to have great advantages and may serve as a design pattern for others facing similar problems.

2:00 - 2:45

On the lossless transformation of single-file, multi-layer annotations into multi-rooted trees

Andreas Witt, Oliver Schonefeld, Georg Rehm, Jonathan Khoo, & Kilian Evang, University of Tübingen

Language corpora routinely incorporate multiple layers of annotation: phonetic and orthographic transcription, morphological analysis, translation equivalents, sentence intonation, and many more. The linguistic structures described frequently overlap each other. Some corpora use a hierarchical data model and record overlapping structures of annotation in a single complex XML tree; others use annotation graphs, in which points on an abstract timeline are used to synchronize different layers. In order to archive such heterogeneous material sustainably, the authors have developed a document architecture which allows both styles of corpus to be translated without loss of information into sets of multi-rooted trees sharing the same primary data, which can be searched, manipulated, and validated with a common tool set. A simple Web interface allows the user to specify how the annotation in a corpus is to be factored into layers and transformed into single-layer trees. New work on concurrent validation of multi-rooted trees allows inter-tree constraints to be formulated and checked.

2:45 - 3:30

MYCAREVENT: OWL and the automotive repair information supply chain

Martin Bryan & Jay Cousins, CSW Informatics

As part of its Information Society Technology program, the European Commission has funded the MYCAREVENT (Mobility and Collaborative Work in European Vehicle Emergency Networks) project from October 2004 to September 2007. Involving research institutes, auto manufacturers, device manufacturers, roadside assistance organizations, IT suppliers, and others, it seeks to generate new business models based on innovative services for automobile drivers. MYCAREVENT's ontology is expressed in OWL (W3C Web Ontology Language). It is derived from a Generic and Integrated Information Reference Model (GIIRM) and terminology for repair information, symptoms, and faults. The ontology is described, along with problems encountered in its development, and the ways in which OWL answers problems encountered when integrating information from the diverse participants in MYCAREVENT.

2:45 - 3:30

Representation of overlapping structures

C. M. Sperberg-McQueen, World Wide Web Consortium

Markup of overlapping structures is, depending on your point of view, either a perpetual hot topic or a trivial edge case in the study of markup. Starting from the belief that overlapping structures are not just common but important, not only in the analysis of literary works but also in the management of changing content, we explore ways to represent overlapping structures in tractable ways. The requirements for representing overlapping structures differ from those for storing simple tree-structured information. An exploration of these requirements is followed by a detailed description of one way to represent Goddag structures in relational form.

4:00 - 4:45

Active Tags: Mastering XML with XML

Philippe Poulard, Inria

Active Tags is a language-independent, general-purpose native XML programming system that provides a generic runtime container for multiple executable markup languages. Active Tags is not a markup language but a set of specifications that describe the system and its libraries (called "modules") and the catalog-like structure that holds it all together. Active Tag programs ("active sheets") are XPath-centric, like XSLT, and (also like XSLT) may contain both instructions and XML literals. Unlike XSLT, a single sheet may contain several instruction sets, each bound to a namespace, and the sheet may be procedural, declarative, or both. After explaining enough of the architecture to enable us all to follow the logic, the author illustrates active sheets that filter SAX streams using XPath patterns, utilize XML macro-tags, and browse non-XML data.

4:45 - 5:30

Advanced approaches to XML document validation

Petr Nalevka, & Jirka Kosek, University of Economics, Prague

Relaxed is an open-source automated tool that provides support for validation of XML documents using predefined or custom compound languages. For example, it can validate documents that use combinations such as XHTML 1.0 + MathML 2.0 + SVG 1.1, testing for conformance with constraints expressed in Relax NG, Schematron, and other schema languages, under the control of NVDL (Namespace-based Validation Dispatching Language). In addition to the maintenance of these schemas, the Relaxed project also includes an extensible validation engine written in Java. Examples that demonstrate the usefulness and practicality of combining multiple kinds of validation and constraint expressions in a convenient automated framework are discussed.

Wednesday, August 8, 2007

9:00 - 9:45

On-the-fly validation of XML markup languages using off-the-shelf tools

Mikko Saesmaa & Pekka Kilpeläinen, University of Kuopio

Validation of XML documents is often treated as a major operation, performed only at major transitions in the document's life cycle, after it has been created or when it enters some new stage of processing. Users editing XML documents, on the other hand, would appreciate instantaneous feedback of the correctness of the document each time anything changes. Such on-the-fly validation can be implemented in an XML editor using the current version of Java and freely available XML tools. The experience of the authors is that on-the-fly validation can be implemented easily without introducing observable delays even on relatively large documents. To demonstrate this, the authors have built an experimental XML editor which validates documents on-the-fly after every modification. The editor supports editing of DTDs and validation according to DTDs and according to schemas written in W3C XML Schema and Relax NG.

9:45 - 10:30

Streaming validation of schemata: the Lazy Typing discipline

Paolo Marinelli, Fabio Vitali, & Stefano Zacchiroli, University of Bologna

Assertions, identity constraints, and conditional type assignments are (planned) features of XML Schema which rely on XPath evaluation. The XPath subset exploitable in those features is limited, for several reasons, including (apparently) to avoid buffering in evaluation of an expression. We divide XPath into subsets with varying streamability characteristics. We also identify the larger XPath subset which is compatible with the typing discipline we believe underlies some of the choices currently present in the XML Schema specification. Such a discipline requires that the type of an element be decided when its start tag is encountered and its validity when its end tag is encountered. An alternative "lazy typing" discipline is proposed in which both type assignment and validity assessment are fired as soon as they are available. Our approach is more flexible, giving schema authors control over the trade-off between using larger XPath subsets (and thus increasing buffering requirements) and expeditiousness.

11:00 - 11:45

A more canonical form of content MathML to facilitate math search

Moody E. Altamimi & Abdou S. Youssef, The George Washington University

MathML, XML, XPath, and XQuery have together opened up math search as a new and challenging area of research. One key challenge is handling notational equivalences. Content MathML enables consistent and less ambiguous representation of math expressions, but different yet equivalent encodings for the same expression can still be used. Content MathML's provisions for author-defined functions and symbols and variable type declarations also complicate matters. Notational differences that should be nonsignificant to searching operations can be made transparent by transforming instances of Content MathML to a normalized (or "canonical") form prior to search. Such a canonical form is described, beginning with a discussion of Content MathML and a survey of related work. Techniques for normalizing various kinds of expressions are proposed, along with some extensions to Content MathML itself.

11:00 - 11:30

Localization of schema languages

Felix Sasaki, World Wide Web Consortium

Internationalization is the process of making a product ready for global use. Localization is the adaptation of a product to a specific locale (e.g., country, region, or market). Localization of XML schemas (XSD, DTD, Relax NG) can include translation of element and attribute names, modification of data types, and content or locale-specific modifications such as currency and dates. Combining the TEI ODD (One Document Does it all) approach for renaming and adaptation of documentation, the Common Locale Data Registry (CLDR) for the modification of data types, and the new Internationalization Tag Set (W3C 2007), the authors have produced an implementation that will take as input a schema without any localization and some external localization parameters (such as the locale, the schema language, any localization annotations, and the CLDR data) and produce a localized schema for XSD and Relax NG. For a DTD, the implementation produces a Schematron document for validation of the modified data types that can be used with a separate renaming stylesheet to generate a localized DTD.

11:45 - 12:30

Exploring intertextual semantics: A reflection on attributes and optionality

Yves Marcoux & Élias Rizkallah, Université de Montréal

Intertextual semantics (IS) is an approach to structured-document modeling, based on natural language and intended both to communicate the semantic intentions of modelers more effectively to human authors and to encourage modelers to develop models amenable to plain understanding. It uses peritexts — texts in natural language that assist understanding of instances of a given element type by imaginarily being prepended and appended to them. IS was first introduced at Extreme in 2006 and has since been tested on a few existing models (such as RSS and Atom). This more recent work has led to explicit formulation of two hypotheses on which IS is based, to an extension of IS to allow smooth handling of attributes, and to an approach within IS for dealing with the difficulties posed by optional elements and attributes.

11:45 - 12:30

Applying structured content transformation techniques to software source code

Roy Amodeo, Stilo International

In structured content processing, benefits of modeling information content rather than presentation include the ability to automate the publication of information in many formats, tailored for different audiences. Software programs are a form of content, usually authored by humans and "published" by compilers to the computer that runs these programs. However, programs are not written solely for use by machines. If they were, programming languages would have no need for comments or programming style guidelines. The application developers and maintainers themselves are also an audience. Modeling software programs as XML instances is not a new idea. This paper takes a fresh look at the challenge of producing XML markup from programming languages by recasting it as a content processing problem using tools developed in the same way as any other content-processing application. The XML instances we generate can be used to craft transformation and analysis tools useful for software engineering by leveraging the marked up structure of the program rather than th native syntax.

2:45 - 3:30

Characterizing XQuery implementations: Categories and key features

Liam Quin, World Wide Web Consortium

XQuery 1.0 was published as a W3C Recommendation in January 2007, and there are fifty or more XQuery implementations. The XQuery Public Web page at W3C lists them but gives little or no guidance about choosing among them. The author proposes a simple ontology (taxonomy) to characterize XQuery implementations based on emergent patters of the features appearing in implementations and suggests some ways to choose among those implementations. The result is a clearer view of how XQuery is being used and also provides insights that will help in designing system architectures that incorporate XQuery engines. Although specific products are not endorsed in this paper, actual examples are given. With XML in use in places as diverse as automobile engines and encyclopedias, the most important part of investigating an XML tool's suitability to task is often the tool's intended usage environment. It is not unreasonable to suppose that most XQuery implementations are useful for something. Let's see!

4:00 - 4:45

Modeling questions: Experiences from the dbGaP project

Kimberly A. Tryka, Jeff Beck, & Matt Mailman, National Institutes of Health (NIH)

dbGap (database of Genotype and Phenotype) is a project of the National Center for Biotechnology Information (National Library of Medicine [NLM], U.S. National Institutes of Health) to archive and distribute the results of studies that have investigated the interaction of genotype (genetic expression) and phenotype (physical manifestation). Unusually, dbGaP collects not only result data from these studies but also collects documents such as questionnaires and research protocols used in the studies. Areas in a document (sections, paragraphs, questions) may be associated with a particular phenotype in the database, and users need to move easily between the data and references to the data in the documents. The documents have been marked up using an extended version of the NLM DTD. One outstanding challenge is modeling variant question types to create web-based versions of the questionnaires that accurately reflect the intent of the original questionnaire, even if it will not be possible to reproduce the layout.

4:45 - 5:30

Doing better "on the wire" with HL7v3 without losing rigour: Is it possible to be simple without being stupid?

Ann Wrightson, CSW Group

Model-driven XML such as HL7v3 messages tends to be verbose and to take serious effort both to understand and to process. Over the last couple of years the problem of opacity has been tackled fairly successfully, but verbosity and complexity have proved much more resistant. The HL7 SOA technical committee started developing a radical alternative approach, with much simpler models — but as soon as that work progressed from simple information such as attributes of individuals to more complex information such as extracts from electronic health records, exactly the same problems arose again. While moving from model-driven XML to old-fashioned hand-designed XML is tempting, it is not practical. Ever more complex models need to be shared to provide open standards in the context of progressive and ubiquitous adoption of automated information processing, we must harness the full expressive power of XML to these needs. We need to make all the capability of XML for concise, meaningful, and semantically rigorous expression available to domain-specific data-oriented applications.

Thursday, August 9, 2007

9:00 - 9:45

Building a C++ XSLT processor for large documents and high performance

Kevin Jones, Jianhui Li, & Lan Yi, Intel

Some current XML users require an XSLT processor capable of handling documents up to 2 gigabytes. To produce a high-speed processor for such large documents, the authors employed a data representation that supports minimal inter-record linking to provide a small, in-memory representation. XML documents are represented as a sequence of records; these records can be viewed as binary encodings of events produced by an XML parser based on the XPath data model. The format is designed to support documents in excess of the 32-bit boundary; its current theoretical limit is 32 gigabytes. To offset the slower navigation speed for a records-based data format, the processor uses a new Path Map algorithm for simultaneous XPath processing. The authors carried out a series of experiments comparing their newly constructed XSLT processor to an object-model-based XSLT processor (the Intel® XSLT Accelerator Software library).

9:00 - 9:45

Principles, patterns, and procedures of XML schema design; Reporting from the XBlog project

Anne Brüggemann-Klein, Thomas Schöpf, Karlheinz Toni, & Technische Universitüt München

Our weblog system, XBlog, is being built using run-of-the-mill XML-based publishing technology. We propose principles, patterns, and procedures that we discovered when translating from the conceptual model of XBlog articles into a schema language for XML. The project is part of a larger endeavor in which we explore to what extent novel publishing applications such as weblog systems can be composed from appropriately configured XML software with a minimum of programming. Our goal is to discover principles, patterns, and procedures that reduce complexity and ensure sustainability when developing and maintaining Web applications. Thus, the project serves as a case study for research in document engineering. In addition, XBlog is being used to teach XML technology to undergraduate and graduate students.

9:45 - 10:30

Using XML compression to increase efficiency of P2P messaging in JXTA-based environments

Brian Demmings & Tomasz Müldner, Acadia University &
Gregory Leighton, University of Calgary &
Andrew Young, Acadia University (formerly)

JXTA (Juxtapose) is a P2P (peer-to-peer) framework that allows collaboration among diverse (and diversely connected) peers; the performance of such P2P systems depends on efficient messaging. The authors show how the use of XML-aware compression can increase the efficiency of P2P messaging in JXTA-based environments which rely on XML. An API called the Compression Pipe Adapter (CPA) manages both XML-aware and standard compression techniques, focusing at the Application Layer. The API permits the same techniques to be used for other purposes, such as encryption; it can be used selectively, and permits the compression techniques to be chosen in light of other requirements, such as the need for searching within messages without decompressing first. Tests over DSL and Fast Ethernet connections, using three different compression techniques, have confirmed that message compression can offer different performance advantages under different conditions. Furthermore, they show that the performance and resource-consumption advantages of message compression can be significant.

9:45 - 10:30

Enhancing AIML Bots Using Semantic Web Technologies

Eric Freese, LexisNexis

The Artificial Intelligence Markup Language (AIML) is an XML language that enables pattern-based, stimulus-response knowledge content (questions/answers) to be served, received, and processed on the Web. The units of knowledge, AIML categories, consist of a pattern (what a user may say or type) and a template (the response to the user input). An AIML bot includes an AIML interpreter and a responder module that handles the human-to-bot or bot-to-bot interface between an AIML interpreter and its AIML objects. When RDF triples (subject/object/property) are converted into AIML categories, they can then be used in an AIML-based bot. The responder's handling of a conversation based on RDF triples will be demonstrated.

11:00 - 11:45

Converting into pattern-based schemas: A formal approach

Antonina Dattolo, University of Napoli Federico II
Angelo Di Iorio, Silvia Duca, Antonio Angelo Feliziani, & Fabio Vitali, University of Bologna

A traditional distinction among markup languages is how descriptive or prescriptive they are. We identify six levels along the descriptive/prescriptive spectrum. Schemas at a specific level of descriptiveness that we call "Descriptive No Order" (DNO) specify a list of allowable elements, their number and requiredness, but do not impose any order upon them. We have defined a pattern-based model based on a set of named patterns, each of which is an object and its composition rule (content model); we show that any schema can be converted into a pattern-based schema without loss of information at the DNO level. We present a formal analysis of lossless conversions of arbitrary schemas as a demonstration of the correctness and completeness of our pattern model. Although all examples are given in DTD syntax, the results should apply equality to XSD, Relax NG, or other schema languages.

11:45 - 12:30

Composable templates

Mario Blažević, Stilo International

A template language is a domain-specific programming language that is, syntactically speaking, a superset of its output. (Smarty is a template language; XSLT is not). When a template language is used as a functional programming language, each template should be a pure function that generates its output value from a restricted set of input values, using a set of predefined text filters that are also pure functions. This principle guided a rewrite of the OMDE, the microdocument-based OmniMark Document Environment originally developed in 1998. A stream of SGML-tagged topics passes through various templates and results in a collection of HTML files. The templates act as filters whose inputs and outputs are markup streams; the modularity of the templates yields a powerful composable design.

2:00 - 2:45

Organized mapping: Documenting a complex musical system

James David Mason, American Guild of Organists, Knoxville Chapter

Demonstration Topic Maps are generally unsatisfactory because the information being mapped is so simple that the overhead of a topic map tool seems unnecessary. Topic Maps shine when used to manage very complex information with multiple intricate interrelationships. The example used here, documentation of a pipe organ, may seem simple to the uninitiated, but in fact even a mid-sized instrument may be comparable in complexity to an airliner. Traditional means of documentation have been incomplete and left the craft of organbuilding an esoteric mystery. With even a small instrument, a thorough topic map is large enough to serve as a dynamic demonstration of the information technology; when more than one instrument is involved, the topic map enables investigations not otherwise possible even to the initiates of the craft.

2:45 - 3:30

Relational database preservation through XML modeling

José Carlos Ramalho, & Miguel Ferreira, Univ. Minho and Luís Francisco da Cunha Cardoso de Faria, & Rui Castro, Portuguese National Archives

Digital archives are complex structures of human resources, policies, and information. Their social responsibility is to safeguard the intellectual heritage of humanity by holding, preserving, and making accessible to users the information in their custody. RODA (Repository of Authentic Digital Objects) is a joint venture between public administration and academic researchers that aims to build a repository that can preserve the authenticity of digital objects. The first prototype handles three kinds of digital objects: text documents, still images, and relational databases. This paper focuses on the relational databases component. Relational databases are ingested by migrating the original database to an XML representation. The Database Markup Language (DBML) was defined to meet specific requirements; it preserves database content, database structure, and database attributes. The authors will discuss the creation of DBML and report its application to some real case studies.

4:00 - 4:45

Mind the Gap: Seeking holes in the markup-related standards suite

Chris Lilley, World Wide Web Consortium James David Mason, Consultant & Mary McRae, OASIS

The XML 1.0 specification was admired for many things, among them its simplicity and brevity. The XML spec has since been joined by many other associated specifications and standards, not all of them simple, few of them short. There are specifications for vocabularies, constraint languages, transformation and formatting languages, data models and functions, packaging and encrypting specifications, and for a wide variety of other things one can do with, to, or about XML. Many of us think there are too many XML-related specifications. However, we continue to identify holes in the markup-related suite of standards: areas in which new specifications would be useful. In this session the audience will suggest (to representatives or members of several organizations that develop and/or promulgate XML-related specifications) areas in which new specifications would be useful. This will be an information gathering activity, not an evaluative process; all suggestions and heresies welcome.

Friday, August 10, 2007

9:00 - 9:45

Declarative specification of XML document fixup

Henry S. Thompson, University of Edinburgh

The historical and social complications of the development of the HTML family of languages defy easy analysis. In the recent discussion of the future of the family, one question has stood out: should 'the next HTML' have a schema or indeed any form of formal definition? One major constituency has vocally rejected the use of any form of schema, maintaining that the current behavior of deployed HTML browsers cannot usefully be described in any declarative notation. But a declarative approach, based on the Tag Soup work of John Cowan, proves capable of specifying the repair of ill-formed HTML and XHTML in a way that approximates the behavior of existing HTML browsers. A prototype implementation named PYXup demonstrates the capability; it operates on the PYX output produced by the Tag Soup scanner and fixes up well-formedness errors and some structural problems commonly found in HTML in the wild based on an easily understood declarative specification.

9:45 - 10:30

Sometimes a table is only a table: And sometimes a row is a column

David J. Birnbaum, University of Pittsburgh

The dominant XML table models (HTML, OASIS CALS, TEI, et al.) are presentationally oriented and do not reflect the meaning of the table. Semantically, tables are complex data structures that (at a minimum) associate a two-dimensional matrix (row and column position) with a value (cell content). Nothing inherent in most table data requires that one dimension be forever expressed in rows and the other in columns, but models prioritize rows and columns, as if one presentation were inherently more natural or correct than another. For years, some have argued that an information-based table model would overcome these limitations. But the early discussions were before XSLT, before nearly-ubiquitous presentational transformations, and before spreadsheets that "save as XML". Is it time to reopen the table-as-data debate? The author presents a table model that resembles spreadsheet models but is sensitive to the needs of XML authoring.

11:45 - 12.30

Topic maps, RDF, and mushroom lasagne

C. M. Sperberg-McQueen, World Wide Web Consortium

Kitchen wisdom sheds some unexpected light on technical issues.

— There is nothing so practical as a good theory

Prepared by Robin Cover for The XML Cover Pages archive.