[This local archive copy mirrored from the canonical site: http://www.w3.org/MarkUp/future/papers/rkhare-1998-501.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

Requirements for Interactive Access to HTML and XML Documents

Position Paper for the W3C Workshop on the Future of HTML

Rohit Khare, rohit@uci.edu, UC Irvine, April 30, 1998

Abstract

As we promote the use of markup languages for representing the entire range of digital data, reusing a common format requires adaptable access patterns on the wire. Different network users may emphasize early access to parts of the structure, or only require access to parts of the structure, or perhaps just the structure and not the contents. Time affects the representation of markup instances in three ways: bandwidth, latency, and throughput. We can envision optimizing access along each of these axes by compression, reordering, and tokenization, respectively. This position paper argues for investigating "wire protocols" that optimize access to parsed markup instances in the hope of forestalling fragmentation into "compiled" archival formats.

Introduction

Part of the reason the World Wide Web has succeeded so expansively is the ease with which developers can represent structured data in HTML, and now XML. Indeed, new tools have arisen solely to extract structured data back out of HTML forms and tables, such as webMethod's WIDL. To date, though, there has been an implicit expectation that Web information is presented to human users. The true promise of Web automation is for the same data stores to be useful as a human interface and a machine interface. To that end, as larger and more complex structures are pickled into markup languages, new users need to be able to access subparts of a document in a principled manner. In particular, network users will be acutely sensitive to the fundamental limits of time: the rate at which the document is delivered (bandwidth), the delay in accessing relevant information (latency), and the computational overhead of communication (throughput).

Consider a personnel database represented as markup tree: perhaps as <PERSON> elements in XML, or stylized as unordered lists of class PERSON in HTML. The complete description may have entries for NAME, PHONE, HOME-ADDRESS, and so on. A Rolodex-browser at the client might want to support interactive browsing of that data store by searching as the user types in successive characters in any of these fields. In the context of a relational database, it's easy enough to express the requirement that it be searchable on any of these keys, and that the client could request delivery of any one of these tables separately. In the case of a markup document, it isn't even as simple as "rotating" the table to sort by ZIP, say: these records could be interspersed with arbitrary other markup, like biographies or organizational relationship information or any one of the other wonderful ways in which the Web encourages composing data sources into human-relevant documents. In other words, how could the Rolodex-browser extract the same information from a personal home page, or purchase order, or ...?

This is a larger problem than navigating an unknown markup tree (whether using class attributes or well-formed XML). XPointers, for example, address that problem to a degree. The additional issue is delivering only that part of a markup tree. The idea behind the pipe connector in XML linking is for a server to only deliver that part of a document, rather than providing the whole and letting the client resolve it (the 'traditional' hash connector). Similarly, in HTML applications, there may be reason to deliver the outline of the tree structure to accelerate visual layout, and then flow in the text and other contents. Finally, interactive access moves in both directions: it would be ideal to allow incremental updates of a markup tree, too. Consider the plight of a word-processor developer considering moving from a proprietary, indexed binary format to HTML: they may lose 'fast-save' and 'partial-load' functionality.

Rather than encourage alternative purpose-built archival formats -- 'binary HTML' or 'compiled HTML' -- we hope to forestall fragmentation by discussing a 'wire protocol' which properly emphasizes performance enhancements in interactive access (where time is a fundamental limit) rather than mere compact representation (for convenience's sake).

Bandwidth

XML has been accused of a great many sins, but brevity is not among them. From the 'waste' of </SPELLED-OUT-END-TAGS> to non-normative whitespace, many a character can be considered extraneous. However, rather than fragment off into new archival formats using compact end-tags, rediscovering markup minimization, or paren-based S-expressions, we're best off approaching the Shannon limit directly rather than hacking around it.

Compression is a very reasonable counterargument to the 'leaner, meaner XML' camp. Deflate, for example, fits right into the HTTP Content-Encoding: mechanism and comes quite close to optimal compression just training itself on the source text. It's fast enough, and disk/network bandwidth is so much lower than compute bandwidth, that the total time to receive and inflate may be less than receiving the plaintext. In fact, researchers at UC Irvine have demonstrated that it's faster to load compressed syntax trees and decompress them and compile them than to load the resulting object code from disk on last year's Macs.

Server designers, though, may legitimately wonder about context-switching between thousands of different open connections' separate dictionaries. It may be more effective to store a single dictionary for HTML in on-chip cache (or publishing purpose-built dictionaries for other XML DTDs). After all, anyone could publish a new compression table by giving it an URL; performance might even increase because there's no 'training lag' to find common bytes like < and >.

Latency

Starting up a compression process, filling its buffers, and chunking out blocks all add latency to that first byte in attempting to increase effective bandwidth. On the other hand, it can fit more into that all-important first packet. The latency of the first byte may be high on the public Internet, but the second packet is just as much with TCP slow-start.

At the next layer, though, we are worried about the delays in getting at the information needed for the task at hand. It may be skipping ahead to the first <PHONE> tag; it may be just parsing out the top-level presentation tree so a browser can allocate real estate; it may be enumerating all the values of any CLASS attribute; it may be all of the italic phrases; it may be jus the definition of "cat" out of the dictionary; it may be drawing a pick list widget while waiting to populate its thousand entries.

Interactive access to existing HTML and XML document instances requires navigating the document tree to extract relevant subsets -- even when that's not the serialization order the information provider planned on. A servers might know what's the most "important" stock of the day within the table; there should be a way to deliver leaves of the parse tree out-of-order.

That is to say: wire transmissions of markup instances should be able to deliver elements and character data in (arbitrary?) order while preserving reliable reassembly -- and further, there may be control mechanisms for the reader to control that prioritization.

Throughput

A naïve first cut at such a solution might assign a serial number to every node and leaf in the parse tree, thus transmitting (serial#, type, length, data) tuples. That would be a poor solution because it increases latency at the server side as it buffers up the entire stream to relabel all the "internal" pointers with those serial numbers. Furthermore, it decreases the throughput, since it requires processors to grovel over the data stream several times as it marks all the serial numbers, sweeps the tree to relabel it, and copies over the text into new tuple-buffers for transmission.

It should be possible to offer streaming delivery of markup instances: each byte should hit the wire with a minimum of reprocessing. This shouldn't just be limited to incremental data sources that only append to the end, like ticker-tape, but also to parse trees that branch out. Consider an in-out board that's published as an ordered list; as the server finds out more from pinging and fingering those hosts, it should be able to incrementally fill in the details on any workstation in the list.

More concretely, it would also improve throughput to vend tokenized markup to reduce parsing overhead. Ambiguous or unnecessary whitespace, entities, and unbounded-length character data could be pre-parsed out using a simpler escaping routine and length-encoded strings. For XML applications in particular, it should be possible to include arbitrary length-encoded binary data (and chunks), since escaping out a marked-section delimiter ("]]>") is more troublesome than "thunking" control of the channel to another data-vending process.

Recommendation

As often as developers accept the battle cry to represent once-proprietary data in marked up form on the Web, there is a complaint "But this application has performance constraints..." The wireless market could be considered a poster child of this form of recalcitrance. Plagued by low-bandwidth, high-latency, high-error links and special-purpose, low-throughput devices, initiative after initiative begins by positing a gateway that transforms Web content to wireless-ML and wireless-TP. They are not alone, though.

If we want to follow through on the promise of marked up data as a canonical form for processing information across the Internet, we need credible solutions to the three challenge areas outlined in this position paper. We need these solutions to forestall fragmentation of the concrete syntax of HTML and XML documents.

Finally, these are not merely challenges for Internet-scale compositional systems: as CPUs get faster and the memory hierarchy steeper, we need a "wire protocol" for ML transmission not just between machines, but within machines: magnetic bits on disk are as far away from the CPU as a transcontinental network hop.