[This archive copy mirrored from the canonical site: http://www.cs.caltech.edu/~adam/papers/xml/xml-for-archiving.html; see the canonical version if possible.]

Capturing the State of Distributed Systems with XML

by Rohit Khare and Adam Rifkin

$Id: xml-for-archiving.html,v 1.46 1997/10/26 02:08:18 adam Exp $


This paper discusses the challenges of capturing the state of distributed systems across time, space, and communities, and looks to XML as an effective solution. First, when recording a data structure for future reuse, XML format storage is self-descriptive enough to extract its schema and verify its validity. Second, when transferring data structures between different machines, XML's link model in conjunction with Web transport protocols reduces the burden of marshaling entire data sets. Third, when sharing collaborative data structures between disparate communities, it is easier to compose new systems and convert data definitions to the degree that XML documents are adopted for the World Wide Web. Just as previous generations of distributed system architectures emphasized relational databases or object-request brokers, the Web generation has good reason to adopt XML as its common archiving tool, because XML's sheer generic power has value in knowledge representation across time, space, and communities.

1. Distributing Data Across Time, Space, and Communities

Twenty-four thousand miles of adventures reduced to six letters. A journey across eleven flight segments, seven countries, three carriers, and one planet --- and all that the airline flight reservation center can be moved to speak is the cryptic code "NQSS5A." A Passenger-Name Record (PNR) is the quintessential digital age artifact, unlocking yet more data stored in tickets, tariff filings, baggage claim checks, catering orders, security profiles, credit records, and myriad other components of this half-real, half-virtual distributed system called an airline.

Throughout a PNR's lifecycle, from the first call to a travel agent to the final posting of frequent-flier miles, these complex data structures face three challenges: to be distributed across time, to both past and future readers; to be distributed across space, to other machines; and to be distributed across communities, to other organizations and applications. The first challenge calls for a stable data format, since itineraries have to be updated consistently by reservationists, ticket agents, gate agents, flight crews, database engineers, accountants, and others. In the second case, there needs to be a stable grain of exchange, to share records and commit transactions between a bevy of information systems. Finally, to communicate across organizations, there have to be common definitions: agreements between airlines, hotels, rental car agencies, travel agents, and passengers about the interpretation of dates, locations, flights, prices, and so on.

In these each of these situations, system designers can leverage several strategies to manage distributed data cost-effectively. File formats, for example, must be machine-readable, but can be more future-proof if they are also human-readable and use self-describing schema. When packaging related objects together to exchange with other machines, finer-grained marshaling strategies are more flexible than integrating systems through a handful of fixed report formats. Finally, industry-wide coordination has been notoriously difficult to design by committee. Instead of fixing protocols and data dictionaries, the best strategy may be to collaborate through conventional "documents" --- for example, purchase orders instead of Electronic Data Interchange (EDI) records.

All too commonly, though, the actual decisions of system designers fall short against these measures. Proprietary, underdocumented, binary file formats are not merely quick hacks; they are strategic decisions to lock-in users. Concurrent systems almost immediately retreat to a unified system image --- so instead of marshaling only relevant data, the entire database needs to be shared. The result is horrifyingly black-box legacy systems that are rarely shared within a community, much less among suppliers, vendors, and other outside users.

In this paper, we argue that the eXtensible Markup Language (XML) [Bray, Paoli, and Sperberg-McQueen, 1997] and its companion eXtensible Linking Language (XLL) [Bray and DeRose, 1997] can together provide an effective solution for capturing the state of distributed systems, particularly on the World Wide Web. XML was designed to provide a subset of the Standard General Markup Language (SGML) that is easy to write, interpret, and implement [Connolly, Khare, and Rifkin, 1997]. Since XML allows extensible markup while preserving rigorous validation, we advocate storing information in XML, sharing it according to XLL's link model, and weaving XML-enhanced data structures into Web documents.

2. Capturing State

The airline system is constantly in motion, but suppose it were possible to hold back the rush of transactions momentarily to take a global snapshot. The resulting tower of storage media would be almost useless to re-animate the system a few years hence. The bits can surely be preserved, but the file formats slowly fade into gibberish as the applications evolve. Worse, the applications might not evolve at all: that hasty decision to use a slightly more compact two-digit year field may come crashing down decades hence (to the tune of $600 billion in North America alone, according to an oft-quoted Gartner Group study).

The point of reducing some complex multidimensional data structure to a bitstream is ultimately to allow some future user to reconstitute that same data structure and manipulate it accurately. The key is enforcing a schema for these transformations. In this section, we will explore the tensions that lead to brittle data formats (Section 2.1), three strategies for future-proofing data formats (Section 2.2), and how XML-based data formats execute those strategies (Section 2.3).

2.1 Data Archaeology

"Data Archaeologist" smacks of postmodernism gone awry, but the business of rummaging through now-forgotten tapes of health-care records or satellite observations for archival data is already a viable industry. Even mission-critical systems can remain underdocumented as a result. The tendency toward proprietary schema and binary encodings also increases the fragility of file formats. There are several seductive reasons why designers argue extensible, self-describing formats are not cost-effective, including:

"We have been using our house style for generations!"
"That format is too bloated and inefficient to parse!"
"The tools are too immature!"
"Besides, that data isn't for outside consumption..."

Paradoxically, the quest for "high-fidelity" archiving tools that directly map data structure in memory to the bit stream leads to more brittle, less flexible formats. Object systems that helpfully default to recording all instance data, force file formats to evolve as rapidly as the code. Separate versions of an application cannot exchange files without littering the parsing logic with evidence of previous generations' data structures. The archives generated by operating system or language tools are often inextensible: there is no way to gently add new fields to a record, much less to indicate if comprehension is optional or mandatory.

Type-equivalence problems in a language can spread to the archives, too, like the impedance mismatch between Java's (and its Serialization's) int type and Integer class [Java RMI, 1997]. Each system establishes its own set of canonical primitives such as character, string, integer, and float, and its own encodings, leading to yet more conversion challenges --- both on the wire level (for example, COM [Chappell, 1996]) and on the interface level (for example, CORBA IDL [Siegel, 1996]). Abstract Syntax Notation (ASN.1) [ISO 8825:1987] encoding rules, for example, specify the type, length, and value of each datum in the stream --- as well as the type, length, and value of the type and the length...

Human-readable formats have their own traps. Many Unix system databases, for example, embrace the need for extensibility, manual editability, and to include comments [Leffler et al., 1989]. Each of the many column-separated flat-file databases for users, groups, email aliases, and so on are still cryptic, not automatically validatable, and are not self-documenting. As the system grows, some databases need to be replaced wholesale by incompatible binary forms updated by distributed directory protocols.

In short, data storage formats are difficult to "future-proof." It takes care and effort to design extensible, editable, scalable, and correct formats and the parsers and generators and Application Programming Interfaces (APIs) that implement it. Instead, designers face immediate concerns about:

Inertia ---
Applications and programmers each have habits for recording state, sometimes dating back to the particular layout of punch cards or drum memory. There may also be resistance to formats "Not Invented Here."
Efficiency ---
A primary rationalization is that accommodating extensibility and human-readablity wastes parsing effort and storage space. However, the further into the future this data will be used, the cheaper these resources and the more valuable flexibility will become.
Tools ---
Tools can be the root of the two earlier objections. The archiving tools already support the status quo, and new tools to support a more future-proof format are immature or expensive. ASN.1 parsers, for example, are too resource-intensive for small smart cards.
Openness ---
Sometimes file formats are exempt from any analysis because that data is expected to be used only within the context of this system. Each open interface to the state of the system increases the complexity of the overall design, after all.

Underdocumentation, though, is still a greater flaw than whether the file format is technically well-suited to the task or not. Without preserving a description of the data --- much less self-descriptive data --- there can be no communication across time to future readers and writers of that data.

2.2 Binding Metadata to Meatdata

We posit three strategic requirements for future-proof data formats:

Machine-Readable ---
Ultimately, computers have to manipulate these bitstreams, so consider the space, speed, and accuracy of the parsers and generators; and to a lesser extent, the size of the bitstream.
Human-Readable ---
The flexibility of human-readable and human-editable formats requires robust error handling and simpler document structure.
Self-Descriptive ---
It is not enough merely to use a rigorous schema for the data contained within a bitstream; the actual stream should include enough information to extract out the schema, and to validate its contents.

The natural tension between these three strategies inspires a delicate balancing act: mechanical logic and human fuzziness can only be reconciled in a format that can be dynamically learned by each.

Successfully machine-readable formats are measured by the logic required to extract and manipulate them. Rigorous enforcement of syntax rules simplifies parsing logic at the expense of robust error handling. Direct projection of the data-representation in memory simplifies parsing and generation at the expense of human-readability and cross-platform support. For example, capturing numeric data in binary form is simple and potentially compact, but unreadable and dependent on the endianness of the CPU architecture. Mission-specific grammars can be more compact than adapting general-purpose encodings (e.g. ASN.1). Turing-complete formats, representing state as executable program text, inflate parser and generator size while reducing the fidelity of the data manipulation. For example, an airline ticket as PostScript requires executing a large program and even then ending up with strokes and arcs instead of cities and flights.

Successfully human-readable formats, by contrast, are measured by the cognitive effort to extract and manipulate information [Schriver, 1997]. In this case, flexible enforcement of syntax rules makes it easier to edit and read. Data representations need to be translated to accessible forms, potentially at the expense of fidelity. For example, integers can be represented accurately in decimal, but inaccuracies can crop up for floating-point. Data presentations also need to be accessible: a Portable Network Graphics (PNG) picture is "human-readable" when presented as an image. A spreadsheet presented as a table, though, loses the equations and symbolic logic behind the numbers in the process. The benefit of all of these tradeoffs is increased reusability, which will increase the viability and investment in maintaining that format. Conversely, when human-readability is reduced to an afterthought as a companion "import/export" format, the canonical binary format may still not become future-proof.

Successfully self-describing formats are measured by how much can be discovered dynamically about their mechanical structure and semantics. The first test is simple identification. The file should contain some type signature, perhaps even a revision number, or at least a filename extension --- enough to characterize the format. Leveraging that identity to define the provenance of the data and its definitions is the next step. A typical UNIX system configuration file, for example, at least refers to the section of the manual that defines its entries. The third test is whether that definition is sufficient to dynamically extract and manipulate the information within both structural and presentational guides. These kinds of metadata can future-proof a format, preserving machine-readability and human-readability.

2.3 Capturing Database Schema as DTDs

XML strikes an appealing balance among the three strategic goals laid out in Section 2.2. The very constraints applied to SGML in specifying XML make it well-suited to dynamically generating new data formats.

First, each specific XML-based file format is based on a separate, explicit Document Type Definition (DTD). Each DTD defines the names of new tags, their structure, and their content model. More to the point, XML files are required to disclose their respective DTDs in their headers, or include the entire DTD within the XML file itself, neatly enforcing self-description. The DTD functions analogously to an Interface Definition Language (IDL) specification or a relational database schema.

Second, XML parsers can validate files with or without the DTD. The implicit grammar rules define a hybrid machine-readable and human-readable text format that can represent numbers, strings, and even escaped binary content. The tools themselves can be built small and run quickly, as described elsewhere in this issue. The resulting files are probably larger than alternative formats, but XML markup should compress effectively for tightly constrained environments. XML-formatted metadata can also be stored alongside legacy files as appropriate.

Furthermore, using XML unlocks other opportunities. DTDs can be cascaded to represent compound data types [Bray, 1997]. A TravelAuthorization record, for example, could combine an Itinerary record and an EmployeeAccount. DTDs can also be hosted on the Web, allowing users to dynamically learn about new formats. Style sheets can be applied to the tagged XML data, garnering all of the formatting abilities that application entails.

3. Exchanging State

Although an airline system could be simulated as one massively centralized application, it is distributed across multiple subsystems for scale and redundancy (by multiplying the number of parallel instances) and to manage complexity (by dividing the problem). The flight dispatcher is an example of the latter: as the delays propagate across the country each day, it decides to schedule or scrub flights and allocate planes. Each decision to fly is based on a number of factors: the reservations for each flight, to estimate the size of the plane; the fares at stake, to determine profitability; and the itineraries of each passenger, to predict missed connections. The PNR database has all that information and more: seat assignments, VIP status flags, baggage claims... so much data that the challenge becomes how to select and transfer just the relevant portions.

While every system needs to ensure its input and output formats can withstand the test of time, distributed systems need to share knowledge between different physical locations at the same time. Protocols must be established for excerpting relevant parts of the workload and shipping data between subsystems, either across a network of separate computers or using interprocess communication within the same computer. In this section, we explain some of the tradeoffs of marshaling data (Section 3.1), a strategy to use the network to defer marshaling decisions (Section 3.2), and how the Extensible Link Language improves upon the Web's hypertext semantics to match (Section 3.3).

3.1 Distributed Systems With Centralized Data

The simplest way to cope with the complexity of data flowing back and forth intermittently in varying packages is not to; "distributed" does not imply "decentralized." Instead of deciding how to export part of an interlinked data structure and how often, it can be considered cost-effective to either unify the entire state of the system in a central relational database [Flemming and Vonhalle, 1988], or partition that whole state between isolated subsystems.

Consider the challenge of exchanging state between the flight dispatcher and another critical resource, the crew dispatcher. Once a flight has been scheduled onto a plane, it still needs pilots and attendants certified to operate that plane in that departure city. It is an especially complex space-time chess game because people, unlike planes, need to return home soon. Optimization algorithms manipulate all of these records simultaneously, producing a complex, connected graph of Employees, Flights, and Planes in memory. The results need to be shared with yet other subsystems in operations and human resources: reports summarizing the activity of each Plane and Employee.

To "pickle" the state of a Plane, we can write down its particulars, but then there are pointers to the several Flights it will take that day which in turn point to several crews. Extracting that report requires marking all the records that plane depends on, then cutting that subgraph out of the larger database by replacing pointers with internal references in the archive. Of course, it is not just a simple spreading tree: pickling an Employee requires enumerating the Flights and Planes it is linked to, and the recursive, tangled mess could easily expand to encompass the entire database.

The system designer has to break this cycle, literally. Decisions must be made either to include a linked record in the archive, or else to replace the pointer with a symbolic name. For example, the daily roster for a Plane can terminate by recording only the Employee name and ID, eliding other details that can be reconstructed by dereferencing that ID. The Employee schedule need only list Flight numbers, rather than include the full details of the flight's passengers, meals, and revenue.

In the geography of distributed systems, distance is (the inverse of) bandwidth, which constrains both the size and frequency of messages. Designers also have to enforce policies about how often to update the system. At one extreme, all data can be stored in atomically small records in one high-performance database. Even if that database is hosted on several computers running in parallel, it is essentially a centralized philosophy at work. It can be made to scale: today's airline reservation systems pool dozens of mainframes in a massive hardened data centers into some of the highest-throughput transaction networks in the world. At the other extreme, all related data can be isolated within one system that emits a batch-processed summary of the entire set every so often.

3.2 Linking Instead Of Marshaling

Marshaling is an expensive strategy because the perogative always lies with the sender. First, the sender has to decide the marshaling policy for each data structure when the application is written. Second, at runtime the marshaling process has to mark all of the records to be written and resolve all of the internal pointers before the first byte hits the wire.

The Web solves this problem rather differently. A page can include many subsidiary resources, some of which load other subparts in turn. Different pages can also share common resources. Web servers do not transmit a single neat package, though: each resource is transferred in a separate HTTP request-response pair [Fielding et al., 1997].

The key observation is that the links between resources already have names. Instead of pointers that can only be interpreted in the sender's context (like memory addresses), relative and absolute Uniform Resource Locators [Berners-Lee et al., 1994] can be interpreted by any recipient. Instead of expensive marshaling burdens on the server (writer), the client (reader) can incrementally fetch the desired resources as needed.

Separating each transaction does not necessarily compromise consistency. At first it might seem that since each resource is exchanged at a different point in time, the entire set could change in the middle. That race condition can be prevented by incorporating state into the URL (for example, a version indicator a la Web Distributed Authoring and Versioning (WebDAV) [Slein et al., 1997]) or into the protocol (for example, an HTTP Cookie [Kristol and Montulli, 1997]).

Separating each transaction can hamper performance, though. HTTP's strictly synchronous model implies a round-trip delay for each resource, even if the sender already knows what dependent resources should be marshaled together. HTTP caching or the future evolution of HTTP to allow "push" responses both can address this limitation.

Neither of these engineering concerns dilute the lesson of linking with names, since URLs are designed to assimilate new naming schemes and access protocols. The strategy of linking resources together with names defers both of the costs associated with marshaling: the perogative to drill down shifts to the recipient, and the sender does not have to map out an entire report.

3.3 Leveraging the XML Link Model

XML-formatted data is particularly well-suited to this strategy because of the rich new linking model in the complementary XLL specification. The W3C Generic SGML activity charter aimed for protocols supporting interactive access to structured documents fetching just a single definition from a dictionary (instead of retrieving the entire dictionary), for example. The addressing model which allows source documents to select any span of elements in another SGML or XML document can also serve the needs of distributed system designers accessing one baggage record out of a manifest, for example.

XLL can indicate whether each linked resource should be interpreted within the same context or a new one (the SHOW axis) and suggest whether to access it in parallel or in series (the ACTUATE axis).

Since the actual link address format is just a URL, it can point at any named span in the target document. Fragment identifiers like "document#label" behave just as they do for HTML. First, they load and parse the entire document, and then they search for the anchor element the original author so labeled. Unlike HTML, though, URLs referring to XML documents can use an extended pointer (XPTR) syntax developed by the Text Encoding Initiative (TEI). An XPTR identifier such as "document|ID(label),CHILD(2,*)" points to the second element below the labeled anchor; there are many other operators for navigating the parse tree, counting characters, matching strings, and indicating spans. XLL deliberately leaves it unspecified who dereferences an XPTR identifier, so the dictionary server can indeed return just matching definitions.

The latter development is perhaps the most significant for XML's future as an archiving format. Portions of the state within a structure can be named, linked to, and even excerpted without modifying the source. Even state-of-the-art object-oriented serialization services for Objective-C and Java can only archive an entire stream all at once [Cox and Novobilski, 1991; Java RMI, 1997]. XML's well-formedness requirements produce structured documents that can be correctly manipulated, even without the entire contents of the document at hand.

4. Sharing Meaning

The airline system does an admirable job of abstraction for its passengers, hiding almost all of the machinations discussed in the previous two sections. For the traveler, black-box reuse means the only interface necessary is to specify the origin, destination, time, and then pay the fare. Black-box reuse also means that when a traveler submits an expense report, the only documentation left is the mere image of a ticket stub. A PNR unlocks no data for an outsider.

The information to fill out an expense report certainly exists within the airline's databases. That information was even collated into a self-contained document. When that ticket changed hands from agent to passenger, though, it was ripped out of context. The point of preparing a report should be to come to enough ontological agreement to allow an outsider to reconstitute its context, and hence its meaning. In this section, we explore the challenges of interorganizational collaboration (Section 4.1), document-centered integration strategy (Section 4.2), and how XML-enhanced documents can provide a usable face to structured data (Section 4.3).

4.1 Coordinating Tasks Across Organizations

A traveler planning a meeting of his clients has to make air, car, and hotel arrangements and obtain consensus from his clients for a schedule. Ideally, once reservations have been made, the traveler would like to link all these affairs together. For example, if the flight is delayed, each participant's calendar will show an alternate meeting time. Or, if the meeting is rescheduled to two days, the travel arrangements are extended to match. In reality, these four systems (air, car, hotel, and company meeting) are so loosely coupled that there is a vibrant market for a fifth organization, the travel agency.

This is not a technology problem. It is not a matter of wiring up all the players with email and websites. It is an ontological problem where no two vocabulary sets quite line up. For example, if a meeting slips from the afternoon to the next morning, it is one extra hotel "night" (which are calculated as solar days), and zero extra car rental "days" (which are calculated as 24-hour blocks) and possibly even a more expensive airline ticket (if the fare had a "maximum-stay" limit, which would be measured in the originating timezone).

Understanding these varying bits of jargon for marking time confers membership in each industry. Organizations can be defined by their language: ontology recapitulates community. Coordinating tasks across organizations ineluctably requires adapting to local conventions and stipulating agreements through legal contracts and other documents such as trust agreements [Khare and Rifkin, 1997]. They also require prying the information out of the several distributed systems: each of the travel planners hides behind a reservation code, to say nothing of the chaos in calendaring standards.

4.2 Collaborating Through Documents

The best remedy to conflicting speech is yet more speech. The conventional practice for forestalling such confusion is to 'put it in writing' through legal contracts and other trust agreements [Khare and Rifkin, 1997]. From the Warsaw Convention fine print on luggage loss liability limits to credit card charge slips, business proceeds by paperwork. Such documents have two roles in brokering interorganizational collaboration: the document itself becomes the concrete face of the task, and it defines its own ontology for the task.

Consider a bank check. Legally, a demand deposit account can be used with a signed napkin, but the U.S. Federal Reserve's clearance policies set out the physical dimensions, layout, and magnetic-ink encoding of a check. As the check moves from bank to bank, there is no confusion as to the exact interpretation of accounts, amounts, and dates, because the check incorporates its own legal conventions. At another end of the spectrum, a forty-thousand page New Drug Application to the US Food and Drug Administration still has the same roles. The application is the one artifact which represents years of negotiations, carefully logged. The application sets out its own drug-specific scientific terms and tests, negotiated by both sides' analysts.

Documents in cyberspace assume the same roles: embodying the user interface to a task and defining its terms.

The document metaphor has a long pedigree in user interface research, far predating the web. Taligent, arguably the most sophisticated mulituser collaborative document toolkit to date, strongly endorsed the convergence of application-as-document and "collaborative places" [Cotter and Potel, 1995]. Concurrent documents views were assembled from active components consulting a shared structured storage model while interaction could pass from user to user. While such peer-to-peer collaboration may be several generations ahead of current Web client technology, server-based coordination of Web pages with forms and active content is a sufficient simulacrum. The broader lesson is that an intelligent "purchase order" document can be a more usable representation of the collaborative process than a traditional application. Web technology accelerates the development cycle by dramatically lowering the threshold for creating document interfaces. A form and a CGI interface to the Shipping Department can put a business online faster than an army of EDI consultants, because the Web's markup format is so accessible.

The logic embedded in a collaborative document also defines the ontology for that task. Within a community, understanding the semantics of a document is a matter of identifying its format (Section 2.2). An outsider has to understand the ontology behind the format, well enough perhaps to translate it into locally-meaningful terms. A calendar developer has to build a lot of shared context with an airline reservation structure to extract facts like "the user will be on a plane and inaccessible during each flight; during a flight, the calendar's time zone should be reset; and the user will not be available for meetings at the office." On the other hand, instead of waiting for an industrywide or international standards process to deliberate over the canonical meaning of "place and "time," developers can at least knit one-to-one mappings. Popular ontologies can emerge organically, like well-trodden paths in a field.

4.3 Weaving XML Into The Web

XML promises to realize the vision of interorganizational collaboration through Web documents. As discussed in Section 2.3, XML will "let a thousand DTDs bloom," making it cost-effective to capture community-specific semantics in XML-formatted data. Several further developments, taken together, will enable XML to assume the roles discussed above: including XML-formatted data structures in HTML documents; building usable interfaces using forms and style sheets; naming and locating DTDs on the Web; and automated processing and conversion of XML-formatted data.

Electronic commerce on the Web is already big business, but its HTML form-based infrastructure is not enough to "become the concrete face of the task." For example, two sites selling books and flowers will both inevitably ask for a shipping address, but without a structured container for addresses, there is no way to automatically fill in the order page at either site, much less using a shared address format. As XML-savvy tools become more popular, web developers will be able to publish and receive XML street address within HTML web pages [Connolly, Khare, and Rifkin, 1997]. Forms extensions could specify the DTD of input data. Style sheets will format the appearance of embedded data structures.

The technology to manipulate the ontology of XML documents is a little further off. The key is XML's hooks for identifying DTDs. The Formal Public Identifier for a document type can now be associated with a URL. XML processing tools could expect to dereference that address and not only discover a DTD file, but also metadata about the meaning of each tag, default style sheets, and possibly even mobile code resources for manipulating such data. With this kind of documentation, automated translation tools might be able to associate an airline's <location> tag, which refers to airport codes with an atlas's latitude and longitude entries.

In the interim, several exciting tools are already focusing on this vision. WebMethods' Web Interface Definition Language (WIDL) can extract structured data from HTML and XML web pages, invoke processing on Web servers using forms, and collate reports harvested from multiple sites in a single format [Allen, 1997]. Many developers have rallied to the motto "XML gives Java something to chew on" [Bosak, 1997], referring to the synergy of XML and mobile code embedded in Web pages together. All of these trends are narrowing the gap between human-readable and machine-readable documents.

5. Knowledge Representation Across Time, Space, and Communities

In this paper, we have spoken of distributing data across time, space, and community as though they are impossible chasms. We believe they are. The arrow of time points opposite to the arrow of memory. Separation in space is separation in time; thus, latency compels brevity. The gulf between communities is the root of communication.

We have tried to set the challenges facing distributed system designers in these contexts. We argue that XML can effectively future-proof data formats, exchange data structures, and enhance Web documents into robust platforms for system integration. It is not the first, last, or universal solution, but it does accelerate the continuing evolution of the Web. As the Web assimilates "the universe of all network-accessible information" [Berners-Lee, 1996], and as XML adds the metadata to define that universe, at some point information transubstantiates into knowledge.

A modern airline can no more take flight without its information systems than without jet fuel. At some point, the distributed system no longer models reality; it becomes reality. As David Gelernter predicted in his 1991 book, when the image in the machine corresponds to the real world, in both directions, we have built a Mirror World [Gelernter, 1991]. Today, these only exist in limited domains at vast expense: transportation systems, telecommunications systems, military operations. Soon, to the degree that the Web continues to evolve toward richer data representation and proprietary systems gain Web interfaces, XML will mediate the recreation of reality in cyberspace.


This paper is based on our experiences over several years working with the Web community. Particular plaudits go to our colleagues at the World Wide Web Consortium, including Dan Connolly and Tim Berners-Lee; and the teams at MCI Internet Architecture and Caltech Infospheres. We also thank Ron Resnick, Mark Baker, and Doug Lea for their helpful comments.


  1. Charles Allen. WIDL: Automating the Web with XML, in World Wide Web Journal, Volume 2, Number 4, Autumn 1997. Available at http://www.webmethods.com/technology/widl.html
  2. Tim Berners-Lee, Larry Masinter, and Mark McCahill. Uniform Resource Locators (URL), RFC 1738, December 1994. Available at http://www.w3.org/Addressing/rfc1738.txt
  3. Tim Berners-Lee. WWW: Past, Present, and Future, IEEE Computer, Volume 29, Number 10, October 1996.
  4. Jon Bosak. XML, Java, and the Future of the Web, 1997. Available at http://sunsite.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm
  5. Tim Bray, Jean Paoli, and C.M. Sperberg-McQueen. eXtensible Markup Language (XML): Part I. Syntax, World Wide Web Consortium Working Draft (Work in Progress), August 1997. Available at http://www.w3.org/TR/WD-xml-lang.html
  6. Tim Bray and Steve DeRose. eXtensible Markup Language (XML): Part II. Linking, World Wide Web Consortium Working Draft (Work in Progress), July 1997. Available at http://www.w3.org/TR/WD-xml-link.html
  7. Tim Bray. Adding Strong Data Typing to SGML and XML, May 1997. Available at http://www.textuality.com/xml/typing.html
  8. David Chappell. Understanding ActiveX and OLE, Microsoft Press, 1996.
  9. Dan Connolly, Rohit Khare, and Adam Rifkin. The Evolution of Web Documents: The Ascent of XML, World Wide Web Journal, Volume 2, Number 4, Autumn 1997. Available at http://www.cs.caltech.edu/~adam/papers/xml/ascent-of-xml.html
  10. Sean Cotter and Mike Potel. Inside Taligent Technology, Addison-Wesley, 1995. Available at http://www.taligent.com/itt.html
  11. Brad Cox and Andrew Novobilski. Object-Oriented Programming: An Evolutionary Approach, Second Edition, Addison-Wesley, 270 pages, 1991.
  12. Roy Fielding, Jim Gettys, Jeff Mogul, Henrik Frystyk, and Tim Berners-Lee. Hypertext Transfer Protocol -- HTTP/1.1, RFC 2068, January 1997. Available at http://www.w3.org/Protocols/rfc2068/rfc2068
  13. Candace Flemming and Barbara Vonhalle. Handbook of Relational Database Design, Addison-Wesley, 605 pages, 1988.
  14. David Gelernter. Mirror Worlds: The Day Software Puts the Universe in a Shoebox... How it Will Happen and What it Will Mean, Oxford University Press, 1991.
  15. ISO 8825:1987. Information Processing Systems - Open Systems Interconnection - Specification of Basic Encoding Rules for Abstract Syntax Notation One (ASN.1), 1987.
  16. Javasoft Java RMI Team. Java Remote Method Invocation and Object Serialization Specification, Sun Microsystems, 1997. Available at http://www.javasoft.com/products/jdk/rmi/index.html
  17. Rohit Khare and Adam Rifkin. Weaving a Web of Trust, World Wide Web Journal special issue on security, Volume 2, Number 3, pages 77-112, Summer 1997. Available at http://www.cs.caltech.edu/~adam/papers/trust.html
  18. David Kristol and Lou Montulli. HTTP State Management Mechanism, RFC 2109 (Work in Progress), February 1997. Available at http://www.internic.net/rfc/rfc2109.txt
  19. Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels. The Design and Implementation of the 4.3Bsd Unix Operating System, Addison-Wesley, 1989.
  20. Karen A. Schriver. Dynamics in Document Design: Creating Texts for Readers, John Wiley and Sons, 1997.
  21. Jon Siegel. Corba Fundamentals and Programming, John Wiley and Sons, 693 Pages, 1996.
  22. J.A. Slein, F. Vitali, E. Jim Whitehead Jr., and D.G. Durand. Requirements for Distributed Authoring and Versioning on the World Wide Web, Internet Draft (Work in Progress), May 1997. Available at ftp://ietf.org/internet-drafts/draft-ietf-webdav-requirements-00.txt

Author Addresses

Rohit Khare, khare@alumni.caltech.edu

Rohit Khare served as a member of the MCI Internet Architecture staff in Boston, MA in the summer 1997, when this paper was written. He was previously on the technical staff of the World Wide Web Consortium at MIT, where he focused on security and electronic commerce issues. He has been involved in the development of cryptographic software tools and Web-related standards development. Rohit received a B.S. in Engineering and Applied Science and in Economics from California Institute of Technology in 1995. He joined the Ph.D. program in computer science at the University of California, Irvine in Fall 1997.

Adam Rifkin, adam@cs.caltech.edu

Adam Rifkin received his B.S. and M.S. in Computer Science from the College of William and Mary. He is presently pursuing a Ph.D. in computer science at the California Institute of Technology, where he works with the Caltech Infospheres Project on the composition of distributed active objects. His efforts with infospheres have won best paper awards both at the Fifth IEEE International Symposium on High Performance Distributed Computing in August 1996, and at the Thirtieth Hawaii International Conference on System Sciences in January 1997. He has done Internet consulting and performed research with several organizations, including Canon, Hewlett-Packard, Reprise Records, Griffiss Air Force Base, and the NASA-Langley Research Center.

Modification information:

$Id: xml-for-archiving.html,v 1.46 1997/10/26 02:08:18 adam Exp $