XML: Document and Information Management

By any reasonable measure, the W3C's Extensible Markup Language (XML) qualifies as a major Internet success story for 1998. Over 75 XML-related specifications or new XML-based markup languages have been announced in the short time since XML was released as a recommendation from the W3C, and new XML applications emerge almost daily. XML has thus catalyzed industry-wide collaborative efforts to create interoperate solutions, accelerating the rate at which open, sharable XML designs are emerging. XML markup provides a very accessible notation for expressing the key elements in a conceptual model for an application domain, and this simplicity of the XML format allows a larger number of persons, whether specialists or non-specialists, to collaborate in the design of new markup languages.

Early adoption of XML by a host of industry partners is thus creating a wealth of opportunity for information reuse and collaborative distributed network computing over the Web. At the same time, the rapid emergence of XML DTDs and vocabularies from industry and government sectors has focused public attention upon issues of resource identification, classification, cataloging, and delivery that hinder reuse and interoperability. The results of new collaborative endeavors are not necessarily easy to identify and access on the Internet. Simply put: XML resources are not nearly as discoverable and reusable as they deserve to be.

While some of the challenges to interoperability relate to fundamental features of Internet infrastructure, other issues relate to the intelligent and responsible management of information resources within application domains, and to the sharing of intellectual constructs across enterprise domains. This document summarizes challenges to interoperability under five categories:

XML No Panacea: Challenges to Information Sharing

As a means of introducing and illustrating the impediments to XML resource sharing, consider a scenario in which a European project team wants to create a generalized XML design for expert system logic and the corresponding knowledgebase. The project team first endeavors to discover related work, and to determine whether their local project goals make it possible to reuse some core components in related designs. Unfortunately, no central agency tracks the development of new XML markup languages, so the researchers are forced to comb the entire Internet using crude string-based search tools. They uncover a single oblique reference to a ESML 'expert system markup language' in an archived email message (or perhaps 'modelling language' - the author was unsure), but the URL to the ESML Web site is now broken. Further investigation uncovers a draft XML DTD for ESML (dating to November 1997) which promises enhanced versions, but it provides 'ANY' and '#PCDATA' models for key element types, making it difficult for the researchers to determine the real extent of overlap between the two projects. The more recent DTD of an Asian-based workgroup is entirely undiscovered, even though the core features in its associated 'Wizard Markup Language' closely match those of the European project team. Sound familiar?

Discovery

Online discovery tools currently represented by Internet guides and text-based search engines provide valuable assistance, but do not adequately support the particular needs of XML developers and users who desire to reuse information modules and design components. A resource discovery facility providing search-and-browse access to XML-related resources should be supported by a database in which the XML modules and components have been classified, annotated, cataloged, cross-referenced, and indexed. In the case of a large vertical-industry XML DTD, for example, dozens of small schemas for general and specialized objects might be defined and carefully documented. While these schemas for primitive real-world objects are natural candidates for reuse, the current generation of Internet search tools is inadequate to reveal them in response to a user's query.

Within the the sphere of traditional publication formats (books, journals, audio recordings), our university libraries, library consortia, and supporting government agencies follow a standard model for classification and annotation of resources so that information can be found by tools that understand the taxonomic schemes. But who is to manage cataloging tasks appropriate to the XML-based electronic resources we wish to promote for standards-based computing?

Authenticity

Users of industry-standard resources currently face a difficult dilemma when they are able to locate multiple versions of an online document, but are unable to verify that a particular copy from a particular Web site is authentic. Likewise, they may be unable to determine whether a particular version is indeed the latest (e.g., amended) version. Similarly, a resource may purport to be conformant to an industry standard, but will not have been assigned a conformance level or merit status by any recognized authority.

Authentication of XML resources represents a problem that might be addressed through the agency of a recognized clearinghouse. This administrative body could maintain current information about key computing resources, including version information, canonical Web site locations, and "magic number" checksums which permit a user to test the integrity of a particular resource regardless of its origin on the network or CDROM.

Naming

The rapid emergence of XML resources has created concern about naming these electronic resources within the public sphere. Public identifiers for XML resources should be unique (within managed namespaces), canonical, persistent, and meaningful. However, it is difficult to enforce all of these constraints within the current cultural environment where users do not expect a URL to be persistent or canonical, and where managing a name inventory independently from a location inventory is a largely a foreign concept. An XML resource declared as external entity may be assigned a permanent and "locationless" name by using an ISO/IEC 9070 Formal Public Identifier for its URI public identifier. But the weak status of the public identifier in XML, together with established location-addressing practices, renders this more robust naming strategy somewhat ineffectual. A W3C Working Draft "Namespaces in XML" provides a simple mechanism for declaring a URI as a namespace name -- but since the functionality specified in this W3C document is very minimal, there is no clear direction on how the URIs and namespaces are to be designed and managed.

The absence of a global architecture for resource naming encompasses also the problem of canonical names. A canonical name means not having more than one official name for the same resource. Canonicity helps reduce confusion and unnecessary duplication of resources, but without an agency to register and maintain canonical names, and to promote the importance of controlled names, the use of canonical names remain elusive. Humans also prefer names that are meaningful (viz., names which are easy to remember and which sensibly identify something using natural language), but providing human-readable names is currently somewhat at variance with convenience and efficiency from the perspective of programmer and computer. Symbolic names like "www.sun.com" which resolve to numbers are taken for granted within the Internet domain name namespace, but it remains unclear whether meaningful names for important XML resources will become the rule, apart from the agency of an administrative body like the Internic. Finally, best-practice guidelines need to be published on the topic of name reuse: the authorities managing unique names and namespaces should draft principles determining the kinds of names which should die, or alternately, which may be reassigned to entirely different resources, or assigned successively to variant versions of the "same" resource.

Access

XML was designed for the Internet. But delivery of XML-encoded resources over the Web is now only as reliable and robust as "the Web" itself -- which one could argue is profoundly broken. Users who have acquired the canonical identifier for an electronic resource still might be unable to access it with predictability. Resources identified simply by URL often disappear overnight (yielding "404 File not found") or may become inaccessible at a critical moment of need due to network failure. Others may be locked behind passwords. URLs handled by HTTP requests nominate a single Net location (without any fallback or redundancy), and latency under average conditions reveals an evident lack of network robustness.

Some partial and stopgap solutions to enhance access are now available (e.g., indirection through OCLC's PURL system; redundant resource locations through the use of cascading SGML Open CATALOGs), but their lack of scalability and software support render them inadequate as generalized strategies, leaving XML in no better shape than HTML with respect to network delivery.

Semantic Transparency

Within the context of interoperable XML-based information processing, "semantic transparency" means that machines and humans are presented with information that is both unambiguous (having a precise, predictably interpreted meaning) and meaningfully correct (simultaneously satisfying a number of integrity constraints). Computer agents, in particular, must exchange well-defined data in order to calculate and pass along "the correct answer." Semantic transparency first requires that small information objects as well as large information objects built from smaller ones are formally specified at a detailed level in terms of their fundamental characteristics, relationships, and natural integrity constraints, such that validation tools can apply heuristics to test information correctness. Given unambiguous semantic specification, both computing agents and humans can verify that XML-encoded information is meaningful and trustworthy.

Despite the apparent 'semantic' clarity, flexibility, and extensibility of XML encoding vis-à-vis HTML, we must reckon with the cold fact that XML itself does not address semantic transparency, and that it does not enable blind interchange. XML may help humans predict what information might lie "between the tags" in the case of <trunk></trunk>, but XML can only help. To an XML processor, <trunk> and <i> and <bookTitle> are all equally and totally "meaningless." Just like its parent metalanguage (SGML), XML has no formal mechanism to support the declaration of integrity constraints upon primitive (ontological, relational) semantics, and XML processors have no means of validating semantics even if they are declared informally in an XML DTD. XML governs syntax only.

Ancillary W3C specifications like DCD, RDF syntax and schemas, XML Data, as well as deliverables from the XML Schema Working Group, promise to make up for some of the recognized limitations of XML. Still, it appears that other industry initiatives will be required to coordinate the results of these efforts -- to enable the meaningful sharing of XML-based schemas and information units at the semantic level. Within the realm of electronic commerce, Ontology.Org now plays a preeminent role in highlighting the critical need to provide unambiguous semantic specification for XML-encoded objects. In order for XML to achieve its full potential, this principle must now must be recognized and democratized across industry domains -- but with whose help?

Managing XML Resources through Industry Consortia

XML and its satellite standards represent key technology solutions that partially enable cross-domain, platform-independent information sharing. But additional activity is now needed to maximize the potential for interoperable XML-based computing solutions. Given that the base network protocols lie within the purview of W3C, IANA, IETF, ISO, and similar agencies (each with its own political inertia and operational requirements) what can industry partners do now to capitalize on the benefits of openness, flexibility, accessibility, and extensibility of XML as a new network lingua franca?

An emergent strategy -- evidenced recently by initiatives within NIST, the XML/EDI Repository Working Group, the EEMA EDI/EC Work Group, CommerceNet's XML Exchange, and the eCo Framework Project and Working Group -- involves horizontal and vertical industry consortia taking charge within their own intellectual domains. Repository and registry facilities are being announced as key administrative components in XML-based resource management solutions. Having both centralized and distributed (redundant) aspects, an XML registry and repository facility can directly address the challenges to interoperability we have outlined: standards-based descriptive cataloging, namespace management, name registration, resource authentication, robust name resolution, distributed resource maintenance, and education.

Building an XML registry and repository facility to provide these services will require significant capital: human resources, funding, and political consensus. Network design issues and professional cataloging services involve domain expertise in several specialty areas. Paying for this expertise and for the computer facilities will necessitate considerable funding. Many perceive that significant power will accrue to an enterprise which gains public recognition as a central agency administering canonical names and cataloging resources, so an influential administrative role will not be surrendered readily to any single commercial entity. In order to satisfy the political requirement for an industry-neutral registration authority, most of the XML registry and repository initiatives announced so far have a consortial structure, distributing the authority among the stakeholders.

Within several vertical industry areas, XML applications being designed as vocabularies or DTDs already show promise in terms of publishing the semantics of reusable object models. But these models can be judged applicable only within narrow analytical perspectives which reflect processing requirements demanded by the particular application domains. Broader initiatives will be required to make these XML designs and encoded information objects usable across enterprise domains. Who should coordinate this broader effort?

An ideal solution might involve one or more recognized industry consortia (e.g., OASIS - Organization for the Advancement of Structured Information Standards, W3C, GCA, X-ACT) serving as host for an XML registry and repository. This broad-based consortium could work in collaboration (1) with a major university or library agency, which would provide expertise in the area of resource classification/cataloging, and (2) with an industry partner, which would provide expertise in network architecture issues. OASIS might be an ideal candidate, to the extent that a central feature in the OASIS mission is to support interoperable computing solutions in industry settings where standards overlap or leave troublesome gaps. The goal of this XML registry and repository would be to enhance and democratize network access to high-quality "standard" XML-related resources that are legally unencumbered, thus creating a more interoperable network of reusable electronic information.

Managing Names and Ontologies: An XML Registry and Repository

By Robin Cover (OASIS)