Cover Pages: XML Daily Newslink: Wednesday, 03 September 2008

A Cover Pages Publication http://xml.coverpages.org/
Provided by OASIS and Sponsor Members
Edited by Robin Cover

This issue of XML Daily Newslink is sponsored by:
Primeton http://www.primeton.com

Headlines

UOML (Unstructured Operation Markup Language) Part 1 Version 1.0
First Public Working Draft: Efficient XML Interchange (EXI) Impacts
Mark Logic Response to 'XML: Good, Bad, ... Bloated?'
XRX: Making OOP Work in XQuery
Updated Version of Unicode Regular Expressions
Orchestration vs. Choreography: Debate Over Definitions
Questions and Answers on Messaging, Real-time, and Grid (MRG)
IETF Updates Sieve Mail Filtering Language (SIEVE) Working Group Charter
Altova Releases New XMLSpy Standard Edition for Basic XML Editing Tasks

UOML (Unstructured Operation Markup Language) Part 1 Version 1.0
Staff, OASIS Announcement

OASIS announced that members of the Unstructured Operation Markup Language eXtended (UOML-X) Technical Committee have submitted an approved Committee Specification for "UOML (Unstructured Operation Markup Language) Part 1 Version 1.0" to be considered as an OASIS Standard. The UOML-X TC was chartered to develop and maintain an XML-based operation interface standard for unstructured documents. The Unstructured Operation Markup Language specification defines an XML schema for universal document operations. The schema is suitable for operating printable documents, including create, view, modify, and query information, that can be printed on paper, e.g. books, magazine, newspaper, office documents, maps, drawings, blueprints, but is not restricted to these kinds of documents. Several commercial and free applications are already available based on the submitted draft of UOML. OASIS members Primeton Technologies Ltd., Redflag 2000, and Sursen have submitted Statements of Use indicating successful use of the UOML specification. UOML is interface standard to process unstructured document; it plays the similar role as SQL (Structured Query Language) to structured data. UOML is expressed with standard XML, featuring compatibility and openness UOML deals with layout-based document and its related information (such as metadata, rights, etc.) Layout-based document is two dimensional, static paging information, i.e. information can be recorded on traditional paper. The software which implements the UOML defined function, is called DCMS, applications can process the document by sending UOML instructions to DCMS. UOML first defines abstract document model, then operations to the model. Those operations include read/write, edit, display/print, query, security control; it covers the operations which required by all different kinds of application software to process documents. UOML is based on XML description, and is platform-independent, application-independent, programming language-independent, and vendor neutral. This standard will not restrict manufacturers to implement DCMS in their own specific way. This specification is the first part of UOML, which defines the operations used for read/write, edit, and display/print layout-based document.

First Public Working Draft: Efficient XML Interchange (EXI) Impacts
Jaakko Kangasharju (ed), W3C Technical Report

Members of W3C's Efficient XML Interchange Working Group have published a First Public Working Draft for "Efficient XML Interchange (EXI) Impacts." EXI defines a new representation for the Extensible Markup Language (XML) Information Set. While the introduction of EXI has the potential to bring XML to new communities, it can also have adverse effects on the existing XML community. The precise scope of these effects may not be fully knowable in advance, but based on experience with existing binary formats, educated estimates can be made. The main goals of EXI in regards to existing systems are to provide maximally seamless compatibility with XML and to avoid disruption of existing XML technologies and specifications. In particular, EXI should not require modifications to existing XML systems, unless these systems are extended to adopt EXI. The purpose of this document is to identify any immediate impacts that require changes to existing XML-based specifications or XML-using applications. It also identifies cases where changes to existing specifications or applications are not required, but might be desirable to increase efficiency. EXI offers two in-band means to distinguish it from other formats: the mandatory Distinguishing Bits and the optional EXI Cookie. In particular, either of these is sufficient to distinguish EXI from XML when using any conventional character encoding. Assuming such a conventional character encoding, the first octet of an EXI document, either one that includes the distinguishing bits or the first octet of the EXI cookie, can not appear as the first octet of a well-formed XML document. Therefore, an XML processor is required by the XML specification to reject any EXI document immediately upon reading that first octet. XML is often used in conjunction with other protocols and technologies. In some such cases, in particular the World Wide Web and Web services where HTTP is common, the protocol supports content negotiation to allow applications to indicate which content types and encodings they are prepared to handle. The "EXI Best Practices" document describes how such support can be used to introduce EXI to such an environment with no impact to applications that have not adopted EXI... In processing of incoming transmissions, an application adopting EXI will need to implement an internal mechanism for routing the incoming content to the appropriate processor (XML or EXI), but a non-EXI-aware application can continue using its XML processor for everything. If the communication protocol does not offer any method for content negotiation, it may be that a non-EXI-aware application occasionally gets sent EXI content. In such cases, the aforementioned immediate rejection should be communicated to the sender so that it can avoid sending EXI content to that receiver in the future.

Mark Logic Response to 'XML: Good, Bad, ... Bloated?'
Dave Kellogg, Blog

GCN ran an article last month, entitled "XML: The Good, The Bad, and the Bloated", about which I wanted to share a few thoughts. [The GCN article noted that] "...XML's flexibility and cross-platform capabilities far outshine any negatives. But if XML files are not properly planned and managed, there is a good possibility that you could experience XML bloat..." Kellogg: "I'd agree that the blind marking of everything in XML can be wasteful. That's why I've long advocated a "lazy" approach where you first decide application requirements and then create XML tags in order to support them, iterating over time on both the application requirements and the sophistication of the XML to support them. This is opposite to a far-too-common "big-bang" approach whereby you design "the ultimate schema," which can answer virtually any possible application requirement, and then spend enormous time and money first designing it, and then trying to migrate your data/content to it. The problems with the big-bang approach are many: Designing the ultimate schema is a Sisyphean task; You spend money investing in XML richness which has no short-term return; i.e., you over-design for the short-term; You lose your budget mid-term because while you're designing perfection, the business has seen no value and loses faith in the project. [...] At Mark Logic, we're trying to [address bloat] in three ways: (1) By delivering a forgiving XML system that accepts content in a rather ragged form, enabling you to ingest XML immediately and begin delivering value against it. (2) By evangelizing a lazy XML enrichment and migration approach that delivers business value faster than big-bang approaches. (3) By delivering a high-performance XML server that ingests and indexes XML in a very efficient way... XML is naturally tree-structured and XML documents are stored as trees. What's more, the element names (i.e., the tags) are typically hashed. So the 20-character "publication-author" element name get hashed to 64 bits once and every time the tag appears in the corpus only the hash-value is stored. So it's not 41K of overhead to 3K of content in the preceding example, it's more 2K to 3K. In fact, by Mark Logic rules of thumb, the picture often looks like: [a] 1MB of text source content, which becomes - [b] 3MB of XML, which becomes - [c] 300K of compressed XML in MarkLogic, which becomes - [d] 1MB of compressed XML + indexes in MarkLogic. Simply put, it's often the case that the content blows up a bit in XML only to be compressed to 1/10th its size, only to be re-inflated through indexing back to its original size..."

XRX: Making OOP Work in XQuery
Kurt Cagle, O'Reilly Technical

Traditionally, there has been a significant dividing line between object-oriented programming and database development. The concept of object-oriented programming (OOP) was at a fever pitch during the 1980s when SQL was under development, but perhaps because of the influence of languages like COBOL, perhaps because SQL was considered a set language that didn't lend itself well to OOP, or because OOP was still controversial without a sufficient track record, the developers of SQL's core functional language, stored procedures, chose to keep those functions very procedural rather than attempting to put an OOP wrapper around them, in essence making all stored procedures a single library. One of the central problems that this approach has had, however, is that such libraries tend to become quite bulky and disorganized over time, as people add stored procedures for handling just their test case, even when such procedures may already exist... The concept of using XQuery as a mechanism for generating web pages is a comparatively new one in the XML Database and XQuery engine world, but the benefits to do so should be fairly obvious. Indeed, there's been a new meme that's begun appearing under the heading XRX, which stands for XQuery, REST, and XForms, though that last particular X could also stand, just as effectively, for XMLHttpRequestObject, the central component in the AJAX world. XRX design is obviously quite potent—in essence you are replacing complex database calls using SQL or LDAP with XQuery calls talking to XML collections (either "real" or virtualized from relational databases) then using the filter/sort/page/render pipeline to generate pages and portions of pages. Object Oriented XQuery, as this section points out, does not necessarily provide a way or writing that much code in the short term, but it does significantly open up ways of reducing the amount of code written for larger projects by more effectively modularizing the code. Moreover, it provides a very useful "integration point" between people working with SQL, with XML and with OOP languages by providing at least some of the strengths of OOP without necessarily tying it into some of the major weaknesses.

See also: W3C XML Query (XQuery)

Updated Version of Unicode Regular Expressions
Rick McGowan, Unicode Consortium Announcement

Unicode Technical Standard (UTS) #18 "Unicode Regular Expressions" has been published in revised form as of 2008-08-29. Edited by Mark Davis and Andy Heninger, this specification provides guidelines for how to adapt regular expression engines to use Unicode. The changes in this update include: improved end-of-line treatment; clarifying the use of grapheme clusters in regular expressions; a clearer discussion of the importance of the different regex features; updated syntax; and other improvements. Unicode is a large character set, and regular expression engines that are only adapted to handle small character sets will not scale well. Unicode encompasses a wide variety of languages which can have very different characteristics than English or other western European text. So there are three fundamental levels of Unicode support that can be offered by regular expression engines: (1) Level 1: Basic Unicode Support. At this level, the regular expression engine provides support for Unicode characters as basic logical units. This is independent of the actual serialization of Unicode as UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE. This is a minimal level for useful Unicode support. It does not account for end-user expectations for character support, but does satisfy most low-level programmer requirements. (2) Level 2: Extended Unicode Support. At this level, the regular expression engine also accounts for extended grapheme clusters (what the end-user generally thinks of as a character), better detection of word boundaries, and canonical equivalence. This is still a default level [independent of country or language] but provides much better support for end-user expectations than the raw level 1, without the regular-expression writer needing to know about some of the complications of Unicode encoding structure. (3) Level 3: Tailored Support. At this level, the regular expression engine also provides for tailored treatment of characters, including country- or language-specific behavior... Level 1 is the minimally useful level of support for Unicode. All regex implementations dealing with Unicode should be at least at Level 1. One of the most important requirements for a regular expression engine is to document clearly what Unicode features are and are not supported. Even if higher-level support is not currently offered, provision should be made for the syntax to be extended in the future to encompass those features.

Orchestration vs. Choreography: Debate Over Definitions
Boris Lublinsky, InfoQueue

With increasing attention given to SOA, it becomes more and more important to standardize (give precise meaning) the terminology used. An interesting discussion on the Yahoo SOA Tech Forum illuminates that point. The discussion was started by Michael Poulin who was asking the question about the difference between "orchestration" and "choreography" hoping to be pointed to the place where the difference had been clearly articulated. Although the Merriam-Webster dictionary definition does not really help to clarify the differences between orchestration and choreography in IT, it was implicitly used by many participants in the discussion. Anne Thomas Manes continued her explanation by referring to the meanings used in existing WS-* specifications namely Business Process definition Language (BPEL) and Web services Choreography Definition language (WS-CDL): 'Orchestration refers to automated execution of a workflow, i.e., you define workflow using an execution language such as BPEL, and you have an orchestration engine to execute the workflow at runtime. An orchestrated workflow is typically exposed as a service that can be invoked through an API. It does not describe a coordinated set of interactions between two or more parties. Choreography refers to a description of coordinated interactions between two or more parties. For example, you request a bid, I return a quote, you submit a purchase order, I send you the goods...' A slightly different point of view is expressed by John Evdemon, classifying the two terms based on their visibility: 'Orchestrations describe what an overall process appears to do without specifying how any of it is implemented. I view choreography as a form of peer-to-peer interaction because there is no "conductor". The choreography is an agreed-upon model for interactions that may consist of a series of orchestrations. From a B2B perspective, orchestrations are intra-organization while choreographies are inter-organization. Put more simply, one organization does not orchestrate another...' This discussion is just one example of the situation that is becoming more and more common in SOA and IT in general. People are using the same words while they really mean different things and keep arguing because they are using different words although in reality they are in complete agreement.

Questions and Answers on Messaging, Real-time, and Grid (MRG)
Bascha Harris and Brian Che, Red Hat Magazine

Red Hat announced the release of a product called MRG — a computing platform that features high-speed messaging and allows high-throughput computing, realtime transactions, and workload management. In this interview, Brian Che, the project manager for MRG, answers a few questions. "Messaging is at the heart of enterprise computing. We had needs for messaging infrastructure at Red Hat—for building out our own capabilities around things like virtualization management. Many of Red Hat's customers were asking us to provide an open source messaging offering. So, we started working on the AMQP specification and our messaging implementation, even though we didn't know it was going to end up in something called 'Red Hat Enterprise MRG'... We saw significant opportunities for building out fundamentally new capabilities by integrating messaging, realtime, and grid into one platform. And so, MRG was born. We released MRG v1 at the Red Hat Summit on June 19, 2008. MRG v1 offers support for messaging and realtime, and grid is in Technology Preview. We'll release a 1.1 update to MRG that will bring grid into full support as well... JP Morgan Chase, like other investment banks, uses messaging for everything from executing stock trades to providing feeds of market data to internal data distribution. They currently send over 1 billion AMQP messages per day. Realtime provides deterministic performance. The US Navy is deploying realtime in its DDG 1000 naval destroyers. Realtime is critical in this environment, because the ships' computers have to respond precisely without ever pausing, freezing, or getting out of sync with other events. Otherwise, the results could be disastrous. One of our large manufacturing customers has been working with Red Hat to build an on-demand grid in Amazon's EC2 cloud environment for the times it needs access to a grid for calculations. Because this customer isn't able to utilize fully a dedicated grid, having the option to deploy a grid in the cloud provides them significant cost savings and flexibility... Many of our largest customers are MRG early adopters, such as investment banks like JP Morgan Chase, telco companies like Alcatel Lucent, and multiple agencies in the US Government. We are also working across oil/gas, animation studios, Internet, shipping, stock exchanges, defense, travel, and so on... Advanced Message Queuing Protocol (AMQP) seems to be an important standard for bearing data quickly, and its terms indicate that it is an open standard, much like the ODF.. One of the significant things about AMQP is that it is the first protocol standard for business messaging. All other standards, like JMS, aren't comprehensive enough and don't specify down to the wire level to provide true interoperability and an open ecosystem. So, I'm not concerned about competing standards—there aren't really any right now. That's why there is so much interest in AMQP. I think that most big businesses will understand and appreciate what AMQP has to offer. Notably, many of the big businesses driving AMQP are not vendors but users. Eventually, if you want to work with these users, you're going to have to adopt AMQP..."

See also: AMQP references

IETF Updates Sieve Mail Filtering Language (SIEVE) Working Group Charter
Staff, IETF Announcement

IETF has announced the re-charter of the SIEVE Working Group in order to finish work on existing in-progress Working Group documents ('draft-ietf-sieve-notify-mailto', 'draft-ietf-sieve-mime-loop', 'draft-ietf-sieve-refuse-reject') as well as to finalize and publish several SIEVE extensions as proposed standards. The principal RFC for "Sieve: An Email Filtering Language" (RFC 5228) describes a language for filtering email messages at time of final delivery. It is designed to be implementable on either a mail client or mail server. It is meant to be extensible, simple, and independent of access protocol, mail architecture, and operating system. It is suitable for running on a mail server where users may not be allowed to execute arbitrary programs, such as on black box Internet Message Access Protocol (IMAP) servers, as the base language has no variables, loops, or ability to shell out to external programs. One of the extensions to be completed is "Sieve Email Filtering - Sieves and Display Directives in XML." This Internet Draft describes a way to represent Sieve email filtering language scripts in XML. Representing sieves in XML is intended not as an alternate storage format for Sieve but rather as a means to facilitate manipulation of scripts using XML tools. Some user interface environments have extensive existing facilities for manipulating material represented in XML. While adding support for alternate data syntaxes may be possible in most if not all of these environments, it may not be particularly convenient to do so. The obvious way to deal with this issue is to map sieves into XML, possibly on a separate backend system, manipulate the XML, and convert it back to normal Sieve format. The fact that conversion into and out of XML may be done as a separate operation on a different system argues strongly for defining a common XML representation for Sieve. This way different front end user interfaces can be used with different back end mapping and storage facilities... This specification defines an XML representation for sieve scripts and explains how the conversion process to and from XML works. The XML representation is capable of accommodating any future Sieve extension as long as the underlying Sieve grammar remains unchanged. Furthermore, code that converts from XML to the normal Sieve format requires no changes to accommodate extensions, while code used to convert from normal Sieve format to XML only requires changes when new control commands are added—a rare event. An XML Schema and sample code to convert to and from XML format are also provided in the appendices.

Altova Releases New XMLSpy Standard Edition for Basic XML Editing Tasks
Staff, Altova Software Announcement

Altova, a industry leading XML editor, and other popular XML, data management, UML, and Web services tools, has announced the availability of Altova XMLSpy 2008 Standard Edition. The new entry-level XML editor allows users to view and validate XML, DTD, XML Schema, XSLT, XQuery and other files, as well as perform basic editing tasks. A a 30-day free trial version of Altova XMLSpy is available for download. XMLSpy 2008 Standard Edition provides numerous helpful features for viewing files and performing light-weight editing on XML documents as well as HTML and CSS files. Advanced Text View supports pretty printing, syntax coloring, line numbering, source folding, code completion, and more, to help users easily navigate and edit XML files. Strong well-formedness checking and validation features, as well as context-sensitive entry helpers, guide users and ensure valid edits. When users need to start a project from scratch, XMLSpy Standard Edition can also generate a sample XML instance document from a DTD or XML Schema. Users may also view XML, DTD, and XML Schema files in Grid View and/or XMLSpy's famous graphical XML Schema View to visualize data according to the document's hierarchical structure... XMLSpy 2008r2 contains a number of advanced optimizations for working with very large files. These result in a reduction of memory consumption by up to 75-80% compared to the previous version when opening and validating XML documents in text view. This means that you can now open and work with files that are about 4-5 times larger than those supported in the past. Enhanced support for large files is especially helpful for developers working with large amounts of data in the context of database applications, financial services, data gathering, enterprise data integration, and so on.

See also: XML Schema Languages


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY

Headlines

Sponsors