CP RSS Channel
About Our Sponsors
Articles & Papers
Technology and Society
Summary: This document was created in 2004 with the intent of maintaining a reference list for technical contributions on "XML and (Data) Compression." The initial summary sentence: 'XML's verbosity is not its most significant "flaw," but it is a legitimate concern in some XML applications. Here are some references to compression techniques (optimized) for XML-encoded data.'
As of late 2007, research was still being conducted on the topic, including work done by members of the W3C Efficient XML Interchange (EXI) Working Group, successor to the W3C XML Binary Characterization Working Group. The EXI Working Group was chartered "to define an alternative encoding of the XML Information Set, that addressed the requirements identified by the XML Binary Characterization Working Group, while maintaining the existing interoperability between XML applications and XML specifications." Details:
The XML Binary Characterization Working Group conducted work between March 2004 to March 2005, to gather information about uses cases where the overhead of generating, parsing, transmitting, storing, or accessing XML-based data, might be deemed too great within the context of those use cases. XBC also characterized the properties that XML provides as well as those that are required by the use cases, and established measurements to help judge whether XML 1.x and alternate encodings provided the required properties. In that group's XML Binary Characterization document, the XBC Working Group recommended future work.
The task of this [EXI] Working Group is to jointly establish and optimize, the performance of an alternate, binary, encoding of XML. At the same time, disruption to existing processors, and impact on the complex real-world uses of XML, must be minimized. The Working Group started by considering existing solutions and has evaluated each in terms of implementability and performance against the requirements produced by the XBC Working Group. We gathered together a test data set of more than 10000 documents in 30 or so XML vocabularies, from a broad range of use case groups, such as Scientific, Financial, Electronic (those intended for human consumption), Storage (intended as data stores), etc. The existing solutions, and candidate base technologies for a potential EXI format, were then measured over a number of merit criteria, within a benchmark framework based on japex. The first draft of the measurements were presented in July 2006, in the Efficient XML Interchange Measurements Note and Analysis of the EXI Measurements reports... [September 2007 description]
See the XML Compression Bibliography maintained by Greg Leighton (Web Data Management Lab, Department of Computer Science, University of Calgary), particularly for years 2004-current.
[September 2007] An Analysis of XML Compression Efficiency. By Chris Augeri, Barry E. Mullins, Leemon C. Baird III, Dursun A. Bulutoglu, and Rusty O. Baldwin. In Proceedings of the 2007 Workshop on Experimental Computer Science (ExpCS '07). San Diego, CA, 13-14 June 2007. 12 pages, with 51 references. "XML simplifies data exchange among heterogeneous computers, but it is notoriously verbose and has spawned the development of many XML-specific compressors and binary formats. We present an XML test corpus and a combined efficiency metric integrating compression ratio and execution speed. We use this corpus and linear regression to assess 14 general-purpose and XML-specific compressors relative to the proposed metric. We also identify key factors when selecting a compressor. Our results show XMill or WBXML may be useful in some instances, but a general-purpose compressor is often the best choice. Note: Additional information about the study, including links to the XML corpus used in the paper, is available from Chris Augeri. [This paper was authored by employees of the U.S. Government and is in the public domain; the research is supported in part by the Air Force Communications Agency; cache.]
[August 2007] "Using XML Compression to Increase Efficiency of P2P Messaging in JXTA-based Environments." By Brian Demmings, Tomasz Müldner, Gregory Leighton, and Andrew Young. Proceedings of Extreme Markup Languages 2007 (Montréal, Québec). "P2P (Peer-to-Peer) systems use messaging for communication amongst peers, and therefore the efficiency of messaging is a key concern for any P2P environment; particularly environments with a potentially large number of peers. One of popular representations of a P2P system is JXTA, which uses XML-based messaging. In this paper, we describe how the use of XML compression can increase efficiency of P2P messaging in JXTA-based environments. In the proposed solution, message elements containing XML data are compressed using an XML-aware compressor. Our design can be used not only for compression but also for other kinds of encodings of XML data, such as encryption. Experimental results demonstrate that our compression technique results in a substantial decrease in message transport time along with a corresponding decrease in the size of messages. Therefore, the application of XML compression for messaging in JXTA-based P2P environments results in an increase in the efficiency of messaging and a decrease in network traffic..." Also in PDF (cache).
[April 09, 2004] "Tip: Compress XML Files for Efficient Transmission. Who Needs Binary XML When We Have Good Compression?" By Uche Ogbuji (Principal Consultant, Fourthought, Inc). From IBM developerWorks (April 09, 2004). ['Binary XML has generated a lot of talk, and one of the motivators is the need for a less verbose transfer format, especially for use with Web services. One solution that is already at hand is data compression. This tip shows you how to use compression to prepare XML for transmission over Web services.'] "The idea of binary XML has always hung around the margins of XML discourse. XML is very verbose because of its textual heritage and the many rules it imposes for friendliness toward internationalized text. Equivalent binary syntax would be much more compact. In an early (2000) article, 'XML The future of EDI?', I demonstrated a translation of part of an ANSI EDI X12 purchase order transaction (which is binary) into XML. The XML result was more than eight times the size of the original EDI message; some other XML/EDI pilots were seeing only around three times. This verbosity is of some concern for storage of XML, but at least storage is cheap these days. Transmission capacity is usually more limited and the loudest calls for binary XML have been among those using XML for message transport formats, including some Web services users. One approach to achieving XML compression is to adopt a format that is designed for binary formats from the start. The leading candidate is ISO/ITU ASN.1, a data transmission standard that predates XML. ASN.1 is being updated with several XML-related capabilities that allow XML formats to be reformulated into specialized forms such as ASN.1 Packed Encoding Rules, which define a very compact binary encoding. OASIS UBL is an example of an XML initiative that has taken the ASN.1 approach to XML data compression... In general, after gzip is applied to an XML file and the compressed result is then encoded with Base64 for delivery inline in SOAP, the result is often half its original size. This may be enough to meet your needs for space savings in XML Web services. If not, do take a good look at ASN.1..."
[June 10, 2003] "Design and Test of the Cross-Format Schema Protocol (XFSP) for Networked Virtual Environment." By Ekrem Serin (Lieutenant Junior Grade, Turkish Navy, B. S., Turkish Naval Academy, 1997). Thesis submitted in partial fulfillment of the requirements for the degree of Master Of Science In Computer Science from the Naval Postgraduate School. March 2003. 149 pages. Referenced by Don Brutzman in a posting to the W3C 'www-tag' list. The thesis presents a "description, design and implementation detail of Cross Format Schema Protocol (XFSP)... With this work we show that a networked simulation can work for 24 hours a day and 7 days a week with an extensible schema based networking protocol and it is not necessary to hard code and compile the protocols into the networked virtual environments. Furthermore, this thesis presents a general automatic protocol handler for schema-defined XML document or message. Additionally, this work concludes with idea that protocols can be loaded and extended at runtime, and can be created with different-fidelity resolutions, resulting in swapping at runtime based on distributed state... This general XML compression scheme can be used for automatic production of binary network protocols, and binary file formats, for any data structures specified in an XML Schema... Ordinarily, XML is not a compact way to express the data. Messages written in XML are much larger than a binary equivalent. The technique that is used to overcome this problem is replacing tags with binary tokens. When an XML tree is parsed to serialize into an output stream, the tags that mark up the data are replaced with their binary equivalents. The end result is a more compact serialized XML tree. As it is discussed before, the basic idea behind XFSP was XML-Serialization. With this approach, XFSP can be used in any application which needs transactions via XML documents such as XML-RPC (XML Remote Procedure Call), XKMS (XML Key Management Services), XML-DSig (XML Digital Signatures) and XML-Enc (XML Encryption). XFSP can present those transactions in a more compact way... For the XFSP project, semantics is not targeted to be solved and is generally considered to be NP-Hard, because the semantic definition needs a knowledge domain and AI generation. As described before, the run-time extensible syntax is pointed out as the research question and targeted to be solved. To solve this problem, XML Schema is used to define the application- layer protocol between users and serialized XML data is sent as the payload..." See also the postings of Don Brutzman on the CAD3D Group in Web3D Consortium and the X3D group which "recommends consideration of technical requirements in the original X3D Binary Compression Requirements and Request for Proposals. [cache]
[March 25, 2003] "Squeezing SOAP. GZIP Enabling Apache Axis." By Brian D. Goodman (IT Architect, IBM Intranet Technology, IBM). From IBM developerWorks, Web services. ['GZIP encoding over HTTP is pretty much old school. "Been there, done that" is the attitude of most. However, if you have been working with a few of the current SOAP implementations, you'll find that they don't take advantage of it. While knowing they will eventually come around, if you are building real world Web service solutions and want a performance boost, GZIP is for you.'] "GZIP encoding over an HTTP transport is a well known technique for improving the performance of Web applications. High traffic Web sites use GZIP to make their user's experience faster, and compression is widely used to make files smaller for download and exchange. In fact, as far as XML goes, GZIP is not even the latest, cool thing to be doing. New technology, like AT&T's XMill, claims twice the compression of GZIP in roughly the same amount of time. GZIP, however, is a core component of the Java platform and many Web servers have the ability to compress content independent of the files or applications it serves up. For that reason, this article will look at what it takes to use GZIP in conjunction with the Axis SOAP implementation. This has proven useful for projects and solutions that need the extra performance now and for which you are willing to sacrifice time spent integrating with follow on releases of SOAP implementations later. Furthermore, this article looks at encoding at the servlet level, which will enable you to implement a different content encoding scheme... GZIP encoding over HTTP is part of Web technologies as we know it. Using it in the existing Web service framework is a logical next step. However, solutions are being designed, built, and deployed every day on these SOAP implementations. In many instances, being able to GZIP encode the SOAP envelope results in faster transaction times with a relatively low overhead. This performance upgrade can be realized today with some simple code modifications. Enabling GZIP encoding in your SOAP environment lets you take advantage of compression today, while patiently waiting for the integration into our favorite implementations..."
[February 2003] SourceForge Project: XML Compression Tools (XMLPPM). See the home page for description: "XMLPPM is a data compression program that compresses XML files from 5 to 30% better than any existing text or XML-specific compressors. It is a combination of the well-known Prediction by Partial Match (PPM) algorithm for text compression, first described by Cleary and Witten in 1984, and an approach to modeling tree-structured data called Multiplexed Hierarchical Modeling (MHM) that I have developed. The XMLPPM source code is part of a project at Sourceforge on XML compression... XMLPPM is based on the expat XML parser library for C and Bill Teahan's C implementation of the PPMD+ algorithms..."
[February 19, 2001] "Intelligent Compression Technologies Releases XML Compressor, XML-Xpress." - "Intelligent Compression Technologies, Inc. (ICT) announced [January 30, 2001] the release of their newest compression software, XML-Xpress. When run head-to-head, XML-Xpress compression software can radically reduce file size over other competing XML compression schemes. As an example, XML-Xpress software achieved lossless compression as high as 34:1 at throughputs up to 9 Mbytes/sec on a test database. In comparison to competitive systems on this same database, XML-Xpress achieved about 81% greater compression than XMill (AT&T's XML Compression solution) and 3.14 times Zip, while obtaining throughputs 3% faster than Zip and 55% faster than Xmill. XML-Xpress is the premier solution for software developers and companies who are distributing information over the Internet via XML (eXtensible Markup Language). XML is rapidly becoming the industry standard format for the exchange and sorting of data between different software applications and platforms. XML uses tags to identify data items. These tags allow users to search, sort, identify and extract data as desired. While the XML format makes the use and interchange of data easier and more user configurable, XML substantially increases the size of these files over the size when the same data is represented in its raw format. This increase in file size or 'bloat' can represent an average size increase of 400%. This inherent inflation of the file sizes is a critical problem when data has to be transmitted quickly or stored compactly. Bill Sebastian, ICT's President, stated, 'We began this project at the request of an application service provider who wanted their on-line service to operate as if it resided on the clients local disk drive. While their decision to implement XML within their application increased the size of their raw files, it actually made the underlying data more compressible and, as a result of using XML-Xpress, made the overall throughput faster. The same solution is applicable to many other situations. By inserting a transparent compressor between the server and client, you can support the rich feature set of XML in your applications and still achieve a higher transfer rate than if XML were not implemented'."
[February 15, 2001] XML Compression. From Andrew Morrow: With respect to the compression aspect, you may be interested in looking at Millau, at http://www9.org/w9cdrom/154/154.html. Millau was designed as a binary XML compression method that is schema aware, and therefore can achieve superior compression over ZIP compression for XML documents less than 5k. Their paper also demonstrates parsing performance improvements, and they experimented with an XML-RPC server exchanging Millau encoded XML." See below.
[February 15, 2001] "Millau: an encoding format for efficient representation and exchange of XML over the Web." By Marc Girardot (Institut Eurécom Sophia Antipolis, France) and Neel Sundaresan (IBM Almaden Research Center San Jose, California, USA). Presented at WWW9. "XML is poised to take the World Wide Web to the next level of innovation. XML data, large or small, with or without associated schema, will be exchanged between increasing number of applications running on diverse devices. Efficient storage and transportation of such data is an important issue. We have designed a system called Millau for efficient encoding and streaming of XML structures. In this paper we describe the Millau algorithms for compression of XML structures and data. Millau compression algorithms, in addition to separating structure and text for compression, take advantage of the associated schema (if available) in compressing the structure. Millau also defines a programming model corresponding to XML DOM and SAX for XML APIs for Millau streams of XML documents. Our experiments have shown significant performance gains of our algorithms and APIs. We describe some of these results in this paper. We also describe some applications of XML-based remote procedure calls and client-server applications based on Millau that take advantage of the compression and streaming technology defined by the system... The Millau encoding format is an extension of the WAP Binary XML format. The WBXML (Wireless Application Protocol Binary XML) Content Format Specification defines a compact binary representation of XML. This format is designed to reduce the transmission size of XML documents with no loss of functionality or semantic information. For example, WBXML preserves the element structure of XML, allowing a browser to skip unknown elements or attributes. More specifically, the WBXML content encodes the tag names and the attributes names and values with tokens (a token is a single byte). In WBXML format, tokens are split into a set of overlapping 'code spaces'. The meaning of a particular token is dependent on the context in which it is used. There are two classifications of tokens: global tokens and application tokens. Global tokens are assigned a fixed set of codes in all contexts and are unambiguous in all situations. Global codes are used to encode inline data (e.g., strings, entities, opaque data, etc.) and to encode a variety of miscellaneous control functions. Application tokens have a context-dependent meaning and are split into two overlapping 'code spaces', the 'tag code space' and the 'attribute code space'..."
[February 16, 2001] From Claude Seyrat: About XML Compression and MPEG-7. Regarding the compression aspect of XML, I would like to mention that currently MPEG experts are working on a standard (MPEG-7) about audiovisual metadata carriage and processing. In MPEG-7 metadata are expressed in XML. XML Schema is used to define datastructures. A large library of types are defined to describe many aspect of a multimedia document ranging from low level features (colors, movement, etc..) to high level description (RDF like). XML descriptions can be sent as a whole or synchronously streamed within the AV stream (MPEG-2 or MPEG-4). In MPEG-7, XML can be easily encoded, streamed, and filtered. For that purpose MPEG-7 designed a binary encoding format very similar to what you are suggesting and to ASN-1 (but far more XML oriented). The codec is dynamically generated based on the schema expressed in XML Schema. It accepts every XML Schema features like substitutionGroups, sub-typing, choices, sequences, aso... Compression ratios show that elements can be represented by few bits in average (sometimes less than one bit), values are coded using specific datatype encodings."
XMILL: An Efficient Compressor for XML Data. By Hartut Liefke (U. Pennsylvania) and Dan Suciu (AT&T Labs). ['XMILL is an XML compressor achieving twice or better compression ratios than gzip, at about the same speed.'] "We describe a tool for compressing XML data, called XMILL, that usually achieves about twice the compression ratio of gzip at roughly the same speed. The intended applications are XML data exchange and archiving. XMILL does not need schema information (such as a DTD or an XML-Schema), but can exploit hints about such a schema in order to further improve the compression ratio. XMILL incorporates and combines existing compressors in order to compress heterogeneous XML data: it uses zlib, the library function for gzip, as well as a collection of datatype specific compressors. XMILL can be extended with new specialized compressors: this is useful in applications managing XML data with highly specialized data types, such as DNA sequences, images, etc. The paper presents a theoretical justification for the method used, XMILL's architecture and implementation, a new language for expressing the hints about the XML schema, a new technique for path expression evaluation based on "reversed data guides", and a series of experiments validating XMILL on several real data sets." Paper in Postscript. See the web site.
[February 15, 2001] From Tom Bradford: "You may want to look at the compression system I designed for dbXML. The goal behind it is not necessarily to produce incredible compression ratios, but to get the XML into a state where it's in a tokenized tree form, doesn't lose any of the original document structure, can be traversed without having to decompress unneeded nodes, and doesn't have to be parsed. The stream can be used to generate lazy DOM trees and SAX events. It utilizes external symbol tables to perform compression/decompression. These symbol tables are also XML files, and can be compressed using a hardcoded symbol table. It was designed for use with dbXML, but I tried to architect it such that it could easily be extracted and used on its own. The license is LGPL."
XMLZip (from XML Solutions. "XMLZip is a unique XML file compression tool that compresses XML documents based on node levels within the document object model (DOM). Large XML files consume bandwidth on the Internet and large amounts of storage on both the client and server machines. A product catalog represented in XML can easily consume 10 to 100 megabytes of characters. The obvious solution is to compress the entire file through traditional file compression utilities. This approach reduces transmission time and lowers storage requirements, but it also renders the file's DOM API inaccessible. XMLZip reduces the size of XML files will retaining the accessibility of the DOM API, thus allowing applications to acess the data in its compressed form. In addtion, XMLZip is capable of selective compression and decompression of the files, allowing users to determine the DOM level at compression time." Available for download.
[November 24, 2000] "Compressing XML with Multiplexed Hierarchical PPM Models." By James Cheney (Cornell University, Ithaca, NY). 2000-11-24 or later. "In this paper, we will describe alternative approaches to XML compression that illustrate other tradeoffs between speed and effectiveness. We describe experiments using a variety of text compressors and XMILL to compress a variety of XML documents. Using these as a benchmark, we describe our two main results: an online binary encoding for XML called _Encoded SAX_ (ESAX) that compresses better and faster than existing methods; and an online, adaptive, XML-conscious encoding based on Prediction by Partial Match (PPM) called 'Multiplexed Hierarchical Modeling' (MHM) that compresses up to 35% better than any existing method but is fairly slow. First, of course, we need to describe XML in more detail... [Conclusion:] We established a working XML compression benchmark based on text compression, and found that 'bzip2' compresses XML best, albeit more slowly than 'gzip'. Experiments using the XMILL transformation verified that XMILL speeds up and improves compression using 'gzip' and bounded-context 'ppmd+' by up to 15%, but worsens compression for unbounded-context compressors such as 'bzip2' and 'ppm*'. We presented an alternative, Encoded SAX, which speeds up and improves compression for all compressors, compresses 2-4% better than 'bzip2' does on text XML, and which has the additional advantage of allowing incremental transmission. Finally, we described a new technique called multiplexed hierarchical modeling that combines existing text compressors and knowledge of XML structure Using the PPMD+ and PPM* models as components, our MHM and MHM* models compress textual XML data about 5% better and structured data from 10-25% better than the best existing method..."
[March 23, 2000] "XML-Deviant: Good Things Come In Small Packages." By Leigh Dodds. From XML.com. March 22, 2000. [This article has to do with compression and efficient storage/transmission of XML.] "One of XML's strengths is its human-readability. But the consequent verbosity is also one of its weaknesses, according to a growing number of XML developers. . . Like any textual markup language, XML is verbose. There is a lot of 'redundant' data in an XML document, including white space, and element and attribute names. XML documents are therefore a prime candidate for compression. [...] However tiny XML-based traffic is today, if the current rate of adoption continues, XML transmission will be ubiquitous before long. Now may be the best time to consider some wider architectural problems: perhaps it's time to take a break from producing the unceasing flow of new standards. Considering how these standards fit together will reinforce our efforts toward the holy grail of Interoperability. Experiences from organizations like MITRE, as well as feedback from developers "on the factory floor," will be vital."
[December 17, 1999] "Transcoding on the fly for the Web." By Nancy E. Dunn and Chris Rumble. From IBM DeveloperWorks. (November 1999). "IBM technology (now in beta) serves as a Web intermediary platform for XML and graphics conversions on the fly. A demonstration application shows how this technology makes it possible to convert Web pages (or other files) from one format to another in real time -- without changing the original pages on the Web server. Content providers or conversion service providers can use the technology for adapting Web pages for hand-held devices, for transforming XML data, and for dozens of other applications. In this interview, two IBM researchers explain how Web intermediary technology supports the conversions and provides a rich vein for more Web and XML conversion on the fly. Paul Maglio and Rob Barrett, researchers at the IBM Almaden Research Center near San Jose, have prepared a demonstration of transcoding on the fly. The initial conversions offered include graphics bit-depth and format conversion and XML conversions. Maglio and Barrett talked with developerWorks staff recently to explain how they used Web intermediary technology to support the Web-based conversion application... In the demonstration, we provide a service where people can perform some sample conversions of their data or graphics, either by inputting URLs or uploading content, and then selecting from a short list of transcoding options. We can convert graphics in a few ways -- compression, bit-depth, file format. We also demonstrate XML transcoding via XSL."
[May 04, 1998] "Web Designers Eye XML Data Compression." By Jeff Walsh. In InfoWorld [Electric] Volume 20, Issue 18 (May 4, 1998). Posted at 6:08 PM PT, April 30, 1998. Summary: "With Microsoft and others viewing the Extensible Markup Language (XML) as a way to enable richer client-side processing, many Web designers are concerned with how users will react to additional XML data being sent via the Internet, or if they will notice at all."
Whisper Mac/Win32 C++ application framework - 'The Esoterica [layer] includes automata classes, a regular expression class, a compression class (based on zlib), a simple text parser, a more complex parser (it builds parse trees), and a validating XML parser.'
|Receive daily news updates from Managing Editor, Robin Cover.|