Cover Pages: XML Daily Newslink: Tuesday, 05 August 2008

A Cover Pages Publication http://xml.coverpages.org/
Provided by OASIS and Sponsor Members
Edited by Robin Cover

This issue of XML Daily Newslink is sponsored by:
Primeton http://www.primeton.com

Headlines

Apache FOP (Formatting Objects Processor) Stable Release Version 0.95
W3C Last Call Review: Content Transformation Guidelines 1.0
Open Architecture at REST
Thinking Differently About XML
The Challenge of Validating XML-in-ZIP File in Place: Use Schematron
Namespace Documents (Kudos to XHTML)
Discussion Document: Revising the OAI-ORE Profile of Atom
Google Upgrades Search Appliance: Bigger Index, New Enterprise Features

Apache FOP (Formatting Objects Processor) Stable Release Version 0.95
Jeremias Maerki, Apache XML Graphics Team Announcement

Developers in the Apache FOP Project have announced the release of FOP 0.05 (stable: 05-August-2008). Changes in FOP 0.95 include: changes to the End-User API; changes to the Code Base; changes to the Font Subsystem; changes to the Image Support; changes to the Layout Engine; changes to Renderers (Output Formats). Apache FOP (Formatting Objects Processor) is a print formatter driven by XSL formatting objects (XSL-FO) and an output independent formatter. It is a Java application that reads a formatting object (FO) tree and renders the resulting pages to a specified output. Output formats currently supported include PDF, PS, PCL, AFP, XML (area tree representation), Print, AWT and PNG, and to a lesser extent, RTF and TXT. The primary output target is PDF. Apache FOP is Open Source software and published under the Apache License v2.0. FOP is the result of significant redesign effort which implements a large subset of the XSL-FO Version 1.1 W3C Recommendation ("Extensible Stylesheet Language (XSL) Version 1.1"). XSL (FO) "is a language for expressing stylesheets. Given a class of arbitrarily structured XML documents or data files, designers use an XSL stylesheet to express their intentions about how that structured content should be presented; that is, how the source content should be styled, laid out, and paginated onto some presentation medium, such as a window in a Web browser or a hand-held device, or a set of physical pages in a catalog, report, pamphlet, or book." Apache FOP is part of the Apache XML Project. The Apache XML Graphics Project consists of several sub-projects, each focused on a different aspect of XML Graphics: (1) Apache Batik: A toolkit for Scalable Vector Graphics (SVG), based in Java; (2) Apache FOP: A print formatter and renderer for XSL-FO (FO=formatting objects), based in Java; (3) Apache XML Graphics Commons: A library with various components used by Apache Batik and Apache FOP, written in Java.

W3C Last Call Review: Content Transformation Guidelines 1.0
Jo Rabin (ed), W3C Technical Report

W3C's Mobile Web Best Practices Working Group has published a Last Call Working Draft for "Content Transformation Guidelines 1.0." The document was produced as part of the Mobile Web Initiative. Since its publication as a First Public Working Draft on 14 April 2008, the "Content Transformation Guidelines 1.0" document has been almost entirely re-written. The guidelines were extended, precised and re-worded for clarity reasons. In particular: (1) The Conformance section was added; (2) The statements were simplified, clarified and follow the same normative pattern; (3) The choice of the "X-Device-"'original header name' HTTP header was confirmed; (4) The possible uses of the link element were detailed; (5) The notion of Web Site was introduced. Content Transformation is the manipulation in various ways, by proxies, of requests made to and content delivered by an origin server with a view to making it more suitable for mobile presentation. The W3C Mobile Web Best Practices Working Group neither approves nor disapproves of Content Transformation, but recognizes that is being deployed widely across mobile data access networks. The deployments are widely divergent to each other, with many non-standard HTTP implications, and no well-understood means either of identifying the presence of such transforming proxies, nor of controlling their actions. This document establishes a framework to allow that to happen. The overall objective of this document is to provide a means, as far as is practical, for users to be provided with at least a "functional user experience" of the Web, when mobile, taking into account the fact that an increasing number of content providers create experiences specially tailored to the mobile context which they do not wish to be altered by third parties. Equally it takes into account the fact that there remain a very large number of Web sites that do not provide a functional user experience when perceived on many mobile devices. The recommendations in this document refer only to "Web browsing", i.e., access by user agents that are intended primarily for interaction by users with HTML Web pages (Web browsers) using HTTP. Clients that interact with proxies using mechanisms other than HTTP (and that typically involve the download of a special client) are out of scope, and are considered to be a distributed user agent. Proxies which are operated in the control of or under the direction of the operator of an origin server are similarly considered to be a distributed origin server and hence out of scope.

Open Architecture at REST
Roy T. Fielding, OSCON 2008 Presentation

The presentation is not about open exoskeleton buildings, or open sourced architecture, open systems, or even about personal computer open architecture. It talks about principles discussed in a doctoral disseration written by Peyman Oreizy (2000): "Open Architecture Software: A Flexible Approach to Decentralized Software Evolution." Collaborative open source development emphasizes community and takes advantage of the scalability obtainable through Internet-based virtual organizations. It adapts to the volunteer nature of developers. Open Development + Conway's Law: Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization's communication structure. True open development (a.k.a, Community-driven Design) will only occur when the design of your system reflects the organizational structure of open development... But change is inevitable. Peyman's approach to Open Architecture: (1) Expose the application's architecture to third-parties (2) Allow third-parties to evolve the application by changing its architecture; (3) Verify changes against the semantic annotations on the system model. Closed Source Examples: Adobe Plugin Architecture, Apple iPhone Ecosystem. What is common to all of the largest and most successful open source projects? a software architecture designed to promote anarchic collaboration through extensions, while preserving control over the core interfaces. Open Source Examples: Emacs, Apache HTTP Server, Linux, Mozilla Firefox, Eclipse, Apache Sling. The best open source projects have learned the importance of designing an open architecture the value of decentralized software evolution [Dissertation: "Open Architecture Software: A Flexible Approach to Decentralized Software Evolution," describes a novel way of constructing software systems that are easy for other developers to change. The crux of the technique is to expose the architectural model of a software system (i.e., its component parts,their interconnections, and the implementation mapping) as an explicit and malleable part of the software package. By directly manipulating this architecture, third-party developers can make novel changes to the software system. I also describe a framework for comparing these kinds of decentralized change techniques, including open-source software, software plug-ins, and event-based systems. My dissertation committee consisted of Richard N. Taylor (Chair, UC Irvine), David S. Rosenblum (University College London), and David Notkin (University of Washington).]

Thinking Differently About XML
Norm Walsh, Blog

Having an XML server at my disposal is making me think about XML applications differently. I've been writing XML applications for a long time. [Over the] years, I'd grown to think of XML applications as being primarily operations on some principle document: a book, a web page, an Atom feed, what have you. I'm not saying all XML applications are like this, I'm just saying this is how I tended to think about them. Individual documents were sometimes composed from several files (via entities or XInclude) and some applications operated on a small number of files, but there was always at least some logical sense in which there was 'the main file' and its ancillary files. When my application involved a potentially large number of files, I usually massaged them into a single file and used that as one of my small number of files. All of the many and varied sources of information used to present essays in this weblog, for example, are aggregated into a honking big RDF/XML document and that document is used as an ancillary resource when formatting the XML for each essay, the 'main file'. One of demos I constructed to learn more about Mark Logic Server was a 'W3C Spec Explorer'. I took all of the W3C specs and poured them into the server then I set out to write some XQuery code that would allow me to view the specifications by date, by working group, by editor, and through full-text search—or any combination of those options simultaneously. My starting point for building the sort of faceted navigation that I had in mind was the RDF metadata that the W3C provides. Not only was I interested in having better full-text search of the specs for myself, I was also interested in exploring RDF in the server... [But] I had gone out and asked the server to give me a fairly big document and now I was trying to grub around inside it to find stuff. Instead, [someone suggested] I should 'push the constraints to the database'. Don't just ask the server (the database) for the file, ask it for the actual elements I care about. This did two things: first, it made my searches instantaneous or so nearly so as to make no difference. Second, it made me start to think very differently about XML applications. The server's universal index over all the content in the database makes it practical (often blindingly fast) to ask questions over an enormous number of documents... [Note: The Mark Logic Server was built from the ground up to operate on XML, so its indexing strategies are designed explicitly to work well with the full richness of mixed content. It's definitely the universal index that makes it possible for the server to provide immediate answers to a large number of query expressions (and different kinds of query expressions simultaneously) that would otherwise be slow to compute.]

See also: Dave Kellogg's Blog

The Challenge of Validating XML-in-ZIP File in Place: Use Schematron
Rick Jelliffe, O'Reilly Technical

The issue of how to bundle up and transport XML document sets, with their accompanying stylesheets and media, has a good solution with the XML-in-ZIP approach used by ODF, OOXML, and SCORM. But it was not always clear that this was a good approach: the horrible XML format Microsoft adopted for XML 2003 is a good (I mean a bad) example, one big XML file with media converted to bin64 characters and embedded directly—a fat, fragile file that could test the resources of conventional XML implementations. Water under the bridge now. But one of the remaining problems, and one that is very interesting to me, is the issue of validation. When the basic information is kept in a single XML file, validation is reasonably straightforward: structures, ID or key referential integrity, datatyping, co-occurrance constraints, and so on. The current range of schema tools support these kinds of intra-document invariants quite well... In the processing model [above], the driver code needs to go through the document and either extract all XML files brute force, or be smart enough to identify just the files of interest for validating using whatever combination of namespace, relationship type, and phase. This is the kind of thing that, in the DSDL model, we are hoping that the validation processing standard will handle: we will be looking at W3C XProc for this. But it may well be that regardless of how well it XProc handles OPC, XLink, XInclude, XForms-in-ODF and other scenarios, Schematron may need some nicer declarative mechanism to handle better access to files in the same ZIP archive and better iteration over a set of links (one and two step) so that the link target becomes a target for validation perhaps in a different phase. At the moment, the context node for a rule element in Schematron has to be in the XML document provided at invocation time, and I expect that we can do something better. I think it would be really useful if we could run a Schematron validation on a ZIP archive (or just a directory) and the schema itself was smart and simple enough to handle access to the various parts. Because of the mooted XProc? validation framework for DSDL, such a facility does not need to be full-featured, it just needs to concentrate on whether there is any low-hanging fruit that allows clearer schemas for XML-in-ZIP... Anyway, it is still early days on all this. Ideas and other experience readers have had is most welcome...

Namespace Documents (Kudos to XHTML)
Michael Sperberg McQueen, Blog

Lately I've had occasion to spend some time dereferencing namespaces and looking at what you get when you do so. If, for example, you have encountered some qualified name and want to know what it might mean, the 'follow-your-nose' principle says it's a good idea that you should be able to find out by dereferencing the namespace name. The follow your nose priniplce introduced to me under that name by Dan Connolly, but I think he'd prefer to think of it as a general principle of Web architecture than as an invention of his own. And indeed the "Architecture of the World Wide Web", as documented by the W3C's Technical Architecture Group, explicitly recommends that namespace documents be provided for all namespaces. The upshot of my recent examinations is that for some namespaces, even otherwise exemplary applications and demos fail to provide namespace documents. For others, the only namespace document is a machine-readable document (e.g., an OWL ontology) without any human-comprehensible documentation of what the terms in the namespace are intended to mean; for still others, there is useful human-readable description (sometimes only in a comment, but it's there) if you can can find it. And for a few, there is something approaching a document intended to be accessible to a human reader. So far, however, the best namespace document I've seen recently is the one produced by the XHTML Working Group... Kudos to the XHTML Working Group!

See also: Namespaces in XML

Discussion Document: Revising the OAI-ORE Profile of Atom
Michael Nelson, Robert Sanderson, and Herbert Van de Sompel

This discussion document relates to "ORE User Guide: Resource Map Implementation in Atom," published in draft form on June 02, 2008. Herbert Van de Sompel (Los Alamos National Laboratory, Research Library) writes: "As a result of feedback that was provided over the past few weeks, both via this list, in private communications, and via the blogosphere, we have made a bold move to compile a Discussion Document that outlines a proposal for a significantly different ORE Atom serialization... We understand this proposal comes very late in the ORE process that is expected to deliver 1.0 specification by the end of September 2008. Your feedback to this proposal is absolutely crucial. Please use the ORE Google Group to share your insights..." Summary from the discussion document: "This document describes a possible revision of the serialization of Resource Maps in Atom. The core characteristics of the revision are: (1) Convey ORE semantics in Atom as add-ons/extensions to regular Atom Feeds by introducing explicit ORE relationships instead of by according ORE-specific meaning to pre-defined Atom relationship values as is the case in the current 0.9 serialization. (2) Express an ORE Aggregation at the level of an Atom Entry not an Atom Feed; there are no ORE-specific semantics at the Feed level. Best practice in the Atom community regarding the use of Atom for specific applications is to define metadata extensions and new relationship types. The current ORE Profile of Atom has not taken this approach. Instead it uses e.g. existing, generic Atom relationships to represent specific ORE relationships. GData makes extensive use of '@rel' attributes and external metadata elements that are extensions to the basic Atom-defined ones. For example, the YouTube Atom Entry links related videos to a separate Atom feed (relative to the entry) with a specially defined '@rel' attribute value... Also, the Liverpool/HP ORE experimentation project (foresite) reported problems in determining which information available for an Aggregation to map to native Atom elements, and which to map to embedded 'rdf:Description' elements. This is especially true when the Atom element can only occur once, yet the predicate in RDF can occur multiple times... The functionality of the Atom Publishing Protocol deemed of essential importance for leveraging ORE Aggregations, is geared towards the Atom Entry level... The ORE Google Group discussions revealed the need to convey all Aggregations available from a repository in a variety of Feeds available from the repository. For example, subject-based Feeds, most recent-Feeds, monthly Feeds, my personal Feed, etc...

Google Upgrades Search Appliance: Bigger Index, New Enterprise Features
Eric Lai, Computerworld

Google Inc. has announced a new version of its enterprise-oriented Google Search Appliance that the company said can index up to 10 million documents in a single box (up from 3 million previously) while also giving IT managers more control over the search results that end users see. For instance, the new version lets IT admins set pages with certain metadata, such as a particular author's name, to show up higher in search results. Or they can bias results so that engineers get pages that have particular keywords or come from certain data repositories, while marketers typing in the exact same search terms see different results, according to Nitin Mangtani, Google's lead product manager for enterprise search. Google has also beefed up its policy features, enabling IT managers to create privileges and groups of users that map to an external group policy infrastructure technology such as Active Directory or LDAP, Mangtani said. Thus, an end user who lacks access privileges to a certain document can be blocked from even seeing it in the search results he gets. In addition, the appliance's security features have been improved, Mangtani said. For instance, users now can be invisibly logged in and given the appropriate group policy access rights via Kerberos authentication. The device's pricing remains the same as before, starting at $30,000 for indexing up to 500,000 documents over two years. Google now claims to have a total of 20,000 customers for all three of its enterprise search products, including the Google Search Appliance, the lower-end Google Mini appliance and its Google Site Search hosted service... Mangtani rebutted charges from competitors that the Google Search Appliance, while easy to use and relatively inexpensive, is coasting on Google's reputation in the consumer search space and lacks the robustness of rival enterprise search systems from Microsoft Corp.'s FAST subsidiary, IBM and other vendors. Meanwhile, start-ups such as Powerset Inc., which now is also owned by Microsoft, claim that their semantic search capabilities, designed to interpret the grammar of search requests submitted by users, are superior to what Google has to offer...


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY

Headlines

Sponsors