Cover Pages: XML Daily Newslink: Wednesday, 04 February 2009

A Cover Pages Publication http://xml.coverpages.org/
Provided by OASIS and Sponsor Members
Edited by Robin Cover

This issue of XML Daily Newslink is sponsored by:
Microsoft Corporation http://www.microsoft.com

Headlines

Apache Qpid Community Releases Apache Qpid M4 for AMQP
New W3C SPARQL Working Group Charter
Unstructured Information Management Architecture (UIMA) Version 1.0
Automate Metadata Extraction for Corporate Search and Mashups
CMIS Implementation Project: Integrating External Document Repositories with SharePoint Server 2007
Open Web Foundation (OWF) Publishes Proposed Draft for Specification IPR
Intellectual Property or Intellectual Privilege?
REST for Java Developers: NetKernel, Where SOA Meets Multicore

Apache Qpid Community Releases Apache Qpid M4 for AMQP
Rafael Schloming, Apache Foundation Software Announcement

On behalf of the Apache Qpid development community, Rafael Schloming announced the release of Apache Qpid M4. Apache Qpid is a cross platform enterprise messaging solution which implements the Advanced Message Queueing Protocol (AMQP). Apache Qpid provides brokers written in Java and C++, and clients in C++, Java (including a JMS implementation), .Net, Python, and Ruby. Context: "Enterprise Messaging systems let programs communicate by exchanging messages, much as people communicate by exchanging email. Unlike email, enterprise messaging systems provide guaranteed delivery, speed, security, and freedom from spam. Until recently, there was no open standard for Enterprise Messaging systems, so programmers either wrote their own, or used expensive proprietary systems. AMQP Advanced Message Queuing Protocol is an open standard for Enterprise Messaging designed to support messaging for just about any distributed or business application. Routing can be configured flexibly, easily supporting common messaging paradigms like point-to-point, fanout, publish-subscribe, and request-response. AMQP is implemented in a number of products (e.g., AMQP Infrastructure yum installable on latest three versions of Fedora Linux; OpenAMQ by iMatix; RabbitMQ by Rabbit Technologies / CohesiveFT /LShift; Red Hat Enterprise MRG by Red Hat). Apache Qpid implements the latest AMQP specification, providing transaction management, queuing, distribution, security, management, clustering, federation and heterogeneous multi-platform support and a lot more. And Apache Qpid is extremely fast. Qpid provides two AMQP messaging servers: C++—with high performance, low latency, and RDMA support; Java -- fully JMS compliant, runs on any Java platform. Many patterns may be created from Qpid's messaging topologies: Point-to-point; One-to-many; Pub-Sub; FAST Reliable Messaging (fully reliable transfers between any two peers, which means that you can publish or subscribe to the broker fully reliable without requiring the need for transactions, all done in async mode with the C++ broker); Transactional (two types of transactions in AMQP 0-10, TX and DTX); Transient message delivery; Durable message delivery (a header on each message where the message properties are specified, and messages marked as durable and published to a durable queue are safe stored); Federation (Hub-spoke, Trees, graphs); Store-and-forward...

See also: the FAQ document

New W3C SPARQL Working Group Charter
Leigh Dodds, Blog

"I was really pleased to see that the charter for the W3C SPARQL Working Group is now public. Having made heavy use of SPARQL in a number of personal and commercial projects there have been a number of pain points which, to date, I've only been able to address by resorting to vendor specific extensions. The key ones for me have been support for aggregates as well as handling of collections and containers. These are all on the list of candidate issues for the Working Group to address, so this is good news..." The WG is chartered through July 31, 2010, with Lee Feigenbaum and Axel Polleres as Co-Chairs. SPARQL (pronounced "sparkle") is an RDF query language with a recursive acronym that stands for SPARQL Protocol and RDF Query Language. It is standardized by the W3C's RDF Data Access Working Group (DAWG) and is considered a component of the semantic web. From the published charter: "SPARQL has become very widely implemented and used since then (and, in fact, even before the specification achieved a W3C Recommendation status). The RDF Data Access Working Group has published three SPARQL recommendations (Query Language, Protocol, and Results Format) in January 2008. Usage and implementation of SPARQL have revealed requirements for extensions to the query langauge that are needed by applications. Most of these were already known and recorded when developing the current Recommendation, but there was not enough implementation and and usage experience at the time for standardization. Current implementation experience and feedback from the user community makes it now feasible to handle some of those issues in a satisfactory manner. The mission of the SPARQL Working Group, part of the Semantic Web Activity, is to produce a W3C Recommendation that extends SPARQL. The extension is a small set of additional feature that (1) have been identified by the users as badly needed for applications; (2) have been identified by SPARQL implementers as reasonable and feasible extension to current implementations. Note that a strict backward compatibility with exisiting SPARQL design should be mantained, and currently no radical redefinition of the SPARQL language is envisaged. Proposed deliverables include the following "SPARQL Use Cases and Requirements, Version 2", with a list of extension categories to be added to the January 2008 version of SPARQL. The document should also prioritize the items, with the possibility to drop some items from the final Recommendation in case the Working Group runs out of time. This document is not expected to be on a Recommendation track. "SPARQL Query Language for RDF" (new version, Recommendation) "SPARQL Protocol for RDF" (new version, Recommendation) "SPARQL Query Results XML Format" (new version, Recommendation) "Serializing SPARQL Query Results in JSON" (new version, Working Group Note).

See also: the W3C SPARQL WG Charter

Unstructured Information Management Architecture (UIMA) Version 1.0
Adam Lally, Karin Verspoor, Eric Nyberg (eds), OS Candidate

OASIS announced the submission of "Unstructured Information Management Architecture (UIMA) Version 1.0" Committee Specification to be considered as an OASIS Standard. Statements of successful use/implementation were provided by IBM, Carnegie Mellon University, Amsoft, Thomson Reuters, and University of Tokyo. The specification was produced by members of the OASIS Unstructured Information Management Architecture (UIMA) Technical Committee, chartered to "generalize from the published UIMA Java Framework implementation and produce a platform-independent specification in support of the interoperability, discovery and composition of analytics across modalities, domain models, frameworks and platforms." UIMA Version 1.0 specification summary: "Unstructured information may be defined as the direct product of human communication. Examples include natural language documents, email, speech, images and video. The UIMA specification defines platform-independent data representations and interfaces for software components or services called analytics, which analyze unstructured information and assign semantics to regions of that unstructured information." Details: "The UIMA specification defines platform-independent data representations and interfaces for text and multi-modal analytics. The principal objective of the UIMA specification is to support interoperability among analytics. This objective is subdivided into the following four design goals: (1) Data Representation. Support the common representation of artifacts and artifact metadata independently of artifact modality and domain model and in a way that is independent of the original representation of the artifact. (2) Data Modeling and Interchange. Support the platform-independent interchange of analysis data (artifact and its metadata) in a form that facilitates a formal modeling approach and alignment with existing programming systems and standards. (3) Discovery, Reuse and Composition. Support the discovery, reuse and composition of independently-developed analytics. (4) Service-Level Interoperability. Support concrete interoperability of independently developed analytics based on a common service description and associated SOAP bindings.

See also: the OASIS announcement

Automate Metadata Extraction for Corporate Search and Mashups
Dan McCreary, DevX.com

This article shows how to extract document semantics with Apache UIMA. There are some exciting developments in automated metadata extraction and its implication for better semantic search and corporate mashups. Advanced open source tools created by linguists to recognize the meaning of words in documents are now becoming an order of magnitude more cost effective to use. The arrival of the Apache Unstructured Information Management Architecture (UIMA) framework makes these tools accessible by non-programmers. The addition of semantically precise metadata to documents opens the door for new semantic web applications; including better document search and document mashups... Most internal corporate search engines use "keyword search" technology. It involves finding all the important words in a document and putting them in a central index. When you search a set of keywords it finds all documents that have those exact words. The problem is that when people use keywords, they frequently don't use the exact same words that are in the document... Document similarity or "semantic nearness" is a very complex concept. Classification requires you to compare all the entities and words in each document with all the other documents in a collection, and then come up with a mathematical weight to sort all the other documents in a collection that is based on the content of a document. The weights might change every time a single document is added or removed. However, mathematical models based on standard algorithms can perform this calculation... The Apache Foundation decided that unstructured analysis was a critical component for many emergent semantic web applications. With assistance from IBM, the Apache Foundation adopted an innovative Unstructured Information Management Architecture. UIMA is now spurring the growth of a new low-cost high-quality component for creating interchangeable components. UIMA tools have already started to lower the cost of high-quality entity extraction. UIMA is based around a pipeline approach to performing unstructured analysis in a universal pattern. These pipelines are very similar to the concept of UNIX pipes: small modular tools that read data in from one representation, enrich the data, and then send the output to other tools... UIMA's design supports a potentially large library of pipes that all fit together. Developers should be able to configure and customize these pipes to solve a wide variety of problems, with very little programming. A typical project requires the modification of only a small set of XML configuration files. These configuration files have Eclipse forms front ends, so that the average person can setup and install a UIMA pipeline without ever knowing XML syntax or learning how to use an XML editor. But as of today, the number of components is somewhat small, and you might need to create some customized components to meet your organization's needs... Although this article discusses the process of automated entity extraction in the context of increasing the precision of corporate search, you gain the potential to have much more than that. Now, you have a robust architecture for using low-cost libraries of interoperable tools to perform highly specific analysis on components. In the future, these will be tools that automatically suggest document taxonomies for classifying documents or tools that perform statistical database profiling to suggest data element mappings to your data warehouse. UIMA is starting to open the door to automated metadata extraction for many types of entities in your organization and not just documents.

CMIS Implementation Project: Integrating External Document Repositories with SharePoint Server 2007
Trent Swanson, Bhushan Nene, and Scot Hillier; MSDN Library

Summary: "This article shows how to take external repositories (Documentum, Interwoven, etc) and surface them in SharePoint as if they were a document library. The project goes way beyond a few web parts to integrate capabilities such as workflow and custom metadata. There is also a Code Gallery project with the implementation. Although the project is not a full CMIS implementation, it is based on the CMIS specification..." Details: "When developing an information strategy, organizations often begin by considering the structured data found in line-of-business (LOB) systems. Structured data, however, represents less than one third of the total data in an organization. The vast majority of data lives in unstructured documents such as proposals, purchase orders, invoices, employee reviews, and reports. In the enterprise, documents are stored in many different repositories including enterprise content management applications, enterprise resource planning systems, customer relationship management systems, product life-cycle management systems, custom LOB applications, and file shares. It can be difficult for information workers who want to use the data contained within these documents to locate the documents and integrate them within their daily work. Microsoft Office SharePoint Server 2007 provides a platform for storing, retrieving, and using document-based data within the enterprise... The architecture presented in this article uses the Content Management Integration Services (CMIS) standard for accessing enterprise content management (ECM) systems in a way that is platform-independent. CMIS is a standard developed by Microsoft, the EMC Corporation, and IBM that uses SOAP, Representational State Transfer (REST), and Atom Publishing Protocol to enable communication with and between ECM systems. While the sample is not intended to be a complete example of a CMIS implementation, it can be used as a starting point for such an effort... Accessing the Custom Repository Using WCF Services: The custom repository in the sample is accessed through a set of Windows Communication Foundation (WCF) services. These services implement a portion of the CMIS specification and expose methods for sample operations on the repository. The methods are contained in four services as follows: repository, navigation, object, and versioning. The repository service provides access to functions that apply to the repository as a whole, such as making a connection. The navigation service provides functions for navigating the object hierarchy of the repository to display folders and documents. The object service provides functions for operating on an individual object, such as returning the metadata for an object, uploading, or checking out. The versioning service provides functions for managing the versions of an object. These services provide strong support for the sample, but they are not fully compliant with the CMIS specification. A compliant implementation includes both SOAP and REST endpoints and implements all of the required repository operations. The sample implements only a subset of the available operations. Because the repository uses a custom authentication system, the services and the repository are deployed together as a "self-hosted" WCF service. Self- hosting allows the repository to be deployed as an executable file and to use an authentication scheme that is not based on Windows authentication, by passing credentials along with any method call. This approach enables the custom repository to accurately represent commercial document management systems that use custom security schemes. The sample contains an installer for the host and services. After it is installed, the host can be started and the repository is available..."

See also: Andrew Connell's blog

Open Web Foundation (OWF) Publishes Proposed Draft for Specification IPR
Eran Hammer-Lahav, OWF Legal Affairs Committee Posting

[Certain] Members of the OWF Legal Affairs Committee have drafted an initial proposal covering copyright and patents in specifications created under the Open Web Foundation community spec development model. The proposal (proposed draft) is presented in an "Open Web Foundation Final Specification Agreement" document, edited by Eran Hammer-Lahav. As drafted, the agreement may be signed by an individual or corporate entity, and sets forth the terms under which the signing party makes certain intellectual property rights available to others for the use and implementation of a specification. Under the Copyright Grant: "I grant to you perpetual (for the duration of the applicable copyright), worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license, without any obligation for accounting to me, to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, distribute, and implement the Specification to the full extent of my copyright interest in the Specification." As to patents ('Patent Non-Assert' and 'Patent License Commitment', the non-binding commentary clarifies that the agreement "makes necessary patents available under two mechanisms. The first is a patent non-assert, which is intended to be a simple and clear way to assure that the broadest audience of developers and users working with commercial or open source software can implement specifications through a simplified method of sharing of technical assets, while recognizing the legitimacy of intellectual property. It is a way to reassure that audience that the specification can be used for free, easily, now and forever. The second mechanism is an agreement to make necessary patent claims available under reasonable and non-discriminatory terms without royalty (also known as RAND-z), which is a common approach taken by standards bodies. The license itself would be agreed upon between the patent owner and the party wishing to use the patent. We believe that the non-assert should be sufficient for most purposes, but included the RAND-z provision as a fall back for situations in which the non-assert may not be appropriate or valid. The RAND-z option is also intended to facilitate the potential transition of the specification to a formal standards body." The OWF Legal Affairs Committee as constituted in 2008-09-19 included Ben Laurie (Google), David Rudin (Microsoft), DeWitt Clinton (Google), Eran Hammer-Lahav (Yahoo!), Gabe Wachob, Geir Magnusson (Joost), Jim Jagielski (Apache Software Foundation), Lawrence Rosen (Rosenlaw & Einschlag), Pelle Braendgaard (Stake Ventures), Simon Phipps (Sun Microsystems), and Stephan Wenger (Nokia). Editor's comments in the announcement: "The 'proposed draft' designation means that this is not yet a Committee Draft—a draft endorsed by the Legal Committee as a whole. It it only a proposal put forward by a few members and guests of the committee. We hope that with an open discussion of the draft by the full committee, we can elevate the draft into a Committee Draft and get it closer to a proposed agreement for the OWF board to vote on. The draft has no author, but I would like to acknowledge the involvement and contribution of the following individuals: DeWitt Clinton, David Recordon, David Rudin, Larry Rosen, Stephan Wenger, Gabe Wachob, Ben Lee, and myself. I would also like to express a special gratitude to David Rudin for his outstanding effort in transforming our collective ideas into a legal document..."

Intellectual Property or Intellectual Privilege?
Simon Phipps, SunMink Blog

"Speaking at conferences like linux.conf.au and OSCON is great fun [but] there are political correctness landmines littering this domain. For example, using the terms 'open source' and 'free software' is often taken as an indication of either one's cluefulness or of one's affiliation to a particular world-view. Personally, I consider the two expressions complementary—open source communities collaborate on a free software commons—but there's rarely a chance to explain that before I speak. An especially frustrating one is the expression 'intellectual property'. The term is used widely in the business and legal communities, and it becomes second nature to speak of patents, copyright, trademarks and trade secrets collectively in this way. The problem with doing so is that the expression is factually wrong, and a legion of open source developers take the use of the phrase 'intellectual property' as a genetic marker for 'clueless PHB-type' at best and 'evil oppressor of geeks' at worst. Why is it wrong? Well, none of those things is really 'property'. In particular, copyright and patents are temporary privileges granted to creative people to encourage them to make their work openly available to society. The 'social contract' behind them is 'we'll grant you a temporary monopoly on your work so you can profit from it; in return you'll turn it over to the commons at the end of a reasonable period so our know-how and culture can grow.' Using the term 'intellectual property' is definitely a problem. It encourages a mindset that treats these temporary privileges as an absolute right. This leads to two harmful behaviours: (1) First, people get addicted to them as 'property'. They build business models that forget the privilege is temporary. They then press for longer and longer terms for the privilege without agreeing in return to any benefit for the commons and society. (2) Second, they forget that one day they'll need to turn the material over to the commons. Software patents in particular contain little, if anything, that will be of value to the commons—no code, no algorithms, really just a list of ways to detect infringement..."

See also: the OWF posting

REST for Java Developers: NetKernel, Where SOA Meets Multicore
Brian Sletten, Java World Magazine

This article series provides an introduction to REST and a quick walk-through of Restlet — an API that makes it easier to build and consume RESTful interfaces in Java. In this Part 3 article, you learn about NetKernel, a next-generation software environment that mixes what people like about REST, Unix pipes and filters, and service-oriented architecture (SOA). Not only can NetKernel make you more productive, but the abstractions also allow you to fully use modern, multicore, multi-CPU systems with little to no effort. At a minimum, if you understand NetKernel, you will understand REST at a deeper level... In the object-oriented world, we [use interfaces], but without some extraordinary gymnastics, we lack the freedom to return completely arbitrary object representations... Others have tried to solve the hardware/software mismatch in various ways. Ericsson's Erlang is a language and runtime environment that forces developers to write functional software that scales. The ejabberd XMPP server from ProcessOne is an example of a modern, scalable system built in this manner. Neither the language approach nor the cloud approach solves the caching issues, because they don't let you to identify uniquely both behavior and the result of calling that behavior. Some modern languages support a technique called memoization, but it is usually specific to a particular calculation, not a general strategy... NetKernel combines many of these ideas by supporting a URI-based, functional model that might resolve locally or might involve distributed requests across multiple servers completely transparently to the client... The benefits of REST for integrating systems are becoming quite well understood. The idea of taking the ideas inside your software architectures is quite revolutionary. NetKernel isn't necessary for building RESTful systems; it just makes it very easy to do. The beauty of an environment like this is that the same benefits apply inside as outside. The potential to cache, pick implementation languages, change implementation, and so on are all powerful and they're generally unavailable as easily in any other system. Orchestration across heterogeneous back-end systems is trivial. Changing a locally resolved call to a remote, distributed call can be done on the fly. You can leverage cloud assets if needed without having to design your system around the concept. You are also free to take advantage of modern languages such as Clojure, Groovy, and Scala; domain-specific languages such as XSLT; and legacy code that has been written in Java for years. Dealing with your object representations is also a perfectly reasonable thing to do within NetKernel. It is not that objects are bad or useless; they remain good implementation tools. They simply do not represent a level of abstraction that works well for unifying everything that goes into modern systems. There have been valiant attempts to do so, but so far none have really taken off. The resource-oriented style embodied by NetKernel could be just such a successful attempt. Not only can you easily end up reusing most of what you have committed to in the past (Java, Hibernate, SOAP, JDBC, and so on), but you can start to build new types of systems as you go; it is not an all-or-nothing proposition.


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY

Headlines

Sponsors