Please take a survey!
This issue of XML Daily Newslink is sponsored by:
Sun Microsystems, Inc. http://sun.com
What's New With Apache Solr?
Grant Ingersoll, IBM developerWorks
In this article, Solr and Lucene committer Grant Ingersoll details the improvements in Solr 1.3, including distributed search, easy database imports, integrated spell checking, new extension APIs, and much more. Apache Solr is an open source, primarily HTTP-based, search server based on Apache Lucene. In 2007, Ingersoll introduced Solr to developerWorks readers in the two-part Search smarter with Apache Solr series. With the recent release of Solr 1.3, the time is right to follow up with details about many of the new features and enhancements made since then. Solr contains many enterprise-ready features, such as easy configuration and administration, multiple client language bindings, index replication, caching, statistics, and logging. With the 1.3 release, Solr builds on enormous performance gains in the 2.3 version of Apache Lucene and adds a new, backward-compatible, plug-and-play component architecture. This new architecture has spawned a rush to create new components that further enhance Solr. For example, the 1.3 release contains components for: (1) "Did you mean" spell checking, (2) Finding Documents that are "More like this", and (3) Overriding search results based on editorial input—also known as paid placement. Furthermore, the existing functionality, such as query parsing, searching, faceting, and debugging, has also been componentized, letting you now custom create SolrRequestHandlers by chaining these components together. Finally (important to many enterprises), Solr has added the capability to index database content directly and to scale out to support very large systems via distributed search.
HTML: The Markup Language (W3C Editor's Draft)
Michael(tm) Smith, W3C Posting
From the WWW-TAG list: "During the face-to-face joint meeting between the TAG and HTML WG in Mandelieu last month, a large part of the discussion concerned the idea of producing a separate normative spec for HTML5 with some of the following characteristics: (1) defines just the syntax, structure, and semantics of the language (2) defines what is conformant for "producers" of HTML content (people/authors and applications, such as editors and content management systems, that produce HTML content) (3) does not define related APIs nor attempt to describe how "consumers" (such as Web browsers) of HTML are meant to process HTML documents (and in general, omits any of the other parts of the current HTML5 draft that cover browser-implementation conformance criteria). Based in part on discussion during that meeting, I've taken a shot at producing a draft of what such a spec might look like: "HTML: The Markup Language" (with abstract) 'This specification describes the fifth major version of the HTML vocabulary. It provides the details necessary for producers of HTML to create conformant HTML documents. By design, it does not describe related APIs nor attempt to describe how consumers of HTML are meant to process HTML documents.'... This message is in part a very initial follow-up on an action item that I took during the meeting, to "lead an HTML WG response to TAG discussion and report back to the TAG at some later time". But this draft document is not in any way a formal response to that action nor a completion of it. It is intended at this point to just provide something concrete that we can use within the HTML WG, and without others outside the group, as a starting point for further discussion about whether we should decide to produce a separate "producers" or "language" spec or not, and if so, should it look anything like this draft, or like something else entirely. So the draft as yet has no official standing within the HTML WG—and may never (if we decide not do it, or to replace it with another document that takes a different approach, or whatever). It is at this point just my own attempt at putting together something tangible to help guide further discussion.
See also: the posting
OSGi in the Enterprise
Alex Blewitt, InfoQueue
With the recent announcement of GlassFish v3 'Prelude', Sun's OSGi-based Java EE 6 server, the use of OSGi across the enterprise has grown to encompass almost all of the back-end servers. A recent press release by the OSGi alliance listed the vendors and the technology that uses OSGi: (1) IBM's WebSphere; (2) Oracle's Weblogic; (3) Paramus' Infiniflow Service Fabric; (4) ProSyst's ModuleFusion; (5) Red Hat's JBoss; (6) SpringSource's SpringSource Application Platform; (7) Sun Microsystem's GlassFish Enterprise Server. Peter Kriens noted that Jonas, arguably the first OSGi-based JEE server, was not mentioned on the list because it is not an OSGi member. He also noted that SAP NetWeaver is moving towards OSGi in the future as well. As previously covered by InfoQ, the main reason these systems are moving to OSGi is to enable greater modularity. This allows a system to be decomposed into more manageable (and testable) units, whilst at the same time providing greater re-use of the component libraries. At the moment, the big players (IBM, Oracle) are using OSGi internally and not exposing that directly to users of the application, but others (SpringSource) are building on the fact that the container itself, and not just the applications, are open to extension. Open-source projects are also moving to OSGi. Spurred on by the Apache Felix OSGi server, other Apache projects are generating OSGi metadata in their builds or moving completely, like Apache Tuscany's recent move. For those open-source projects that don't create metadata at the source, a number of OSGi bundle repositories (SpringSource Enterprise Bundle Repository, OBR, Eclipse Orbit, Felix bundle repository etc.) exist to provide open-source Jars annotated with OSGi metadata. Whether directly or indirectly, the chance of using OSGi in enterprise applications is growing higher. With the Spring framework becoming a de-facto standard for application development and the benefits of the Spring DM server, building dynamic, modular applications is where the industry is headed.
Why Migrate to the Semantic Web?
Richard Hancock, DevX.com
Why should an existing web application, such as CDMS, which currently stores its data in a relational database and presents this information as HTML and PDF documents, migrate to the semantic web? Most web applications are based on HTML documents. The semantic web is based on RDF, the RDF Schema language (RDFS), and the Web Ontology Language (OWL). Additionally, the SPARQL Query Language for RDF is also a part of the semantic web and allows you to query across the semantic web of data. The CDMS application supports quality assurance and compliance checking in the building and construction industries of Australia and New Zealand. It deals with real-world objects—things such as buildings, roads, and infrastructure—as well as the people and organizations involved in building them. Additionally, it needs to incorporate more abstract concepts such as the standards, regulations, and legislation that define the criteria used for compliance checking. Building projects defined in CDMS annotate real world objects with information. RDF is more flexible at annotating objects than a relational database. Some of these annotations include: (1) References to the criteria (e.g., legislation, standards) that form the basis for compliance checking; (2) Linking project participants who may be of the same or different organizations; (3) Location based information...
Text in PDF Documents
Norman Walsh, Blog
My attention was drawn recently to 'Text Content in PDF Files', by Jim King. If you've ever tried to extract text from a PDF file, you'll appreciate this clear and concise description of why it didn't work. Or at least, didn't work as well as you'd hoped... As a consumer, I would almost always be happier if you simply published your information in reasonably well structured (X)HTML with a little CSS styling to establish whatever look and feel you want. (No Flash, no silverlight, no PDF, just the content, please). If you've got your information in some more structured XML, I'd probably like access to that too, but publishing arbitrary XML on the web seems not to have taken off. The reason is simple: PDF gives all the formatting control to the author. I'm the reader and I want the control. I want the freedom to flow the text differently to fit on my wide or narrow screen or handheld device, or convert it into an audio format, or do something else that will make the information more useful to me. I think the assertion that 'enough auxiliary information has been defined for PDF files such that a well written program can extract the text content from PDF pages' only tells half the story. It's clear from my experience with PDF files in the wild that there are a lot of PDF generators out there that aren't writing enough auxiliary information.. I'll grant that the virtue of PDF is that it allows me to publish page images in a more compressed and flexible format than bitmaps, it's just not a virtue that holds very much appeal to me. I don't think I'd go so far as to say that it's worth the price of obscured content, but what really worries me is the implication that you might put content into PDF intending to get the text back out. I understand the convenience of PDF for publishing page images. I sincerely wish that there was a free, fully conformant XSL FO processor. I think that'd be a big boon for a lot of people. But for the love of whatever you hold dear, please don't use PDF as an archival format for your content! You want to hold onto originals with richer, more accessible structure. Ideally, you want them to be stored in a widely available XML format. [citation, belatedly]
A Vendor-Neutral Standard For Virtual Machines? There Isn't One.
Charles Babcock, InformationWeek
How's progress coming on a neutral VM runtime format that could be recognized by all the hypervisor vendors? Winston Bumpus, President of the DMTF, said: "Nothing is underway at the moment. Nobody's proposed that we undertake that work." Bumpus said his standards body is focused on Open Virtual Format or OVF, a common deployment virtual file format formerly known as the Distributed Management Task Force; the vendors have agreed only on a common deployment format, not a runtime format. That may be as close to a neutral standard as we're going to get... Why not settle on a neutral runtime format that eliminates the need to convert the alien into a citizen of a particular hypervisor's realm. The vendors could keep their proprietary VMs, but there would be an alternative, the option for a VM to be a citizen of the world with a passport to move around, regardless of governing hypervisor... Bumpus is only being realistic when he says the vendors aren't lobbying for such a thing, and neither the DMTF or any other standards body is going to impose it on them without their consent. But customers, getting deeper into virtualization by the day, should sit their virtualization salesman down for a heart-to-heart chat about the future and how wise it would be to get a neutral standard available in this area.
See also: on OVF
SAP's Fourth Enhancement Pack Includes Line-Of-Business Offerings
Courtney Bjorlin, SearchSAP.com
SAP has released its fourth enhancement pack, and it's focused on bringing new functionality to many lines of business. New downloadable functionality is now available for finance, human resources, plant and operations, sales and support, procurement, and product development and manufacturing. The biggest improvement in NetWeaver PI 7.1 is the enterprise service repository—a key component that supports the effective creation, management and reuse of a wide range of business metadata, according to Ken Vollmer, principal analyst with Cambridge, Mass.-based Forrester Research. The enterprise service repository (ESR) can now capture more than 15 types of metadata, such as integration scenarios, process component models, service interface, global data types, interface mappings and executable integration processes based on the Business Process Execution Language (BPEL) standard. The repository also supports full lifecycle management and impact-analysis features... The new version of NetWeaver PI supports more Web services standards, such as Web services reliable messaging, according to Ken Tsai, director of solution marketing for NetWeaver. There are currently more than 2,800 licensed NetWeaver PI customers. The new version of the Adaptive Computing Controller (ACC) tool—a point of control for assigning resources to run any service on any server at any time -- will give IT better visibility into running a virtualized environment. The operator can see everything that's going on in the data center in the ACC central help dashboard. [Note: The Global Data Types are defined using CCTS and the NDR and Core Data Types from UNCEFACT.]
Selected from the Cover Pages, by Robin Cover
W3C has announced the formation of a new Web Services Resource Access (WS-RA) Working Group as part of the W3C Web Services Activity, chartered through June 30, 2010. The WS-RA Working Group will produce W3C Recommendations for a set of Web Services specifications by refining the WS-Transfer, WS-ResourceTransfer, WS-Enumeration, WS-MetadataExchange, and WS-Eventing Member Submissions contributed in March 2006 and August 2008. The five W3C Member Submissions serving as input specifications were contributed (variably) by BEA Systems, Computer Associates, Fujitsu, Hitachi, IBM, Intel, Layer 7, Microsoft, Progress Software, Red Hat, SAP AG, Software AG, Sonic Software, Systinet (A Mercury Division), Tibco, WSO2, Xerox. Bob Freund (Hitachi) serves as the WG Chair, and Yves Lafon is the W3C Team Contact. The WS-RA Working Group is chartered to standardize a general mechanism for accessing and updating the XML representation of a resource-oriented Web Service and metadata of a Web Service, as well as a mechanism to subscribe to events from a Web Service. Because the submitted specifications are relevant to other standardization efforts that rely on resource access and event subscription mechanisms, the Web Services Resource Access Working Group is expected to complete its work in a timely fashion. The submitted specifications define SOAP-based mechanisms for interacting with the XML representation behind a resource-oriented Web Service, accessing metadata related to that service, as well as a mechanism to subscribe to events related to that resource. New publications expected from the WS-RA Working Group include W3C Recommendations for the output specifications of this Working Group. The Working Group may organize the structure of the specifications into one or more documents. A test suite will be created, intended to promote implementation of the Candidate Recommendation, and to assess interoperability between these implementations. The WS-RA Working Group deliverables will be based on W3C Recommendations (SOAP Version 1.2, WS-Addressing 1.0, WSDL 2.0, WS-Policy 1.5) and aligned with ISO 29361:2008 (WS-I BP 1.1). However, the Working Group should also consider conformance to the forthcoming WS-I profiles.
XML Daily Newslink and Cover Pages sponsored by:
|Sun Microsystems, Inc.||http://sun.com|
XML Daily Newslink: http://xml.coverpages.org/newsletter.html
Newsletter Archive: http://xml.coverpages.org/newsletterArchive.html
Newsletter subscribe: email@example.com
Newsletter unsubscribe: firstname.lastname@example.org
Newsletter help: email@example.com
Cover Pages: http://xml.coverpages.org/