Document Revision: | $Revision: 1.5 $ |
---|---|
Date: | $Date: 2001/11/26 16:33:37 $ |
This document contains an architectural overview of version 1.1beta of the NewsML Toolkit, a Java library for processing NewsML documents (for more information on NewsML, see the IPTC NewsML Web Site). Additional information on the toolkit is available in the testing document and in the JavaDoc API documentation, and a simple demo application is also available.
The document is intended for people who plan to test, modify, or extend the toolkit itself. Readers should also refer to the JavaDoc documentation, which provides detailed coverage for each of the interfaces. The document assumes a good knowledge of Java and NewsML and at least some familiarity with XML and the common XML processing interfaces in Java. If you are not an XML or NewsML specialist and simply need to use the NewsML Toolkit to develop a NewsML application, it is not necessary for you to read this document.
The toolkit contains a collection of Java interfaces for the different structures that can appear in a NewsML document; these interfaces hide all of the details of XML processing, so that a Java programmer with little or no knowledge of XML markup can write programs to extract information from NewsML packages.
Like any markup specification, NewsML actually contains two parts:
Developers of programs to work with NewsML need to have at least some familiarity with the first part -- they need to know, for example, that a NewsItem contains a NewsComponent, and that a NewsComponent can contain several other types of nodes -- but there is no reason that they need to learn the second part, since it can be handled automatically by an XML-aware Java library like the NewsML Toolkit.
The toolkit contains many interfaces, but it is designed so that it can be learned incrementally: a developer approaching the toolkit and NewsML for the first time should be able to create useful applications after learning only a few key interfaces such as NewsMLFactory, NewsML, NewsItem, NewsComponent, and ContentItem.
This section describes the major stylistic design principles behind the NewsML Toolkit interfaces, including naming conventions.
Where possible, interface names correspond directly to XML element names in the NewsML specification.
The names of the base interfaces, which do not correspond directly with XML element names in the NewsML specification, all end with the word "Node".
Where multiple XML element types in the NewsML specification contain exactly the same functionality, the toolkit uses only one Java interface to represent all of them; for example, the FormalName interface is used for the NewsML XML elements MediaType, Format, MimeType, Notation, Language, Genre, OfInterestTo, MetadataType, Role, NewsService, and many others.
The accessor method name for a single object is usually the
corresponding XML element name (where applicable) with the word "get"
prepended, as in Text getSystemIdentifier()
, except where
using that name might cause conflict or confusion.
There are three methods for accessing repeatable objects:
A counter method returning the number of objects available. Its name consists of the XML element name with the word "get" prepended and the word "Count" appended, as in the following example:
public int getContributorCount()
An indexed accessor taking an integer argument and returning the object at that position. Its name consists of the XML element name with the word "get" prepended, as in the following example:
public PartyList getContributor (int index)
An array accessor, returning an array of all available objects of the requested type. Its name also consists of the XML element name with the word "get" prepended, but it does not take any arguments, as in the following example:
public PartyList [] getContributor ()
An accessor method returns null to signal failure, either because the object is not present in the NewsML package, or, in the case of an indexed accessor, because there is no object available at the specified index. The indexed accessors do not throw an ArrayIndexOutOfBoundsException.
The early implementation of the NewsML Toolkit is based on the Document Object Model (DOM) interface available from the World Wide Web Consortium (W3C). The DOM is a fairly easy-to-use interface supported by many vendors and open-source developers, but DOM implementations tend to run slowly and use a lot of memory; as a result, the NewsML Toolkit is designed so that other backends can be plugged into an application for different purposes, as in the following examples:
As new backends become available, application developers will need to be able to switch to them without rewriting hundreds or thousands of lines of code. The NewsML Toolkit helps developers to write robust applications by dividing the implementation from the interfaces that the application sees: for example, there is a DOMNewsComponent class that implements a news component on top of the DOM interface, but applications do not refer to the class directly; instead, they work with the NewsComponent interface. If the application developer wants to switch to a different, non-DOM-based backend, the new class will still implement the NewsComponent interface, and all the code for processing news components will continue to work unmodified.
The abstract NewsML application programming interfaces appear in the package org.newsml.toolkit; the concrete classes for the DOM backend appear in the package org.newsml.toolkit.dom.
Of course, at some point the application still needs to specify what backend it is using. It does so by invoking a static factory method from a backend-specific class that implements the common NewsMLFactory interface. Currently, only the DOMNewsMLFactory class is available, but others may appear in the future.
For example, the following fragment creates a new NewsML root object using the DOM backend:
NewsMLFactory factory = new DOMNewsMLFactory(new XercesDOMFactory()); try { NewsML newsml = factory.createNewsML("http://sample.org/news/story01.xml"); } catch (IOException e) { System.err.println("Failed to load NewsML package"); }
This should be the only place in the application that refers directly to the DOM backend; everything else will work through the generic interfaces.
Note that in the case of the DOM backend, it is necessary to supply a domFactory argument to the constructor, so that it knows what DOM implementation you want to use (there are many available in Java). The DOMFactory interface is yet another factory interface, this time, one specific to the DOM backend. A default implementation for the DOM implementation in the Apache Xerces XML parser is provided in the XercesDOMFactory class, but it is simple to write new factory classes for other DOM implementations.
NewsML is a complex specification, and that complexity is mirrored in the large number of interfaces in the NewsML toolkit itself. The NewsML XML document type contains many structures that are almost but not quite the same, and those slightly-divergent structures can make it difficult to write generalised, reusable code to process NewsML documents.
To help alleviate this problem, the NewsML Toolkit contains a series of more abstract interfaces that capture the simple, common patterns that do appear. These base interfaces do not correspond directly with specific markup structures, but they capture similar patterns that make up parts of many different substructures. These interfaces make it possible to write reusable code for common situations, such as navigating through the main structure of a NewsML document or processing a series of comments; they also simplify the process of learning both the NewsML XML document type and the Java interfaces in the toolkit.
The following subsections describe the base interfaces in detail.
This interface corresponds to the %assignment;
parameter entity in the NewsML DTD: all interfaces that provide
information about the authority and circumstances under which
information was provided extend this interface.
The methods of an assignment node can answer the following questions:
For more information, see the API documentation for AssignmentNode.
This is the base interface for most other interfaces in the NewsML Toolkit (that is, most NewsML objects can be cast to a BaseNode). Exceptions are interfaces like HeadLineGroup and SubjectCodeItem, which represent ordering patterns rather than actual XML elements.
The methods of a base node can answer four questions:
The methods also allow the user to remove the object from the NewsML tree and to serialize it in XML to a string or character stream.
For more information, see the API documentation for BaseNode.
This interface applies to all objects that contain a resource catalog. A catalog node's method can answer the following question:
For more information, see the API documentation for CatalogNode.
This interface applies to all objects that can contain human-readable comments. The methods in this interface can answer the following question:
For more information, see the API documentation for CommentNode.
This interface represents an object that can appear as the payload of a NewsComponent: a NewsItem, NewsItemRef, a ContentItem, or another NewsComponent.
For more information, see the API documentation for EquivalentNode.
This interface represents an object that contains formal-name information. It is necessary to have a base interface separate from FormalName because the ProviderId interface does not allow a scheme reference. The The methods in this interface answer the following questions:
For more information, see the API documentation for FormalNameNode.
This interface represents an object that provides a URI reference. The method in this interface answers the following question:
For more information, see the API documentation for HrefNode.
This interface represents the XML identifier information (the NewsML Duid and Euid) available for most NewsML object types, with the exception of the NewsIdentifier and TopicUse interfaces.
For more information, see the API documentation for IdNode.
This interface represents an object that can have a natural language code assigned to it, following the rules for the xml:lang attribute in the XML 1.0 Recommendation. The method in this interface can answer the following question:
For more information, see the API documentation for LanguageNode.
This interface represents an object that can have text content mixed with Origin sub-elements. The methods in this interface can answer the following question:
For more information, see the API documentation for OriginNode.
This interface represents an object that can have generic properties attached to it (see the Property interface). This interface's methods answer the following question:
For more information, see the API documentation for PropertyNode.
In NewsML, the name of an XML element serves to distinguish both the role and the nature of a structure; as a result, many NewsML element types share exactly the same structure, and differ only in their name (reflecting their different roles). For example, the 16 NewsML XML element types Format, FutureStatus, LabelType, MediaType, MetadataType, MimeType, NewsItemType, NewsLineType, NewsProduct, NewsService, Notation, Priority, Role, Status, TopicType, and Urgency all have exactly the same XML content model and attribute lists and can be processed in the same way (i.e. they have the same nature, or type), but can appear in different places (i.e. they fill different roles, or functions).
In normal object-oriented design (Java or otherwise), the role and nature are specified separately: the role is specified by the accessor method on the interface referencing the node, and the nature is specified by the node's own interface. In other words, while it makes perfect sense in Java to have separate methods getFormat, getFutureStatus, and so on, there is no need to create 16 separate identical interfaces.
For situations like these, the NewsML toolkit contains the shared interfaces described in the following sections (more will likely be added in future releases).
The toolkit uses this interface for every element that contains
both formal-name and assignment properties (using the NewsML
%formalname;
and %assignment;
parameter
entities): Genre, Language,
Relevance, Subject, SubjectDetail,
SubjectMatter, SubjectQualifier, and
TopicOccurrence. The OfInterestTo
interface extends this one.
For more information, see the API documentation for AssignedFormalName.
The toolkit uses this interface for every element that contains
text together with origin information and assignment properties (using
the NewsML Origin element and %assignment;
parameter entity): EndDate, Geography,
Limitations, RightsHolder, StartDate,
and UsageType.
For more information, see the API documentation for AssignedOriginText.
The toolkit uses this interface for elements that contain text together with assignment properties but not origin information (using the NewsML %assignment; parameter entity): currently, only TopicOccurrence.
For more information, see the API documentation for AssignedText.
The toolkit uses this interface for every NewsML structure that
contains identifier and three-part formal name information (using the
NewsML XML %formalname;
parameter entity):
Format, FutureStatus, LabelType,
MediaType, MetadataType, MimeType,
NewsItemType, NewsLineType,
NewsProduct, NewsService, Notation,
Priority, Role, Status,
TopicType, and Urgency. The AssignedFormalName, Instruction,
Party, and TopicSet
interfaces extend this one directly.
For more information, see the API documentation for FormalName.
The toolkit uses this interface for elements that contain text
together with the regular identifiers (using the NewsML
%localid;
parameter entity): DateAndTime,
DateLabel, FileName, FirstCreated,
LabelText, NameLabel,
SystemIdentifier, ThisRevisionCreated,
Url, and Urn. The AssignedText,
Comment, Description,
and OriginNode interfaces extend this one directly.
For more information, see the API documentation for IdText.
The toolkit uses this interface for elements that contain text without assignment information, mixed with Origin subelements: ByLine, CopyrightHolder, CopyrightDate, CopyrightLine, CreditLine, DateLine, HeadLine, KeywordLine, NewsLineText, RightsLine, SeriesLine, SlugLine, and SubHeadLine. The AssignedOriginText extends this one directly.
For more information, see the API documentation for OriginText.
The toolkit uses this interface to represent every NewsML structure that contains information about one or more parties (people or organisations), with optional comments attached: Contributor, Creator, Provider, SentFrom, and SentTo. The SourceList interface extends this one directly.
For more information, see the API documentation for PartyList.
The following information from a NewsML document is not yet available or available in incorrect form through the toolkit interfaces: