From: http://prdownloads.sourceforge.net/newsml-toolkit/newsml-toolkit-1.0.zip The original/canonical document in HTML contains important links Date: 2001-06-26 See: http://www.iptc.org/NMLIntro.htm See: http://xml.coverpages.org/newsML.html -------------------------------------------------------- NewsML Toolkit (1.0): Architectural Overview Document Revision: $Revision: 1.10 $ Date: $Date: 2001/05/23 12:52:48 $ Contents 1. Introduction 2. Style 2.1. Interfaces 2.2. Methods 3. Interface Design 3.1. Factories 3. Base Interfaces 4.1. AssignmentNode 4.2. BaseNode 4.3. CatalogNode 4.4. CommentNode 4.5. FormalNameNode 4.6. HrefNode 4.7. IdNode 4.8. LanguageNode 4.9. OriginNode 4.10. PrimaryNode 4.11. PropertyNode 5. Shared Interfaces 5.1. FormalName 5.2. PartyList 5.3. Text 6. Unfinished Tasks 1. Introduction This document contains an architectural overview of version 1.0 of the NewsML Toolkit, a Java library for processing NewsML documents (for more information on NewsML, see the IPTC NewsML Web Site). Additional information on the toolkit is available in the testing document and in the JavaDoc API documentation, and a simple demo application is also available. The document is intended for people who plan to test, modify, or extend the toolkit itself. Readers should also refer to the JavaDoc documentation, which provides detailed coverage for each of the interfaces. The document assumes a good knowledge of Java and NewsML and at least some familiarity with XML and the common XML processing interfaces in Java. If you are not an XML or NewsML specialist and simply need to use the NewsML Toolkit to develop a NewsML application, it is not necessary for you to read this document. The toolkit contains a collection of Java interfaces for the different structures that can appear in a NewsML document; these interfaces hide all of the details of XML processing, so that a Java programmer with little or no knowledge of XML markup can write programs to extract information from NewsML packages. Like any markup specification, NewsML actually contains two parts: 1. a logical model for the structure of a NewsML package; and 2. rules for representing an instance of that model in XML markup. Developers of programs to work with NewsML need to have at least some familiarity with the first part -- they need to know, for example, that a NewsItem contains a NewsComponent, and that a NewsComponent can contain several other types of nodes -- but there is no reason that they need to learn the second part, since it can be handled automatically by an XML-aware Java library like the NewsML Toolkit. Unfortunately, the current NewsML specification does not rigorously separate these layers, so readers looking for information on the logical model are forced to wade through page after page of XML markup details; with luck, future versions of the specification will correct this problem. The toolkit contains many interfaces, but it is designed so that it can be learned incrementally: a developer approaching the toolkit and NewsML for the first time should be able to create useful applications after learning only a few key interfaces such as NewsMLFactory, NewsML, NewsItem, NewsComponent, and ContentItem. 2. Style This section describes the major stylistic design principles behind the NewsML Toolkit interfaces, including naming conventions. 2.1. Interfaces Where possible, interface names correspond directly to XML element names in the NewsML specification. The names of the base interfaces, which do not correspond directly with XML element names in the NewsML specification, all end with the word "Node". Where multiple XML element types in the NewsML specification contain exactly the same functionality, the toolkit uses only one Java interface to represent all of them; for example, the FormalName interface is used for the NewsML XML elements MediaType, Format, MimeType, Notation, Language, Genre, OfInterestTo, MetadataType, Role, NewsService, and many others. 2.2. Methods The accessor method name for a single object is usually the corresponding XML element name (where applicable) with the word "get" prepended, as in Text getSystemIdentifier(), except where using that name might cause conflict or confusion. There are three methods for accessing repeatable objects: 1. A counter method returning the number of objects available. Its name consists of the XML element name with the word "get" prepended and the word "Count" appended, as in the following example: public int getContributorCount() 2. An indexed accessor taking an integer argument and returning the object at that position. Its name consists of the XML element name with the word "get" prepended, as in the following example: public PartyList getContributor (int index) 3. An array accessor, returning an array of all available objects of the requested type. Its name also consists of the XML element name with the word "get" prepended, but it does not take any arguments, as in the following example: public PartyList [] getContributor () An accessor method returns null to signal failure, either because the object is not present in the NewsML package, or, in the case of an indexed accessor, because there is no object available at the specified index. The indexed accessors do not throw an ArrayIndexOutOfBoundsException. 3. Interface Design The early implementation of the NewsML Toolkit is based on the Document Object Model (DOM) interface available from the World Wide Web Consortium (W3C). The DOM is a fairly easy-to-use interface supported by many vendors and open-source developers, but DOM implementations tend to run slowly and use a lot of memory; as a result, the NewsML Toolkit is designed so that other backends can be plugged into an application for different purposes, as in the following examples: * A backend built using the Simple API for XML (SAX) interface would be able to read a NewsML package in a quick, single pass, with minimum memory overhead. * A backend built on top of a database could execute database queries directly to build the different parts of a NewsML document. * A validating backend could apply business rules to every part of the NewsML package as it is queried. * A masquerading backend could make data in another news-industry format appear as NewsML to an application. As new backends become available, application developers will need to be able to switch to them without rewriting hundreds or thousands of lines of code. The NewsML Toolkit helps developers to write robust applications by dividing the implementation from the interfaces that the application sees: for example, there is a DOMNewsComponent class that implements a news component on top of the DOM interface, but applications do not refer to the class directly; instead, they work with the NewsComponent interface. If the application developer wants to switch to a different, non-DOM-based backend, the new class will still implement the NewsComponent interface, and all the code for processing news components will continue to work unmodified. The abstract NewsML application programming interfaces appear in the package org.newsml.toolkit; the concrete classes for the DOM backend appear in the package org.newsml.toolkit.dom. 3.1. Factories Of course, at some point the application still needs to specify what backend it is using. It does so by invoking a static factory method from a backend-specific class that implements the common NewsMLFactory interface. Currently, only the DOMNewsMLFactory class is available, but others may appear in the future. For example, the following fragment creates a new NewsML root object using the DOM backend: NewsMLFactory factory = new DOMNewsMLFactory(new XercesDOMFactory()); try { NewsML newsml = factory.createNewsML("http://sample.org/news/story01.xml"); } catch (IOException e) { System.err.println("Failed to load NewsML package"); } This should be the only place in the application that refers directly to the DOM backend; everything else will work through the generic interfaces. Note that in the case of the DOM backend, it is necessary to supply a domFactory argument to the constructor, so that it knows what DOM implementation you want to use (there are many available in Java). The DOMFactory interface is yet another factory interface, this time, one specific to the DOM backend. A default implementation for the DOM implementation in the Apache Xerces XML parser is provided in the XercesDOMFactory class, but it is simple to write new factory classes for other DOM implementations. 4. Base Interfaces NewsML is a complex specification, and that complexity is mirrored in the large number of interfaces in the NewsML toolkit itself. The NewsML XML document type contains many structures that are almost but not quite the same, and those slightly-divergent structures can make it difficult to write generalised, reusable code to process NewsML documents. To help alleviate this problem, the NewsML Toolkit contains a series of more abstract interfaces that capture the simple, common patterns that do appear. These base interfaces do not correspond directly with specific markup structures, but they capture similar patterns that make up parts of many different substructures. These interfaces make it possible to write reusable code for common situations, such as navigating through the main structure of a NewsML document or processing a series of comments; they also simplify the process of learning both the NewsML XML document type and the Java interfaces in the toolkit. The following subsections describe the base interfaces in detail. 4.1. AssignmentNode This interface corresponds to the %assignment; parameter entity in the NewsML DTD: all interfaces that provide information about the authority and circumstances under which information was provided extend this interface. The methods of an assignment node can answer the following questions: * Who assigned the information? * How important is the information? * How confident is the assigner about the information? * How is the information applicable? * When was the information assigned? For more information, see the API documentation for AssignmentNode. 4.2. BaseNode This is the base interface for most other interfaces in the NewsML Toolkit (that is, most NewsML objects can be cast to a BaseNode). Exceptions are interfaces like HeadLineGroup and SubjectCodeItem, which represent ordering patterns rather than actual XML elements. The methods of a base node can answer two questions: * What is the XML element name associated with this object? * What is the session to which this object belongs? For more information, see the API documentation for BaseNode. 4.3. CatalogNode This interface applies to all objects that contain a resource catalog. A catalog node's method can answer the following question: * What is the resource catalog for this object (if any)? For more information, see the API documentation for CatalogNode. 4.4. CommentNode This interface applies to all objects that can contain human-readable comments. The methods in this interface can answer the following question: * What human-readable comments, if any, are attached to this object? For more information, see the API documentation for CommentNode. 4.5. FormalNameNode This interface represents an object that contains formal-name information. It is necessary to have a base interface separate from FormalName because the ProviderId interface does not allow a scheme reference. The The methods in this interface answer the following questions: * What is the local part of the formal name? * What is the locally-specified or defaulted vocabulary? * What is the defaulted (*not* locally-specified) vocabulary scheme? For more information, see the API documentation for FormalNameNode. 4.6. HrefNode This interface represents an object that provides a URI reference. The method in this interface answers the following question: * What URI reference, if any, does this object provide? For more information, see the API documentation for HrefNode. 4.7. IdNode This interface represents the XML identifier information (the NewsML Duid and Euid) available for most NewsML object types, with the exception of the NewsIdentifier and TopicUse interfaces. For more information, see the API documentation for IdNode. 4.8. LanguageNode This interface represents an object that can have a natural language code assigned to it, following the rules for the xml:lang attribute in the XML 1.0 Recommendation. The method in this interface can answer the following question: * What natural language code, if any, was assigned to this object? For more information, see the API documentation for LanguageNode. 4.9. OriginNode This interface represents an object that can have text content mixed with Origin sub-elements. The methods in this interface can answer the following question: * What text and Origin elements appear as content? For more information, see the API documentation for OriginNode. 4.10. PrimaryNode This interface represents a piece of the content hierarchy of a NewsML package. The content hierarchy of a NewsML package takes the following pattern: * The root is always a NewsML object. * The NewsML object may contain one or more NewsItem objects. * Each NewsItem object may contain a root NewsComponent object. * Each NewsComponent object may contain a list of NewItem or NewsItemRef objects, or a list of other NewsComponent objects, or a list of ContentItem objects. * Each ContentItem may contain leaf content (such as news stories or photos), either inline or by URL reference. In other words, to get at the actual news content, a program always needs to walk down the tree until it finds the leaf content items; the nodes above provide information (or metadata) about the leaf content items and describe how the content items are related to each other. For more information, see the API documentation for PrimaryNode. 4.11. PropertyNode This interface represents an object that can have generic properties attached to it (see the Property interface). This interface's methods answer the following question: * What generic properties, if any, are attached to this object? For more information, see the API documentation for PropertyNode. 5. Shared Interfaces In NewsML, the name of an XML element serves to distinguish both the role and the nature of a structure; as a result, many NewsML element types share exactly the same structure, and differ only in their name (reflecting their different roles). For example, the 16 NewsML XML element types Format, FutureStatus, LabelType, MediaType, MetadataType, MimeType, NewsItemType, NewsLineType, NewsProduct, NewsService, Notation, Priority, Role, Status, TopicType, and Urgency all have exactly the same XML content model and attribute lists and can be processed in the same way (i.e. they have the same nature, or type), but can appear in different places (i.e. they fill different roles, or functions). In normal object-oriented design (Java or otherwise), the role and nature are specified separately: the role is specified by the accessor method on the interface referencing the node, and the nature is specified by the node's own interface. In other words, while it makes perfect sense in Java to have separate methods getFormat, getFutureStatus, and so on, there is no need to create 16 separate identical interfaces. For situations like these, the NewsML toolkit contains the shared interfaces described in the following sections (more will likely be added in future releases). 5.1. FormalName The toolkit uses this interface for every NewsML structure that contains identifier and three-part formal name information (using the NewsML XML %formalname; parameter entity). There is also a specialised sub-interface AssignedFormalName for a formal name that also contains assignment information (see AssignmentNode). For more information, see the API documentation for FormalName. 5.2. PartyList The toolkit uses this interface to represent every NewsML structure that contains information about one or more parties (people or organisations), with optional comments attached. For more information, see the API documentation for PartyList. 5.3. Text The toolkit uses this interface for every NewsML structure that contains both text and identifiers; in other words, for every structure that is represented by an XML element containing only plain text, with only %localid; attributes. For NewsML structures represented by plain attribute values, the toolkit uses the regular Java String interface. There are also three specialised sub-interfaces: * AssignedOriginText, for text with assignment information and Origin sub-elements. * AssignedText, for text with assignment information only. * OriginText, for text with Origin sub-elements only. For more information, see the API documentation for Text. 6. Unfinished Tasks The following information from a NewsML document is not yet available or available in incorrect form through the toolkit interfaces: NewsItem + There is not yet support for incremental updates.