NewsML Functional Specification (Working Draft)

NewsML-CSC-001

9 April 2000

Daniel Rivers-Moore
Director of New Technologies, RivCom
Consultant to the NewsML development process

Copyright © 2000 International Press Telecommunications Council
All Rights Reserved

Contents

Contents *

Status of this document *

Purpose *

Further development *

Scope *

Other bodies of work *

Introduction *

Overview and general concepts *

NewsML is an interchange format *

NewsML uses XML as its primary encoding *

NewsML carries both content and metadata *

NewsML is media-neutral but handles text efficiently *

NewsML can manage associations between data objects *

Association information can also have metadata *

A data object plus its metadata is an information object *

NewsML can handle complexity *

NewsML can handle the evolution of news over time *

NewsML can handle the management of alternative representations of the same information *

NewsML can handle information packages *

Information objects may be included in packages explicitly or by reference *

NewsML Features *

Information carrier *

Information object *

Metadata *

Metadata categories *

Information object types *

Minimum required metadata *

Default values *

Metadata properties and permitted values *

Controlled vocabularies *

Controlled vocabulary schemes *

Assertions of equivalence and identity *

NewsML metadata evolution *

Association of metadata with data objects *

Information collections *

Inclusion, inclusion by reference and exclusion of parts *

Revision and representations *

Text-oriented features of NewsML *

Identification of news content *

Accountability and confidence *

Authentication and security *

NewsML Conformance *

Glossary *

Status of this document

This Working Draft is the first formal deliverable of the NewsML consultancy process, set in motion at the IPTC Spring Meeting in Nice, France, on 24 March 2000. It represents work in progress towards a full functional specification of NewsML. In its current form, its contents are not binding on IPTC or any of IPTC’s members.

Purpose

The purpose of this document is to establish a baseline functional specification from which further work on NewsML can proceed. The document is deliberately neutral on certain issues that are under active debate and which do not need to be resolved until a later stage in the NewsML development process. However, it makes explicit statements of choice on issues where one or both of the following applies:

The borderline between issues on which this document makes explicit statements, and those which it leaves open-ended, are enunciated below in the section entitled Scope.

Further development

A second Working Draft of the NewsML Functional Specification will be made available to the IPTC membership no later than the Working Party meetings due to be held in London on 4 and 5 May 2000. Working Draft 2 of the NewsML Functional Specification will be sufficiently complete and explicit (including on matters left open-ended in the current Working Draft) for it to be a candidate final functional specification for NewsML Version 1.0

The development process from Working Draft 1 to Working Draft 2 of the NewsML Functional Specification is as follows:

Scope

Scope of the NewsML Functional Specification

The NewsML Functional Specification, together with the NewsML Requirements and NewsML Encoding Decisions documents, will provide a clear formal statement of the consensus reached by the IPTC membership as to the basis on which the NewsML Version 1.0 will be built:

It is in the nature of design principles that they are intended to provide clear guidelines to the designers of a specification, rather than immutable rules. Thus, it is possible that the NewsML Functional Specification will in some cases depart from strict adherence to the principles laid out in the NewsML Encoding Decisions. However, any cases where this occurs should be explicitly recognised as exceptions to the principles in question, and the NewsML Functional Specification should include explicit justification for any such exceptions. Possible justifications might include considerations of comprehensibility, implementation efficiency, performance optimisation, ease of use of relevant tools (such as XML editing tools), or compliance with an external standard or specification. Such exceptions should (by definition) be rare.

The NewsML Functional Specification will include clear definitions of the specialised terms it uses, whether these be the formal names of constructs such as XML elements or attributes, or other terms used in explanations of those constructs and their intended use.

What Working Draft 1 includes

Working Draft 1 of NewsML Functional Specification covers the overall architecture of NewsML. It addresses issues of how NewsML documents will be structured, how relationships between them will be established, how they can make reference to, and be referred to by, other resources, and how, in general terms, extensions to and further developments of NewsML can be handled.

Working Draft 1 of NewsML Functional Specification does not define specific XML elements and attributes, or make choices about the names these may have. Nor does it specify which external standards or specifications will be used, either by inclusion or by normative reference. In some instances, it gives indications of specific options in these areas; in others, it leaves such questions open. This is a deliberate choice, as the purpose of Working Draft 1 is to register the consensus so far achieved, and to make decisions on fundamental issues which must be resolved before further progress can be made. It is not the purpose of Working Draft 1 to resolve all issues that are still the subject of active debate, nor to prejudge the decisions of the Working Parties responsible for other aspects of NewsML, or the conclusions the consultant will reach when carrying out his survey of external specifications or bodies of work.

What Working Draft 2 will include

By contrast, Working Draft 2 of the NewsML Functional Specification will make explicit the proposed choices of XML encodings, including naming conventions and normative or other references to other specifications. It will however leave sufficient freedom of choice within the structures it lays down to allow the Metadata and News Management Working Parties to continue their work unconstrained with respect to the business purpose of their activity. In other words, the NewsML Functional Specification will state how news management information and other metadata will be structured, but it will not place bounds on which specific pieces of news management information or other metadata are to be supported by NewsML. These choices are the responsibility of the relevant Working Parties, not of the authors of the NewsML Functional Specification.

Other bodies of work

Other bodies of work that may be deemed relevant to NewsML include:

The ways in which such bodies of work might be used as a support for NewsML include:

Introduction

The NewsML Requirements document sets out the capabilities that NewsML is required to deliver. This NewsML Functional Specification sets out the technical means that will be employed to meet those requirements. The requirements can be briefly summarised as follows (numbers in brackets preceded by the letter R are references to the relevant clauses in the NewsML Requirements document):

NewsML is to be an compact (R900), extensible and flexible (R700) structural framework for news, based on XML and other appropriate standards and specifications (R1000). It must support the representation of electronic news items, collections of such items, the relationships between them, and their associated metadata (R100). It must allow for the provision of multiple representations of the same information (R500), and handle arbitrary mixtures of media types, formats, languages and encodings (R300, R400). It must support all stages of the news lifecycle (R600) and allow the evolution of news items over time (R200). Though media-independent, NewsML will provide specific mechanisms for handling text (1100). It will allow for the authentication and signature of both metadata and news content (R800).

This functional specification is presented here in three parts. First, an Overview and general concepts section highlights some key considerations and concepts that have a formative impact on the overall design and technical approach that NewsML will adopt, as well as introducing and explaining some key terms that will be used throughout the document. Then follows a NewsML Features section that lists the specific functional capabilities that NewsML will have, together with a brief description of each and, in some cases, a description of the technical approach that will be adopted in order to provide the feature in question. In cases where further discussion or research is needed before the details of the technical approach can be specified in detail, one or more Notes spells out the options and any remaining issues that need to be resolved. Finally, a Glossary provides a concise definition of each of the technical terms that were introduced in the Overview and general concepts section.

Overview and general concepts

In the course of this overview, we shall introduce some terms that will be used in this document to describe the structure and design of NewsML. These terms are purely descriptive and should not be presumed to resemble any future formal names of NewsML constructs such as XML elements and attributes. To prevent any misunderstanding on this point, we have deliberately used composite terms such as ‘information carrier’ and ‘information content’, without concatenation, so they could not be possibly be used as XML names in their current form. These terms and their definitions are reproduced in the Glossary section at the end of this document.

NewsML is an interchange format

NewsML is primarily intended as a format for the interchange of news, rather than its creation, editing, storage or in-house management. However, NewsML will be inherently extensible, and so can provide the basis for the development of further, specialised formats for these and other purposes. Such formats may be proprietary or may result from subsequent standardisation efforts.

NewsML uses XML as its primary encoding

The primary encoding used by NewsML will be XML. This simple statement carries with it several important consequences.

Firstly, it implies that the objects of NewsML are logical rather than physical. An XML document is a single logical object, though it may be built up of the contents of multiple physical files through the mechanism of XML entity references, for example.

With this in mind, we shall use the term data object to represent a logical unit of data, whether or not it is stored as a single file or transmitted as a single data stream.

Secondly, the use of XML implies that all the data-management capabilities that are provided by XML and its associated specifications are available to NewsML, and are inherited by it. In particular, the ability of XML to handle data in non-XML formats, through the NOTATION mechanism, is available to NewsML also. This powerful feature of XML will provide one possible way of addressing some of the media- and encoding-independence issues that are part of the NewsML requirements.

NewsML carries both content and metadata

In order to be handled appropriately by an application or business process, a data object often needs additional data to be associated with it. We shall use the term metadata to designate data that is associated with a data object with the intent of enabling a system to handle that data object appropriately. The system may be a computer application, a business process handled by human beings, or some combination of the two. Thus, metadata may need to be machine-interpretable, human-readable, or both. A metadata object is a data object consisting of metadata.

NewsML is media-neutral but handles text efficiently

NewsML makes no assumption about the media type, format or encoding of news. It provides a structure within which pieces of news media relate to each other. NewsML can equally contain text, video, audio, graphics, photos, or other media and combinations of media yet to be invented.

We shall use the term information carrier to designate a data object whose data content is intended to convey information (such as news content) to human recipients. Thus, the video component of a TV news report, the braille-encoded content of a newspaper article destined for blind readers, the electronic encoding of a table of stock-market closing prices, and the textual content of an article destined for a piece of newsprint, would all be described as information carriers.

We shall use the term information content to designate the information carried by an information carrier and intended for human consumption. (Note that we use the term ‘information content’ rather than ‘news content’ in order to make the NewsML specification as general-purpose as possible. A NewsML information carrier could for example be used to carry historical information intended for use in the biography of a historical character, or a documentary about a past event, as easily as it carries up-to-the-minute information about current events.)

It is important that NewsML cater to the needs of traditional formats as well as new or emerging formats. Many information carriers contain structured or unstructured text. Care must therefore be taken to ensure that allowing NewsML to handle non-text-based information does not result in a reduction in the ease and efficiency of text-based news handling.

NewsML can manage associations between data objects

It is frequently necessary to associate a data object with one or more others, which may be physically close to or physically remote from it. Furthermore, the data that encodes the association between data objects may be internal to one or both objects, physically close to one or both, or physically remote from both. The same is true of the metadata associated with a data object: some or all of the metadata may be physically remote from the object to which it relates.

We shall use the term association information to designate the information that conveys the existence and nature of an association between any number of data objects and/or metadata objects, wherever that information is physically held.

We shall use the term association data object to designate any data object that encodes association information.

It is possible for association information to exist without there being any corresponding association data. For example, the fact that a given XML element is the sibling of another will often imply the existence of a meaningful relationship between those two elements. This relationship counts as association information despite the fact that there is no association data that represents the existence of this relationship, beyond the structural relationship that pertains between the elements themselves. We shall use the term implicit association information to designate association information that is implicit in the structural relationships between the data objects themselves rather than being represented by association data objects.

Association information can also have metadata

Like any other data objects, association data objects can themselves have associated metadata. We shall use the term association metadata to designate metadata associated with association data objects. Examples of association metadata would be the date at which an association data object was created in order to assert a relationship between two data objects, the name of an editor who created the association data object, the reason why it was created, a date before which it is not deemed to be valid, etc.

A data object plus its metadata is an information object

We shall use the term information object to designate any data object plus its associated metadata. Another way of putting this is to say that an information object is a data object that is able to be used appropriately by a system. This follows from the fact that metadata is defined as data associated with a data object with the intent of enabling a system to handle it appropriately.

NewsML can handle complexity

It is a characteristic feature of news stories that they often bring together multiple information objects, for example, a text story, a photograph and its caption, and a vector graphic. Further, it is often necessary to bring together multiple complete stories and handle them as a coherent collection, for example in a digest of the week’s major stories, or as a response to a query seeking out stories relating to a particular event or theme.

We shall use the term information collection to designate an information object comprising any number of information objects that are intended to be used together for any purpose. The components of an information collection are zero or more information objects (data objects plus associated metadata), at least one additional item of metadata associated with the information collection itself, and any relevant association information defining the relationships between the information objects that make up the collection. Just as an information collection is itself an information object, so any or all of its component information objects may themselves be information collections.

NewsML can handle the evolution of news over time

Special cases of information collection are ‘the collection of all versions of an article as it proceeds through the editorial workflow from authoring, to editing, to publication’, and ‘the collection of all revisions of a story as new facts are discovered and the story is refined and augmented’. Similarly, special cases of association information are the fact that one information object is an earlier version of another, and a correction or replacement to a third. NewsML will provide explicit support for these kinds of situation through the provision of specific categories of metadata and association information appropriate to the management of the entire news lifecycle.

NewsML can handle the management of alternative representations of the same information

An information collection is not necessarily intended to be used in its entirety at any given moment of news delivery. It might for example include multiple alternative versions of some of its component information objects. The fact that one information object is an alternative to another within the context of a particular information collection is represented by appropriate association information specifying the relationships between the relevant objects.

Criteria for the choice between alternatives might relate to any aspect of the metadata associated with the objects. The choice might depend, for example on the format (e.g. video, audio or photograph), the encoding (e.g. JPEG, EPS or SVG graphic), the available space (choice between short or long headline, large or small photograph, etc), the language in which the textual component of the information is written, the language spoken by the speaker in an audio clip, or any other consideration dependent on the constraints or requirements of the delivery environment, process or intended audience.

NewsML can handle information packages

As has been explained above, information objects and information collections are logical rather than physical objects. However, it is sometimes necessary (for example for data transmission or data storage purposes) to render the content of these logical objects in a single file or data stream. We shall use the term information package to designate a physical file or data stream that is used to store or transmit one or more information objects or information collections.

The various protocols or security features of the data storage or transmission environment might require certain information to be included in the information package in addition to the actual information objects it contains. We shall use the term information envelope to designate this additional information. An information package therefore comprises an information envelope together with the information object(s) or information collection(s) that are being stored or transmitted.

Information objects may be included in packages explicitly or by reference

We shall use the term explicit inclusion to designate the physical inclusion of the entire data content of an information object within an information package. We shall use the term inclusion by reference when an information package does not physically include the data content of an information object but instead includes a pointer or other form of reference data, sufficient to enable a NewsML-aware system to retrieve the actual information object at some later point in time, if it is required.

NewsML provides the means for both explicit inclusion and inclusion by reference to be used at will. The criteria by which a given application decides whether to use one inclusion mechanism or the other, or some combination of the two, are implementation-dependent. Senders and receivers of NewsML packages are free to negotiate their own rules in this regard, in the light of the specific constraints or requirements of their physical, technical or business environments.

NewsML Features

Information carrier

NewsML will provide a generic information carrier object, whose purpose is to represent any item of news-related information, destined for human consumption, through some data encoding. The eventual presentation of the information to a human recipient could be through any medium – recorded sound, written or speech-synthesized text, printed or on-screen photograph, or other media yet to be invented. The rendering of the information to the recipient will be through some physical device that is able to interpret the encoding in which the information is carried.

Information object

The information carrier on its own is simply a raw data object without any meaning beyond that implicit in the semantics of its encoding, and even that is of little use unless it is accompanied by additional information to tell the system what that encoding is. The information carrier becomes an information object through the addition of appropriate metadata, which conveys all the additional information relevant to telling a system how the object is intended to be used.

Metadata

NewsML will specify a number of different kinds of metadata, each relating to a different aspect of the ways information objects may need to be used within NewsML-aware systems.

We shall use the term metadata category to designate a broad class of metadata objects that it is useful to consider as a group because the objects that fall within that category are relevant to a particular aspect of the business purpose or practical use to which NewsML information objects may be put.

We shall use the term metadata token to designate a metadata object of a particular kind, falling within one of the supported metadata categories.

Note: The choice of the term ‘token’ is deliberate here, as it is neutral between various possible kinds of encoding that may be used. There is no presumption at this stage as to whether a given metadata token will be encoded as an XML element, and XML attribute, some combination of the two, or even some other kind of object such as an XML-namespace URI or some non-XML data object identified by reference from within an XML data structure.

We shall use the term metadata property to designate that aspect of an information object that is represented by a particular metadata token; put another way, a metadata property is a class of information conveyed by metadata tokens of a particular type.

To give an example: within the metadata category known as ‘physical’, whose purpose includes the description of certain aspects of the way an information object will be rendered, will be a metadata token called, perhaps ‘width’, which represents a particular property of information objects, namely the amount of horizontal space they will occupy when rendered.

It should be noted that metadata properties may themselves be structured. It cannot be assumed that a single XML attribute, for example, will be sufficient to carry the semantics of a metadata property. In the example given, the metadata property known as ‘width’ is not atomic. As a minimum it must be expressed as a numeric value accompanied by a unit of measure (for example, ‘5 inches’ or, equivalently, ‘12.7 centimetres’), not just as an unadorned numeric value. In fact, it may be yet more complex, as it might be decided to allow width to have multiple possible values such as ‘recommended with’ and ‘minimum width’, within a single, complex metadata token representing a similarly complex metadata property comprising a set of related widths, each with a different role.

Note: Working Draft 2 of the NewsML Functional Specification will specify in general terms what XML constructs may be used as metadata tokens, to encode whatever metadata properties the relevant Working Parties decide NewsML will support within each metadata category. NewsML Version 1.0 will fully specify these constructs and give indications of the kinds of ways they may be used in NewsML-aware systems.

Metadata categories

Note: At this point in the NewsML Functional Specification will appear a list of the relevant categories of metadata together with a brief description of the purpose of each. At the time of writing Working Draft 1, this list is not definitively agreed. At least two candidate lists have been proposed. One (included in the Reuters draft functional specification) consists of five categories: descriptive, editorial, identification, role and physical. Another (included in the AFP draft functional specification) consists of seven categories: identification, administrative, descriptive, physical, roles, rights and management. It will be for the relevant IPTC Working Parties to make appropriate decisions as to the definitive list to be adopted by NewsML Version 1.0, and to spell out the specific metadata objects that will be able to exist within each category.

Note: It would be extremely useful (though it may be beyond the scope of NewsML Version 1.0) to allow a formal hierarchy of metadata categories. For example, the ‘physical’ metadata category may have recognised subsets such as ‘geometric’ – to cover such properties as width and height – and ‘time-based’ – to cover such properties as duration. If it is decided to recognise this hierarchy of metadata categories, we shall coin the term metadata subcategory, with the obvious meaning: in the example given, ‘geometric’ would be a subcategory of ‘physical’.

There is no theoretical limit to the kinds of use to which information objects may be put, and consequently the metadata categories that may be required to provide information relevant to those kinds of use.

Information object types

It is clear that certain kinds of information object require certain kinds of metadata and not others. An audio object does not have a ‘width’ property, for example, and so does not require a ‘width’ token to encode it.

We shall use the term information object type to designate a class of information objects to which certain metadata categories are relevant.

Minimum required metadata

It is clear from the above list of metadata categories that the scope of NewsML metadata is extremely broad. It is therefore important that the NewsML specification define a special metadata category comprising the minimum subset of metadata properties that all NewsML information objects must have, and to specify among those which are ‘required’ and which are ‘optional’. We shall use the term minimum required metadata to designate this subset.

Default values

For those metadata properties that are deemed to be ‘required’, the corresponding metadata tokens may in fact remain optional, since NewsML may specify default values they are deemed to have if not physically present in a NewsML data object. A required metadata token will be one corresponding to a required metadata property for which no default value has been specified. The default values may be specified in the NewsML specification itself, or in an agreement between the sender and receiver of NewsML information objects.

Note: It may be appropriate for NewsML Version 1.0 to specify a formal mechanism for the specification of default values for metadata properties, and the range of information objects over which those default values apply.

Metadata properties and permitted values

The NewsML specification will include an explicit list of metadata properties within each metadata category. Certain metadata properties will be allowed any value at all, provided it is of an appropriate data type. Others may require the value to be within a certain range (that is, between specified minimum and maximum values). Others may require the values to be drawn form an explicit list of permitted values, known as a controlled vocabulary.

Note: XML Schema provides ways of specifying constraints on datatypes and ranges that certain data objects must fall within. Working Draft 2 of the NewsML Functional Specification will state whether and how this mechanism will be used within NewsML.

Controlled vocabularies

We shall use the term controlled vocabulary to designate a list of permitted terms that may be used for a particular purpose. The content of a controlled vocabulary is not fixed. However, changes to it can only be made according to a specified and well-managed process.

If a given metadata property needs a specified list of permitted values, this list will take the form of a controlled vocabulary.

Note: The ways in which controlled vocabularies are to be defined and managed requires further study. It is already clear today that they will not take the form of enumerated lists of allowed attribute values within an XML DTD. Nor will the terms in a controlled vocabulary be the names of elements or attributes within ATTLIST declarations or sets of ELEMENT declarations in a NewsML DTD. Several possible mechanisms for the management of controlled vocabularies exist. One possible technique is the use of external entity files, identified by a parameter entity reference within a NewsML DTD, where the content of the entity file itself is not specified by the NewsML standard, but is managed by the users of NewsML within a given community. Another possible technique is for NewsML to define an XML Schema that allows specialisations or extensions to the constructs it defines, and for those specialisations and extensions to be managed by a community of NewsML users. Another possible technique is for NewsML to specify one or more namespace URIs, and for IPTC to manage the content of files that will be accessible through those URIs, and which will contain information about the current list of terms within the controlled vocabulary, and their intended meaning and use. Working Draft 2 of the NewsML Functional Specification will specify the technical means whereby NewsML controlled vocabularies may be defined and managed. Either as part of NewsML Version 1.0, or as a separate initiative, IPTC may define a certain number of initial controlled vocabularies, and/or a set of guidelines or specific procedures for the management of these vocabularies. It may be that the following of these procedures is deemed to be a necessary part of NewsML conformance, or it may be that NewsML users will be given a greater degree of freedom as to how this management will be done. It is within the remit of the NewsML consultant to investigate how these matters are managed in other communities that are faced with similar requirements. It is within the remit of the IPTC to form a view as to the degree to which the controlled vocabulary management process shall or shall not be specified within NewsML Version 1.0 itself.

Note: NewsML may recognise as a controlled vocabulary a set of terms that is managed by an external institution or standards body. For example, the Dublin Core, or some subset of it, may become a permitted controlled vocabulary for certain metadata categories, and ISO 3166 country codes, might be designated as a controlled vocabulary for the permitted values of certain metadata properties.

Controlled vocabulary schemes

There may be more than one controlled vocabulary that can be used for a given purpose within NewsML.

NewsML will provide a mechanism for specifying which vocabulary is being used in any particular instance. Thus, for example, in specifying the value of a particular metadata property, it will be possible

Note: Working Draft 2 of the NewsML Functional Specification will specify the mechanism whereby all this is achieved. The mechanism chosen may involve use of the XML Namespaces construct. Alternatively it may simply involve the use of appropriate XML element types and attributes, declared as part of the NewsML DTD and explained in the accompanying documentation. As with the controlled vocabulary mechanism itself, it is within the remit of the NewsML consultant to investigate how these matters are managed in other communities that are faced with similar requirements.

Assertions of equivalence and identity

NewsML will provide a formal mechanism for asserting that the meaning of a term in one controlled vocabulary and the meaning of a term in another controlled vocabulary are the same. These terms will then be considered to be equivalent whenever they are used within NewsML information objects. Likewise, where terms in controlled vocabularies designate objects in the real world (such as places or countries), there will be a mechanism for asserting the identity of their referents. Thus, for example, it will be possible to assert that the town designated by the term ‘Lyons’ in a British controlled vocabulary is identical with the town designated by the term ‘Lyon’ in a French controlled vocabulary.

NewsML metadata evolution

Just as it is necessary for the lists of permitted values of metadata properties to evolve over time in a controlled manner, so it is necessary to allow for new metadata properties to be recognised over time and supported by NewsML systems, even if they were not explicitly included in the NewsML Version 1.0 specification.

Note: Whereas the decision has already been made that lists or permitted values shall not take the form of enumerations of allowed attribute values within a NewsML DTD, there has been no decision as to whether the list of metadata properties themselves should take the form of element or attribute names within a DTD, or some other mechanism. The stance taken in this Working Draft (which is the recommendation of the NewsML consultant, based on the views he has gleaned during conversations with various IPTC members) is that NewsML should define a set of metadata properties within each metadata category, which should take the form of element types declared within a NewsML DTD, and that there should in addition be a generic metadata element within each metadata category that has a name attribute whose values can be the names of additional metadata properties not explicitly recognised within the DTD. The permitted values of this name attribute should be contained in a controlled vocabulary, to be defined and managed in accordance with the principles to be established (see Controlled vocabularies above for a note on this issue). This two-pronged approach has a number of merits. It allows communities of NewsML users to carry out experimentation in a controlled manner on new candidate metadata properties, while at the same time allowing implementers to build optimised systems that are finely tuned for the efficient handling of that subset of metadata properties that are enshrined in the DTD. After a period of experimentation, a given community might develop its own DTD extension (in the form of entities referenced from within the base DTD), or Schema extension or specialisation in the way defined by the XML Schema specification. This would allow implementers within those communities to enhance or refine their implementations to provide optimised handling of those properties that have been enshrined in the DTD extension or Schema extension. IPTC may, in the course time, incorporate some of these new metadata properties that are deemed to be of widespread relevance, within future versions of NewsML itself.

Association of metadata with data objects

The relationship between a data object and its associated metadata objects is asserted through the existence of appropriate association information. This association information may be implicit or explicit.

Note: The creation of an association between a data object and its metadata through implicit association information might for example consist simply in the fact that the metadata is contained in XML elements that are children of the same parent as the XML element that contains or references the relevant data object. On the other hand, specific association data objects asserting the relationship between the data object and its metadata might for example take the form of XLink elements that contain external pointers to the relevant data and metadata objects, thereby asserting the existence of the association between them. The detailed mechanism or mechanisms to be used, and guidelines in how to choose between them if more than one is made available, will be specified in Working Draft 2 of the NewsML Functional Specification.

Information collections

Information objects may be composites consisting of multiple component information objects, each including its own metadata, and associated by implicit or explicit association information. The component information objects may in turn be information collections in their own right, to arbitrary levels of nesting.

The ultimate information carriers that are contained in the lowest-level information objects in the collection may be of any media type and may be in any encoding. NewsML will include a specific metadata category whose intent will be to specify the encoding and other physical characteristics of each information carrier, so that it can be appropriately selected, decoded and rendered by a NewsML-aware system.

Note: It may be appropriate for NewsML to provide a list of preferred encodings for the principal news media types. It has been suggested, for example, that there be a preferred encoding for text information carriers, and that that encoding be NITF. Other approved but not preferred text encodings might for example be raw ASCII, raw UNICODE, namespaced XML with reference to an appropriate DTD, XHTML, etc.

Note: It may also be appropriate for certain encodings to be deprecated. Examples might be HTML not valid according to any of the formal HTML DTDs, etc.

Note: It has been suggested that NewsML develop its own encoding for structured text. However, the overwhelming consensus in the conversations the NewsML consultant has had with IPTC members has been that this encoding should either be NITF or a specified subset of NITF.

Note: It may be appropriate for the NewsML Functional Specification to specify that the MIME-type associated with an encoding should be a required piece of metadata for any information carrier that uses that encoding.

Inclusion, inclusion by reference and exclusion of parts

NewsML will provide a simple mechanism for creating an information collection or information package by specifying an entry point into a web of associated information objects, and providing a set of criteria whereby, starting from the information object designated as entry point, associated information objects are included, explicitly or by reference, within the collection or package.

The process whereby the collection or package will be built up is to locate the association information relevant to the initial information object, apply the specified criteria to determine which associations are to be followed and which not, then to repeat the process for each information object thus located, and so on, recursively, until the rules being followed identify no further relevant associations.

Note: It is possible that some combination of XSLT, XLink and XML Query might be useful in specifying the above mechanism. Of these, XSLT is currently an approved W3C Recommendation, XLink is on the point of becoming so, and XML Query is still at an early stage of development. The relative merits and maturity of these specifications will be assessed in order to determine which if any shall be included in the specification of this mechanism. An appropriate statement of principle on this matter will be included in Working Draft 2 of the NewsML Functional Specification.

Revision and representations

NewsML will specify how alternative revisions and representations of the same news story shall be managed and related to one another.

Note: The details of how this mechanism will be specified depends on the specifics of the kinds of metadata that will be made available. It has already been stated in the overview above that the mechanism will depend on appropriate use of metadata and association information. Depending on how mature the metadata specifications are by the time Working Draft 2 of the NewsML Functional Specification is produced, it may be possible to be more explicit in that draft than is possible today.

Text-oriented features of NewsML

It is recognised in the NewsML Requirements document that, while NewsML is itself media-neutral, nonetheless there are certain ways in which text plays a special role. It is a stated requirement of NewsML that it handle these text-oriented features efficiently and effectively.

The text-oriented features of NewsML include

Note: Working Draft 2 of the NewsML Functional Specification will specify how these examples, and perhaps others that will be identified as being important, will be handled. For the moment, it is sufficient to say that the items grouped under the first bullet point above can be treated by NewsML simply as metadata objects, according to the same principles and protocols as all metadata objects. The items grouped under the second bullet point above will be treated by NewsML simply as information containers, according to the same principles and protocols as all information containers. This implies that their encodings must be made known via appropriate associated metadata objects (and this applies equally well whether the encoding is NITF or PostScript). It has already been stated that IPTC may, as part of NewsML Version 1.0, specify which encodings are preferred, and which (if any) deprecated. It is likely that NITF (or a designated subset thereof) will be a preferred encoding for text-based information containers.

Identification of news content

We shall use the term content metadata to designate metadata whose purpose is to provide information about the information content of an information carrier, no matter what encoding the information carrier uses or what media-type is used to render it for human consumption.

Content metadata may occur under a number of different metadata categories. The most obvious example of call content metadata is what the Reuters and AFP draft functional specifications both designate by the term ‘descriptive metadata’. This metadata category is explicitly intended to convey information about the information content conveyed by an information carrier. However, this is not the only example of content metadata. In the case of a video of a news report, the journalist’s sign-off statement is part of the information content of the video clip. In this case, the metadata information objects (within the ‘editorial’ or ‘administrative’ metadata category) which identifies the journalist and the date and place of the report count as content metadata because they convey the same information as part of the information content of the information carrier itself.

NewsML will provide mechanisms for:

There are many ways in which sections of an information carrier can be designated as being relevant to a particular piece of metadata. These are all dependent of the nature of the information carrier in question, and in some cases are also dependent on its encoding. Some examples are:

NewsML places no limits on the formats or syntaxes that may be used to encode such pointers and references into parts of the information carriers it handles. It will provide mechanisms for stating what encoding is being used. As with the encodings of the information carriers themselves, NewsML may designate some preferred encodings, and/or some encodings that are deprecated, for such pointers. The mechanism itself is generic. The only fundamental restriction is that such an external pointer must be meaningful in relation to the encoding of the information carrier into which it points. In other words, it must be capable of performing its function of identifying a part of the information carrier to which a piece of content metadata relates.

Accountability and confidence

NewsML will provide mechanisms for stating the identity of the person, computer system, organization or authority that creates, transmits or publishes any information object, whether it be an information carrier, a metadata object or an association information object.

It will therefore be possible for recipients of NewsML information packages to form judgements as to the confidence they place in the information they receive, based in part on the identity of the people and/or organisations from which it comes.

Furthermore, NewsML will provide mechanisms whereby a person or organisation may specify the degree of confidence they themselves place in a particular information object. Thus, it would be possible, for example, for a news agency to transmit a story while expressing doubts as to its veracity, or for a photo editor to identify a face in a photograph and state that they believe, but are not certain, that this is the face of a specified person.

This capability of NewsML can be used to attach statements of confidence (or lack of it) to parts of an information package that is transmitted to a customer, for example. It may also be used purely internally, as part of the process of deciding which stories to run, or as part of a research phase of the editorial process, where different stories are being corroborated against one another in order to assess the degree of confidence one wishes to place in them.

Authentication and security

Note: It is a requirement on NewsML that it be able to handle authentication and security. The XML Signature Working Group at W3C has defined a set of requirements on digital signature and authentication of XML documents. These requirements are well thought through and likely to be sufficient to cover the needs of the news industry. They include, for example, the ability to sign and authenticate not only an XML document as a whole, but any subelement within it. Thus it would be possible, for example, to have certain information objects within an information package signed by one person or authority, an information collection containing them signed by another, and the package as a whole signed by yet a third. One option for NewsML would be simply to make a normative reference to the XML Signature work, and then carry out a review of the specification when it has been ratified by W3C to see whether IPTC, or NewsML itself, wishes to specify any extensions to it or caveats or guidelines regarding its use. This would have the merit that IPTC would not need to devote effort to working in parallel with W3C on this important issue that is not specific to the news industry but concerns all those wishing to conduct business using XML on the Web. However, this approach goes against the guidelines included in the NewsML Encoding Decisions document regarding the criteria for adoption by NewsML of an external standard or specification. It may be that there is a strong enough case for overriding the guidelines of the NewsML Encoding Decisions document in this instance.

NewsML Conformance

Given that the scope of NewsML metadata is considerable, and may grow as NewsML evolves, it would be inappropriate to require every NewsML system to support every NewsML metadata category. It is clearly of no relevance for a radio broadcasting system, for example, to support metadata relevant to the rendering of photographs. However, it is a requirement of every conformant NewsML system that it support at least the minimum required metadata as defined by the NewsML specification.

Note: It might be appropriate to define a number of ‘NewsML conformance classes’ for different kinds of NewsML environment. For example, it might be that newswire services would require NewsML systems to be of "conformance class X", while TV broadcasting stations would require their NewsML systems to be of "conformance class Y". The different conformance classes would be defined in terms of the metadata categories that the systems complying with them must recognise and know how to handle.

Glossary

association data object any data object that encodes association information

association information the information that conveys the existence and nature of an association between any number of data objects and/or metadata objects, wherever that information is physically held

association metadata metadata associated with association data

content metadata metadata whose purpose is to provide information about the information content of an information carrier, no matter what encoding the information carrier uses or what media-type is used to render it for human consumption

controlled vocabulary a list of permitted terms that may be used for a particular purpose; the content of a controlled vocabulary is not fixed; however, changes to it can only be made according to a specified and well-managed process

data object a logical unit of data, whether or not it is stored as a single file or transmitted as a single data stream

explicit inclusion the physical inclusion of the entire data content of an information object within an information package

implicit association information association information that is implicit in the structural relationships between the data objects themselves rather than being represented by association data objects

inclusion by reference the virtual inclusion of an information in an information package resulting from the physical inclusion of a pointer or other form of reference data, sufficient to enable a NewsML-aware system to retrieve the actual information object at some later point in time, if it is required

information carrier a data object whose data content is intended to convey information (such as news content) to human recipients

information carrier an electronic object that encodes information intended to convey information (news content) to human recipients

information-carrier metadata information associated with an information carrier that is not part of its information content

information collection an information object comprising any number of information objects that are intended to be used together for any purpose; the components of an information collection are zero or more information objects (data objects plus associated metadata), at least one additional item of metadata associated with the information collection itself, and any relevant association information defining the relationships between the information objects that make up the collection; an information collection is itself an information object, and any or all of its component information objects may themselves be information collections

information content the information carried by an information carrier and intended for human consumption

information envelope information to be included in an information package in addition to the actual information objects it contains

information object a data object plus its associated metadata; another way of putting this is to say that an information object is a data object that is able to be used appropriately by a system This follows from the fact that metadata is defined as data associated with a data object with the intent of enabling a system to handle it appropriately

information object type a class of information object to which certain metadata categories are relevant

information package a physical file or data stream that is used to store or transmit one or more information objects or information collections

metadata data that is associated with a data object with the intent of enabling a system to handle that data object appropriately; the system may be a computer application, a business process handled by human beings, or some combination of the two

metadata category a class of metadata objects that it is useful to consider as a group because the objects that fall within that category are relevant to a particular aspect of the business purpose or practical use to which NewsML information objects may be put

metadata object a data object consisting of metadata

metadata property that aspect of an information object that is represented by a particular metadata token; put another way, a metadata property is a class of information conveyed by metadata tokens of a particular type

metadata token a metadata object of a particular kind, falling within one of the supported metadata categories

minimum required metadata the subset of metadata properties that all NewsML information objects must have