UTF - An SGML Standard for the News Distribution Industry

UTF - An SGML Standard for the News Distribution Industry


by Dave Becker

Mead Data Central
daveb@meaddata.com
August 15, 1994


Abstract

In June, 1992, a working subcommittee was established to create an industry standard for the interchange of textual material between news agencies and and their clients (primarily newspapers) that would replace the current standard IPTC 7901 and ANPA 1312 formats. The new standard is called the Universal Text Format (UTF). After significant discussion, SGML was adopted as the encoding language for the new standard. Members of the working subcommittee are now attempting to finalize and prototype the new standard in selected test environments. The purpose of this paper is to describe the context within which the UTF was developed, the standard itself, and plans for future development.


1. Introduction

Modern transmission of information dates to the development of the telegraph and teletype. Inherent in past development was the concept that the information would appear on a printed page, either in the traditional newspaper or on paper for a news reader at a radio or television studio. Until now, the concerns of most information delivery were about how the information would appear on paper, even if the printing device was only a teleprinter.

This document discusses the development of a device-independent file format for the transfer of textual information in the news industry. This file format is called the Universal Text Format (UTF). It is intended to be used with the Information Interchange Model (IIM), an electronic envelope convention for the exchange of information which has been adopted by the International Press Telecommunications Council (IPTC) and the Newspaper Association of America (NAA). The UTF is being developed under the direction of the IPTC and the NAA.

2. The Current Distribution Process and Formats

Senders and Receivers

The news distribution industry can be characterized by "senders" and "receivers". The "senders" constitute a number of local, national, regional and international news agencies or wire services. Their primary role is to collect, edit, and disseminate (send) news information. Some of these news agencies specialize in particular information domains such as sports, financial, political or hard news, while others are generalized, covering any number of different news domains. The news information is collected in many different forms ranging from text entered by reporters in the field, to photos, graphics, spreadsheets and even audio and video. Some of the major news distribution organizations involved in the effort to develop the UTF standard are Associated Press, Reuters, United Press International, Canadian Press, Agence France Presse of Paris, Press Association of London, and Deutsche Presse Agentur.

The "receivers" are those organizations that are at the receiving end of a news transmission. Their intent is to repackage the information that has been collected and disseminated by the news agencies into products that are targeted toward their selected markets. These include newspapers, broadcasters, magazines, newsletters, database services and a host of specialized service providers. Some of the major news service users involved in the UTF standard development effort are The New York Times, Dow Jones, The Miami Herald, Chicago Tribune, Star Tribune of Minneapolis, Apalachicola Times of Florida, Mead Data Central, and others. Some organizations, such as The New York Times, Chicago Tribune and Dow Jones, are also news providers.

While the functions of sending and receiving are still clearly defined, the line between organizations that are strictly senders or receivers is beginning to blur as many organizations perform both functions. Organizations such as Dow Jones and The New York Times are both senders and receivers, and are involved in both ends of the news distribution process.

Current Text Distribution Formats

There are a variety of text formats in use today by the news agencies. But the two news wire formats that have been most widely adopted by senders and receivers today are:

Both of these formats incorporate a significant amount of identifying, descriptive, filing and routing information associated with the actual text of the news story. This information is typically incorporated into a news transmission as part of a rigidly structured header having a set of concatenated fields presented in a rigorous sequence with predefined field separators and terminators. The contents of the fields are usually specified according to either a style guide adopted by each individual news agency or a set of standards established by the agency's industry standards organization (IPTC or ANPA/NAA). Both the IPTC 7901 and the ANPA 1312 formats are 7-bit character code formats based on the ISO 646 IRV coded character set for information interchange.

The text portion of a news wire transmission, whether IPTC or ANPA, is typically sent in a plain, relatively unstructured format (e.g., IPTC Unstructured Character Oriented Format). Such plain formats rely on traditional typographical conventions to convey any structural information that may be present. For example, carriage returns, line feeds and indentations are used to identify new paragraphs. Or, for tabular material, carriage returns and line feeds are used to introduce new rows, while tab characters are used to position to the beginnings of individual preset columns within a row.

In some cases senders and receivers are not using IPTC 7901 or ANPA 1312, and are using much more primitive formatting schemes based on 5-bit and 6-bit coded character sets some of which support only mono-case transmissions (e.g., all upper case). In other situations, senders and receivers have agreed to transfer more highly formatted information, and for those instances elaborate escape sequences are frequently employed with custom designed character sets to permit representation and rendition of specialized publishing symbols, fonts, soft hyphenation and spacing, rules and borders, and many other typographical conventions.

3. The Information Interchange Model (IIM)

In the late 1980's and early 1990's with the advent of digital photo technology, it became obvious that the old transmission formats would not be adequate to handle the new information media that were becoming available. So the IPTC and the NAA joined forces to develop the Information Interchange Model (IIM). The intention of the IIM was to produce an electronic envelope capable of transferring files derived from different sources of information between any of the various computer systems used in the news distribution industry.

The OSI Model

This electronic envelope is based on the widely accepted seven-layer Open Systems Interconnection (OSI) model for exchanging information electronically between computer systems. While the lower five or six layers of this model are filled in by products and standards generated by other organizations, it remains the responsibility of the information senders and receivers to define the model that they will be using at the highest layer of the OSI for their applications. The model that has been adopted is designed to provide for universal communications embracing all types of data, including text, photos, graphics, audio, video, etc., all on a single network or storage medium.

Databases and Datasets

Within the application layer of the ISO, the IIM has adopted a "database" approach to establishing the electronic envelope for the news information being interchanged. Note that "database" is used here generically, and does not refer to any specific commercial systems or database providers. The IIM database model assumes that the sender wishes to transfer a binary data object (text, photo, graphic, audio, video, etc.). The data object is in some format appropriate to the information being transferred.

Surrounding the data object is an envelope within which can be found information that is associated with the data object. This associated information provides identification and description and facilitates routing and processing of the object. Each piece of information associated with the object is set up as a self-identifying "dataset". These datasets can occur in either a binary form readable only by a computer, or in a coded character form that is readable by both humans and computers. The envelope has relatively few required pieces of information. Only those datasets required for a given application are mandatory. Other datasets are optional and are utilized only when the sender and receiver deem it necessary to do so.

Records

The IIM organizes the data object and its accompanying datasets into a series of database "records" that provide structure to the application that is working with the transmission of the object. Allowance has been made in the IIM for nine different records, but only six are currently defined and being used. The records of the IIM are as follows:

Table 1: IIM Record Structure


Record #        Record Name               Record Description
--------        -----------               -------------------
Record 1        Envelope Record           Routing and Identity; Envelopment
Records 2-6     Application Records:  
   Record 2     Application Record No. 1  Pertinent Editorial Information
                                          about the Object
   Record 3     Application Record No. 2  Image Parameters related to
                                          Digitized Image Objects
   Records 4-6  Undefined
Record 7        Pre-Object Descriptor     Estimate of Object Length
Record 8        Object                    The Data Object
Record 9        Post-Object Descriptor    Actual Object Length as
                                          Transmitted

The flexibility of this envelope model is such that it will permit users to encapsulate other formats which then can be fully reconstituted from the IIM datasets. For textual data objects this flexibility will enable news information senders and receivers to more easily migrate their existing systems to the IIM by providing a method for converting back to the existing IPTC 7901 or ANPA 1312 wire formats. This approach to the IIM does, however, require that the receiving system incorporate the ability to directly process the IIM envelope or carry out the back conversion procedure that reconstitutes the older formats. It will typically be the responsibility of the news agencies to help the receivers do this as part of their transition to the IIM.

4. Development of the UTF

The initial attention of the working subcommittees of the IPTC and NAA was focused on an overall definition of the electronic envelope scheme and a specific definition using this scheme to support transfer of digital newsphoto files. The original versions of IIM and the Digital Newsphoto standards were completed and approved in January, 1992. The current versions are revisions that were approved in April of 1993.

The original plans were to follow up these standards with the definition of a separate electronic envelope to support the transfer of graphics files. Instead, an implementation guideline was developed through the IPTC to accommodate graphics within the IIM without requiring modification of the envelope. Reuters is now using this scheme to transfer Apple Macintosh-based graphics files. Work on the graphics standard has been followed by work on an electronic envelope to support transfer of digital audio files and other multimedia objects.

While work on graphics and digital photos and other multimedia standards was necessary to address some immediate industry problems, the great bulk of the news distribution industry is based on the distribution of stories that are in "text" format. So, in June of 1992 a working subcommittee was set up to study text objects and to define a format to be used for transferring text files. This new text file format was dubbed the Universal Text Format (UTF), and the working subcommittee came to be called the New Text Subcommittee.

The standards development process being followed by the New Text Subcommittee is roughly composed of four major steps:

While the process as presented here appears to be very simple, logical and systematic, in practice it has been very difficult and messy. Steps were occasionally executed out of sequence. Backtracking occurred to cover issues that had previously been decided, but that were reopened to accommodate new information, perspectives or alternatives. Progress was very uneven, moving ahead rapidly in some areas, and lagging in others. All of the members of the working subcommittee had other work to do in addition to the standards effort, and their ability to commit time and budget to investigate issues, author portions of the standard and attend working meetings varied greatly during the course of development. Nevertheless, the standard has progressed apace, and the hard work is now beginning to bear fruit.

5. Requirements for a New Text Markup Scheme

The analysis work identified a number of major problems from which have emerged the requirements that the new text format will have to address. These requirements fall loosely into the areas of: device independence, avoidance of proprietary standards, data intelligence and robust markup, ease of customization, markup minimization, availability of cost-effective off-the-shelf tools, and compatibility with new coded character set standards. Let's look at each of the requirements in turn.

Device Independence

The text portion of news transmissions using the current IPTC 7901 and ANPA 1312 formats is largely unstructured and clearly oriented toward its eventual rendition in printed form. While accommodation of the printed form is still very important, the news distribution industry can no longer tie itself exclusively to the printed-page model. First, the vastly increasing flow of information requires that computers handle the data to enable easier access to the information by those who must work with it. And second, the users of the news information contained in the news agency transmissions were no longer rendering that information exclusively in print products.

Thus, any new text format must permit receiving systems to process the information precisely, regardless of whether its final presentation is as newspaper pages, videotex, audiotex, radio, television, information retrieval, or something else. If the data is properly marked up and arranged, then that data can be converted into any type of electronically-driven presentation format for any type of service or device.

To be device-independent, "presentation-specific" information must be avoided. This is because the manner in which information is presented (e.g., line length, bold-facings, italics, color, font, pagination, margins, etc.) all have different meanings to each of the types of services and devices for which the output is targeted. Therefore, adoption of the concept of device-independence is intended to result in information in which the structure and content are well defined, and where the receiving system or device knows what to do with each piece of information it receives.

Avoidance of Proprietary Standards

The number of different types of hardware devices and software packages currently used by the news distribution industry is quite large. It is not the intention of the UTF to foster the advancement of any one particular hardware or software vendor or product over another. Indeed, doing so might lock out the natural market forces of competition that help fuel innovation and help maintain reasonable price levels.

Thus, any new text format should not be based on any one of the many proprietary standards (e.g., Microsoft's RTF, IBM's GML or Adobe's PDF) that are used by specific vendors to encode data on their systems.

Instead, the objectives of the new text format would be best accomplished by a non-proprietary standard. It is more likely that a non-proprietary standard would be acceptable to the larger number of senders and receivers in the news distribution industry, each of which is pursuing its own objectives and has its own preferences.

Data Intelligence and Robust Markup

More and more receivers of news transmissions are calling for increasing levels of value in the streams of text that they are receiving. One way that value can be added is in the form of intelligence embedded in the data (manifesting itself as specialized content markup or tagging). Higher levels of intelligence in the data allow the receivers to innovate with new products, and to increase the level of sophistication in the products that they already offer to their customers.

It is important to recognize that some specialized content markup or tagging is already beginning to be inserted in text by some senders and receivers to meet the needs of particular end users. But pressure is building for much more.

There are a number of specific applications where a text format with significantly greater intelligence would be expected to offer advantages over current news wire transmission formats:

  • Tabular Material - This would include highly structured and fielded information whether appearing in a formal columnar format, or in some non-columnar format that is like text (i.e., occurring in an unjustified, concatenated stream, as is frequently the case in a textual presentation of sports statistics or election results). It is generally accepted that the current means of sending tabular data on news wires are woefully inadequate for making that data reusable.
  • Information Identifiers - This would include identification of selected classes of proper nouns (e.g., companies or organizations) and assignment of normalized identifiers that are associated with them (e.g., industry groups, ticker symbols or agency-specific codes). These could then be used to build indexes or customized reports, for example.
  • Linkages - This would include the relationships that exist between separately transmitted news objects:
    • Small-to-Big: e.g., a brief to the full story
    • Medium-to-Medium: e.g., a photo or audio clip to its associated text item
    • Time-A -to-Time-B: e.g., an update to a previous version of a story
  • Specialized Formats - This consists largely of text data types that cannot be sent easily today in a format that makes them reusable (e.g., television scripts, television listings or daybooks).

In each of the above cases, the sending agency is likely to transmit in a standardized format, and then use a conversion routine at the receiving site to deliver the data in the format desired by the receiver. For example, in the case of tabular material, some sites might want to receive it as a common spreadsheet file, others in a database format, and still others in plain old IPTC 7901 or ANPA 1312.

Not only do the applications listed above lend themselves well to tagging of some kind, but they also originate mainly either from machine-generated data sources or from within specialized editorial desks. Initial application of any new text encoding scheme is most likely to be implemented first in these controlled environments, long before it is propagated to general editorial work.

Any new text format must be able to encode all of the data intelligence discussed above, and potentially much more. It must accommodate deep levels of encoding in a way that makes the data intelligence easily accessible by both the humans and computers that will be processing the information. And it must facilitate the validation of the markup to ensure high quality.

Ease of Customization

Because of the large number of senders and receivers in the news distribution industry, and the variety of cultures and information domains that they represent, one standard could not possibly cover all of their needs. It is expected that there will naturally arise considerable pressure for specific pairs or groups of senders and receivers to customize the new text format to meet their diverse needs. Given that such a situation will arise regardless of the best intentions of the participating organizations to standardize and conform their transmissions, it would be desirable to accommodate customization in a standard manner.

Thus, any new text format must include mechanisms that will facilitate customization in a number of specific, but standardized, ways.

Markup Minimization

The real world of news delivery currently dictates that information is likely to be delivered through some serial and relatively slow-speed medium such as satellite broadcast, ISDN, or leased asynchronous circuits. The amount of time it takes for a reporter in the field to enter the information or for a receiving system or human to process the information, when added to the amount of time it takes to transmit the information, is often very important. In many situations (e.g., breaking news stories, changes or closes in financial markets, or looming broadcast and publishing deadlines) seconds become very critical.

Thus, any new text format must not impose unreasonable overhead either in the entry, transmission or processing of the information.

Availability of Cost-Effective, Off-the-Shelf Tools

One of the main concerns about implementing any new tagging scheme is whether off-the-shelf applications are available for generating and reading the tagged text. If they are not available, then adoption of the new standard will be considerably slowed due to the need to develop the applications in-house, to contract for their development externally, or to wait for the commercial software community to develop them.

Additionally, the news distribution organizations that will be purchasing these applications are currently under significant cost pressures. They are constantly looking for cost effective ways to meet to their needs. If these needs can be supplied in a timely manner at reasonable cost through purchase of already existing software tools, then the organizations will not have to resort to in-house or contracted development.

Thus, ideally the new text format should already have a suite of application software readily available at reasonable cost.

New Coded Character Sets

As mentioned previously, the existing IPTC 7901 and ANPA 1312 formats are based on the ISO 646, a 7-bit coded character set for information interchange. Several other older or proprietary character encoding schemes are also still in use. There has emerged in the past several years a new set of 8-bit and 16-bit character encoding schemes. One of these is the ISO 8859 8-bit single-byte coded graphic character sets. This is a series of coded character sets for the encoding of the alphabets and symbols of several major European languages. For example, ISO 8859-1 (also called Latin-1) comprises 191 graphic characters covering most of the requirements of the Western European languages, including English, French, German, Spanish and Italian. Other parts of ISO 8859 cover other sets of languages and alphabets including Scandinavia, Eastern Europe, Russia, Greece and the Mediterranean. Another new standard called Unicode is a 16-bit, 2-byte character encoding standard that covers most of the languages of the world, but in particular the requirements of the CJK languages, Chinese, Japanese, and Korean.

Any new text format must be able to accommodate data streams that are encoded using these new coded character sets. The new format must also accommodate inclusion of foreign text coded in a language and possibly even an alphabet that is different from the native language and alphabet of the text.

6. The Universal Text Format

SGML and the UTF

[Note: This paper will not provide a description of SGML. It is assumed that the reader is already familiar with SGML or can refer to the large number of excellent descriptive materials available on the subject.]

Given an understanding of the requirements that the new text standard would have to meet, the next step was to investigate and select among the available alternative representation schemes.

From the very beginning there was strong interest in SGML as the language to use for the creation of the UTF. It clearly supports attainment of the objectives of device independence, non-proprietariness and deep and robust encoding as described above. There are numerous other successful standards development efforts that are based on SGML. A superficial look was taken at the Open Document Architecture (ODA), but there was little commercial software available to support ODA, and the architecture of the standard would potentially conflict with that of the IIM.

There is an existing and rapidly growing suite of SGML software available on the market. Prices for this software are coming down, albeit slowly. Several of these vendors have met with the working subcommittee. More importantly, vendors of existing word processing and desktop publishing software (Microsoft, WordPerfect, Frame and Interleaf) that is widely used in the news distribution industry are now moving aggressively to incorporate SGML capability in their existing products. And some of the vendors of traditional newsroom editorial systems (namely SII) are also starting to show an active interest.

SGML includes or accommodates mechanisms and techniques that will permit markup minimization and customization, as well as the ability to use or incorporate other coded character sets. Oddly enough, of all of the requirements, markup minimization is the one that was the stumbling block to adoption of SGML, primarily because the encoding of tables using full-blown SGML markup rendered the tables almost totally unreadable and would have imposed unacceptable transmission overhead.

A demonstration was undertaken in late 1993 of table markup minimization using the SGML facilities of omittag, shortref and text entities. When it was finally demonstrated that SGML could adequately address the table markup requirement as well as all of the other requirements described above, then the working subcommittee and the sponsoring organizations acceded to the adoption of SGML in early 1994 as the encoding language for the UTF.

Given the selection of SGML as the encoding language for the UTF, the way was then clear to begin the work of defining and reviewing the basic components of the UTF using SGML. This has involved the creation of an SGML Document Type Definition (DTD) for the UTF, as well as the definition of a series of SGML conventions that permit the attainment of the requirements listed above. The SGML DTD and conventions are discussed below.

Overall Structure of the UTF

A basic set of SGML element and attribute declarations was provided by Mead Data Central (MDC) as a starting point for the DTD. From the MDC starting point, the organization of the UTF has evolved considerably, and the SGML elements can now be grouped into four major categories:

  • The Structural Element Set
  • Tables
  • The Non-structural Element Set
  • The IIM Datasets

Many of the elements and attributes in the original MDC element set and the newer UTF extensions were borrowed in part from early published work conducted by the Text Encoding Initiative (TEI), an SGML standards development project whose aim was to provide coding standards for literary and linguistic computing. The TEI was sponsored by the Association for Computational Linguistics, the Association for Literary and Linguistic Computing, and the Association for Computing and the Humanities, with much of the work being directed from Oxford University and the University of Chicago.

Inspiration has also been drawn from three other sources. The first source is the American Association of Publishers (AAP) NISO standard and proposed ISO standard (ISO 12083) for Electronic Manuscript Preparation and Markup. The second source is the CAPSNEWS and ICADD DTDs, two SGML DTDs written in Europe for newspapers. These two DTDs were written with special considerations for rendition into forms (i.e., Braille, large print and computer voice) that provide support for the visually impaired. The third source is CALS (Computer-aided Acquisition and Logistics Support), an SGML initiative of the U.S. Department of Defense.

The Structural Element Set

The number of elements that compose the top level structural markup of the UTF are really quite few and they are relatively simple in their structure. This is because the text of most individual news agency transmissions is usually quite brief. Large complex transmissions are relatively rare and primarily limited to tabular material. These elements are called structural elements because they delineate the overall organization of the material in the document at and above the level of the paragraph.

The structural element set is organized as follows:

  • The IIMUTF and File Elements - At the highest level is the IIMUTF element. This is the basic unit of the UTF and represents the text data object that is transmitted in Record 8 of the IIM. The IIMUTF element contains zero or more IIM Dataset elements followed by a single File element.

  • The IIM Datasets - Textual representations of any of the datasets from the other records of the IIM can be encoded in the UTF. In many cases these datasaets are already textual in nature, and can be represented in the UTF in a relatively straightforward manner. In other cases these datasets are in binary format, so that a textual representation of them will be difficult. Much work still remains on exactly how to represent these binary fields.

  • The File Element - The File element is considered to be the basic publishable chunk of text in the UTF. The File element can consist of one or more Text Blocks and IIM Datasets.

  • Text Blocks - Multiple Text Block elements are needed when a single transmission consists of a collection of news stories, such as news briefs or sports roundups. Each story is usable without the other stories in the transmission and can potentially stand on its own. But each story is typically related to the other news stories in the collection because they are all part of coverage of a common event, topic or region. A Text Block can also be included recursively within another Text Block to permit arbitrary subdivisions of the material. Text Blocks have a type attribute that indicates whether they are standard publishable material, a publishable advisory or a non-publishable advisory. Within a Text Block can be found Paragraph Header, Paragraph, List and Table elements appearing in any order and number in addition to IIM Datasets.

  • Paragraph Level Elements - IIM Datasets, Paragraph Headers, Paragraphs, Items within Lists and Entries within Tables in turn consist of Parsed Character Data (PCDATA), the most basic data type in SGML, and/or any of the non-structural elements described below.

Tables

The tabular portion of the MDC element set that the UTF used as its starting point was the CALS Table DTD. The reason for the adoption of the CALS Table DTD was that up until the last year or so, the only good SGML table editing software available in the marketplace was hard-coded around the CALS or the AAP DTDs. CALS was most appropriate for MDC's needs. This table-editing software availability situation is now beginning to change, and it is expected that the evolution of the Table portion of the UTF DTD away from the CALS model will not necessarily preclude the acquisition of software to work with it. The CALS model is still a reasonably good model for the UTF, as long as all of the provisions for presentation that appear in the attributes are removed or ignored, and some special accommodations are made.

Table markup is organized as follows:

  • Tables and Table Groups - A Table element is composed of one or more Table Group elements. Table Groups permit the construction of multi-part Tables, a situation that occurs frequently, especially in sports (box scores). Interspersed with the Table Groups, but still part of the Table can be found Paragraph Headers and Paragraphs.

    Each Table and Table Group has a style attribute that identifies what category of table it is, and thus what its columnar structure, headers and entry formats should be. There are other attributes that define the number of columns in the Table Group, and defaults for the format type and alignment of the contents of the Entries in the Table Group.

  • Column and Spanner Specifications - Each Table Group will contain a set of Column Specifications and Spanner Specifications. There will be one Column Specification and one Spanner Specification for each column and spanned group of columns (e.g., for headers that span more than one column) that appear in the Table Group. These are empty elements that exist just to supply attributes.

    The attributes associated with each specification supply identifying name and position number of the column or span within the Table Group, as well as defaults for the format type and horizontal alignment of the contents of the Entries in the column. The formats of the Entries can be alphanumeric, integer, decimal, fraction, sign, or some mixture of those five. The horizontal alignment can be either left, right, center, justified both right and left, aligned on some character, or stream (meaning that the Entry contents are concatenated with the contents of the previous Entry). Also, if one column is closely related to another column for some specific reason, a name reference to the related column can be included in a separate attribute.

  • Header, Body and Footer Elements - The substance of a Table Group is composed an optional Table Header element, one or more Table Body elements, and an optional Table Footer element. The Table Header is composed of one or more Row elements, and can contain overrides of the Column Specifications already presented for the entire Table group that apply only to the Header. The Table Body elements are composed of Rows. The Table Footer element is structured exactly like the Header, except that it is typically reserved for the inclusion of totals and/or subtotals. This implies that if a Table is to have multiple subtotals, then one option is to construct it from multiple Table Groups. These Header, Body and Footer elements all have an attribute that supplies the vertical alignment of the content that occurs within their Entry elements, whether the content should be positioned toward the top, in the middle or toward the bottom of the row.

  • Rows and Entries - Each Row element is composed of one or more Entry elements. The attributes associated with each Entry can be used to override those attributes supplied in the Table Group and Column Specification elements. Another Table Group element can be inserted in the place of an Entry element, thus permitting the recursive inclusion of tables within tables. It is also possible to stipulate in an attribute that a given Entry (or embedded Table Group) can span more than one row.

It is anticipated that a significant amount of minimization and customization will be used for tables. The topics of minimization and customization, particularly as they relate to tables, are discussed at length later in the paper.

The Non-structural Element Set

The working subcommittee had a clear sense that the markup scheme they were creating was intended to add value to the text. It was expected that this value would be added through better and more extensive identification (via markup) of the information components that occur in the text. The intention was to make this information much more reusable by the computer systems of the receivers involved in the transmission.

Of the markup schemes that were investigated, there were very few that could provide deep encoding. The only really extensive and well thought out approach was the work done by the TEI. The TEI approach was the inspiration for the MDC element set and so became the starting point for the non-structural element set.

The non-structural element set is a large group of elements that are called non-structural because they occur in unstructured text, and because they apply to individual words and character sequences below the level of the paragraph. These elements, when combined with Parsed Character Data (PCDATA), compose the content of the structural elements: Paragraph Headers, Paragraphs, List Items and Table Entries. Also, the non-structural elements are typically bound to no particular position in the overall document structure.

The non-structural elements can be loosely grouped into the following categories:

  • Text Characteristics - The Text Characteristics elements include superscripts, subscripts and emphasized text objects. Emphasized text objects have a level attribute associated with them that permits the generic representation of multiple emphasis conventions, the exact rendition of which (e.g., bold, italics, underline, quotes, highlight, font family, size, etc.) can be determined and applied at the time of presentation.

  • Temporal Markup - Temporal Markup elements include dates, times, ranges of dates or times, days of the week, and singular or reoccurring events having names. Most of these elements have a set of attributes that are intended to facilitate the proper processing of the elements when a system encounters them and attempts to process them with other instances of the same element.

  • Names - Name elements include titles of works, names of organizations, names of products, and names of people. Markup that is intended to be supplied for the titles is applicable to the following types of works: books, magazines, articles, plays, movies, songs, etc. The types of organizations whose names are intended to be marked up include companies, corporate divisions, schools, universities, churches, foundations, political parties, unions, consumer groups, sports teams, etc. The markup for people names includes subelements for first, middle and last names and initials, prefixes, suffixes, titles, and indexable portions of any of the people name subelements or combinations of the people name subelements. Various person name attributes are available for providing information about the person such as sex, status, role, aliases, nicknames, and normalized forms for retrieval purposes.

  • Language - The Language elements are intended to indicate the presence of domain or language specific content. These elements include foreign language text that is different from the native language of the containing text object, technical terminology specific to a particular domain, and pronunciation guides. These elements have a set of attributes that supply additional information such as the language of the foreign text, the type of specialized domain and the system of pronunciation that is being employed.

  • Locations - The Location elements are intended to provide markup that indicates the presence of place names and addressing information. This set of elements includes specific elements for geographic names, compass directions, geo-political subdivisions, building, room and street addresses, various forms of structured and unstructured addresses constructed from the preceding elements, and various postal and electronic mailing addresses. Attributes supply descriptive information about the type of the element.

  • Numbers - Number markup is intended to identify numeric values that are present in the text in all of their varied forms. Specific elements are included that indicate numbers used for identification purposes, numbers used for measurement purposes, and ranges of either kind of number. Identification numbers can be of several different types such as phone #'s, fax #'s, social security numbers, product numbers, invoice numbers, etc. Measurement numbers include sub-elements that identify what is being measured, the actual count of what is being measured, and the unit of measurement. Various type, standard, sign, exponent and normalization attributes are available to supply the appropriate information.

  • Annotations and References - Annotation and reference markup is intended to permit the inclusion or recognition of linked components within a document. These linked components can include headnotes, footnotes, endnotes, bibliographies, anaphoric or pronoun references, section references, copyright notices, and other referencing conventions. The markup includes elements for the material being referenced, the reference itself, and the enumerated form of the reference such as a footnote number or page number. These elements make use of the ID and IDREF conventions of SGML.

  • Embedded Quotes and Optional Material - Elements have also been included to permit identification of embedded quotations and optional matter.

The list of non-structural elements is quite rich and complex. It is not expected that this deep level of encoding will be fully implemented and employed by anyone anytime soon. It is hard to imagine that a reporter in the field with his laptop trying to meet a submission deadline is going to be inclined to supply this detailed level of markup. Instead, this deep encoding specification is intended to serve as a guidepost or set of guidelines for those organizations that find they need to begin to implement deep markup for their own very practical reasons. They will at least have a framework within which to approach these very complex markup situations.

The initial appearance of this level of markup will almost certainly occur as part of a secondary editorial pass of the data, performed (most probably with the assistance of intelligent, text analysis software) either by the senders who want to make multiple "enhanced-data" releases of their news transmissions, or by the receivers who want to process and include sophisticated markup of transmitted material into their products and services.

Inclusion of Datasets from the IIM

Within the New Text Subcommittee there was considerable debate concerning the advisability of including within the text object encoded versions of the datasets from the other IIM records that are outside of the text object in a typical IIM transmission. Inclusion of these elements in the text object would necessarily require that separate specifications of these elements be incorporated into the UTF.

There were several concerns against including this information in the UTF:

  1. The encoded dataset information in the text object would be a duplication of information already present in the IIM envelope or application records, and any time information is duplicated it becomes subject to synchronization problems which can lead to misprocessing.
  2. Because most of the IIM datasets are optional, a situation could develop where the information that is normally included in the IIM datasets is not included in the IIM at all, but only in the text object. This situation could lead to the transmission not being properly processible by pure IIM-only systems.
  3. The specification of IIM datasets in the UTF could lead to the development of two separate standards for the delivery of text, one with a pure IIM envelope and the other just marked up text without an envelope.
  4. Many of the IIM datasets proposed for inclusion in the markup of the text object contain information that was not intended to be part of the printable object. However, the expected intention of some information receivers is to simply strip away the SGML markup that is included in the text object at the receive site, and then provide the relatively readable raw text to their systems and users. Those systems and users would thus be confronted with potentially unintelligible information, or information that was not intended for their consumption.

The counter argument to these concerns was simply that the primary intention of the IIM and UTF is to make transmitted information easier to use and more useful to the receivers of that information. As senders and receivers actually begin to implement the IIM and UTF, certain components of the standards will naturally prove useful and will be implemented, while other aspects of the standards will not. The important point at this stage of development of the UTF standard is to try to devise ways to accommodate in the standard all of those aspects of usage that can be expected to be encountered and that are reasonable to accommodate. Then the marketplace should be counted on to weed out the truly bad ideas and advance the really good ones.

Eventually the counter argument was accepted, and work is now in progress to define exactly what structures from the non-text object records of the IIM are reasonable to accommodate in the UTF, and exactly how they will be incorporated. For example, some IIM datasets that are early candidates include Byline, Headline, Release Date, Source, Caption, Keywords, Category Codes, etc.

Markup Minimization in the UTF

One of the major requirements for the UTF was that it not impose unreasonable overhead in the entry, transmission or processing of the information. Limiting the amount of markup overhead is particularly important when it comes to tabular material. Markup minimization is the SGML technique that primarily satisfies this objective. There are several mechanisms available in SGML to achieve minimized markup, three of which have been successfully employed with the UTF.

  1. Tag Omission - One method of minimization involves the use of SGML's OMITTAG feature. This feature permits the omission of start tags and/or end tags of certain elements during the encoding and interpretation of the text depending on the context in which the element occurs. Essentially, tags can be omitted whenever it is allowed by the specification of the element, and wherever there can be no ambiguous interpretation as to what tag is inferred at the position where it is omitted. Removal of unnecessary end tags by itself eliminates a significant amount of markup.

  2. Short References - Another method of minimization involves the use of SGML's SHORTREF and USEMAP capabilities. These capabilities permit the replacement of sections of markup with relatively short character sequences during the encoding and transmission process. Upon receipt and contextual interpretation of the character sequences, the markup can be fully restored to permit proper processing to continue.

    During the encoding of tables, this technique was successfully applied to the markup appearing at the beginning of rows, and between the columnar entries in each row. The result was a huge decrease in markup and a vast improvement in readability.

  3. Entities - The last method of minimization employed in the UTF involves the use of SGML Entities. This capability permits large portions of commonly recurring text to be replaced by a reference to the text during the encoding and transmission process, thus saving significant amounts of storage space and transmission time. The replaced text (text entity) is stored in an entity file at the receiving site where. Upon receipt and interpretation, the entity reference is resolved to its replacement text whenever necessary.

    During the encoding of tables, this technique was successfully applied to the markup appearing at the beginning of table groups. Properly structured table groups of the same style could be assumed to have the same column specifications and headers. This fact lent itself nicely to use the public entities and entity references, and resulted in even more decreases in markup and improvements in readability. In addition, the large number of public entity declarations necessitated by the use of public entity references for the different styles of table groups could in turn be codified as a public entity for a given style of table, thus reducing the SGML markup even more.

  4. Public Entities - The reference name (entity reference) and replaced text can be publicly registered so that many organizations can gain access to and use the shorthand. Registration would also ensure that names and content were unique and not subject to misinterpretation. Text entities and entity references that are set up in this fashion are called Public Entities. One suggestion that is currently under consideration is for the NAA and the IPTC to administer a registry of the Public Entities used with the UTF. In this capacity, the NAA and IPTC would be responsible for supplying a repository where master versions would be stored, moderating a review of any proposed changes, performing maintenance as appropriate, and distributing copies to members (presumably for free) and interested parties (possibly for a modest fee).

Customizing the UTF

Another major requirement of the UTF was that it include mechanisms to facilitate customization of the standard in a number of specific, but standardized, ways. This objective was accomplished by taking full advantage of one of the most underused capabilities of SGML, the document type declaration subset.

The document type declaration is always included at the beginning of every document. It is typically used to associate a document type name with an external, publicly defined document type definition (DTD), but it can also include an internal document type declaration subset within the square brackets of the declaration. By including elements, attributes and entities in the internal subset, it is possible to customize the definition of the markup of any instance of a document, because the included components will override and/or augment their counterparts within the externally defined DTD.

There are several ways that this capability is envisioned as being utilized in the UTF:

  • Invoking Subsets of the DTD - In this situation, alternative declarations of selected groups of elements, attributes and entities would be gathered together into DTD subsets that would be registered as public entities. As conditions warranted for any given document instance, the document type declaration at the beginning of the document would include in its internal subset a public entity reference to the appropriate DTD subset. Some cases where this would be useful are:

    • Invoking domain specific elements and attributes (e.g., people name markup (elements and attributes) that would be used for sports stories versus political stories, or football stories from Europe versus football stories from America)
    • Invoking different coding schemes (e.g., the UTF without IIM datasets versus the UTF with IIM datasets)

  • Overriding and Extending the DTD - In some situations specific pairs or groups of senders and receivers may decide that parts of the UTF are not working as expected, or that extensions need to be added to the UTF to accommodate their new or changing requirements. In such situations, rather than waiting for what could be a lengthy change, review and approval process to the official industry standard, it might prove more expedient to override or extend the UTF DTD with the appropriate markup definitions in just those documents that are affected. While such a situation would incur a little extra overhead, it could prove to be an invaluable means of testing new markup schemes within the overall structure of the UTF. Then, if a scheme was successful, it could naturally be incorporated into the UTF or registered as a public entity for everyone to use.

7. Conclusion

The UTF is still undergoing testing and revision. It is a draft standard that has not yet been approved and is still in flight. There are a number of different areas where significant changes can be expected.

It is the intention of the New Text Subcommittee to get the UTF standard out into the real world where the member organizations can rigorously test it and begin to implement it. It is anticipated that the full UTF, or some major subset of it, will have been reviewed and approved during 1994 by the sponsoring organizations, the IPTC and the NAA, and that the standard will be in actual use by the end of 1994.

While there is much work yet to be completed, the UTF in its current form is already a very robust and modern information processing standard that clearly addresses the needs of a major international industry, the news distribution industry.


HTML-muck-mogrified by Robin Cover, March 1995. The original document came from Dave Becker [daveb@meaddata.com], January 13, 1995, in RTF format (to which it should be compared for formatting precision). Get the RTF file here. Appreciation is extended to Dave Becker and the larger UTF research team for making this document available.