[This local archive copy is from the official and canonical URL, http://www.mediacenter.org/report.htm; please refer to the canonical source document if possible.]


Memorandum

To: API Grammarians

From: Tagging Team (Report prepared by Kathy Foley and Tom Johnson)

Date: 02/03/99

Re: Preliminary Thoughts on Journalism Tag Schema

What follows are the first-pass thoughts of the tagging sub-group of the API Grammarians who met in Dallas Jan. 7-8, 1999. Participants on the tag-team were: Kathy Foley (Editor, Information Services-San Antonio Express-News kfoley@express-news.net ); Tom Johnson (Prof. of Journalism - San Francisco State University tom@jtjohnson.com ); Alan Karben ( Associate Director, Interactive Development, The Wall Street Journal Interactive Edition karben@interactive.wsj.com ); B. C. Krishna (FutureTense - bkrishna@futuretense.com ); Chris Ryan (Freedom Forum Fellow - University of Kansas ); Dennis Walsh ( Co-Head of the Interactive Media Center, Miami University of Ohio DPWALSH@miavx1.acs.muohio.edu ); Chris Willis (Senior Editor/Technology & Design - A. H. Belo Corp cwillis@dallasnews.com ); Steve Yelvington ( Editor, Star-Tribune Online stevey@startribune.com ); MTK

OBJECTIVES:

    • To develop a flexible but universal markup language would give editors or producers a standard code for organizing journalistic content of all types for analysis and delivery via all forms of digital media.
    • The markup language should be integrated into newsroom publishing and pagination systems to capitalize on existing workflow, rather than creating a new task.
    • Such a markup language must allow journalists and media managers maximum flexibility in the storage, retrieval and delivery of content regardless of file type or structure.
    • Such a markup language will allow readers to personalize their own files and databases by selecting and saving components of the content.
    • Such a markup language should be consistent with the standards currently evolving in the library/information management, geography, government and database design professions.
    • Such standards should permit a high degree of customization and application in non-English and non-North American applications.

DEFINITIONS

Metadata: Metadata is, most generally, data that describes other data to enhance its usefulness. The catalog that emerged as an important component of the modern library is used as a canonical example of metadata, although there are many other well-developed examples within libraries, museums, corporations and other institutions that emphasize intellectual assets as a central part of their stock-in-trade. The development and maintenance of this metadata is, then, an essential activity for these institutions. They describe, keep track of, provide access to, and manage their collections by the means. (We recommend the document at http://www.csdl.tamu.edu/~marshall/dl98-making-metadata.html because it addresses situations in other industries and institutions exactly parallel to ours in the media.)

Metadata may not be universal in its scope. Some metadata is local and private, used only by data producers, managers or maintainers— for example, metadata relating to archiving or media production or distribution. Or metadata may pertain to a limited set of content users — for example, the metadata relating to the use of particular materials on a particular day or of a given subject by individuals using a particular access medium or service (e.g. the Web verses PalmPilot racing results).

Data Tags: Anyone who has used computer editing systems, created Web pages in HTML or even used WordPerfect or Xywrite word processing software is already familiar with the concept and usage of data tags. Data tags are to a markup language what letters and words are to a written language: the building-blocks. A tag generally precedes a string of text and tells the computer to do something specific to whatever characters follow. When the instruction has been completed, then a related tag is used to stop the action. For example, around <ital>this phrase</ital> are tags to start and stop the italicization of the two words within the tags.

In computing and data base terms, tags can have greater significance. They can be used to organize the creation, archiving, searching, sorting, retrieval, and communication of a variety of data and data types, so long as that data is stored in a digital format. Data tags can indicate which part of a story is the headline and which is the byline. Tags can be used to label the concepts or content of a audio, video or database file. Tags can be used to indicate an annotation of a file, although the content of that file may not have that specific digital content. For example, a picture of Bill kissing Hillary could refer to "love" or "devotion," though such data would not be automatically inherent in a graphic file.

Tags often reflect Parent-Child or Hierarchy-of-Meaning relationships. For example, a tag in a reporter's story marking a geo-spatial reference might simply point to the name of the city "Arlington." But Arlington alone is not sufficient because the term can be part of multiple hierarchies that extend from Universe-Solar System-Planet-Earth-Continent-North America-United States-Virginia [or Texas, Illinois, Maine, Wisconsin], Tarrant County, Municipality, ZIP, latitude-longitude-degree-minute. The advantage of using a tag to identify Arlington is that the software can be programmed to require the producer or editor to specify which Arlington. And once that location is specified, a digital tag dictionary/thesaurus can make the necessary hierarchical links that would permit "fuzzy" searching, an in "There's that city that starts with an 'A" in Texas that's near Ft. Worth...."

[CHRIS: DO WE NEED A SECTION HERE ON THE WHY'S AND WHEREFORE'S OF XML ???]

TAGS FOR JOURNALISM

In addition to the metatags for journalism, there appear to be three sub-sets. Sometimes, too, metatags will be duplicated in various ways in the sub-set tags. For example, a metatag referring to the size or location of the file could also show up as a tag necessary for the archiving of the content.

Fig. 1 Relationships of Journalism Content Tag Types

 

Often the tags for these sub-sets overlap, but each has specific uses. The three sub-sets are:

    • Tags related to process

These tags typically could include comments between writers, editors, producers, programmers or administrative personnel. They could pertain to publishing schedule or degree of content readiness. They could reflect an audit trail of changes and access rights. They can reflect embedded file types (e.g. a story or ad or announcement that links to an A/V file on the web or an audio file that delivers driving directions over a cell phone).

    • Tags related to content

Content tags tend to be much more specific, but flexible enough, to help differentiate between President Thomas Jefferson and Thomas Jefferson High School.

Digital journalism content is evolving to generally fall in three sub-sets: news/editorial; commercial (ads, transactions) and community (i.e. content generated and largely maintained by individuals or community groups such as church announcements or Little League activities). The system we propose is malleable enough to accommodate all of these data types.

A well-designed tagging plan can help distinguish between the time and/or date a story was published and the time and/or date of an event. These tags can be used to mark-up headlines, subheads, bylines and even such industry-specific components as ledes, nut grafs and kickers. The system is flexible enough—using a coding language called XML—that each newspaper, magazine or TV station can customize the tags to fit its unique newsroom vocabulary.

    • Tags related to archiving, bibliographic citation and transactions

On one hand, these tags are tied to the special concerns of archivists and archive vendors such as Lexis-Nexis and MediaStream. But they also will be invaluable to content producers as media institutions come to realize that their archives are one of the few truly unique resources they have, especially as they pertain to local markets. Consequently, ease of precision searching to facilitate, pardon the expression, re-purposing the content will drive P/L decisions. Typical archiving tags identify the source (byline, credit, publication, edition, page, section, zone), the type (news, feature, analysis, review, game story), or the relationship of the object to other objects (photo caption, graphics text, sidebars, series information)"

Here are the suggested tags the sub-group came up with in a short time. While we have tried to group the tags in their principal family -- Process, Content or Archive -- a tag will often serve more than one master. It is important to understand, however, that the term standard as applied to these tags does not mean mandatory or constricted. The system proposed is of value primarily because it allows -- and perhaps even encourages -- flexibility and customization.

NOTE: These tags below are just for illustrative purposes; this does not approach a complete list. Everyone should feel free to suggest tags at all levels, keeping in mind the nested functions and cross usage. This formulation is ideally suited for Web presentation. For a model, see "The FGDC Content Standard for Digital Geospatial Metadata" http://www.its.nbs.gov/nbs/meta/meta.html

Journalism Content Metatags (preliminary):

    1. Copyright
      1. Permissions
      2. Rights management
      3. Payment
    2. Story type (each type could have different sub-tags):
      1. Sport
      2. Sport-specific statistics
      3. Results
      4. Politics
      5. Business and economics
    3. Column Name
    4. Series name (package)
    5. Series number
    6. [Story] Package name/ ID
      1. Stories
      2. Graphics
      3. Photos
      4. Links to….
      5. Audio
      6. Video
      7. Database of CAR data

      Preliminary Journalism Process Tags

    7. Source
      1. Press release
      2. Community publishing
      3. Gov. doc
      4. "Commercial"
        1. Ads
        2. Display
          1. Classified
          2. Audio
          3. Video
    8. Byline/Author
    9. Credit
    10. Page
    11. 1st character of text
    12. Section
    13. Edition
    14. Zone
    15. Language
    16. Priority (see ANPA wire service header list)
    17. Text
      1. Words
      2. Bytes
      3. Col. Inches (if appropriate)

      11.4 Lines

    18. Graphic
      1. File size
      2. Original dimensions
    19. Audio
      1. File size
      2. Run time
    20. Video
      1. File size
      2. Run time

       

      Preliminary Journalism Content Tags

    21. Slug
    22. Byline(s)
    23. Text
        1. Words
        2. Bytes
        3. Col. Inches (if appropriate)
    24. Graphic
        1. File type
        2. File size
        3. Original dimensions
    25. Audio
        1. File type
        2. File size
        3. Run time
    26. Video
        1. File type
        2. File size
        3. Run time
    27. Date
        1. Date Authored
        2. Published
        3. Date Expired
        4. Date(s) referred to
    28. Story type (each type could have different sub-tags):
        1. Sport
          1. Sport-specific statistics
          2. Results
        2. Politics
        3. Business and economics
    29. Headline & Subheads
        1. Poster heads/decks
    30. Lead
    31. Summary/nut graf/abstract
    32. Text
        1. That ran
        2. Text that didn't run

      Preliminary Journalism Archive Tags

    33. 1st character of text
    34. Memo
        1. Confidential
        2. Editor's notes
        3. To public
        4. To staff
    35. Correction
        1. Published
        2. Uncorrected
    36. Slug
    37. Series name Unique ID (i.e. Accession number Used by MediaStream, Lexis-Nexis, etc. for filing.)
    38. Column Name
    39. Series name (package)
    40. Series number
    41. 1st character of text
    42. Package name/ ID -- All the associated stories, graphics, photos, dynamic tables, audio, video
    43. Stories
    44. Graphics
    45. Photos
    46. Links to….
    47. Audio
    48. Video
    49. Sports Scores
    50. Results
    51. Database of CAR data
    52. Keywords (i.e. "subject headings")
        1. Proper names
        2. Alias or AKA(s)
        3. Company names
        4. Industry sector codes ( SIC code SIC codes are being replaced by the new North American Industry Classification System (NAICS)
          1. Links to SIC-to-NAICS conversation tables
        5. Stock exchange symbol
        6. Generic Subject Headings
          1. Murder
          2. Schools
          3. Weather
          4. Accidents
    53. Story types (controlled list) – review, obit, news

 

ISSUES FOR THE IMMEDIATE FUTURE:

    • Inclusion of hardware, database, archiving and frontend vendors in our discussions
    • Inclusion of standard lists of accepted tag content that are being developed by specialists in unique areas" e.g. the Keywords for Photos that has become standardized among news librarians.
    • Inclusion of experts in other file formats (especially audio and video file formats) in our discussions
    • Inclusion of individuals from media other than U.S. -- particularly Latin America, Canada and Europe -- in our discussions
    • Greater consideration of user research, especially in terms of how consumers search for, identify and retrieve content
    • Develop more intuitive tag nomenclature that easily reflects the concepts of family or class as they pertain to: (1) journalism metatags; (2) journalism process tags; (3) journalism content tags; (4) journalism archiving tags
    • Develop a plan to create tag portals, or extensions, for: transaction (e-commerce) tags; appending content and files to established works/data; notifying original copyright or authorship holders of re-publication and/or sales of content. See: http://www.bic.org.uk/bic/uniquid.html and http://www.bic.org.uk/rights.html
    • Develop process for consultation with all stakeholders on establishing and modifying these "standards."

Brief History of Metadata and Content Tagging

    • 1995 Federal Geographic Data Committee (FGDC) Competitive Grant to implement its proposal titled An Educational and Research Program in Support of Content Standards for Digital Geospatial Metadata. NSGIC's mission is to encourage effective and efficient government through the coordinated development of geographic information and systems to ensure that information may be integrated at all levels of government. NSGIC has been active in the area of metadata research and adoption through the review of technical standards, compilation of a catalogue of metadata inventories held by states, and support of publications and other related initiatives. The NSGIC metadata research and education project consists of three phases. The first tests the FGDC metadata standard on a wide range of state, local, tribal and federal information as part of a cooperative effort of nine member states. The second involves the preparation of a practical explanation of the metadata standard for state and local governments. The third phase is a distance education educational program based on the results of the second phase. http://rat.lic.wisc.edu/metadata/metahome.htm

Nov. 1998 API Media Center Conference: Developing a Grammar for New Media," A proposal to create a news markup language for the Web may be the most influential development to emerge from a gathering of some of the finest creative minds from a broad spectrum of disciplines held Nov. 7-10, 1998 at The Media Center at the American Press Institute in Reston, VA. http://www.mediacenter.org/grammar

Jan. 1999 API Media Center Conference: "Grammar II: The Sequel," in Dallas" The intrepid grammarians attempt to develop a News Markup Language. http://www.mediacenter.org/nml.htm

 

References:

Marshall, Catherine C. Marshall "Making Metadata: a study of metadata creation for a mixed physical-digital collection" ABSTRACT: Metadata is an important way of creating order in emerging distributed digital library collections. This paper presents an analysis of ethnographic data gathered in a university library's educational technology center as the staff develops metadata for a mixed physical-digital collection of visual resources. In particular, the paper explores issues associated with the application of standards, uncertain collection and metadata boundaries, distribution and responsibility, the types of description that arise in practice, and metadata temporality and scope. These issues help to characterize a problem space, and to explore the trade-offs collection maintainers must face when they create metadata for heterogeneous materials. http://www.csdl.tamu.edu/~marshall/dl98-making-metadata.html#table2

Rust, Godfrey. "Metadata: The Right Approach An Integrated Model for Descriptive and Rights Metadata in E-commerce." "There are currently four major active communities of rights-holders directly confronting these questions [involving digital content and metatags]: the DOI (Digital Object Identifier) community, at present based in the book and electronic publishing sector; the IFPI community of record companies; the ISAN community embracing producers, users, and rights owners of audiovisuals; and the CISAC community of collecting societies for composers and publishers of music, but also extending into other areas of authors' rights, including literary, visual, and plastic arts.... This paper examines three propositions that support the need for radical integration of metadata and rights management concerns for disparate and heterogeneous materials, and sets out a possible framework for an integrated approach. It draws on models developed in the CIS (Common Information System) plan and the DOI Rights Metadata group, and work on the ISRC (Corporation for National Research Initiatives), ISAN (International Standard Audiovisual Number), and ISWC standards and proposals." http://www.dlib.org/dlib/july98/rust/07rust.html#introduction

Resources:

NOTE: "The Dublin Core" SITE IS THE JUMP STATION TO REVIEW THE MANY YEARS OF INTERNATIONAL WORK ALREADY COMPLETED ON TAGGING CONTENT.

The Dublin Core: A Simple Content Description Model for Electronic Resources : Metadata for Electronic Resources

The Dublin Core is a metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations.

The Dublin Core Workshop Series has gathered experts from the library world, the networking and digital library research communities, and a variety of content specialties in a series of invitational workshops. The building of an interdisciplinary, international consensus around a core element set is the central feature of the Dublin Core. The progress represents the emergent wisdom and collective experience of many stakeholders in the resource description arena. An open mailing list supports ongoing work. See: http://purl.oclc.org/dc

Metadata Related Tools: http://purl.oclc.org/dc/tools/index.htm

    • Creating Metadata (Templates)
    • Tools for the Creation/Change of Templates
    • Automatic Extraction/Gathering of Metadata
    • Automatic Production of Metadata
    • Conversion Between
    • Metadata Formats
    • Integrated (Tool) Environments

The Meta Data Coalition (formerly Metadata Coalition) regroups vendors and users allied with a common purpose of driving forward the definition, implementation and ongoing evolution of a meta data interchange format and its support mechanisms. The need for such standards arises as meta data, or the information about the enterprise data emerges as a critical element in effective data management. Different tools, including data warehousing, distributed client/server computing, databases (relational, OLAP, OLTP...), integrated enterprise-wide applications, etc... must be able to cooperate and make use of meta data generated by each other. http://www.he.net/~metadata/index.html

Meta Data Interchange Specification (MDIS Version 1.1) This is version 1.1 of the Meta Data Interchange Specification. It is available here as a table of contents and complete downloadable copies in PDF and PostScript formats. This document is dated August 1, 1997 http://www.csdl.tamu.edu/~marshall/dl98-making-metadata.html#table2 The Table of Contents here offers a model for how we might structure our reports and presentation.

Geospatial Support Staff Metadata Tutorial Introduction

"In the beginning, one collected ... data without considering that somewhere, sometime, someone might ask -

Why was this data gathered?

What was collected?

Who collected it?

How was it collected?

How current is it?

Where is this data?

[Who has access rights to the data? Who has rights to change the data? Should the data be updated? Should it be referenced to other data? Should the data have a warning flag of any sort (e.g. potentially libelous]?

"This brings us to the somewhat confusing business of collecting Metadata. When first faced with this nasty problem, the collection of Metadata can seem overwhelming. Fortunately, the Federal Geographic Data Committee (FGDC) has thought this out and has published a metadata content standard. So we know what to collect. Our job is to figure out what information would be most meaningful and best define our data sets using their standard. So let's begin by looking at the 10 Metadata sections." http://www.blm.gov/gis/meta/barney/meta1.html

XML Resources (Note that XML is a trademark held by M.I.T.):

Frequently Asked Questions about the Extensible Markup Language Maintained on behalf of the World Wide Web Consortium's XML Special Interest Group and many other members of the XML Special Interest Group of the W3C as well as FAQ readers around the world. A site with reliable, straight-forward answers. http://www.ucc.ie/xml

What the ?XML! may be one of the worst designed Web sites in existence, and Geocities a challenge to navigate, but be patient and drill down. The information is straightforward and helpful. This is the general site with the "Learn XML in 11.5 Minutes" document. http://www.geocities.com/SiliconValley/Peaks/5957/wxml.html

XML.COM is a Seybold Publications and O'Reilly & Associates venture that apparently aims to cover the XML world as it evolves. Good at staying up-to-date http://www.xml.com

W3C XML This is the home page of the W3C XML Activity, part of the Architecture Domain. The XML Activity Statement explains the W3C's work on this topic in more detail. http://www.w3.org/XML

Microsoft's XML resource page. This section provides information on Microsoft's support of the Extensible Markup Language (XML) -- the universal format for data on the Web. XML allows developers to easily describe and deliver rich, structured data from any application in a standard, consistent way. XML does not replace HTML; rather, it is a complementary format. (Note that XML has its own newsgroup: microsoft.public.xml. You can use this newsgroup to get answers to your XML questions, learn what's new with XML, and find out what you can do using XML.) http://www.microsoft.com/xml/default.asp

Arbortext, a vendor "of open, XML-based software solutions that accelerate the process of creating, managing, and delivering product information in medium and large enterprises. Has NT products. Click down, however, to find the links to other online XML resources. http://www.arbortext.com

 

Suggested JournalismTags:

Metatags

(P)rocess / (C)ontent / (A)rchive Tags

Synonyms

Example of Usage

Source

A

Newspaper, Publication

Washington Post, U.S. News & World Report, Associated Press

Byline

Credit

A
A

Author

By Paul Prestige
Daily Rag Staff Reporter, New York Times

Section

Page

Edition

Zone
Version
Priority

A
A
A
A
A
A

A Section, Sports, Editorial
01A, A01, 9B
Final, State
Jersey Shore

Length
Wordcount

A
A

21 inches, 622 words

Date (Published)

Day Published

Date Authored

Date Expired
Time Stamp

A
A
A
A
A

1/10/99, 19991201: 18:32

Copyright

A

Ó Washington Post

Type (Record)

A

Format

Text, Photo, Graphic, Audio, Video, Database

Headline

A

Head, Title

Self explanatory

Lead
Abstract
Nut Graph

A

Self explanatory

Dateline

Mexico City


Text
Text that didn’t publish
1st character of text
Captions
Text from Graphics
Pull Quotes

Self explanatory

Keywords
Subject Terms
Named Persons
Corporate ID
Stock Symbol
Profile

Sex Crimes, Rape, William Jefferson Clinton, Microsoft, ATT, Hillary Clinton

Column Name
Series Name
Series Number

Business Briefs, Pope John Paul II in Cuba, Third of Three parts, One of an occasional series

Language

English

Memo

Correction

Sam Schleterbahn’s name was incorrectly spelled in Saturday’s story.

Slug
Accession Number

Package ID
Thread ID

Israeltalks
19990101842, 670345

Package Connectors

Link to Sidebars
Photos + Photo Credit
Graphics + Credit

Audio

Video

Links to
Sports scores
Results

Databases of CAR data