Untitled Document

[This local archive copy is from the official and canonical URL, http://www.mediacenter.org/report.htm; please refer to the canonical source document if possible.]

Memorandum

To: API Grammarians

From: Tagging Team (Report prepared by Kathy Foley and Tom Johnson)

Date: 02/03/99

Re: Preliminary Thoughts on Journalism Tag Schema

What follows are the first-pass thoughts of the tagging sub-group of the API Grammarians who met in Dallas Jan. 7-8, 1999. Participants on the tag-team were: Kathy Foley (Editor, Information Services-San Antonio Express-News kfoley@express-news.net ); Tom Johnson (Prof. of Journalism - San Francisco State University tom@jtjohnson.com ); Alan Karben ( Associate Director, Interactive Development, The Wall Street Journal Interactive Edition karben@interactive.wsj.com ); B. C. Krishna (FutureTense - bkrishna@futuretense.com ); Chris Ryan (Freedom Forum Fellow - University of Kansas ); Dennis Walsh ( Co-Head of the Interactive Media Center, Miami University of Ohio DPWALSH@miavx1.acs.muohio.edu ); Chris Willis (Senior Editor/Technology & Design - A. H. Belo Corp cwillis@dallasnews.com ); Steve Yelvington ( Editor, Star-Tribune Online stevey@startribune.com ); MTK

OBJECTIVES:

To develop a flexible but universal markup language would give editors or producers a standard code for organizing journalistic content of all types for analysis and delivery via all forms of digital media.
The markup language should be integrated into newsroom publishing and pagination systems to capitalize on existing workflow, rather than creating a new task.
Such a markup language must allow journalists and media managers maximum flexibility in the storage, retrieval and delivery of content regardless of file type or structure.
Such a markup language will allow readers to personalize their own files and databases by selecting and saving components of the content.
Such a markup language should be consistent with the standards currently evolving in the library/information management, geography, government and database design professions.
Such standards should permit a high degree of customization and application in non-English and non-North American applications.

DEFINITIONS

Metadata: Metadata is, most generally, data that describes other data to enhance its usefulness. The catalog that emerged as an important component of the modern library is used as a canonical example of metadata, although there are many other well-developed examples within libraries, museums, corporations and other institutions that emphasize intellectual assets as a central part of their stock-in-trade. The development and maintenance of this metadata is, then, an essential activity for these institutions. They describe, keep track of, provide access to, and manage their collections by the means. (We recommend the document at http://www.csdl.tamu.edu/~marshall/dl98-making-metadata.html because it addresses situations in other industries and institutions exactly parallel to ours in the media.)

Metadata may not be universal in its scope. Some metadata is local and private, used only by data producers, managers or maintainers— for example, metadata relating to archiving or media production or distribution. Or metadata may pertain to a limited set of content users — for example, the metadata relating to the use of particular materials on a particular day or of a given subject by individuals using a particular access medium or service (e.g. the Web verses PalmPilot racing results).

Data Tags: Anyone who has used computer editing systems, created Web pages in HTML or even used WordPerfect or Xywrite word processing software is already familiar with the concept and usage of data tags. Data tags are to a markup language what letters and words are to a written language: the building-blocks. A tag generally precedes a string of text and tells the computer to do something specific to whatever characters follow. When the instruction has been completed, then a related tag is used to stop the action. For example, around <ital>this phrase</ital> are tags to start and stop the italicization of the two words within the tags.

In computing and data base terms, tags can have greater significance. They can be used to organize the creation, archiving, searching, sorting, retrieval, and communication of a variety of data and data types, so long as that data is stored in a digital format. Data tags can indicate which part of a story is the headline and which is the byline. Tags can be used to label the concepts or content of a audio, video or database file. Tags can be used to indicate an annotation of a file, although the content of that file may not have that specific digital content. For example, a picture of Bill kissing Hillary could refer to "love" or "devotion," though such data would not be automatically inherent in a graphic file.

Tags often reflect Parent-Child or Hierarchy-of-Meaning relationships. For example, a tag in a reporter's story marking a geo-spatial reference might simply point to the name of the city "Arlington." But Arlington alone is not sufficient because the term can be part of multiple hierarchies that extend from Universe-Solar System-Planet-Earth-Continent-North America-United States-Virginia [or Texas, Illinois, Maine, Wisconsin], Tarrant County, Municipality, ZIP, latitude-longitude-degree-minute. The advantage of using a tag to identify Arlington is that the software can be programmed to require the producer or editor to specify which Arlington. And once that location is specified, a digital tag dictionary/thesaurus can make the necessary hierarchical links that would permit "fuzzy" searching, an in "There's that city that starts with an 'A" in Texas that's near Ft. Worth...."

[CHRIS: DO WE NEED A SECTION HERE ON THE WHY'S AND WHEREFORE'S OF XML ???]

TAGS FOR JOURNALISM

In addition to the metatags for journalism, there appear to be three sub-sets. Sometimes, too, metatags will be duplicated in various ways in the sub-set tags. For example, a metatag referring to the size or location of the file could also show up as a tag necessary for the archiving of the content.

Fig. 1 Relationships of Journalism Content Tag Types

Often the tags for these sub-sets overlap, but each has specific uses. The three sub-sets are:

Tags related to process

These tags typically could include comments between writers, editors, producers, programmers or administrative personnel. They could pertain to publishing schedule or degree of content readiness. They could reflect an audit trail of changes and access rights. They can reflect embedded file types (e.g. a story or ad or announcement that links to an A/V file on the web or an audio file that delivers driving directions over a cell phone).

Tags related to content

Content tags tend to be much more specific, but flexible enough, to help differentiate between President Thomas Jefferson and Thomas Jefferson High School.

Digital journalism content is evolving to generally fall in three sub-sets: news/editorial; commercial (ads, transactions) and community (i.e. content generated and largely maintained by individuals or community groups such as church announcements or Little League activities). The system we propose is malleable enough to accommodate all of these data types.

A well-designed tagging plan can help distinguish between the time and/or date a story was published and the time and/or date of an event. These tags can be used to mark-up headlines, subheads, bylines and even such industry-specific components as ledes, nut grafs and kickers. The system is flexible enough—using a coding language called XML—that each newspaper, magazine or TV station can customize the tags to fit its unique newsroom vocabulary.

Tags related to archiving, bibliographic citation and transactions

On one hand, these tags are tied to the special concerns of archivists and archive vendors such as Lexis-Nexis and MediaStream. But they also will be invaluable to content producers as media institutions come to realize that their archives are one of the few truly unique resources they have, especially as they pertain to local markets. Consequently, ease of precision searching to facilitate, pardon the expression, re-purposing the content will drive P/L decisions. Typical archiving tags identify the source (byline, credit, publication, edition, page, section, zone), the type (news, feature, analysis, review, game story), or the relationship of the object to other objects (photo caption, graphics text, sidebars, series information)"

Here are the suggested tags the sub-group came up with in a short time. While we have tried to group the tags in their principal family -- Process, Content or Archive -- a tag will often serve more than one master. It is important to understand, however, that the term standard as applied to these tags does not mean mandatory or constricted. The system proposed is of value primarily because it allows -- and perhaps even encourages -- flexibility and customization.

NOTE: These tags below are just for illustrative purposes; this does not approach a complete list. Everyone should feel free to suggest tags at all levels, keeping in mind the nested functions and cross usage. This formulation is ideally suited for Web presentation. For a model, see "The FGDC Content Standard for Digital Geospatial Metadata" http://www.its.nbs.gov/nbs/meta/meta.html

Journalism Content Metatags (preliminary):

Copyright

Permissions
Rights management
Payment

Story type (each type could have different sub-tags):

Sport
Sport-specific statistics
Results
Politics
Business and economics

Column Name
Series name (package)
Series number
[Story] Package name/ ID

Stories
Graphics
Photos
Links to….
Audio
Video
Database of CAR data

Preliminary Journalism Process Tags

Source

Press release
Community publishing
Gov. doc
"Commercial"

Ads
Display

Classified
Audio
Video

Byline/Author
Credit
Page
1st character of text
Section
Edition
Zone
Language
Priority (see ANPA wire service header list)
Text

Words
Bytes
Col. Inches (if appropriate)

11.4 Lines

Graphic

File size
Original dimensions

Audio

File size
Run time

Video

File size
Run time

Preliminary Journalism Content Tags

Slug
Byline(s)
Text

Words
Bytes
Col. Inches (if appropriate)

Graphic

File type
File size
Original dimensions

Audio

File type
File size
Run time

Video

File type
File size
Run time

Date

Date Authored
Published
Date Expired
Date(s) referred to

Story type (each type could have different sub-tags):

Sport

Sport-specific statistics
Results

Politics
Business and economics

Headline & Subheads

Poster heads/decks

Lead
Summary/nut graf/abstract
Text

That ran
Text that didn't run

Preliminary Journalism Archive Tags

1st character of text
Memo

Confidential
Editor's notes
To public
To staff

Correction

Published
Uncorrected

Slug
Series name Unique ID (i.e. Accession number Used by MediaStream, Lexis-Nexis, etc. for filing.)
Column Name
Series name (package)
Series number
1st character of text
Package name/ ID -- All the associated stories, graphics, photos, dynamic tables, audio, video
Stories
Graphics
Photos
Links to….
Audio
Video
Sports Scores
Results
Database of CAR data
Keywords (i.e. "subject headings")

Proper names
Alias or AKA(s)
Company names
Industry sector codes ( SIC code SIC codes are being replaced by the new North American Industry Classification System (NAICS)

Links to SIC-to-NAICS conversation tables

Stock exchange symbol
Generic Subject Headings

Murder
Schools
Weather
Accidents

Story types (controlled list) – review, obit, news

ISSUES FOR THE IMMEDIATE FUTURE:

Inclusion of hardware, database, archiving and frontend vendors in our discussions
Inclusion of standard lists of accepted tag content that are being developed by specialists in unique areas" e.g. the Keywords for Photos that has become standardized among news librarians.
Inclusion of experts in other file formats (especially audio and video file formats) in our discussions
Inclusion of individuals from media other than U.S. -- particularly Latin America, Canada and Europe -- in our discussions
Greater consideration of user research, especially in terms of how consumers search for, identify and retrieve content
Develop more intuitive tag nomenclature that easily reflects the concepts of family or class as they pertain to: (1) journalism metatags; (2) journalism process tags; (3) journalism content tags; (4) journalism archiving tags
Develop a plan to create tag portals, or extensions, for: transaction (e-commerce) tags; appending content and files to established works/data; notifying original copyright or authorship holders of re-publication and/or sales of content. See: http://www.bic.org.uk/bic/uniquid.html and http://www.bic.org.uk/rights.html
Develop process for consultation with all stakeholders on establishing and modifying these "standards."

Brief History of Metadata and Content Tagging

October 10, 1995. METADATA COUNCIL FORMS TO LAUNCH STANDARDS INITIATIVE: Industry Leaders Organize Coalition of Vendors and End Users to Define Metadata Standards http://www.he.net/~metadata/press/pr19951010.html and Notable Meta Data Coalition Press Releases http://www.he.net/~metadata/press/index.html

1995 Federal Geographic Data Committee (FGDC) Competitive Grant to implement its proposal titled An Educational and Research Program in Support of Content Standards for Digital Geospatial Metadata. NSGIC's mission is to encourage effective and efficient government through the coordinated development of geographic information and systems to ensure that information may be integrated at all levels of government. NSGIC has been active in the area of metadata research and adoption through the review of technical standards, compilation of a catalogue of metadata inventories held by states, and support of publications and other related initiatives. The NSGIC metadata research and education project consists of three phases. The first tests the FGDC metadata standard on a wide range of state, local, tribal and federal information as part of a cooperative effort of nine member states. The second involves the preparation of a practical explanation of the metadata standard for state and local governments. The third phase is a distance education educational program based on the results of the second phase. http://rat.lic.wisc.edu/metadata/metahome.htm

Nov. 1998 API Media Center Conference: Developing a Grammar for New Media," A proposal to create a news markup language for the Web may be the most influential development to emerge from a gathering of some of the finest creative minds from a broad spectrum of disciplines held Nov. 7-10, 1998 at The Media Center at the American Press Institute in Reston, VA. http://www.mediacenter.org/grammar

Jan. 1999 API Media Center Conference: "Grammar II: The Sequel," in Dallas" The intrepid grammarians attempt to develop a News Markup Language. http://www.mediacenter.org/nml.htm

References:

Marshall, Catherine C. Marshall "Making Metadata: a study of metadata creation for a mixed physical-digital collection" ABSTRACT: Metadata is an important way of creating order in emerging distributed digital library collections. This paper presents an analysis of ethnographic data gathered in a university library's educational technology center as the staff develops metadata for a mixed physical-digital collection of visual resources. In particular, the paper explores issues associated with the application of standards, uncertain collection and metadata boundaries, distribution and responsibility, the types of description that arise in practice, and metadata temporality and scope. These issues help to characterize a problem space, and to explore the trade-offs collection maintainers must face when they create metadata for heterogeneous materials. http://www.csdl.tamu.edu/~marshall/dl98-making-metadata.html#table2

Rust, Godfrey. "Metadata: The Right Approach An Integrated Model for Descriptive and Rights Metadata in E-commerce." "There are currently four major active communities of rights-holders directly confronting these questions [involving digital content and metatags]: the DOI (Digital Object Identifier) community, at present based in the book and electronic publishing sector; the IFPI community of record companies; the ISAN community embracing producers, users, and rights owners of audiovisuals; and the CISAC community of collecting societies for composers and publishers of music, but also extending into other areas of authors' rights, including literary, visual, and plastic arts.... This paper examines three propositions that support the need for radical integration of metadata and rights management concerns for disparate and heterogeneous materials, and sets out a possible framework for an integrated approach. It draws on models developed in the CIS (Common Information System) plan and the DOI Rights Metadata group, and work on the ISRC (Corporation for National Research Initiatives), ISAN (International Standard Audiovisual Number), and ISWC standards and proposals." http://www.dlib.org/dlib/july98/rust/07rust.html#introduction

Resources:

NOTE: "The Dublin Core" SITE IS THE JUMP STATION TO REVIEW THE MANY YEARS OF INTERNATIONAL WORK ALREADY COMPLETED ON TAGGING CONTENT.

The Dublin Core: A Simple Content Description Model for Electronic Resources : Metadata for Electronic Resources

The Dublin Core is a metadata element set intended to facilitate discovery of electronic resources. Originally conceived for author-generated description of Web resources, it has attracted the attention of formal resource description communities such as museums, libraries, government agencies, and commercial organizations.

The Dublin Core Workshop Series has gathered experts from the library world, the networking and digital library research communities, and a variety of content specialties in a series of invitational workshops. The building of an interdisciplinary, international consensus around a core element set is the central feature of the Dublin Core. The progress represents the emergent wisdom and collective experience of many stakeholders in the resource description arena. An open mailing list supports ongoing work. See: http://purl.oclc.org/dc

Metadata Related Tools: http://purl.oclc.org/dc/tools/index.htm

Creating Metadata (Templates)
Tools for the Creation/Change of Templates
Automatic Extraction/Gathering of Metadata
Automatic Production of Metadata
Conversion Between
Metadata Formats
Integrated (Tool) Environments

The Meta Data Coalition (formerly Metadata Coalition) regroups vendors and users allied with a common purpose of driving forward the definition, implementation and ongoing evolution of a meta data interchange format and its support mechanisms. The need for such standards arises as meta data, or the information about the enterprise data emerges as a critical element in effective data management. Different tools, including data warehousing, distributed client/server computing, databases (relational, OLAP, OLTP...), integrated enterprise-wide applications, etc... must be able to cooperate and make use of meta data generated by each other. http://www.he.net/~metadata/index.html

Meta Data Interchange Specification (MDIS Version 1.1) This is version 1.1 of the Meta Data Interchange Specification. It is available here as a table of contents and complete downloadable copies in PDF and PostScript formats. This document is dated August 1, 1997 http://www.csdl.tamu.edu/~marshall/dl98-making-metadata.html#table2 The Table of Contents here offers a model for how we might structure our reports and presentation.

Geospatial Support Staff Metadata Tutorial Introduction

"In the beginning, one collected ... data without considering that somewhere, sometime, someone might ask -

Why was this data gathered?

What was collected?

Who collected it?

How was it collected?

How current is it?

Where is this data?

[Who has access rights to the data? Who has rights to change the data? Should the data be updated? Should it be referenced to other data? Should the data have a warning flag of any sort (e.g. potentially libelous]?

"This brings us to the somewhat confusing business of collecting Metadata. When first faced with this nasty problem, the collection of Metadata can seem overwhelming. Fortunately, the Federal Geographic Data Committee (FGDC) has thought this out and has published a metadata content standard. So we know what to collect. Our job is to figure out what information would be most meaningful and best define our data sets using their standard. So let's begin by looking at the 10 Metadata sections." http://www.blm.gov/gis/meta/barney/meta1.html

XML Resources (Note that XML is a trademark held by M.I.T.):

Frequently Asked Questions about the Extensible Markup Language Maintained on behalf of the World Wide Web Consortium's XML Special Interest Group and many other members of the XML Special Interest Group of the W3C as well as FAQ readers around the world. A site with reliable, straight-forward answers. http://www.ucc.ie/xml

What the ?XML! may be one of the worst designed Web sites in existence, and Geocities a challenge to navigate, but be patient and drill down. The information is straightforward and helpful. This is the general site with the "Learn XML in 11.5 Minutes" document. http://www.geocities.com/SiliconValley/Peaks/5957/wxml.html

XML.COM is a Seybold Publications and O'Reilly & Associates venture that apparently aims to cover the XML world as it evolves. Good at staying up-to-date http://www.xml.com

W3C XML This is the home page of the W3C XML Activity, part of the Architecture Domain. The XML Activity Statement explains the W3C's work on this topic in more detail. http://www.w3.org/XML

Microsoft's XML resource page. This section provides information on Microsoft's support of the Extensible Markup Language (XML) -- the universal format for data on the Web. XML allows developers to easily describe and deliver rich, structured data from any application in a standard, consistent way. XML does not replace HTML; rather, it is a complementary format. (Note that XML has its own newsgroup: microsoft.public.xml. You can use this newsgroup to get answers to your XML questions, learn what's new with XML, and find out what you can do using XML.) http://www.microsoft.com/xml/default.asp

Arbortext, a vendor "of open, XML-based software solutions that accelerate the process of creating, managing, and delivering product information in medium and large enterprises. Has NT products. Click down, however, to find the links to other online XML resources. http://www.arbortext.com

Suggested JournalismTags:

Metatags	(P)rocess / (C)ontent / (A)rchive Tags	Synonyms	Example of Usage
Source	A	Newspaper, Publication	Washington Post, U.S. News & World Report, Associated Press
Byline Credit	A A	Author	By Paul Prestige Daily Rag Staff Reporter, New York Times
Section Page Edition Zone Version Priority	A A A A A A		A Section, Sports, Editorial 01A, A01, 9B Final, State Jersey Shore
Length Wordcount	A A		21 inches, 622 words
Date (Published) Day Published Date Authored Date Expired Time Stamp	A A A A A		1/10/99, 19991201: 18:32
Copyright	A		Ó Washington Post
Type (Record)	A	Format	Text, Photo, Graphic, Audio, Video, Database
Headline	A	Head, Title	Self explanatory
Lead Abstract Nut Graph	A		Self explanatory
Dateline			Mexico City

Text Text that didn’t publish 1^st character of text Captions Text from Graphics Pull Quotes			Self explanatory
Keywords Subject Terms Named Persons Corporate ID Stock Symbol Profile			Sex Crimes, Rape, William Jefferson Clinton, Microsoft, ATT, Hillary Clinton
Column Name Series Name Series Number			Business Briefs, Pope John Paul II in Cuba, Third of Three parts, One of an occasional series
Language			English
Memo
Correction			Sam Schleterbahn’s name was incorrectly spelled in Saturday’s story.
Slug Accession Number Package ID Thread ID			Israeltalks 19990101842, 670345
Package Connectors Link to Sidebars Photos + Photo Credit Graphics + Credit Audio Video Links to Sports scores Results Databases of CAR data