Cover Pages: TalkBank and the Codon XML-Based Annotation Framework

[November 16, 2000] "The ongoing growth in computer power and connectivity has led to dramatic changes in the methodology of science and engineering. By stimulating fundamental theoretical discoveries in the analysis of semistructured data, we can to extend these methodological advances to the social and behavioral sciences. Specifically, we propose the construction of a major new tool for the social sciences, called TalkBank. The goal of TalkBank is the creation of a distributed, web-based data archiving system for transcribed video and audio data on communicative interactions. We will develop an XML-based annotation framework called Codon to serve as the formal specification for data in TalkBank. Tools will be created for the entry of new and existing data into the Codon format; transcriptions will be linked to speech and video; and there will be extensive support for collaborative commentary from competing perspectives. The TalkBank project will establish a framework that will facilitate the development of a distributed system of allied databases based on a common set of computational tools. Instead of attempting to impose a single uniform standard for coding and annotation, we will promote annotational pluralism within the framework of the abstraction layer provided by Codon. This representation will use labeled acyclic digraphs to support translation between the various annotation systems required for specific sub-disciplines. There will be no attempt to promote any single annotation scheme over others. Instead, by promoting comparison and translation between schemes, we will allow individual users to select the custom annotation scheme most appropriate for their purposes. Codon will also facilitate the direct comparison of complementary and competing analyses of a given dataset."

"Beyond the annotation of communicative data, Codon must provide the necessary structures for storing the web of annotations that corresponds to a particular database. It must also provide tools for creation, maintenance and query. Since the structure of annotation data is ill-suited to conventional data models, and since both the data and the structure are subject to continual revision during the time course of collection and analysis work, Codon will employ semistructured data models. Under this approach to database design, the data is self-describing and there is no requirement to force annotations into the straightjacket of a relational schema. This move brings annotation structures into the realm of database technology, and provides the foundation for data exchange and transformation. The semistructured data model for annotation graphs is, in effect, an internal data structure for exchange of data between corpora, but having such a structure invites the idea of querying data in AG format directly. XML is a natural 'surface' representation for semistructured data, and we have adopted it as the primary exchange format for Codon. Import and export capabilities will be provided for current systems, such as CHAT, SignStream, and Emu. Codon databases, and the multimodal information they store, will be accessible over the Internet. Various XML extensions, such as XML-Data, RDF and XML-QL, will provide the starting point for our explorations of the representation and query of Codon structures within XML."

Talk Bank development of transcription and commentary tools: "The development of data formats and annotated corpora must proceed hand-in-hand with the construction of tools for transcription and analysis. In accord with our pluralistic emphasis, we will encourage the distributed construction of these tools, as well as the adaptation of existing tools. By adding import and export capabilities to existing tools, we can facilitate the development of interoperability between current tools. To seed this integration process, we will work with the authors of existing systems to get import/export capabilities for the Codon format (specified as an XML DTD, or in some other data structuring formalism that can be expressed in XML). We will also create our own open-source tools for the following tasks: transcription, commentary, alignment, browsing, and retrieval. Transcription tool: We will provide a platform-independent Codon Editor implemented in Tcl/Tk or Java. This will provide a high-level interface looking somewhat like the displays in Project 2, and it will store annotation data as XML. Our experience with the CHILDES editor will provide a starting point for the design. We will provide multiple, customizable 'views' of annotation and signal data. Commentary tool: Annotation often references signal data directly. We will create a commentary tool, which permits 'meta' annotations to reference existing annotations. This possibility for indirect annotation is already intrinsic to the annotation graph formalism (and to XML), and is sometimes termed 'Standoff Markup'."

Talk Bank Query tools: "While we intend to use XML as the standard for data exchange, this leaves open what application programming interfaces (APIs) and other interfaces (GUIs and query languages) should be provided. It is clear that the provision of such interfaces for the Talk Bank data will greatly enhance its use and is likely to determine the success of the project. While a number of APIs for XML are under development, the only generally accepted API at the time of writing is the Document Object Model. Of particular importance is the development of an efficient query language that will allow researchers to scan collections of annotations for features of interest. These will be relatively complex (e.g. parse tree structures) and involve temporal relationship across modalities. The annotation framework proposed by Bird and Liberman is essentially that of a labeled graph, and the direct representation of this in XML is, in its simplest form two relations consisting of a binary (node-id, node-label) node relation and a ternary (node-id, edge-label, node-id) edge relation. Any non-trivial query against this representation involves joins, and the current crop of XML query languages differ greatly in their ability to express joins: at present only XML-QL provides for arbitrary joins, and even in that language a complex query will be quite cumbersome and probably inefficient. The problem is that XML query languages tend to be tuned to the tree-like structure of XML documents. In this case the tree is 'flat'. We are left with two avenues of investigation: (1) to augment the XML structure so that it is a better match to the query language, or (2) to consider alternative query languages. This will be one of our first investigations. It is likely that a combination of the two approaches will be needed. We shall also investigate the construction of other APIs designed to simplify the problem of programming with Talk Bank. These may make use of an embedded query language."

References:

TalkBank web site
"Talk Bank: A Multimodal Database of Communicative Interaction." TalkBank proposal to NSF. [cache]
TalkBank DTDs
Developing an infrastructure for online linguistic archives - Contribution from Gary Simons.
CHAT XML-Schema
AIF DTD


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY