The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
SEARCH | ABOUT | INDEX | NEWS | CORE STANDARDS | TECHNOLOGY REPORTS | EVENTS | LIBRARY
SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic
Last modified: November 16, 2000
Multilevel Annotation, Tools Engineering (MATE)

[November 15, 2000] The MATE project [Telematics Project LE4-8370] "aims to facilitate re-use of language resources by addressing the problems of creating, acquiring, and maintaining language corpora. The problems are addressed along two lines: (1) through the development of a standard for annotating resources; (2) through the provision of tools which will make the processes of knowledge acquisition and extraction more efficient. Specifically, MATE will treat spoken dialogue corpora at multiple levels, focusing on prosody, (morpho-) syntax, co-reference, dialogue acts, and communicative difficulties, as well as inter-level interaction. The results of the project will be of particular benefit to developers of spoken language dialogue systems but will also be directly useful for other applications of language engineering."

"The MATE workbench is a program designed to aid in the display, editing and querying of annotated speech corpora. It can also be used for arbitrary sets of hyperlinked XML encoded files... At the heart of the workbench is a database which contains a representation of the XML files which have been loaded into the workbench. The abstract data model that we use is a directed graph, and any directed graph can be represented in the model. Each node in this graph has a type name (e.g., word, phrase, or other arbitrarily chosen name), a set of string-valued attributes and a list of child nodes (corresponding to outgoing edges from the node). Edges in the model are unlabelled. A node may have several different parent nodes, but each node has (at most) one distinguished parent node which can be used to treat the graph as an edge-disjoint set of trees. This view of the graph is useful for algorithms which require a tree structure, such as writing the internal representation out to a set of files, applying the stylesheet processor, or asking questions about the linear order of two elements. The semantics or interpretation of the data model is flexible and may vary depending on the requirements of the annotation schemes used. A common, but not necessary, interpretation of a node and the parent-child relation is that a node represents a segment of speech and its children represent a finer sub-division of the same segment of speech, where the order of the children corresponds to the temporal order. The mapping from XML-encoded files to this data model is fairly straightforward..."

The 'MATE Dialogue Annotation Guidelines' contains a comprehensive collection of recommendations or guidelines for representing descriptive annotation of spoken dialogue material. Descriptive annotation includes any information that encodes linguistic data with respect to their physical, perceptual, or functional dimensions. Spoken dialogue material refers to any collection of spoken dialogue data (human-human, human-system, or human-human-system), including not only speech files but also logfiles or scenarios which are related to the spoken dialogues. Spoken dialogue annotation is the only area considered in this report, however this does not exclude that the recommendations may apply to other areas as well. It builds on a common standard framework in terms of a coding module at the conceptual level and an underlying representation in XML at the implementational level. For each level considered by MATE recommendations are provided on how to encode relevant phenomena, one or more best practice coding modules are provided and several examples are given. The descriptions given in this document allow a complete separation from the underlying machine representation for which MATE uses XML. The separation means that in principle one could decide to other formats than XML at the implementational level without affecting the coding module in any way. In this document recommendations will be made that rely on a given markup language, XML, that has already found broad support. This is an important factor as the availability of parsers and other software enhances the integration of this proposal into existing environments." Annex C provides the XML DTDs.

References:

  • MATE Project Web site

  • MATE Project description

  • MATE Project overview

  • The MATE Markup Framework - MATE Deliverable D1.2

  • MATE Dialogue Annotation Guidelines

  • MATE deliverables

  • MATE publications

  • MATE Advisory Panel

  • MATE institutional partners

  • MATE Workbench

  • An introduction to Mate stylesheets

  • MATE workbench information and discussion

  • See also: MATE Project - IMS

  • See also: Linguistic Annotation - From UPenn Linguistic Data Consortium. "...prepared in conjunction with LDC research on the logical structure of linguistic annotation, based on annotation graphs.

  • "The MATE Workbench - A tool for annotating XML corpora." By Amy Isard, David McKelvie (Language Technology Group, Division of Informatics, Edinburgh University), Andreas Mengel (Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart), and Morten Baun Møller (Natural Interactive Systems Laboratory, Odense University). In Proceedings of Recherche d'Informations Assistie par Ordinateur (RIAO'2000), Paris, April, 2000. "This paper describes the design and implementation of the MATE workbench, a program which provides support for flexible display and editing of XML annotations, and complex querying of a set of linked files. The workbench was designed to support the annotation of XML coded linguistic corpora, but it could be used to annotate any kind of data, as it is not dependent on any particular annotation scheme. Rather than being a general purpose XML-aware editor it is a system for writing specialised editors tailored to a particular annotation task. A particular editor is defined using a transformation language, with suitable display formats and allowable editing operations. The workbench is written in Java, which means that it is platform-independent. This paper outlines the design of the workbench software and compares it with other annotation programs. . . The annotation or markup of files with linguistic or other complex information usually requires either human coding or human correction of automatic coding. This annotation can be a time consuming and tedious process, yet it is a necessary one for the creation or analysis of corpora and the provision of data necessary for training automatic annotation programs, such as parsers, speech recognisers, or named entity recognisers. In order to support the creation and use of such annotations, specialised editing, retrieval and display programs are required. However, the creation of these programs can in itself be a complex process. In the MATE workbench, we have attempted to provide a parametrised tool which can be specialised to provide these functions on any annotation scheme capable of being expressed in XML. . . The major features of the MATE workbench are: (1) An internal database - using arbitrary XML as an interchange format, extended to cover multiple hierarchies and arbitrary directed graphs using hyperlinks or ID/IDREF pointers between elements. This extension from trees to graphs is required to allow XML to represent more complex data. (2) A query language which is tailored to this internal representation. This language returns tuples instead of single elements (as in the XSLT query language). The architecture allows us to add new structure to the database by evaluating a query. (3) A transformation language and processor that goes beyond XSLT in some respects. (4) A display and editing engine for displaying to the user and enabling editing actions. We believe that the MATE project has developed a prototype annotation tool which demonstrates a feasible approach to making specialised editors easier to develop and to use. This claim will be tested in future work at Edinburgh where we will use the workbench to annotate a corpus of telephone dialogues. The system is based on developing standards in the XML community. There are a number of ideas which underlie the structure of the MATE workbench, which we believe are important. (1) For flexibility of display and the easy definition of editors tailored to a particular annotation task, one should use a high level transformation language to provide a flexible link between logical structures and display structures. (2) The system design should be reflexive -- definitions of user interfaces, query results, and corpus descriptions should all be in the same format and expressible in the same data model. (3) In order to handle the complexity of linguistic annotations and a reflexive system design one needs to extend the data model away from trees towards general graphs. The query and transformation languages used should reflect this data model. (4) The display processor should be extendible, so that it becomes easy to add new display options, for example to add tree/graph displaying capabilities. However there are a number of aspects of the prototype system which it would be useful to readdress if we were to re-implement the workbench. Firstly, now that the XSLT language is becoming stable, it would be sensible to address the issue of using the standard XSLT transformation language instead of our cut-down version of it. This would involve (a) using the XPATH query language in stylesheets, (b) addressing how handle hyperlinking in stylesheets, and (c) using XMLSchema to enable access to DTD-type information in stylesheets. An alternative (more speculative) approach would be to keep the MATE query language and extend stylesheet templates to work with tuple-returning queries. Secondly, to address the issue of efficiency and allow the workbench to be used with much larger files. This could involve reimplementing the workbench to use an external database. We have chosen an architecture where entire files are loaded into memory and processed as a group. The alternative would be to provide a streaming interface where larger files were read and processed a section at a time (where the definition of a section would be defined by a query on the XML structure of the file). Thirdly, extend the display objects so as to be able to use any Java (Swing) user interface component in stylesheets and in displays. This would involve writing a schema (DTD or otherwise) for the Java classes, which would be a useful exercise in its own right. Finally, the transformation language and the data model need to be extended to encompass dynamic interfaces and the updating of the document structure. These are necessary to support the editing of corpus annotations. The two directions for future work are making the language for defining editing actions easier to use, and allowing display structures to be created dynamically, i.e., popup menus." [cache]

  • "The Mate Workbench - an annotation tool for XML coded speech corpora." By David McKelvie, Amy Isard, Andreas Mengel, Morten Baun Møller, Michael Grosse, and Marion Klein. To appear in Speech Communication Volume 33, Numbers 1-2 (December 2000). Special issue on Speech Annotation and Corpus Tools. "This paper describes the design and implementation of the MATE workbench, a program which provides support for the annotation of speech and text. It provides facilities for display and editing of such annotations, and complex querying of a resulting corpus. The workbench offers a more approach than most existing annotation tools, which were often designed with a specific annotation scheme in mind. Any annotation scheme can be used with the MATE workbench, provided it is coded using XML markup (linked to the speech signal, if available, using certain conventions). The workbench uses a transformation language to define specialised editors optimised for particular annotation tasks, with suitable display formats and allowable editing operations tailored to the task. The workbench is written in Java, which means that it is platform-independent. This paper outlines the design of the workbench software and compares it with other annotation programs. At the heart of the workbench is a database which contains a representation of the XML files which have been loaded into the workbench. The abstract data model that we use is a directed graph, and any directed graph can be represented in the model. Each node in this graph has a type name (e.g., word, phrase, or other arbitrarily chosen name), a set of string-valued attributes and a list of child nodes (corresponding to outgoing edges from the node). Edges in the model are unlabelled. A node may have several different parent nodes, but each node has (at most) one distinguished parent node which can be used to treat the graph as an edge-disjoint set of trees. This view of the graph is useful for algorithms which require a tree structure, such as writing the internal representation out to a set of files, applying the stylesheet processor, or asking questions about the linear order of two elements. The semantics or interpretation of the data model is flexible and may vary depending on the requirements of the annotation schemes used. A common, but not necessary, interpretation of a node and the parent-child relation is that a node represents a segment of speech and its children represent a finer sub-division of the same segment of speech, where the order of the children corresponds to the temporal order. The mapping from XML-encoded files to this data model is fairly straightforward. As each XML file is read into the database, each XML element and piece of text is represented as a new node, whose type is the name of the XML element (or #PCDATA for text). Attributes are copied over (taking default values from the XML Document Type Definition [DTD] if required). The parent-child relation is that derived from textual inclusion in the XML files, and the distinguished parent of a node is the node of the textually enclosing XML element. This gives us a set of trees, one per loaded file. As a final stage of loading, elements which are hyperlink sources (i.e., which have href attributes), have additional (non-distinguished) parent-child links added to them. The href attribute is evaluated to give a list of other nodes in the internal representation, and each node in this list is then added as a new child of the source node. This gives us the ability to represent arbitrary directed graphs in XML..." [alt URL, and cache]

  • "A generic approach to software support for linguistic annotation using XML." By Jean Carletta, David McKelvie, Amy Isard, Andreas Mengel, Morten Baun Møller, and Marion Klein. 28 pages.

  • "The MATE workbench annotation tool, a technical description." By Amy Isard, David McKelvie, Andreas Mengel, and Morten Baun Møller. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, May 2000.

  • "The MATE Workbench." By Dybkjær, L. and Bernsen, N.O. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC 2000), Athens, May 2000.

  • "The MATE Annotation Workbench: User Requirements." By Jean Carletta, Amy Isard, Marion Klein, David McKelvie, Andreas Mengel and Morten Baum Møller. In Proceedings of the ACL99 Workshop 'Towards Standards and Tools for Discourse'. June 1999.


Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation

Primeton

XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI: http://xml.coverpages.org/mate.html  —  Legal stuff
Robin Cover, Editor: robin@oasis-open.org