The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Created: December 29, 2003.
News: Cover StoriesPrevious News ItemNext News Item

LREC Post-Conference Workshop on XML-Based Richly Annotated Corpora.

A Call for Papers has been issued in connection with an announcement for the Workshop on XML-Based Richly Annotated Corpora, to be held May 29, 2004 in Lisbon, Portugal as a post-conference event following the Fourth International Conference on Language Resources and Evaluation (LREC). LREC 2004 is organized by the European Language Resources Association (ELRA) in cooperation with several other associations, consortia, and international organizations. The Annotated Corpora Workshop organizers recognize that "XML has become a de facto standard for the representation of corpus resources: it is being used for representing speech and text corpora, multimodal and multimedial corpora, as well as, in particular, integrated corpora which combine different modalities. XML-based representations make it easier to work with richly annotated corpora, which include annotations from different levels of linguistic description or from different modalities. A number of tools have also become available, over the last few years, for creating, managing, annotating, querying such corpora and for their statistical exploration. The workshop aims at bringing together XML experts, both theorists and practitioners, as well as linguists and natural interactivity researchers working on the definition of corpus architectures, annotation and resource exchange schemes and on tools for the use of multilevel and/or multi-layer annotated corpora. It will provide a forum for the definition of requirements for corpus representations and pertaining tools, discussing at the same time case studies from linguistics and natural interactivity research."

From the Announcement

Although XML is a useful representation language, its use alone does not solve all the problems and choices with respect to the representation style (e.g., stand-off annotations vs. embedded annotations); these are in turn closely linked with questions of the architecture of richly annotated corpora, such as the following: should information from different levels of linguistic description be represented in separate "layers" of the annotation? Should a given information type serve as a grounding for all or some of the others? How to account for interdependencies and interaction between phenomena from different levels of description? How to account for concurrent annotation (one phenomenon, different analyses or theories/approaches)? Such questions and the pertaining corpus-architectural considerations interact with at least two more problem areas: on the one hand with the kinds of research questions and of phenomena to be analysed in linguistic and natural interaction research (which may call for certain architectural solutions), and on the other hand with tools for the creation, annotation, manipulation and exploration of XML-based corpora.

The workshop will attempt to address the interplay between the following research areas:

1. Representation Techniques

XML techniques for corpus representation, i.e.,

  • Standoff annotation vs. embedded annotation
  • Use of XML linking standards for language data (XLink, XPointer, XPath); other ways of ensuring relationships between levels, e.g., through naming conventions
  • Concepts of layering in corpora annotated at several levels of linguistic description; types of information grouped together vs. distributed over different "packages"
  • Hierarchical vs. flat annotation
  • the grounding of annotations (e.g., in XML elements vs. in characters?) and its implications
  • techniques for the manipulation of XML-based representations for massively annotated corpora usefulness and relevance of XQuery

2. Description Levels

Levels of linguistic description and their interaction, i.e.,

  • Examples of richly annotated corpora: reasons for the choice of the annotated levels; linguistic and natural interactivity research questions which can (only) be solved with richly annotated data
  • Interaction between levels: new research questions in linguistics and natural interactivity research which can only be addressed because of observation across levels, across modalities, etc. An example is the use of clustering techniques across different levels: e.g., relevant cooccurrences of phenomena from different levels identified via clustering
  • Use and usefulness of concurrent annotations in XML-based corpora; an example is concurrent flat and deep syntactic analysis

3. Tools

Tools for handling richly annotated corpora: Software solutions for, e.g.,

  • corpus creation, transformation, exchange, and validation
  • interactive annotation
  • exploration: query and retrieval, statistical analysis
  • corpus management (e.g., wrt. meta-data)

Tools presented should be positioned with respect to the questions of corpus architecture and with respect to the research directions discussed above under (1) and (2).

About LREC

The Language Resources and Evaluation Conference (LREC) "has become a major event in the field of language engineering, and constitutes a milestone in the life of Human Language Technologies (HLT)..."

The large range of uses makes the Language Resources infrastructure a strategic part of the e-society, where the creation of a basic set of LRs for all languages must be ensured in order to bring all languages to the same level of usability and availability.

Examples of LRs are written or spoken corpora and lexica, which may be annotated or not, multimodal resources, grammars, terminology or domain specific databases and dictionaries, ontologies, multimedia databases, etc. LRs also cover basic software tools for the acquisition, preparation, collection, management, customisation and use of the above mentioned examples.

The relevance of evaluation for language technologies development is increasingly recognised. This involves assessing the state-of-the-art for a given technology, measuring the progress achieved within a programme, comparing different approaches to a given problem, assessing the availability of technologies for a given application, benchmarking, and assessing system usability and user satisfaction.

The aim of this conference is to provide an overview of the state-of-the-art, discuss problems and opportunities, exchange information regarding LRs, their applications, ongoing and planned activities, industrial uses and needs, requirements coming from the new e-society, both with respect to policy issues and to technological and organisational ones..."

The Fourth International Conference on Language Resources and Evaluation, LREC 2004, is organised by ELRA in cooperation with other Associations and Consortia, including ACL, AFNLP, ALLC, ALTA, COCOSDA and Oriental COCOSDA, EAFT, EAMT, ELSNET, ENABLER, EURALEX, GKS, GWA, IAMT, ICWLR, ISCA, LDC, ONTOWEB, TEI, and with major national and international organisations, including the Commission of the EU - Information Society DG, Unit E1 'Interfaces and Cognition'..." [adapted from the LREC 2004 conference web site]

About ELRA

The mission of the European Language Resources Association (ELRA) is to "promote language resources for the Human Language Technology (HLT) sector, and to evaluate language engineering technologies, including: (1) Identification of language resources; (2) Promotion of the production of language resources; (3) Production of language resources; (4) Validation of language resources; (5) Evaluation of systems, products, tools, etc., related to language resources; (6) Distribution of language resources; (7) Standardization. The promotion of the production of language resources also includes our support of the infrastructure for evaluation campaigns and our support in developing a scientific field of language resources and evaluation, e.g., via the LREC conference.

Many of these tasks are being carried out by our distribution agency ELDA (Evaluations and Language resources Distribution Agency). ELRA also regularly conducts market studies and surveys in the field of HLT, and publishes a quarterly newsletter, distributed not only to its members but also to a large number of people in the HLT community. In doing so, ELRA participates in the development of HLT and promotes HLT among the players on national, European and international levels..." [adapted from the ELRA front page]

About ELDA

Evaluations and Language resources Distribution Agency (ELDA) is "ELRA's operational body, set up to identify, classify, collect, validate and produce the language resources which may be needed by the HLT -- Human Language Technology -- community. Additionally, ELDA is involved in HLT evaluation campaigns."

ELDA handles the practical and legal issues related to the distribution of language resources, provides legal advice in the field of HLT, and drafts and concludes distribution agreements on behalf of ELRA..."

Principal references:

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Bottom Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: