Distribution of Alembic Workbench and ACE Pilot Format DTDs

Subject: Distribution of Alembic Workbench and ACE Pilot Format DTDs
From: "David S. Day" <day@mitre.org>
Date: Fri, 07 Jan 2000 12:34:07 -0500
CC: ACE XDIT <ace-xdit@linus.mitre.org>
Content-Type: multipart/mixed; boundary="------------0858E51D1B4EDF90D3FDC65F"
Sender: day@linus.mitre.org
Fellow ACE members,

  This document will explain how to annotate files for the ACE Pilot
EDT evaluation using the Alembic Workbench.  The ACE Pilot Format is a
form of XML stand-off annotation that is the target of our joint
annotation efforts, and any way you can generate documents conforming
to this format (DTD) is fine.  Which is to say, no one is required to
use the Workbench; it is simply one option for how to generate the
necessary reference files for the Pilot evaluation.  The Version 4.23
distribution of the Workbench contains two XML DTD files that can be
used to validate files for their conformance to either the reference
or system output markup requirements.

  The current release of the Workbench has been modified (hacked) to
make EDT annotation as easy as possible under the time constraints.
Since we have not had time for extensive modifications, not everything
that an annotator might wish for is available.  However, it is
possible to generate valid data in a reasonably efficient manner using
this tool, as we have done so on a few files.  The Workbench allows
annotations to be "visible" to support the review and correction of
annotations. If you encounter bugs, problems or are otherwise
mystified, please do not hesitate to call or write one of the people
listed at the end of this message.


I.  ACE Pilot Format (APF) and associated DTDs

  The ACE Pilot Format (APF) has been designed cooperatively by NIST
(George Doddington, John Garofolo and Jon Fiscus), and MITRE (John
Henderson, Benjamin Wellner and David Day).  This format is the result
of a very time constrained effort to produce something appropriate and
useful for the ACE Pilot evaluation as early as possible.  Thus, we
anticipate further work on the issue of appropriate data/annotation
encoding standards between now and May, most likely coming from the
ATLAS effort.  In the meantime, we want to move forward with the
current approach.

  The annotation is realized in the form of XML standoff annotation,
which means that the file as a whole conforms to XML encoding
standards, and the raw data (or "signal") being annotated resides in a
separate file.  The annotations "point" to portions of the signal via
indices.  Because we anticipate three different kinds of signals
(text,speech,ocr), we have provided three different kinds of indices,
though others can be proposed and adopted.  As you can see from the
DTD, these three are charspan (for text), timespan (for speech) and
pixelboundingbox (for images).

  The usual operation of Alembic Workbench is to load a plain text or
SGML-annotated file and produce additional SGML/XML annotations
embedded directly in the same file, since this has been the de facto
annotation interchange standard to date.  Since the ACE Pilot Format
(APF) for annotation takes the form of a stand-off annotation file, we
have modified the Workbench to include a separate conversion utility
that extracts the information saved in the file being edited and
places it an appropriately formatted file.  This conversion utility is
called sgm2apf.  It can be called directly from within the Workbench
application and is also available as a standalone program.  We do not
yet have a conversion utility for taking an APF file and the
associated source file and creating a file suitable for the Workbench
to visualize and edit, but we will be writing this over the next week
or so. [One of the uses for this would be to enable system output
produced directly in APF to be viewed with the Workbench
browser/editor.]

  All of the indices into the original source document from an APF
file are based on units appropriate to the signal being annotated.  In
the case of text file "signals" these indices take the form of
character offsets.  There are a number of very important points that
should be understood about these character offsets:

  o The character offsets do *not* include SGML/XML/HTML annotations
    that are embedded in the source document.  The logic of this is
    that the character offsets are indices into the raw text "signal"
    and not into some amalgam of text and other annotations that may
    or may not be present.  Practically speaking, this means that
    applications must be able to distinguish among SGML annotations
    and "raw" data, and only count characters occurring in the raw
    data.  We assume data are encoded in Latin-1 or the UTF-8 Unicode
    standard (which properly includes ascii).  Offset counting is in
    terms of characters, not bytes.  One property of this approach is
    that the annotations present within a source file may change quite
    dramatically without affecting in any way any conforming standoff
    annotation files.  While it is appropriate to take the view that
    source files are "immutable," the practical situation is often
    quite different.  [The default behavior of the Workbench is to
    always save its annotations as SGML directly embedded within the
    file being annotated, though the underlying text is never
    modified.  The original file can always be retrieved in principle
    by stripping away any and all tags added during a series of
    Workbench annotation sessions.  In practical terms, it is always
    best to save an original copy of the file separate from the file
    being annotated, in case problems arise.]

  o Character offset counting is "zero-based"---that is, the first
    non-annotation character encountered in a file would be indexed by
    a character offset of 0 (zero).

  o These character offsets are relative to the beginning of the
    *file*, not the beginning of the document within the file (indicated
    by the initial <DOC> tag).  Thus, if a file's contents were to begin
    with a space character *before* the first <DOC> tag, then a taggable
    string occurring immediately after this <DOC> tag would have index
    position 1.


II. Setting up Alembic Workbench and data for EDT annotation

  The directions provided below assume that you have downloaded
Version 4.23 of the Alembic Workbench from MITRE's external web site:
"http://www.mitre.org/technology/nlp".  (Please note: you need to have
version 4.23 of the Workbench, which will be available within an hour
or less from now on our web site.

  Before annotating a file for EDT in the Workbench, it is important
that you operate on a file in which two conditions have been met:

  (1) The file contains only one document (indicated by a single pair
of <DOC> ... </DOC> tags).  This is to ensure that APF annotations do
not refer to mentions across document boundaries.  The current version
of the Workbench does not have any special handling for <DOC> tags, so
it cannot check that EDT mentions are being restricted to
within-document annotations.

  (2) The <DOC> tag is the first text appearing in the file---there
are no leading spaces, newlines or any other character data.  This is
make sure that the character offset indices are consistent across
annotating sites.

  The Workbench includes a script, called separate-docs, which will
take a file containing one or more documents (identified by pairs of
<DOC> ... </DOC> tags) and produce individual files that satisfy these
two constraints.  Refer to the documentation of that script produced
when a single ?-h? argument is given for more detailed information.
That is, do:

    unix> separate-docs -h

to get more information.  An example use of this on one of the
distributed TDT files is as follows:

    unix> separate-docs -collection <multi-doc-file>


  There are four important steps to annotating a file in versin 4.23
of the Alembic Workbench for EDT:

  (0) Enable the Workbench to generate "APF" standoff annotation files
  every time it saves out a file.  This is done by selecting the
  "Generate ACE Pilot Format" option in the dialog box found by
  selecting the "File Saving Options"  entry under the "Options"
  pulldown menu.  Turning this on should cause it to stay on for all
  subsequent annotation sessions by this user, so this should only
  need to be done once for ACE pilot annotation purposes.

  (1) Load in a file containing a single document, making sure that
  the proper character encoding has been selected.  This is done by
  selecting the appropriate "Load File" option under the "File"
  pulldown menu.  In the dialog box presenting normalization options,
  make sure to select "No Normalization" (the default).  This value
  will remain the default for all subsequent files that are loaded.

  (2) After the file has been loaded, load in the "mention-prefs" tag
  preferences file.  This is done by selecting the "Load Tag
  Preferences" option under the "Tag Preferences" option in the
  "Options" pulldown menu.

  (3) After the mention-prefs have been loaded, load in the
  "PilotEntity.rtd" relations definition for tagging "relations."
  This is done by selecting the "Options" pulldown menu, then "Tag
  Preferences", then "Load Tag Preferences".  This need only be done
  once for a single file.  Once a relation type definition (rtd) has
  already been added to a file, all subsequent editing of this file
  can be performed by selecting the "Edit Existing Relations" option
  under the "Relation" option of the "Options" pulldown menu.

  "Relation" is the term used in the Alembic Workbench to refer to any
  N-ary relationship among strings, stringfills, setfills, or
  instances of relationships ("RELINSTs").  The relations editing
  facility is a spreadsheet model for establishing such relationships,
  where the columns represent "fields" (or "slots") in the relation,
  and rows represent individual relation instances.  Multiple values
  are accommodated in any single cell, and these multiple values are
  "stacked" one above another within the cell, in the order they were
  filled.  New values can be added to a single field by clicking
  button-3 (right mouse button) over the appropriate cell.  A
  particular value in a cell can be *replaced* by clicking button-2
  (middle mouse button) over a particular cell.  Either button can be
  used to fill/add the first value to an empty cell.  More information
  is usually available via the in the various "help" buttons on the
  various dialog boxes in the Workbench.


III. Oveview of the EDT annotation process in Workbench version 4.23

  Given the severe time constraints in producing a tool that would
provide a reasonable annotation/viewing environment for EDT, we have
chosen to modify the existing capabilities of the Workbench as much as
possible.  This has meant that the annotation process has been broken
down into two distinct types of operations: (1) Identifying heads for
mentions that contain heads whose extent is different from the full
mention, and (2) Placing mentions in the appropriate column of the
relation table.  A single example should suffice:

      ... the three brothers ...

and

     ... Fyodor Dostoyevsky ...

In the former case the "full extent of the mention" for purposes of
scoring some of the tasks in EDT consists of all three words, while
the "head" of the phrase (for use by other tasks in the EDT pilot
evaluation) is limited to the final word alone.  In the second case,
the guidelines currently indicate that the head of the phrase is
identical to the full extent of the phrase, which would contain both
names.

In the case with differing head/mention extents, the process is as
follows:

 (a) Change your tag preferences to be
     $AWB/tag-preferences/mention-prefs ["Options" pulldown menu; "Tag
     Preferences" cascading option; "Load Tag Preferences" option.
     You only need to do this once for any individual session with the
     Workbench.]

  (b) Select the relation $AWB/relations/PilotEntity.rtd ["Options"
      pulldown menu; "Relations" cascading option; "Select Relation
      Type to Add" You only need to do this once for each individual
      file that you are annotating with the Workbench.  Thereafter the
      relation type definition continues to be stored within the SGML
      source file itself.]

  (c) For each mention containing a head that is distinct (with
      different extent) than the maximal mention phrase:

      i) create a Mention phrase around the maximal phrase extent
     ii) this will cause the Mention tag to blink, requesting the
         user to select the head of the phrase.  The user should
         select (using standard Workbench mouse bindings) the one
         or multi-word phrase constituting the head of the phrase

      This will create two annotations, one a MENTION, a second a
      MENTIONHEAD, where the MENTION tag contains attributes
      "pointing" to the MENTIONHEAD tag.

  (d) While holding the mouse over some portion of the MENTION phrase
      that is *not* also within the extent of the MENTIONHEAD tag, use
      the <shift><button-1> combination to create a new selection with
      the exact same extent as the MENTION phrase.  (Optionally, one
      can do this with multiple key clicks using the standard mouse
      bindings, but the suggested approach is faster and less prone to
      errors.)

  (e) Now place the mouse over the appropriate cell in the PilotEntity
      table (relation) and click middle (or, if you are adding an
      additional value to the cell, click with the right button).
      This should fill the slot with both the MENTION phrase string
      *and*, in square brackets, the MENTIONHEAD phrase.

In the case with differing head/mention extents, the process is as
follows:

  (a) In the event a mention's extent is co-extensive with it's head
      (e.g., "Tom Smith"), one should skip step (c) and (d). Instead,
      swipe/select the extent of the phrase, and then proceed
      immediately to step (e) above (that is, filling the approrpiate
      relation cell).  This will fill the cell with the simple phrase
      *without* the additional square-bracketed head phrase.  The
      interpretation of this is that the head and the maximum extent
      of the phrase are identical.


IV. Other pointers and obsevations

  It is possible to get the Alembic Workbench confused.  We strongly
suggest that you save out your results frequently.  The Workbench
allows "backup" versions of the annotated file to be kept, and the
number of these files is controlled by a field in the "File Saving
Options" settings available under the "Options" pulldown menu.  By
saving often, you can retrieve previous versions of the annotated file
and continue to work from there.  The backup versions of the file are
of the form <filename>.<num>, where the <num> indicates the version of
the file saved.  The zero-th version (<filename>.0) is always the
original file prior to any Workbench annotations.

V.  Example files

  Attached to this message you will find four files.

  The first two are the DTDs that define appropriate ACE Pilot Format
XML standoff annotation files for reference and system output data,
respectively.  These DTDs can also be found in the directory
$AWB/dtds/ in any version 4.23 distribution of the workbench (or
later).  These files contain some amount of commentary about their
structure and motivation.

  The second three files are different incarnations of file annotated
for EDT using the Workbench.  The first is the original file, with
only pre-existing SGML.  The second file is this same file after it
has been annotated by the Workbench.  This is the Workbench-native
form of the annotations, utilizing embedded SGML declarations for
relation types, relation instances, and within-text annotation tags.
The third file in this group is the APF version of this EDT
annotation, produced by running the sgm2apf conversion utility.


VI. Points of Contact at MITRE
                                                    "Areas of
                                                    Expertise"
David Day            day@mitre.org  (781) 271-2854  [C,T,U,X]
Lisa Ferro        lferro@mitre.org  (781) 271-5875  [A,G,U]
John Henderson   jhndrsn@mitre.org  (781) 271-2849  [C,U,X]
Alan Goldschen     alang@mitre.org  (703) 883-6005  [G,U,X]
Ben Wellner      wellner@mitre.org  (781) 271-7191  [C,X]
John Aberdeen   aberdeen@mitre.org  (781) 271-2840  [T]

Abbreviations for "Areas of Expertise"

A  Annotation effort (file assignments, distribution, etc.)
G  Annotation Guidelines
C  Conversion program (AWB SGML --> ACE Pilot Format)
T  Alembic Workbench and related tools (technical details)
U  Use of Alembic Workbench to perform ACE annotation
X  XML of ACE Pilot Format
ace-pilot-ref.dtd
ace-pilot-sys.dtd
sample_awb_ed_annotation.orig.sgm
sample_awb_ed_annotation.sgm
sample_awb_ed_annotation.sgm.apf.xml.16
Prev by Date: Re: Routine for split minicorpus files into individual document files
Next by Date: Mini-corpus filenames
Prev by thread: Mini-corpus filenames
Next by thread: Routine for split minicorpus files into individual document files
Date Index | Thread Index | Back to archive index | Back to Mailing List Page
Problems or questions? Contact list-master@nist.gov