Fellow ACE members,
This document will explain how to annotate files for the ACE Pilot
EDT evaluation using the Alembic Workbench. The ACE Pilot Format is a
form of XML stand-off annotation that is the target of our joint
annotation efforts, and any way you can generate documents conforming
to this format (DTD) is fine. Which is to say, no one is required to
use the Workbench; it is simply one option for how to generate the
necessary reference files for the Pilot evaluation. The Version 4.23
distribution of the Workbench contains two XML DTD files that can be
used to validate files for their conformance to either the reference
or system output markup requirements.
The current release of the Workbench has been modified (hacked) to
make EDT annotation as easy as possible under the time constraints.
Since we have not had time for extensive modifications, not everything
that an annotator might wish for is available. However, it is
possible to generate valid data in a reasonably efficient manner using
this tool, as we have done so on a few files. The Workbench allows
annotations to be "visible" to support the review and correction of
annotations. If you encounter bugs, problems or are otherwise
mystified, please do not hesitate to call or write one of the people
listed at the end of this message.
I. ACE Pilot Format (APF) and associated DTDs
The ACE Pilot Format (APF) has been designed cooperatively by NIST
(George Doddington, John Garofolo and Jon Fiscus), and MITRE (John
Henderson, Benjamin Wellner and David Day). This format is the result
of a very time constrained effort to produce something appropriate and
useful for the ACE Pilot evaluation as early as possible. Thus, we
anticipate further work on the issue of appropriate data/annotation
encoding standards between now and May, most likely coming from the
ATLAS effort. In the meantime, we want to move forward with the
current approach.
The annotation is realized in the form of XML standoff annotation,
which means that the file as a whole conforms to XML encoding
standards, and the raw data (or "signal") being annotated resides in a
separate file. The annotations "point" to portions of the signal via
indices. Because we anticipate three different kinds of signals
(text,speech,ocr), we have provided three different kinds of indices,
though others can be proposed and adopted. As you can see from the
DTD, these three are charspan (for text), timespan (for speech) and
pixelboundingbox (for images).
The usual operation of Alembic Workbench is to load a plain text or
SGML-annotated file and produce additional SGML/XML annotations
embedded directly in the same file, since this has been the de facto
annotation interchange standard to date. Since the ACE Pilot Format
(APF) for annotation takes the form of a stand-off annotation file, we
have modified the Workbench to include a separate conversion utility
that extracts the information saved in the file being edited and
places it an appropriately formatted file. This conversion utility is
called sgm2apf. It can be called directly from within the Workbench
application and is also available as a standalone program. We do not
yet have a conversion utility for taking an APF file and the
associated source file and creating a file suitable for the Workbench
to visualize and edit, but we will be writing this over the next week
or so. [One of the uses for this would be to enable system output
produced directly in APF to be viewed with the Workbench
browser/editor.]
All of the indices into the original source document from an APF
file are based on units appropriate to the signal being annotated. In
the case of text file "signals" these indices take the form of
character offsets. There are a number of very important points that
should be understood about these character offsets:
o The character offsets do *not* include SGML/XML/HTML annotations
that are embedded in the source document. The logic of this is
that the character offsets are indices into the raw text "signal"
and not into some amalgam of text and other annotations that may
or may not be present. Practically speaking, this means that
applications must be able to distinguish among SGML annotations
and "raw" data, and only count characters occurring in the raw
data. We assume data are encoded in Latin-1 or the UTF-8 Unicode
standard (which properly includes ascii). Offset counting is in
terms of characters, not bytes. One property of this approach is
that the annotations present within a source file may change quite
dramatically without affecting in any way any conforming standoff
annotation files. While it is appropriate to take the view that
source files are "immutable," the practical situation is often
quite different. [The default behavior of the Workbench is to
always save its annotations as SGML directly embedded within the
file being annotated, though the underlying text is never
modified. The original file can always be retrieved in principle
by stripping away any and all tags added during a series of
Workbench annotation sessions. In practical terms, it is always
best to save an original copy of the file separate from the file
being annotated, in case problems arise.]
o Character offset counting is "zero-based"---that is, the first
non-annotation character encountered in a file would be indexed by
a character offset of 0 (zero).
o These character offsets are relative to the beginning of the
*file*, not the beginning of the document within the file (indicated
by the initial <DOC> tag). Thus, if a file's contents were to begin
with a space character *before* the first <DOC> tag, then a taggable
string occurring immediately after this <DOC> tag would have index
position 1.
II. Setting up Alembic Workbench and data for EDT annotation
The directions provided below assume that you have downloaded
Version 4.23 of the Alembic Workbench from MITRE's external web site:
"http://www.mitre.org/technology/nlp". (Please note: you need to have
version 4.23 of the Workbench, which will be available within an hour
or less from now on our web site.
Before annotating a file for EDT in the Workbench, it is important
that you operate on a file in which two conditions have been met:
(1) The file contains only one document (indicated by a single pair
of <DOC> ... </DOC> tags). This is to ensure that APF annotations do
not refer to mentions across document boundaries. The current version
of the Workbench does not have any special handling for <DOC> tags, so
it cannot check that EDT mentions are being restricted to
within-document annotations.
(2) The <DOC> tag is the first text appearing in the file---there
are no leading spaces, newlines or any other character data. This is
make sure that the character offset indices are consistent across
annotating sites.
The Workbench includes a script, called separate-docs, which will
take a file containing one or more documents (identified by pairs of
<DOC> ... </DOC> tags) and produce individual files that satisfy these
two constraints. Refer to the documentation of that script produced
when a single ?-h? argument is given for more detailed information.
That is, do:
unix> separate-docs -h
to get more information. An example use of this on one of the
distributed TDT files is as follows:
unix> separate-docs -collection <multi-doc-file>
There are four important steps to annotating a file in versin 4.23
of the Alembic Workbench for EDT:
(0) Enable the Workbench to generate "APF" standoff annotation files
every time it saves out a file. This is done by selecting the
"Generate ACE Pilot Format" option in the dialog box found by
selecting the "File Saving Options" entry under the "Options"
pulldown menu. Turning this on should cause it to stay on for all
subsequent annotation sessions by this user, so this should only
need to be done once for ACE pilot annotation purposes.
(1) Load in a file containing a single document, making sure that
the proper character encoding has been selected. This is done by
selecting the appropriate "Load File" option under the "File"
pulldown menu. In the dialog box presenting normalization options,
make sure to select "No Normalization" (the default). This value
will remain the default for all subsequent files that are loaded.
(2) After the file has been loaded, load in the "mention-prefs" tag
preferences file. This is done by selecting the "Load Tag
Preferences" option under the "Tag Preferences" option in the
"Options" pulldown menu.
(3) After the mention-prefs have been loaded, load in the
"PilotEntity.rtd" relations definition for tagging "relations."
This is done by selecting the "Options" pulldown menu, then "Tag
Preferences", then "Load Tag Preferences". This need only be done
once for a single file. Once a relation type definition (rtd) has
already been added to a file, all subsequent editing of this file
can be performed by selecting the "Edit Existing Relations" option
under the "Relation" option of the "Options" pulldown menu.
"Relation" is the term used in the Alembic Workbench to refer to any
N-ary relationship among strings, stringfills, setfills, or
instances of relationships ("RELINSTs"). The relations editing
facility is a spreadsheet model for establishing such relationships,
where the columns represent "fields" (or "slots") in the relation,
and rows represent individual relation instances. Multiple values
are accommodated in any single cell, and these multiple values are
"stacked" one above another within the cell, in the order they were
filled. New values can be added to a single field by clicking
button-3 (right mouse button) over the appropriate cell. A
particular value in a cell can be *replaced* by clicking button-2
(middle mouse button) over a particular cell. Either button can be
used to fill/add the first value to an empty cell. More information
is usually available via the in the various "help" buttons on the
various dialog boxes in the Workbench.
III. Oveview of the EDT annotation process in Workbench version 4.23
Given the severe time constraints in producing a tool that would
provide a reasonable annotation/viewing environment for EDT, we have
chosen to modify the existing capabilities of the Workbench as much as
possible. This has meant that the annotation process has been broken
down into two distinct types of operations: (1) Identifying heads for
mentions that contain heads whose extent is different from the full
mention, and (2) Placing mentions in the appropriate column of the
relation table. A single example should suffice:
... the three brothers ...
and
... Fyodor Dostoyevsky ...
In the former case the "full extent of the mention" for purposes of
scoring some of the tasks in EDT consists of all three words, while
the "head" of the phrase (for use by other tasks in the EDT pilot
evaluation) is limited to the final word alone. In the second case,
the guidelines currently indicate that the head of the phrase is
identical to the full extent of the phrase, which would contain both
names.
In the case with differing head/mention extents, the process is as
follows:
(a) Change your tag preferences to be
$AWB/tag-preferences/mention-prefs ["Options" pulldown menu; "Tag
Preferences" cascading option; "Load Tag Preferences" option.
You only need to do this once for any individual session with the
Workbench.]
(b) Select the relation $AWB/relations/PilotEntity.rtd ["Options"
pulldown menu; "Relations" cascading option; "Select Relation
Type to Add" You only need to do this once for each individual
file that you are annotating with the Workbench. Thereafter the
relation type definition continues to be stored within the SGML
source file itself.]
(c) For each mention containing a head that is distinct (with
different extent) than the maximal mention phrase:
i) create a Mention phrase around the maximal phrase extent
ii) this will cause the Mention tag to blink, requesting the
user to select the head of the phrase. The user should
select (using standard Workbench mouse bindings) the one
or multi-word phrase constituting the head of the phrase
This will create two annotations, one a MENTION, a second a
MENTIONHEAD, where the MENTION tag contains attributes
"pointing" to the MENTIONHEAD tag.
(d) While holding the mouse over some portion of the MENTION phrase
that is *not* also within the extent of the MENTIONHEAD tag, use
the <shift><button-1> combination to create a new selection with
the exact same extent as the MENTION phrase. (Optionally, one
can do this with multiple key clicks using the standard mouse
bindings, but the suggested approach is faster and less prone to
errors.)
(e) Now place the mouse over the appropriate cell in the PilotEntity
table (relation) and click middle (or, if you are adding an
additional value to the cell, click with the right button).
This should fill the slot with both the MENTION phrase string
*and*, in square brackets, the MENTIONHEAD phrase.
In the case with differing head/mention extents, the process is as
follows:
(a) In the event a mention's extent is co-extensive with it's head
(e.g., "Tom Smith"), one should skip step (c) and (d). Instead,
swipe/select the extent of the phrase, and then proceed
immediately to step (e) above (that is, filling the approrpiate
relation cell). This will fill the cell with the simple phrase
*without* the additional square-bracketed head phrase. The
interpretation of this is that the head and the maximum extent
of the phrase are identical.
IV. Other pointers and obsevations
It is possible to get the Alembic Workbench confused. We strongly
suggest that you save out your results frequently. The Workbench
allows "backup" versions of the annotated file to be kept, and the
number of these files is controlled by a field in the "File Saving
Options" settings available under the "Options" pulldown menu. By
saving often, you can retrieve previous versions of the annotated file
and continue to work from there. The backup versions of the file are
of the form <filename>.<num>, where the <num> indicates the version of
the file saved. The zero-th version (<filename>.0) is always the
original file prior to any Workbench annotations.
V. Example files
Attached to this message you will find four files.
The first two are the DTDs that define appropriate ACE Pilot Format
XML standoff annotation files for reference and system output data,
respectively. These DTDs can also be found in the directory
$AWB/dtds/ in any version 4.23 distribution of the workbench (or
later). These files contain some amount of commentary about their
structure and motivation.
The second three files are different incarnations of file annotated
for EDT using the Workbench. The first is the original file, with
only pre-existing SGML. The second file is this same file after it
has been annotated by the Workbench. This is the Workbench-native
form of the annotations, utilizing embedded SGML declarations for
relation types, relation instances, and within-text annotation tags.
The third file in this group is the APF version of this EDT
annotation, produced by running the sgm2apf conversion utility.
VI. Points of Contact at MITRE
"Areas of
Expertise"
David Day day@mitre.org (781) 271-2854 [C,T,U,X]
Lisa Ferro lferro@mitre.org (781) 271-5875 [A,G,U]
John Henderson jhndrsn@mitre.org (781) 271-2849 [C,U,X]
Alan Goldschen alang@mitre.org (703) 883-6005 [G,U,X]
Ben Wellner wellner@mitre.org (781) 271-7191 [C,X]
John Aberdeen aberdeen@mitre.org (781) 271-2840 [T]
Abbreviations for "Areas of Expertise"
A Annotation effort (file assignments, distribution, etc.)
G Annotation Guidelines
C Conversion program (AWB SGML --> ACE Pilot Format)
T Alembic Workbench and related tools (technical details)
U Use of Alembic Workbench to perform ACE annotation
X XML of ACE Pilot Format
sample_awb_ed_annotation.orig.sgm
sample_awb_ed_annotation.sgm.apf.xml.16
Problems or questions? Contact list-master@nist.gov