Fellow ACE members, This document will explain how to annotate files for the ACE Pilot EDT evaluation using the Alembic Workbench. The ACE Pilot Format is a form of XML stand-off annotation that is the target of our joint annotation efforts, and any way you can generate documents conforming to this format (DTD) is fine. Which is to say, no one is required to use the Workbench; it is simply one option for how to generate the necessary reference files for the Pilot evaluation. The Version 4.23 distribution of the Workbench contains two XML DTD files that can be used to validate files for their conformance to either the reference or system output markup requirements. The current release of the Workbench has been modified (hacked) to make EDT annotation as easy as possible under the time constraints. Since we have not had time for extensive modifications, not everything that an annotator might wish for is available. However, it is possible to generate valid data in a reasonably efficient manner using this tool, as we have done so on a few files. The Workbench allows annotations to be "visible" to support the review and correction of annotations. If you encounter bugs, problems or are otherwise mystified, please do not hesitate to call or write one of the people listed at the end of this message. I. ACE Pilot Format (APF) and associated DTDs The ACE Pilot Format (APF) has been designed cooperatively by NIST (George Doddington, John Garofolo and Jon Fiscus), and MITRE (John Henderson, Benjamin Wellner and David Day). This format is the result of a very time constrained effort to produce something appropriate and useful for the ACE Pilot evaluation as early as possible. Thus, we anticipate further work on the issue of appropriate data/annotation encoding standards between now and May, most likely coming from the ATLAS effort. In the meantime, we want to move forward with the current approach. The annotation is realized in the form of XML standoff annotation, which means that the file as a whole conforms to XML encoding standards, and the raw data (or "signal") being annotated resides in a separate file. The annotations "point" to portions of the signal via indices. Because we anticipate three different kinds of signals (text,speech,ocr), we have provided three different kinds of indices, though others can be proposed and adopted. As you can see from the DTD, these three are charspan (for text), timespan (for speech) and pixelboundingbox (for images). The usual operation of Alembic Workbench is to load a plain text or SGML-annotated file and produce additional SGML/XML annotations embedded directly in the same file, since this has been the de facto annotation interchange standard to date. Since the ACE Pilot Format (APF) for annotation takes the form of a stand-off annotation file, we have modified the Workbench to include a separate conversion utility that extracts the information saved in the file being edited and places it an appropriately formatted file. This conversion utility is called sgm2apf. It can be called directly from within the Workbench application and is also available as a standalone program. We do not yet have a conversion utility for taking an APF file and the associated source file and creating a file suitable for the Workbench to visualize and edit, but we will be writing this over the next week or so. [One of the uses for this would be to enable system output produced directly in APF to be viewed with the Workbench browser/editor.] All of the indices into the original source document from an APF file are based on units appropriate to the signal being annotated. In the case of text file "signals" these indices take the form of character offsets. There are a number of very important points that should be understood about these character offsets: o The character offsets do *not* include SGML/XML/HTML annotations that are embedded in the source document. The logic of this is that the character offsets are indices into the raw text "signal" and not into some amalgam of text and other annotations that may or may not be present. Practically speaking, this means that applications must be able to distinguish among SGML annotations and "raw" data, and only count characters occurring in the raw data. We assume data are encoded in Latin-1 or the UTF-8 Unicode standard (which properly includes ascii). Offset counting is in terms of characters, not bytes. One property of this approach is that the annotations present within a source file may change quite dramatically without affecting in any way any conforming standoff annotation files. While it is appropriate to take the view that source files are "immutable," the practical situation is often quite different. [The default behavior of the Workbench is to always save its annotations as SGML directly embedded within the file being annotated, though the underlying text is never modified. The original file can always be retrieved in principle by stripping away any and all tags added during a series of Workbench annotation sessions. In practical terms, it is always best to save an original copy of the file separate from the file being annotated, in case problems arise.] o Character offset counting is "zero-based"---that is, the first non-annotation character encountered in a file would be indexed by a character offset of 0 (zero). o These character offsets are relative to the beginning of the *file*, not the beginning of the document within the file (indicated by the initial <DOC> tag). Thus, if a file's contents were to begin with a space character *before* the first <DOC> tag, then a taggable string occurring immediately after this <DOC> tag would have index position 1. II. Setting up Alembic Workbench and data for EDT annotation The directions provided below assume that you have downloaded Version 4.23 of the Alembic Workbench from MITRE's external web site: "http://www.mitre.org/technology/nlp". (Please note: you need to have version 4.23 of the Workbench, which will be available within an hour or less from now on our web site. Before annotating a file for EDT in the Workbench, it is important that you operate on a file in which two conditions have been met: (1) The file contains only one document (indicated by a single pair of <DOC> ... </DOC> tags). This is to ensure that APF annotations do not refer to mentions across document boundaries. The current version of the Workbench does not have any special handling for <DOC> tags, so it cannot check that EDT mentions are being restricted to within-document annotations. (2) The <DOC> tag is the first text appearing in the file---there are no leading spaces, newlines or any other character data. This is make sure that the character offset indices are consistent across annotating sites. The Workbench includes a script, called separate-docs, which will take a file containing one or more documents (identified by pairs of <DOC> ... </DOC> tags) and produce individual files that satisfy these two constraints. Refer to the documentation of that script produced when a single ?-h? argument is given for more detailed information. That is, do: unix> separate-docs -h to get more information. An example use of this on one of the distributed TDT files is as follows: unix> separate-docs -collection <multi-doc-file> There are four important steps to annotating a file in versin 4.23 of the Alembic Workbench for EDT: (0) Enable the Workbench to generate "APF" standoff annotation files every time it saves out a file. This is done by selecting the "Generate ACE Pilot Format" option in the dialog box found by selecting the "File Saving Options" entry under the "Options" pulldown menu. Turning this on should cause it to stay on for all subsequent annotation sessions by this user, so this should only need to be done once for ACE pilot annotation purposes. (1) Load in a file containing a single document, making sure that the proper character encoding has been selected. This is done by selecting the appropriate "Load File" option under the "File" pulldown menu. In the dialog box presenting normalization options, make sure to select "No Normalization" (the default). This value will remain the default for all subsequent files that are loaded. (2) After the file has been loaded, load in the "mention-prefs" tag preferences file. This is done by selecting the "Load Tag Preferences" option under the "Tag Preferences" option in the "Options" pulldown menu. (3) After the mention-prefs have been loaded, load in the "PilotEntity.rtd" relations definition for tagging "relations." This is done by selecting the "Options" pulldown menu, then "Tag Preferences", then "Load Tag Preferences". This need only be done once for a single file. Once a relation type definition (rtd) has already been added to a file, all subsequent editing of this file can be performed by selecting the "Edit Existing Relations" option under the "Relation" option of the "Options" pulldown menu. "Relation" is the term used in the Alembic Workbench to refer to any N-ary relationship among strings, stringfills, setfills, or instances of relationships ("RELINSTs"). The relations editing facility is a spreadsheet model for establishing such relationships, where the columns represent "fields" (or "slots") in the relation, and rows represent individual relation instances. Multiple values are accommodated in any single cell, and these multiple values are "stacked" one above another within the cell, in the order they were filled. New values can be added to a single field by clicking button-3 (right mouse button) over the appropriate cell. A particular value in a cell can be *replaced* by clicking button-2 (middle mouse button) over a particular cell. Either button can be used to fill/add the first value to an empty cell. More information is usually available via the in the various "help" buttons on the various dialog boxes in the Workbench. III. Oveview of the EDT annotation process in Workbench version 4.23 Given the severe time constraints in producing a tool that would provide a reasonable annotation/viewing environment for EDT, we have chosen to modify the existing capabilities of the Workbench as much as possible. This has meant that the annotation process has been broken down into two distinct types of operations: (1) Identifying heads for mentions that contain heads whose extent is different from the full mention, and (2) Placing mentions in the appropriate column of the relation table. A single example should suffice: ... the three brothers ... and ... Fyodor Dostoyevsky ... In the former case the "full extent of the mention" for purposes of scoring some of the tasks in EDT consists of all three words, while the "head" of the phrase (for use by other tasks in the EDT pilot evaluation) is limited to the final word alone. In the second case, the guidelines currently indicate that the head of the phrase is identical to the full extent of the phrase, which would contain both names. In the case with differing head/mention extents, the process is as follows: (a) Change your tag preferences to be $AWB/tag-preferences/mention-prefs ["Options" pulldown menu; "Tag Preferences" cascading option; "Load Tag Preferences" option. You only need to do this once for any individual session with the Workbench.] (b) Select the relation $AWB/relations/PilotEntity.rtd ["Options" pulldown menu; "Relations" cascading option; "Select Relation Type to Add" You only need to do this once for each individual file that you are annotating with the Workbench. Thereafter the relation type definition continues to be stored within the SGML source file itself.] (c) For each mention containing a head that is distinct (with different extent) than the maximal mention phrase: i) create a Mention phrase around the maximal phrase extent ii) this will cause the Mention tag to blink, requesting the user to select the head of the phrase. The user should select (using standard Workbench mouse bindings) the one or multi-word phrase constituting the head of the phrase This will create two annotations, one a MENTION, a second a MENTIONHEAD, where the MENTION tag contains attributes "pointing" to the MENTIONHEAD tag. (d) While holding the mouse over some portion of the MENTION phrase that is *not* also within the extent of the MENTIONHEAD tag, use the <shift><button-1> combination to create a new selection with the exact same extent as the MENTION phrase. (Optionally, one can do this with multiple key clicks using the standard mouse bindings, but the suggested approach is faster and less prone to errors.) (e) Now place the mouse over the appropriate cell in the PilotEntity table (relation) and click middle (or, if you are adding an additional value to the cell, click with the right button). This should fill the slot with both the MENTION phrase string *and*, in square brackets, the MENTIONHEAD phrase. In the case with differing head/mention extents, the process is as follows: (a) In the event a mention's extent is co-extensive with it's head (e.g., "Tom Smith"), one should skip step (c) and (d). Instead, swipe/select the extent of the phrase, and then proceed immediately to step (e) above (that is, filling the approrpiate relation cell). This will fill the cell with the simple phrase *without* the additional square-bracketed head phrase. The interpretation of this is that the head and the maximum extent of the phrase are identical. IV. Other pointers and obsevations It is possible to get the Alembic Workbench confused. We strongly suggest that you save out your results frequently. The Workbench allows "backup" versions of the annotated file to be kept, and the number of these files is controlled by a field in the "File Saving Options" settings available under the "Options" pulldown menu. By saving often, you can retrieve previous versions of the annotated file and continue to work from there. The backup versions of the file are of the form <filename>.<num>, where the <num> indicates the version of the file saved. The zero-th version (<filename>.0) is always the original file prior to any Workbench annotations. V. Example files Attached to this message you will find four files. The first two are the DTDs that define appropriate ACE Pilot Format XML standoff annotation files for reference and system output data, respectively. These DTDs can also be found in the directory $AWB/dtds/ in any version 4.23 distribution of the workbench (or later). These files contain some amount of commentary about their structure and motivation. The second three files are different incarnations of file annotated for EDT using the Workbench. The first is the original file, with only pre-existing SGML. The second file is this same file after it has been annotated by the Workbench. This is the Workbench-native form of the annotations, utilizing embedded SGML declarations for relation types, relation instances, and within-text annotation tags. The third file in this group is the APF version of this EDT annotation, produced by running the sgm2apf conversion utility. VI. Points of Contact at MITRE "Areas of Expertise" David Day day@mitre.org (781) 271-2854 [C,T,U,X] Lisa Ferro lferro@mitre.org (781) 271-5875 [A,G,U] John Henderson jhndrsn@mitre.org (781) 271-2849 [C,U,X] Alan Goldschen alang@mitre.org (703) 883-6005 [G,U,X] Ben Wellner wellner@mitre.org (781) 271-7191 [C,X] John Aberdeen aberdeen@mitre.org (781) 271-2840 [T] Abbreviations for "Areas of Expertise" A Annotation effort (file assignments, distribution, etc.) G Annotation Guidelines C Conversion program (AWB SGML --> ACE Pilot Format) T Alembic Workbench and related tools (technical details) U Use of Alembic Workbench to perform ACE annotation X XML of ACE Pilot Format
sample_awb_ed_annotation.orig.sgm
sample_awb_ed_annotation.sgm.apf.xml.16
Problems or questions? Contact list-master@nist.gov