[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/p029.html; see this canonical version of the document.]
Keywords: semantic mark up, proper noun extraction, SGML
This work belongs to a project which aims at a real-world application and due to this reason the categories of the SGML-based standard News Industry Text Format (NITF) have been applied. NITF was developed by the International Press Telecommunication Council (IPTC) for the exchange of news messages. An interesting feature of the NITF standard is that besides structural mark up, it allows also semantic encoding. Our aim in this project has been twofold: first, to develop an algorithm for the automatic identification of those phrases in new incoming messages which contain semantic information, e.g. names of persons, organizations, places, weekdays etc. Second, to mark up the messages according to the respective NITF categories and export the marked up messages as an NITF conformant SGML text. The degree of correctness of the automatic marked up texts is decisive for the applicability of this method for the daily practice.
Within the CLIP-ing project a specific work area relevant to content indexing concerns the investigation of ways to analyze machine-readable news texts for the automatic identification and semantic classification of proper nouns, e.g. 'Hans Albrecht', 'Ontario', 'UNICEF', as well as temporal expressions and phrases denoting person roles, e.g. 'Anfang Oktober 1996' (beginning of October 1996), 'Bildungsminister' (minister of education). This task bears similarities to the general text analysis systems which are reported in the message understanding conference proceedings (Grishman and Sundheim 1996, MUC-3, MUC-5). Our task here is not to develop a full blown information extraction or text analysis system, but rather to extract only certain application-related information from the news messages and render it in a standard format, embedding it thus in the workflow process.
For the analysis tasks described above we have used the Text Analysis Tool with Object Encoding (TATOE) (Alexa & Rostek 1996). Two features of TATOE which are important for this kind of work are: TATOE enables the analyst to develop corpus-based pattern rules for parsing and marking up a corpus of texts according to a categorization schema (Alexa & Rostek 1997). The other feature is that TATOE enables analysis and mark up of the corpus texts according to different categorization schemata concurrently. Since TATOE did not support the export of mark up into SGML encoded text, an export procedure has been defined for the particular application context. This is presented later on in this paper.
The second schema consists of the same categories (although its categories are spelled slightly different to enable comparisons) and has been used for storing and displaying the automatically performed mark up. A set of pattern rules have been defined in order to parse and mark up the texts accordingly.
The evaluation of the correctness of the automatic mark up is then a comparison between the mark up of the two schemata, i.e. it measures the differences between the intellectual and the automatic mark up.
<NITF><HEAD><TOBJECT></TOBJEC>>T> <IPTC7901.WIREHEAD IPTC7901.PRIORITY="5" IPTC7901.TIMEDATE="281434 Aug 91" IPTC7901.SVCID="bas" IPTC7901.OPTINFO="vvvvb dpa 260 " IPTC7901.KEYWORD="Theater" IPTC7901.MSGNUM="362" IPTC7901.CATEGORY="ku"></HEAD> <BOD>Y><HEDLINE><HL1><PERSON>Andrea Breth</PERSON> in künstlerischer Leitung der <ORG>Berliner Schaubühne</ORG></HL1></HEDLINE> <DATELINE><LOCATION>Berlin</LOCATION></DATELINE> <P>Die <PERSON><FUNCTION>Regisseurin</FUNCTION> Andrea Breth</PERSON> gehört mit Beginn der Spielzeit <CHRON NORM="19920101">1992</CHRON> / <NUM>93</NUM> der künstlerischen Leitung der <ORG>Berliner Schaubühne am Lehniner Platz</ORG> an, teilte das <ORG>Theater</ORG> am <CHRON NORM="19910827">Dienstag</CHRON> mit. Sie übernimmt die unbesetzte Stelle von <PERSON>Jürgen Gosch</PERSON> , von dem sich das <ORG>Theater</ORG> zum Jahresende <CHRON NORM="19890101">1989</CHRON> vorzeitig getrennt hatte. <PERSON>Andrea Breth</PERSON> inszeniert derzeit an der <ORG>Schaubühne</ORG> <PERSON>Arthur Schnitzlers</PERSON> Stück "Der einsame Weg". Die Premiere ist für den <CHRON NORM="19900930">30. September</CHRON> angekündigt. Die 38jährige <FUNCTION>Regisseurin</FUNCTION> arbeitet auch noch am <ORG>Wiener Burgtheater</ORG>> unter dem <PERSON><FUNCTION>Intendanten</FUNCTION> Claus Peymann</PERSON> .</P></BODY></NITF>
During the generation of the SGML expression for each paragraph in a text we calculate an inclusion lattice of the marked up phrases to order the overlapping elements and to determine the insertion points. In that way mark up export from TATOE in SGML is enabled.
For the temporal information marked up as CHRON elements the system following the NITF guidelines creates an SGML attribute NORM which has as value the concrete date of the temporal phrase in a normalized form. In the text above <CHRON NORM="19910827">Dienstag</CHRON> means that Dienstag (German for Tuesday) was 27th August 1991 calculated from the fact that the Tuesday before the date of the message (28.8.91).
Alexa, Melina and Lothar Rostek (1996): Computer-assisted corpus-based text analysis with TATOE. Presented at ALLC- ACH96, Bergen, Norway. Abstracts, pp. 11-17.
Chen, Hsin-Hsi and Jen-Chang Lee (1996): Identification and classification of proper nouns in Chinese Texts. Proceedings of COLING-96, Vol. 1, pp. 222-229, Kopenhagen, Denmark.
Grishman, Ralph and Beth Sundheim (1996): Message Understanding Conference - 6: a brief history. Proceedings of COLING-96, Vol. 1, pp. 466-471, Kopenhagen, Denmark.
Kitani, T. and T. Mitamura, T. (1994): An accurate morphological analysis and proper noun identification for Japanese text processing. In Transactions of Information Processing Society of Japan, Vol. 35, No.3, pp. 404-413.
MUC-3: Proceedings of the Third Message Understanding Conference (MUC-5), August 1993. Morgan Kaufmann Publishers, San Diego, CA, USA.
MUC-5: Proceedings of the Fifth Message Understanding Conference (MUC-5), August 1993. Morgan Kaufmann Publishers, San Fransisco, CA, USA.
Wakao, Takahiro and Robert Gaizauskas and Yorick Wilks (1996): Evaluation of an Algorithm for the Recognition and Classification of Proper Nouns. Proceedings of COLING-96, Vol. 1, pp. 418-423, Kopenhagen, Denmark.