Marking up in TATOE and exporting to SGML - Rule development for identifying NITF categories

Lothar Rostek

GMD - Integrated Publication and Information Systems Institute
rostek@darmstadt.gmd.de
Keywords: semantic mark up, proper noun extraction, SGML

Introduction

We have analyzed a corpus of German news messages with the general aim of extracting semantic information. More specifically, we have focused on the automatic categorization of proper nouns relating to persons, organizations, locations as well as numeral and temporal expressions. Proper noun identification and classification has been examined for different languages, such as English, Chinese or Japanese ((Wakao et al. 1996), (Chen and Lee 1996) Kitani and Mitamura (1994) among many others). As far as German is concerned two features which are idiosyncratic of the language and complicate this task are first, German makes no surface distinction in the spelling of both proper nouns and common nouns, i.e. common nouns are also spelled with the first letter capitalized. Second, compounds in German can be long, without any word boundaries or hyphenation between the contained nouns and are therefore relatively difficult to identify.

This work belongs to a project which aims at a real-world application and due to this reason the categories of the SGML-based standard News Industry Text Format (NITF) have been applied. NITF was developed by the International Press Telecommunication Council (IPTC) for the exchange of news messages. An interesting feature of the NITF standard is that besides structural mark up, it allows also semantic encoding. Our aim in this project has been twofold: first, to develop an algorithm for the automatic identification of those phrases in new incoming messages which contain semantic information, e.g. names of persons, organizations, places, weekdays etc. Second, to mark up the messages according to the respective NITF categories and export the marked up messages as an NITF conformant SGML text. The degree of correctness of the automatic marked up texts is decisive for the applicability of this method for the daily practice.

The general application context

The task reported above is part of the CLIP-ing project, a national collaborative project supported by DeTe-Berkom (a subsidiary of the German Telekom AG). One of the partners of the consortium is the German press agency (dpa), which provided us with the corpus of news messages. IPTC is another partner and has had a strong interest in the possibility of automatic semantic encoding using the NITF standard. The goal of the CLIP-ing project is to support the linking of different agency services as well as the planning and managing of the news production and to add further value to news content by means of content indexing, and to conform to standards. It is envisaged that in that way news agencies may provide their clients with news reports which are semantically marked up according to the NITF standard. This creates an additional value to information sources of news agencies and provides for richer information requirements which news agencies clients may have.

Within the CLIP-ing project a specific work area relevant to content indexing concerns the investigation of ways to analyze machine-readable news texts for the automatic identification and semantic classification of proper nouns, e.g. 'Hans Albrecht', 'Ontario', 'UNICEF', as well as temporal expressions and phrases denoting person roles, e.g. 'Anfang Oktober 1996' (beginning of October 1996), 'Bildungsminister' (minister of education). This task bears similarities to the general text analysis systems which are reported in the message understanding conference proceedings (Grishman and Sundheim 1996, MUC-3, MUC-5). Our task here is not to develop a full blown information extraction or text analysis system, but rather to extract only certain application-related information from the news messages and render it in a standard format, embedding it thus in the workflow process.

Methodology

A corpus of 483 raw dpa news messages drawn from the dpa text database has been analyzed. In order to have an evaluation basis for the automatic extraction the whole corpus has been marked up by human coders. Given that the main source of information for the development of the extraction rules has been the news messages corpus itself, it has been important to have flexible means for inspecting and viewing corpus words and the contexts they occur in. Furthermore, and abstracting from the single word type level, enabling the display of selective concordances by means of syntactic and/or semantic patterns is of course advantageous. To enable the definition of syntactic patterns, we have used GERTWOL, a morphological analysis tool for German from Lingsoft, Finland (http://www.lingsoft.fi). With regards to the semantic information, our aim has been to define mechanisms for filtering out relevant words according to part of speech categories and then classifying them semantically.

For the analysis tasks described above we have used the Text Analysis Tool with Object Encoding (TATOE) (Alexa & Rostek 1996). Two features of TATOE which are important for this kind of work are: TATOE enables the analyst to develop corpus-based pattern rules for parsing and marking up a corpus of texts according to a categorization schema (Alexa & Rostek 1997). The other feature is that TATOE enables analysis and mark up of the corpus texts according to different categorization schemata concurrently. Since TATOE did not support the export of mark up into SGML encoded text, an export procedure has been defined for the particular application context. This is presented later on in this paper.

Semantic mark up

The dpa corpus which we imported in TATOE contains 483 dpa messages with 20,407 word types and 124,691 word tokens. We have defined and used two categorization schemata in TATOE. Each schema contains categories which are based on the NITF categories for semantic markup. One of the schemata has been used in order to mark up all the texts intellectually with such NITF categories as PERSON, FUNCTION, CITY, CHRON, etc. Additionally to the standard NITF categories, we have defined more specific ones in order to allow for more detailed information; for example, the categories PCHRON and FCHRON for distinguishing between past and future temporal phrases or FUNCPERS for those phrases which express a named person together with his/her role. This intellectual mark up has been used as a test basis for the evaluation of the correctness of the automatic mark up.

The second schema consists of the same categories (although its categories are spelled slightly different to enable comparisons) and has been used for storing and displaying the automatically performed mark up. A set of pattern rules have been defined in order to parse and mark up the texts accordingly.

The evaluation of the correctness of the automatic mark up is then a comparison between the mark up of the two schemata, i.e. it measures the differences between the intellectual and the automatic mark up.

Exporting performed semantic mark up into SGML

Each marked up position in TATOE is stored into an object representing the paragraph it belongs to. This object contains also the text of the paragraph. This storing mechanism separating the text from its mark up positions has the advantage of enabling fast selection and display of all mark up according to the current schema selected by the user. In addition, multiple and overlapping mark up does not pose a problem. However, if one wants to export the mark up into an SGML format, first the mark up positions need to be selected, then sorted and finally the SGML tags need to be inserted into the original text. Furthermore, this process has to respect the dependencies between the marked up elements. For example, in the text shown below, the system has stored two marked up phrases, namely 'Regisseurin' and 'Regisseurin Andrea Breth'; however, it needs to be recognized that there is an interdependence between the two phrases, that is, the first phrase is part of the second and that for this text an element called FUNCTION should be inserted inside the PERSON element.

<NITF><HEAD><TOBJECT></TOBJEC>>T> <IPTC7901.WIREHEAD IPTC7901.PRIORITY="5" IPTC7901.TIMEDATE="281434 Aug 91" IPTC7901.SVCID="bas" IPTC7901.OPTINFO="vvvvb dpa 260 " IPTC7901.KEYWORD="Theater" IPTC7901.MSGNUM="362" IPTC7901.CATEGORY="ku"></HEAD> <BOD>Y><HEDLINE><HL1><PERSON>Andrea Breth</PERSON> in künstlerischer Leitung der <ORG>Berliner Schaubühne</ORG></HL1></HEDLINE> <DATELINE><LOCATION>Berlin</LOCATION></DATELINE> <P>Die <PERSON><FUNCTION>Regisseurin</FUNCTION> Andrea Breth</PERSON> gehört mit Beginn der Spielzeit <CHRON NORM="19920101">1992</CHRON> / <NUM>93</NUM> der künstlerischen Leitung der <ORG>Berliner Schaubühne am Lehniner Platz</ORG> an, teilte das <ORG>Theater</ORG> am <CHRON NORM="19910827">Dienstag</CHRON> mit. Sie übernimmt die unbesetzte Stelle von <PERSON>Jürgen Gosch</PERSON> , von dem sich das <ORG>Theater</ORG> zum Jahresende <CHRON NORM="19890101">1989</CHRON> vorzeitig getrennt hatte. <PERSON>Andrea Breth</PERSON> inszeniert derzeit an der <ORG>Schaubühne</ORG> <PERSON>Arthur Schnitzlers</PERSON> Stück "Der einsame Weg". Die Premiere ist für den <CHRON NORM="19900930">30. September</CHRON> angekündigt. Die 38jährige <FUNCTION>Regisseurin</FUNCTION> arbeitet auch noch am <ORG>Wiener Burgtheater</ORG>> unter dem <PERSON><FUNCTION>Intendanten</FUNCTION> Claus Peymann</PERSON> .</P></BODY></NITF>

During the generation of the SGML expression for each paragraph in a text we calculate an inclusion lattice of the marked up phrases to order the overlapping elements and to determine the insertion points. In that way mark up export from TATOE in SGML is enabled.

For the temporal information marked up as CHRON elements the system following the NITF guidelines creates an SGML attribute NORM which has as value the concrete date of the temporal phrase in a normalized form. In the text above <CHRON NORM="19910827">Dienstag</CHRON> means that Dienstag (German for Tuesday) was 27th August 1991 calculated from the fact that the Tuesday before the date of the message (28.8.91).

Conclusions

We have defined this export procedure from TATOE to SGML specifically for the CLIP-ing application context. Clearly, a general solution for this requirement has to be provided, whereby a general descriptive formalism within TATOE is specified in order to determine the mapping from mark up into some SGML tagged text. Nevertheless we feel that the defined export procedure is an important step towards that direction.

References

Alexa, Melina and Lothar Rostek (1997): Pattern concordances - TATOE calls XGrammar. Paper to be presented at ALLC-ACH97, Kingston, Canada. June 1997.

Alexa, Melina and Lothar Rostek (1996): Computer-assisted corpus-based text analysis with TATOE. Presented at ALLC- ACH96, Bergen, Norway. Abstracts, pp. 11-17.

Chen, Hsin-Hsi and Jen-Chang Lee (1996): Identification and classification of proper nouns in Chinese Texts. Proceedings of COLING-96, Vol. 1, pp. 222-229, Kopenhagen, Denmark.

Grishman, Ralph and Beth Sundheim (1996): Message Understanding Conference - 6: a brief history. Proceedings of COLING-96, Vol. 1, pp. 466-471, Kopenhagen, Denmark.

Kitani, T. and T. Mitamura, T. (1994): An accurate morphological analysis and proper noun identification for Japanese text processing. In Transactions of Information Processing Society of Japan, Vol. 35, No.3, pp. 404-413.

MUC-3: Proceedings of the Third Message Understanding Conference (MUC-5), August 1993. Morgan Kaufmann Publishers, San Diego, CA, USA.

MUC-5: Proceedings of the Fifth Message Understanding Conference (MUC-5), August 1993. Morgan Kaufmann Publishers, San Fransisco, CA, USA.

Wakao, Takahiro and Robert Gaizauskas and Yorick Wilks (1996): Evaluation of an Algorithm for the Recognition and Classification of Proper Nouns. Proceedings of COLING-96, Vol. 1, pp. 418-423, Kopenhagen, Denmark.

Paper