[Cache from http://www.csl.sony.co.jp/person/nagao/gda/tagset.html; please use this canonical URL/source if possible.]
This draft discusses the GDA (Global Document Annotation) tag set, providing rationale of tags and examples of how to use them. This may serve as a tagging manual for human annotators with plenty knowledge of both theoretical and computational linguistics, but the real tagging manual for most annotators must be provided apart from this draft.
The GDA tag set aims at making semantic and pragmatic structure of
electronic texts automatically recognizable.
It is being developed so as to be easy to embed into
TEI,
EAGLES,
and HTML tag sets.
So the meanings of the GDA tags should
be maximally consistent with the three tag sets.
Some tags are imported from them,
but when such a tag is defined in two or more of them,
the meaning in HTML is preferrred to that of TEI and EAGLES,
because GDA tags are expected to be very often embedded in HTML files.
For example, although TEI uses <s> for encoding sentence,
GDA uses <su> instead, because
HTML uses <s> to overstrike a horizontal bar (like this).
This draft is published for the sake of public survey and evaluation. Empirical evaluation is necessary on both how useful the tags described below are for practical applications and how consistently people can annotate documents with those tags. We would like to improve the tag set by taking results of evaluations into account, before announcing it for public, extensive use in late 1998.
To optimize the benefit per cost of tagging, we try to design as simple a tag set as posibble which captures enough contents for practical applications. The semantic and pragmatic content of an utterance might be unlimitedly complex due to the complexity of the context, but an appropriate degree of complexity of the tag set could be identified, because the present technology concerning natural language can effectively process only limited sorts of information. For instance, tagging for metonymy may not be very useful. The tag set should go along with the contemporary state of the art. We can refine the tag set when more detailed tags become useful as technology advances.
Here we do not restrict ourselves to any single application, but try to capture as many aspects of language which seem sufficiently useful in one of translation, retrieval, summarization, question answering, case-based reasoning, and so on. Users interested in only some of these applications may want to use subsets of the tag set, but those subsets will be addressed in some other document. In this connection, the GDA tags are optional, because applications do not normally require exhaustive tagging. In fact, many relatively simple untagged sentences can be analyzed right by the current technology.
The GDA tag set is not specific to any particular language, though the example passages below are all in English. The usage of the tags is subject to some customization for particular languages, but we want to use the same vocabulary for the sake of coordination across different languages. Of course different tagging manuals are necessary for different languages. However, we hope to design the tag set so that it is easy for you to write such a manual once you have understood the idea behind the tag set.
The tag set is not to address a linguistic theory, but to encode semantic and pragmatic structures of documents, probably remaining somewhat neutral in linguistic controversies. Encoding a semantic or pragmatic structure and capturing a linguistic generalization are different issues. In particular, syntactic generalization may be sacrificed very often, because syntax is not our primary concern but used as a partial aid for recovering semantics and pragmatics. This could be justified because people probably have better intuition about semantics and pragmatics than about syntax. For instance, consider the tagged sentences below:
<n syn=p>Not Sue, but Kim</n> came in.The attribute syn=p means that the top-level syntactic structure of the element is a parallel construction, where `parallel' means that the maximal subelements of the element share the same syntactic dependencies as to the context outside of the element. In this case, `Sue' and `Kim' are both regarded as a semantic subject of `came in' in both sentences. However, the underlying mechanisms are quite different between the two sentences. In the first, the shared dependency arises syntactically due to the coordinate structure, whereas in the second, it seems to arise from general inferences, because `instead of Sue' is probably an adjunct to the sentence rather than to `Tom,' granted that sentences such as `Instead of leaving, Kim came in' are fine. Thus the same form of tagging in the two sentences captures little linguistic generalization other than concerning semantic structure. We thus intend that the <n> tag, among others, should capture not exactly real syntactic constituency but rather semantic constituency. Of course linguistic theories are very helpful in designing the tag set, but what is important is that the tag set can represent the semantic and pragmatic structure of a wide range of documents, but not that it captures linguistic generalizations.<n syn=p>Instead of Sue, Kim</n> came in.
All the GDA tags discussed below may have the following attributes, all of which are optional:
Hereafter the speaker means not only the agent of a speech, but also the author of a written passage, the thinker of a thought, the performer of a sign language or a gesture, and so on, where the speech, the passage, the thought, etc. appear as tagged elements in the document. Similarly, the hearer means the addressee of them in the correspondingly wide sense.
A referential index is a name. It is usually the id value of another element, but there are eight special referential indices which are not the id value of any element. These are 0p (generic people), 1p (first person (speaker/author) singular, or `I/my/me' ), 1pp (first person plural, or `we/our/us'), 1px (first person plural including second person, or `we/our/us including you/your/you'), 1px (first person plural excluding second person, or `we/our/us excluding you/your/you'), 2p (second person singular), 2pp (second person plural), and nil (nothing). They can be associated with other referential indices via who and whm attributes of <q> elements.
A conceptual term is a concept id of some ontology or a term in some natural language. The former is of the form id or ont.id, and the latter lang.term. A conceptual form of the simplest form id is a semantic relation defined native in the GDA tag set. ont is the abbreviated name of an ontology specified in a preceding <ontology> element, and lang is a language identifier in ISO 639-2. 'eng.make fool of' is an example of a conceptual term which represents a term in English --- quotation marks are placed on both ends when the term part contains a white space.
<p> and the tags thereafter are called intradivisional tags. <cite>, <quote>, <q>, <iq> are called quotational tags. <seg> and the tags thereafter are called intrasentential tags. The following table shows elements of which tags (in the left) can contain which tags (in the right).
<divi> | all tags except <div1> ··· <divi> |
<hi> | intradivisional tags |
<p> and <ss> | <ss>, <su>, and quotational tags |
quotational tags | all tags |
<su> and intrasentential tags except <el> | quotational tags and intrasentential tags |
The following tags are used to encode ambiguities. In GDA, these tags are usually not manually handled, but instead automatically processed by computers. The elements of these tags are all empty, and can appear anywhere in the document. These tags except <anchor> are called link tags. Link elements (elements with link tags) are daughters of other elements only when they are referred to via the dtrs attribute.
The syn attribute encodes types of syntactic constructions. The possible values are the following.
The rel attribute encodes binary relations licensed by syntactic dependency or intersentential juxtaposition. They are relations between the depending element and the depended element, and include grammatical functions, thematic roles, and rhetorical relations. The distinction among these three types of relations is often vague. For instance, LOCATION counts as both a grammatical function and a thematic role. Although CAUSE is usually regarded as a rhetorical relation, it can also serve as a thematic role of phrases such as `due to lack of money.' This is why we conflate grammatical functions, thematic roles, and rhetorical relations, under the rel attribute. Among the values introduced below, cau, cnc, cnd, and so on, serve as both rhetorical relations and thematic roles.
The value of the rel attribute is a relational term, which is a kind of conceptual term representing a binary relation. So rel is an open-class attribute, potentially encompassing all the binary relations lexicalized in natural languages. The rel attribute encodes a relationship in which the current element stands with respect to the element that it depends on, as in:
go <seg rel=fin>to Paris</seg>In this connection, conceptual terms may also be values of the sem attribute, which is to encode word senses, as discussed later. So the sem attribute with a relational term as its value is attached to the function word, as in:
go <seg sem=fin>to</seg> Paris
A purpose of the rel attribute is to associate complement elements (subjects, objects, indirect objects, and so forth) with the corresponding arguments of verbs, adjectives, etc. To fulfill this, we employ a rather standard approach: the association is specified by marking elements with grammatical functions such as SUBJECT and OBJECT (sbj and obj below, respectively), provided that we have a dictionary containing the argument structures of verbs and so on. In many languages, there is usually no need to explicitly markup complements such as subjects objects, and indirect objects, because their grammatical functions are obvious from the surface forms and hence their thematic roles can be inferred from the dictionary. When the verb has multiple argument structures, such as with `Tom opens the door' (where `Tom' is the agent) and `The key opens the door' (where `the key' is the instrument), we can either markup the subject noun phrases with the thematic role or markup the verb in terms of the argument structure. Also, by using grammatical functions we do not have to worry about whether the subject of buy should be AGENT or GOAL, for example.
The rest of the purpose of the rel attribute is to resolve ambiguities of both the thematic roles of adjunct elements, which are typically prepositions and postpositions, and the rhetorical relations which are not explicitly marked. To attain this, we must simply markup the elements in question with thematic roles and rhetorical relations. However, the exhaustive listing of thematic roles and rhetorical relations appears impossible, as widely recognized. We are not yet sure about how many thematic roles and rhetorical relations are sufficient for engineering applications such as machine translation, but as mentioned before, the appropriate granularity of classification will be determined by the current level of technology. The following list is by no means a definite set of thematic roles and rhetorical relations, but we hope to improve this to come up with a nearly optimal one.
The native relational terms are enumerated below:
Relational terms can combine to make compound relational terms. There are two types of combination. If a and b are relational terms, then a:b and a_b are relational terms, too. The operator `:' has precedence over `_.' That is, a:b_c is the combination of a:b and c through `_.'
a:b represents the composition of a and b as binary relations. That is, x and z stand in relation a:b, if and only if there exists y such that x and y stand in relation a and y and z stand in relation b.
a_b is the intersection of a and b as sets; a binary relation is a subset of the Cartesian product of two sets. That is, x and y stand in relation a_b, if and only if x and y stand in both relations a and b. This is used when the thematic or rhetorical relation between two syntactic elements cannot be sharply classified into any single category.
I came <adp rel=res_pur>so that I met him</adp>.
In `Kim likes Mary better than Betty,' for instance, we must specify whether `Betty' is compared with `Kim' or `Mary.' Similarly, in `Kim blamed Mary together with Betty,' we want to mark whether Kim and Betty blamed Mary or Kim blamed Mary and Betty. To implement this in general, we use extended relational terms of the form a-b, where a and b are relational terms but not extended relational terms. a is a relational term such as cmp and jnt, and b is a relational term to indicate which element is in parallel with the current element, as in what follows:
The relational terms are used as attributes as well, which we will call relational attributes. While the rel attribute appears in the depending element (the satellite in the case of rhetorical relations), the relational attribute appears in the depended element (the nucleus) and points to the depending element. Namely, the value of the relational attribute is the referential index of the element, if any, which semantically or pragmatically depends on the element containing this relational attribute. Of course the attribute name indicates the type of the dependency.
Communicative functions including speech acts are encoded by the com attribute. Its values are classified into properties (unary relations) and binary relations. The former are the following:
Under construction
Anaphora by an overt expression such as a pronoun or a definite description is encoded by the crf attribute, which takes referential index as its value.
<seg id=j0>John</seg> beats <seg crf=j0>his</seg> wife.When the referent of an anaphoric expression is a member of the type (kind, set, etc.) which the referent of a former expression is a member of, we use the ctp attribute.
You bought <seg id=c1>a car</seg>. I bought <seg ctp=c1>one</seg>, too.In the latter example, `the car' refers to an instance of a specific model of cars, and `it' refers to another instance of the same model.You bought <seg id=c0>the car</seg>. I bought <seg ctp=c0>it</seg>, too.
Zero anaphora is encoded by relational attribute.
Tom visited <seg id=m1>Mary</seg>.
He <seg iob=m1>brought</seg> a present.
If a is a relational attribute, then a:ctp
is a relational attribute, too, which will be useful for
langauges like Japanese in which zero pronouns abound.
Not only noun phrases but also verb phrases, sentences, and so on refer to objects, events, states of affairs, and so on. Here we introduce attributes to classify such references.
rtyp is to encode type of reference. Its values include:
In a generic reading, the predication concerns (default properties of) the whole kind referred to by the noun phrase in question. An accidental universal quantification, such as with `I know (all) the Emperors of Japan,' does not qualify as a generic reading. We do not distinguish the two types of generic reading: those such as with `Chickens evolved from dinosaurs' and those such as with `Chickens lay eggs.' This distinction is captured by classifying the predicates.
individual vs. stage reading?
Under construction
Most langauges have grammaticized marking of tense, but for instance Chinese lack tense marking so that tense tagging will be of a great benefit in Chinese. Perhaps no language lacks grammaticized aspect marking, but aspect tagging could be useful in some cases.
The tns attribute encodes tense. Its values are:
The asp attribute encodes aspect. Its values are:
Under construction
Scopes of quantifiier, negation, modal operator, conditional operator, parallel coordination, and so forth are encoded by the sce (scoping element) and sps (super-scope) attribute.
The sce value of an element A is the id value of another element B such that A is a scope of B. Here A must command B; an element commands another element when the former contains the latter or contains an element which either points at the latter element via a relational attribute or pointed by the latter element via the dep attribute. For instance, the following annotation entails the interpretation that each of three collectors bought one same paining, so that this painting has been bought three times as far as the sentence entails.
<su sce=3m><np id=3m>Three collectors</np> have bought a painting.</su>
An optionally scope-introducing element, such as `three collectors' and `Tom and Mary,' actually introduces a scope only when pointed via the sce attribute. In the above exmaple, if cse=3m were absent, the interpretation is that the three men cooperatively bought a car, so that the car was bought once.
The scopes of elements such as `every man' and `Tom or Mary,' which always introduce scopes, are assumed to be the minimal dominating <su> or <np> elements. For instance,
<su><np syn=p>Tom or Mary</np> came.</su>means that Tom came or Mary came, where the scope of `Tom or Mary' is the entire sentence.
The sps value of element A is the id value of another element B which is disjoint from A, and B is the element introducing the minimal scope containing A's referent.
For instance, the following means that each of three men bought a car, entailing three possibly different cars bought.
<su sce=3m><np id=3m>Three men</np> bought <np sps=3m>a car</np>.</su>Here the referent of `a car' is in the scope introduced by `three men.' Since there are three instantiatoins of this scope, corresponding to the three men, there are three possibly distinct cars each of which was bought in one of those instantiations.
For another example, the de dicto reading of `Jane wants to marry a doctor,' which entails no specific doctor, is marked up as follows:
Jane <v id=w1>wants</v> to marry <vp sit=w1>a doctor</np>.Here the doctor is situated in the scope introduced by the modal operator `wants.' Being the head of the complement of `want,' `marry' is forced to be situated in the scope of `wants.' So the sps attribute need not be specified for `marry.' As for the other elements, absense of the sps attribute defaults to exclude the elements from all scopes, situating them in the maximum semantic context introduced by the minimal <q> or <quote> element or the whole document. So if `a doctor' in the above example lacked the sps attribute, then it would entail the de re reading involving a specific doctor.
Similarly, the reading of `every man loves a woman' in which `a woman' is in the outermost situation (that is, one woman is loved by all the men) is the default reading of the sentence. The other reading, in which the referent of `a woman' is in the body scope of `every man' (different men may love different women), is encoded by:
<vp id=e0>Every man</np> loves <np sit=e0>a woman</np>Under construction
The sem attribute takes a conceptual term as its value to disambiguate the meaning of the element. A good, free multilingual electronic ontology would be very useful for sense tagging.
Under construction
stl politeness
Under construction