[Cache from http://www.csl.sony.co.jp/person/nagao/gda/tagset.html; please use this canonical URL/source if possible.]


The GDA Tag Set
Draft Version 0.16 (July 6, 1998)

by HASIDA Kôiti and Christoph J. Neumann
This document is under frequent revision. Its latest version is available as http://www.etl.go.jp/etl/nl/GDA/tagset.html. See also the GDA Home Page.
Table of Contents
  1. Introduction
  2. Global Attributes
  3. Syntactic Tagging
  4. Tags
  5. Syntactic Constructions
  6. Semantic Relations
  7. Communicative Functions
  8. Coreferences
  9. Reference Types
  10. Tense and Aspect
  11. Scopes
  12. Word Senses
  13. Others

Tags, attributes, and their values are in bold face in the text. They are in red where they are defined. Examples are in a green typewriter font.

1. Introduction

This draft discusses the GDA (Global Document Annotation) tag set, providing rationale of tags and examples of how to use them. This may serve as a tagging manual for human annotators with plenty knowledge of both theoretical and computational linguistics, but the real tagging manual for most annotators must be provided apart from this draft.

The GDA tag set aims at making semantic and pragmatic structure of electronic texts automatically recognizable. It is being developed so as to be easy to embed into TEI, EAGLES, and HTML tag sets. So the meanings of the GDA tags should be maximally consistent with the three tag sets. Some tags are imported from them, but when such a tag is defined in two or more of them, the meaning in HTML is preferrred to that of TEI and EAGLES, because GDA tags are expected to be very often embedded in HTML files. For example, although TEI uses <s> for encoding sentence, GDA uses <su> instead, because HTML uses <s> to overstrike a horizontal bar (like this).

This draft is published for the sake of public survey and evaluation. Empirical evaluation is necessary on both how useful the tags described below are for practical applications and how consistently people can annotate documents with those tags. We would like to improve the tag set by taking results of evaluations into account, before announcing it for public, extensive use in late 1998.

To optimize the benefit per cost of tagging, we try to design as simple a tag set as posibble which captures enough contents for practical applications. The semantic and pragmatic content of an utterance might be unlimitedly complex due to the complexity of the context, but an appropriate degree of complexity of the tag set could be identified, because the present technology concerning natural language can effectively process only limited sorts of information. For instance, tagging for metonymy may not be very useful. The tag set should go along with the contemporary state of the art. We can refine the tag set when more detailed tags become useful as technology advances.

Here we do not restrict ourselves to any single application, but try to capture as many aspects of language which seem sufficiently useful in one of translation, retrieval, summarization, question answering, case-based reasoning, and so on. Users interested in only some of these applications may want to use subsets of the tag set, but those subsets will be addressed in some other document. In this connection, the GDA tags are optional, because applications do not normally require exhaustive tagging. In fact, many relatively simple untagged sentences can be analyzed right by the current technology.

The GDA tag set is not specific to any particular language, though the example passages below are all in English. The usage of the tags is subject to some customization for particular languages, but we want to use the same vocabulary for the sake of coordination across different languages. Of course different tagging manuals are necessary for different languages. However, we hope to design the tag set so that it is easy for you to write such a manual once you have understood the idea behind the tag set.

The tag set is not to address a linguistic theory, but to encode semantic and pragmatic structures of documents, probably remaining somewhat neutral in linguistic controversies. Encoding a semantic or pragmatic structure and capturing a linguistic generalization are different issues. In particular, syntactic generalization may be sacrificed very often, because syntax is not our primary concern but used as a partial aid for recovering semantics and pragmatics. This could be justified because people probably have better intuition about semantics and pragmatics than about syntax. For instance, consider the tagged sentences below:

<n syn=p>Not Sue, but Kim</n> came in.

<n syn=p>Instead of Sue, Kim</n> came in.

The attribute syn=p means that the top-level syntactic structure of the element is a parallel construction, where `parallel' means that the maximal subelements of the element share the same syntactic dependencies as to the context outside of the element. In this case, `Sue' and `Kim' are both regarded as a semantic subject of `came in' in both sentences. However, the underlying mechanisms are quite different between the two sentences. In the first, the shared dependency arises syntactically due to the coordinate structure, whereas in the second, it seems to arise from general inferences, because `instead of Sue' is probably an adjunct to the sentence rather than to `Tom,' granted that sentences such as `Instead of leaving, Kim came in' are fine. Thus the same form of tagging in the two sentences captures little linguistic generalization other than concerning semantic structure. We thus intend that the <n> tag, among others, should capture not exactly real syntactic constituency but rather semantic constituency. Of course linguistic theories are very helpful in designing the tag set, but what is important is that the tag set can represent the semantic and pragmatic structure of a wide range of documents, but not that it captures linguistic generalizations.

2. Global Attributes

All the GDA tags discussed below may have the following attributes, all of which are optional:

id
Unique identifier for the element. The value must begin with a letter and can contain letters, digits, hyphens, and periods.
lang
Language of the text in this element; if not specified, the language is assumed to be the same as in the surrounding context. The value should be a three-letter language identifier in ISO 639-2, such as eng (English) and jpn (Japanese).
next
Next element's id value in an aggregate.
prev
Previous element's id value in an aggregate.
  • <q id=q1 who=j1 next=q2>`<su id=s1 next=s2>If it rains,</su>'</q> <seg id=j1>John</seg> said, <q id=q2 prev=q1>`<su id=s2 prev=s1>I won't come.</su>'</q>
    dtrs
    id values of the daughter elements which does not explicitly occur in the element. The daughter elements must have link tags.
  • These attributes except dtrs are straightforward imports from TEI.

    3. Syntactic Tagging


    4. Tags

    Hereafter the speaker means not only the agent of a speech, but also the author of a written passage, the thinker of a thought, the performer of a sign language or a gesture, and so on, where the speech, the passage, the thought, etc. appear as tagged elements in the document. Similarly, the hearer means the addressee of them in the correspondingly wide sense.

    A referential index is a name. It is usually the id value of another element, but there are eight special referential indices which are not the id value of any element. These are 0p (generic people), 1p (first person (speaker/author) singular, or `I/my/me' ), 1pp (first person plural, or `we/our/us'), 1px (first person plural including second person, or `we/our/us including you/your/you'), 1px (first person plural excluding second person, or `we/our/us excluding you/your/you'), 2p (second person singular), 2pp (second person plural), and nil (nothing). They can be associated with other referential indices via who and whm attributes of <q> elements.

    A conceptual term is a concept id of some ontology or a term in some natural language. The former is of the form id or ont.id, and the latter lang.term. A conceptual form of the simplest form id is a semantic relation defined native in the GDA tag set. ont is the abbreviated name of an ontology specified in a preceding <ontology> element, and lang is a language identifier in ISO 639-2. 'eng.make fool of' is an example of a conceptual term which represents a term in English --- quotation marks are placed on both ends when the term part contains a white space.

    <div1> <div2> ··· <div7>
    Subdivision of document.
    type
    Conventional name of the division. The standard values are part, chapter, section, and subsection, but any other character string is possible for the value.
    GDA does not assign any special meaning to <div>, because it is used in HTML.
    <h0> <h1> ··· <h6>
    Title of the division. <title> is not used here, because HTML browsers hide <title> elements. A <divi> may contain an <hj>, where i and j need not be the same digit. <h0>, which is undefined in HTML, can be used to avoid HTML formatting effects.
    <p>
    Paragraph.
    <ss>
    Sequence of sentences.
    <q> <iq>
    Direct and indirect speech, thought, etc. The quotations marks, if any, for a direct speech is included in the <q> element. <q> or <iq> elements may often not be syntactic constituents.
    type
    Type of the content matter, such as speech or thought. The values may be spoken, written, thought, sign (for sign language), and gesture.
    who
    Speaker. The value is a referential index.
    whm
    Hearer. The value is a referential index. The quoted matter is a monologue if who and whm attributes share the same value.
    who is in TEI but whm is unique to GDA.
  • Press the <q>`YES'</q> button.
    In this example, `YES' is interpreted as printed on the button, as if the button were saying 'YES.'
    <cite> <quote>
    Quotation. <cite> is defined in HTML so that its content be italicized in standard browsers. Defined in TEI but not in HTML, <quote> does not have such a formatting effect. <cite> and <quote> elements do not have to be noun phrases. They do not have to be even syntactic constituents, either, when they are parts of the outer texts.
    <su>
    Sentence (sentential unit).
    <seg>
    Subsentential segment, which is a syntactic constituent. Used when the constituent cannot or need not be categorized by any of the following tags.
  • I <seg>saw a girl</seg> with a telescope.
    <cs>
    Complement sentence.
  • I think <cs>that it's OK</cs>.
    <n>
    Noun or noun phrase.
    <v>
    Verb or verb phrase.
    <aj>
    Adjective or adjectival phrase.
    <ad>
    Adverb, adverbial phrase, adnominal phrase, preposition, postposition, prepositional phrase, postpositional phrase, or determiner.
  • <n><ad>the</ad> <n>man</n></n>
    <mentioned>
    Word or phrase not used but mentioned. A <mentioned> element is probably a noun phrase in the outer context.
  • <mentioned>Long</mentioned> is a short word.
    <socalled>
    Word or phrase for which the speaker disclaims responsibility. Like <cite> and <quote> elements, <socalled> elements do not have to be syntactic constituents.
    <date>
    Date.
    value
    Value of the date in the format of ISO 8601.
    <time>
    Time of day.
    value
    Value of the time in the format of ISO 8601.
    <rs>
    General purpose name or referring string.
    type
    Type of the object referred to. The value is a conceptual term.
    <name>
    Proper noun or noun phrase.
    type
    Type of the object referred to. The value is a conceptual term.
  • <rs type=eng.human>Mr. <name>Brown</name></rs>
    <num>
    Number.
    type
    Type of numeric value. The values include int, real, float, ordinal, fraction, and percentage.
    value
    Value of the number in a standard form.
  • <num type=int value='21'>twenty one</num> <num type=percentage value='10'>10%</num>
  • <num type=ordinal value='2'>second</num>
  • <num type=fraction value='1/3'>one third</num>
    <address> <addr>
    Postal address. HTML browsers italicize <address> elements by default. <addr> should be used to avoid that effect.
    <abbr>
    Abbreviation.
    expan
    Expansion of the abbreviation.
    type
    Class of the abbreviation. Values include contraction, suspension, brevigraph, superscription, and acronym.
  • <abbr expan="Electrotechnical Laboratory" type=acronym>ETL</abbr>
    <gloss>
    Gloss or definition for another word or phrase.
    target
    Id value of glossed word or phrase.
  • <seg id=s0>soul</seg> (<gloss target=s0 lang=deu>Geist</gloss>)
    <el>
    Ellipsis containing head. The <el> elements must be empty. Zero anaphors must be encoded not with <el> but by using relational atttribute as discussed later.
    ref
    Pointer to the filler.
    fil
    Filler.
  • Tom <seg id=love>loves</seg> Mary and <seg>Bob, <el ref=love> Sue</seg>.
    <idi>
    Idiom, proverb, or other idiosyncratic expression.
  • the event <idi prev=l1>to</idi> which I <idi id=l1>look forward</idi>
  • <p> and the tags thereafter are called intradivisional tags. <cite>, <quote>, <q>, <iq> are called quotational tags. <seg> and the tags thereafter are called intrasentential tags. The following table shows elements of which tags (in the left) can contain which tags (in the right).

    <divi> all tags except <div1> ··· <divi>
    <hi> intradivisional tags
    <p> and <ss> <ss>, <su>, and quotational tags
    quotational tags all tags
    <su> and intrasentential tags except <el> quotational tags and intrasentential tags

    The following tags are used to encode ambiguities. In GDA, these tags are usually not manually handled, but instead automatically processed by computers. The elements of these tags are all empty, and can appear anywhere in the document. These tags except <anchor> are called link tags. Link elements (elements with link tags) are daughters of other elements only when they are referred to via the dtrs attribute.

    <anchor>
    Anchor point. The <anchor> element must have id attribute, and assigns an identifier to a point in the document.
    <alt>
    Alternatives.
    targets
    The id values of the alternatives.
    weights
    The percentage probabilities of the corresponding alternatives.
    content
    id values of <anchor> elements. The content attribute specifies the virtual content of the element. For instance, the virtual content of <alt content='n0 n1 s0 s1' targets='v1 v2'> is `The idea' plus `that I should go' if the following text is in the same document file.
  • <anchor id=n0>The idea<anchor id=n1> occurred to me <anchor id=s0>that I should go<anchor id=s1>.
    In general, the virtual content of an element with attribute content='id1 ··· id2n' is the aggregate of regions between the <anchor> elements with id values id2i-1 and id2i for i from 1 to n. content='id1 ··· id2n+1' is equivalent to content='id1 ··· id2n+1 id2n+1'.
  • <linkdiv1> ··· <linkidi>
    Link tags corresponding to <div1> ··· <idi>.
    content
    Virtual content of the element. See above.
  • <su dtrs='va0'>I <anchor id=a0>saw <anchor id=a1>the girl <anchor id=a2>with a telescope<anchor id=a3>.</su>
    <alt id=va0 content='a0 a3' targets='vp1 vp2'>
    <linkv id=vp1 dtrs='np1'>
    <linkv id=vp2 dtrs='vp3 pp1'>
    <linkn id=np1 content='a1 a3' dtrs='pp1'>
    <linkv id=vp3 content='a0 a2'>
    <linkad id=pp1 content='a2 a3'>

    Here the <alt> element points to two <linkv> elements, which represent the two alternative parses of `saw the girl with a telescope.'
  • The possible inclusion among virtual elements of <linkdiv1> ··· <linkidi> is the same as that among the elements of <div1> ··· <idi> shown before.

    5. Syntactic Constructions

    The syn attribute encodes types of syntactic constructions. The possible values are the following.

    f
    Forward dependency.
  • <su><seg syn=f>I</seg><seg>'m sleepy</seg>.</su>
  • I have <seg syn=f>boiled eggs</seg>.
    The latter means that I have several boiled eggs, not that I have boiled them.
    fx
    Forward crossing dependency, or backward extraction.
  • <su syn=fx>Mary, I hate.</su>.
    b
    Backward dependency.
    bx
    Backward crossing dependency, or forward extraction.
  • <su syn=bx>A man came <cs>who I don't know</cs></su>.
    a
    Apposition.
    fa
    Forward apposition. `A, B' forms a forward apposition if it can be paraphrased to `B, which is A' and so on.
    ba
    Backward apposition. `A, B' forms a backward apposition if it can be paraphrased to `A, which is B' and so on.
  • <seg syn=ba><seg>Dr. Brown,<seg> <seg>the founder of this institue</seg></seg>
  • <seg syn=ba>the two guys, <seg>Tom and Bill</seg></seg>
    p
    Parallelism or coordination. Two syntactic constituents are parallel when they share the same sorts of syntactic (semantic) dependency concerning the outer context.
  • Kim <seg syn=p><seg rel=pre>came</seg> and left</seg>.
    bp
    Backward dependency in which the constituents are parallel.
  • Kim ate <seg syn=pb>bread, <seg rel=cnc>though not egg</seg></seg>.
    fp
    Forward dependency in which the constituents are parallel.
  • <seg syn=fp><seg>Instead of Sue</seg>, Kim</seg> came.

  • 6. Semantic Relations

    The rel attribute encodes binary relations licensed by syntactic dependency or intersentential juxtaposition. They are relations between the depending element and the depended element, and include grammatical functions, thematic roles, and rhetorical relations. The distinction among these three types of relations is often vague. For instance, LOCATION counts as both a grammatical function and a thematic role. Although CAUSE is usually regarded as a rhetorical relation, it can also serve as a thematic role of phrases such as `due to lack of money.' This is why we conflate grammatical functions, thematic roles, and rhetorical relations, under the rel attribute. Among the values introduced below, cau, cnc, cnd, and so on, serve as both rhetorical relations and thematic roles.

    The value of the rel attribute is a relational term, which is a kind of conceptual term representing a binary relation. So rel is an open-class attribute, potentially encompassing all the binary relations lexicalized in natural languages. The rel attribute encodes a relationship in which the current element stands with respect to the element that it depends on, as in:

    go <seg rel=fin>to Paris</seg>
    In this connection, conceptual terms may also be values of the sem attribute, which is to encode word senses, as discussed later. So the sem attribute with a relational term as its value is attached to the function word, as in:
    go <seg sem=fin>to</seg> Paris

    A purpose of the rel attribute is to associate complement elements (subjects, objects, indirect objects, and so forth) with the corresponding arguments of verbs, adjectives, etc. To fulfill this, we employ a rather standard approach: the association is specified by marking elements with grammatical functions such as SUBJECT and OBJECT (sbj and obj below, respectively), provided that we have a dictionary containing the argument structures of verbs and so on. In many languages, there is usually no need to explicitly markup complements such as subjects objects, and indirect objects, because their grammatical functions are obvious from the surface forms and hence their thematic roles can be inferred from the dictionary. When the verb has multiple argument structures, such as with `Tom opens the door' (where `Tom' is the agent) and `The key opens the door' (where `the key' is the instrument), we can either markup the subject noun phrases with the thematic role or markup the verb in terms of the argument structure. Also, by using grammatical functions we do not have to worry about whether the subject of buy should be AGENT or GOAL, for example.

    The rest of the purpose of the rel attribute is to resolve ambiguities of both the thematic roles of adjunct elements, which are typically prepositions and postpositions, and the rhetorical relations which are not explicitly marked. To attain this, we must simply markup the elements in question with thematic roles and rhetorical relations. However, the exhaustive listing of thematic roles and rhetorical relations appears impossible, as widely recognized. We are not yet sure about how many thematic roles and rhetorical relations are sufficient for engineering applications such as machine translation, but as mentioned before, the appropriate granularity of classification will be determined by the current level of technology. The following list is by no means a definite set of thematic roles and rhetorical relations, but we hope to improve this to come up with a nearly optimal one.

    The native relational terms are enumerated below:

    agt
    Agent of action. This value will not be used very often, because sbj will suffice most of the case. In languages like English, even sbj will be unnecessary for the most part.
  • <su><seg rel=agt>Tom</seg> came.</su>
    ben
    Beneficiary.
  • a present <seg rel=ben>for you</seg>
    cau
    Cause, reason, or motivation.
  • <ss><su rel=cau>Tom came.</su> <su>Mary was surprised.</su></ss>
  • I went home <seg rel=cau>because I was sleepy</seg>
  • He died <seg rel=cau>of cancer</seg>.
    cmp
    Comparison.
  • Tom is as tall <seg rel=cmp>as Bill</seg>.
    cnc
    Concession.
  • <ss><su rel=cnc>Tom came.</su> <su>Mary wasn't surprised.</su></ss>
    cnd
    Condition.
  • I'll come <seg rel=cnd>if you're there</seg>
    cnt
    Content of thought, belief, speech, promise, rumor, plan, request, and so forth.
  • <seg>plan <seg rel=cnt>to visit Tokyo</seg></seg>
  • ask <seg rel=exp>her</seg> <seg rel=cnt>for a date</seg>
  • the <seg syn=b>fact <seg rel=cnt>that you're here</seg></seg>
    cntrst
    Contrast.
  • <ss>Tom came. <su rel=cntrst>However, Bill left.</su></ss>
    dir
    Direction.
  • I walked <seg rel=dir>to the north</seg>.
    dur
    Duration. The temporal extension of the event.
  • I was asleep <seg rel=dur>during his talk</seg>.
    `During his talk' in `I slept during his talk' has rel=time rather than rel=dur, if the intended interpretation does not entail that I was sleeping all during his talk.
    eg
    Example.
  • expensive cars <seg rel=eg>such as Mercedez</seg>
    exp
    Experiencer.
  • It seems <seg rel=exp>to me</seg> that he left.
    fin
    Final point or goal.
    ini
    Initial point or source.
  • <seg syn=b>come <seg rel=ini>from above</seg></seg> [rel=pth:ini?]
    iob
    Indirect object.
  • <seg rel=iob>Who</seg> did you tell it?
    jnt
    Joint participant in the event.
  • You came <seg rel=jnt>with her</seg>.
    loc
    Location (typically spatial). Equivalant to pth:sup.
  • live <seg rel=loc>in Tokyo</seg>
    mnr
    Manner.
  • speak <seg rel=mnr>slowly</seg>
    mns
    Means or instrument.
  • survive <seg rel=mns>by eating grasses</seg>
  • commute <seg rel=mns>by car</seg>
    msr
    Measure.
  • weigh <seg rel=msr>two kilograms</seg>
    obj
    Syntactic object.
  • Mary beats <seg rel=obj>her husband</seg>.
    pat
    Patient.
  • eat <seg rel=pat>an apple</seg>
    pos
    Posessor.
  • a daughter <seg rel=pos>of mine</seg>
    pre
    Precedent.
  • She came <seg rel=pre>after he arrived</seg>.
    pst
    Postcedent.
    pth
    Path. The spatial extension of the event. In contrast, loc (equivalent to pth:sup) entails spatial inclusion (for instance, `walk in the garden' entails that the walking event is spacially included in the garden).
    pur
    Purpose.
  • I went there <seg rel=pur>to see her</seg>.
    ql
    Qualification.
  • I scolded her <seg rel=ql>as her father</seg>.
    rec
    Recipient.
  • <seg syn=b>give <seg rel=rec>him</seg> the book</seg>
    res
    Result. A resulting event or object.
  • Tom is gone, <seg rel=res>so that I'm alone</seg>.
  • Sue built <seg rel=res>a house</seg>.
  • Kim turned the car <seg rel=res>to garbage</seg>.
    restat
    Restatement.
  • New York <seg rel=restat>or Big Apple</seg>
  • <ss>Tom is gone. <su rel=restat>He escaped.</su></ss>
    sbj
    Syntactic subject.
  • <su><seg rel=sbj>Spring</seg> has come.</su>
  • want <seg rel=sbj>you</seg> <seg rel=cnt>to come</seg>
    sub
    Subset, part, or element.
    sum
    Summary.
    sup
    Inverse of sub. Includer of any sort: superset as to subset, whole as to part, set as to element.
    time
    Temporal location. Equivalent to dur:sup.
  • I was born <seg rel=time>in 1958</seg>.
  • Relational terms can combine to make compound relational terms. There are two types of combination. If a and b are relational terms, then a:b and a_b are relational terms, too. The operator `:' has precedence over `_.' That is, a:b_c is the combination of a:b and c through `_.'

    a:b represents the composition of a and b as binary relations. That is, x and z stand in relation a:b, if and only if there exists y such that x and y stand in relation a and y and z stand in relation b.

    Note that `until noon' in the first example has rel=dur:fin but not rel=time:fin. Since time is equivalent to dur:sup, time:fin is equivant to dur:sup:fin, which is the final point of a superset of the duration of the event. The sentence entails that Tom was there at noon, but time:fin fails to entail it.

    a_b is the intersection of a and b as sets; a binary relation is a subset of the Cartesian product of two sets. That is, x and y stand in relation a_b, if and only if x and y stand in both relations a and b. This is used when the thematic or rhetorical relation between two syntactic elements cannot be sharply classified into any single category.

    I came <adp rel=res_pur>so that I met him</adp>.

    In `Kim likes Mary better than Betty,' for instance, we must specify whether `Betty' is compared with `Kim' or `Mary.' Similarly, in `Kim blamed Mary together with Betty,' we want to mark whether Kim and Betty blamed Mary or Kim blamed Mary and Betty. To implement this in general, we use extended relational terms of the form a-b, where a and b are relational terms but not extended relational terms. a is a relational term such as cmp and jnt, and b is a relational term to indicate which element is in parallel with the current element, as in what follows:

    The relational terms are used as attributes as well, which we will call relational attributes. While the rel attribute appears in the depending element (the satellite in the case of rhetorical relations), the relational attribute appears in the depended element (the nucleus) and points to the depending element. Namely, the value of the relational attribute is the referential index of the element, if any, which semantically or pragmatically depends on the element containing this relational attribute. Of course the attribute name indicates the type of the dependency.


    7. Communicative Functions

    Communicative functions including speech acts are encoded by the com attribute. Its values are classified into properties (unary relations) and binary relations. The former are the following:

    info
    New information.
    ord
    Order
    req
    Request.
    ofr
    Offer.
    qyn
    Yes/no query.
    qwh
    WH query.
    smn
    Summons.
    abu
    Abuse.
    blm
    Blame.
    The binary relations are:
    ack
    Acknowledgment.
    r
    Reply.
    rw
    Reply WH.
    ry
    Reply YES or acceptance.
    rn
    Reply NO or refusal.
    This is admittedly far from the complete set of communicative functions. We would like to incorporate the result of the ongoing international effort toward the design of discourse tags.

    Under construction


    8. Coreferences

    Anaphora by an overt expression such as a pronoun or a definite description is encoded by the crf attribute, which takes referential index as its value.

    <seg id=j0>John</seg> beats <seg crf=j0>his</seg> wife.
    When the referent of an anaphoric expression is a member of the type (kind, set, etc.) which the referent of a former expression is a member of, we use the ctp attribute.
    You bought <seg id=c1>a car</seg>. I bought <seg ctp=c1>one</seg>, too.

    You bought <seg id=c0>the car</seg>. I bought <seg ctp=c0>it</seg>, too.

    In the latter example, `the car' refers to an instance of a specific model of cars, and `it' refers to another instance of the same model.

    Zero anaphora is encoded by relational attribute.

    Tom visited <seg id=m1>Mary</seg>. He <seg iob=m1>brought</seg> a present.
    If a is a relational attribute, then a:ctp is a relational attribute, too, which will be useful for langauges like Japanese in which zero pronouns abound.

    9. Reference Types

    Not only noun phrases but also verb phrases, sentences, and so on refer to objects, events, states of affairs, and so on. Here we introduce attributes to classify such references.

    rtyp is to encode type of reference. Its values include:

    gn
    Generic or attributive.
  • <seg rtyp=gn>Dinosaurs</seg> are extinct.
  • dance like <seg rtyp=gn>a butterfly</seg>
    sg
    Singular.
  • Give me <seg rtyp=sg>your fish</seg>.
    pt
    Partitive.
  • Give me <seg rtyp=pt>water</seg>.
    pl
    Plural.
    du
    Dual.
    plgn
    Plural generic.
  • <seg rtyp=plgn>These cars</seg> are expensive. (Several models of cars are entailed here.)
    dugn
    Dual generic.
  • Of cource sg, pl, and du are for countable nouns only, and pt is for uncountable nouns only.

    In a generic reading, the predication concerns (default properties of) the whole kind referred to by the noun phrase in question. An accidental universal quantification, such as with `I know (all) the Emperors of Japan,' does not qualify as a generic reading. We do not distinguish the two types of generic reading: those such as with `Chickens evolved from dinosaurs' and those such as with `Chickens lay eggs.' This distinction is captured by classifying the predicates.

    individual vs. stage reading?

    Under construction


    10. Tense and Aspect

    Most langauges have grammaticized marking of tense, but for instance Chinese lack tense marking so that tense tagging will be of a great benefit in Chinese. Perhaps no language lacks grammaticized aspect marking, but aspect tagging could be useful in some cases.

    The tns attribute encodes tense. Its values are:

    futr
    Future.
    past
    Past, including historical present.
  • Brutus <v tns=past>murders</seg> Caesar.
    pres
    Present.
  • He <v tns=pres>could do it</seg>.
  • The asp attribute encodes aspect. Its values are:

    tel
    Telic.
    prf
    Perfect.
    npt
    Non-perfect telic.
    atel
    Atelic.
    prog
    Progressive.
    stat
    Stative.
    prf and npt are special cases of tel, and prog and stat are special cases of atel. Perhaps we do not need to subdivide atel.

    Under construction


    11. Scopes

    Scopes of quantifiier, negation, modal operator, conditional operator, parallel coordination, and so forth are encoded by the sce (scoping element) and sps (super-scope) attribute.

    The sce value of an element A is the id value of another element B such that A is a scope of B. Here A must command B; an element commands another element when the former contains the latter or contains an element which either points at the latter element via a relational attribute or pointed by the latter element via the dep attribute. For instance, the following annotation entails the interpretation that each of three collectors bought one same paining, so that this painting has been bought three times as far as the sentence entails.

    <su sce=3m><np id=3m>Three collectors</np> have bought a painting.</su>

    An optionally scope-introducing element, such as `three collectors' and `Tom and Mary,' actually introduces a scope only when pointed via the sce attribute. In the above exmaple, if cse=3m were absent, the interpretation is that the three men cooperatively bought a car, so that the car was bought once.

    The scopes of elements such as `every man' and `Tom or Mary,' which always introduce scopes, are assumed to be the minimal dominating <su> or <np> elements. For instance,

    <su><np syn=p>Tom or Mary</np> came.</su>
    means that Tom came or Mary came, where the scope of `Tom or Mary' is the entire sentence.

    The sps value of element A is the id value of another element B which is disjoint from A, and B is the element introducing the minimal scope containing A's referent.

    For instance, the following means that each of three men bought a car, entailing three possibly different cars bought.

    <su sce=3m><np id=3m>Three men</np> bought <np sps=3m>a car</np>.</su>
    Here the referent of `a car' is in the scope introduced by `three men.' Since there are three instantiatoins of this scope, corresponding to the three men, there are three possibly distinct cars each of which was bought in one of those instantiations.

    For another example, the de dicto reading of `Jane wants to marry a doctor,' which entails no specific doctor, is marked up as follows:

    Jane <v id=w1>wants</v> to marry <vp sit=w1>a doctor</np>.
    Here the doctor is situated in the scope introduced by the modal operator `wants.' Being the head of the complement of `want,' `marry' is forced to be situated in the scope of `wants.' So the sps attribute need not be specified for `marry.' As for the other elements, absense of the sps attribute defaults to exclude the elements from all scopes, situating them in the maximum semantic context introduced by the minimal <q> or <quote> element or the whole document. So if `a doctor' in the above example lacked the sps attribute, then it would entail the de re reading involving a specific doctor.

    Similarly, the reading of `every man loves a woman' in which `a woman' is in the outermost situation (that is, one woman is loved by all the men) is the default reading of the sentence. The other reading, in which the referent of `a woman' is in the body scope of `every man' (different men may love different women), is encoded by:

    <vp id=e0>Every man</np> loves <np sit=e0>a woman</np>
    Under construction

    12. Word Senses

    The sem attribute takes a conceptual term as its value to disambiguate the meaning of the element. A good, free multilingual electronic ontology would be very useful for sense tagging.

    Under construction


    13. Others

    stl politeness

    Under construction


    GDA Homepage