A Conceptual Modeling Language For the Analysis and Interpretation of Text Gary F. Simons Summer Institute of Linguistics 7500 W. Camp Wisdom Rd. Dallas, TX 75236 Internet: gary@txsil.lonestar.org Text Encoding Initiative Committee on Text Analysis and Interpretation Document Number: TEI AIW12 January 16, 1990 Version 1, January 17, 1990 ABSTRACT CONTENTS 1. The justification for a higher level metalanguage Why syntax without semantics sometimes works When syntax without semantics does not work From syntactic modeling to conceptual modeling 2. The specifications for a proposed conceptual modeling language Element class definitions Value set definitions Possible directions for enhancement 3. Some example conceptual models A basic dictionary A glossed and translated text 4. The SGML encoding of analyses based on conceptual models The general encoding rules An encoded dictionary entry An encoded sentence with translation and glosses 5. The problem of links This document proposes a conceptual modeling language which could provide a framework for designing encoding schemes for the linguistic analysis and interpretation of text. Note the focus on "designing encoding schemes." The December 1989 meeting of the TEI-ANA committee concluded that the requirements for encoding linguistic analysis of text are significantly more complex than the requirements for encoding the text itself. While the metalanguage built into SGML (namely, the lan- guage for document type definitions) is adequate for expressing the design of encoding for the text itself, it is not adequate for express- ing the design of encoding for text analysis. The committee thus con- cluded that we needed to begin by designing a metalanguage that would allow us to express the design of encoding schemes for text analysis. This document seeks to explain why this is needed and then gives an ini- tial proposal for such a metalanguage with examples from two domains. 1. THE JUSTIFICATION FOR A HIGHER LEVEL METALANGUAGE SGML provides a metalanguage for specifying the syntax of markup, but ignores the semantics of markup. This attempt to justify the need for a higher-level metalanguage which addresses semantics as well as syntax proceeds in three steps. First, it explains why syntax without seman- tics works for some basic SGML applications. Then it shows why syntax without semantics is unacceptably weak for applications in text analysis and interpretation. Finally, it proposes that the needed metalanguage can be viewed as an instance of a conceptual modeling language -- a device that is gaining momentum in some modern database circles. Why syntax without semantics sometimes works The basic model of text embodied in garden variety SGML applications (like the AAP tag sets for books, articles, and journals) is that all the bottom level data elements come from one domain (namely, the words and punctuation of a particular language), that these data elements are arranged in an inherently ordered sequence, and that the main problem of document markup is to encode the fact that the content of the text also involves an organization in which lower-level elements nest inside of higher- level ones to form a hierarchy of text elements. The SGML metalanguage for expressing the design of text encoding applications, namely the document type definition (DTD) mechanism, mod- els these concepts directly. The domain of words and punctuation is called "character data," or CDATA. The ordering inherent in the data is represented by linear sequence in the text file. !ELEMENT definitions declare the kinds of higher-level elements that are allowed, and then define what each element type is in terms of constraints on the patterns by which other elements or CDATA are allowed to nest within it. As Sperberg-McQueen and Burnard have observed in TEI document EDW5, SGML and the DTD mechanism provide only a syntactic definition of the element tags. Consequently, SGML can guarantee syntactic integrity, but it is the responsibility of the user (or the application software) ensure semantic integrity. The SGML approach of enforcing syntactic integrity while ignoring semantic integrity works well in the world of conventional documents. This is because the semantics of the markup bears an implicit one-to-one mapping to its syntax. The semantic domain of the words and punctuation of the author's language maps directly onto the syntactic domain of CDATA. The semantic concept of order in the flow of the conceptual text maps directly onto the linear sequence of elements and CDATA in the encoded text. When syntax without semantics does not work The approach of explicit syntax with implicit semantics breaks down when there is not a one-to-one mapping between semantic and syntactic notions. The domains under the purview of the TEI-ANA committee, name- ly, the encoding of dictionaries and of the linguistic analysis of text, are certainly ones in which there is no such simple mapping. I illus- trate this in terms of three aspects of TEI-ANA documents which are not present in garden variety SGML documents: multiple domains for the fun- damental data elements, nonlinearity among the elements of conceptual objects, and nonhierarchy in the relationships between objects. 1. Dictionaries provide a good example of how the fundamental data elements may come from many domains. Whereas the fundamental data ele- ments of a garden variety text come from one domain, namely, the domain of words and punctuation in the author's language, in a dictionary they come from many domains. The lemma field comes from the domain of words of the language (normally with constraints on morphological form). The pronunciation field comes from the domain of a phonetic transcription alphabet. The part of speech field comes from the domain of grammatical category abbreviations. In a monolingual dictionary the definition field comes from the domain of free text; in a bilingual dictionary it is constrained to be free text in the second language. Cross-reference fields (like antonym and synonym) come from the domain of words that are lemmas in the dictionary. The representation of text analysis is simi- larly full of multiple domains. These domain constraints are an important aspect of the semantic integrity of a dictionary or a text analysis. The designer of an encod- ing scheme should be required to specify these constraints, both because it is an important part of the documentation for someone else who would use the encoded dictionary, and because application software could use a formal statement of such constraints to go beyond the syntactic integri- ty provided by SGML to provide some semantic integrity. 2. Dictionaries (when viewed as stores of lexical information rather than as books to read) also illustrate that the true semantic relation- ship between consecutive markup elements may be one of simultaneity rather than linearity. Whereas the subelements within elements in a garden variety text (such as sentences within paragraphs, paragraphs within sections, and so on) are intrinsically ordered, the subelements within a lexical entry are not. Leaving aside the complexities of mul- tiple senses, let us assume that a lexical entry contains a headword, a pronunciation, an etymology, a part of speech, a definition, a usage note, and different kinds of cross references. The semantic relation- ship between these parts is very different than the relationship between, say, sentence elements as the parts of a paragraph element. In the latter case, the subelements are related by their position in a lin- ear sequence; the order in which they appear is actually part of the information. in the former case, the subelements are related as simul- taneous attributes of the higher level element; the order in which they appear is not part of the information and does not really matter. (When dictionaries are printed, a particular order is chosen and consistently imposed to enhance readability, but this is a feature of the chosen dis- play format not of the inherent information structure.) The same phenomenon is common in analyzed text. In another paper (Simons 1987), I have discussed the way in which an analyzed text encodes information from a number of simultaneous dimensions. Whereas a plain sentence is simply a sequence of plain words, an analyzed sentence might be a simultaneous bundle of attributes like its contents (which would be a sequence of analyzed words), a syntactic parse, a free trans- lation, and comments on interesting grammatical features. An analyzed word might further be a simultaneous bundle of attributes like an ortho- graphic representation, an underlying phonological form, a part of speech, and a gloss. There is thus a semantic distinction between subelements which are sequentially ordered subparts of the containing element, versus subele- ments which are simultaneous attributes of the containing element. The designer of an encoding scheme should be required to declare these dis- tinctions, both because it is an important part of the documentation for someone else who would use the encoded information, and because applica- tion software could use a formal statement of such declarations to treat elements in sequence differently from elements as simultaneous attri- butes. 3. In garden variety documents for printed publication, the relation- ships between elements are strictly hierarchical. For instance, there are words within sentences within paragraphs within subsections within sections within chapters. The relationship between elements of differ- ent levels is always that each lower-level element is included in exact- ly one higher-level element. The mapping from higher-level units to included lower- level ones is always one-to-many. A linguistically analyzed text, however, may involve both one-to-many and many-to-one mappings between the same two kinds of elements. For instance, a single lexeme might be realized by many words, just as a single word might be realizing many lexemes. Or, a single morpheme might be realized by many phonemes, just as a single phoneme might be simultaneously involved in the realization of many morphemes. Many-to-one mappings also occur when alternative analyses are encod- ed. For instance, if there is more than one way to parse a sentence, the sentence (perhaps encoded as a sequence of analyzed word elements) should be stored just once and each of the multiple parses should refer back to the same sequence of words. The problems of many-to-one mappings and of alternative analyses are ones discussed in Santorini's working paper (AIW2) on concerns of the syntax subcommittee. Another problem they mention which will require the same mechanism is the problem of discontinuous constituents; for instance, a syntactic parse may need to indicate that the preposition at the end of a sentence is actually part of the verb in the middle of the sentence. Related to this problem is the skewing in linear order when the same subelements are involved as children of different kinds of elements. For instance, the semantic representation of the predicate- argument structure of a sentence might reorder the constituents with respect to their surface order and their arrangement in the syntactic parse tree. These are all problems which require a mechanism for expressing links between elements. SGML does have the id/refid mechanism for handling links but it has the major shortcoming that there is only a single, global name space for element identifiers. The SGML parser can verify that references are syntactically valid, but it can do nothing to check the semantic integrity of links, that is, to ensure that the links are pointing to the right sorts of things. For instance, if a terminal node in a syntactic parse is supposed to point to the word it represents, then ideally, processing software should be able to ensure that the end- point of the link is indeed a word (and not some other kind of element) and that it is a word of the sentence being parsed. Conceptually, the link is constrained as to both type and scope. Whereas in garden variety documents the relationships between ele- ments fit a hierarchical pattern in which all links are implicit in the hierarchical inclusion, the linguistic analysis of text is very differ- ent. It involves many-to-many mappings between elements which result in a rich network of relationships that requires the explicit expression of linkages in an SGML markup. The designer of an encoding scheme for such analyses should be required to declare the type and scope constraints on the links, both because it is an important part of the documentation for someone else who would use the encoded information, and because applica- tion software could use a formal statement of these constraints to help insure the integrity of the markup. From syntactic modeling to conceptual modeling The basic premise of this working paper is that the syntactic modeling of markup that the DTD mechanism of SGML provides is not strong enough for the needs of text analysis and interpretation. Syntactic modeling works for garden variety documents because there is essentially a one- to-one mapping from the syntax to the semantics of the markup: CDATA corresponds to the single domain of words and punctuation in the author's language, linear sequence of information in the data file cor- responds to the linear sequence of the text elements in the published form, and the inclusion of elements within elements corresponds to the only kind of relationship between elements of different levels, namely, hierarchy. The one-to-one mapping from syntax to semantics breaks down drastically in dictionaries and linguistic analyses where it is the rule rather than the exception that the information comes from many domains, that many information elements are bundles of simultaneous attributes, and that two kinds of elements are related by many-to-many mappings. To make markup work for these kinds of documents we need to go beyond the syntactic modeling provided by the DTD mechanism of SGML and begin dealing with semantics in the specification of encoding schemes. Though this may be a new idea in text encoding circles, it is only a new appli- cation of well-established ideas from other disciplines like database management, programming language theory, and artificial intelligence. In the next few paragraphs I trace some of these relationships. Note: [For this first draft, I will not take the time to look up Note: all the references and give a well reasoned literature review. I will certainly do so in a subsequent draft if the paper is well enough accepted to warrant further development of it. The basic ideas I want to refer to are the use of integrity constraints, schemas, and data description languages in database management; the development of the object-oriented view of information in programming language theory, and knowledge representation devices like frames and attribute-value pairs used in AI.] In the past few years, these streams of thought in database manage- ment, programming languages, and artificial intelligence have come together to found the notion of the object-oriented knowledge base (xxx, yyy). The information stored in the knowledge base is either informa- tion about particular instances of objects or information about the classes of objects in general which is used not only to validate instances which are created but also to guide applications like reason- ing with the stored knowledge. The general knowledge about object classes is expressed formally in what some authors have called a "Con- ceptual Modeling Language" (zzz). The idea inherent in this name is that the object classes and their attributes correspond to concepts of the target domain of knowledge, and that the formal definition of the characteristics and relationships of these classes and attributes con- stitutes a conceptual model of the knowledge domain. In applying this notion to the encoding of text analysis and inter- pretation, I am suggesting that a good encoding scheme must reflect a good conceptual model. Therefore the first step in designing an encod- ing scheme should be to develop a formal conceptual model for the prob- lem domain. That formal conceptual model would then serve three func- tions: (1) it would serve as an important part of the user documentation about the encoded text, (2) it would serve as the basis from which the syntactic model for the encoding (namely the DTD) could be generated automatically, and (3) it would serve application software as a semantic model of the encoded text. The design of the conceptual modeling language proposed in the next section draws heavily from the conceptual modeling language under devel- opment in the CELLAR (Computing Environment for Linguistic, Literary, and Anthropological Research) project of the Summer Institute of Lin- guistics (Dallas, TX). That project is described in some unpublished working papers by Gary Simons, John Thomson, Steve DeRose, and John Boy- land. 2. THE SPECIFICATIONS FOR A PROPOSED CONCEPTUAL MODELING LANGUAGE The specification is given in terms of a BNF-style grammar with running commentary. In the productions, the metacharacters have the following meanings: < > encloses name of nonterminal " " encloses literal characters ::= the "rewritten as" operator | separates alternatives * zero or more occurrences of the preceding + one or more occurrences of the preceding ? zero or one occurrences of the preceding ( ) grouping to clarify the scope of | and * Words joined by hyphen and not bracketed by < > or " " represent infor- mal descriptions of terminal strings. Details of the placement of whitespace, including newlines, is ignored in the specification. A conceptual model is a series of declarations which define either a class or a value set. < conceptual model > ::= < declaration > + < declaration > ::= < class definition > | < value set definition > Element class definitions The definition for a class of elements is introduced with the keyword "class" and associates a class name with a set of attributes. The class name is a string of alphanumeric characters (with no spaces). Since the class name serves as the tag in SGML markup it usually kept short. When it is an abbreviation for a longer name, the long name is specified fol- lowing the keyword "for". The short and long names may be used inter- changeably in attribute definitions. The list of attribute definitions is separated by comma and terminated by period. < class definition > ::= "class" < class name > ("for" < long name > )? "has" < attributes > "." < class name > ::= string-of-alphanumeric-characters < long name > ::= string-of-alphanumeric-chars-plus-space < attributes > =:: < attribute definition > ( "," < attribute definition > )* An attribute definition consists of an attribute name followed by an optional long name followed by a semantic constraint on values of the attribute. The attribute name is meant to be a short string that is used in the markup. When the significance of the name is not self docu- menting, a long name introduced by the keyword "for" is added. The short and long names may be used interchangeably in other attribute def- initions. A colon marks the end of the name part of the definition and the beginning of the constraint part. < attribute definition > ::= < attribute name > ("for" < long name > ?) ":" < semantic constraint > < attribute name > ::= string-of-alphanumeric-characters The semantic constraint limits the permissible values of the attribute with respect to quantity and with respect to type. The quantity con- straint specifies whether only a single value is allowed, or multiple. In the case of multiple values, they are constrained to comprise a set (in which duplicates are not allowed and ordering is not significant) or a sequence (in which duplicates are allowed and ordering is signifi- cant). The type constraint limits the attribute values to be members of a specified value set, instances of a particular class, or references (that is, links or pointers) to instances of a particular class. Refer- ence values can optionally be further constrained to point only to instances of the named type in a particular scope. The problem of scopes and how they are specified is taken up in section 5 below. < semantic constraint > ::= < quantity > < type > < quantity > ::= "single" | "set of" | "sequence of" < type > ::= "member of" < value set name > | "instance of" < class name > | "reference to" < class name > ("in" < scope > )? Value set definitions The definition for a set of simple values (that is, values that are not links or structured elements) is introduced with the keyword "value set". The definition associates a value set name with the primitive type it is based on plus an optional further constraint. < value set definition > ::= "value set" < value set name > "is any" < primitive type > ( < value constraint > )? "." < value set name > ::= string-of-alphanumeric-characters For the present, the set of primitive types will contain only numbers, dates, and strings. (Other textual information like currency or time of day could be defined, as well as non-textual information, like sound or graphics.) It is assumed here that numbers and dates will be represent- ed as ASCII character strings, not as binary values. Built-in primitive types of numbers include common sets like counting numbers, whole num- bers, integers, real numbers, floating-point numbers (that is, expressed in scientific notation); other types would be possible. Dates are declared as to format. A few possible format designators are given; existing standards need to be consulted. Primitive string values are typed according to the character sets they draw from. If no character set is named, then ASCII is assumed. The character set may be an accepted standard, or it may be a special purpose set defined by the user (as long as it too is documented). The value set definition for language data should always tell the language of the string, either in the < character set name > or in the informal description of the < value constraint > . < primitive type > ::= < numeric type > | < date type > | < string type > < numeric type > ::= "counting number" | "whole number" | "integer" | "real number" | "floating point" < date type > ::= "date in" < date format > < date format > ::= "MM/DD/YY" | "DD-MM-YY" | "YYMMDD" | "Month Day, Year" | "Day Month Year" < string type > ::= "string" ("in" < character set name > )? < character set name > ::= ISO-or-other-standard-name | user-defined-name-for-user-documented-char-set The value constraint may be a list enumerating all the members of the set, it may be a boolean expression which must be true of any member of the set, or lacking any formal means of specifying the constraint it may be an informal prose statement. < value constraint > ::= "from" < enumerated set > | "where" < boolean expression > | "where, informally," < description > An enumerated set is declared with the keyword "from" followed by a series of set element declarations separated by commas. A set element declaration first gives the number or string which is the set member, and then explains what the set element stands for in a quoted string following the keyword "means". < enumerated set > ::= < set element > ( "," < set element > )* < set element > ::= < value > "means" < explanation > < value > ::= a-number-in-string-form | a-string-in-quotes < explanation > ::= a-string-in-quotes Finally, a "where" clause defines a constraint for sets that are more open-ended. Where a formal statement of the constraint can be made, it is expressed in some formal language (not yet defined). The formal con- straint would be a boolean expression on the keyword "value" (represent- ing any member of the value set) which evaluates to true when "value" is bound to a member of the set, and false otherwise. If the constraint cannot be stated formally, then the word "informally" is added and a prose description of the constraint is given at least for the sake of documentation. < boolean-expression > ::= not-defined-in-this-paper < description > ::= any-characters-except-period Possible directions for enhancement The above proposals for quantity and type constraints in semantic con- straints on attributes leave plenty of room for further development. For instance, quantity (for which there may be a better name) could include further possibilities like "ordered set" (in which duplicates are not allowed but the order is significant) or "sorted sequence" (in which the order of the values is based on a comparison criterion which must also be stated). Type could be extended to include polymorphic types (that is, types which include more than one class or value set). These might be specified by separate "signature" declarations or by means of disjunction (that is, "or") in the < type > production. (This device is actually used in the examples below.) Another enhancement would be to constrain a polymorphic sequence by a regular expression on class types, somewhat like the pattern constraint on elements in an SGML DTD. Note, however, that such regular expres- sions are probably not needed with this element and attribute model. For instance, instead of a regular expression on front matter, body, and back matter, a book would have three attributes for front matter, body, and back matter. Similarly, a chapter might have attributes like num- ber, title, author, and contents. Another useful kind of enhancement would be to allow inheritance in the definition of a new class as based on (but slightly different from) an already defined class. For instance, class < new name > based on < old name > also has < attri- butes > . could mean that the new class has all of the attributes of the old class plus the new attributes listed, or, if one of the listed attributes has the same name as one of the old attributes, then the new constraint overrides the old one. Another approach to subclassification and inheritance would be to allow parameters. For instance, a general definition of a bilingual dictionary could be made in terms of Language1 and Language2 (such as in the character set and value constraints of the value set definitions). Then the following class German-EnglishDictionary based on BilingualDictionary where Language1="German" and Language2="English". would define a German-English dictionary as a new subclass of the more general class. The documentation of user-defined character sets was left as a given above. A metalanguage for defining character sets would be a logical extension of the current proposal. 3. SOME EXAMPLE CONCEPTUAL MODELS The proposed syntax for specifying conceptual models is now illustrated on two straightforward examples: a basic monolingual dictionary and an analyzed text with word glosses and free translated. These are not meant to be draft models for these kinds of documents; they are offered strictly as examples of how the system can be applied. A basic dictionary The basic dictionary is a monolingual dictionary of a particular lan- guage. The name of that language is the first data element recorded in the dictionary. To generalize the definition to be for any language, we invent a preprocessor definition to define the language name for a par- ticular application: define theLanguage to be < fill in the language name here > . The hypothetical preprocessor will substitute the actual language name in place of < theLanguage > wherever it occurs in the definitions. The name here encoded is the name used in selecting the proper character set in string-valued value sets. Here follows a possible definition of a basic monolingual dictionary (more in the sense of a lexical database than of a book to be read): class dictionary has -- double hyphen delimits a comment language : single member of LanguageName, -- name of language compilers : sequence of members of PersonalName, date : single member of Date, -- date of last update contents: sequence of instance of entry sorted by headword. value set LanguageName is any string where, informally, value is the name of a language. value set PersonalName is any string where, informally, value is the name of a person. value set Date is date in MM/DD/YY. class entry has hw for headword : single member of LanguageWord, pr for pronunciation : sequence of member of PhoneticWord, -- pronunciations are ordered from most preferred to least et for etymology : single instance of etymon -- this is terribly simplified senses : sequence of instances of subentry sorted by number -- senses are ordered from most basic to most extended value set LanguageWord is any string in < theLanguage > where, informally, value is a word in the language. value set PhoneticWord is any string in IPA. class etymon has lg for language : single member of LanguageCode, wd for word : single member of SourceWord, gl for gloss : single member of LanguagePhrase, sc for source : single instance of etymon. -- source is recursive instance of Etymon for an even -- earlier form. Recursion ends when source is empty. value set LanguageCode is any string from "Lat." means "Latin", "OFr." means "Old French", "Fr." means "French", "It." means "Italian", "OHG" means "Old High German", ... and so on. value set SourceWord is any string where, informally, value is a word from the language named in the preceding lg attribute. value set LanguagePhrase is any string in < theLanguage > where, informally, value is a phrase in the language. class sub for subentry has no for number : single member of SenseNumber, ps for part of speech : single member of PoSCode, us for usage note : single member of UsageCode, df for definition : single member of LanguagePhrase, an for antonym : set of reference to one of senses of contents of this document, -- This is a syntax I made up on the spot, where senses and -- contents are attribute names defined above. Note that -- definitions of both attributes specify sorted sequences. -- The link could be symbolized by the values of the sort -- keys (namely, the hw and no attributes), as well as by -- an arbitrary reference id. ... and so on for other lexical relations. value set SenseNumber is any counting number. value set PoSCode is any string from "n" means "noun", "vi" means "intransitive verb", "vt" means "transitive verb", "adj" means "adjective", ... and so on. value set UsageCode is any string from "arch" means "archaic", "sl" means "slang", "vul" means "vulgar", ... and so on. A glossed and translated text The second example is a conceptual model for a particular way of analyz- ing running text with glosses at the morpheme and word levels and free translations at the sentence level. In particular it is written as a model for glossing Eskimo texts (and it will be used to encode a sample Eskimo sentence in the next section). Thus the conceptual model con- tains specific knowledge of analytical features of Eskimo. The example begins with a class for annotated sentences. Above that there are likely to be classes for paragraphs and the text as a whole. class sent for sentence has no for id number : single member of sentenceNumber, tx for original text : single member of EskimoText, ft for free translation : single member of EnglishText, wa for word analysis : sequence of instance of (word or punc). value set sentenceNumber is any counting number. value set EskimoText is any string in Eskimo where, informally, value follows spelling and punctuation conventions of Eskimo. value set EnglishText is any string in ASCII where, informally, value follows spelling and punctuation conventions of English. class word has sf for surface form : single member of EskimoWord, wg for word gloss : single member of EnglishGloss, ma for morpheme analysis : sequence of instance of (base or suf or clit). value set EskimoWord is any string in Eskimo where, informally, value is a single Eskimo word. value set EnglishGloss is any string in ASCII where, informally, value is a phrase of words that conform to an English spelling checker. class punc for punctuation has form : single member of EskimoPunctuation, func for function : single member of PuncFunction. value set EskimoPunctuation is any string in Eskimo from ".", ",", "!", "?", '"'. value set PuncFunction is any string from "SF" means "statement final", "QF" means "question final", "EF" means "exclamation final", "QI" means "direct quote initial", "QF" means "direct quote final", ... and so on. class base has lf for lexical form : single member of EskimoMorpheme, mg for morpheme gloss : single member of EnglishGloss. class suf for suffix has lf for lexical form : single member of EskimoMorpheme, mg for morpheme gloss : single member of TechnicalGloss. class clit for postclitic has lf for lexical form : single member of EskimoMorpheme, mg for morpheme gloss : single member of EnglishGloss. value set EskimoMorpheme is any string in Eskimo where, informally, value is a single morpheme in its lexical representation. value set TechnicalGloss is any string from "RSL" means "resultative aspect", "GER" means "gerundive", "s.MOD" means "singular, modalis case", ... and so on. 4. THE SGML ENCODING OF ANALYSES BASED ON CONCEPTUAL MODELS The conceptual modeling language provides a way to define the semantics of a text encoding scheme. The syntax for the SGML markup of an appli- cation of the scheme can be deduced directly from the conceptual model. In fact, a formal translator could be implemented which would translate a conceptual model into an SGML DTD. I will not attempt a formal description of the correspondence between conceptual models and DTDs; rather I will give an informal description of how the class definitions of the conceptual model relate to instances defined through SGML markup. The general encoding rules SGML encoding of class instances follows the following basic rules: 1. The class name becomes an SGML tag. For instance, an instance of "class entry" is marked with the opening tag < entry > . 2. An explicit closing tag, for instance < /entry > , is used to mark the end of the data for the instance. 3. The name of an attribute in the conceptual model becomes an SGML tag, with one slight modification. All attribute tags use a unique precedence character, say "%". Thus, an instance of attri- bute "senses" is marked with the opening tag < %senses > . 4. No end tags for attributes are required since they are always redundant. 5. Members of values sets are always encoded as character data. The only rules that seem to need comment are the ones concerning attribute markup. Three aspects of the proposals for attribute markup might be questioned: (1) Why not use SGML's attribute mechanism? (2) Why introduce a special precedence character for attribute tags? (3) Why aren't end tags needed? I now discuss these three points in turn. SGML's attribute cannot be used for a very simple reason: it does not allow the nesting of SGML elements. The attributes of the conceptual model, must of course be permitted to contain instance of elements which in turn contain other elements, and so on. While SGML's attributes could be used when the value is a value set member, it seems a pointless complication to use different mechanisms depending on the complexity of the attribute value. At first blush the obvious approach to encoding attributes seems to be to use regular tags. But when confronted with a case of elements which have other elements for the value of their first attribute, one begins to appreciate the problem with using regular tags. For instance, assume we have an analyzed text whose first attribute is title, which is filled by a sentence whose first attribute is contents, which is filled by a sequence of words whose first attributes are form. With regular tags for both elements and attributes, the encoded text would look like: < text > < title > < sentence > < contents > < word > < form > The < / > ... This quickly presents a disorientation problem for the human reader of the markup. Prefixing the attribute tag names with a unique character like "%", as in, < text > < %title > < sentence > < %contents > < word > < %form > The < / > ... gives immediate relief by making it clear which tags represent instances of elements and which signal their attributes. This is not just a mat- ter of convenience, however, element instances and element attributes are fundamentally different things and thus it is reasonable that they be syntactically distinct. Using the tag prefix for attribute names means that there is no naming conflict in the SGML markup if the concep- tual model uses the same name for both a class and an attribute. One distinction between element tags and attribute tags is that element tags have a global scope (that is, a given element tag always represents the same class no matter where it occurs in the text) whereas attribute tags have a scope local only to the containing element (that is, the same attribute tag immediately dominated by different element tags represents different attributes of different elements). Another fundamental distinction between element tags and attribute tags has to do with their syntactic privilege of occurrence, which in turn explains why end tags for attributes are always redundant. Unlike elements, attributes cannot nest within each other and occur only one time each in an element. This means that the occurrence of an attribute tag always simultaneously marks the end of the previous attribute. Sim- ilarly, the end tag for an element always simultaneously marks the end of the last attribute of the element. Though there is no harm in including explicit end tags for attributes, it is never needed, so it is recommended that they be omitted as the general rule. An encoded dictionary entry Here is the encoding of a sample dictionary entry for the English word 'cook'. The encoding follows the conceptual model developed in the pre- ceding section. The source of the etymologies and definitions was The American Heritage Dictionary (second college edition). Here is a rough rendering (without the benefit of bold and italic type faces) of a pos- sible output rendering of the dictionary entry: cook (kuk) [ME coke 'cook' < OE coc < LLat cocus < Lat coquus < Lat coquere 'to cook'] 1. (vt) To prepare for eating by providing heat. 2. (vt) To prepare or treat by heating. 3. (vi) To prepare food for eating by providing heat. 4. (vi) To undergo cooking. 5. (vi, slang) To happen, develop, or take place. 6. (n) A person who prepares food for eating. Below is is the encoding of the same information. Note that none of the punctuation symbols shown in the above rendering are encoded. Nor is the initial capitalization in definitions. These are all aspects of output format that should be added by a style specification given to the formatter at display time. < entry > < %hw > cook < %pr > kuk < %et > < etymon > < %lg > ME < %wd > coke < %gl > cook < %sc > < etymon > < %lg > OE < %wd > coc < %sc > < etymon > < %lg > LLat < %wd > cocus < %sc > < etymon > < %lg > Lat < %wd > coquus < %sc > < etymon > < %lg > Lat < %wd > coquere < %gl > to cook < /etymon > < /etymon > < /etymon > < /etymon > < /etymon > < %senses > < sub > < %no > 1 < %ps > vt < %df > to prepare for eating by providing heat < /sub > < sub > < %no > 2 < %ps > vt < %df > to prepare or treat by heating < /sub > < sub > < %no > 3 < %ps > vi < %df > to prepare food for eating by providing heat < /sub > < sub > < %no > 4 < %ps > vi < %df > to undergo cooking < /sub > < sub > < %no > 5 < %ps > vi < %us > sl < %df > to happen, develop, or take place < /sub > < sub > < %no > 6 < %ps > n < %df > a person who prepares food for eating < /sub > < /entry > An encoded sentence with translation and glosses Here is the encoding of a sample translated and glossed sentence follow- ing the conceptual model developed in the preceding section. The exam- ple is based on the same Eskimo sentence used in the SSS proposal. It appears as follows in a more conventional interlinear form of display: 1. Akutchilighmik-uvva uqaaqtullangniaqtunga. akutuq -si -liq -mik =uvva uqaaqtuq -llak -niaq-tunga icecream-RSL-GER -s.MOD=now tell story-DUR -INT -1s.I about making Eskimo icecream I am going to tell a story 'I am going to tell a story about making Eskimo ice cream.' Below is the encoding of the same information. Note that symbols used to show morpheme separation do not appear in the encoding. They are aspects of output format for classes base, suf, and clit which should be added by a style specification given to the formatter at display time. < sent > < %no > 1 < %tx > Akutchilighmik-uvva uqaaqtullangniaqtunga. < %ft > I am going to tell a story about making Eskimo ice cream. < %wa > < word > < %sf > akutchilighmik-uvva < %wg > about making Eskimo icecream < %ma > < base > < %lf > akutuq < %mg > ice cream < /base > < suf > < %lf > si < %mg > RSL < /suf > < suf > < %lf > liq < %mg > GER < /suf > < suf > < %lf > mik < %mg > s.MOD < /suf > < clit > < %lf > uvva < %mg > now < /clit > < /word > < word > < %sf > uqaaqtullangniaqtunga < %wg > I am going to tell a story < %ma > < base > < %lf > uqaaqtuq < %mg > tell a story < /base > < suf > < %lf > llak < %mg > DUR < /suf > < suf > < %lf > niaq < %mg > INT < /suf > < suf > < %lf > tunga < %mg > 1s.I < /suf > < /word > < punc > < %form > . < %func > SF < /punc > < /sent > 5. THE PROBLEM OF LINKS This section is unwritten for the time being, but I have the basic ideas in mind (though I won't know how well they work until I write it all out). Here is a sketch. The basic idea I want to propose is that a link be represented by the value of the key attribute of the element it refers to. For instance, a link to a subentry in a dictionary is a composite of the headword of the entry (which is the key attribute for retrieving an entry from the dic- tionary) plus the number of the subentry (which is the key attribute for retrieving a subentry from the senses), for instance, "cook.4" would be a reference to subentry 4 of headword cook. The key to the semantics of the link is the attribute definition in the conceptual model which tells the type and scope of the links. It is those definitions which make "cook.4" refer to an instance of a subentry as opposed to being just a string of characters. Another example of using links is that a one or two letter spelling, like 't' or 'th', could be the key attribute of a phoneme element which gives a full feature analysis. Then, with the right attribute defini- tions in the conceptual model, the string "k t b" could be sequence of references to phoneme objects. It is likely, too, that we need to support an implicit, positional key. Thus, for instance, the words of a sentence might have an implicit key which is an index to their position in the sequence. This would be the key used in making references to the words from ambiguous syntax parse trees, for instance. Version 1, January 17, 1990