ENPC:Documentation

English-Norwegian Parallel Corpus: Manual

Draft chapter, May -94
Stig Johansson, Oslo

2 Coding

2.1 General principles

The coding of the texts is in broad agreement with the TEI guidelines for electronic texts, as presented in Sperberg- McQueen and Burnard (1993). Textual features are marked by tags enclosed within angle brackets. For example, a heading is marked by a start-tag <head> and an end-tag </head>. Tags may have attributes, to provide an identifier of the element or characterize it in some other way, e.g. <p id=p1> to identify a particular paragraph or <div type=chapter> to mark a chapter. Some tags do not enclose text, e.g. <pb n=2> marking a page break at a particular point in the text. So-called entity references (bounded by & ;) can be used for a variety of purposes, e.g. to represent characters which are not available or to carry a grammatical tag. The occurrence of tags, attributes, and entity references in a particular type of document is called a document type definition.

The document type definition for the texts in the corpus differs in some respects from the TEI model. The overall structure is shown by this example:

<partext lang=en type=novel orig=yes id=FW1>
     <header>
     </header>
     <text>
     </text>
</partext>

In other words, there are two main parts: a header and the main text. The type of text is characterized broadly by attributes on the top tag, in this case specifying that the text is in English, that it is a novel and an original text, and that it has the identifier FW1 (indicating text 1 by Fay Weldon). The corresponding coding for the translation would be:

The value of the identifier of the translated text is identical to that of the original, with the addition of a letter (T) marking it as a translation.

2.2 Markup levels

A distinction is made between two levels:

level 1 (minimum coding): header, coding of main text structure (divisions, headings, paragraphs, S-units). Attributes for "rendition" may be omitted.

level 2: additional coding as outlined in this chapter

The aim is to code as many texts as possible according to level 2. The markup level of a text is specified in the encoding description of the header (see 2.3.2).

2.3 The header

Each text is described by a header which has four main parts, in accordance with the TEI guidelines: a file description, an encoding description, a profile description, and a revision description. These are tagged as follows:

<header>
     <fileDesc></fileDesc>
     <encodingDesc></encodingDesc>
     <profileDesc></profileDesc>
     <revisionDesc></revisionDesc>
</header>

2.3.1 The file description

The file description is the only part of the header which is obligatorily specified in the early stages of the project. It is generally structured as shown by this example:

Header and main text structure

<partext lang=en type=novel orig=yes id=FW1>
  <header>
       <fileDesc>
            <titleStmt>
                 <title>The Heart of the Country: Extract in machine-
                 readable form</title>
                 <author>Fay Weldon</author>
                 <resp>
                      <role>tagger</role>
                      <name>SJ</name>
                 </resp>
            </titleStmt>
            <extent>10,000 words from beginning of text</extent>
            <publicationStmt></publicationStmt>
            <notesStmt></notesStmt>
            <sourceDesc>
                 <biblStruct>
                      <monogr>
                           <author>Fay Weldon</author>
                           <resp>
                                <role></role>
                                <name></name>
                           </resp>
                           <title>The Heart of the Country</title>
                           <imprint>      
                                <publisher>Hutchinson
                                </publisher>
                                <place></place>
                                <date>1987</date>
                           </imprint>
                      </monogr>
                 </biblStruct>
            </sourceDesc>
       </fileDesc>
       <encodingDesc>level 2</encodingDesc>
       <profileDesc>
            <langUsage>British English</langUsage>
            <textClass>fiction: general</textClass>
            <textDesc></textDesc>
       </profileDesc>
       <revisionDesc></revisionDesc>
  </header>
  <text>
       <front></front>
       <body>
            <div1 type= id= >
                 <div2 type= id= >
                      <p id= >
                           <s id= link= ></s>
                      </p>
                 </div2>
            </div1>
       </body>
       <back></back>
  </text>
</partext>

Note that the <titleStmt> describes the machine-readable file, while the source text is specified in the <sourceDesc>. The title in the <titleStmt> should indicate that this is a machine-readable version and should not be identical to the title of the source text.

The <notesStmt> is left empty in the example above. It consists of a series of <note> elements, each recording potentially significant details about the source text which cannot be accommodated elsewhere.

2.3.2 Encoding description

The TEI encoding description may include a project description, editorial declarations (on correction, normalization, etc.), information on sampling, reference systems, and any classification schemes. In our case the encoding description can be very brief; it chiefly consists of a reference to the manual for the corpus, the markup level, and any additional comments on special features of encoding applying to the individual text. In the early stages of the project the encoding description is limited to an indication of markup level and a description in prose of any special characteristics of the text.

2.3.3 Profile description

The profile description is of particular interest in the encoding of corpora, in that it makes it possible to describe each text in a very detailed manner. The present project will chiefly use the following main parts of the TEI profile description:

<langUsage> where the language/dialect of the text is described;

<textClass> where the text is classified in terms of a classification scheme;

<textDesc> where the text is described in terms of situational parameters.

The description under <langUsage> is in terms of labels like: American English (AmE), Australian English (AuE), British English (BrE), Canadian English (CaE), New Zealand English (NZE), etc. This section may also include observations on special linguistic features of the text (cf. 2.8 below). The classification under <textClass> is in terms of the following scheme:

Fiction:  
          Children (FC)
          Detective (FD)
          General (FG)

Non-fiction:
Popular:     Belles lettres (biography, memoirs) (NPB)
             Information (information for the general public) (NPI)
             Science (history, biology, etc.) (NPS)
             Miscellaneous (NPM)

Specialized: Acts (NSA)
             Reports (official reports) (NSR)
             Science (history, biology, etc.) (NSS)
             Miscellaneous (NSM)

Note that a broad classification in terms of text type is also given in an attribute on the top tag (see 2.1). In the early stages of the project the <textDesc> is left empty. Eventually this may be used to provide a detailed classification of the text in terms of situational parameters.

2.3.4 Revision description

The revision description takes the form of a series of changes. It is structured as follows:

<revisionDesc>
     <change>
          <date></date>
          <name></name>
          <what></what>
     </change>
</revisionDesc>

In other words, this is a list of changes specifying the date of the change, the person responsible for the change, and the nature of the change.

2.4 Text units

The corpus texts are segmented into the following main units: text, division (where applicable), paragraph, S-unit, and word. Words are simply marked by spacing as in ordinary written text. The other units are explicitly tagged.

2.4.1 Text

Where complete texts are encoded, these have the structure recommended by the TEI guidelines:

<text>
     <front></front>
     <body></body>
     <back></back>
</text>

In the case of text extracts from books, [part of] the body only is included. The encoded text starts with the body of the main text, including headings, and ends with the nearest chapter or section division after the required number of words for the text extract has been reached. If the nearest chapter or section division extends considerably beyond the required number of words, the encoded text ends with the nearest paragraph.

The end of a text extract is marked by an <omit> tag; see 2.13.2.

2.4.2 Divisions

Most written texts include some sort of segmentation in terms of parts, chapters, sections, etc. According to the TEI guidelines, these units are tagged as numbered or unnumbered divisions. This corpus uses numbered divisions, where a lower number indicates a higher level. The type of division is described by an attribute. Example structure:

<body>
     <div1 type=part id=NN1.1>
          <div2 type=chapter id=NN1.1.1>
               <div3 type=section id=NN1.1.1.1></div3>
          </div2>
     </div1>
</body>

Each unit has an identifier which is built up by successively adding to the identifier of the text (in this case text NN1: cf. 2.3.1 above).

Low-level divisions in the text which are only marked by a blank line, asterisks, or the like, are not tagged as divisions. The tag <blankline> is inserted at the appropriate point in the text. This may be taken to signal a major paragraph break.

Where the front and the back of the text are tagged, these may also be marked as containing divisions. See the TEI guidelines.

2.4.3 Paragraphs

Divisions primarily contain a sequence of paragraphs (in addition, there may be headings, notes, etc.). Continuing our example above, these are marked as follows:

<div3 type3=section id=NN1.1.1.1>
     <p id=NN1.1.1.1.p1></p>
</div3>

Each paragraph has an identifier which adds yet another layer to the immediately superordinate identifier.

Paragraphs are identified as sections of texts marked by indentation, a blank line, or a combination of the two. Lists are marked as paragraphs or sequences of paragraphs; see 2.10.

2.4.4 S-units

Paragraphs are divided into orthographic sentences, here called S-units to underline that they are not necessarily sentences in a grammatical sense. They are tagged as follows:

<p id=NN1.1.1.1.p1>
     <s id=NN1.1.1.1.s1 link= ></s>
     <s id=NN1.1.1.1.s2 link= ></s>
</p>

S-units are numbered within the nearest division, as shown above. After alignment, each S-unit in the core corpus has a "link" attribute containing a reference to the corresponding unit(s) in the parallel text. S-units in the supplementary corpus have no link attribute.

An S-unit always opens after a paragraph start and ends before an end-of-paragraph marker. S-units are split within paragraphs where a mark of end punctuation (.?! or ... marking ellipsis) is followed by a word beginning with a capital initial (ignoring intervening parentheses, dashes, and quotation marks). No split is made between a colon or semi-colon followed by a word beginning with a capital initial (unless there is an end-of-paragraph marker).

S-units are not allowed to nest, i.e. they cannot be contained within each other. If there is an included sentence, e.g. within parentheses or between dashes, it is not coded separately, but is part of the S-unit it is included in. S-units may contain embedded poems, quotations, etc.

The division into S-units is complicated in some cases involving abbreviations and direct speech. Examples:

     <s>Dr. Smith, St. George</s>
     <s>'Hurry up!' Wolfram interrupted.</s>
     <s>'Why didn't you come straight to me?' I asked her.</s>

No split is made in such cases, where the capital does not mark the beginning of an S-unit, but rather the nature of the word.

Headings, epigraphs, notes, and poems embedded in the text are not split into S-units.

2.4.5 Words

As pointed out above, words are not tagged, but are simply marked by spacing as in ordinary written text. The exception is that contractions are split into two words (in order to facilitate alignment). Examples:

     can't     ca n't
     I'll      I 'll
     it's      it 's
     d'you     d' you

In the early stages of the project words are not grammatically annotated, with a couple of exceptions:

     let's     let 's&pron;
     soon's    soon 's&subord;

The -s is here disambiguated by the following entity reference, which may be regarded as a grammatical tag.

2.5 Headings and other openers

Headings may occur at the beginning of a division or between paragraphs. They are marked by the tag <head>. Examples:

     <head id=NN1.1.h1>Part 1</head>
     <head id=NN1.1.1.h1>1 Mind in myth</head>

The "enumerator" is encoded as part of the head, as in these examples. Headings carry an "id" which is built up according to the same principle as the "id" of paragraphs and S-units, i.e. they are numbered within the nearest <div> but using "h1, h2, etc." rather than "p1, p2, etc." and "s1, s2, etc". See 2.4.3-4.

Where there is more than one heading at a particular point, the tag <head> may be repeated. The typographical rendition of the heading is regularly left unmarked, but it can be specified by a "rend" attribute; see 2.7.1.

Headings as part of the front matter of a book are encoded differently; see the TEI guidelines. Running heads at the top of pages are not encoded.

Epigraphs at the beginning of divisions have the following structure:

<epigraph>
     <quote></quote>
     <bibl></bibl>
</epigraph>

As regards the encoding of other opening elements, see the TEI guidelines.

2.6 Punctuation

The punctuation is regularly left as in the original text. Some problems of detail are taken up below.

2.6.1 Full stop

The full stop is retained both as a marker of abbreviation and when marking the end of an orthographic sentence. The two uses are disambiguated by the tagging of S-units (see 2.4.4).

The marking of ellipsis by successive full stops is regularized; any spaces before or between the dots are removed.

2.6.2 Hyphen

Line-end (soft) hyphens are removed where they are not part of the regular spelling of the word. In cases of doubt, guidance should be sought elsewhere in the same text or in dictionaries. If doubt still remains, a hyphen should be retained rather than removed. In these cases a retained line-end hyphen is marked by an entity reference ().

2.6.3 Dash

Dashes are marked by an entity reference (&dash;). No distinction is made between different types of dashes.

2.6.4 Quotation marks

Quotation marks are regularized to single and double quotes. At a later stage in the project the various uses of quotation marks may be distinguished and marked according to the TEI conventions. See further 2.7.6 below.

2.6.5 Apostrophe

The apostrophe is left as it is. In the encoded text it cannot be distinguished from a single quotation mark. This is of less importance, as the two regularly appear in different contexts; the quotation mark at the beginning or end of words, the apostrophe within words (apart from genitives ending in -s' and split contractions; cf. 2.4.5). The ambiguity may be removed at a later stage (cf. 2.6.4).

2.7 Highlighting and quotation

No attempt is made to capture the full typography of the original text. Variation between upper and lower case is reproduced as in the original text. Use of typographical highlighting is marked where it is judged to be significant for the interpretation of the text.

2.7.1 Typographical highlighting

Typographical highlighting is marked by a "rend" (=rendition) attribute, if it applies to a whole element: a heading, a paragraph, or an S-unit, as in:

     <head rend=bold>
     <p rend=ital>
     <s rend=bold>

Where there is no applicable element, the tag <hi> is used:

     I <hi rend=italic>hate</hi> it.

The TEI guidelines propose the tag <emph> for linguistically emphatic or stressed sections of the text. The TEI tag <hi> is preferred in the present corpus, to avoid some problems in identifying the purpose of typographical highlighting.

Where part of a text is highlighted typographically because it is identified as foreign, it is preferable to use the tagging presented in the next section (though the "rend" attribute can be used in addition).

2.7.2 Foreign words and expressions

Foreign words and expressions are marked by a "lang" attribute. This is simple if the foreign element carries a tag:

     <head lang=fr>
     <s lang=la>

Where there is no applicable element, the tag <foreign> is used:

     He was tried <foreign lang=la>in absentia</foreign>

Some possible values of the "lang" attribute are:

     de   German
     en   English
     es   Spanish
     fr   French
     gr   Greek
     la   Latin
     no   Norwegian
     sv   Swedish

Foreign words and expressions are only marked where they are clearly recognizable as foreign (by being identifiable as separate units or being reproduced as typographically distinct from the surrounding text). The "lang" attribute can of course be used in the cases taken up next. Long passages in a foreign language are replaced by an <omit> tag; see 2.13.2.

2.7.3 Language mention

Words and expressions which are mentioned rather than used are normally marked by italics or quotation marks. These are tagged <mentioned>, as in:

     <mentioned rend=italic>She</mentioned> is a personal pronoun.
     <mentioned lang=ger>'Singen'</mentioned> is a strong verb.

The rendition is marked by an attribute and/or by retaining quotation marks.

2.7.4 Terms

Highlighted terms are tagged <term>, possibly accompanied by a <gloss>, as in:

     Apical sounds are produced with the <term rend=italic>apex</term>
     <gloss>'tip of the tongue'</gloss>.

2.7.5 Titles

Titles of books, magazines, films, songs, paintings, etc. are tagged <title>, as in:

     Have you read <title>'Paradise Lost'</title>?

Foreign titles are marked by a "lang" attribute.

2.7.6 Names

Names of ships, boats, buildings, etc. are tagged <name>, as in:

     I went on board<name>Tumble</name> and set sail.

They are only tagged if they are typographically highlighted in some way, eg by italic, bold or underscore. The "type" attribute is optional, and is usually not inserted at this stage.

Names of persons, places, organizations, etc. are not tagged.

2.7.7 Quotations

Quotations from extraneous sources are tagged <quote>, as in:

     The Apostle Paul said concerning some that <quote>"By good
     words and fair speeches they deceived the heart of the
     simple."</quote>

Foreign quotations are marked by a "lang" attribute. Long foreign quotations are omitted and replaced by an <omit> tag; see 2.13.2.

Direct speech in fiction is left unmarked and is simply shown by quotation marks. At a later stage direct speech may be tagged as in this example:

     <q>"Let's go,"</q> she said.

Before this tagging, direct speech may not be identifiable, as it is not always indicated by quotation marks. Missing quotation marks can be inserted using the <add> tag; see 2.13.2.

2.7.8 Use of single (') vs double quotation marks (")

All single quotation marks (') are converted to double quotation marks in direct speech and marked text.

     <s>"I do n't know how he stays so thin."</s>
     <s>She used her "meeting voice".</s>

The single quotation mark, ('), is only used in contractions (She 's, y' enjoy) and to mark the genitive (next week's Sunday newpapers' review section). Quotations within quotation are tagged <q>. This also applies to marked text within quotations or direct speech.

     <p><s>"The finger got stuck inside his nose," Matilda said,"and he
     had to go around like that for a week.</s> <s>People kept saying to
     him, <q>Stop picking your nose</q>, and he could n't do anything
     about it.</s> <s>He looked an awful fool."</s></p>

     <s>"Lately he 's discovered <q>breakfast meetings</q>.</s>
     <s>Now he gorges and guzzles all day.</s> <s>I do n't know how
     he stays so thin."</s>

2.8 Linguistically distinct material

The marking of foreign elements has already been dealt with (see 2.7.2). It may be essential to mark other linguistically distinct material, such as dialect words or idiosyncratic spellings. These are tagged <distinct>, with an attribute indicating the type of deviance. Examples:

     <distinct type=nonstand>Mister Carlyle sure give it to yuh,
     he finds out!</distinct>

     Why do we not treat <distinct
     type=nonceword>bunkraptcy</distinct> precisely as we treat
     bankruptcy?

The main value used for the "type" attribute in the present project is "nonstand", indicating deviance of different kinds: dialect, slang, idiosyncratic spelling, etc. If such features are pervasive in the text, this is noted in the header (under <langUsage>; see 2.3.3), and each individual case is not marked.

2.9 Notes

Notes in the source text are tagged <note> and are inserted at the place in the text marked by the reference to the note. Attributes include "n", "resp", and "place". Example:

     <note n=1 resp=auth place=foot>Unless otherwise specified, all
     remarks about bilingualism apply as well to multilingualism, the
     practice of using alternately three or more languages.</note>

Values of the "resp" attribute used in the project are: auth (author), ed (editor), tr (translator), tag (tagger). References to notes are omitted.

Only notes produced by the author of the original text are counted as included in the text proper. Editorial notes and translators' notes are omitted in counting the number of words of the text, but they form part of the electronic file.

In special cases it may be desirable to omit notes. They are then replaced by an <omit> tag. See 2.13.2.

2.10 Lists

Lists which contain very little ordinary language text (e.g. lists of references) are omitted and replaced by an <omit> tag; see 2.13.2. Other lists are treated as paragraphs or sequences of paragraphs (the latter in case each list item is set out typographically as a paragraph). S-units are used for subdivision, as for ordinary paragraphs.

2.11 Figures, diagrams, and tables

Figures, diagrams, and tables are left out and replaced by an <omit> tag. See 2.13.2.

2.12 Embedded texts

Poems, songs, etc. that are embedded in a prose text are tagged <poem>. The internal structure is not specified. Verse lines are reproduced with a line break between each. There is a blank line between stanzas. Poems are included in the nearest S-unit. There is no internal division into S-units.

In some cases it may be preferable to leave out a poem and replace it by an <omit> tag. See 2.13.2.

Embedded texts in prose are simply reproduced as part of the main text. Ordinary paragraph and S-unit marking is used. Frequently they will be tagged as quotations; see 2.7.7.

2.13 Editorial comment

The mechanisms for editorial comment are those recommended by the TEI guidelines for simple editorial changes.

2.13.1 Correction and regularization

Correction is marked as shown by this example:

     ... to render that service to poor <corr sic=poele>people</corr>

Where it is apparent that there is a typographical error, the main text is corrected and the original reading is given as a value of a "sic" attribute. A "resp" attribute should be used to specify the person responsible for the correction (normally "tag" for "tagger"; cf. 2.9).

The tag <sic> is used where there is no straightforward correction, but it is apparent that the text is inaccurate. A suggested correction may be given as a value of a "corr" attribute. A "resp" attribute should be used to specify the person responsible for the correction.

Beyond correction of obvious typographical errors, the language of the corpus texts is not normalized or regularized.

2.13.2 Addition, deletion, and omission

Omission of passages in the text may be marked by an <omit> tag; see 2.4.1, 2.7.2, 2.7.7, 2.9, 2.10, 2.11, 2.12. The tag has the following attributes:

desc: describing the omitted text
reason: giving the reason for the omission
extent: indicating the extent of the omission
resp: specifying the person responsible for the omission

The "desc" and "resp" attributes should normally be used. Sample "desc" values include: table, figure, foreign text.

Addition and deletion in the main text are avoided, though they can be indicated by <add> and <del> tags. An example of the use of the <add> tag is the insertion of a missing quotation mark; cf. 2.7.7.

2.14 Special characters

Special characters are encoded as entity references, eg

$         &dollar;
�         &pound;
�         &dash;

For a full list of entity references, see the document type definition (DTD).

NB! Accented and special characters used in Western European languages (de, en, fr, no) are not encoded as entity references at this stage. They are, therefore, system dependent.

2.15 Page breaks

Page breaks in the source text are kept to make it easier to refer back to the source. They are tagged <pb n= >, i.e. with the number as the value of an attribute. The placement of <pb> is normalized and is always given at the beginning of the relevant page. If there is a page break in the middle of a hyphenized word in the original text, <pb> is placed after the relevant word in the encoded text.

2.15 Reference system

A reference system is built up using the identifiers of the text units. See 2.1 (text), 2.4.2 (division), 2.4.3 (paragraph), 2.4.4 (S-unit), 2.5 (heading).

2.16 Links

Links between parallel texts are indicated by attributes of S-units, as shown in 2.4.4. Example:

<s id=DL2.1.s18 link='DL2T.1.s18 DL2T.1.s19'>At once, feeling her
advantage, she said, 'Do n't forget you 've been living soft for four
years.'</s>

<s id=DL2T.1.s18 link=DL2.1.s18>Hun hadde f�tt et lite overtak og fulgte
det opp.</s>
<s id=DL2T.1.s19 link=DL2.1.s18>"Ikke glem at du har levd godt i fire �r
n�."</s>

2.17 Analytic coding

In the earlier stages of the project there will be no linguistic annotation, with one exception; see 2.4.5.