Published in English Language Corpora: Design, Analysis and Exploitation, Papers from the 13th international conference on English Language research on computerized corpora, Nijmegen 1992, edited Jan Aarts, Pieter de Haan and Nelleke Oostdijk.
Before this work is carried out, each incoming text is assigned a unique code name, and stored on disk in accordance with agreed file storage procedures (Burnage 1992b). As well as identifying each text in the file system, the code name is used in a database which stores details about each text and its current progress along the production line, details of which are updated regularly (Dunlop 1992c).
Also used extensively is the ICON programming language. It was developed at the University of Arizona under Ralph and Madge Griswold, and is particularly suited to the manipulation of character strings -- making it an ideal tool for the re-formatting and encoding of text corpora (Griswold & Griswold 1990). It is available for a wide range of machines and operating systems, and is in the public domain.
The main SGML tool used is a public-domain parser called SGMLS (Clark 1992). Using the emacs text editor, the parser can be called upon to analyse the text during an editing session on that text. This speeds up the checking and correction process considerably.
The bibliographic database is implemented with the INGRES database management system, which is available to OUCS under a local site licence agreement (Ingres 1989).
-- Character level: The corpus is held as plain ASCII text (strictly, it uses the International Reference Version of ISO 646:1990 (ISO 1990). Characters outside the limited set permitted by this standard are represented by mark-up.
-- Word level: Word-class tagging is applied to each word in the corpus.
-- Phrase level: A small selection of the texts (the `core corpus') in the BNC has tagging at the phrase level with parse tree analysis.
-- Sentence level: The word-class tagging process divides all texts in the corpus into segments, which correspond closely to sentences in running text. Segments are also used in a reference system which allows a unique reference to be generated for any segment in the corpus
-- Structural level: Where appropriate and possible, the structure of each document --- consisting of chapters, sections, paragraphs, or similar elements --- is marked.
-- Text level: Each text in the corpus is accompanied by a comprehensive header giving bibliographic information, and listing the criteria by which the text was selected for inclusion in the corpus.
In some respects, SGML is not itself a mark-up language; rather, it is a language in which mark-up languages may be defined. Consequently, it is possible using SGML to express two functionally identical mark-up languages which are nevertheless incompatible because, for example, they use different names for the same element, or because they use different character sets. Such incompatibilities would make it difficult for researchers using the two schemes to exchange data sets, so the use of SGML alone does not provide a solution to the problem caused in the past by lack of mark-up standardization. (It does, however, address the problem of a lack of common tools: subject only to capacity limitations, and to the ability to handle optional extensions to the base standard, any SGML-aware tool can process any document marked up using SGML.)
The TEI sets out to define an application of SGML which minimizes incompatibilities between the mark-up used by different researchers, while allowing both subsetting and extension. Its recommendations try to describe a spectrum of SGML document type definitions (DTDs) which may be applied to a wide variety of text types, defining mark-up which will facilitate the use and, importantly, the exchange of marked-up text for a wide variety of scholarly and didactic activities. Sperberg-McQueen and Burnard (1992a) divides the features that particular researchers might want to address into a number of subsets, recommending the manner in which tagging should be applied, and giving names which should be used for tags marking particular types of element. Those following the recommendations are free to implement as much or as little of each subset as is required for their application, and may use tags of their own devising to mark elements not described in the recommendations. Thus, a TEI-conformant mark-up may be characterized by the extent to which it implements each subset of the recommendations.
Broadly, CDIF provides a relatively sparse implementation of the text body tagging described by the recommendations; a complete (and, indeed, extended) implementation of the text and corpus header recommendations; and a medium level of word-class tagging. The subsections which follow give more detail:
Consequently, it was decided to divide CDIF text tags into three categories:
Required tags, which must be used to mark particular types of feature if those features appear in a text. (In some cases, as an alternative to tagging, the content of the feature may be silently deleted from the electronic transcription: footnotes are a case in point. The editorial practices declaration in each text header describes the treatment of such features.) Examples of required tags are <p>, to mark paragraphs in written text; <u> to mark spoken utterances; and <note> to mark foot- end- or side-notes, or editorial comments inserted during BNC processing.
Recommended tags, which are not mandatory, but highly desirable. Often these mark text features which could cause anomalous results in corpus-based research if their presence were not noted. Examples are lists (marked with <list>); poetry (<poem>) and material written to be spoken (<sp>).
Optional tags, which may appear if sufficient information has been
captured from the original text, or if their use resolves some problem
identified during syntactic or semantic checking. Examples include
<hi> to describe text rendition (no attempt is made to interpret the
semantic reason for changes in rendition);
Burnard (1992b) summarizes the division of CDIF tags into these three
categories. The text header (Dunlop 1992b) lists the tags used in a
particular text.
Figure 2. Written example in source form
While the text of figure 3 would be acceptable for inclusion in the
corpus in this form, most corpus texts show more complete tagging, as
shown in figure 4. Here some recommended tags are added, fully
describing the text structure, and identifying the list and poem
fragment that it contains. The footnote also appears, tagged at its
point of reference.
The addition of optional mark-up, shown in figure 5, provides
information about text rendition, and indicates that the poem fragment
is a quotation.
If the example were an actual corpus text, it would also be marked up
with part-of-speech and segmentation information, independent of the
level of other tagging applied. See Eyes (1992) for further details.
Figure 6: A spoken example represented as a dramatic script.
Figure 6 shows an example spoken text, set out as if in the printed
script of a play. The main feature of the example is the overlap
between the utterances of the two speakers. As figure 7 shows, this
is handled by marking the start and end of the period of overlap in
each utterance. An `alignment map' at the start of the text shows
the ordering in time of the starting and ending marks, and so
indicates which utterances overlap which others. The transcription
method used for spoken material (see Crowdy 1991) correctly captures
up to three simultaneous utterances. (In an actual corpus text, the
identifiers used for each alignment map location would be longer than
those in the example, as they must be unique across the whole corpus.)
For completeness, figure 8 shows the example text of figure 6 marked
up as it would be if it had been captured from a printed dramatic script.
Involvement with the TEI has many benefits, but, in these relatively
early days, has often necessitated waiting for recommendations to
appear before CDIF mark-up specifications can be frozen. This is a
two-way process: experience with issues raised by the BNC has informed
several aspects of the TEI's work, and the close attention of several
of the technical experts who contribute to the TEI recommendations has
been of great assistance to the BNC project.
Future TEI-conformant corpus-building projects will not be burdened by
many of the issues which the BNC, as an early user, has encountered.
However, any builder of a large corpus faces the problem of converting
texts from a variety of source formats into a uniform electronic
format prior to accession to the corpus. Experience on the BNC
project indicates that, the earlier in this process that some
mechanically verifiable form of quality control can be introduced, the
better.
Aho, Alfred V; Kernighan, Brian W & Weinberger, Peter J.
1988. The Awk programming language. Addison Wesley. Reading,
Massachusetts.
British National Corpus. October, 1991. TGAW14, Spoken corpus design
specification. BNC working paper.
British National Corpus. September 1991. BNCW08, Written corpus
design specification. BNC working paper.
Burnage, Gavin. July 1992. TGCW26, Is the conversion of
Longman/Lancaster texts to CDIF possible? BNC Working Paper.
Burnage, Gavin. August, 1992. TGCW35, Corpus text
processing: directory structures and file names.
Burnard, Lou. March 1992. TGCW27, BNC acceptance procedures ---
Draft OUCS proposals. BNC Working Paper.
Burnard, Lou, September 1992. TGCW30, Corpus Document Interchange
Format, version 1.2, BNC working paper.
Clark, James. October 1992. sgmls 1.0 available. Announcement on
comp.text.sgml newsgroup.
Clear, Jeremy. July 1992. TGCW33, BNC Data Capture: OUP format
definition for text handover to OUCS. BNC Working Paper.
Crowdy, Steve et al. December 1991. TGCW21, Spoken corpus
transcription guidelines. BNC working paper.
Davis, Caroline. July 1992. TGCW04 annex. Corpus markup: Codes
for freelancers and scanner operators. OUP/BNC Working Paper.
Doucherty, Dale. 1991. sed & awk. O'Reilly & Associates.
Sebastopol, California.
Dunlop, Dominic. March 1992. TGCW25, Mark-Up for non-ISO 646
invariant part characters. BNC working paper.
Dunlop, Dominic. September, 1992. TGCW34, The relationship between
the TEI.2 header and the BNC corpus and text headers. BNC working
paper.
Dunlop, Dominic. October, 1992. TGCW36, The new BNC database. BNC
working paper.
Eyes, Elizabeth & Leech, Geoffrey. 1992. Improving corpus annotation
practices. [This ICAME report.]
Garside, Roger; Leech, Geoffrey; Sampson, Geoffrey. The computational
analysis of English: a corpus-based approach. Longman. London.
Goldfarb, Charles. 1990. The SGML handbook, Oxford University Press.
Griswold, Ralph E & Griswold, Madge T. 1990. The Icon programming
language, second edition. Prentice Hall. New Jersey.
Ingres Corporation. 1989. Introducing Ingres. Alameda, California.
International Organization for Standardization. 1986. ISO 8879:1986,
Standard Generalized Markup Language (SGML). (Included in
Goldfarb 1990.) Geneva.
International Organization for Standardization. 1990. ISO 646:1990, 7
bit coded character set for information exchange. Geneva.
Langendoen, Terry. January 1992. TGDW09, Preliminary feature
structure definition for CDIF. BNC working paper.
Leech, Geoffrey. April 1992. TGDW08, Revised proposal for basic
grammatical tagset. BNC working paper.
Leech, Geoffrey. September 1992. TGDW11, Proposal for enriched
grammatical tagset. BNC working paper.
Sperberg-McQueen, C.M. and Burnard, Lou (eds.). 1992 (forthcoming).
TEI P2, Recommendations of the Text Encoding Initiative. Chicago &
Oxford.
Wall, Larry & Schwartz, Randal L. 1991. Programming perl.
O'Reilly & Associates. Sebastopol, California.
British National Corpus working papers are available on request in
printed or electronic from the authors.
The sgmls program and published TEI papers are available by anonymous
FTP from archives at sgml1.exeter.ac.uk. The Icon language processors
are available by anonymous ftp from cs.arizona.edu. The perl language
is available by anonymous ftp from many sites, including ftp.uu.net
and doc.ic.ac.uk. In addition to UNIX, all of these languages support
MS-DOS, VMS, and a number of other operating environments.
to show quotation
of material written by some person other than the main author of a
text; and <cite> to enclose the citation within a text of another
work.
A written example
1. A sample text
<div>s contain paragraphs; they may also contain
a) Lists
b) Poems --- such as
"It's only words,
and words are all I have..."
c) Lower-level <div>s
1.1 A sub-section
Contents of the sub-section*.
* Two further levels are supported
<!DOCTYPE cdif SYSTEM "cdif1.2.dtd" [ ]>
<cdif><header>
A sample text containing only mandatory mark-up
</header>
<text>
<div1><head>A sample text</head>
<p><div>s contain paragraphs;
they may also contain
a) Lists
b) Poems — such as
"It's only words,
and words are all I have…"
c) Lower-level <div>s
<p>A sub-section
<p>Contents of the sub-section.
</text></cdif>
Figure 3. Written example with required CDIF mark-up
Figure 2 shows a sample of text as it might appear on the printed
page. In figure 3, the same text appears with only required CDIF
mark-up added. Note that information about rendition and structure
below the top level is not recorded, and the footnote has been
silently deleted.
<!DOCTYPE cdif SYSTEM "cdif1.2.dtd" [ ]>
<cdif><header>
A sample text containing required and recommended mark-up
</header>
<text>
<div1><head>A sample text</head>
<p><div>s contain paragraphs; they may also contain
<list><label>a)</label><item>Lists</item>
<label>b)</label><item>Poems — such as
<poem><l>"It's only words,
<l>and words are all I have…"
</poem></item>
<label>c)</label><item>Lower-level <div>s</item></list>
<div2><head>A sub-section</head>
<p>Contents of the sub-section
<note place=foot>Two more levels are allowed</note>.
</text></cdif>
Figure 4. Written example with required and recommended CDIF mark-up.
<!DOCTYPE cdif SYSTEM "cdif1.2.dtd" [ ]>
<cdif><header>
A sample text containing required, recommended and optional mark-up
</header><text>
<div1><head>A sample text</head>
<p><div>s contain <hi r=it>paragraphs</hi>;
they may also contain
<list><label>a)</label><item r=it>Lists</item>
<label>b)</label><item><hi r=it>Poems</hi> — such as
<quote><poem><l>It's only words,
<l>and words are all I have…
</poem></quote></item>
<label>c)</label><item r=it>Lower-level <div>s</item></list>
<div2><head>A sub-section</head>
<p>Contents of the sub-section
<note place=foot>Two more levels are allowed</note>.
</text></cdif>
Figure 5. Written example with required, recommended and optional CDIF
mark-up.
A spoken example
Tom: I used to smoke (coughs) ...
Dick: (interrupting) Thought as much.
Tom: (continuing) but I never inhaled.
<!DOCTYPE cdif SYSTEM "cdif.dtd" [ ]>
<cdif><header>
A sample spoken text
</header>
<stext><align><loc id=P1><loc id=P2></align>
<div>
<u who=Tom>I used to smoke
<ptr t=P1><vocal desc=cough dur=5><ptr t=P2>
but I never inhaled.
<u who=Dick><ptr t=P1>Thought as much.<ptr t=P2>
</div></stext></cdif>
Figure 7: Spoken example with CDIF mark-up.
<!DOCTYPE cdif SYSTEM "cdif.dtd" [ ]>
<cdif><header>
A sample dramatic text
</header><text>
<sp><spkr>Tom</spkr><p>I used to smoke
<stage>coughs</stage>
<sp><spkr>Dick</spkr><stage>interrupting</stage>
<p>Thought as much.
<sp><spkr>Dick<spkr><stage>continuing</stage>
<p>but I never inhaled.
</text>
Figure 8: Example with CDIF dramatic text mark-up
Lessons learned
As one of the first attempts to build a large, balanced corpus with
uniform, SGML-based mark-up, the BNC project was bound to encounter
unforeseen difficulties and tasks which took longer than had been
anticipated.
References