Encoding the British National Corpus

Gavin Burnage and Dominic Dunlop
Oxford University Computing Services

Published in English Language Corpora: Design, Analysis and Exploitation, Papers from the 13th international conference on English Language research on computerized corpora, Nijmegen 1992, edited Jan Aarts, Pieter de Haan and Nelleke Oostdijk.

The British National Corpus project
- The role of OUCS in the BNC
Producing corpus texts
- Obtaining electronic text
- Converting text to CDIF
BNC hardware and software facilities at OUCS
- Hardware
- Software
Mark-up for the British National Corpus
Lessons learned
References

The British National Corpus Project

The British National Corpus (BNC) project is currently constructing a 100 million word corpus of modern British English for use in linguistic research. It is a collaborative, pre-competitive initiative carried out by Oxford University Press (OUP), Longman Group UK Ltd., Chambers, Lancaster University's Unit for Computer Research in the English Language (UCREL), Oxford University Computing Services (OUCS), and the British Library. The project receives funding from the UK Department of Trade and Industry and the Science and Engineering Research Council within their Joint Framework for Information Technology.

The role of OUCS in the BNC

OUCS's main role in the project is to encode all corpus texts in a standard format, and to act as a central clearing-house for the exchange and storage of corpus texts for all parties involved in BNC construction work. The common encoding scheme agreed within the project is called the `Corpus Document Interchange Format', or CDIF (Burnard, 1992c). It is an application of the Standard Generalized Markup Language (SGML) (Goldfarb, 1990; ISO 1986), and conforms in large measure to the recommendations of the Text Encoding Initiative (TEI) for the encoding of linguistic corpora (Sperberg-McQueen and Burnard, 1992a). CDIF is the format in which the BNC will be published at the end of the project.

Producing corpus texts

OUCS's work is one part of a production line which involves most of the project's participants directly. The starting point is the creation of electronic versions of a wide range of texts in British English; the finishing point is those same texts encoded in CDIF to show both the structure of each text and the syntactic analysis of each sentence. This production line is illustrated in figure 1.

Obtaining electronic text

The commercial publishers in the project are responsible for the initial stages of the process, namely supplying electronic versions of texts selected for inclusion in the corpus in accordance with the design criteria (British National Corpus, 1991a, 1991b). There are three ways of obtaining electronic text. One is to use scanners such as the Kurzweil Data Entry Machine or the Microtek 600; another is to type in text directly; another is to use material which is already in electronic form, usually from publishers or existing archives. In practice, the amount of existing electronic text which fulfils the corpus design criteria has been a lot smaller than envisaged, which means that the bulk of the material received at OUCS has been scanned or typed. Naturally the transcribed spoken material for which Longman is responsible can only be typed. Both OUP and Longman use their own internal mark-up schemes for the encoding of the data they supply for the BNC (Davis 1992) -- though a pre-condition for a text's inclusion in the corpus is that its automatic conversion to CDIF must be easy to implement (Burnage 1992a, Clear 1992). This illustrates one important reason for the use of TEI-conformant SGML in distributing the corpus: researchers are free to convert to and from the encoding systems and software they are happy using in their local set-up, but for the exchange of data between different computational set-ups, a single, standardized encoding scheme is to be preferred.

Converting text to CDIF

The range of texts received at OUCS is very broad, not only in terms of subject matter and linguistic register, but also in terms of textual structure. The headings, sections, and paragraphs of an academic article are usually well marked, and its logical structure is easy to follow; in contrast, feature articles from colour magazines often contain short snippets of text which are hard to identify --- they could be paragraphs, or headings, or captions -- in no particular order. Moreover, transcription of conversation presents another set of encoding problems (Crowdy 1991). CDIF has been designed to accommodate the encoding of texts whose structures differ widely. It is a single SGML document type definition (DTD) which sets out a formal description that every BNC text must match in order to become a part of the corpus. This formal description is broad enough to encompass the many different types of text intended for the corpus, and rigorous enough to show up some of the errors which occur in the mark-up of these texts.

The CDIF check

The first task which OUCS performs when new electronic text arrives, therefore, is to convert the mark-up to the various conventions set out in the CDIF DTD. The success or otherwise of this conversion can be gauged on one level by using an SGML parser to check the mark-up of the text against the formal description in the DTD. The parser reports any errors in the text, and these can be corrected by hand. If the error is one likely to recur frequently, small programs can be used to speed up the correction process. When the document conforms to the DTD, the parser finds no further errors.

Before this work is carried out, each incoming text is assigned a unique code name, and stored on disk in accordance with agreed file storage procedures (Burnage 1992b). As well as identifying each text in the file system, the code name is used in a database which stores details about each text and its current progress along the production line, details of which are updated regularly (Dunlop 1992c).

The semantic check

The fact that a text conforms to the DTD does not necessarily mean that it is faultless or perfectly encoded. Tags can conform to the expectations of the DTD and the parser, but still have been misapplied or misused. The caption to a photograph might mistakenly be labelled a heading, for example. Certain significant textual features such as chapters and paragraphs might not have been tagged at all. There may also be more fundamental problems: portions of text present in the original book, conversation, or whatever may have been omitted from the electronic version. Scanning software or transcribers may inadvertently have introduced typographical errors. For these reasons, a 'semantic' check follows each successful CDIF parse. A portion of each text is examined for errors such as those described above, almost always against the original printed version (in the case of written texts). Each error that occurs is corrected, as are any similar errors which can easily be identified in the rest of the text. If this examination shows that a lot of manual correction will be required to bring the text up to standard, the text may be `bounced' -- that is, returned to its original sender. They may decide to correct it, or simply provide other appropriate texts in its place. Such constraints are designed to ensure that the production line keeps moving at a reasonable rate. Full correction of every badly-encoded text would, unfortunately, cost too much time and effort.

Adding header information

Another task carried out at OUCS is the addition of a CDIF `header' which supplies bibliographic and other information at the beginning of every text (Dunlop 1992b). This includes the title, the publisher, the date and place of publication, the age, sex, and regional origin of the author, information about the sample size, and so on. For spoken material, details are given about the people who participated in the conversations and activities recorded. Much of this information comes to OUCS as part of the electronic texts prepared by the publishers, and it is also stored in a database. Including it in a header for each text allows researchers to find out more about each text; including it in a database means that researchers can extract sub-corpora from the main 100-million word corpus. These sub-corpora can be designed according to the researcher's own needs. The database also means that while the corpus is being constructed, a continuous check can be made on the way the stipulated design criteria are being met. If, for example, too few books by female writers from North-east England have been added to the corpus, then the publishers who supply text to OUCS can be alerted and take steps to remedy the imbalance in the corpus.

Further processing

After a text has passed the CDIF and semantic checks satisfactorily, it is sent to UCREL in Lancaster. There syntactic tagging is carried out before the texts are returned to OUCS for one last CDIF conformance check. Once this has been done, the text becomes an official part of the corpus.

BNC hardware and software facilities at OUCS

There are three full-time staff working for the BNC at OUCS, with a wide variety of skills and interests. There is therefore a correspondingly wide range of software tools in use to carry out the work described above.

Hardware

Processing power comes from two Sun Microsystems Sparcstation 2 machines running Sun's UNIX operating system. Given the large memory (32 megabytes) and processing speed (28 mips) of these machines, long texts can be processed quickly. Hard disk storage space currently amounts to two gigabytes; this will shortly be doubled.

Software

For processing text, standard UNIX tools such as awk (Aho 1988), and sed (Doucherty 1991) are in frequent use, along with perl (Wall 1991) when required.

Also used extensively is the ICON programming language. It was developed at the University of Arizona under Ralph and Madge Griswold, and is particularly suited to the manipulation of character strings -- making it an ideal tool for the re-formatting and encoding of text corpora (Griswold & Griswold 1990). It is available for a wide range of machines and operating systems, and is in the public domain.

The main SGML tool used is a public-domain parser called SGMLS (Clark 1992). Using the emacs text editor, the parser can be called upon to analyse the text during an editing session on that text. This speeds up the checking and correction process considerably.

The bibliographic database is implemented with the INGRES database management system, which is available to OUCS under a local site licence agreement (Ingres 1989).

Mark-up for the British National Corpus

As a three-year project with budget constraints and ambitious data collection targets, the BNC cannot be over-ambitious in the amount or complexity of the mark-up that it applies to captured text. The mark-up which is applied relates to content of written and spoken texts at a variety of levels:

-- Character level: The corpus is held as plain ASCII text (strictly, it uses the International Reference Version of ISO 646:1990 (ISO 1990). Characters outside the limited set permitted by this standard are represented by mark-up.

-- Word level: Word-class tagging is applied to each word in the corpus.

-- Phrase level: A small selection of the texts (the `core corpus') in the BNC has tagging at the phrase level with parse tree analysis.

-- Sentence level: The word-class tagging process divides all texts in the corpus into segments, which correspond closely to sentences in running text. Segments are also used in a reference system which allows a unique reference to be generated for any segment in the corpus

-- Structural level: Where appropriate and possible, the structure of each document --- consisting of chapters, sections, paragraphs, or similar elements --- is marked.

-- Text level: Each text in the corpus is accompanied by a comprehensive header giving bibliographic information, and listing the criteria by which the text was selected for inclusion in the corpus.

CDIF and the recommendations of the Text Encoding Initiative

As has been stated, it was decided at the outset to use SGML in order that consistent mark-up could be applied throughout the corpus. Further, the recommendations of the TEI were to be adhered to where possible.

In some respects, SGML is not itself a mark-up language; rather, it is a language in which mark-up languages may be defined. Consequently, it is possible using SGML to express two functionally identical mark-up languages which are nevertheless incompatible because, for example, they use different names for the same element, or because they use different character sets. Such incompatibilities would make it difficult for researchers using the two schemes to exchange data sets, so the use of SGML alone does not provide a solution to the problem caused in the past by lack of mark-up standardization. (It does, however, address the problem of a lack of common tools: subject only to capacity limitations, and to the ability to handle optional extensions to the base standard, any SGML-aware tool can process any document marked up using SGML.)

The TEI sets out to define an application of SGML which minimizes incompatibilities between the mark-up used by different researchers, while allowing both subsetting and extension. Its recommendations try to describe a spectrum of SGML document type definitions (DTDs) which may be applied to a wide variety of text types, defining mark-up which will facilitate the use and, importantly, the exchange of marked-up text for a wide variety of scholarly and didactic activities. Sperberg-McQueen and Burnard (1992a) divides the features that particular researchers might want to address into a number of subsets, recommending the manner in which tagging should be applied, and giving names which should be used for tags marking particular types of element. Those following the recommendations are free to implement as much or as little of each subset as is required for their application, and may use tags of their own devising to mark elements not described in the recommendations. Thus, a TEI-conformant mark-up may be characterized by the extent to which it implements each subset of the recommendations.

Broadly, CDIF provides a relatively sparse implementation of the text body tagging described by the recommendations; a complete (and, indeed, extended) implementation of the text and corpus header recommendations; and a medium level of word-class tagging. The subsections which follow give more detail:

Characters and character sets

While the language of the BNC is modern British English, which can generally be represented in ISO 646 IRV, there is a need to represent accented Roman letters, the Greek alphabet, and a variety of printers' marks, such as em-dashes, degree signs, and bullets. An annex to SGML (ISO 1986) provides `public entity sets' which address almost all of the needs of the BNC, with marks such as á (small letter a with acute accent), — (em dash), and °ree; (degree sign). Only a few additional marks have been introduced. These include &ft; and &inch; (prime and double prime used to indicate measurements in feet and inches respectively); and &bquo; and &equo; for normalized beginning and ending quotation marks, replacing the variety of marks used for this purpose in the original texts. Dunlop (1992a) lists the marks (entities) used in the BNC.

Indication of language and dialect

The works in the BNC inevitably contain words, phrases or passages in languages other than modern British English. They may be in non-British English, other modern languages, archaic English or other languages, or dead languages. Some written works also contain representations of modern British dialects, or of English spoken with a non-British accent. Additionally, some of those whose speech is transcribed in the spoken part of the corpus speak with regional or ethnic accents. The project does not have the resources to undertake the very difficult task of marking each departure from standard British English in its texts. Besides, in many cases such judgements would inevitably be subjective. Consequently, although CDIF provides a means by which shifts in language may be marked, this mark-up is not applied in practice. Instead, languages seen during the process of semantic checking of a written text are noted in its header. (Language which cannot be represented using the marks available --- for example, Hebrew or Japanese --- is deleted.) For spoken texts, no attempt is made to provide a phonetic or prosodic representation of the transcribed speech; words are regularized to standard British spelling. (An exception is made in the case of words which appear in the project's control lists of vocalized pauses, and regional and dialectal usages.)

Bibliographic control

The TEI P2 recommendations (Sperberg-McQueen and Burnard, 1992a) require any conformant text to have a header which, at a minimum, gives brief bibliographic information about the electronic text and its source. Where many texts are assembled to make up a corpus, TEI P2 describes a separate corpus header, which gives bibliographic information about the corpus, and information which is common to all the texts it embodies. In the BNC, both corpus header and text headers approach the maximum level of detail provided for in TEI P2, and in some respects exceed it. Reasonably comprehensive bibliographic information about text titles and authors is provided, along with the detailed information needed to define and enforce the selection criteria used in deciding which texts should be included in the corpus. (See British National Corpus 1991a, 1991.) Headers also describe the processing undergone by each text, state the restrictions on the use of each text, and identify the holders of copyrights around the world. For spoken texts, headers provide as much demographic information about participants as possible.

`Common tags'

TEI P2 provides a rich set of tags which are expected to be applicable to most conformant texts. Examples are <p> to mark paragraphs, and a variety of <div>s to mark higher-level structure. CDIF provides for the use of many of these `common tags', although, as described in Tag Classification below, there is no requirement that all features for which CDIF defines a tag be identified in any given text.

Analytic and Interpretive Information

Each word in the BNC has a class assigned to it by CLAWS, a probabilistic tagger (see Garside et al 1987). A companion paper in this volume (Eyes 1992) discusses this process. TEI P2 describes a general mechanism which uses `feature structures' to mark parts of speech and other features of any language. Used directly, feature structures can be extremely verbose, and provide for the encoding of far more information than is necessary to characterize modern English, or can be captured by a mainly-automatic tagging process. Consequently, the BNC uses a relatively small set of short `tags' (actually SGML entity references), which either expand to or point to `canned' feature structure definitions. (The exact mechanism to be used in the final corpus has not been determined at the time of writing.) A list of the entities may be found in Burnard (1992c) or Leech (1992a); examples of corresponding feature structures is in Langendoen (1992). Leech (1992b) lists the entities used in the million-word core corpus, a selection of BNC texts subjected to a more detailed analysis.

Tags for Specific Text Types

It is the BNC's intention to mark many varieties of text --- transcribed speech, books, plays, periodicals, letters, handbills... --- using a common tag set. Consequently, CDIF embodies few of the provisions for specific text types described in TEI P2. Areas for which specific special-purpose tagging is defined include poetry, drama, and, importantly, the transcription of spoken material. In the latter, tags exist to handle overlap, truncation, manner of delivery, and a variety of vocal and non-vocal events.

Text and corpus headers

TEI P2 describes a text header applicable to any conformant text, and a corpus header which is similar in structure, but applicable only to conformant corpora. The BNC provides a full implementation of both, with the intention of allowing researchers to build sub-corpora reflecting some feature or combination of features of the texts represented in the corpus. Dunlop (1992b) describes the BNC headers in detail.

Linking of related material

As material has been collected for the BNC, it has become apparent that there are many situations where it would be desirable to link a number of texts --- or subsections of a single text --- together because they have some characteristic in common. To give two examples, the same principal speaker will appear in many spoken corpus texts; a single reporter may contribute more than one article to a given edition of a newspaper. While such common features may be established by examination of text and corpus headers, CDIF provides no means of making such links explicit. At the time of writing, the TEI is considering proposals as to how this might be done; however, resolution will come too late for CDIF.

Tag classification

In discussions with those responsible for data collection for the BNC, it became apparent that it would not be possible to provide a uniform level of mark-up across the whole corpus. For example, when text is captured using optical character recognition, it is cheap and easy to capture changes in type style, but manual intervention is required to mark poetry, and to insert footnote text at its point of reference. Where text is rekeyed, changes in type style may go unnoticed, but the transcriber can handle poetry and notes accurately and with relative ease.

Consequently, it was decided to divide CDIF text tags into three categories:

Required tags, which must be used to mark particular types of feature if those features appear in a text. (In some cases, as an alternative to tagging, the content of the feature may be silently deleted from the electronic transcription: footnotes are a case in point. The editorial practices declaration in each text header describes the treatment of such features.) Examples of required tags are <p>, to mark paragraphs in written text; <u> to mark spoken utterances; and <note> to mark foot- end- or side-notes, or editorial comments inserted during BNC processing.

Recommended tags, which are not mandatory, but highly desirable. Often these mark text features which could cause anomalous results in corpus-based research if their presence were not noted. Examples are lists (marked with <list>); poetry (<poem>) and material written to be spoken (<sp>).

Optional tags, which may appear if sufficient information has been captured from the original text, or if their use resolves some problem identified during syntactic or semantic checking. Examples include <hi> to describe text rendition (no attempt is made to interpret the semantic reason for changes in rendition); to show quotation of material written by some person other than the main author of a text; and <cite> to enclose the citation within a text of another work.

Burnard (1992b) summarizes the division of CDIF tags into these three categories. The text header (Dunlop 1992b) lists the tags used in a particular text.

A written example

    1. A sample text
    <div>s contain paragraphs; they may also contain

    a) Lists

    b) Poems --- such as
    "It's only words,
    and words are all I have..."

    c) Lower-level <div>s

    1.1 A sub-section
    Contents of the sub-section*.

* Two further levels are supported

Figure 2. Written example in source form

    <!DOCTYPE cdif SYSTEM "cdif1.2.dtd" [ ]>
    <cdif><header>
    A sample text containing only mandatory mark-up
    </header>
    <text>
    <div1><head>A sample text</head>
    <p><div>s contain paragraphs;
    they may also contain

    a) Lists

    b) Poems — such as
    "It's only words,
    and words are all I have…"

    c) Lower-level <div>s
    <p>A sub-section
    <p>Contents of the sub-section.
    </text></cdif>

Figure 3. Written example with required CDIF mark-up Figure 2 shows a sample of text as it might appear on the printed page. In figure 3, the same text appears with only required CDIF mark-up added. Note that information about rendition and structure below the top level is not recorded, and the footnote has been silently deleted.

    <!DOCTYPE cdif SYSTEM "cdif1.2.dtd" [ ]>
    <cdif><header>
    A sample text containing required and recommended mark-up
    </header>
    <text>
    <div1><head>A sample text</head>
    <p><div>s contain paragraphs; they may also contain
    <list><label>a)</label><item>Lists</item>
    <label>b)</label><item>Poems — such as
    <poem><l>"It's only words,
    <l>and words are all I have…"
    </poem></item>
    <label>c)</label><item>Lower-level <div>s</item></list>
    <div2><head>A sub-section</head>
    <p>Contents of the sub-section
    <note place=foot>Two more levels are allowed</note>.
    </text></cdif>

Figure 4. Written example with required and recommended CDIF mark-up.

While the text of figure 3 would be acceptable for inclusion in the corpus in this form, most corpus texts show more complete tagging, as shown in figure 4. Here some recommended tags are added, fully describing the text structure, and identifying the list and poem fragment that it contains. The footnote also appears, tagged at its point of reference.

    <!DOCTYPE cdif SYSTEM "cdif1.2.dtd" [ ]>
    <cdif><header>
    A sample text containing required, recommended and optional mark-up
    </header><text>
    <div1><head>A sample text</head>
    <p><div>s contain <hi r=it>paragraphs</hi>;
    they may also contain
    <list><label>a)</label><item r=it>Lists</item>
    <label>b)</label><item><hi r=it>Poems</hi> — such as
    <quote><poem><l>It's only words,
    <l>and words are all I have…
    </poem></quote></item>
    <label>c)</label><item r=it>Lower-level <div>s</item></list>
    <div2><head>A sub-section</head>
    <p>Contents of the sub-section
    <note place=foot>Two more levels are allowed</note>.
    </text></cdif>

Figure 5. Written example with required, recommended and optional CDIF mark-up.

The addition of optional mark-up, shown in figure 5, provides information about text rendition, and indicates that the poem fragment is a quotation.

If the example were an actual corpus text, it would also be marked up with part-of-speech and segmentation information, independent of the level of other tagging applied. See Eyes (1992) for further details.

A spoken example

Tom: I used to smoke (coughs) ...
Dick: (interrupting) Thought as much.
Tom: (continuing) but I never inhaled.

Figure 6: A spoken example represented as a dramatic script.

<!DOCTYPE cdif SYSTEM "cdif.dtd" [ ]>
<cdif><header>
A sample spoken text 
</header>
<stext><align><loc id=P1><loc id=P2></align>
<div>
<u who=Tom>I used to smoke 
<ptr t=P1><vocal desc=cough dur=5><ptr t=P2> 
but I never inhaled.
<u who=Dick><ptr t=P1>Thought as much.<ptr t=P2>
</div></stext></cdif>

Figure 7: Spoken example with CDIF mark-up.

Figure 6 shows an example spoken text, set out as if in the printed script of a play. The main feature of the example is the overlap between the utterances of the two speakers. As figure 7 shows, this is handled by marking the start and end of the period of overlap in each utterance. An `alignment map' at the start of the text shows the ordering in time of the starting and ending marks, and so indicates which utterances overlap which others. The transcription method used for spoken material (see Crowdy 1991) correctly captures up to three simultaneous utterances. (In an actual corpus text, the identifiers used for each alignment map location would be longer than those in the example, as they must be unique across the whole corpus.)

<!DOCTYPE cdif SYSTEM "cdif.dtd" [ ]>
<cdif><header>
A sample dramatic text
</header><text>
<sp><spkr>Tom</spkr><p>I used to smoke
<stage>coughs</stage>
<sp><spkr>Dick</spkr><stage>interrupting</stage>
<p>Thought as much.
<sp><spkr>Dick<spkr><stage>continuing</stage>
<p>but I never inhaled.
</text>

Figure 8: Example with CDIF dramatic text mark-up

For completeness, figure 8 shows the example text of figure 6 marked up as it would be if it had been captured from a printed dramatic script.

Lessons learned

As one of the first attempts to build a large, balanced corpus with uniform, SGML-based mark-up, the BNC project was bound to encounter unforeseen difficulties and tasks which took longer than had been anticipated.

Involvement with the TEI has many benefits, but, in these relatively early days, has often necessitated waiting for recommendations to appear before CDIF mark-up specifications can be frozen. This is a two-way process: experience with issues raised by the BNC has informed several aspects of the TEI's work, and the close attention of several of the technical experts who contribute to the TEI recommendations has been of great assistance to the BNC project.

Future TEI-conformant corpus-building projects will not be burdened by many of the issues which the BNC, as an early user, has encountered. However, any builder of a large corpus faces the problem of converting texts from a variety of source formats into a uniform electronic format prior to accession to the corpus. Experience on the BNC project indicates that, the earlier in this process that some mechanically verifiable form of quality control can be introduced, the better.

References

Aho, Alfred V; Kernighan, Brian W & Weinberger, Peter J. 1988. The Awk programming language. Addison Wesley. Reading, Massachusetts.

British National Corpus. October, 1991. TGAW14, Spoken corpus design specification. BNC working paper.

British National Corpus. September 1991. BNCW08, Written corpus design specification. BNC working paper.

Burnage, Gavin. July 1992. TGCW26, Is the conversion of Longman/Lancaster texts to CDIF possible? BNC Working Paper.

Burnage, Gavin. August, 1992. TGCW35, Corpus text processing: directory structures and file names.

Burnard, Lou. March 1992. TGCW27, BNC acceptance procedures --- Draft OUCS proposals. BNC Working Paper.

Burnard, Lou, September 1992. TGCW30, Corpus Document Interchange Format, version 1.2, BNC working paper.

Clark, James. October 1992. sgmls 1.0 available. Announcement on comp.text.sgml newsgroup.

Clear, Jeremy. July 1992. TGCW33, BNC Data Capture: OUP format definition for text handover to OUCS. BNC Working Paper.

Crowdy, Steve et al. December 1991. TGCW21, Spoken corpus transcription guidelines. BNC working paper.

Davis, Caroline. July 1992. TGCW04 annex. Corpus markup: Codes for freelancers and scanner operators. OUP/BNC Working Paper.

Doucherty, Dale. 1991. sed & awk. O'Reilly & Associates. Sebastopol, California.

Dunlop, Dominic. March 1992. TGCW25, Mark-Up for non-ISO 646 invariant part characters. BNC working paper.

Dunlop, Dominic. September, 1992. TGCW34, The relationship between the TEI.2 header and the BNC corpus and text headers. BNC working paper.

Dunlop, Dominic. October, 1992. TGCW36, The new BNC database. BNC working paper.

Eyes, Elizabeth & Leech, Geoffrey. 1992. Improving corpus annotation practices. [This ICAME report.]

Garside, Roger; Leech, Geoffrey; Sampson, Geoffrey. The computational analysis of English: a corpus-based approach. Longman. London.

Goldfarb, Charles. 1990. The SGML handbook, Oxford University Press.

Griswold, Ralph E & Griswold, Madge T. 1990. The Icon programming language, second edition. Prentice Hall. New Jersey.

Ingres Corporation. 1989. Introducing Ingres. Alameda, California.

International Organization for Standardization. 1986. ISO 8879:1986, Standard Generalized Markup Language (SGML). (Included in Goldfarb 1990.) Geneva.

International Organization for Standardization. 1990. ISO 646:1990, 7 bit coded character set for information exchange. Geneva.

Langendoen, Terry. January 1992. TGDW09, Preliminary feature structure definition for CDIF. BNC working paper.

Leech, Geoffrey. April 1992. TGDW08, Revised proposal for basic grammatical tagset. BNC working paper.

Leech, Geoffrey. September 1992. TGDW11, Proposal for enriched grammatical tagset. BNC working paper.

Sperberg-McQueen, C.M. and Burnard, Lou (eds.). 1992 (forthcoming). TEI P2, Recommendations of the Text Encoding Initiative. Chicago & Oxford.

Wall, Larry & Schwartz, Randal L. 1991. Programming perl. O'Reilly & Associates. Sebastopol, California.

British National Corpus working papers are available on request in printed or electronic from the authors.

The sgmls program and published TEI papers are available by anonymous FTP from archives at sgml1.exeter.ac.uk. The Icon language processors are available by anonymous ftp from cs.arizona.edu. The perl language is available by anonymous ftp from many sites, including ftp.uu.net and doc.ic.ac.uk. In addition to UNIX, all of these languages support MS-DOS, VMS, and a number of other operating environments.