Entry, Encoding, Evaluation: Textual Markup

Textual mark-up is concerned with the structure of the text -- its sections, paragraphs and sentences, for example. Encoding such features in a standardized way helps ensure that the corpus will be usable no matter what the local computational set-up.

Material gathered by OUP and Longman was initially transcribed according to their own markup schemes. It was then sent to Oxford University Computing Services, where the markup of each text was converted into the standard, BNC-wide scheme called CDIF (Corpus Document Interchange Format). CDIF uses SGML (Standard Generalized Markup Language), implemented in broad accordance with the recommendations of the TEI (Text Encoding Initiative), to ensure the corpus can be re-used round the world by many different researchers using different types of machines and software. The markup conventions used are specified in the formal CDIF DTD (Document Type Definition).

Specifically linguistic information is added subsequently by Lancaster University, where basic syntactic codes are generated by a modified version of the CLAWS2 parser and tagger. This information is converted to CDIF format, even though CLAWS output normally comes in its own markup style.

While all the CDIF tags, both structural and linguistic, are intended to be an aid to research, they can be rewritten or removed according to the purpose of the researcher.

A detailed article about Encoding the British National Corpus, written by Dominic Dunlop and Gavin Burnage while the BNC was under de velopment, provides extensive details about the textual markup used.

Next Stage: Linguistic Markup
Back to the start: BNC Top page