Material gathered by OUP and Longman was initially transcribed
according to their own markup schemes. It was then sent to Oxford
University Computing Services, where the markup of each text was
converted into the standard, BNC-wide scheme called CDIF (Corpus Document Interchange Format). CDIF
uses SGML (Standard Generalized Markup Language), implemented in broad
accordance with the recommendations of the
TEI (Text Encoding Initiative),
to ensure the corpus can be re-used round the world by many different
researchers using different types of machines and software. The markup
conventions used are specified in the formal
Specifically linguistic information is added subsequently by Lancaster
University, where basic syntactic codes are generated by a modified
version of the CLAWS2 parser and tagger. This information is converted
to CDIF format, even though CLAWS output normally comes in its own
markup style.
While all the CDIF tags, both structural and
linguistic, are intended to be an aid to research, they can be
rewritten or removed according to the purpose of the researcher.
A detailed article about Encoding the British
National Corpus, written by Dominic Dunlop and Gavin Burnage
while the BNC was under de
velopment, provides extensive details about
the textual markup used.
Next Stage:
Linguistic Markup
Back to the
start: BNC Top page