Using SGML for Linguistic Analysis: the case of the BNC. By Lou Burnard. October 1996

1 Abstract

The British National Corpus (BNC) is a rather large SGML document, comprising some 4124 samples taken from a rich variety of contemporary British English texts of every kind, written and printed, famous and obscure, learned and ignorant, spoken and written. Each of its hundred million words and six and a quarter million sentences is tagged explicitly in SGML and carries an automatically-generated linguistic analysis. Each sample carries a TEI-conformant header, containing detailed contextual and descriptive information, as well as more conventional SGML mark-up.

The corpus was created over a four year period by a consortium of leading dictionary publishers and academic research centres in the UK, with substantial funding from the British Department of Trade and Industry, the Science and Engineering Research Council, and the British Library. It is currently available under licence within the European Union only, where it is increasingly used in linguistic research and lexicography, in applications ranging from the construction of state of the art language-recognition systems, to the teaching of English as a second language.

This paper begins by describing how the corpus was constructed, and gives an overview of some of the SGML encoding issues raised during the process. A description of the special purpose SGML aware retrieval system developed to analyse the corpus is also provided.