The SARA system

Document home: SARA (SGML-Aware Retrieval Application) - Overview

SARA (SGML-Aware Retrieval Application) is a client/server software tool allowing a central database of texts with SGML mark-up to be queried by remote clients. The system was developed at Oxford University Computing Services, with funding from the British Library Research and Development Department (1993-4) and the British Academy. The original motivation for its development was the need to provide a robust low-cost search-engine for use with the 100 million word British National Corpus, and several features of the system design necessarily reflect this.

The SARA system has four key parts:

the indexing program, which generates an index of tokens from an SGML marked-up text
the server program, which accepts messages in the Corpus Query Language (see below) and returns results from the SGML text
the SARA protocol, a formally defined set of message types which determines legal interactions between the client and server programs; this protocol makes use of a high-level query language known as CQL (for Corpus Query Language)
one or more client programs, with which a user interacts in any appropriate platform-specific way, and which communicate with the server program using the protocol

At the time of writing, the server program, the protocol, and a client for Microsoft Windows are all relatively stable, and are included in general releases of the BNC. The server program and the protocol are freely available for non-commercial usage. Exact licencing conditions are yet to be determined for the rest of the system.

The SARA Index

The SARA index contains entries for each lexical token identified in the text, and also for each occurrence of an SGML start- or end- tag. Entries for lexical tokens do not distinguish upper and lower case letters, but do distinguish amongst the different part of speech codes associated with the word. For example, "Can (noun)" and "can (noun)" will be indexed together, and distinctly from "can (modal verb)". For very high frequency tags, such as <p> or <s>, special secondary indexes (known as accelerator files) can be maintained to improve performance. In principle, this mechanism could also be used for POS codes, which are not indexed in the BNC implementation.

Because SGML tags are indexed together with lexical tokens, most queries relating to SGML structure can be readily translated into regular expressions. For example a search for "foo" within the context of a <bar> element must be treated as a search for the start-tag <bar>, the term "foo", and the end-tag </bar> in sequence. The indexer does not record explicitly any information about the SGML document hierarchy.

The indexer is table driven and could in principle be used to index text using any document type definition. It has so far only been used with the BNC's dtd, which is a simplified version of that proposed by the Text Encoding Initiative.

For a running text of 100,000,000 words, the SARA indexer generated an index of some 150,000,000 distinct entries, occupying 33000 files, and 2.5 Gbytes. Indexing is carried out in the traditional manner, with a number of intermediate merge and sort phases. For each lexical item, both byte off-set within the text and also the token count (i.e. word position number) are recorded, as neither would be sufficient for the full range of retrieval facilities. SGML elements and attributes are indexed without word numbers.

The SARA server

The server program is written in ANSI C, using BSD sockets to implement network connexions. It has been compiled under SunOS, Solaris, Digital Unix, and HP Unix.

In addition to servicing queries from one or more clients, the server program is used to maintain access control facilities, and logs system usage records.

Two factors require consideration when allocating memory to the server program. The first is the number of concurrent client sessions to be permitted, since each such session spawns a new instance of the server program. The second is the likely complexity of queries to be answered, since each component of a query requires its own buffer. As supplied, the server will support an indefinite number of concurrent logins, the limit being reached when a new server cannot allocate its initial buffers. Up to 100 components per query are supported, though after the first twenty components buffers for different components are deallocated, so that performance begins to deteriorate. If insufficient memory is available to start a new session (because existing sessions are using all that is available), logins are refused.

The SARA protocol and CQL

A full definition of the SARA protocol is included in the BNC reference manual. This brief description gives a flavour only of the features supported.

The server listens on a specified socket (usually 7000) for login calls from a client. When such a call is received, the server tries to create a process to accept further data packages. If it succeeds, the client is logged on and set up messages are exchanged which define for example, the names and characteristics of SGML elements in the server's database. Following this, the client sends queries in the Corpus Query Language, and receives data packets containing solutions to them. Once a connexion has been established in this way, the server expects to receive regular messages from the client, and will time out if it does not. The client sends keepalive packages to show that it is still active. The client can request the server to interrupt certain transactions prematurely. The following functions are amongst those carried out by the server:

supply bibliographic information for a given text identifier (BNC specific).
calculate collocation score for a given query.
look up tokens in the word list, explicitly or using a regular expression.
allocate one of a predefined set of filters to the result set produced by a query, for example to remove partial SGML tags or normalize white space.
find all possible POS codes for a given token (BNC specific).
extract content data from a given location in an SGML text.
log on, log off, change password, save, name, and reload queries.
solve a CQL query.
thin the result set from a CQL query (for example, to return a random selection of exactly n solutions).

The corpus query language is a fairly typical Boolean style retrieval language, but has a number of additional features particularly useful for corpus work. Its syntax is rebarbative, but regular.

A query is made up of one or more atomic queries. An atomic query may be one of the following:

a word, punctuation mark, or delimited string e.g. JAM, ?, "don't"
a word+POS pair, known as an L-word, e.g. CAN=NN1 .
a phrase e.g. 'NOT ON YOUR LIFE'.
an existing (named) solution set. Names are allocated to queries by the server.
a regular expression.
an SGML query, that is, a search for a start- or end-tag. Attribute values may also be searched for.
the wildcard character &undline, which will match any single word.

Four unary operators are allowed in CQL:

case: The $ operator makes the query which is its operand case-sensitive.
header : The @ operator makes the query which is its operand search within headers as well as in the bodies of texts (it thus assumes that a TEI-conformant dtd is in use).
optional: The ? operator matches zero or one solutions to the query which is its operand; it makes no sense unless the query is combined with another.
not: The ! operator matches anything which is not a solution to the query which is its operand; it makes no sense unless the query is combined with another.

A CQL expression containing more than one query may use the following binary operators:

sequence: one or more blanks between two queries matches cases where solutions to the first immediately precede solutions to the second.
disjunction: The | operator between two queries matches cases where either query is satisfied.
join: The * operator between two queries matches cases where both queries are satisfied in the order specified; the # operator between two queries matches cases where both queries are satisfied in either order.

When queries are joined, the scope of the expression may be defined in one of the following ways:

SGML element: A join query followed by a / operator and an SGML query matches cases where the joined query is satisfied within the scope of the SGML query.
number: A join query followed by a / operator and a number matches cases where the joined query is satisfied within the number of words specified.

If no scope is supplied for a join query, the default scope is a single <bncDoc> element.

Some examples follow:

dog : finds the word DOG (dog or Dog)
dog=NN1 : finds the word dog as a singular noun
dog|cat : finds either the word dog or the word cat
{CAT.*} : finds words beginning with cat
$Dog : finds the word Dog (but not dog)
@dog : find the word dog in headers as well as in texts
 : finds the SGML start-tag for head elements
 : finds SGML head elements whose type attribute has the
                  value sub

cat _ dog : finds three word phrases of which the first word is cat and
            the last is dog

!cat dog : finds occurrences of dog not preceded by cat

cat*dog : finds occurrences of cat followed anywhere within the same
          document by dog

cat#dog : finds occurrences of cat followed or preceded by dog anywhere
          within the same document

cat*dog/10 : finds occurrences of cat followed by dog within ten words
cat*dog/ : finds occurrences of cat followed by dog within a
                 single  element

The SARA client

The current Windows client allows access to all features of the SARA protocol using a modern graphical user interface for MS Windows 3.1 and above. Its use is documented and illustrated in the BNC Handbook which also includes a summary of the software's menu structure and functionality.

The SARA client is written in Microsoft Visual C++ and is currently available under beta test by ftp from the BNC project only.