CREATING AND USING A MULTILINGUAL PARALLEL CONCORDANCER

[This local archive copy mirrored from the canonical site: http://sun1.bham.ac.uk/johnstf/paracon.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]

CREATING AND USING A MULTILINGUAL PARALLEL CONCORDANCER

Philip King and David Woolls

The University of Birmingham
English for International Students Unit
Birmingham B15 2TT

e-mail:
Philip King: [email protected]
David Woolls: [email protected]

1.1 Creating the concordancer

The Multilingual Parallel Concordancing project is supported by the European Union under its Lingua initiative, and is led by the University of Nancy II, France [1]. The objective of the project, which began in 1994, was to develop software for parallel concordancing that would enable a user to enter a search string in one language, and find not only the citations for that string in the search language, but the corresponding sentences in the target language. It operates therefore on parallel translated texts which constitute the corpus. Furthermore, since the project is specifically pedagogical in its aims, intended for language teachers and learners, and for training translators, the software described below enables the creation of tests.

This paper describes in detail the Windows version of the software developed at the University of Birmingham by the second author, and outlines its potential use for students and trainers of translation.

1.2 The programming language and the program environment

The programming language used for the development of the program is Borland Turbo C++ for Windows. The Windows environment has the advantages of being multilingual by design, handling Western and Eastern European languages, Greek and Cyrillic. The Windows style of presentation is also likely to be familiar to users in all countries in the project. As the program is designed to be used by lecturers and students in the classroom this was an important consideration.

The corpus itself is designed to conform with the Text Encoding Initiative (TEI), using Standard Generalised Markup Language (SGML) to encode the text. This is to allow the corpus to be used in UNIX and Macintosh environments, as well as on IBM PC's. The program uses the bare minimum of encoding to reduce the physical size of the distributed corpus, and to allow potential users to mark up their own texts relatively easily.

1.3 Truly parallel translations

If automatic alignment is going to be possible at all, it is necessary for the two texts undergoing alignment to be in a recognisable relation with each other, namely that each of them contains the same number of paragraphs and that those paragraphs have a one to one relationship with each other. The alignment takes place at sentence level within paragraphs, where one to many and many to one relations between the sentences are catered for.

The constraint of paragraph alignment is a manageable though time-consuming problem. It is sometimes necessary to identify omissions of material, insertions of other text, and re-aligning of paragraphs by translators. In addition, optical scanners sometimes react to features of text layout by creating unwanted paragraphs. Where the initial alignments in a set of parallel texts do not match, then adjustments in all languages are made in relation to the source language of each text.

1.4 Correct sentence alignment between language pairs

A caveat is necessary here. Sentence alignment is not an exact science. The algorithm selected for use with the program described here carries out alignment during concordancing on text which contains only the simplest mark up, and which is pre-aligned only in relation to the paragraphs. Although correct alignments in over 90% of cases is the target, the results are dependent on the nature of the original text. In particular, a text with long paragraphs is likely to produce less satisfactory results.

Why simple mark up?
Using text with a minimum of mark up means that it is simple to switch between language pairings without having to hold a lot of data about the different inter-relationships between the source language and all the translations, and between the translations themselves. It also means that any user with the patience to apply the minimal mark up rules to a pair of texts can make use of them immediately.

What is minimal mark up? The algorithm needs to recognise two things, paragraph boundaries and sentence boundaries. As mentioned above, the original aim was to use the TEI guidelines on the application of SGML encoding. The only necessary elements for a program operating within a Windows environment are for the start of a paragraph and for the end, and similarly <s> ;</s> pairs to identify sentence boundaries. Minimal mark up allows the dropping of the and </s;> leaving the texts marked simply as illustrated below.

<s> LE PETIT PRINCE
<s> I
<s> Once when I was six years old I saw a
magnificent picture in a book, called "True Stories
from Nature", about the primeval forest. <s>It was a
picture of a boa constrictor in the act of
swallowing an animal. <s>Here is a copy of the
drawing.
<s> In the book it said: "Boa constrictors swallow
their prey whole, without chewing it. <s> After that
they are not able to move, and they sleep through
the six months that they need for digestion."
<s> I pondered deeply, then, over the adventures
of the jungle. <s>And after some work with a
coloured pencil I succeeded in making my first
drawing. <s> My Drawing Number One. <s>It looked
like this:

A corpus marked in such a way can be readily created from one containing full SGML mark up for accented characters and other markings, as well as more detailed annotation of each paragraph and sentence. Such a segment is illustrated below.

<div id=1>
<s id=d1p1s1> LE PETIT PRINCE</s><lb></lb>
<lb></lb>
<s id=d1p2s1> I</s><lb></lb>
<lb></lb>
<s id=d1p3s1> Once when I was six years
old I saw a magnificent picture in a book, called
"True Stories from Nature", about the primeval
forest.</s><s id=d1p3s2> It was a picture of a boa
constrictor in the act of swallowing an
animal.</s><s id=d1p3s3> Here is a copy of the
drawing.</s><lb></lb>
<lb></lb>
<s id=d1p4s1> In the book it said: "Boa
constrictors swallow their prey whole, without
chewing it.</s><s id=d1p4s2> After that they are not
able to move, and they sleep through the six months
that they need for digestion."</s><lb></lb>
<lb></lb>
<s id=d1p5s1> I pondered deeply, then,
over the adventures of the jungle.</s><s id=d1p5s2>
And after some work with a coloured pencil I
succeeded in making my first drawing.</s><s
id=d1p5s3> My Drawing Number One.</s><s id=d1p5s4>
It looked like this:</s><lb></lb>

The full markup preserves many features of the original layout and gives each sentence a unique identifier.

The algorithm for alignment works on the basis of the following observations. Texts are generally translated in a linear fashion, not only in the succession of paragraphs, but in the succession of sentence content inside paragraphs. For the most part, short sentences are translated by short sentences and long sentences by long ones, or it may be that a long sentence is broken into short sentences, or several short sentences are combined into a longer sentence.

As long as these basic expectations are met, and there is not a great deal of overlapping of the various possibilities, it should be possible to judge simply from comparing the lengths of the sentences within any given paragraph, which ones go with which. This is what the algorithm tries to do. It does not currently make use of any probability measures, but uses simple arithmetic as its only check on what is going on. The linguistic element is accordingly restricted to the observations noted above. Applying the algorithm to the various and varied texts used so far indicates that a success rate of at least 90% is achievable over both long and short texts, as long as care is taken at the editorial stage.

1.5 Selecting language pairs.

The program has a screen which displays the available languages in two lists, Search and Target (see fig. 1). By highlighting for example French in Search and Greek in Target you can obtain a list of files which are available in both languages for searching. As you change either the Search or the target language, the available files are updated. By highlighting the files you wish to include for searching you can widen or narrow the scope of the search. Each language is currently identified by using the EC two letter abbreviation e.g. DK for Danish, DE for German. By adding either a letter or a number to this identifier, it will be possible to include category information about the files which will be available as a further selection box. e.g. DKC could mean Danish Children's texts while ENF could mean English Fiction. As the corpus grows in size and variation some such categorisation will become necessary.

1.6 Entering queries

Queries are entered in a separate dialogue box. They are entered in the language selected as Search. The program recognises the international character set, so French or German keyboards can be used without modification. A keyboard manager is available to allow simple switching to non-standard fonts such as Greek and Cyrillic.

Queries can be (i) whole words or phrases or (ii) words or phrases containing wild cards either at the front, at the end or both. There is no current availability of internal wildcards. The wild card is currently restricted to *, meaning any number of characters including zero. The ?, meaning one occurrence only, will be implemented in the final version.

One Context word or phrase can also be entered, also with the wild card provisions. This context can be selected to give results if found in the same paragraph, the same sentence or within a certain number of words of the hit to the left, right or both up to a maximum of 6 words either side.

Contextual searching carries with it a time cost, since each paragraph with hits has also to be checked for the context before retaining the location data for alignment purposes. Each time a search is initiated the alignment procedure starts from scratch. Until this point the results of the previous search, if any, are available for sorting, selection and test creation, however many times those operations have been carried out. This method was chosen to allow maximum flexibility for switching between languages, in particular the reversing of the search and target languages to explore linguistic features in either of the currently selected languages. It also allows the maximum speed to be achieved by only attempting to align within paragraphs where hits, and context if needed, have been identified. There is an element of redundancy if some of the same words are selected, but no overloading of memory with information no longer required within the same file pairings.

The more complex the query, the longer the program will take to run. In simple queries with infrequent words the response is almost immediate. The faster the processor, the faster everything happens. The parallel concordancer is about four times slower than a monolingual concordancer for a given size of search-language text, and this will be noticeable on longer files.

The program reports for each file how many hits it has found, and how many paragraphs it has looked in. It stops after each file and waits for the user to respond before proceeding. It does not currently allow a Cancel operation. If it reaches the hit maximum of 100 it tells you this and stops looking in any of the remaining files. At present it does continue to read to the end of the file it is in when the limit is reached, but this will be removed in later versions. The limit was introduced simply to indicate to the user that the scope of the search could probably be refined if they wish to avoid a lot of deleting or subsequent editing, and to allow searching for very frequently occurring words without overloading the examples.

1.7 Sorting and selection

The results can be viewed as illustrated in Fig. 2. A dialogue box at the top of the screen controls the sorting, viewing and selection procedures. The results of the search and alignment are shown beneath this, with the Search language above the Target. The hit words or phrases are not highlighted in the search area, but are identified in the list within the dialogue box. Notice in the example given in fig. 2 that two sentences in the original French correspond to one sentence (with a colon in the middle) in the target language English, and that the algorithm has picked up the correct correspondence.

The order of this list can be changed by selecting left, right or keyword sorting. For the left/right sorting the distance from the search word can be set from 0 to 3. The size of the list can be reduced by deleting sentences which are surplus to requirements. If a sentence view is not sufficient, the full paragraph is available in either search or target languages, but not in both simultaneously.

The selections in this box are retained after it is closed, to enable you to see where you were the last time you used the box.

1.8 Test Creation.

One of the requirements of the project specification was that teachers or students should be able to produce a number of linguistic tests on-screen and be able to save the results for later use. This has been implemented so that tests can be created in either the Search or the Target language, or in interleaved Search/Target form with the test created on one or the other languages. Fig. 3 shows a sample of the interleaved results. The order of the results will be that last selected within the Sort/Selection dialogue box, and the listing will only contain what remains after deletion from within that dialogue box. The tests available are as indicated in Fig 4.

Selecting the plain text option leaves the text entire as a means of providing answers or comparative material in the case of interleaved selection.

Some tests can be selected to be applied to every nth word:

A C-Test removes the second half of the nth word, if it is longer than three letters;
A Cloze test removes all of every nth word;
A First letter test removes all but the first letter of every nth word.

Other tests simply operate on the length of the word:

Less than n removes all words less than n characters in length, replacing the characters by blanks;
More than n removes all words more than n characters in length, similarly replacing the characters by blanks.

All tests are created on the screen directly out of the source files, which remain unmodified throughout. They can be saved to disk as separate files if required. This method of operation makes minimum demands on computer memory, by only holding internal references to the data, rather than the data itself.

2.1 Using the concordancer

The software described in the first section has been designed principally for pedagogic use in language teaching and learning, and because it works by explicitly presenting corresponding sentences, it can have an important role to play in translation training. The official languages of the project are Danish, English, French, German, Greek and Italian.

2.2 The corpus

The corpus as presently constituted consists of a variety of text-types. The criteria for their inclusion are firstly, would they be suitable foreign language texts for foreign language students at higher secondary or tertiary level, and secondly, have they been translated into a number of other (if not all) of the project languages, and if so, are the translations generally good? In addition, each of the official project languages must be represented by an original text, so that no language corpus is made up entirely of translated texts.

At the time of writing, titles marked up in the original language plus at least one other include selected chapters from A Brief History of Time (English Original); The Little Prince (French original); a number of Tintin strips (French- language original); Europe and the Sea (German original); novels originating in Greek and Italian; and a Danish children's book. Other titles are being added as fast as the editing will allow. Some texts in other languages have also become available and have been marked up. In principle, as mentioned above, any language using characters wholly contained in the Western European Windows fonts can be included with no further ado from the computational point of view. The minimum mark-up required by this software also makes it easy for users to mark up their own pairs of texts, so that the range could easily be extended to suit the needs of particular individuals or institutions.

2.3 Some limitations

What this program does is produce corresponding sentences laid out adjacently. Its virtues are the speed and accuracy with which it does this, and the flexibility allowed in searching, sorting and test production. Its limitation is that it cannot answer other queries which one might have about a text pair.

It was pointed out earlier that the algorithm is purely arithmetical, and that the target hit rate is 90%. This means that something like one sentence pair in ten will fail to produce a correspondence between the source language and the target language sentences (although in most cases the corresponding sentence turns out to be an adjacent one). The user has to be prepared for this, and the trainer may well decide to edit the output files at this point before using them with trainees, if the lack of correspondence is perceived as an irritant.

In a few cases, the translation turns out to be divergent, so that the search item is not represented anywhere in the target text. While this may be interesting or significant, it is difficult to make use of a printout of sentence-level equivalences to explore this point, which would be better investigated by recourse to the paragraphs on screen, or the original books.

The other limitation is that features of layout (including the paragraphing of the translated texts) are not preserved, so that in Alice in Wonderland, for instance, it would not be possible to compare how the tale of the mouse (which is shaped in the text like the tail of the mouse) is laid out in the different languages. Information of this type can best be checked by comparison of the printed forms. Equally, if different pictures accompany the same text in different languages, this would be lost.

2.4 The software and the translation trainer

Two modes of use are envisaged for this software: trainer access to prepare materials, and direct trainee access. As an example of how such an approach can work, an account will be given of some material developed by the first author of this paper. The occasion was not a translation class, but a group consisting of a number of Greek Postgraduate research students, for whom the author was conducting a writing seminar. The students were all writing up research into 19th and 20th century Greek literary figures, and one problem was the extent of interference from Greek in the English of the students, and in the English of their arguments and hypotheses. One problem was that the students sometimes appeared to be overasserting their claims.

A search was conducted on would in A Brief History of Time and the Greek translation. The interleaved sentences, consisting of some 70 pairs, were reduced to some 30 or so by editing out all those where would corresponded to a Greek conditional form. The others were then printed out, and presented for discussion.

It became clear from the discussion that propositions and hypotheses that were modulated by the use of would in English, were often not modulated at all in this way in Greek, a plain future or plain present often corresponding. The corollary then has to be that Greek present or future tenses regularly, under circumstances which can be defined, would be best translated by a modal form in English. Despite the difference in genres (students writing academic theses on literature topics; the text selected representing a popularisation of a science topic), the students themselves were able to recognise the validity of the point for themselves.

Figures 2, 3 and 4 have been selected so that in addition to showing the main steps in the program, they illustrate, at least in part, how the translator into English has rendered the French word on. Questions like this can be quickly and succinctly answered by means of the parallel concordancer.

2.5 What it makes sense to ask

The size and the nature of the corpus as it exists at present means that some searches are more likely to produce useful answers than others. The frequency distribution pattern of words in text means that a high proportion of items will appear with a frequency of only 1 or 2. It is a matter of training and experience to work out what sort of query will produce an informative answer. At a word level, it is the most frequent items which generally produce the biggest problems in translation, in part because of their mere frequency, and in part because they tend to be polysemous, or take their meaning from the collocations of which they form part.

(Figures 1-4 omitted. For figures showing a later version of the program than that described in this paper, see http://web/bham.ac.uk/johnstf/lingua.htm

NOTE

1 The lead institution is Université de Nancy II, Nancy, France. Contact: Francine Roussel, Campus Lettres et Sciences. Other partners besides Birmingham are the universities of Hull and East Anglia (UK), Aarhus (Denmark), Wuppertal (Germany), Patras (Greece), Turin (Italy), and the School for Translators and Interpreters, Trieste (Italy), Harper Collins publishers (UK) and The Centre de Recherche en Informatique de Nancy (CRIN) (France). The UNIX version can be viewed on the World Wide Web at: http://www.loria.fr/exterieur/equipe/dialogue/lingua/pageintro.html: the Windows version can be seen at http://web.bham.ac.uk/johnstf/lingua.htm

KEYWORDS
parallel corpora
multilingual concordance
translation