[This local archive copy mirrored from the canonical site: http://sun1.bham.ac.uk/johnstf/paracon.htm; links may not have complete integrity, so use the canonical document at this URL if possible.]
e-mail:
Philip King: p.b.king@bham.ac.uk
David Woolls: 100343.2362@compuserve.com
This paper describes in detail the Windows version of the software developed at the University of Birmingham by the second author, and outlines its potential use for students and trainers of translation.
The corpus itself is designed to conform with the Text Encoding Initiative (TEI), using Standard Generalised Markup Language (SGML) to encode the text. This is to allow the corpus to be used in UNIX and Macintosh environments, as well as on IBM PC's. The program uses the bare minimum of encoding to reduce the physical size of the distributed corpus, and to allow potential users to mark up their own texts relatively easily.
The constraint of paragraph alignment is a manageable though time-consuming problem. It is sometimes necessary to identify omissions of material, insertions of other text, and re-aligning of paragraphs by translators. In addition, optical scanners sometimes react to features of text layout by creating unwanted paragraphs. Where the initial alignments in a set of parallel texts do not match, then adjustments in all languages are made in relation to the source language of each text.
Why simple mark up?
Using text with a minimum of mark up means that it is simple to switch between language pairings without having to hold a lot of data about the different inter-relationships between the source language and all the translations, and between the translations themselves. It also means that any user with the patience to apply the minimal mark up rules to a pair of texts can make use of them immediately.
What is minimal mark up? The algorithm needs to recognise two things, paragraph boundaries and sentence boundaries. As mentioned above, the original aim was to use the TEI guidelines on the application of SGML encoding. The only necessary elements for a program operating within a Windows environment are <p> for the start of a paragraph and </p> for the end, and similarly <s> ;</s> pairs to identify sentence boundaries. Minimal mark up allows the dropping of the </p> and </s;> leaving the texts marked simply as illustrated below.
<p><s> LE PETIT PRINCEA corpus marked in such a way can be readily created from one containing full SGML mark up for accented characters and other markings, as well as more detailed annotation of each paragraph and sentence. Such a segment is illustrated below.
<p><s> I
<p><s> Once when I was six years old I saw a
magnificent picture in a book, called "True Stories
from Nature", about the primeval forest. <s>It was a
picture of a boa constrictor in the act of
swallowing an animal. <s>Here is a copy of the
drawing.
<p><s> In the book it said: "Boa constrictors swallow
their prey whole, without chewing it. <s> After that
they are not able to move, and they sleep through
the six months that they need for digestion."
<p><s> I pondered deeply, then, over the adventures
of the jungle. <s>And after some work with a
coloured pencil I succeeded in making my first
drawing. <s> My Drawing Number One. <s>It looked
like this:
<div id=1>The full markup preserves many features of the original layout and gives each sentence a unique identifier.
<p id=d1p1><s id=d1p1s1> LE PETIT PRINCE</s></p><lb></lb>
<lb></lb>
<p id=d1p2><s id=d1p2s1> I</s></p><lb></lb>
<lb></lb>
<p id=d1p3><s id=d1p3s1> Once when I was six years
old I saw a magnificent picture in a book, called
"True Stories from Nature", about the primeval
forest.</s><s id=d1p3s2> It was a picture of a boa
constrictor in the act of swallowing an
animal.</s><s id=d1p3s3> Here is a copy of the
drawing.</s></p><lb></lb>
<lb></lb>
<p id=d1p4><s id=d1p4s1> In the book it said: "Boa
constrictors swallow their prey whole, without
chewing it.</s><s id=d1p4s2> After that they are not
able to move, and they sleep through the six months
that they need for digestion."</s></p><lb></lb>
<lb></lb>
<p id=d1p5><s id=d1p5s1> I pondered deeply, then,
over the adventures of the jungle.</s><s id=d1p5s2>
And after some work with a coloured pencil I
succeeded in making my first drawing.</s><s
id=d1p5s3> My Drawing Number One.</s><s id=d1p5s4>
It looked like this:</s></p><lb></lb>
The algorithm for alignment works on the basis of the following observations. Texts are generally translated in a linear fashion, not only in the succession of paragraphs, but in the succession of sentence content inside paragraphs. For the most part, short sentences are translated by short sentences and long sentences by long ones, or it may be that a long sentence is broken into short sentences, or several short sentences are combined into a longer sentence.
As long as these basic expectations are met, and there is not a great deal of overlapping of the various possibilities, it should be possible to judge simply from comparing the lengths of the sentences within any given paragraph, which ones go with which. This is what the algorithm tries to do. It does not currently make use of any probability measures, but uses simple arithmetic as its only check on what is going on. The linguistic element is accordingly restricted to the observations noted above. Applying the algorithm to the various and varied texts used so far indicates that a success rate of at least 90% is achievable over both long and short texts, as long as care is taken at the editorial stage.
Queries can be (i) whole words or phrases or (ii) words or phrases containing wild cards either at the front, at the end or both. There is no current availability of internal wildcards. The wild card is currently restricted to *, meaning any number of characters including zero. The ?, meaning one occurrence only, will be implemented in the final version.
One Context word or phrase can also be entered, also with the wild card provisions. This context can be selected to give results if found in the same paragraph, the same sentence or within a certain number of words of the hit to the left, right or both up to a maximum of 6 words either side.
Contextual searching carries with it a time cost, since each paragraph with hits has also to be checked for the context before retaining the location data for alignment purposes. Each time a search is initiated the alignment procedure starts from scratch. Until this point the results of the previous search, if any, are available for sorting, selection and test creation, however many times those operations have been carried out. This method was chosen to allow maximum flexibility for switching between languages, in particular the reversing of the search and target languages to explore linguistic features in either of the currently selected languages. It also allows the maximum speed to be achieved by only attempting to align within paragraphs where hits, and context if needed, have been identified. There is an element of redundancy if some of the same words are selected, but no overloading of memory with information no longer required within the same file pairings.
The more complex the query, the longer the program will take to run. In simple queries with infrequent words the response is almost immediate. The faster the processor, the faster everything happens. The parallel concordancer is about four times slower than a monolingual concordancer for a given size of search-language text, and this will be noticeable on longer files.
The program reports for each file how many hits it has found, and how many paragraphs it has looked in. It stops after each file and waits for the user to respond before proceeding. It does not currently allow a Cancel operation. If it reaches the hit maximum of 100 it tells you this and stops looking in any of the remaining files. At present it does continue to read to the end of the file it is in when the limit is reached, but this will be removed in later versions. The limit was introduced simply to indicate to the user that the scope of the search could probably be refined if they wish to avoid a lot of deleting or subsequent editing, and to allow searching for very frequently occurring words without overloading the examples.
The order of this list can be changed by selecting left, right or keyword sorting. For the left/right sorting the distance from the search word can be set from 0 to 3. The size of the list can be reduced by deleting sentences which are surplus to requirements. If a sentence view is not sufficient, the full paragraph is available in either search or target languages, but not in both simultaneously.
The selections in this box are retained after it is closed, to enable you to see where you were the last time you used the box.
Selecting the plain text option leaves the text entire as a means of providing answers or comparative material in the case of interleaved selection.
Some tests can be selected to be applied to every nth word:
At the time of writing, titles marked up in the original language plus at least one other include selected chapters from A Brief History of Time (English Original); The Little Prince (French original); a number of Tintin strips (French- language original); Europe and the Sea (German original); novels originating in Greek and Italian; and a Danish children's book. Other titles are being added as fast as the editing will allow. Some texts in other languages have also become available and have been marked up. In principle, as mentioned above, any language using characters wholly contained in the Western European Windows fonts can be included with no further ado from the computational point of view. The minimum mark-up required by this software also makes it easy for users to mark up their own pairs of texts, so that the range could easily be extended to suit the needs of particular individuals or institutions.
It was pointed out earlier that the algorithm is purely arithmetical, and that the target hit rate is 90%. This means that something like one sentence pair in ten will fail to produce a correspondence between the source language and the target language sentences (although in most cases the corresponding sentence turns out to be an adjacent one). The user has to be prepared for this, and the trainer may well decide to edit the output files at this point before using them with trainees, if the lack of correspondence is perceived as an irritant.
In a few cases, the translation turns out to be divergent, so that the search item is not represented anywhere in the target text. While this may be interesting or significant, it is difficult to make use of a printout of sentence-level equivalences to explore this point, which would be better investigated by recourse to the paragraphs on screen, or the original books.
The other limitation is that features of layout (including the paragraphing of the translated texts) are not preserved, so that in Alice in Wonderland, for instance, it would not be possible to compare how the tale of the mouse (which is shaped in the text like the tail of the mouse) is laid out in the different languages. Information of this type can best be checked by comparison of the printed forms. Equally, if different pictures accompany the same text in different languages, this would be lost.
A search was conducted on would in A Brief History of Time and the Greek translation. The interleaved sentences, consisting of some 70 pairs, were reduced to some 30 or so by editing out all those where would corresponded to a Greek conditional form. The others were then printed out, and presented for discussion.
It became clear from the discussion that propositions and hypotheses that were modulated by the use of would in English, were often not modulated at all in this way in Greek, a plain future or plain present often corresponding. The corollary then has to be that Greek present or future tenses regularly, under circumstances which can be defined, would be best translated by a modal form in English. Despite the difference in genres (students writing academic theses on literature topics; the text selected representing a popularisation of a science topic), the students themselves were able to recognise the validity of the point for themselves.
Figures 2, 3 and 4 have been selected so that in addition to showing the main steps in the program, they illustrate, at least in part, how the translator into English has rendered the French word on. Questions like this can be quickly and succinctly answered by means of the parallel concordancer.
(Figures 1-4 omitted. For figures showing a later version of the program than that described in this paper, see http://web/bham.ac.uk/johnstf/lingua.htm
NOTE
1 The lead institution is Université de Nancy II, Nancy, France. Contact: Francine Roussel, Campus Lettres et Sciences. Other partners besides Birmingham are the universities of Hull and East Anglia (UK), Aarhus (Denmark), Wuppertal (Germany), Patras (Greece), Turin (Italy), and the School for Translators and Interpreters, Trieste (Italy), Harper Collins publishers (UK) and The Centre de Recherche en Informatique de Nancy (CRIN) (France). The UNIX version can be viewed on the World Wide Web at: http://www.loria.fr/exterieur/equipe/dialogue/lingua/pageintro.html: the Windows version can be seen at http://web.bham.ac.uk/johnstf/lingua.htm
KEYWORDS
parallel corpora
multilingual concordance
translation