[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/roberts.html; see this canonical version of the document.]
The dictionary in question is the Bilingual Canadian Dictionary(BCD), which is still in preparation. As its tentative title indicates, it is a bilingual dictionary which will reflect English and French as they are used in Canada. The creation of this dictionary is the major objective of a vast collaborative research project, called "Comparative Lexicography of French and English in Canada", funded by the Social Sciences and Humanities Research Council. The project involves three universities: the University of Ottawa (which is also the administrative centre), the University of Montreal, and Laval University.
The primary problem seemed to be that a dictionary entry has many components, a number of which are optional or repeatable at various points. Thus, a dictionary entry can be very long (e.g. the BCD entry for coeur) or quite short (e.g. our entry for motoneigiste) or simply a cross-reference to another entry (e.g. our entry for naveau). Another problem was the number of entries we have planned for the dictionary (approximately 80,000).
Terminological database management systems, such as AQUILA and MC4, were obviously unsuitable for our needs, because of their predetermined structure which cannot be modified. MTX, also used for terminological data, offered more flexibility because it does not preestablish fields, but it has the disadvantage of separating, into two different zones, information on the source language word from information related to this word (the target language equivalents, illustrative examples, etc.) and of restricting the scope of each of these zones. In any case, it is not intended for large numbers of entries (or lexicographic records).
After rejecting the possibility of using any of the readily available terminological database management systems, we began, in 1993, a worldwide search for lexicographic database management systems. However, despite our conviction that such systems exist - after all, dictionaries are constantly published - we were able to locate only two: one in Copenhagen (called GestorLEX) and another in Cambridge designed for the Cambridge International Dictionary of English. While both GestorLEX and the CIDE database system seemed to be comprehensive, well thought-out packages, with many specialized features, they both have two major disadvantages: (a) no North American technical support is available; and (b) both are "closed systems", in that they store their document bases in an internal format which can only be read by the system, even though GestorLEX does support SGML marking to facilitate information interchange. In addition, both would have needed to be adapted to our specific needs, and we were unable to obtain test copies to verify just how much further programming would be required on our part.
However, programming was not our strong point, as we soon realized after attempting to adapt a general-purpose database management system, Rbase, to meet our needs. And efforts by the University of Montreal computer centre to design a system for us in Prolog progressed so slowly, apparently because of the complexity of some of our entries, that after a year and a half only the "introductory zone" was completed.
By 1995, we had reached two major decisions: (a) given the fact that our project is a long-term project, we could not run the risk of being tied to one platform or one specific database system; (b) we needed expert advice, which was not available at the university level, on a text processing and text retrieval applications.
By this point, we had acquired some knowledge of SGML, and thought that an SGML application could provide us with many advantages:
(a) Since an SGML document exists simply as plain text, it is easily portable from one application to another, or another platform to another. This is of considerable importance to us for several reasons: we have three lexicographic centres, which need to "exchange" information constantly; since the final product, the BCD will appear not only in electronic form but also in print, SGML would be a dramatic time-saver both to us and to the publisher; since, over the duration of the project, we would be updating our hardware, the fact that SGML can be used on any hardware was a definite asset.
(b) Since an SGML document does not need authors to concern themselves with formatting and layout, lexicographers would be better able to concentrate on the actual content of entries.
(c) Since SGML markup provides the document's structure, lexicographers would not accidently forget elements that need to be included in the entry. Moreover, since SGML entries are automatically parsed when they are saved, a number of lexicographic omissions would be flagged at this point.
(d) Since SGML markup is multipurpose, the tagging can serve for verification of specific entry elements during the final revision of all the dictionary entries, and for the creation of future subproducts (e.g. a bilingual dictionary of Canadianisms).
Armed with the conviction that SGML would provide us with a solution, we initially worked on a DTD with the help of a visiting professor from Rennes. Realizing, however, that the DTD design was only one element of an integrated system for our entries, we consulted with three computer consulting firms to see if they could help us realize our goal. While two of them could only promise us some sort of useable SGML product in the more or less distant future, we had the great good fortune to find at Microstar in Ottawa a consultant, Dr. David Megginson, who not only specializes in SGML, but who had also done previous lexicographic programming. With his assistance, we moved, in the space of six months, from wordprocessing to a fully integrated SGML application.
Since the BCD team had already designed a tentative entry structure and had produced about 10,000 entries, we began by examining the proposed dictionary microstructure and its realization in about thirty entries. The proposed microstructure was complex because of the number of components and combination of components possible. The entries chosen for examination were selected to represent as many of the components as possible. The main selection criteria were the following: representation of different parts of speech (e.g. aîné n, aîné adj); representation of monosemous and polysemous word entries (e.g. allopathy, a monosemous noun, versus ball, which is a highly polysemous noun); representation of compounds and fixed expressions in certain entries (e.g. feu n); representation of "marked" word entries (e.g. bébelle n) as well as unmarked word entries (e.g. adjust v); representation of entries created in different centres; and representation of entries from French to English as well as English to French.
On the basis of all this supporting documentation, the DTD committee first identified all possible components and described each component in a Component Form.
At the end of this stage, we had identified seventy-six components. These were then reviewed and revised. For example, we had initially identified as a component Annotated Translation Example, which we had defined as "translation of a source language free combination, collocation, fixed expression or compound preceded by one or more of the following: sense indication, actants, referents, comments." However, at the review stage, we realized that we did not need to identify this as a separate component, since all its constituents (translation, sense indication, actant, referent, comment) had been individually identified. At the review stage, we also added a couple of components, which had been previously missed.
After all the components were identified and described, we decided to set up two separate DTDs - one for a Full Dictionary Entry, another for a Cross-Reference entry - since the latter contains very little information.
We then started ordering the components for the Full Dictionary Entry, both vertically and horizontally, using Microstar's product, Near and Far Designer, which uses a simple "drag-and- drop" graphic interface to program the DTD. As the Dictionary Component Form presented above illustrates, while identifying the components, we had already specified to some extent the relations between components. These relations needed to be further clarified at this stage and elements grouped in blocks and groups. Thus, the elements irregular feminine, irregular plural and irregular singular were grouped together in a Noun Information Block, while the elements irregular feminine, irregular plural, regular comparative, regular superlative, irregular comparative and irregular superlative were placed together in the Modifier Information Block. At this point, we also had to decide which components should be elements and which should be attributes (of the elements). We ended up by considering several components that would disappear in the final dictionary entry as attributes: source codes and lexicographer's notes, for instance. After much discussion and reordering, we finally arrived at what seemed a satisfactory DTD structure.
We then tested it for a couple of months using the SGML authoring program InContext. Entry preparation using the DTD revealed a certain number of problems. For example, we could not add the grammatical comment NonC at the start of a sense division for a word that was non-count in only one or two of its senses (e.g. bush). While we had a repeatable irregular feminine form element and a repeatable irregular plural form element, we had forgotten the fact that we needed to distinguish between the irregular masculine plural from the irregular feminine plural. We established lists of "problems", i.e. elements that were either not appropriately placed or subelements that we had ignored. These problems led to further changes to the DTD.
We now have a DTD that works for all words. While we still notice minor "problems" from time to time, the changes required are so minimal that they do not affect the overall structure. However, since every entry is different from another, we have taken precautions to allow for future additions. We have added a "loophole" at the end of each block, which can later be used to add so-far unforeseen elements.
More recently, we have begun to use the SGML option in Corel's WordPerfect 7. This program allows the lexicographer to see, at a given point, all the elements and attributes in the DTD on the screen. However, the element and attribute are accompanied by SGML tags, which can "clutter up" the screen. While the tags can be hidden, certain lexicographers still prefer using InContext.
We found that the style sheet editor associated with InContext was not quite suited to our needs (for example, we could not get it to print annotations). We were considering having a DSSSL script written, when we found out that WordPerfect 7 also offered the possibility of designing, on an ad hoc basis, different style sheets for our SGML entries. Using WordPerfect 7 style sheets, our in-house, self-taught computer expert, Lucie Langlois, designed style sheets which enable us to print out reasonable facsimiles of dictionary entries.
However, we still have one major task that remains: that of converting over 12,000 completed entries from WordPerfect format to SGML. We are looking into the possibility of using a parser for this purpose. But, meanwhile, while new entries are being authored in SGML, former entries are still in WordPerfect. So our goal of SGMLizing all our entries has not yet been accomplished.