[This local archive copy mirrored from: http://www.stg.brown.edu/webs/tei10/tei10.papers/romary.html; see the canonical version of the document.]
Text Encoding Initiative
Laurent Romary, Patrice Bonhomme, Florence Bruneseaux, Jean-Marie
CRIN-CNRS & INRIA Lorraine
Bâtiment Loria, B.P. 239, F-54506 Vand¦uvre Lès Nancy
In the last years, a renewed interest for studies resting on computerized linguistic resources has been observed in the field of human sciences for linguistic, literary or historical studies as much as in the field of computer sciences. Several recent publications (Aarts et al. 92, TAL-95, IJCL-96 for instance) as well as the diversity of the methods and targets used by researcher in this field testify to the liveliness of the movement. In fact, this renewal of the methods raises a set of essential questions related to the status as well as to the maintenance of the data thus handled. Indeed, it seems now impossible to repeat ad infinitum the working cycles on data that have been almost always observed in our communities. In the scope of a particular research project, the necessary data are defined, they are then collected and a couple of ad hoc tools are rapidly constructed to allow the extraction of relevant information for the current study. Finally, when all the work has been done and the results of the research are published, the data are left to themselves under a more or less identified form and above all, are subservient to the existence of a collective memory of the researchers who took part in the research project. In most cases these data become totally unusable for any project of the same kind, because either the compatible computer tools are not existing anymore or the formats of display have not been documented or because it would be too expensive to transform these data in order to make them compatible with the new tools defined for the current research. This last point also partly explains why it has been impossible until now to consider a flexible and modular use of the data stemming from great textual funds, their specific formats have never been associated to the distribution of tools which are largely available among the academic communities.
Clearly, the question raised here is that of the re-usability. This problem cannot be tackled generally and lines of reflection leading to realistic answers for our community have to be defined. First, it seems necessary to outline the different linguistic resources that have to be represented. The case of textual data seems more simple at first, because they imply a low degree of structure. However, we will observe that, even in the simplest modes of representation (untagged texts), it is necessary to add a minimum documentation on the origins and contents of the corresponding texts. Moreover, following M.-P. Pery-Woodley's advice (95), it seems essential for us to collect textual data from complete and identifiable texts in order to master perfectly all parameters (gender, structure) which might be used in later studies. Indeed, text must be seen as a countable noun, not as a mass one.
The other linguistic resources are by nature generally more structured and, because of that, demand more attention if we want to make them available for a large community. Among them, we can mention the case of lexical resources which, according to the use that will be made of them, will take the form of a computerized dictionary (for human use) or of a lexical database (for an automatic use). In this last case it is essential to fully normalize the structure used, so that these resources can be integrated into different platforms of development. In the same way, a great number of dialogue corpuses now exist (transcriptions of man-man dialogues, Wizard of Oz experiments etc.) but their forms are so different that it is impossible to define unified tools of exploration which would allow a full exploitation of these documents. From linguistic resources considered as available data we are soon led to consider the tools that are to be associated to them. Indeed, making data available should always bring, even if it has to remain a pious hope, a thorough reflection on the working environment which allow their manipulation/use. So, according to the category of the user, different kinds of tools will be needed: transparent, data integrated (on-line) tools or largely distributed and adaptable tools (computerized libraries for instance).
As it can be seen, the reflection that has to be carried out is important and it is clear that all difficulties cannot be solved at once. However, hoping for the improvement of the situation in this field within the French-speaking community, the CNRS and the Aupelf€Uref have launched a joint initiative bringing together 5 university teams and which aims at addressing as many sites or French-speaking laboratories as possible. In this article we present a synthesis of the reflections which have led to the implementation of a first experimental server.
SILFIDE (Serveur Interactif pour la Langue Française, son Identité, sa Diffusion et son Etude) is a tool for sharing, congenially and thoughtfully, a knowledge on different aspects of the French language. It consists in a network of data processing servers together with the necessary support.
The aim of SILFIDE is not to integrate the totality of the contents (corpuses, glossaries, tools) of the available resources within an academic community, but to allow any researcher to be informed of the existence of such contents, to get a relatively precise idea of them and to be informed of the methods of access. In the case of resources which are widely used or which do not raise particular problems when accessed, SILFIDE will be able to propose the automatic transfer of the corresponding data.
A French-speaking server. SILFIDE is a help provided to all the laboratories of the French-speaking community and to those who are interested in the study or the automatic treatment of French language. In this respect, French has to be the main language of our project. On the one hand, most of the data available on SILFIDE will be in French or associated to equivalent data in French (in the case of a parallel corpus for instance). On the other hand, French will be the metalanguage associated to the management of the resources either at the level of their documentation or at the level of the access interface to the corpus. However, a description of the server in other languages (English or German for instance) would be useful.
Still, considering the importance of sharing expertise in the field of standardized data delivery, it is now clear that the enderlying technology has to be kept generic enough (use of Unicode etc.) to make sure that the it can be duplicated at any site.
Main functions. In the beginning, SILFIDE should be able to answer the following questions which might be raised by an user:
Other functions. Besides the function of access to linguistic resources, it might be interesting to propose directly accessible "on-line" tools to users who do not have access to an elaborate computer environment. Concordances can then be thought out for a set of selected texts together with elementary lexical statistics (frequencies, reduced deviations etc.).
Moreover, SILFIDE should be helpful by compiling (and possibly by documenting) the tools available in the field of textual resources manipulation. It can be a matter of encoding data but also libraries functions dedicated to normalized data. These different additional functions will have to be progressively integrated to the successive versions of the SILFIDE server.
Considering the different targets Silfide is aiming at, it is clear that the project could not have taken place without the existence of an underlying framework for the representation of structured document in electronic format. As a matter of fact it has immediatly been obvious to us, even before starting any kind of work, that we should follow the steps of the Text Encoding Initiative rather than devise our own scheme, even if on some occasions, we have had to simplify, if not to misuse the actual guidelines provided by the TEI.
What we want to show here is how a large and multifarious project as ours may consider the TEI from a great many points of view depending on a) the different kinds of data that are to be represented and) the different usages that are contemplated upon these.
Right at its beginning, Silfide had to cope with two opposite views upon the data which it had to distribute. On the one hand it had to provide any user with a concise and acurate description of the available linguistic resources which could be queried easily and rapidly. On the other hand, we had to allow one to access the content of any resource for specific research purposes. We chose to solve this conflict between efficiency and exhaustiveness by clearly assigning two different functions to the TEI Header on one side and the <text> element on the other. Accordingly, we devised a user's scenario which mainly relies on two phases, one during which he actually selects the resources he wants to work on, putting them into what we called the "shopping basket ", and one corresponding to the actual work on the resources through specific tools which actually use the structural content of the data.
As such, the TEI Header can be seen as a highly structured piece of information which by essence can be assimilated to a database of precisely identified fields (title, author, bibliographic source etc.). Moreover, from a user's point of view - we mean not one who may have spent hours reading the TEI guidelines - the actual precise structure of such meta-information, has no reason to be precisely known. This is why we decided to actually consider that the (virtual) set of headers associated with our whole data fund would be compiled into a database accesible through a set of indexes directly queryable by the user.
However, putting this into practice can prove to be a highly difficult task considering the degree of flexibility‹with from other points of view may be valuable‹of the TEI regarding the precise structure of the header. Designing a single indexing scheme upon the header was all the more difficult for us as we have to cope with quite a large variety of document natures or genres: we have "standard" narrative texts, plays, transcriptions of oral documents, dictionnaries, lexica, all these requiring specific variations upon the TEI header to be used. For instance, the transcription of an oral dialogue implies an extensive and detailed use of both sourceDesc (by way of the recordingStmt) and the profileDesc (in particular through the particDesc), which are used in a quite different way for a novel.
Considering this we chose to adopt the following editorial policy regarding the header:
Among the mandatory fields are either obvious ones like title,author in the fileDesc or langUsage in profileDesc, but also more "exotic" ones such as the systematic use of textClass and textDesc to provide Silfide with indexing element associated with keywords, genre, domain etc.
In the same way as it has been difficult for us to provide a sound and generic description of the header for all resources that we had to deal with, we have faced several difficulties in devising a clear editorial policy concerning the way we would encode the actual content of our data. As it has been observed by several encoding projects which have used the TEI (e.g. the Women Writers Project at Brown), there is always a compromise to be reached between a) the precision of the encoding that any of us would like to be as refined as possible and b) the level of genericity of the corresponding document, that is its compatibility with different possible usages. The main argument associated to this compromise is that, if one wants to keep a homogeneous encoding scheme within his database, each step towards a more refined encoding may prove to be both highly time and fund consuming, but also difficult to control and maintain.
As a result, we adopted the following general principles for the encoding of our resources:
There is no need to detail here the technical platform from which the current version of the SILFIDE server is derived. We can simply point out that all the developments are based on the Internet network and its protocols, so that it would eventually be possible to have direct access to the server from any standard web browser. However, the SILFIDE service, unlike some private initiatives such as the ABU server (Association des Bibliophiles Universels), is not intended for the general public but for a community of researchers who wish to work on the French language. Far from making of the procedure something particularly heavy, we have thus set up a registering system which allows to identify the different users of the server whether they are suppliers or mere users/readers. Consequently, all the functions of the server which require a direct access to the resources themselves cannot be available without prior authorization.
In outline, the SILFIDE server takes the form of a navigation which gives access to the following functions:
The SILFIDE server, which in its experimental version currently contains an initial corpus of texts and dialogues transcriptions (about 5 million words for 30 megabytes of data) is accessible from the following address: http://www.loria.fr/Projet/SILFIDE.
The SILFIDE project will only prove its usefulness when it will become a "natural" component of any research resting on linguistic data, that is to say a place where a user will spontaneously and systematically think about for searching the data necessary to is work, and when any user will feel as a potential supplier. At first, and in accordance with the initial objective of the project, SILFIDE must accompany the structuring actions of our community, such as the Concerted Research Actions of the Aupelf€Uref. Beyond this point, it is important for the project to be enriched with related developments, from the point of view of its contents (data, tools) as well as in the scope of research projects which would rest on this structure. Finally, a middle term perspective seems to be the development of the SILFIDE model in order to transmit it to other sites, in Europe or anywhere else, whose want is to promote a similar server for languages other than French or within the context of specific projects (we can think of structuring actions with Eastern Europe for instance).
In that way, an enrichment of the structure is conceivable, firstly because the compatibility of the different fields should allow, in the end, the interconnection of such servers, and also because each site could develop additional access tools available to all.
Finally, putting the TEI into practice clearly shows that rather than aiming at being a "standard", the TEI is the occasion to share practices - and sometime a kind of philosophy‹in the encoding of textual documents. Above all, the TEI will actually prove valuable when we will really be able to exchange both data and tools (such as Silfide) between us without having to revise either of them. As a matter of fact, this is something which is not yet able to reach, but can be achieved by even more collaborative work between the sites which are concerned by digital resources.
Anuff E.(1996) The Java Sourcebook, J. Wiley and sons, New-York, Chichester, Brisbane.
Aarts Jan, Peter de Haan et Nelleke Oostdijk (Eds), English Language Corpora: Design, Analysis and Exploitation, Rodopi, Amsterdam, 1993.
Association for Computers and the Humanities (ACH), Association for Computational Linguistics (ACL), and Association for Literary and Linguistic Computing (ALLC) (1994), Guidelines for Electronic Text Encoding and Interchange (TEI P3), Editions C. M. Sperberg-McQueen and Lou Burnard, 2 volumes, Chicago, Oxford: Text Encoding Initiative.
Béthery A. (1993) Abrégé de la classification décimale de Dewey, Editions du cercle de la librairie, Collection Bibliothèques.
Dunlop D. (1995) "Practical Considerations in the Use of TEI Headers in a Large Corpus," Text Encoding Initiative: Background and Context, Kluwer Academic Publishers, Dordrecht, p. 85-98.
Heid U. and Oliver C. (1996) An Investigation into the Use of AFS for distribution and networking of linguistic ressources and tools, Technical Report Universität Stuttgart, Institut für maschinelle Sprachverarbeitung.
Ide N. et Véronis J. (1995) MULTEXT/EAGLES-Corpus Encoding Standard, document Version 0.1. CNRS, Aix-en-Provence.
IJCL-96, International Journal of Corpus Linguistics, V. 1 N.1, John Benjamins, 1996.
Krol E. (1992), The Whole Internet: user's guide and catalog, O'Reilly & associates, Sebastopol, collection Nutshell Landbook.
Lapeyre D.A. & Usdin T. (1996) TEI and the American Memory Project at the Library of Congress, Workshop : The Text Encoding Initiative Guidelines and their Application to Building Digital Libraries (20-23 Mars 96)
Péry-Woodley Marie-Paule, "Quels corpus pour quels traitements automatiques ?", in TAL-95.
Pino M. (1996) Encoding two large Spanish corpora with the TEI scheme: design and technical aspects of textual markup, Workshop : The Text Encoding Initiative Guidelines and their Application to Building Digital Librairies (20-23 Mars 96)
TAL-95, Traitement probabilistes et corpus, revue t.a.l., Volume 36, Numéro 1-2, 1995.
Last Modified: Wednesday, 19-Nov-97 15:23:54 EST
This page is hosted by the STG.