[This local archive copy mirrored from the canonical site: http://lacito.vjf.cnrs.fr/archivag/english.htm, text only; links may not have complete integrity, so use the canonical document at this URL if possible.]

(à la version française)

Linguistic Data
Archiving Project


Boyd Michailovsky, LACITO/CNRS, coordinator.
John B. Lowe, LACITO/CNRS; University of California, Berkeley.
Michel Jacobson, doctoral candidate, LACITO/CNRS.


  1. Introduction
  2. Data structure
  3. Software
  4. Project history


  1. Flowchart
  2. DTD (document type definition)
  3. XML document: a Hayu story
  4. XSL stylesheet
  5. View of the XML document using the XSL style
  6. View of the XML document showing interlinear glosses
  7. SoundIndex (Macintosh) screen
  8. SoundIndex2 screen
  9. Internet Explorer: text/sound demonstration

The goals of the LACITO linguistic data archiving project are the conservation and the distribution of speech data. To these ends it has developed norms for the preparation and exploitation of documents incorporating sound and text using internationally recognized standards, SGML (Standard Generalized Markup Language) in particular.

The main source of data for the project is the mass of documents recorded and transcribed in the field by members of the LACITO over the last thirty years. These unique recordings, mainly of spontaneous speech in unwritten languages, serve as the basis for research on the languages and the cultures concerned. Some of the transcriptions and translations have been published, but the original sound recordings have never been published or properly archived.

The documents prepared by the project incorporate both sound and text -- at a minimum a phonological transcription and free translation, and where available word-by-word glosses, notes, etc. The text is indexed to the sound at the level of the "sentence" or intonational group. The documents can be accessed either locally on CD-ROM or over a network. (See project flowchart fig. 1.)

2. Data structure

2.1 XML markup

An explicit XML (Extensible Markup Language) markup has been adopted for the text materials. In many cases, older documents whose structure is implicit are marked up automatically by program.

XML is an avatar of the SGML standard (Standard Generalized Markup Language, ISO 8879) adopted by the W3C (World Wide Web Consortium) in February 1998. It is likely to be widely adopted and to benefit from rapid development of programming tools.

The structure of the XML documents prepared by the project is defined by a DTD (Document Type Definition) (fig. 2); all project documents are validated using public-domain tools. The DTD, which will be enriched as the need arises, defines a structure whose elements include a HEAD and a BODY, the latter being divided into segments corresponding roughly to sentences or phrases. Each segment (between the start-tag <S> and the end-tag </S>) contains labeled textual data of different sorts (transcription, translation, etc.) and an AUDIO element which specifies the time-offsets of the beginning and the end of the corresponding sound-data in the sound file. Fig. 3 shows the beginning of one of the project documents, a story in Hayu (a Tibeto-Burman language with about 200 speakers in eastern Nepal). The logical structure of the data is entirely explicit and equivalent to a tree-structure. Parsers, query languages, formatting languages, etc., exploiting this generic structure exist in the public domain.

2.2 XSL styles

The parameters of a view on an XML document -- that is, the elements to be displayed and the format of each -- can be defined in XSL (Extensible Style Language). XSL, which is itself an application of XML, was the object of a recommendation of the W3C in 1997. The XSL stylesheet shown in fig. 4, applied to the XML document of fig. 3, results in the view shown in fig. 5. A different XSL stylesheet gives the view shown in fig. 6, which includes word-for-word interlinear glosses.

2.3 Unicode character coding

The Unicode standard (or the basic multilingual plane of the ISO/IEC 10646 standard) assigns a unique two-byte code to some 30,000 characters and signs used in dozens of languages, including Chinese characters, the IPA, Indian alphabets, etc. Its purpose is to obviate non-standard codings based on particular fonts.

The transcriptions of linguistic documents are generally coded in a particular font, SIL-IPA, IPA-Times, or NewLacito, for example. For the purposes of the current project, these documents are transcoded as ASCII text files, with phonetic characters represented as "entities" corresponding to their Unicode code-positions. Thus, for example, the IPA letter "eng" in the third word of the Hayu text (fig. 3) is represented as "&#x014B;". In the simplest case, text documents can be displayed using the font "Lucida Sans Unicode" which is included with Windows NT4 and includes 1752 glyphs (figs. 5, 6).

2.4 Digitized speech

The sound file format used in the project is RIFF (WAV). This is the native Windows format, but it can be used on other platforms or converted to other formats. The project uses digitalization at 44.1 KHz with a resolution of 16 bits, stereo or mono depending on the original recording (usually mono). These parameters are perhaps excessive given the quality of the original recordings, but they have been chosen to avoid any further degradation of often irreplacable documents.

2.5 Sound alignment

A standard for multimedia documents on the web, SMIL (Standard Multimedia Integration Language) is being developed by the W3C. In the meantime, until tools implementing this standard become available, the project has adopted a non-standard XML element tagged <AUDIO> and specialized program modules to handle the alignment of sound and text.

3. Software

Software development at the project has been pursued in three areas:

  1. Authoring tools
  2. Browsing tools
  3. Cataloging and acces-management tools

3.1 Authoring tools

The SoundIndex program was written in 1996-1997 to help researchers to mark up the correspondence between text and sound. The program, written for Macintosh in C++, displays the transcription, whose segments are delimited by a chosen character, and the sound wave (fig. 7) which is read either from a computer sound file (in format WAV or AIFF, on CD-ROM or computer disk) or from an audio CD track. The user listens to the sound and places markers corresponding to the beginning and the end of a segment of text directly on the display of the sound wave. The value of these markers (that is, the absolute time of the beginning and end in milliseconds, measured from the beginning of the sound file) for each segment is recorded in a table. This program is available for Macintosh 68000 or PPC, with a complete manual.

A new version of SoundIndex (fig. 8), for the alignment of documents marked up in XML, has been developed recently in Java for Windows (and ultimately for other platforms). The links between text and sound are recorded in an AUDIO element which is inserted into each segment of the text document by the program.

Documents aligned with the Macintosh version of SoundIndex can be converted by program to the XML format produced by the new version and exploited by the browser (see below).

3.2 Browsing tools

The browser displays the text data while playing back the corresponding sound. The user acceses the document using a standard browser (Netscape, Internet Explorer, Hot Java, etc.); he can listen to the sound corresponding to a chosen segment or listen to the whole text while the transcription scrolls on the screen. He can choose a particular view depending on the styles available for a given document -- with or without translation, with or without interlinear glosses, language of the translation, etc. The access to the sound is managed by an "applet" adapted to the latest versions of the standard browsers. This applet plays back the sound after being passed the URL of the sound file and the time-indexes contained in the AUDIO elements of the XML file.

A somewhat imperfect (partly due to incompatibilities between the "standard" browsers) demo is available on the Internet. Fig. 9 shows the text (in nemi, a New Caledonian language) and the user interface as presented by the browser Internet Explorer.

3.3 Cataloging and access-management

An AccessTM database has been developed for internal use to catalog the original documents and the documents produced by the project. The possibility of a public database, giving access to information contained in the HEAD elements of the XML documents and to the documents themselves is under study.

4. Project history

The first maquette of a project document associating text and aligned sound was produced by J. B. Lowe in 1995. Since that time, the document and software architecture has been designed by J. B. Lowe and Michel Jacobson. The current software, SoundIndex and the various applets, scripts, etc., was realized by Michel Jacobson.

The project has been supported by the departments of Humanities and Social Sciences (SHS) and Science for the Engineer (SPI) of the CNRS (French National Center for Scientific Research).

Under a contract with the Agency for the Development of Kanak Culture, the LACITO is producing CD-ROMs in a dozen languages of New Caledonia for the Tjibaou Cultural Center in Noumea.

Addresses :

Boyd Michailovsky boydm@vjf.cnrs.fr
John B. Lowe jblowe@socrates.berkeley.edu ; web page: http://bantu.berkeley.edu/public/jblowe.html
Michel Jacobson jacobson@idf.ext.jussieu.fr ; web page: http://www.mygale.org/01/jacobson

top of page
LACITO home page
B.M. / 6 May 98