CELL - A Chinese Language Learning System based on SGML

Scott Chiang
Michael Leventhal
Text Science, Inc.

[Document mirrored from http://www.textscience.com/ncell2.html. See this canonical URL at Text Science, Inc. for the most recent version, and links to essential graphics.]


This paper describes a prototype Collaborative Environment for Language Learning (CELL) which is used for the text-centered multimedia study of Chinese. The CELL uses Standard Generalized Markup Language (SGML) for the definition and interchange of learning material and is deployed on the Internet through the World Wide Web (WWW). The CELL emphasizes collaboration, taking advantage of the communication capabilities of the Internet, by allowing Chinese language students, teachers, and scholars to share their knowledge with other users of the system. The two-way information flow of the WWW also permits the setup of a virtual classroom with structured exercises and on-line guidance from a human teacher.

The following text is a Chinese translation of the abstract. It will look like garbage unless your system is capable of displaying Big 5 Chinese characters. Some examples later in this text are also encoded in Chinese.

本文介紹一種多媒體合作共濟式的語言學習系統 CELL 來學習中文. 在 CELL 系統中所有的中文及外文教材都要符合 Standard Generalized Markup Language (SGML) 的規範才能在世界網路 (WWW) 及電腦網路上 (Internet) 交換資訊. CELL 特別強調學習中文時的共同參與及合作共濟特性. 因為使用 CELL 的學生 老師學者都能同時參與共同學習分享知識. CELL 利用世界網路的雙向通訊功 能創造了一種虛擬教室 Virtual Classroom 的學習環境.在虛擬教室中老師 可以線上指導及互動式學習.


There are two major advantages to using computers with multimedia technology to study the Chinese language:

The first item is well-understood and has been the focus of most of the effort in computer-based language learning. The possibility of taking advantage of the second item on a large scale has only emerged recently and has not, heretofore, been exploited.

Traditional language learning is a very inefficient process. One major hurdle to be passed in progressing from low intermediate to advanced stages is the acquisition of a sufficiently large vocabulary, requiring countless hours thumbing through the dictionary. Chinese is particularly difficult in this respect since only Japanese and Korean students will be able to recognize written cognates. As the students graduate from primers to real selections from literature, encountering complex constructions and historical and idiomatic language, deciphering the correct meaning becomes increasing difficult and often requires line-by-line exegesis from the instructor.

The most successful technique for acquiring an adequate vocabulary and mastering decoding is simply to read a lot. Yet, the tedium of the mechanics of the process defeat all but exceptionally determined individuals - the majority never do attain competency and fluency. One method for making reading in a foreign language more efficient and more rewarding is through the use of bilingual and annotated texts. A bilingual text gives the student immediate help when he encounters unfamiliar words or difficult constructions and enables the student to tackle material which may be stimulating to him but beyond his reach if he were otherwise unaided. Bilingual material which has been specially prepared for the language learner is especially valuable. Unfortunately, there is very little of this kind of material available.

Computers can improve on the bilingual reader. A software program can present Chinese text and translations on demand in response to the reader's mouse click. By presenting the material in Chinese the reader is discouraged from relying on the translation as he might do with the bilingual text. It is possible for the software to put pedagogical material in different windows depending on its content - for example, a word dictionary in one window, grammar notes in another, sentence translations in a third - organizing the information into an environment optimized for language instruction.

The software might do other things such as automatically generate vocabulary lists for the student based on the words he marks in the text. Finally, the computer can integrate sound, pictures, and video into the text. Sound is obviously important in helping a student to speak and understand the spoken language, and perhaps more helpful to the student of Chinese than it is to the student of any other language. Pictures not only enliven the material, but may convey very useful information such as diagrams illustrating how to draw characters or examples of handwritten characters. Video has been shown to be a very successful way to convey real-life conversational experience to students. While speed and bandwidth problems hamper the full exploitation of this medium by computer software, the solution to these problems is in the foreseeable future. Computer software will give the student fine control to stop and replay video sequences, to hide and display subtitles, and to link video into an integrated curriculum.

Language students often find that studying together is a way to both learn more efficiently and to reduce some of the unproductive repetitiveness and tedium of language learning. A well-run classroom is, in fact, an example of how organized collaboration can provide the both the most efficient and most rewarding environment for language learning. The computer can be used as tool to make it easier for individuals to work together. For example, a student who has decoded a difficult phrase will write out his translation, grammatical analysis, and other notes including, perhaps, references to other material. If the student is reading the text through a computer network it is possible for him to share his understanding with other students, either in his class or even around the world nearly instantaneously. The next student that reads the same material can, if he so chooses, build on the knowledge of those who have encountered the same problems, increasing the resources at his disposal for learning Chinese.

This model applies equally well to the perspective of the preparer of language learning material, the teacher. The same vehicle that is used to enable students to cooperate can be used to enable expert instructors around the world to collaborate in the development of high-quality multimedia instructional material. Multimedia and more traditional material are both, today, extremely time-consuming and complicated to produce. A collaborative framework removes the need for special expertise in the production of printed material or computer software. And while such material is today commonly the work of single individual the framework can harness the efforts of thousands. The producer of material also has a remarkable means of dissemination at hand - the Internet is capable of distributing his work from Antarctica to the Arctic Circle in a matter of minutes.

It is also possible to use this infrastructure to create Virtual Classrooms where students are guided though a curriculum and are able to interact with the instructor, take examinations, and communicate with other students through the computer. This can provide many of the benefits of a classroom setting to students who may be unable to attend school.


The Collaborative Environment for Language Learning (CELL) is designed to offer students of Chinese both the traditional advantages of using multimedia computers and the advantages of a collaborative paradigm through the use of network-driven software.

CELL is based on international standards and is implemented in our prototype entirely with free software and shareware. A CELL server can be implemented by anyone with a machine running a World Wide Web (WWW) server and the CELL can be used by any student with access to the WWW from a computer capable of displaying Chinese.

The prototype CELL uses NCSA's Mosaic (or any other WWW-compatible browser) and can support text and graphics as well as sound and video if applications to handle these media types are available on the student's computer. There are no specific requirements on the type of software which may be used for multimedia, as long as it is capable of handling the particular media format presented.

CELL displays Chinese text and allows users to access various kinds of information about the material being studied. For example, translations on the levels of glyphs, characters, words (semantic units), phrases, and sentences may be displayed, as well as grammatical and literary notes, sound files with pronunciation, and graphic files with illustrations of handwritten characters.

CELL is collaborative. Users can add their own annotations to any of the information categories listed above and send those annotations to the server to be shared with other users. Users may also select the annotations they wish to see; for example, the may only be interested in the annotations of certain individuals or members of their class.

Creating material for use with the CELL is easy. The Chinese text is marked with tags that define the structure of the text. The structure is used to define annotatable units, e.g., words (semantic units), phrases, and sentences. The syntax of the markup is defined by an ISO (International Standards Organization) standard, Standard Generalized Markup Language (SGML). The markup itself is plain text, inserted directly into the Chinese text file, and may be created with any word processor. Once the text has been provided to the CELL the instructor may create instructional material using the annotation feature or he may prepare his annotations off-line and import them into the CELL using its batch annotation facility.



The World Wide Web is a global hypertext information network running on the Internet. Web servers are computers connected to the Internet which store and send documents to users that request them. Users may request documents by using a Web browser which is, typically, running on a personal computer connected through a modem with an Internet service provider. Web documents are capable of containing hypertext links, i.e., references to other Web documents which, when selected in a Web browser, causes the browser to request the referenced document from the server on which it resides and to subsequently display the document. Documents anywhere on the Web may be referenced; it is common for someone "surfing" to jump to documents stored in every corner of the globe within minutes. Web documents may contain multimedia, although current Web browsers only directly support some graphics formats. Most Web browsers, however, have the capability to call up other applications which can display media types it can't process such as a sound. Free and shareware Web server and browser software is available, so there is no direct cost associated with participating in the Web for those that are already connected to the Internet. The cost of an Internet connection from a commercial provider is currently about $20 month.

Web servers can be extended with programs that permit interactive communication with the user. Web documents can be forms which capture information from the user provided through fill-in fields, check boxes, and selection lists. The information is returned to the Web server where it is processed by whatever program is associated with the form. A program may, for example, put registration information in a database for a student enrolling in a class or return a document found in database using some search criteria supplied by the user.

CELL stores two types of documents on the Web server: Chinese texts and annotations of the Chinese texts. The Chinese texts are imported onto the server using data filtering programs provided as part of the CELL. The source material is encoded in SGML, as described in the following section. The data filtering programs do three things with each Chinese text:

  1. Convert input to CELL-capable WWW representation (Hypertext Markup Language (HTML), actually an SGML language).

  2. Make each character in the text a hypertext reference.

  3. Create an annotation database for the document and associate hypertext references in the Chinese document with annotation documents.

Annotations submitted directly to the server are also encoded in SGML. The SGML language is described in the following section. Annotations may include, for example, translations of semantic units (words), phrases, and sentences, historical, literary, or grammatical notes, pictures illustrating handwritten characters, stroke order, or character building blocks, and sound files demonstrating pronunciation. Annotations are explicitly associated with portions of text in the Chinese text file. The annotations also have a field identifying the submitter which will later enable the Learner to view only specific sets of annotations.

The learner accesses the Chinese documents through their Web browser. The CELL prototype has been run under Microsoft Chinese Windows. Chinese characters are correctly displayed by the browser by simply selecting the appropriate default font. Since every character is a hypertext link, the learner may click anywhere on the document in order to display the annotations associated with the position of the mouse. Any position may have multiple levels of annotations associated with it; for example, the translation of the immediate word as well as the phrase that sentence is in and a note applying to the entire paragraph. In addition, annotations may have been provided by many users of the CELL. The learner is able to customize his CELL session, selecting the types of annotations and annotation providers which he is interested in.

CELL takes advantage of the two-way communication of the Web by allowing the learner to add his own annotations to the document. The annotations will be automatically associated with the location in the document the learner is viewing at the structural (e.g., word, sentence, note) level selected. The annotations will be added to the database with the identification the user provides, available immediately for others to view.

II. SGML and Import Facilities

Standard Generalized Markup Language (SGML) is a language-independent international standard (ISO 8879, also formally supported by an alphabet soup of organizations like NIST, DoD, PTO, IRS, NATO, CERN and many other national and international organizations) for encoding text-based applications. Such applications have included: publishing, computer-based textbooks, hypertext, and multimedia. Some of the advantages of using SGML to create an on-line Chinese language learning environment are:

CELL is SGML-based. As mentioned above, Web documents are encoded in an SGML language. SGML enables anyone to easily create Web documents and to ensure that they are syntactically correct. However, HTML is not designed to represent the structure of information, for example, how it is broken up into words, phrases, or sentences. One additional tag must be introduced into HTML to represent these structures: the nested component. The syntax of the nested component is as follows:


where <c> and </c> enclose the structure which is being defined (text). All text must be enclosed by one or more <c> tags. Usually, punctuation and spaces are not individual units of interest so they are simply included with adjacent word. Nested components may be nested to any depth; for example:

                                     池溏  裡. 

has three levels of structure defined by nesting level. We could call those levels sentence, clause, and phrase, equivalent to the more readable structure:

                                    <word> 種 </word>
                                    <word> 進 </word>

CELL uses a generic component with arbitrary nesting depth in order to achieve language independence. For example, the concept of a sentence may or may not be applicable to a Chinese text while a semantic unit of multiple characters in Chinese does not even have a commonly recognized name in English.

Documents may be submitted to CELL as fully marked up HTML with addition of the nested components or they may be submitted in a more simple form. The tag <c> may be shortened to a single character and the </c> end tag may be omitted. A document may be submitted without any markup whatsoever. In this case, the <c> tag is implicitly placed around each word. Words are recognized by a space or punctuation separator.

Annotations are also encoded in simple SGML. The basic syntax of an annotation is:

                                    <annotation idref= " link to Chinese text "  [ type= " # " ]> 
                                     ...annotation text... </annotation>

The idref provides the connection to the Chinese text by using a nested component addressing scheme. For example, if the components were sentence, clause, word the second clause in the fourth sentence would be referenced by


The optional type field is used to classify annotations by type. If it is omitted it is assumed to be translation text. The type is given by number, defined by the <type> tag:

                                   <type number="1">Audio Pronunciation</type>

It is useful for the submitter of annotations to provide names that will be used to identify the annotations structure level to the CELL user. This is accomplished by the <aname> tag, eg.:

                                  <aname level="1">sentence</aname>
                                  <aname level="2">clause</aname>
                                  <aname level="3">word</aname>

The annotation is typically plain text although HTML is also allowed. Sound or other media is included in an annotation through the use of SGML's NOTATION and ENTITY facility.

                                 <!NOTATION wav SYSTEM "audio/x-wav">
                                 <!ENTITY wav111 SYSTEM "sound file external identifier" NDATA wav>

                                <annotation idref="1.1.1">&wav111;</annotation>  

The NOTATION is a Multipurpose Internet Mail Extensions (MIME) content type, which is the standard used by the Web to transmit multimedia information.


The CELL described in this paper is a prototype. The concept can be developed and improved in several areas.

The Web browser is adequate but not optimal for a language learning environment. We feel that the Learner's environment can be improved with special Viewers either running off of the Web browser or replacing it.

The CELL database is built on the file system. In order to manage a large database and ensure its integrity and security the server should employ a real database engine. Furthermore, there are many possibilities for the extension of CELL as far as the management of the annotation database goes in terms of establishing public and private access, working groups, making revisions, qualifying submissions, etc.

More sophisticated tools could be developed to make it easier to develop and enhance CELL material.

CELL could be enriched by integrating other services for language learners such as games, vocabulary list generation, tests, and a browsable dictionary. An interface could be created to make CELL into a virtual classroom with structured exercises and on-line guidance from a human teacher.

The most important development, though, is to get the word out about the CELL and to encourage as many people as possible to participate in creating source material. Although we believe that access to free CELL servers is important in order to promote the study of the Chinese language as widely as possible, the commercialization of the WWW which is underway should be very helpful in enabling teachers and scholars to devote time to developing high quality CELL materials.

Further Study

McArthur, Douglas. "World Wide Web & HTML", Dr. Dobb's Journal, December 1994.

Herwijnen, Eric Van. Practical SGML, Kluwer Academic Publishers, 1994.

The authors may be reached at:

Text Science, Inc.
Text Science Tower
1800 Lake Shore Ave., No. 14
Oakland, CA 94606
(V) 510-444-2962 (F) 510-444-1672
michael@textscience.com scott@textscience.com

[Text Science Home Page]