[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/p045.html; see this canonical version of the document.]

Paper

A Digital Library System for Japanese Classical Literature

Shoichiro Hara

National Institute of Japanese Literature
hara@nijl.ac.jp

Hisashi Yasunaga

National Institute of Japanese Literature
yasunaga@nijl.ac.jp

Keywords: SGML, Japanese classical text, multimedia database

1. Overview

The National Institute of Japanese Literature (NIJL) has been designing, building, managing, and maintaining the databases on Japanese classical literature for academic researchers both in Japan and foreign countries. The NIJL's database system is comprised from a computer and inter-network, and provides three catalogue databases (i.e., the Catalogue of Holding Microfilms of Manuscripts and Printed Books on Japanese Classical Literature, the Catalogue of Holding Manuscripts and Printed Books on Japanese Classical Literature, and the Bibliography of Research Papers on Japanese Classical Literature). A feature of the NIJL's computer system is that all data processing from data compiling, data correction, database service, and to publishing is executed on a main frame computer system. However, during more than ten years, NIJL's database system has had many problems awaiting solution from the view of software and hardware. To solve these problems, NIJL has started the new project of the digital library for Japanese classical literature. This project downsizes the main frame computer system and reconstructs it as the so-called distributed computer system over several years. The key words of this project are "standardization of data," "data independent from systems" and "multimedia oriented." At present, following this definite policy, we are reconstructing catalogue databases and full text databases, and from this year, we start constructing the new image database of Holding Manuscripts and Printed Books on Japanese Classical Literature.

During few years of experiment, we recognize that a digital library alone cannot always contribute to research activities of humanities scholars. A digital library is only a bank of raw material data, on the other hand, valuable results are produced under the individual research environments. Thus, we feel better and effective software tools, linking with digital libraries for downloading raw data and uploading research results can assist research skill done by the researchers. We begin new study of software for humanities as a "Digital Study System."

In the following, chapter two describes "On Going Project" of the digital library, chapter three describes the new project of the image database. Finally, new study of the "Digital Study System" for humanities is described in chapter four.

2. On Going Projects

2.1 SGML as the Basis of Data Description

There are a several languages or standards for describing text structures, including SGML (Standard Generalized Markup Language), TeX, PostScript, and ODA (Open Document Architecture). Among these, SGML is the only language that can describe the logical structure of text. As it is established as ISO and JIS (Japanese Industrial Standard) standard, many applications have been developed.

At present, we are under reconstruction of catalogue databases and full text databases. Both data can essentially be considered as nested string fields with variable length. SGML can describe the complicated text structure such as repeating groups, nests, an order of appearance, and number of appearances. If a data search is regarded as "a search for a specific string in text data," constructing database system that uses a string searching device is possible. Actually, in research on Japanese literature, search by string is more common way than search by numbers.

Meanwhile, fast string search devices and software are being developed and sold; all of the products are capable of handling SGML data. Consequently, we have done some projects based on SGML.

2. 2 Catalog Databases

Catalogue data is used for various purposes such as on-line database service, publishing in printed form, publishing in CD-ROM and so on[1]. This database system was designed more than 10 years ago based on devices at that time. As the latest computer system cannot support these devices, taking this opportunity, we begin reconstructing whole database systems. Reviewing old systems, we make new system policy of independence data from hardware and software, definitely speaking, we introduced SGML to describe the data[2]. As the original data was prepared and compiled by librarians from their points of view, some researchers are not satisfied with the contents for their research purposes. We adopt their advices to expand data structure while reconstructing the systems.

Based on the ideas described above, we began reconstruction of new catalogue database systems. Specifically, we have:

1) Created DTD (Data Type Definition) for new catalogue data.
2) Converted original data to SGML data.
3) Test-produced a database system using a string searching tool.
4) Converted SGML data to LaTeX data and output in printed form.

The tools used in this project were MARK-IT for the data conversion and structure analysis, and OPEN-TEXT for the string searching engine.

2.3 Full Text Database

At the beginning of our full-text database construction, the movement of standardization of the text data description was not active in Japan. We considered SGML as a favorable standard for defining text structure and describing text data. However, its Japanese standard was not established, and the worse, there were no applications to manipulate Japanese language. For these reasons, we had to establish our own text description rules based on SGML. We call these rules as KOKIN rules (KOKubungaku (Japanese literature) INformation)[3].

As KOKIN rules were designed to be easy to understand and to use, they have been favored by humanities researchers. However, as they are independent rules from another standard, there are no good tools to parse and check the KOKIN-based texts. SGML was originally developed as the document markup language for publishers, but recently, it has been regarded as an encoding scheme for transmission of data among the systems. From this background, we believed our text data should be converted to SGML-based one from the point of effective data circulation. Recently, as SGML has become popular in Japan, we began the project to construct a new full-text database based on SGML[4]. We used "the Anthology of Storiette" as a sample. This is the collection of the short stories of the citizen in "Edo" period and this text have been already transcribed by one of our co-worker in NIJL, and it is also marked up by KOKIN rules. This text has complex structures such as editorial corrections, side notes, Japanese rendering and so on. We conducted the following experiments of:

1) Creating DTD for "the Anthology of Storiette."
2) Converting the original data to the SGML data.
3) Constructing the database system using a string searching tool.
4) Converting SGML data to LaTeX data and printing a block-copy manuscript.

The tools used in this experiment were the same as for the above project.

2.3 Image Database for the Study of Japanese Classical Literature

One of the main dissatisfactions sent from database users is that "the catalogue databases are undoubtedly useful to find the existence of materials, but accessing materials themselves is difficult for distant uses (especially for foreign users)." To respond to the request, we begin the new project of constructing the "Image Database for the Study of Japanese Classical Literature" receiving a grant-in-aid from the Ministry of Education, Science and Culture.

This image data is derived from the microfilms of the "Holding Manuscripts and Printed Books on Japanese Classical Literature" as for getting around the copyright problems and for speedy construction. The image is sampled with 1 bit gray scale and 600 DPI resolution, compressed by G4 method, and stored in a TIFF format.

The image database will be linked with the new catalogue database as mentioned above. Database users first consult the catalogue database to search for their objective material, then they will access its image data by following the link between two databases (this link is based on the call-number of the materials in both databases).

We make an ingenious device in image data for linking both databases in opposite direction. That is, as we write the call-number of an original materials in each image data file (tag 0x10d of "Document Name" is used for this purpose), then a viewer program can access the corresponding catalogue information automatically by matching the call-number. Using this device, database users first glance at the image database to find an interesting material, then they will access its catalogue information following the link in the opposite direction.

We will digitize about four hundred thousand frames of microfilms within 1996. If the project progresses satisfactorily, we will finish digitizing all materials ( about one million frames of microfilms) in 1998.

4. A Digital Study System

Above "On Going Projects" become on the right track. By the way, these systems are the systems for center functions such as to store and supply data, but not always for research support functions such as to analyze text. During few years experiment, we recognize the need of better and effective software tools. In the reviews of past information systems, we begin a software development. At present, the following software are under construction or in a draft stage.

4.1 Image Annotation Program

This program allows researchers to attach annotations (by text) to a certain position and/or area on an image. This program is intended to support a transcription process of humanities researchers. If a researcher attaches some keywords or codes to images, he or she can access specific images by searching specific string in annotations attached to images. In the same way, a researcher can collect images on a specific subject. Furthermore, linking images of different materials is possible, for example, a researcher can compare a specific sentence between authentic text and its variants if he or she put same keywords or codes to several materials.

Another use for this program might be image retrieval. As it stores a text with its associated coordinate on the image, calculating the locational relation of annotations is easy. Thus, if the appropriate annotations are attached to images, the image retrieval such as "the images of the mountain in the center of the images with lakes at the foot (under) of it" might be possible.

4.2 Version Control Mechanism

As mentioned in 2.3, we are compiling full text data. However, it is impossible to transcribe all materials by ourselves. One way for swiftly increasing of the amount of full texts is to gather full text data from the public. The problems pointed out to this way are quality control, difficulty of source identification, and so on.

One solution to these problems is the version control mechanism, that is, the text data that passes the NIJL data systems must have a kind of "header" that includes the version information such as an original source, a reviser, a summary of revision and so on. The version control mechanism constructs the version tree to show the history of data development. By reviewing the history, users can assess the quality of the data.

This mechanism is in planning level. We are considering to use the TEI header for this mechanism[5].

4.3 Lexical Analyzer

One of the main studies on text is vocabulary analysis, where attribute information such as inscription and reading should be added to each word to organize useful data. There are many convenient text analyzing tools for European languages. However, most of these tools cannot be applicable to Japanese text. As there are no spaces between words in Japanese sentences, words shall be written with space by manual to prepare for further analysis. Moreover, Japanese words have a word-forming characteristic to form compound words, which makes it difficult to separate a sentence into words automatically. And, as the style of a sentence is different from work to work, genre to genre, and period to period, thus methods of preparation, management and use of a vocabulary index are different. These make Japanese text difficult to introduce convenient text analysis tools such as TACT.

Thus a lexical analyzer to divide the sentence into elements (words) is very important for Japanese text analysis. Recently, there are several large electronic dictionaries and some software tools to analyze vocabulary. We examine these tools to construct more useful lexical tools.

5. Conclusion

NIJL is under reconstruction of databases using SGML to cope with the multimedia age. These reconstructions become on the right track. We begin a software development to support the individual research environment.

References

[1]Keiko KITAMURA, Hisashi YASUNAGA: Data Base Delivery for Japanese Literature by CD-ROM', Joint International Conference ALLC/ACH Conference Abstract, pp.261-265, 1991.

[2] Shoichiro HARA, Hisashi Yasunaga: On the Text Based Database Systems for Public Service, Joint International Conference ALLC/ACH Conference Abstract, pp.43-45, 1995.

[3]Hisashi Yasunaga: Data Description Rule and Full-text Database for Japanese Classical Literature, Joint International Conference ALLC/ACH Conference Abstract, pp.234-239,1992.

[4]Shoichiro HARA, Hisashi Yasunaga: SGML Markup of Japanese Classical Text -A Case Study-, Joint International Conference ALLC/ACH Conference Abstract, pp.131-134,1996.

[5]C.M.Sperberg-McQueen, Lou Burnard: Guidelines for Electronic Text Encoding and Interchange (TEI P3), 1994.