[Archive copy mirrored from the URL: http://www.epas.utoronto.ca:8080/cch/hara.html; see this canonical version of the document.]

[Occasional Lecture Series]

Shoichiro HARA and Hisashi YASUNAGA (National Institute of Japanese Literature), "On the Text Based Database Systems for Public Service"

Thursday, 16 March 1995, 4 p.m.
Upper Library
Massey College

1. Introduction

The National Institute of Japanese Literature (NIJL) has been designing, building, managing and maintaining a computer system to provide information on Japanese literature to researchers both in Japan and abroad. The NIJLs computer system consists of a main frame computer and an inter-university network, and provides information from three data bases: the Catalogue of Holding Microfilms for Manuscripts and Printed Books on Classical Japanese Literature, the Catalogue of Holding Manuscripts and Printed Books on Classical Japanese Literature, and the Bibliography of Research Papers on Japanese Literature. A feature of the NIJL's computer system is that all operations from creation and correction of data and providing the database service to creation of block copy manuscripts are done on the main frame computer systems. The concept of consistent processing of data by a computer seems natural today, but considering that the basic design of this system was carried out 10 years ago, it can be evaluated as a fairly ambitious system. However, after 10 years, the text data processing and data exchanges by the main frame computer now seem somewhat inefficient, and the complexities of management and maintenance of the system have surfaced.

To solve these problems, the NIJL will be carrying out downsizing and distributing of the computer systems over the next several years. One of the key words in distributed computer systems is "standardization of data". Present catalogue databases, full text database services planned for the near future, and the Catalogue of Holding Historical Materials also planned for the service are basically text type data. Catalogue data provides a typical example of formal record structure, but even then there are problems in constructing the database such as a field with many of repeating. Moreover, for example, among the NIJL's catalogue data records, there are variable length fields with fairly long sentences, such as "summaries". In the case of full text databases, all texts can be considered as variable length fields, and they contain complicated structures such as repeating group and nesting substructure.

In order to accumulate a variety of text data in a computer and provide adequate service in a small organization, some kind of standardization is necessary from the view of efficient use of resources. In this paper, the specific way in which the NIJL goes about constructing a new data base system with SGML as the basis for standardization is described.

2. SGML as a Text Structure Descriptive Language

There are a several languages or standards for describing text structures, including SGML (Standard Generalized Markup Language), TeX, PostScript and ODA (Open Document Architecture). However, SGML is the only language that can describe the logical structure of text and is established as an ISO and JIS (Japanese Industrial Standard) and for which many applications have been developed. Meanwhile, at the NIJL, a data description standard called "the KOKIN (KOKubungaku INformation) rule" has been developed for Japanese literature text, based on the same concept as SGML. At present, a system based on the KOKIN rule and SGML is being test-produced in the Computer Center. The following discussion describes the attempt to perform all operations from text data exchange through text data search and block copy manuscript creation consistently using text processing tools that conform to SGML.

Both catalogue data and full text data can essentially be considered as complicated structural texts with variable length fields. SGML describes the complicated logical structures which text possesses such as repeating groups, nests, order of appearance and number of appearances. Recently, the means of creating SGML data are available. For example, SGML data creation tools such as the SGML editor is now being sold. There are several indirect methods of creating SGML data. One method, an example will be given below, is to attach a simple tag to the text in a word processor, then convert to full markupped SGML data using a conversion tool. Another way is to create text with a spread sheet or style sheet, then attach a simple tag to texts and convert to full markupped SGML data using a conversion tool.

In contrast to the text data creation, there are many problems involved in data searches. A relational database is complete with search formulas and standard tools such as SQL based on an elegant mathematical model, but these cannot be used unless fairly strict restrictions are placed on the data structures. An object-oriented database is suitable for complicated data structures, but has neither standard search logic nor a standard product such as SQL. However, if a data search is regarded as a "search for a character string in text data", it is possible to construct a database system that uses a character string search device. In research on Japanese literature, searches that focus on character strings of interest are more common than searches based on the relative magnitudes of numbers. Consequently, it is believed that a system based on a character string search device can be effective. Meanwhile, today fairly fast character string search devices and software are being developed and sold; all of the products are capable of handling text logical structures. Thus, if there is a high speed character string search system that can handle SGML data, it is also possible to construct a database system based on it.

Based on the concepts described above, we have performed an experiment involving the creation of text data and servicing multiple data sets of different natures with a single tool. Specifically, we have:

  1. Created DTD (Data Type Definition) for use with SGML with two types of existing data set (the Catalog of Holding Microfilms for Manuscripts and Printed Books on Classical Japanese Classical, and the Anthology of Classical Japanese Poems).
  2. Converted above text data to SGML data with an SGML conversion tool.
  3. Test-produced a database system using an character string search tool based on SGML.
  4. Converted SGML data to TeX file with a conversion tool based on SGML, and output a block copy manuscript. The tools used in this experiment were MARK-IT for the data conversion and structure analysis, and OPEN-TEXT for the character string search engine.

3. System Construction

3.1 The Anthology of Classical Japanese Poems

The original text data of the Anthology of Classical Japanese Poems were created by the Japanese literature scholars at the NIJL. Not only were text data created, but a search system that will run on a personal computer was developed. This experiment envisioned text data conversion and servicing, and servicing of individual data created by researchers.

Texts for the Anthology of Classical Japanese Poems have simple tags that start with the Japanese-Yen-Mark, and by which all text elements are indicated. The structure of the text data is relatively simple, but there are many fields the appearance of which is nonessential or the appearance positions of which are not specified. The text data conversion tool (MARK-IT) automatically marks the text up following the DTD. Specifically, Japanese-Yen-Mark followed by one alphabet is regarded as a simple tag, and converted to the full markupped SGML data. This SGML data is used to construct a character string search database using OPEN-TEXT. Then MARK-IT is used again to convert SGML data to a LaTeX file. The LaTeXfile is output from the laser printer as a block copy manus cript using a LaTex tool on Windows.

3.2 The Catalog of Holding Microfilms for Manuscripts and Printed Books

This is a public database, but from the view of a system configuration, it is reaching the limit of ability to maintain it. Therefore, the entire operation from data creation to service is being reviewed, and the possibility of converting from a main-frame-based systems to a distributed systems based on workstations is being investigated. The present experiment is a first step toward achieving these.

The Catalog of Holding Microfilms for Manuscripts and Printed Books has a relatively simple record structure. This data is also indicated by simple tags that start with Japanese-Yen-Mark , but aside from that, there are fields separated by separators such as ";" and ":", fixed length continuous fields, and repeating fields. Since it is difficult to handle these non-tagged element with MARK-IT, simple tags are inserted as preprocessing to create intermediate files. MARK-IT is used to handle these intermediate files, which are then converted to the full markupped SGML data. In addition, this SGML data is used to create character string search database and LaTeX file using OPEN-TEXT. The LaTeX file is output from the laser printer as block copy manuscripts using the LaTex tool on Windows.

3.3 Character String Search

A character string search by OPEN-TEXT is performed by specifying a logical element defined by DTD. For example, in The Anthology of Classical Japanese Poems, the search takes the form of a searching for a "writer containing a specified character string in the poem". In addition, in the case of the Catalog of Holding Microfilms for Manuscripts and Printed Books on Classical Japanese Literature, if a key field is regarded as the target element for a character string search, the character string search can be conducted in exactly the same way as an ordinary information retrieval.

In the present experiment, two databases of different natures are organized within the same system and the search is conducted. The conclusion is that it has been confirmed that two types of databases can be adequately operated on within the same framework, so that the initial objective has been achieved. However, since the ready-made search tool provided by OPEN-TEX was used as the GUI (Graphical User Interface), difficulties in operation appeared. However, the search tool is independent from OPEN-TEXT search engine. In addition, the commands exchange between search engine and search tool is based on a kind of a protocol, so it is possible that new interface application can be developed by users or in the information supply division. Consequently, it is easy to create the kind of server-client information system.

4. Summary

A catalogue database system and a full text database system have been test- produced based on the description of text data and a character search system conforming to SGML. As a result, it has become clear that two types of database system of different natures can be constructed on a common application.

In the future, we plan to determine whether the performance of this system will withstand actual use and through that experience to design the utility system. In addition, we are also thinking of making a multimedia database in which the text and image are linked.

References


Shoichiro Hara
National Institute of Japanese Literature
1-16-10, Yutaka-cho, Shinagawa-ku, Tokyo 142, Japan
hara@nijl.ac.jp

[Occasional Lecture Series]