[Mirrored from: http://sherlock.berkeley.edu/asis_paper/subsectionstar3_2_1.html]



Next: The Cheshire II Up: The Cheshire II Previous: The Cheshire II

Cheshire II and SGML

One of the problems faced in developing the Cheshire II system was creating a retrieval mechanism that could be applied with equal facility to text that was highly structured, such as MARC records, or a good deal less so, such as full text of journal articles, etc. One of the keys to solving this problem was the use of Standard Generalized Markup Language (SGML)[10] as the fundamental data storage type for Cheshire II. All text records within the Cheshire II system are stored as tagged, SGML text (see here for an example of a MARC record in SGML format).

The use of SGML tagging for all of the textual data stored in Cheshire II has several benefits. Foremost among these is providing a common storage format for a variety of data. For the developer, this means that basic programming for file access and indexing need only be written to handle one basic format, rather than developing a variety of routines to handle storage and indexing of MARC records, full-text, etc. Additionally, the structure provided by SGML to what is generally regarded as unstructured text provides a convenient mechanism for extracting portions of text to generate indexes for searching. Once full-text articles have been tagged, for example, it is a trivial matter to provide users the ability to search for text within citations, footnotes, or captions, in addition to more standard indexes such as author and title.

The use of SGML to enable a great deal of flexibility in generating search indexes has been significantly augmented by using SGML as an integral part of the Cheshire II system itself. Configuration information for the Cheshire II system, specifying details such as the location of data files, the relevant SGML Document Type Definitions to apply in reading those files, the number and type of indexes to generate on the files and the indexing mechanisms to employ in generating them, are all specified in a single, SGML document (see here for an example of a configuration file). Placing all relevant information regarding data storage and indexing in one, easily-modified file, combined with the ability to provide probabilistic search and retrieval on any SGML data, means that the Cheshire II system can be changed from serving as an advanced catalog system to, say, a retrieval system for a researcher's field notes, by making a few alterations to the basic configuration file.

There are further possibilities for the creative use of SGML within the Cheshire II system that remain to be explored. Use of the SGML structure for browsing within a single document is an obvious extension to the graphical user interface. There are also benefits to be achieved by using SGML's capacity to handle a variety of character sets to enhance presentation of foreign language records to the user. It might also be possible to incorporate additional information within SGML records to enhance users' ability to perform somewhat less directed searching than is provided by Boolean and probabilistic search capabilities (discussed below). If a retrieved document, for example, cited another document present within the database, having the citation tag within the retrieved document contain identifying information for the second document could enable hypertext linking of documents for citation-chain searching. These and other enhancements may be pursued in future versions of Cheshire.



Next: The Cheshire II Up: The Cheshire II Previous: The Cheshire II


Contact: Ray R. Larson
ray@sherlock.berkeley.edu