The Structured Information Manager (DBMS) is designed to manage multi-gigabyte collections of documents containing text, images and other kinds of data. Unlike a conventional database, SIM allows a very flexible definition of the possible structure of the records in a table. A grammar is used to describe the structure of the record, so that individual records may vary widely in both size and structure. This makes SIM ideal for storing documents such as bibliographic records, company reports, newspaper articles, office memoranda and legal statutes.
SIM provides a large number of concurrent users with full-text access to the entire document collection. SIM supports emerging open systems standards, thus enabling it to provide networked solutions for organisations with large scale document management needs.
SIM is designed for on-line deletion, insertion and update of documents in the database. This means that SIM can offer 24 hour availability, without any need for down time to re-process or re-index data.
SIM supports the storage of several different kinds of data, including SGML text, binary data, MARC records, and ASCII text. A single record in a SIM database can contain many fields of different types of data, so that an multimedia document including SGML text, several images and sound can easily be stored in SIM.
SIM database schemas have exceptional flexibility and descriptive power, as a grammar is used to describe the structure of the records. Conventional database systems support tables of objects of uniform structure, for example a relational database schema will describe relations each with a fixed number of attributes. However, in collections of objects containing text and image data, even when objects are of the same kind, these objects will often have differing structure. Use of a grammar allows many such objects of differing structure to be grouped together into a single database. For example, a collection of office memoranda will include documents ranging from notes and circulars to letters and even substantial reports.
To cater for these differences in structure, document interchange standards such as SGML (Standard Generalized Markup Language) can specify the structure of documents using a grammar. Thus the use of grammars allow SIM to support documents of highly variant structure which cannot readily supported by conventional database systems.
SIM supports a client-server model of processing. Front-end applications interact with SIM via an open systems standard protocol. Using this standard, the front-end can run on the same machine as the SIM DBMS or it can run on a remote system across a wide area network. Back-end SIM DBMS processes can also communicate with each other across the network, allowing distribution of data across several servers.
SIM supports a large range of user interface platforms, including command line, forms based interfaces on character TTY terminals, graphical user interfaces (GUIs) on X-Window platforms and MS Windows based PCs (even across a serial line).
SIM has a comprehensive range of text searching facilities. Wildcard text searching and stemming (allowing a search to be conducted using the root of the word, so that "compute" would match computer and computing) are supported.
SIM also allows the user to index *every* word in the database. Even queries that contain words such as "the" or "and" can be evaluated, as SIM has no "stop" words (Many text databases do not index very common words such as "the" due to the enormous overhead this involves when using traditional inverted file indexes. Such words are then called "stop" words, as they may stop the search).
SIM provides rapid access to multi-gigabyte collections of data by using advanced indexing technology, based on compressed inverted files . SIM indexes typically occupy only 50% of the size of the indexed data, and that figure even includes "stop" words and all the information necessary for full positional queries.
SIM provides ranking to present the results of a search to the user in the order that is most likely to be relevant. This relevance judgement is based on information such as the frequency of query terms within individual documents and throughout the whole collection. Ranking, especially in conjunction with stemming, can greatly enhance the user's ability to extract useful information from large amounts of text.
SIM allows the user to specify that a search is to be carried out over several databases at one time. All the documents that satisfy the search will appear together in one result set, even if they vary in size and structure, and can be manipulated like any other result set.
SIM makes possible the easy integration of text, images and sound within a single database. SIM's data structures allow an intuitive representation of the complex relationships found in such databases, while its variable length fields and advanced indexing efficiently support the large amounts of data typically involved.
SIM has comprehensive billing and security features. Users of the SIM database can be charged by connect time, by information viewed, by resource utilisation, or by a combination of these factors. The security provisions have multiple levels, and access to tables and to individual fields in tables can be restricted to specified users or groups of users.
SIM is an ideal server database for collections of World Wide Web documents. Its support of ranking allows quick access over enormous collections, and its support of SGML allows easy storage and management of World Wide Web (HTML) documents.
SIM allows the database administrator to create new word parsers, which means that the database administrator can decide for each database the exact definition of what should constitute a word. This enables SIM databases to be configured to allow users to search for words like "C++", or "Z39.50".
Central to the design of the SIM DBMS is the ANSI Z39.50 standard - "Information and Documentation - Search and Retrieve Application Protocol Specification for Open System Interconnection." (This standard is closely related to the ISO standard ISO 10162/10163, and these two standards are continuing to move closer together.)
This standard specifies the protocol that an application should use to query the SIM database kernel. By conforming with this standard, the SIM system can be distributed over a network, with a true client-server separation between its front and back ends.
SIM was built with recent international (ISO) document and networking standards in order to support gateways to other database systems conforming to these standards.
ISO 8879 "Standard Generalised Markup Language" (SGML) is another standard that is integral to the SIM architecture. SGML is a widely used standard for document interchange. By directly supporting the import and export of documents marked up in SGML, SIM automatically has access to a large range of text processing tools such as text editors, and translators that can convert text between a variety of word processing standards. Storing text in SGML format also means that the SIM retrieval engine has access to the structure of the documents as well as their contents. This structural information can be used to improve the processing of user queries and to determine the best ways to display a document.
The SIM database query language is a full implementation of the ISO 8777 standard "Commands for Interactive Text Searching" This standard query language is sometimes known as CCL, the Common Command Language, and is very similar to the ANSI Z39.58 query language standard. This gives the user a comprehensive range of familiar text retrieval functions.
SIM is committed to the open systems philosophy. SIM uses OSI network services to carry out the communication of data from the user to the database. This means that SIM is compliant with the Australian Governments GOSIP standards, and that a large variety of networking hardware can be supported. SIM is written in the `C' language under the UNIX operating system, and is therefore highly portable.
The underlying technology for the SIM database has been developed at The University of Melbourne and RMIT(Royal Melbourne Institute of Technology) over many years. Research and development is now continuing at CITRI - the Collaborative Information Technology Research Institute, a joint venture between the two universities. CITRI's Hypermedia group is actively involved in the international information retrieval community, and is conducting leading edge research in many related areas such as hypertext systems, multimedia systems, electronic libraries and text and image compression. All this experience is now being used to build a document retrieval database that can meet the needs of a wide community of information providers and users.
The SIM system is being built in cooperation with Ferntree Computer Corporation. Ferntree is one of Australia's leading computer services company, with a long history as an information provider with the AUSINET information service. Ferntree's experience in providing a large text database service has been vital in ensuring that SIM meets the needs of existing information service users, and in determining the requirements that must be met for the next generation of text retrieval systems.
If you have a Z39.50V2 client, you might like to try connecting to our server. See here for more details.
723 Swanston St, Carlton 3053, AUSTRALIA
Tel: +61 3 9282 2400, Fax: +61 3 9282 2490
This page has been visited times.
Last Updated: 1/9/93