Edward A. Fox, N. Dwight Barnette
Department of Computer Science, Virginia Tech, Blacksburg VA 24061-0106
This presentation describes the architecture, design, development and administration for our system. In addition to using Mosaic and the NCSA WWW server, we employ Graz University's Hyper-G server and its Harmony and Amadeus clients. Further, the Envision system aids authoring by providing searching, results visualization, and browsing of a large SGML archive and multimedia object database - and also utilizes a WWW server loaded with data from ACM and other publishers. Our materials include large bibliographies, collections of page images, SGML versions of journal articles, videos, and other resources.
We describe the tradeoffs between these several servers and clients, how they function in an integrated environment to aid in education, how powerful search capabilities work with hypermedia collections, techniques of automatic and semi-automatic link formation, and other issues.
Educational institutions must become more efficient, and also must improve the quality of the learning experience. These two objectives seem at odds, but new technology has the potential of helping us meet both simultaneously. In particular, we argue that:
Interest in digital libraries  has grown rapidly throughout the 1990's. The National Science Foundation in particular has sponsored a number of workshops, and encouraged research in this field . We report on a number of activities, supported in part by NSF, to develop a digital library in Computer Science (CS), mostly available through WWW, and to apply that DL to improve education.
The next section discusses our initial project to develop a DL for CS. Section 3 explains our project to apply that DL to CS education. Section 4 discusses the design and architecture of our client/server complex which serves that goal. Section 5 reports on our current status, and Section 6 outlines our future plans.
Our work on a DL for CS had its origins in efforts to enhance ACM's publication services and to develop new electronic publishing products . In 1988, it became clear that a central database or archive should be developed, and added to as new publications are prepared. This called for new collaborations, application of standards, pilot tests, development of business strategies, and various types of research .
Accordingly, in 1991 we proposed, and received NSF support for, A User-Centered Database from the Computer Science Literature. ACM agreed to provide access to its publications, so we could convert them to a suitable archival form, and prepare software to provide ready access from the desktop .
Our software built upon earlier experience with developing a testbed for the application of artificial intelligence concepts to information retrieval . That led directly to the MARIAN system, developed initially to provide easier access to our campus online public access catalog data . MARIAN can be accessed through Mosaic as well as with gopher, curses and Motif clients. MARIAN is implemented as a distributed server with a special client/server protocol that supports mixed initiative sessions through a NeXTstep interface. It runs on a collection of NeXT systems, but should be portable to other platforms such as our main target, a DEC Alpha running OSF/1. The server implements the vector space retrieval model  and will handle relevance feedback later this year. Morphological analysis of words is facilitated by the large lexicon derived from machine-readable dictionaries, especially the Collins English Dictionary .
Our project to build the DL for CS was called Envision since we wanted to encourage users and developers to be bold in thinking of new ways to utilize a DL, and since end-users were particularly eager to have computer assistance for visualizing search results. A user-centered design approach was used to plan and implement the Envision system, and we tried to adopt sensible principles for representation, system architecture, interface construction, and retrieval . The current interface provides many special features to allow users to work with search output, analyzing it from various perspectives in a two-dimensional graphical display .
Of particular importance is our commitment to using SGML and object-oriented database (OODB) methods. According to our investigations, users prefer to think about objects related to their domain of inquiry (e.g., algorithms, animations, source code, pseudo code, theorems, proofs, conferences, research projects, authors). Thus, we have an object in our OODB for each abstract entity (e.g., author, institution, journal), with associated attributes recorded as well as links to related objects. Since each object has a unique identifier, a URL can be associated with it. Therefore, part of our system can be viewed as a server on WWW, since it can return an HTML representation for any object, presenting the attribute/value entries as well as links. This is illustrated in Figure 1. However, for best results, users should work directly with our client to carry out search and graphical display activities in concert with the full Envision server.
Figure 1. Environment of our Computer Science Digital Library
In contrast with objects that represent abstract entities, other objects that have a concrete representation are described with SGML documents. Early on we developed a Document Type Definition (DTD) for bibliographic entries (e.g., in ACM's Guide to the Computing Literature and the HCI bibliography) and one for articles; ACM has incorporated these into a DTD that it plans on using for large-scale capture and conversion projects. Over 100,000 bibliographic records, and a moderate number of full-text articles, have been converted into SGML representations. Multimedia documents will each have an abbreviated SGML record to describe them, with metadata and preview or summary information included, as well as pointers to the relevant file(s).
The Envision system also has an identifier for each SGML entry, so that once again a URL can be associated with each document. Upon request, the SGML form is analyzed, organized, and a suitable HTML rendering is made, either of the preview form or the full-form of a document.
The original form of most of the objects in our system is retained in our overall database. Guillermo Averboch is developing a system called DELTO for automating the analysis of various raw forms of documents, converting them to SGML and at the same time extracting link and object database information to facilitate search and hypertext access. The Envision system has two target user groups: those who will use it to develop courseware and those who will use it directly. In the former case, after searching and selecting desired items, the URL for each can be recorded so that others can be pointed to interesting documents. In the latter case, the flexible search, results visualization, and Mosaic-assisted browsing modes are of particular value.
Given the existence of the Envision system, we proposed in 1993 to enhance the computer science undergraduate learning experience by integrating courses with usable and useful computerized materials in our emerging DL. . NSF agreed to fund work at Virginia Tech and at Norfolk State University to demonstrate the cost effectiveness of using DLs as the integrating concept for improving learning through interaction with rich research resources. By applying information retrieval, hypertext, multimedia, and human-computer interaction methods, we plan to add value to the DL.
Our plans called for two new courses: a senior level three-credit offering entitled ``Human-Computer Interaction'' and a one-credit service course ``Using Computers and Networked Information''. We proposed revising our Database Management course to broaden its scope, but instead decided to offer another new senior level three credit course entitled ``Multimedia, Hypertext and Information Access'' starting in Spring 1995. Extensive revision is being made to ``Computer Professionalism'' in Fall 1994. In addition, changes are being made to courses in Data Structures, Algorithm Analysis, Introduction to Computer Science, parallel and Distributed Computation, Numerical Analysis, and Software Engineering to take advantage of our emerging DL.
In some cases we have also applied nontraditional educational methods. Thus, for the last three years our graduate course on ``Information Storage and Retrieval'' has followed the Personalized System of Instruction originally proposed by Keller , and has used Mosaic for delivery since Fall 1993.
The first year of the project involved further building of the DL, collecting tools, constructing special software for algorithm visualization (the SWAN system being developed by Jun Yang, Clifford Shaffer and Lenwood Heath), and helping faculty develop new courses to be delivered using hypermedia technology. Several courses make use of the KMS system  which we expect will eventually be connected to WWW; right now it is an excellent tool for helping students improve their writing, collaboration, and hypertext authoring skills.
Three courses make extensive use of WWW for classroom presentations and for student interaction with courseware. The next section explains the design and architecture of this environment.
While our early work with the Envision system focused on constructing the DL and supporting search of it, efforts in the last 18 months have broadened to focus also on browsing and hypertext access. Ultimately, we believe that integration of searching and hypertext is imperative at the deepest levels of system development . However, lack of resources prevented the Envision effort from fully developing this concept. Accordingly, we chose an architecture in which the presentation, browsing, and hypertext parts of our system could be loosely coupled with the rest (searching, OODB, SGML, multimedia).
This approach is in keeping with an evolving architectural model for DLs that is based on a careful study of requirements, client/server systems, document modeling, resource management, and layered application development . Thus, our system can be thought of as one that implements the following main modules: OODB and SGML database managers, vector space search engine, visualization tool, and hypertext access document manager.
In its initial prototype, completed in 1992 with NeXTstep and WAIS support, the Envision search system provided a demonstration of the planned interface capabilities. The current version replaces all that, using the MARIAN system for search support and an X/Motif implementation of the search and results visualization interface. This is illustrated in Figure 1. Presentation of documents as well as browsing and hypertext support is provided by Mosaic, but could be handled by other clients.
Thus, the Hyper-G system might be intergrated with Envision sometime in the future; in our educational activities we already have made use of the Hyper-G  server and clients (Harmony for UNIX and Amadeus for PC/Windows). These have a number of advantages:
Some use is already made of Hyper-G, such as to browse or search the RISKS internet digest for our course on ``Computer Professionalism'' and more use is planned as the client and server software matures.
Of particular concern is the failure of Mosaic in particular, and the WWW model in general, to support caching adequately. As we expand our use of multimedia, and digital video in particular , this will increase in importance. In addition to considering the possibility of using Hyper-G, we also have begun a feasibility study of modifying Mosaic and WWW servers to increase caching. Since we will have hundreds of students accessing WWW for their courses, and our campus internet connections are already quite busy, this matter must be resolved during the next year.
As is illustrated the top part of Figure 1, we now have three classes with extensive courseware accessible through WWW:
The Envision system is being tested and refined, at the same time that more data is being loaded. The Envision project funding from NSF will run out in February 1995, at which time the software will be turned over to the Interactive Learning project and Virginia Tech Computing Center for support and enhancement. Meanwhile, documentation, usability testing, performance tuning, porting, and additions to the help system are underway.
The Interactive Learning project has five principal investigators at Virginia Tech, who constitute the local management board, and two at Norfolk State University. Each course that is being launched, revised in a major way, or enhanced slightly has one or more faculty who are responsible. In cases where a course is taught by multiple instructors, each draws from an ever increasing collection of multimedia materials, that serves as a resource pool for all to use. Evaluation activities operate at multiple levels, from usability testing of software and courseware interfaces, to comparative tests of different approaches in one course, to outcomes assessment, to curriculum-wide surveys and measures. System administration of the various computers and software packages involved is handled by a collection of graduate and undergraduate students, some funded and some engaged in course or masters degree level projects.
During the next year, we expect to reach significant milestones in a number of areas. By early 1995, the Envision system should be running smoothly, with a moderately large amount of data loaded. That will make it an effective aid for faculty and students to search and select materials for subsequent examination, or for referral in courseware.
Later in 1995, after porting has been completed, a server system will be installed at ACM Headquarters, running the Envision software and providing access to users over the Internet. It is likely that such service will be for members only, or that some charges will be levied on users who are not ACM members.
During the summer of 1995, a visit will be made to Graz, to collaborate with the Hyper-G developers. Further integration of their server and client software in our projects will take place afterwards.
The first phase of our work on automatic and semi-automatic identification of hypertext links should be completed by early 1995. Already we have a set of our own software working in concert with the Fulcrum retrieval system, and are building document structure trees and vectors for sections, paragraphs and sentences so that links can be suggested at various levels of granularity. We plan to connect that software with other software already used for some MARIAN-related clustering studies  and then to try the combination out with parts of our DL. One mode of use will be to have the computer identify similar sections, paragraphs, or sentences to one selected by a user. Another mode will be to carry out a clustering, or incremental clustering that only considers new material, to find clusters of paragraphs (which might indicate that links could be added between paragraphs in the same cluster). This tool should be of use for faculty developing courseware, to supplement the manually identified links.
In parallel with these software developments, more materials will be loaded into the DL. In Spring 1995, the Multimedia, Hypertext and Information Access course will be offered in a classroom that is a laboratory, with a PowerMac system for each student. This will facilitate not only access to the DL, but incorporation of other interactive courseware. Also in the Spring of 1995, the Computers and Networked Information course will be offered through distance learning techniques to Norfolk State University students, and possibly others.
Many other efforts are planned during the next several years of our work on applying DLs to improve education. We especially look forward to improvements and new offerings connected with the WWW that can be applied to our situation.
The digital library research at Virginia Tech has been funded in part by the National Science Foundation through a series of grants, especially IRI-9116991, CDA-9312611 and CDA-9303152. Thanks go to all of the co-investigators and students who have worked on these projects! Special thanks go to those who have co-authored key works .
Knowledge Systems Inc. has provided KMS software, ported it to the DECstations that many of our students own, and assisted in our learning about and application of that system.
PRC Inc. is funding our research on automatic and semi-automatic hypertext linking, and has provided the xprcedit tool for displaying and manipulating page images. The Virginia Tech Computing Center and College of Arts and Sciences have provided cost sharing, and in the former case, staffing to help with the MARIAN and Envision projects.
Dr. Edward A. Fox holds a Ph.D. and M.S. in Computer Science from Cornell University, and a B.S. from M.I.T. Since 1983 he has been at Virginia Tech (VPI&SU), where he serves as Associate Director for Research at the Computing Center, and Associate Professor of Computer Science. In addition to his regular courses, Dr. Fox has taught 22 tutorials or short courses in 8 countries. For ACM he served 1988-91 as editor-in-chief of ACM Press Database and Electronic Products (i.e., electronic publishing). He is chairman of the Special Interest Group on Information Retrieval (SIGIR), chairman of the Steering Committee for the ACM Multimedia series of conferences, and member of the ACM/Springer Journal on Multimedia Systems editorial board. He also serves on the editorial boards of CD-ROM Professional, Electronic Publishing (Origination, Dissemination and Design), Information Processing and Management, J. of Educational Multimedia and Hypermedia, J. of Universal Computer Science, and Multimedia Tools and Applications. He written extensively on information storage and retrieval, hypertext/hypermedia/multimedia, computational linguistics, CD-ROM and optical disc technology, electronic publishing, and expert systems.
N. Dwight Barnette is an instructor in the Virginia Tech Department of Computer Science. He was formerly an Assistant Professor of Computer Science at Christopher Newport College and an Assistant Professor of Mathematics at Concord College. Mr. Barnette received his B.S. from Concord College and his M.S. from Virginia Tech, where he is currently completing his Ph.D. work in Computer Science Education. His research centers upon designing user-centered networked hypermedia and multimedia knowledge bases to optimize learning, grounded upon students' cognitive styles.