Computer - Federating Diverse Collections of Scientific Literature

[Archive copy mirrored from the URL: http://computer.org/computer/dli/r500280/r50028.htm; see this canonical version of the document.]

From Computer theme issue on the US Digital Library Initiative, May 1996

A University of Illinois project is developing an infrastructure for indexing scientific literature so that multiple Internet sources can be searched as a single federated digital library.

Federating Diverse Collections of Scientific Literature

Bruce Schatz, William H. Mischo, Timothy W. Cole, Joseph B. Hardin, Ann P. Bishop, University of Illinois
Hsinchun Chen, University of Arizona

The most important recorded information medium on the Internet, and in the world at large, is the document. Although text might seem prosaic in contrast to multimedia objects, it is still the major medium for communicating information. Internet document retrieval can draw upon years of research results and practical experience in on-line information access as well as from traditional physical libraries. The technology for text information retrieval is far more mature than that for other media. Therefore, documents are also the best vehicle for investigating problems specific to digital libraries, such as the federation problem of making distributed collections of heterogeneous materials appear to be a single integrated collection.

The Digital Library Initiative (DLI) project at the University of Illinois at Urbana-Champaign is developing the information infrastructure to effectively search technical documents on the Internet. We are constructing a large testbed of scientific literature, evaluating its effectiveness under significant use, and researching enhanced search technology. We are building repositories (organized collections) of indexed multiple-source collections and federating (merging and mapping) them by searching the material via multiple views of a single virtual collection.

Developing widely usable Web technology is also a key goal. Improving Web search beyond full-text retrieval will require using document structure in the short term and document semantics in the long term. Our testbed efforts concentrate on journal articles from the scientific literature, with structure specified by the Standard Generalized Markup Language (SGML). Our research efforts extract semantics from documents using the scalable technology of concept spaces based on context frequency. We then merge these efforts with traditional library indexing to provide a single Internet interface to indexes of multiple repositories.

Our project focuses on developing a large-scale infrastructure adequate for solving real-world problems. The Testbed part of the project is based in the University Library in a new facility that showcases engineering and science information and literature. We are placing article files into the digital library on a production basis in SGML directly from major engineering and science publishers. The National Center for Supercomputing Applications (NCSA) is developing software for the Internet version in an attempt to make server-side repository search as widely available as its Mosaic software made client-side document browsing.[1] The Research section of the project is using NCSA supercomputers to compute indexes for new search techniques on large collections, to simulate the future world, and to provide new technology for the Testbed section.

Federating distributed repositories

A traditional physical library is a single repository for materials from many sources to which a user comes seeking information. A repository is just an organized collection in which documents and other objects are indexed for effective search. The Net situation is quite different, since users can directly access the sources themselves. A digital library is a group of these distributed repositories that users see as a single repository.

It is difficult to support the federation of multiple physical sources into a single logical source. Part of the difficulty is in handling the text: Documents have differing structures and styles. Handling searches is also difficult. They must support different classification schemes so that sources can be indexed in various ways at different levels of detail.

Once Net retrieval is transparent, the digital library becomes similar to a typical physical library. Reference librarians help users locate information in a large collection by examining various indexes (search) and sources (display). Traditional libraries should naturally want to support digital libraries, since the range of indexes and sources available already far exceeds what a library building can physically house.

Figure 1 illustrates the major efforts in the Illinois Digital Library Project, (one of six participants in the overall DLI). The publishers, our partners, are filtering scientific literature and collecting it into repositories. Our Testbed is developing index technology for effective search and display of SGML repositories. The Internet part of our project is developing interface technology to support multiple indexes for multiple Internet repositories. This will let us evaluate the Testbed's effectiveness for thousands of users on thousands of documents. Our Research effort is developing semantic technology to support federated search across multiple repositories, using document content rather than structure.

Figure 1. Major efforts within the Illinois Digital Library Initiative project. (Click on the thumbnail to view a 18K GIF image.)

The Internet interface will incorporate Research technology that provides semantic federation of distributed repositories for scientific literature. The Testbed is the middle ground of our large-scale experiment, where we deploy the technology and evaluate the sociology.

Repositories for scientific literature (testbed)

Our Testbed provides enhanced access over the Internet to the full text of selected engineering journals, using SGML document structure to facilitate search. The Testbed is based in the Grainger Engineering Library Information Center, a $30 million facility opened in March 1994 to showcase emerging information technologies. The Testbed was formally deployed in February 1996, with the production stream consisting of Applied Physics Letters from the American Institute of Physics. The production Testbed will gradually encompass the full collections of all publisher partners (listed below). Students and faculty at the University of Illinois, and then the other Big Ten universities, will be able to access the experimental digital library in accordance with our partner agreements.

Publishers and collection development

The testbed collection gathers articles directly from publishers in SGML format. These articles include the text and all figures, tables, images, and mathematical equations. Our publisher partners are committed to providing us with materials in the same time frame that they produce the print versions. That way we can place articles into our digital library before they reach the shelves of our physical library. We have chosen to manipulate SGML to the fullest extent possible, foregoing, for example, PDF (Portable Data Format), HTML (HyperText Markup Language), and ASCII, as later discussed. We are thus engaged in finding effective, scalable methods for the processing, indexing, retrieval, and display of structured document articles.

The testbed collection presently comprises over 4,000 articles from journals in computer science, electrical engineering, physics, civil engineering, and aerospace engineering. Publishers represented in this initial collection are

the IEEE Computer Society,

the Institute of Electrical and Electronics Engineers (IEEE),

the American Physical Society (APS),

the American Institute of Physics (AIP),

the American Society of Civil Engineers (ASCE), and

the American Institute of Aeronautics and Astronautics (AIAA).

Thus, for example, this issue of Computer will be in our collection before you read this article. Other professional societies (such as the American Association for Advancement of Science, which publishes Science) and commercial publishers (such as John Wiley & Sons) have committed to supply us with articles in SGML.

We believe that SGML will become the premier language of open document systems. SGML enables a system to treat documents as fine-grained objects to view, manipulate, and output. Tags delineate header (such as author, title, affiliation, and journal) and body (such as chapter, figure, table, and equation) structures. SGML's strength, in terms of retrieval, is that it reveals such deep document structure. SGML is becoming ubiquitous, but publishers are still mostly generating it as a byproduct of their production process, rather than as an integral part. In many cases we have been the first to actually display the SGML version of the published articles.

In the first phases of this project, we developed procedures for generating collections of SGML materials.[2] We process the heterogeneous SGML we receive from publishers into a federated repository of structured documents. Tags differ from one publisher to another. For example, every publisher has several author tags, which differ across publishers. We can federate some differences with simple syntactic transformations, such as AU or AUT or AUTHOR for the author tag. However, others reflect semantic differences and conventions. Yet the user wants to merely issue a query for author. We settled on an extension of the ISO 12083 Article Document Type Definition (DTD) for the project's canonical DTD. We are writing heuristic software for each DTD that maps publisher tags into our canonical set for indexing and retrieval. This tag normalization is our approach to structure federation.

To display journal articles, the testbed team has been working with SoftQuad to test and evaluate its Panorama SGML viewer. Figure 2 shows a portion of an SGML document as received from a publisher and displayed in this viewer. The bottom window is an American Institute of Physics document with federated tags, and the top window is how Panorama displays the SGML. Panorama can display all tagged parts directly: the text itself, titles (in this case, PACS), and equations. Style definitions for each DTD associate particular fonts and other aspects of display style with particular tag structures. At present, we are specifying these styles, but eventually publishers must define the styles just as they define the tags. Preserving the "look and feel" of the magazine layout is just as important as maintaining the article structure.

Figure 2. Testbed SGML sample: (top) "cooked," after styles; and (bottom) "raw," with tags. (Click on the thumbnail to view a 72K GIF image.)

Repositories and federated search

After adding an SGML document to the collection, we must index it for efficient retrieval. Our indexing techniques utilize the fine-grained structure of the documents so that, for example, users can search for a phrase solely within figure captions. We experimented with full-text retrieval using an SQL (Structured Query Language) engine before we settled on Open Text's Open Text Index search engine for indexing and accessing the DLI Project documents. This engine, tailored to SGML processing and retrieval, has the scalability to index large document stores (the Open Text Web Search Server presently indexes over 3 billion words and over 30 million links).

To evaluate database structures and retrieval effectiveness, we implemented a prototype client (written in Visual Basic) under Microsoft Windows. Figure 3 illustrates this prototype, which is our currently functioning testbed. The search query, shown in the upper overlaid window of the composite screen dump, finds nanostructure appearing only in figure captions. Selecting a retrieved article and viewing its short entry version shows that the caption of its Figure 2 contains that word. This figure, labeled as F2, can be viewed within the full article as shown in the window at the bottom of the screen dump. This service of our Engineering Library lets users access SGML document search within the context of other electronic retrieval services. Integrating bibliographic databases, on-line catalogs, local and remote periodical index databases, and the full-text SGML collection is vital to the Illinois digital library system.

Figure 3. Current Testbed client prototype within the context of the Engineering Library. (Click on the thumbnail to view a 70K GIF image.)

The information science literature shows that providing different search interfaces tuned to each search need helps users find information. In the current testbed interface, for example, users can use Boolean connectors to specify a phrase with different amounts of proximity or specify multiple phrases, and employ SGML tags to restrict the search to particular subparts of documents or to selected information sources. They can also use a "word wheel" list to choose possible terms appearing in the collection and use preselected lists of "classic" documents to choose documents directly.

At present, we are placing the sources into a single repository maintained at our home site at the University of Illinois. There we process the SGML articles into a single index with federated tags. That index drives the search engine and the document store. Concurrently, we are training our publisher partners to build their own repositories. They can then process and index their own materials and run their own servers for searching across the Net. We expect a number of our publisher partners to establish such repositories, using our federated tag schema. Uniform searching across these will then provide a true testbed for distributed repositories of professional materials.

User and usage evaluation

To evaluate testbed users and usage, we combined a broad study of use with a deep study of social phenomena.[3] Throughout the DLI project, we will observe how engineering work and learning activities intersect with using distributed, digital information. We will interview, individually and in groups, a range of potential and actual testbed users from the engineering community. We will conduct usability tests of various testbed components and versions, and experiment with economic models and charging mechanisms. Large-scale user surveys and testbed transaction logs will also yield extensive data.

Our sociological research has already yielded some valuable results. We asked focus groups of engineering students and faculty members how they use journals to support research and educational activities. The groups also discussed the biggest problems they have in identifying, retrieving, and using journal material. For example, focus group responses supported the relationship between journal structure and information needs and strategies. Many professors noted that figures, rather than abstracts or conclusions, were accurate indicators of whether they would be interested in the entire paper. They claimed that figures revealed what the authors had really done, as opposed to what they wished they had done. Several also reported that sometimes the equations were the only part they really needed to support certain work tasks. Several graduate students reported that the paper's bibliography indicated the paper's utility better than its title or author. In fact, sometimes they used the bibliography without reading the paper at all. The introductory paragraph was how most undergraduates decided whether an article was interesting, relevant, and written at the right level. These findings provide preliminary evidence that flexible interaction with document structure will enhance digital library effectiveness.

Inadequate information retrieval because of shallow semantics was a universal observation. Virtually all participants reported major difficulties in "getting the right words" to perform topic searches. This suggests a critical mismatch between the users' and the library's vocabulary systems. Students reported asking professors what "old" or "weird" term a particular database used to refer to the concept they wanted. They also searched their word processor's thesaurus for suggestions of alternative terms and asked other library patrons if they could think of "better words."

Multiple views for distributed repositories (Internet)

The typical entry into a digital library is a specific search query, which matches some selected documents. The user can display these documents at different levels of detail and issue another search, so that a session gradually retrieves documents relevant to the user's needs. We are developing a multiple-view interface that enables transparent drag-and-drop between multiple indexes for multiple repositories. In addition, we are developing gateway technology to maintain the state and protocols for heterogeneous distributed repositories.

Interfaces to multiple indexes

Multiple views means that different searching techniques are available concurrently. We have built a prototype multiple-view interface, which will be used in the Internet version of the DLI Testbed to be introduced this summer. This interface incorporates a number of different view types, dynamically loading the actual data. We discuss this interface and compare the effectiveness of its views for different information retrieval purposes elsewhere.[4]

The view types integrate into a single framework many indexing styles and the major results from our projects. The primary views are subject thesauri, co-occurrence lists, and full-text search. A human indexer-a professional librarian-generates each subject thesaurus. The thesaurus arranges important terms in a subject area into a semantic hierarchy. A machine indexer-an automatic program-generates each co-occurrence list, which contains a more extensive list of terms, arranged by contextual frequency. Users can employ either one to interactively discover alternative search terms. They can then enter the new terms into a search engine for full-text search of the document collection.

Many studies, including our own, show that users have difficulty generating search terms that appear within the document collection. That is why our interface offers different types of term suggestion, then provides a high-quality search based on these new terms. Our experiments indicate that a typical user session is as follows. First the user consults the subject thesaurus for coarse-grained suggestions to identify the general subject area. Then the user accesses the co-occurrence lists for fine-grained suggestions to gather a list of desired terms. Finally, a full-text search retrieves the documents containing these terms.

Interactive term suggestion

The primary index for our initial testbed collection is INSPEC, from the Institution of Electrical Engineers (the British IEEE). It offers extensive coverage of electrical engineering, computer science, and physics. Our subject thesaurus is the INSPEC thesaurus, which has 10,000 terms in a broader-narrower term hierarchy. The co-occurrence list is 200,000 terms from the INSPEC abstracts collection, which we arrange in concept graphs by co-occurrence frequency. The prototype interface lets users drag-and-drop suggested terms into the full-text search system constructed as part of testbed efforts.

The left side of Figure 4 illustrates the INSPEC thesaurus interface,[5] which provides a graphical display of the subject hierarchy of important concept terms. The user can specify a term and see broader and narrower terms, as well as graphically examine related ones. The graph is traversed by specifying computer applications, showing the narrower terms such as deductive databases, whose broader term is database management systems, and whose related terms include logic programming. The prototype interface enables terms so located to be passed into a search query.

Figure 4. Current Internet prototype, showing the multiple-view interface. (Click on the thumbnail to view a 66K GIF image.)

The right window in Figure 4, marked "Search Wizard," illustrates the co-occurrence lists.[6] Unlike the subject classifications, which are generated by professional librarians, these are automatically generated directly from the document content. The automatic generation employs co-occurrence analysis, which records how often a term occurs within the same sentence as another. The list of terms in the figure thus reflects terms that appear frequently with deductive databases. The concept graph, which relates term co-occurrence, is the collection of all lists. This approach is based on document content, rather than structure. Thus even in domains where the materials are unstructured, such as the Net, it captures more of the underlying concept semantics.

Stateful gateways and distributed repositories

To implement complete search sessions, we need techniques for providing state information within the Web. The Web is essentially stateless, with each transaction fetching a document, then stopping. Complete searching requires levels of stateful gateways to provide session history. First, each individual CGI-style gateway must maintain the state of the requests made to each server. Next, a higher level gateway must route queries to the appropriate servers and route results back to the appropriate clients. Finally, a search history must be kept for each user to record the session requests to each gateway. This function logically belongs in the client, which is where it is placed in our current design. However, it could potentially exist in any combination of client, server, or gateway depending on their functionalities.

Our distributed repositories prototype implements the levels of stateful gateways across a variety of protocols. The primary testbed search is an Open Text engine, with a custom protocol built on sockets. We implemented suggestion indexes using a Microsoft SQL engine. The SGML documents themselves reside in files accessed by an HTTP server. The interfaces to external search engines, such as the on-line catalog, follow the Z39.50 protocol. We even have an initial publisher repository, the experimental American Astronomical Society (AAS) server, connected via the CNIDR (Center for Network Information Discovery and Retrieval) Z39.50 software to test distributed respository protocols.

Our DLI project is providing major input to the next-generation server that NCSA is building. The server will move from a WWW document server using HTTP to a distributed repository host using multiple protocols. The server version 2.0, due the summer of 1996, will feature a modular protocol design and integrated security. We will later incorporate the work on stateful gateways into the server on the output end. The input end will incorporate the work on collection development. Thus the new server will eventually support session history and metadata checking. Later versions will also support security measures such as token passing, which our economic charging trials involving the NetBill software from the Carnegie Mellon DLI project will use.

We expect that during the course of the DLI project many of our publisher partners will create their own repositories. This will help the testbed evolve into a multiple-view reference system to distributed repositories. The repository management package will let other organizations and individuals make their organized collections searchable via a multiple-view interface.

Semantic federation across repositories (research)

The holy grail of information retrieval has always been deep semantics across heterogeneous sources. This is clearly expressed in the recent report[7] on the research agenda for digital libraries from a workshop sponsored by the Information Infrastructure Technology and Applications (IITA) committee (the primary technical committee for setting National Information Infrastructure (NII) directions for federal government R&D investment). The report said that "deep semantic interoperability is the grand challenge for digital libraries." At its base, information retrieval technology matches terms specified by the user to terms occurring in documents in a digital collection. This term-matching is most effective when specialists access materials in their own subject area with precise terminology.

Concept spaces for scalable semantic retrieval

Broadening access requires different techniques to extend effective support to nonspecialists or specialists working outside their area of expertise. Specialists in even a closely related subject area usually cannot find relevant materials using current information systems. They know the concepts, but not the right terms. Artificial intelligence and natural-language approaches that parse deep document structure to deduce semantics are usually effective only in narrow subject domains. The broad subject domains in our testbed in particular and the Net in general call for a different approach.

Our research focuses on methods that interactively provide the user with conceptual maps that offer alternative search terms. Interactive term suggestion, where the system suggests terms for the user to choose, can significantly enhance retrieval effectiveness. Although traditional library indexes provide some degree of term suggestion, effective Net searching requires automatic indexing. Many Net repositories are too small or specialized for a human indexer to provide the required level of fine-grained indexing. In addition, most digital repositories are "fluid," containing concepts and vocabularies too new or dynamic for controlled-vocabulary-based human indexing.

We have developed algorithms to extract concepts from documents so as to provide automatic indexing for semantic retrieval. The automatic indexing we are investigating generates concept spaces, which are concept graphs based on co-occurrence analysis.[8] Concept spaces lead to an approach for semantic federation across digital repositories, in particular towards solving the "vocabulary problem."[9] The vocabulary problem is the version of the semantic interoperability problem for text documents, the Grand Challenge of digital library research.

When digital libraries become widespread, every specialized community will have its own digital library of documents. This is already true for large professional communities. The increasing maturity of Net publishing will soon make it increasingly true for small amateur communities as well. The vocabulary problem will increasingly become an obstacle to the propagation of digital libraries.

Solving the vocabulary problem involves mapping a community library's specialized terms into the corresponding terms of other libraries being searched. Intersecting co-occurrence graphs from different domains provides an approach to concept-mapping across community libraries. Two graphs from different subject domains can be intersected by having the user specify a term common to both domains and displaying the graph around that term for both domains. This creates two term suggestion lists that can be compared for terms that are different in each subject domain but represent the same concept. In practice, the user needs to interactively cull the lists, but often discovers vocabulary that can be switched across domains.

Vocabulary-switching experiments

We are running large-scale experiments to investigate using co-occurrence graphs for vocabulary switching. These experiments build on smaller successful experiments for vocabulary switching in molecular biology.[10] Since part of our project is based at NCSA, we can use their supercomputers to perform experiments with realistic-scale collections. The experiments use algorithms for vocabulary switching across subject domains based upon the co-occurrence frequency of phrases within documents to generate concept spaces.

Last year we generated the concept space used as the co-occurrence list for the term suggestion above from a collection of 400,000 computer engineering abstracts extracted from the INSPEC database.[6] By using one day (24.5 hours) of CPU time on the 16-node Silicon Graphics Power Challenge, we created a comprehensive concept space of about 270,000 terms and 4,000,000 links. During this two-week period, our application was the single largest user of NCSA supercomputers, beating out even the physicists and biologists.

This year we performed an order-of-magnitude-larger computation to generate multiple concept spaces for a large-scale vocabulary-switching experiment. We used some 4,000,000 abstracts from the Compendex database covering all of engineering as the collection. We partitioned it along classification code lines into some 600 community repositories. For example, (400) is civil engineering, (401) is bridges and tunnels, and (401.1) is bridges. We then generated a concept space for each individual repository and intersected the spaces to provide semantic mapping. This covers engineering fairly well and provides a large-scale test of mapping similar concepts across related domains with different terms. We used time during the testing phase of the new 64-processor Convex Exemplar at NCSA. The computation took roughly four days of CPU time over two weekends of dedicated machine usage, proving a good match for the shared-memory multiprocessor (SMP) architecture.

The scale of a repository in the Compendex experiment is, for example, on bridges rather than on civil engineering. This means that our prototype can realistically support dialogues across community repositories. Our system can display a list centered around a term like fluid dynamics in several domains. The user can then choose which terms in one domain to map into which terms in another. The user can thus interactively navigate between the spaces (see discussion of Interspace below). We are also experimenting with the concept space approach to semantic interoperability for other data types. For example, we will be switching texture images in spatial maps through a collaboration with the University of California at Santa Barbara DLI project. (This finds the co-occurrence frequency of textures in maps instead of phrases in documents.)

An example of vocabulary switching in our prototype might be:

I'm a civil engineer who designs bridges. I'm interested in using fluid dynamics to compute the structural effects of wind currents on long structures. Ocean engineers who design undersea cables probably do similar computations for the structural effects of water currents on long structures. I want you [the system] to change my civil engineering fluid dynamics terms into the ocean engineering terms and search the undersea cable literature.

Building the interspace

The encouraging results with concept spaces lead us to believe that we could build a complete information system supporting semantic retrieval. Since supercomputers can be used as a "time machine" to simulate future ordinary processing, ordinary personal computers will be able to generate similar concept spaces in years hence. This will provide essential infrastructure for the information systems possible on the Net of the twenty-first century. We are designing prototypes for community repositories on the Net that researchers outside the community can readily search. These prototypes will demonstrate the technological feasibility of "analysis environments," where researchers solve problems by correlating information from multiple sources across the network.

In the next century, information systems will directly support correlation of information across community repositories. Thus a user will deal with the Interspace rather than the Internet.[11] (The term Interspace indicates interconnection of spaces, just as Internet indicates interconnection of networks.) The fundamental interaction is intersecting concept spaces of related terms across subject domains, extracted from information spaces of interlinked objects comprising community repositories. Each individual and each community will have their own spaces. The Net will then enable information analysis, rather than merely document transfer as it does now.

The DLI project's prototype Interspace environment embeds concept spaces into the infrastructure of a network information system. Basic retrieval employs semantic matching to support information analysis. The user selects navigation paths of relevant objects, which the system records. The system then matches the user path to related paths across community repositories using semantic retrieval on concept spaces. We have completed the preliminary design and are beginning to implement the first prototype.

The Interspace prototype concentrates on the scalable technology for concept spaces:[12]

semantic retrieval (using concept spaces for term suggestion),

semantic interoperability (vocabulary switching across subject domains),

semantic indexing (concept identification of document content),

information representation (information units for uniform manipulation), and

collaboration support (paths and grouping operations).

Since we are prototyping future Net functionality, we assume that distributed objects and syntactic interoperability have already become a mass infrastructure. Our choice of software tools--Smalltalk, CORBA, and ObjectStore--enables us to simulate building upon the future Internet-wide operating system. We are collaborating with research projects like the Stanford DLI project (object interoperability) and the CNRI (Corporation for National Research Initiatives) repository project (object naming). This will help us track and influence the object infrastructure necessary to support the concept infrastructure we are prototyping.

Conclusion

In the coming years, we will continue to investigate whether concept spaces are a generic protocol that supports semantic interoperability across subject domains. We plan to construct complete analysis environments based on these protocols as prototypes of fundamental information infrastructure for the next wave of the Net. These future network information systems will support cross-correlation of information across distributed repositories.

We are optimistic that the Testbed efforts of the Illinois Digital Library project will influence the facilities for searching information on the Net with the help of technology evolved in our Internet version. We are also hopeful that the Research efforts will influence the facilities for analysis of information after the Internet becomes the Interspace.

Acknowledgments

Many people have contributed to the ideas and the prototypes discussed here. In particular, we thank Larry Jackson, Beth Frank, Eric Johnson, Jason Ng, Pauline Cochrane, Leigh Star, Roy Campbell, Charlie Catlett, Dorbin Ng, Kevin Powell, and Susan Harum. We also thank our many publishing partners for making their materials available to us on an experimental basis. This project is funded by NSF/ARPA/NASA Digital Library Initiative DLI grant to the University of Illinois IRI-94-11318COOP.

For further information on the Illinois DLI project, see http://www.grainger.uiuc.edu/dli/.

References

B. Schatz and J. Hardin, "NCSA Mosaic and the World Wide Web: Global Hypermedia Protocols for the Internet," Science, Vol. 265, Aug. 12, 1994, pp. 895-901.

T. Cole and M. Kazmer, "SGML as a Component of the Digital Library," Library Hi Tech, Vol. 13, No. 4, 1995, pp. 75-90.

A. Bishop et al., "Building a University Digital Library: Understanding Implications for Academic Institutions and Their Constituencies," Proc. Monterey Conf. on Higher Education and the NII: From Vision to Reality, Coalition for Networked Information, Washington, D.C., 1995.

B. Schatz et al., "Interactive Term Suggestion for Users of Digital Libraries: Using Subject Thesauri and Co-Occurrence Lists for Information Retrieval," Proc. First ACM Int'l Conf. Digital Libraries, ACM Press, New York, 1996, pp. 126-133.

E. Johnson and P. Cochrane, "A Hypertextual Interface for a Searcher's Thesaurus," Proc. Digital Libraries '95 Conf., 1995, available at http://csdl.tamu.edu/DL95.

H. Chen et al., "A Parallel Computing Approach to Creating Engineering Concept Spaces for Semantic Retrieval: The Illinois Digital Library Initiative Project," IEEE Trans. Pattern Analysis and Machine Intelligence (special issue on digital libraries: representation and retrieval), to appear 1996.

"Interoperability, Scaling, and the Digital Libraries Research Agenda," report of IITA Digital Libraries Workshop, May 1995; available at http://www-diglib.stanford.edu/diglib/pub/reports/iita-dlw/main.html.

H. Chen et al., "Automatic Thesaurus Generation for an Electronic Scientific Community," J. American Soc. Information Science, Vol. 46, No. 3, Apr. 1995, pp. 175-193.

H. Chen, "Collaborative Systems: Solving the Vocabulary Problem," Computer (special issue on computer-supported cooperative work), Vol. 27, No. 5, May 1994, pp. 58-66.

H. Chen et al., "A Concept Space Approach to Addressing the Vocabulary Problem in Scientific Information Processing: An Experiment on the Worm Community System," J. American Soc. Information Science, to appear 1996.

B. Schatz, "Information Analysis in the Net: The Interspace of the Twenty-First Century," America in the Age of Information: A Forum on Federal Information and Communications R&D, sponsored by Committee on Information and Communications, National Science and Technology Council, 1995; http://www.hpcc.gov/cic/forum/CIC_Cover.html.

B. Schatz et al., "Building the Interspace: Overview and Architecture," http://csl.ncsa.uiuc.edu/interspace.html.

Bruce Schatz is principal investigator of the Digital Library Initiative project at the University of Illinois and a research scientist at the National Center for Supercomputing Applications, where he is the scientific advisor for digital libraries and information systems. He is also an associate professor in the Graduate School of Library and Information Science, the Department of Computer Science, and the Program in Neuroscience. He holds an NSF Young Investigator award in science information systems. Schatz has worked in industrial R&D at Bellcore and Bell Labs, where he built prototypes of networked digital libraries that served as a foundation of current Internet services (Telesophy), and the University of Arizona, where he was principal investigator of the NSF National Collaboratory project that built a national model for future science information systems (Worm Community System).

His current research in information systems is building analysis environments to support community repositories (Interspace), and in information science is performing large-scale experiments in semantic retrieval for vocabulary switching using supercomputers. Schatz received an MS in artificial intelligence from Massachusetts Institute of Technology, an MS in computer science from Carnegie Mellon University, and a PhD degree in computer science from the University of Arizona.

William H. Mischo is the director of the Grainger Engineering Library Information Center at the University of Illinois at Urbana-Champaign and professor of library administration. He has been responsible for the design and development of several client-server information retrieval systems and has written several articles on interface design, including a benchmark 1987 ARIST (Annual Review of Information Science and Technology) chapter. He is the principal designer and supervisor of the Illinois Digital Library Initiative Testbed.

Timothy W. Cole is the system librarian for digital projects in the University of Illinois Library. From 1989-1994 he held the position of assistant librarian at the UIUC Engineering Library. While there, he helped to develop the microcomputer interface for end-user searching of bibliographic databases currently used at the UIUC Library. Cole is responsible for the acquisition, processing, and indexing of the SGML materials in the UIUC DLI database. Cole received both a BS in aeronautical and astronautical engineering (1978) and the MS in Library and Information Science (1989) from the University of Illinois at Urbana-Champaign.

Joseph B. Hardin has been the associate director for software development at NCSA since 1992. Previously he was the manager of the software development group and a visiting research associate at NCSA. He has taught in the department of Speech Communication at the University of Georgia at Athens. Hardin has received a number of grants and awards in the area of scientific visualization and network-based software development, and has spoken extensively on workstation tools for computational science, technologies for networked information systems, and the human dimensions of collaboration technologies in cyberspace. He served as cochair of the Second International World Wide Web Conference 94: Mosaic and the Web. He is also a founder and cochair of the International World Wide Web Conferences Committee, which is coordinating future WWW conferences.

Ann P. Bishop is an assistant professor in the Graduate School of Library and Information Science at the University of Illinois. On the DLI project, she heads the testbed evaluation and social science team. She is currently studying the impact of electronic networking on engineering work and communication and on community life. Recently completed collaborative research projects include a study of federal information inventory/locator systems (sponsored by the US Office of Management and Budget), and an assessment of the impact of high-speed networks on scholarly communication and research (sponsored by the US Office of Technology Assessment). In 1990, Bishop was a cowinner of the American Library Association's Jesse H. Shera Award for research.

Hsinchun Chen is an associate professor of Management Information Systems at the University of Arizona and director of the Artificial Intelligence Group. He is the recipient of an NSF Research Initiation Award, the Hawaii International Conference on System Sciences Best Paper Award, and an AT&T Foundation Award in Science and Engineering. He has published more than 30 articles about semantic retrieval and search algorithms. Chen received a PhD in information systems from New York University.

Readers can contact the authors at Digital Library Initiative Project, Grainger Engineering Library Information Center, 1301 W. Springfield Ave., University of Illinois, Urbana, IL 61801; e-mail dli@uiuc.edu. Hsinchun Chen's address is Dept. of Management Information Systems, McClelland Hall, University of Arizona, Tucson, AZ 85721; hchen@bpa.arizona.edu.

Computer | Computer Society home page

Send comments and questions about this page to Christine Miller, cmiller@computer.org

Send general comments and questions about the IEEE Computer Society's Web site to webmaster@computer.org

Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE. For information on obtaining permission, send a message to whagen@ieee.org.

By choosing to view this document, you agree to all provisions of the copyright laws protecting it.

Federating Diverse Collections of Scientific Literature

Federating distributed repositories

Repositories for scientific literature (testbed)

Publishers and collection development

Repositories and federated search

User and usage evaluation

Multiple views for distributed repositories (Internet)

Interfaces to multiple indexes

Interactive term suggestion

Stateful gateways and distributed repositories

Semantic federation across repositories (research)

Concept spaces for scalable semantic retrieval

Vocabulary-switching experiments

Building the interspace

Conclusion

Acknowledgments

References