SGML: TORPEDO:Networked Access to Full-Text

SGML: TORPEDO:Networked Access to Full-Text

From: gopher://info.lib.uh.edu/00/articles/e-journals/uhlibrary/pacsreview/v6/n3/atkinson.6n3.

 
-----------------------------------------------------------------
Atkinson, Roderick D., and Laurie E. Stackpole.  "TORPEDO:
Networked Access to Full-Text and Page-Image Representations of
Physics Journals and Technical Reports."  The Public-Access
Computer Systems Review 6, no. 3 (1995): 6-15.
-----------------------------------------------------------------
 
1.0  Introduction
 
The Naval Research Laboratory (NRL) Library and the American
Physical Society (APS) are experimenting with electronically
disseminating journals and reports over NRL campus networks.  The
project is called TORPEDO (The Optical Retrieval Project:
Electronic Documents Online).  It involves storing and
disseminating two APS journals (Physical Review Letters and
Physical Review E) as well as the NRL collection of unclassified,
unlimited distribution technical reports.  These paper-format
journals and reports are scanned at NRL to create CCITT Group IV
image files, the image files are converted to ASCII files using
OCR, both types of files are associated with bibliographic
information, and they are imported into a client/server-based
commercial imaging system.
 
2.0  Participating Institutions
 
The NRL Library and the APS have been actively exploring the
potentials of electronic information dissemination through a
variety of projects.
 
2.1  The Naval Research Laboratory Library
 
Created in 1923 by Congress for the Department of the Navy on the
advice of Thomas Edison, NRL is the Navy's corporate research and
development laboratory.  The Ruth H. Hooker Research Library and
Technical Information Center (NRL Library) addresses the
information needs of the NRL research community, which consists
of about 3,500 Federal staff and about 1,500 contractors at the
Washington, D.C. facility.
 
NRL occupies a 130-acre campus of 152 buildings located on the
Potomac river in Southwest Washington, D.C.  Research facilities
are also located in Orlando, Florida; Bay St. Louis, Mississippi;
and Monterey, California.  In addition, the Library also services
NRL's parent organization, the Office of Naval Research (ONR), in
nearby Arlington, Virginia.
 
+ Page 7 +
 
The research efforts of the Laboratory are concentrated in 17
broad areas: acoustics, advanced space sensing, artificial
intelligence, astrophysics, biotechnology, chemistry, condensed
matter science, information technology, materials research,
optical sciences, plasma physics, radar and electronics,
radiation technology, remote sensing, space science, space
systems, and structural dynamics.
 
The NRL Library has been in the forefront of the initiative to
move toward a totally digital library.  The Library began
actively scanning its technical reports collection in 1988, and
it has been scanning close to 10,000 pages a day since early
1993.  In addition, the Library has supported a campus-wide
information system, called the InfoNet, since 1992.  InfoNet
provides NRL/ONR researchers desktop access to commercial and
noncommercial online services on the Internet, the NRL Library's
online catalog, NRL resources, CD-ROM databases, and electronic
books.
 
2.2  The American Physical Society
 
The American Physical Society (APS) is an organization of more
than 43,000 physicists worldwide.  The APS publishes several
major physics research journals: the Physical Review series,
Physical Review Letters, and Reviews of Modern Physics.  It
organizes scientific meetings where new results are reported and
discussed.  In addition to these primary functions, the Society
has many other programs in areas such as education, international
affairs, public affairs, and public information.
 
Since its founding in 1899, the primary purpose of the APS has
been to advance the knowledge of physics.  Recently the APS
became quite active in projects to disseminate its journals
electronically.  In addition to working with NRL, the APS is
involved in several electronic journal dissemination projects,
including the development of an archive of the Physical Review at
Los Alamos National Laboratory and the dissemination its flagship
publication, Physical Review Letters, through OCLC.  As part of
its commitment to electronic publishing, the APS will utilize
SGML for the production of all of its journal publications.
 
+ Page 8 +
 
3.0  Project Goals
 
By working together to disseminate scientific journals
electronically, the NRL Library and the APS hope to determine:
 
     1.   The attitudes of scientists toward electronic
          information.
 
     2.   The attitudes of APS members versus nonmembers.
 
     3.   The feasibility of disseminating journals in image
          format over campus networks and the Internet.
 
     4.   Researcher preferences for electronic format options
          (e.g., images versus page-definition files).
 
     5.   The desirable features of future electronic journal
          systems.
 
     6.   How publishers and libraries can most effectively
          cooperate in making electronic journals available to
          scientists, and how they can effectively integrate them
          with other materials.
 
     7.   What kind of controls can be used to prohibit
          unauthorized users from accessing the system.
 
4.0  Project Implementation
 
TORPEDO is being implemented in three phases.  The first phase
was completed between January and April 1995.  The second phase
began in May 1995.  The third phase will begin in July 1995.
 
The three phases of the project are:
 
     1.   Local access from end-user workstations in the NRL
          Library.
 
     2.   Remote access from anywhere on the NRL's Washington
          campus network by any of the supported computing
          platforms (Microsoft Windows, Macintosh, and X Window
          System workstations).
 
     3.   Internet access from the campus networks of the other
          NRL research units and from the Office of Naval
          Research.  Dial-access will also be provided for
          researchers who are working at home or travelling.
 
+ Page 9 +
 
5.0  TORPEDO Access
 
End-user access to TORPEDO is provided through the NRL Library's
World-Wide Web home page (http://infonext.nrl.navy.mil).  The
Library's home page provides documentation to assist end-users in
learning about the TORPEDO project, provides access to the home
pages of the associated participants (APS and Los Alamos National
Laboratory), permits the downloading of freely distributable
client software and user guides, and serves as the point of
access for TORPEDO.  The computer workstation requirements for
accessing TORPEDO are identical to those for running NCSA Mosaic.
 
To deliver electronic versions of journals and technical reports
to end-users, TORPEDO uses a commercial imaging software package
from Excalibur Technologies called EFS.  EFS is predominantly
client/server based and comes with freely distributable client
software for Microsoft Windows and Macintosh workstations.  UNIX
access to EFS is currently supported through an X Window System
interface, and a true UNIX software client is scheduled for
release in the next major EFS upgrade.
 
6.0  APS Journals
 
APS forwards issues of Physical Review Letters and Physical
Review E via overnight mail to the NRL Library as these issues
come off the press.  Simultaneously, APS sends the bibliographic
data associated with the articles to the NRL Library via
electronic mail.  The NRL Library scans the journals using a
Pentium PC running Microsoft Windows and a duplex scanner.  The
scanner has an autofeeder and is rated at 20 pages per minute.
Images are stored on a Novell NetWare 3.12 server in CCITT Group
IV TIFF format.  The images are converted to ASCII form using
optical character recognition (OCR).  As part of the process, the
images are deskewed and enhanced.  The OCR process is done on a
Pentium PC using two software packages that run under Microsoft
Windows: the Avatar EnMasse! batch image capture and conversion
software and its bundled Calera WordScan Plus software.
 
+ Page 10 +
 
Throughout this process, a technician feeds the batch scanner,
reviews the OCR process to ensure that text columns are correctly
identified, and separates sets of journal image files into
distinct articles.
 
Once the files have been scanned and converted to ASCII form,
they are automatically moved from a Novell file server to a SUN
SPARCstation 20 using a networked 486 PC running Microsoft
Windows.  The intermediary PC is required because the directory
structure used by Avatar to store the images and ASCII text is
different from that used by EFS.  In addition, this PC provides
the EFS database with updated names for the files, adds the files
into an appropriate tree structure, verifies the integrity of the
images, and associates the files with the bibliographic data used
for field searching.
 
Image and ASCII files are then imported into the EFS database and
stored on 5 1/4" multifunction (read-write) optical disks.  The
optical disks themselves, 32 in all, are housed inside one
Hewlett-Packard jukebox.  Each formatted optical disk stores 1 GB
of data.  Each page-image file is approximately 55 KB and each
ASCII file is about 20 KB.  Importing the files into EFS is done
overnight and requires no operator intervention except to
initiate the process.
 
As the image and ASCII files are imported into the EFS system,
they are indexed.  The index itself is stored on a 9-GB hard
disk.  Each morning the SUN server shuts down and reboots so that
the new records can be searched by end-users.
 
The entire process of scanning the paper journals, converting the
scanned files to ASCII form, and importing the files into EFS can
be performed in one 24-hour period.  This means that end-users
can have a current issue of a journal electronically available at
their workstations one day after it is received by the NRL
Library.
 
7.0  NRL Technical Reports
 
As part of an ongoing project begun in 1988, the NRL Library has
already scanned over 100,000 unclassified technical reports in
its collection.  These reports are presently stored on 12" WORM
optical disks housed in a 50-platter Sony jukebox.
 
+ Page 11 +
 
The system used to display these reports, Genesys ImageExtender,
is a commercial PC-client-based, IPX/NetBIOS protocol system.
Linking the ImageExtender product to an existing catalog produced
a system that offers extensive field searching, but is not easily
scaled to meet the needs of TORPEDO end-users in a wide-area
networked environment.  EFS, on the other hand, more closely fits
the wide-area network imaging needs of the NRL/ONR community
because of its native TCP/IP support, client/server
configuration, and multiplatform support.  Therefore, those
unclassified technical reports that have no distribution
restrictions (i.e., unclassified, unlimited documents) are
imported into EFS after going through the same processing as the
APS journals.  These reports are added to the EFS database with
their own hierarchy so that they can be searched by end-users
either independently or in combination with the journals.
 
8.0  Searching
 
End-users can retrieve information from the TORPEDO system using
direct, content, and field searching techniques.
 
8.1  Direct Searching
 
A direct search is used when the end-user is looking for a
specific citation or is simply browsing the collection.
End-users move through a hierarchical menu structure for journals
and reports to find specific documents of interest.  In the case
of journals, the tables of contents are presented as the first
article in the appropriate volume series to facilitate browsing.
 
8.2  Content Searching
 
A content search examines the full text of all documents to find
the word or phrase entered by the end-user.  The end-user also
has the option of limiting content searches to particular
journals or reports, volumes or issues, or any combination
thereof.  Boolean operators may also be used in content
searching, although the format used by EFS for Boolean searches
is not intuitive.
 
+ Page 12 +
 
EFS supports a fuzzy full-text retrieval concept called Adaptive
Pattern Recognition Processing (APRP).  APRP retrieves documents
by recognizing data patterns at a binary level.  As a result,
data itself automatically directs the creation of indexes that
are highly fault tolerant and thereby offers the ability to
accurately retrieve information based on an approximation of
query terms or phrases.  Because the EFS retrieval software seeks
patterns rather than exact words or phrases, users can accurately
search "dirty" ASCII (raw OCR-processed text) without the need
for ASCII cleanup or rekeying.
 
8.3  Field Searching
 
EFS supports field searching.  No more than 256 characters can be
entered into a field, and the fields established for all journals
and reports must be the same.  Bibliographic data for both the
APS journals and technical reports is being added to TORPEDO.
End-users will soon be able to search documents for specific
authors, titles, or years.
 
9.0  Electronic Journal Publishing
 
The APS is now in position to make Physical Review Letters
available to the NRL Library in SGML format on a regular basis,
thereby eliminating the need to scan and OCR the paper copies.
Moreover, Physical Review E is now partially available in SGML
and will soon be available entirely in SGML as well as all of
Physical Review A through Physical Review D.  While the EFS
imaging system chosen by the Library for TORPEDO has no native
support for SGML (EFS only supports CCITT Group IV TIFF files and
ASCII), it does have support for third-party image, word
processing, and SGML display applications.  In fact, EFS can
import files in almost any format and integrate all of them into
one full-text and field searchable database.  The Library is
presently investigating the software viewers of several SGML
product vendors for possible integration with TORPEDO.
 
+ Page 13 +
 
10.0  Summary
 
The NRL Library and the APS have made significant strides in
making collections of physics journals and technical reports
available over networks to a large community of geographically
dispersed researchers who utilize a myriad of computing
platforms.  While these journals and reports currently originate
in paper format and are being converted into images only after
publication, a fully electronic publication system is coming
closer to production.  The lessons learned and the end-user
feedback coming from the TORPEDO project will have a critical
impact on the direction the APS pursues in its electronic
publishing efforts as well the methods ultimately adopted by the
NRL Library in its quest to provide its research community with a
comprehensive digital library.
 
About the Authors
 
Roderick D. Atkinson, Electronic Resources Coordinator, Naval
Research Laboratory, Code 5220, Washington, DC 20375-5334.
Internet: rod@library.nrl.navy.mil.
 
Laurie E. Stackpole, Chief Librarian, Naval Research Laboratory,
Code 5220, Washington, DC 20375-5334.  Internet:
lauries@library.nrl.navy.mil.
 
-----------------------------------------------------------------
 
Article Formats
 
This article is available in both ASCII and HTML formats.
 
Network Access
 
     o    ASCII File
 
          List Server:
 
          Send the e-mail message GET ATKINSON PRV6N3 F=MAIL to
          listserv@uhupvm1.uh.edu.
 
+ Page 14 +
 
          Gopher:
 
          gopher://info.lib.uh.edu:70/00/articles/e-journals/
          uhlibrary/pacsreview/v6/n3/atkinson.6n3
 
     o    HTML File
 
          World-Wide Web:
 
          http://info.lib.uh.edu/pr/v6/n3/atki6n3.html
 
Publication Information
 
The Public-Access Computer Systems Review is an electronic
journal that is distributed on the Internet and on other computer
networks.  It is published on an irregular basis by the
University Libraries, University of Houston.  There is no
subscription fee.
 
To subscribe, send the following e-mail message to
listserv@uhupvm1.uh.edu: SUBSCRIBE PACS-P First Name Last Name.
 
To retrieve the cumulative index for journal, send the following
e-mail message to listserv@uhupvm1.uh.edu: GET INDEX PR F=MAIL.
 
PACS Review back issues (ASCII and HTML files) are available from
the University of Houston Libraries' World-Wide Web server:
http://info.lib.uh.edu/pacsrev.html.
 
Back issues (ASCII files only) are also available from the
University of Houston Libraries' Gopher server: info.lib.uh.edu,
port 70.
 
Copyright
 
This article is in the public domain.
 
+ Page 15 +
 
The Public-Access Computer Systems Review is Copyright (C) 1995
by the University Libraries, University of Houston.  All Rights
Reserved.