[This local archive copy mirrored from the canonical location: http://yuri.stanford.edu/ic1q97/final.htm; see this official version of the document.]

LARGE-SCALE ELECTRONIC DOCUMENT DISTRIBUTION AT PACIFIC BELL

Michael Leventhal, Member, IEEE, Jeffrey S. Kohl, David R. Lewis, and Ann K. Smith

Copyright Notice

This is a draft of a paper submitted to IEEE Internet Computing. This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

If this paper is accepted for publication the IEEE claims the following specific copyright rights:

Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

ABSTRACT

At Pacific Bell we have developed a document distribution system which leverages a number of Internet technologies to solve extreme scaling requirements while staying cost-effective and meeting demanding business needs. The system, currently a fully operational prototype, is a combination of off-the-shelf and custom software. We are in the process of customizing our solution for specific internal organizations while introducing open document standards which enable the production of information compatible with our delivery system.

Prior Experience

In the early 1990's, one of the first client/server document distribution systems, Bellcore's SuperBook, was installed for two groups in Pacific Bell, each with their own server. In the mid-1990's, a newer version of SuperBook was installed for a third group for a total of approximately 9,000 users. The documents for this group totaled at least eight feet of paper; or roughly 45,000 single-sided pages of text, tables and graphics. Two custom authoring, document management systems, and conversion routines were developed for the three groups. In order to efficiently use networks, twelve servers were strategically placed throughout California. SuperBook proved quite successful in terms of scalability; we had no problems supporting the user population even as it grew to 9,000 users and beyond, and no performance limits were reached while the library grew to hold the roughly 45,000 pages worth of documents it eventually held. Because SuperBook retrieved only the applicable piece or `chunk' of the document, any client limitations, such as memory or storage, were avoided and the load placed on the network was kept low. SuperBook also benefited from significant Bellcore research in usability engineering and in hypertext and hypermedia [3] [4] [5] [6].

Difficulties with the SuperBook system included the proprietary nature of both the system and the markup language. Being only one customer of Bellcore, our requests for enhancements were weighed and compared with the requests of Bellcore's other customers for other work. Although we were primarily interested only in the Microsoft Windows client, any upgrades to the client had to be done for all the clients (Unix, Mac, and Microsoft Windows) thus slowing the rate of upgrades. In addition, there were no commercially available tools for creating the proprietary markup language and no "off-the-shelf" document authoring and management software that could interface with SuperBook.

In an effort to move to open standards and commercial products and to take advantage of the quickly developing Internet technologies, Pacific Bell issued a request for proposal (RFP) in October 1995 for a large-scale SGML browser. This paper provides an overview of the requirements and describes how various commercial components have been integrated to meet our business needs.

Requirements

Our requirements for the system were determined through a set of meetings with potential sponsors and users of the system from various organizations across Pacific Bell. These requirements included architectural requirements, user requirements and administrative and support requirements.

Architectural requirements included: Use of a distributed architecture with replication and load balancing, and an architecture extensible through APIs, scripting language, interprocess communication features (e.g., OLE) or messaging interface. A severe scalability was imposed such that the system was to scale to 50,000 users without degradation in document retrieval speed.

User requirements included high-speed retrieval of documents (little or no discernible delay) irrespective of the size of the document or the number of documents in the repository, display of SGML documents, library view navigation using a document table of contents, full-text searching, hypertext support, support for display of tables and graphics as well as printing support.

SGML is included here to leverage standards based document interchange as a business strategy. In particular, it supports Pacific Bell's participation in the TeleCommunications Industry Forum (TCIF) initiative to develop and promote SGML-based document interchange formats between suppliers and consumers of equipment and services. It also is present to enable us employ structured document management for information resources developed internally.

Tools for systems administration were also specified as a requirement. These concerned overall systems surveillance, configuration management, repository management, and so on. Requirements related to scalability included, among others, load monitoring, and content replication and load distribution.

Overall, the RFP detailed requirements were specified in twenty nine separate categories and were weighted as to their relative importance. While we received a number of responses to this RFP, no single response satisfied the requirements sufficiently for us to select one as the basis for our solution to our document delivery system needs. There didn't appear to be an "off-the-shelf" solution satisfying our requirements. We did however learn a great deal about candidate components which could be used to build (integrate) a solution. The system described in this article is the result of our current endeavor to develop and integrate a system based on these components.

Architecture

Client-Server System Architecture

Our design brings together web server, search, translation and presentation technology in a two-tier client-server system as illustrated in Figure 1. The hardware platforms used are at present one or more Unix hosts on the server side and some large number of web-browser capable PCs or workstations on the client side. This figure also indicates the major software components located within the server and client platforms of the system. Since the hardware employeed is not in any way novel, the following sections focus on the software architecture and describes these software components in detail.

Our system's software employs as a foundation the classic web architecture of an HTTP demon process underlying the server-side components and a web browser anchoring the client-side components. Our system builds upon this foundation by adding document preparation, search and server-side TOC delivery/navigation components on the server; and interface framework, library navigation, search interface and TOC presentation/navigation components on the client. All client-side components except the web-browser itself are delivered across the network at the time they are first needed using the HTTP protocol.

Server

The server-side of the software architecture of our system is illustrated in figure 3. The workhorse of this piece of our delivery system is Open Text Inc.'s LiveLink Search (LLS) product, a search and retrieval engine which runs as a CGI process under the HTTP demon. The LLS engine provides the function of retrieving entire documents or selected regions of documents according to a "region query" passed to it as a parameter (see Region Queries). The documents themselves are stored in the server's file system along with indices which are rebuilt periodically. Indices and the documents themselves may be distributed across multiple servers for load distribution and/or redundancy purposes; LLS' "Parallel Execution Monitor" is used to broadcast incoming queries to these multiple servers to realize these goals.

Documents are added to the delivery system as the final step in a document loading and preparation process. While a complete explanation of this process is beyond the scope of this paper, the following highlights are offered. Documents arrive at the start of the preparation process either by being posted to the server using an HTML-form-based file upload tool, or as a side-effect of them being posted into a document management system employed within Pacific Bell. Once entered into this process, documents are subjected to one or more translation steps. In cases where a document doesn't arrive in SGML or HTML format, translation into one or the other of these formats is first performed. A translation application built using the Omnimark translation tool (Omnimark Technologies Inc.) is then used to extract the document structure data needed to implement the navigable table of contents (TOC); and, depending on the format of the source document, may also perform some optimization of the document content's markup.

When a document has been loaded into the system, its TOC elements are ready for use in the TOC database, and its contents are loaded into the LLS and indexed. When a document or document region is requested and retrieved by the LLS system, it is piped through a Results Processor, a program which performs any on-the-fly transformations required. Currently the Results Processor is a Perl program we have developed which highlights user-entered search terms in the text and also adds Previous and Next Region hyperlinks at the beginning and end of the text. We have also prototyped use of the Results Processor to convert documents from their source format, e.g., SGML, into the delivery format, HTML, however in our current implementation we perform such transformations off-line as described above.

The server-side of a distributed TOC delivery, presentation and navigation subsystem is built using a lightweight database program which contains the TOC for each document in the system, and which executes as a CGI program on the server. The TOC, consisting of the set of "nodes" in the documents structure hierarchy, can consist of a significant amount of data. Our distributed design eliminates the need to deliver all of this data to the client when only a small portion of it is employed in a particular view of the TOC the user will see. The input into this program is a document identifier, the current state of the TOC on the client, and the reader's latest request. A request only goes to the server when the reader requests a portion of the TOC which is not present on the client. The program returns the previous state of the TOC and the new section requested by reader, as well as any peer nodes at the most recently opened level of the TOC hierarchy and, in order to take advantage of the principle of locality, some adjacent nodes as permitted by size constraints which are set based on experience with the network load (optimum size to be delivered to the client) and the performance of the (JavaScript-based) client-side of the TOC subsystem.

Region Queries

Region queries may restrict searches to named components or regions within documents. For example,

"Pacific Bell" within title within chapter

will only return hits if the string "Pacific Bell" appears in a defined region called title which is within a defined region called chapter. Region queries can also be used to cause the component returned on a retrieval to be a specified region. For example,

chapter including "Pacific Bell" within title

explicitly makes the retrieval unit the entire chapter which contains "Pacific Bell" in the region title.

Client

The client-side of our delivery system is built on the web browser provided by our standard desktop environment, currently Netscape Navigator. The special features of our system are implemented through the use of frames, HTML forms, and JavaScript.

The interface used by users of the system is sketched in figure 4. ^* Three frames are presented to the reader. The main frame, on the right hand side, holds the document proper. The left hand side is for locating information; one of the two frames contains an HTML form from which the reader can send full-text, structural, and domain-restricted queries and the other frame is the navigator. Three different views can be presented in the navigation frame; library, book TOC, and search results, depending upon the system's state.

The document frame contains the document region last selected by the reader from either the search results or the book TOC. In the former case hits will be highlighted with hyperlinked arrow icons which allow the user to jump between occurrences of the hits within the current region. There are also previous and next icons at the top and bottom of the document (region) which permit the reader to go to adjacent regions. While our system is not page-oriented this feature provides a comfortable equivalent to "turning the page". Previous and next , like all document retrieval operations, are implemented as queries which are passed to the server and there to the LLS and Results Processor.

The search frame is a simplified variant of the standard LLS query interface, but with specific knowledge about the organization of the document repository "built-in". Specifically, relevance and precision may be increased by limiting the domain of the search. For example, we have organized a repository along the lines of the SuperBook model with Libraries containing Shelves and Shelves containing Books. The search may be restricted to specific books, shelves, or libraries or may be repository-wide. The reader can also limit the scope of the information returned to hierarchically defined containers within the documents such as "chapter" or "section" or to content or structure defined containers such as "examples", "warnings", "tables", or "figure captions". The organization of the repository and the definitions of hierarchical, content, or structural containers varies among different groups within Pacific Bell. The search frame may be customized to accommodate these differences.

Library Navigation and TOC Subsystem

An expandable-contractable library/TOC presentation and navigation capability is built using an implementation distributed across client and server platforms. On the client-side, both library-level and book TOC-level presentation and navigation offer an expandable and collapsible tree view, using the same JavaScript-based client-side program. At start-up, or in response to user navigation, TOC data is fetched from the server-side TOC database and carried to the client within the dataspace of this program. When this program runs it draws the currently visible library or TOC hierarchy with certain tree nodes "expanded" according to the current state as developed during prior reader interaction with this structure.

A subtree of a library or document TOC hierarchy is expanded when the user clicks on a plus sign to the left of a node's description label (e.g., section title), and is collapsed by a click on a minus sign. The JavaScript program will modify the state of the TOC and immediately do a redraw if all needed portions of the hierarchy have previously been downloaded to the client; otherwise the mouse click generates a request to the server and CGI TOC database program on the server. In response to this request an updated TOC is downloaded. We have discovered in our testing that significant delays may occur on the client if the JavaScript program must process a large number of TOC entries, so this limitation must be balanced against the cost of extra server requests and downloads. Alternatively, a reimplementation in Java could reduce this delay be increasing processing performance.

Clicking on a leaf node in the library navigator causes a book TOC to be downloaded into the same frame. A hyperlink which causes toggling back to the library navigator is provided in the upper frame. Clicking on any node description text (rather than on the plus or minus sign which may proceed it) in the book TOC causes all document content located beneath the subtree headed by this node to be downloaded into the text frame. The actual download request takes the form of a HTTP GET or POST message to the server which results in a CGI query to LLS for the appropriate region of the document.

When a user performs a search, the search results list also overwrites the navigator frame. Presentation of the search results list is customizable with respect to ranking criteria and meta-data offered such as the name of the container in which the hit occurs (e.g., book name, chapter title, section title) and the display of the proximate text.

State

Our system maintains a small amount of state information and maintains it entirely on the client; in cookies, embedded in interface forms, and in document content navigation links. The tree view program used in the navigator frame stores the state of each node (expanded or not-expanded) in the cookie. The search form contains a hidden field of information with the current context. The next and previous regions adjacent to the currently visible document region are stored as hyperlinks in document content held in the the view window.

Documents: SGML + HTML

Document markup and thus document preparation are just as critical to the systems capabilities as the other components of the delivery system. The very large number of documents which will be made available through our delivery system requires that the production process be streamlined for large-scale cost efficiency while meeting the specific requirements of a distribution system designed for a large-scale readership.

The ability to search and retrieve documents based on the hierarchy or content comes from the documents being hierarchically composed and marked for content in the first place. The technology for accomplishing this is Standard Generalized Markup Language (SGML). SGML defines a formal way to create document markup languages. The formal expression of the rules of the markup language can be employed in a variety of tools to create and check documents using those rules and to transform documents by manipulating the tokens delineated by the markup language. HTML, in fact, is a markup language defined by SGML. Unfortunately, HTML writers and HTML tools tend to honor SGML in the breach, impeding the application of SGML-based tools to the betterment of the World Wide Web.

The SGML markup in our system creates, as Douglas Engelbart expressed it in [2], "Explicitly Structured Documents -- where the objects comprising a document are arranged in an explicit hierarchical structure, and compound-object substructures may be explicitly addressed for access or manipulation of the structural relationships." Open Text's LLS provides us with the ability to index these hierarchical objects and to retrieve them using "region expressions", the general principle of which is described in [1].

Our documents originate from a variety of sources including Microsoft Word, native HTML, and SGML. In all three cases we require the authors or providers to, at a minimum, encode hierarchical divisions into the documents. Those hierarchical divisions are used to generate the TOC database and thus become basic units of content search and retrieval. The authors may also identify other units, or containers, of content (by use of tags or styles depending upon their authoring tools), and when they do, these become additional units of search and retrieval in the delivery system. These additional containers may be smaller (finer-grained), or larger than the hierarchical units, and may be based in traditional document technology, "example" or "paragraph, for example, or may contain units meaningful within an application or knowledge domain.

Use of Domain Specific Containers, such as for example, "product" and "question" in a FAQ document collection, can be used to improve precision in search and conciseness of returned content in recall. Containers (objects) such as these are defined within the knowledge domain of a particular application; for example, "order entry instructions" might be a meaningful object for sales representatives. Conversion scripts have been developed to pair optimal authoring environments in these application areas, to SGML markup useful to navigation and search device construction, and HTML markup which is useful in delivery.

Our current SGML markup encompasses HTML; that is, both HTML and non-HTML tags can exist in a document at the same time. The non-HTML tags provide container and the addressing information while the HTML markup describes content formatting. Currently, we are able to deliver these "SGML+HTML" documents directly to the Netscape client as it, and other current generation browsers, simply ignores the unrecognized SGML markup. We have it at least informally from W3C that the HTML standard will include a formalization of the current treatment of unrecognized markup but we recognize that this may be an unsolid foundation. If future web-browsers are built to be less lenient toward unrecognized markup or collisions in tag name space occur we can either make creative use of the existing HTML element set or we will need to strip the non-HTML tags from the content before delivering it to the client. We could do the latter either during preprocessing or on-the-fly in the results processor. A third option rapidly developing into a distinct possiblity is to use our current element set with a Web browser capable of handling the simplified form of SGML called XML (eXtended Markup Language) being developed, under the auspices of W3C, for general Web use and as an alternative to HTML.

SGML+HTML Markup Example

<P>This file can be region indexed and structurally searched while also displaying as expected in an HTML browser</p></Sect3><Sect3>< !--<Id><Seq>4.2.1.0</Seq><Next>4.3.0.0</Next> <NextLev>2</NextLev><Prev>4.2.0.0</Prev><PrevLev>2</PrevLev> <Book>2002d</Book><Lib>3</Lib><Shelf>1</Shelf></Id>--> <A NAME="toc2"><H2><Heading>SGML+HTML Markup Example</Heading></H2></A><Warning><P><Em>Correct Display of SGML+HTML Code is dependent on common but not-standardized browser features. Alternatives include use of <div> elements and "class" attributes or use of browsers that support generic markup, a.k.a., SGML.</Em></P></Warning><P>There are three types of additional SGML markup in our implementation (displayed in italics in this example):<Ol><Li><P>Addressing, i.e., everything inside id elements (must be enclosed in comments because HTML browsers currently to not have the ability to hide element content).</P></Li><Li><P>Nested section containers</P></Li><Li><P>Content object containers such as Warning and Heading above</P></Li></Ol></P></Sect3></Sect2></Sect1>

Region Query Examples

A query generated as the result of the user requesting an entry in the TOC would take the following form (syntax generalized):

<SECT3> including "4.2.1.0" within <SEQ> and "2002d" within <Book>

and would return the entire section entitled "SGML+HTML Markup Example".

A full-text, ad-hoc region-specific query could take the following form:

<WARNING> including "SGML+HTML"

and would return warning sections containing the string "SGML+HTML" (a full-text query would actually return a result list and issue a second query like our first example in order to return the specific warning section desired by the user).

Comparison of Technologies

The table below compares the features of our new delivery system to our existing tool - SuperBook- and to traditional web document delivery.

	SuperBook	"Traditional" Web	Large/Structured Document Delivery System
Transport Protocol	Proprietary	HTTP	HTTP
Server Software	Proprietary	HTTPd	HTTPd + Server-side Components (Index/Region Server + TOC Server)
Client Software	Thick Proprietary	Web browser	Web browser + Thin Client-side Components
Chunking	yes - fixed	no	yes - Reader chooses granularity from available components. Possible granularity of components is determined by document structure which is in turn the result of the application of structured authoring practices.
Structure-aware search	no	no	yes
Structure-aware retrieval	no	no	yes - (see chunking)
Document Format	Proprietary	HTML	SGML + HTML
State Data Location	Server	None	Client
State Data Preserved	Current Page	Document	Current Region + Expanded TOC Nodes

Messaging/Client-Server Operation

Figures 5a through 5d illustrate the messages passed between client and server during system operation.

Principles in a Design for Scalability

We enumerate below principles which both guided and enabled our project, principles shared with many large-scale Internet engineering projects today. All points have a direct relationship to the technology of large-scale systems or an indirect but still critical bearing on our ability to design and maintain those systems at a reasonable cost

We believe that each of the principles is well-established, on its own, both in fundamental research and in engineering practice. Some of our work is based on the experience in various disciplines using SGML, an area which until recently has been poorly integrated into the mainstream of IR research. For this reason we can not point to a body of research which, for example, validates the use of region queries and region retrieval as a scalable IR technique which produces such-and-such percent improvement in precision and recall. We have practical experience on other large projects which leads us to believe that region queries belongs in the panoply of powerful IR techniques and we hope this paper will promote further interest in this topic. In terms of the totality of our system architecture and design principles we think the combination of elements we have pulled together has resulted in an original and unique design. Again, there is no master cookbook for document systems designs which could ensure us of theoretical validity in its totality. As an conceptual design we believe our system is founded on a set of principles which operate together with logical integrity. We have now completed a prototype implementation of this system and have tested this prototype on limited numbers of users. Our results have been quite positive and consonant with our expections based on the conceptual model. It remains for us to expand and test our implementation in further evaluation environments before we will declare our design a success or failure and be prepared to proceed to large scale deployment. We hope we will have the opportunity to present an analysis of our results in a future issue of Internet Computing.

Open Systems

Each component in our delivery system provides a standard or widely-accepted interface to other components in the system. The availability of open, modular components has enabled us to architect the best overall strategy for our environment and to further customize our approach for our sub-organizations. We have not had to build any of the base technologies (document encoding, search, server, presentation) from scratch effect.

Vendor Independence

Scalability, as a function over time¹ benefits greatly from the ability to take advantage of improvements in technology as scale increases. Maintaining vendor independence gives the maximum freedom to adopt best of the breed technologies as they appear and also to keep the pressure on vendors to keep technologies competitive. Document production, delivery, and presentation are decoupled and we continue to evaluate and integrate multiple tools in the areas of authoring (Word processors, DTP, and SGML), search, and browsers (HTML, PDF, SGML, and others).

Zero Added Cost Client Administration

The maintenance of common set of computer desktop tools and capabilities in a large corporation with a heterogeneous body of users and equipment is a mind-boggling undertaking and perhaps the greatest single challenge in enterprise client-server computing. ² Our thin client built on the Web browser achieves zero added cost roll-out and administration for our Netscape-equipped Standard Desktop while still distributing computing, being network-efficient, and meeting our user interface objectives. Strictly speaking, we should perhaps distribute the cost of maintaining the Web browser among all the applications that take advantage of its generic interface, but in any event it is clear that the client administration overhead of our document distribution system is drastically reduced over that of a thick client system like SuperBook. The number of workstations with access to the document distribution system can always be immediately scaled to the number of workstations equipped with our Standard Desktop. From the point of view of client administration there is no limit within our enterprise to the scalability of the thin client solution.

Balance the Distribution of Computation

Server load is reduced and controlled by pushing some of the computation which could take place at delivery time off-line. The use of region indices, computed at off-hours, greatly reduces the overhead for retrieval. This in turn was enabled by design of the SGML+HTML documents, again generated off-line, which can both be region-indexed and delivered to the browser without on-the-fly transformation. TOC processing is done on the client, while database retrieval operations on large TOCs are on the server. The size and frequency of TOC downloads can be adjusted and balanced by the library administrator.

Structured Documents

Our insights into the natural and traditionally implicit structure of documents came out of our experience with SGML. We require our authors to make that structure explicit using a variety of authoring tools and techniques from word processor stylesheets to full structural (SGML) editors and use that structure to chunk information and level system demand on delivery. Unlike pure SGML systems correct stylesheet usage is not checked when the document is created but we perform a series of validations during subsequent procesing of the document. One of our greatest challenges in creating a large-scale volume of information is to retrain authors accustomed to crafting printed pages using WYSIWYG software. (See [8] for an excellent critique of the problem of moving from WYSIWYG systems to structured document authoring. Engelbert says WYSIWYG will give way to WYSIWYN[eed], that is, authoring tools will provide "different options for how you'd view selected portions of the document space". [1]) Change overnight in a large organization like ours is not feasible so the emphasis has been on incremental improvement. In lieu of having authors produce highly and accurately structured documents, a lot of our effort is currently being put into creating special purpose, custom filters which can take loosely structured word processing documents and recognize implicit structure from sequential, textual, and visual patterns. In the SGML world this is part and parcel of the highly developed art of legacy conversion as described in numerous papers such as [7].

Write Once/Read Many³ Benefit Analysis

One of the primary objections to structured authoring is that it requires more upfront effort. Whether or not that is true, the simple fact that effort applied at the authoring side can be repaid many times over by the effectiveness of the delivery of information to a large number of clients is often overlooked. Sometimes the authoring team does not have a sense of their documents as a product which adds value to the company when delivered effectively or sometimes there is a lack of appreciation for the economics of large-scale document distribution. Our project has been very effective in motivating interest in structured authoring because the benefits are clear.

Deliver Only What the Reader Wants

The reader is not forced to think about downloading the optimum-sized piece of information as the TOC and region delivery directs him or her to the size component needed. A reader can choose, however, between downloading a chapter at once in its entirety or section by section. On average the size and frequency of downloads will settle at the optimum balance as the readers learn to respond intelligently to system performance.

Detach Delivery from Authoring

We mentioned that some authors may object to additional effort imposed by structured documents. But that extra effort may be paid back not only by the efficiency of the delivery system but also by the effective life-span of the documents they create. Structured authoring leads to more tool-independent and reusable, multi-purpose documents. We also mentioned that scalability is a function of time; major changes in tools and proprietary formats every two years was standard but now we faced with Web "dog-years" and six month cycles. For example, we anticipate significant changes in HTML before we have rolled out our system to the majority of users at Pacific Bell⁴. We will be able to take advantage of those changes. The intrinsic properties of information are eternal but the format du jour is not.

Automation and High-Volume Document Production

WYSIWYG DTP tools have enabled and required the technical writer to take over many of the tasks in the publication production workflow which were formerly performed by specialists in those areas - e.g., graphics designers, typesetters, pasteup artists, illustrators. ⁸ While this approach has a number of benefits it may not be the most efficient production model for the creation of a large volume of corporate information. Structured authoring reintroduces the idea of a division of labor and also promotes the consolidation and automation of repetitive and repeated tasks. In a structured authoring environment the design of document structures is performed by a document architect, the mapping of document structure to a page layout is performed by a formatting expert and the mechanics of page composition is handled by a programmer. The authors primary concern becomes writing, just writing. A side-effect of this process which turns out to be a great advantage in automating the document production process is that the idiosyncrasies of individual authors working with WYSIWYG tools largely disappear. WYSIWYG, paper production tools, virtually gurantee that format inconsistencies which are inconsequential on the printed page will cause problems when the documents are processed by text processing software. We are working toward setting up high-volume production processes which minimizes the amount of manual "fix-up"⁵, an objective which is entirely complementary to our move toward structured authoring in order to enjoy the benefits of document structure for document distribution.

Summary

One of the things we hoped to get across in this paper is that there is an intimate relationship between scalability of technology and the way that human beings use that technology. We have found that we can neither count on the technology being used efficiently in the absence of a careful designed plan for scalability nor that we can obtain scalability simply by "educating users". The technology must be crafted so that the rewards ...and punishments... condition both content developers and readers to make the most efficient use of resources from a global point of view. The incentive for the reader to learn to navigate, search, and retrieve in the, network-wise, most efficient way is that that is the way the information sought is most quickly found and downloaded. The incentive for creating structured information is in lower long-term production costs, better longevity and therefore increased payback from development costs, and the higher intrinsic value through the increased efficiency of the reader and the system utilization. Engelbart in [2] discusses the relationship of explicitly structured documents to knowledge-domain interoperability, not scalability. One of the greatest upsides of our project is that we have been able to obtain scalability while also taking a large step toward enabling us, in the future, to optimize the use of our corporate information as an enhanced knowledge repository.

Bibliography

F. J. Burkowski, C. L. A. Clarke, and G. V. Cormack, "An Algebra for Structured Text Search and a Framework for its Implementation," The Computer Journal, 38(1):43-46, 1995. (electronic version @ ftp://cs-archive.uwaterloo.ca/cs-archive/CS-94-30)

D. C. Engelbart, "Knowledge-Domain Interoperability and an Open Hyperdocument System," Proc. of the Conf. on Computer-Supported Cooperative Work, Los Angeles, CA, October 7-10, 1990, pp. 143-156. Republished in Hypertext/Hypermedia Handbook, E. Berk and J. Devlin, eds., McGraw-Hill, 1991. pp. 397-413. (electronic version @ http://beluga.dc.isx.com/bootstrap/final/augment-132082.htm)

J. R. Remde, L. M. Gomez, and T.K. Landauer, "SuperBook: An Automatic Tool for Information Exploration - Hypertext?,", Hypertext `87 Papers.

D. E. Egan, J. R. Remde, L. M. Gomez, T. K. Landauer, J. Eberhardt, and C. C. Lochbaum, "Formative Design-Evaluation of SuperBook," ACM Transactions on Information Systems, Vol. 7, No. 1, January 1989, pp. 30-57.

D. E. Egan, J. R. Remde, T. K. Landauer, J. Eberhardt, C. C. Lochbaum, and L. M. Gomez, "Acquiring Information in Books and SuperBooks," Machine-Mediated Learning, Vol. 3, 1989, pp. 259-277.

C. Palowitch, D. Stewart, "Automating the Structural Markup Process in the Conversion of Printed Documents to Electronic Texts," Draft, possibly published in Digital Libraries '95 , March 1995. (electronic version @ http://sil.org/sgml/palowitchdl95paper.html)

C. Taylor, "What has WYSIWYG done to us?", The Seybold Report on Publishing Systems, Volume 26, Number 2, September 30, 1996. (electionic version @ http://www.datatext.co.uk/ideography/library/seybold/WYSIWYG.html )

¹Demand always increases. Corollary of Moore's Law.

²If a technician were to spend, say, 2 hours per user per upgrade approximately 50 person-years are required to update tools at an company the size of Pacific Bell.

³"Write and Modify A Few Times/Read alot of Times" as a title for this section would be a more accurate reflection of the authoring process but was eschewed because the reader might miss the allusion to the parallel concept in storage technology.

⁴A corporate-wide rollout may take as long as several years. For example, it has not heretofore been considered necessary to provide access to computer communication technology to a significant number of workers so the fullest deployment of this technology remains a serious undertaking. Our document distribution system is, in fact, one of the compelling reasons, both in terms of utility and cost-effectiveness, for entertaining the prospect of the widest possible rollout.

⁵At 1 error per page, requiring 10 minutes to fix, a small library of 100,000 pages requires 8 person-years of manual "clean-up".

^*We regret that a screen-captured image of the current interface cannot be provided at this time due to intellectual property concerns.