NOTE:

This paper was give at the 1994 OmniMark User's Group Meeting (OMUG) in Tyson's Corner, Virginia on November 6, 1994.

MANAGING WEB DOCUMENTS WITH OMNIMARK

Norman E. Smith
Science Applications International Corp.
P.O. Box 2501
301 Laboratory Road
Oak Ridge, TN 37831-2501
(615) 576-2276
smith@zeus.osti.gov.

Abstract

The Department of Energy (DOE) Office of Scientific and Technical Information (OSTI) set its World Wide Web (WWW) Server up as a Standard Generalized Markup Language (SGML) application from the very beginning. SGML processing is built around OmniMark. Web HyperText Markup Language (HTML) documents are parsed with OmniMark and SGML syntax errors corrected before being loaded on the production Web Server. Automation of hypertext links is an absolute necessity as the number of documents on a server grows to prevent dangling hyperlinks. SGML provides the automation vehicle for the OSTI Web Server. Hypertext links are managed via SGML and the parsing process. Each document is given a logical name which is set up as an SGML entity reference. The SGML entity contains the Universal Resource Locator (URL) for the document. The OmniMark program substitutes the proper URL for the logical name reference automatically generating valid hyperlinks. The SGML approach has made possible several complete reorganizations of the file structure on the Web Server with minimal impact on either outside access or staff sanity. This paper examines using OmniMark in managing Web Servers from an SGML prospective.

This document describes work performed at the DOE Office of Scientific and Technical Information under contract DE-ACO5-91MA40061.

Background

The Department of Energy (DOE) Office of Scientific and Technical Information (OSTI) is a major player in the area of Web servers. OSTI installed the Web software and began developing a server in January 1994. Early success of pilot efforts to put full text documents on the system and demonstrate the potential capabilities of the system resulted in OSTI maintaining the official DOE Home Page (http://www.doe.gov or http://apollo.osti.gov) as well as hosting home pages for several other DOE organizations. The DOE Home Page is the centralized starting point for accessing information throughout the Department.

SGML, an integral part of the World Wide Web (WWW) effort at OSTI since the installation of the first pilot server, is also at the heart of the HyperText Markup Language (HTML) processing to prepare documents for installation on the server. HTML is an SGML application. There is an HTML Document Type Definition (DTD) that defines the HTML tag set and document structure. The original HTML DTD has been modified extensively by OSTI to track capabilities available in popular Web viewers and to add tags for local use. We call the modified HTML Document Type Definition MOSAIC.DTD. This DTD makes it possible to use general purpose SGML tools and features that are not available in HTML-specific processing software.

What Is A Web System

The World Wide Web can be thought of as a loose web of information. Built on the concept of information sharing and document hyperlinking, the web was originally developed at CERN in Geneva as a means of sharing information. The original Web server software and viewer were developed there. Several other organizations have since begun developing server and browser software.

The pieces of a Web system include:

Web server
Web client
Web documents

What Does Managing Your Web Server Data Mean?

The short answer is treating the Web server as a production system. It deserves the same care and feeding as your payroll system. If the information does not have value, why make it available on a Web server in the first place?

The problem with Web servers is that because they are so easy to put up and feed HTML documents into, many are started as part-time adhoc projects put up by one of the system people because Web servers are neat. Because system installation is so easy, planning is often bypassed. The result is that the server grows so fast that by the time you realize it needs to be managed, there is just too much information stored to go back and do it right.

So what should you do? Some up front planning will pay off many times over the long term. Realize that several areas of expertise are necessary and are rarely found in one person. Use SGML to facilitate document management. Parse documents. Follow configuration management principles.

Parsing HTML documents insures markup consistency. Doing just that will make other processing much easier. Get staff used to parsing documents the same way they automatically spell check documents. Parsing with a translating parser opens additional doors. For example additional tags can be added for your own use which can be dropped before document installation on the server. Hyperlink generation can easily be done by a translating parser during the normal parse processing.

Document conversion is also a big job once a Web server grows past the initial few "Who We Are" documents that are usually made available first. The critical point here is automation of the document conversion process. This requires translation programs developed by a programmer along with someone to do the actual document conversion. At OSTI, the main translation programs are a OmniMark programs with support of WordPerfect macros and AWK scripts.

The home-grown approach taken at OSTI was arrived at after an examination of several alternative, publicly available conversion utilities. Each utility seemed to have different deficiencies, ranging from dropping important pieces of information to not running on the document conversion platform. An important part of the OSTI approach was the decision to support the lowest common denominator document format, ASCII text.

Authoring Web documents from scratch is an entirely different process than document conversion. The current best HTML authoring tool, HoTMetaL is not appropriate for document conversion.

Document Management

Most organizations dedicate a workstation class computer to the Web server role and place access restrictions on the network via electronic firewalls to prevent Internet access to internal networks. This extra security minimizes the risk of computer break-ins. If the rest of the computer installation is worth protecting this way, then so is the information residing on the Web server.

Good configuration management principles dictate that multiple copies of data be kept in separate locations. Baseline Web data is stored on separate computer systems at OSTI. At OSTI, the production Web server documents reside on a SparcServer 630 and the baseline documents are kept on the internal PC network. It is important to have a copy that can be used to rebuild the entire Web server should its data ever be damaged.

When a document is updated, the copy on the PC network is changed. The modified document is parsed with an OmniMark parsing program and any SGML are errors corrected before the document is moved to the Web server. The changes are reviewed prior to the document's installation on the production server area. Even an informal configuration management program as described here can have enormous benefits on the overall quality of a Web system.

Document management can easily be introduced into the production process, both in the course of normal document maintenance and during document conversion. A simple tag set can be added to HTML documents to track progress through the conversion cycle and to record changes. At OSTI, just such an approach has been adopted. The autotagging process automatically puts a skeleton document management tag set into each document. Dates, status, and other relevant information is recorded. After a document has been placed into production status, document changes are tracked. This document management approach does not replace a real document management system, but it does raise awareness of the need.

Why SGML

SGML is a very important cog in the Web wheel at OSTI. Hyperlink maintenance, configuration management, document management, and overall system quality can be traced directly back to the use of SGML. SGML tools ease the maintenance of HTML documents. SGML-based editors validate documents in the authoring process. SGML parsers validate document structure and tags. SGML translators turn logical document names into full URL hyperlinks. OmniMark performs both functions at OSTI. SGML filter programs automatically generate Table of Contents and split large documents into manageable sized chunks. All of these things contribute to a well-run production system. And since SGML is a general solution that can be applied to other document manipulation problems, valuable SGML experience is gained that can be effectively applied to other projects.

SGML documents have a consistency that facilitates automated manipulation, even with non-SGML tools. Ease of manipulation is a very important, even critical, side effect of having data and documents in SGML form. SGML documents can be reused in different applications ranging from paper publishing to databases to CD-ROM applications.

Processes that come with the SGML territory include:

Document parsing
Hyperlink maintenance
Document conversion

A special program called an SGML parser does document parsing. The parser loads the DTD, then reads the SGML document. The parsing process validates document structure, performs entity substitutions, and validates tag names. Some parsers, such as OmniMark, are also capable of performing additional processing. These are called translating parsers. SGML parsers that only check document structure are called validating parsers.

Parsing HTML documents is an important step in document production at OSTI. Most document browsers are very forgiving of bad markup; they simply ignore invalid tag names, allow headings inside lists, and many other markup errors. Markup errors tend to build up over time and may contribute to unexplained problems. For example, Mosaic only loaded the first 64 hyperlinks of a large document that contained numerous SGML errors. However, after parsing and correcting the markup errors, all of the over 200 hyperlinks loaded and function properly.

The second function of parsing at OSTI is automatic hyperlink generation. This additional processing is possible in the parsing step through the use of OmniMark. After the parser expands the entity references (logical document names) into URLs, the URL is written into the parsed document. A primary advantage of this approach is that the value of a URL is changed in only one place. The documents that reference it are then reparsed, and this can be automated.

Hyperlink maintenance might not seem a critical issue when there are only a few documents on your server. It becomes a big problem with just a few dozen documents. Manual hyperlink maintenance is both very time consuming and virtually impossible to do without errors. The result is usually user frustration caused by dangling hyperlinks.

Document Conversion

Manual document conversion is both error prone and time consuming. Again, SGML plays a significant role in the automated document conversion process at OSTI. The areas SGML impacts include:

Autotagging documents
Hyperlinked Table of Contents generation
Detecting markup errors
Automatic hyperlink maintenance
Document Management

A critical process to automate is document tagging. There are a variety of approaches possible here. The OSTI approach has evolved into a combination of steps that may include WordPerfect macros to preserve emphasis markup, an OmniMark program for the actual autotagging, another OmniMark program to generate internal Table of Contents hyperlinks, and a variety of AWK scripts for specific data cleanup.

The current document conversion system does about 80-95% of the document conversion work, depending on the format of the original document. The percentages have improved with program refinement based on the number of actual large document conversions. The OmniMark autotagging program is continually being improved.

In the OSTI environment, every document is essentially a custom conversion. Document format and style consistency saves time and cost in the document conversion process because consistency makes the process more amenable to automation. The markup consistency achieved by the autotagging process makes it practical to create hyperlinked internal Tables of Contents and split large files into smaller, more manageable pieces automatically. Good first steps for any organization toward document consistency are style guides and word processor style sheets. It is not enough to just have style guides and style sheets; they must be used!

The document management tag set consists of a container tag, <DOCMGMT>, information that identifies the document, who is working on its conversion to HTML, dates, and any general comments deemed necessary by the conversion staff. They essentially form an electronic routing sheet.

SGML Approach vs HTML Approach

There is no contest in choosing the SGML approach for managing HTML documents. Good HTML-specific tools exist. However, very few perform any kind of validation of either document structure or tags. The net effect is that creating incredibly bad documents (from a markup prospective) is incredibly easy. Users and managers are lulled into thinking that HTML does not require the same care and feeding as other production systems.

The HTML-only approach to managing documents locks users into HTML. Migration to future HTML versions is questionable because of the lack of markup consistency in documents created in this environment. Conversion becomes almost a manual process because documents do not have markup consistency. HTML documents in this type of system lack much chance for reuse. HTML is used as a formatting language for the screen a la Troff or TeX. In short, the HTML-only approach is a short-term and short-sighted solution that limits long-term value of the document you work so hard to convert.

Following the SGML approach adds value to HTML data. Documents contain no markup errors and have a tagging consistency that makes conversion to other forms automatable. The automatic generation of the hyperlinked Table of Contents mentioned several time in this paper is a good example. Consider the following HTML documents:


<h1>HEADER 1                    <HTML>
<P>                             <BODY>
This illustrates the HTML-only  <H1>HEADER 1</H1>
approach. Here is a list:       <P>
<ul>                            This illustrates the SGML
<LI><H2>Large bolded text       approach. Here is a list:
<li><h2>Another item</h4>       <UL>
<h1>ANOTHER HEADER</H1>         <LI><B>Bolded
text</B>
<LI>Another item                </UL>
Another paragraph.              <H1>ANOTHER HEADER</h1>
                                <P>
                                Another paragraph.
                                </BODY>
                                </HTML>

Both versions display essentially identical content. The sample on the left is typical of many documents that follow an HTML-only approach. Automating the Table of Contents generation is impossible with documents like this. There is no document structure and no markup consistency. Mosaic displays the text, but there is very little chance of that HTML document ever being used for anything else.

The version on the right is typical of an HTML document using an SGML approach. It will parse and is much easier to manipulate programmatically. Tags are used for their intended purpose; <H2> means header level two. In the left file, header tags are used to produce screen formatting.

The SGML file is very easy to manipulate via software. Generating a Table of Contents is simply a matter of reading the SGML file and writing each header (open tag, data, and end tag) to an output file. Generating and including hyperlinks is only a little more involved.

Documents created, either through authoring or conversion, using SGML are good candidates for reuse. If a CD-ROM project comes up, SGML documents, even those in HTML, are usually more useful than the original documents created with a propriety word processor (WordPerfect, MS Word, AmiPro, etc). Data consistency is a primary reason. Managing the Web server with SGML shows that the data is valued.

How OSTI Uses SGML

SGML is the cornerstone of managing the OSTI Web server. Three primary areas of Web server management heavily affected by SGML are:

Hyperlink management
Document conversion
Configuration management

Automated hyperlink management is a critical function on any Web server. Dangling hyperlinks caused by moving or deleting link destination files result in much user frustration. Both document files and other Web server URLs are defined as SGML entities in the OSTI system. The entity name becomes the logical name used in the hyperlinks instead of the full URL. The OmniMark parsing program loads the DTD and entity definition file. Then, it reads the HTML document, expands the entity references while validating the tag structure, and writes the output document with the expanded entity references translated into full URLs.

Hyperlinks to the OSTI Home Page occur at least four times on the server. Only the entity definition has to be changed to point it to a different file or URL. The documents that reference the entity definition are simply reparsed to update the URL references. Entity references also facilitate simultaneous production and development server areas without having to physically modify documents to move them between development and production status; they are simply reparsed.

The final major role SGML plays in the OSTI Web server management is HTML document configuration management. All HTML document processing is performed on the OSTI local area network. The baseline document is kept in a separate directory tree. SGML marked sections are utilized to allow both development and production versions of a document to reside in the same file. Either version can be parsed by user selection. This prevents the possibility of multiple document versions without knowing which is production and which is development. Hyperlinks can also be tied to development and production marked sections. Several new OSTI specific tags have been added to the documents to facilitate document tracking. The extra tags are dropped by the parsing process. Adding extra, non-HTML tags to documents is not practical without SGML.

HOW OSTI USES OMNIMARK

The only commercial SGML software OSTI purchased to start their Web system was OmniMark. OmniMark was chosen because it is a translating parser that can manipulate SGML data as well as non-SGML data. Treating the Web server as an SGML application from the very beginning has enabled OSTI to complete four major server restructuring efforts as the DOE Home Page evolved, with no major down time resulting from the ongoing changes. This is remarkable considering the OSTI Web server has over 20M bytes of data in hundreds of HTML documents. The SGML approach has made this smooth evolution possible.

OmniMark programs perform most of the HTML document processing ranging from autotagging to parsing. The following sections take a look at two of the primary OmniMark processing systems in detail.

Production Parsing

The OSTI Web System is set up following configuration management principles. The baseline documents reside on a PC network while the production documents are kept on a Sun SparcServer where they are accessed by users via the Web Server. This separation of baseline documents from production documents has several purposes:

The production system can be regenerated from the baseline documents in the event of a system crash or data corruption.
Documents can be updated or added without affecting the productions system, facilitating the testing of documents before being moved to production.
The baseline documents do not have to be strictly HTML.

Parsing a document before moving it to production was strictly for catching parsing errors and insuring document structure. It was an administrative rule that I insisted upon. Parsing baseline documents is now required as more and more SGML features have been folded into the system. Baseline documents now contain marked sections, a number of new non-HTML tags, and hyperlinks are SGML entities.

Marked sections are used to allow both development and production versions of a document in the same file. OSTI requires that the system support both production and development/testing areas on the Web Server. The document destination is specified at parse time, allowing either development or production hyperlinks to be generated.

New tags have been added to our MOSAIC.DTD to facilitate document management. See http://apollo.osti.gov/html/osti/eei/encoding/mosaic.html. The ability to add additional, localized tags to HTML documents are a big benefit to keeping baseline and production documents separated. The document management tags are screened out as part of the parsing process.

Hyperlinks are maintained as SGML entities. Each document is given a logical name, which is the entity name. The URL is the entity value. When the URL of a document changes, the entity file (HYPER.DEF) is the only file that must be edited. The links are updated by simply reparsing the documents that reference the link.

OmniMark supports these requirements with ease. A single OmniMark program does the work. See http://apollo.osti.gov/html/osti/eei/encoding/hyper.html for the source code to HYPER.XOM.

OmniMark handles the entity expansion for hyperlinks and the marked section processing automatically. No extra code is necessary. Generating the parsed document with tags does require OmniMark code, as does dropping the document management non-HTML tags. The following code for including graphic images, <IMG> is typical of the code for most tags:

;
; Handle Images...
;
ELEMENT IMG 
	OUTPUT "<%q "
	DO WHEN ATTRIBUTE src IS SPECIFIED
		OUTPUT "  src=%"%v(src)%" "
	DONE
	DO WHEN ATTRIBUTE align IS SPECIFIED
		OUTPUT "  align=%"%v(align)%" "
	DONE
	DO WHEN ATTRIBUTE ismap IS SPECIFIED
		OUTPUT "  ISMAP"
	DONE
	OUTPUT ">%sc"

Basically, HYPER.XOM loads MOSAIC.DTD and HYPER.DEF. Then the document is read in and entity references expanded. The processing rules put HTML tags around the data and writes the tags to the output file. The local tags are suppressed on output. The file extension of .RAW designates baseline document, which is input to HYPER.XOM. The output file extension is .HTM. The .HTM file extension is expanded to .HTML when the document is moved to the server.

Document Conversion

Document conversion is an important part of OmniMark's job at OSTI. Document conversion is more than just autotagging. It includes:

Autotagging
Automatic creation of hyperlinked internal Table of Contents for large documents
Breaking large documents into 20k - 30k chunks
Document tracking

A decision made early on in the initial Web Server pilot was that supporting the lowest common denominator document format was important because of the wide variety of document sources OSTI has to deal with. Virtually all wordprocessing and typesetting systems can export ASCII files. Many of them create ASCII files that retain some information that can be used to key autotagging off. Bullet list indents are a good example as well as blank lines between paragraphs.

The next lowest common denominator file format is WordPerfect. It turns out that a normal WordPerfect file saved as ASCII without styles preserves line breaks, blank lines, and indentations making autotagging much easier. A simple WordPerfect macro captures emphasis information that gets passed on to the ASCII file as well. WordPerfect documents with styles are a completely different case.

Autotagging is built around the ASCII text document. A typical document conversion consists of the following steps for a WordPerfect document:

Run a WordPerfect macro that converts [BOLD], [ITAL], and [UNDER] to their corresponding HTML emphasis tags.
Save the document in ASCII format.
Clean up the document to give the autotagging program processing hints.
Run the ASC2HTM.XOM OmniMark program that examines line characteristics to tag the document.
Tags in the output file are visually examined for correctness. ASC2HTM tends to insert extra paragraph marks when it gets confused.
A second OmniMark program, TOCGEN.XOM, generates a hyperlinked internal Table of Contents.
A third program is run to automatically divide large documents into 20k - 30k chunks.
Another pass is made to add customer specified external hyperlinks, decide how to handle tables, and determine how graphics in figures are to be handled.
Several AWK scripts may be run to clean up specific data problems that may occur throughout the conversion process.

After the conversion process is completed, the document is parsed, reviewed for quality, and then installed on the production server. Document content takes precedence over matching the printed form, due to the formatting limitations of the electronic delivery system. The OSTI document conversion process currently does 80-95% of the conversion work automatically, depending on the original document format.

The following list is a summary of the autotagging rules used in ASC2HTM.XOM:

Blank line between paragraphs
Blank line between bullet list items
Header 1 - Centered line of text with 2 blank lines before and 1 blank line after
Header 2 - Single line of UPPER CASE text with a blank line before and a blank line after.
Header 3 - Single line of lower case text with a blank line before and a blank line after.
Bullet list - Whitespace (either blanks or tabs) followed by ^G, o, or -, followed by at least one blank.
Numbered list - Whitespace (either blanks or tabs) followed by a digit, followed by a . or ), followed by at least one blank.
Alpha list - Whitespace (either blanks or tabs) followed by a letter, followed by a . or ), followed by at least one blank.

A typical FIND rule in ASC2HTM.XOM is:


; This rule handles alpha lists.
;
FIND "%n"* 
     LINE-START BLANK+ ((LETTER ("." OR ")") OR ("(" Letter ")")))
     BLANK+ = leading-blank 
     ANY-TEXT+ = text 
     "%n"
   end-nested-list
   start-alpha-list
   DEACTIVATE li-flag
   OUTPUT "<LI>%x(text)%n"
;------------------------------------------------------------------

SUMMARY

Giving HTML documents on the Web server the SGML treatment provides valuable learning experiences for dealing with other SGML applications. The hard parts of an SGML application are already done with a Web system. The DTD and applications for storage and viewing documents already exist. Users can concentrate on SGML document authoring and conversion. Parsing HTML documents prepares staff for other, more complex SGML applications. By dealing with document conversion and related production issues, staff members gain valuable experience using SGML tools, such as OmniMark, that can be transferred to other SGML projects, furthering general SGML efforts.

OmniMark is a key ingredient in the SGML effort at OSTI. OmniMark programs handle most of the SGML-related processing. The text conversion capabilities of OmniMark are utilized to autotag documents. Other OmniMark programs generate hyperlinked Table of Contents and parse documents.

SGML is generally considered a long-term solution because of high up-front costs. Utilizing an SGML approach to maintaining Web servers is one of the few SGML applications that pays for itself in the short term.

Check out the results of using SGML to manage a Web server at http://www.doe.gov/home.html, http://apollo.osti.gov/home2.html, and http://apollo.osti.gov/html/osti/ostipg.html. Our Mosaic Document Type Definition and some OmniMark codes can be accessed at http://apollo.osti.gov/html/osti/eei/eei.html.

Biosketch

Mr. Smith has been programming since he was 18. He has published four programming related books with a fifth at the publisher. He has worked on database publishing applications at OSTI for 8 years and is heavily involved with OSTI's SGML activities ranging from teaching SGML classes to being the architect of the HTML document processing for OSTI's Web Server. Mr. Smith is also the author of EasyDTD, S-Engine, and S-Viewer SGML utilities, which are publicly available via anonymous FTP at ftp.ifi.uio.no in /pub/SGML/Demo.