Managing Web Documents With SGML

Mirror from: http://www.doe.gov/html/osti/eei/sgml/infotech.html - Managing Web Documents With SGML]

Norman E. Smith, CDP
Science Applications International Corp.
P.O. Box 2501
301 Laboratory Road
Oak Ridge, TN 37831-2501
615-482-9031
smithn@zeus.osti.gov


Abstract

The DOE Office of Scientific and Technical Information (OSTI) set its World Wide Web (WWW) Server up as an SGML application from the very beginning. Web HyperText Markup Language (HTML) documents are parsed and SGML syntax errors corrected before being loaded on the production Web Server. Automation of hypertext links is an absolute necessity as the number of documents on a server grows to prevent dangling hyperlinks. SGML provides the automation vehicle for the OSTI Web Server. Hypertext links are managed via SGML and the parsing process. Each document is given a logical name which is set up as an SGML entity reference. The SGML entity contains the Universal Resource Locator (URL) for the document. The SGML parser substitutes the proper URL for the logical name reference automatically generating valid hyperlinks. The SGML approach has made possible several complete reorganizations of the file structure on the Web Server with minimal impact on either outside access or staff sanity. This paper examines the issues of managing Web Servers from an SGML prospective.

This document describes work performed at the DOE Office of Scientific and Technical Information under contract DE-ACO5-91MA40061.

Note: This paper was authored in HTML with an SGML editor. I practice what I preach...

BACKGROUND

The Department of Energy Office of Scientific and Technical Information (OSTI) is a major player in area of Web servers. OSTI installed the Web software and began developing a server in January 1994. Early success in pilots putting full text documents on the system and demonstrating the potential capabilities led to OSTI maintaining the official DOE home page (http://www.doe.gov or http://apollo.osti.gov) as well as hosting pages for several other DOE organizations. The DOE Home Page is the centralized starting point for accessing information throughout the Department.

SGML has been an integral part of the World Wide Web (WWW) effort at OSTI ever since the first pilot server was installed. It is also at the heart of the HyperText Markup Language (HTML) processing for preparing documents for installation on the server.

HTML is an SGML application. There is an HTML Document Type Definition (DTD) that defines the HTML tag set and document structure. The original HTML DTD has been modified extensively by OSTI to track capabilities available in popular Web viewers. We call the modified HTML Document Type Definition MOSAIC.DTD. This DTD makes it possible to use general purpose SGML tools and features that are not available in HTML-specific processing software.

WHAT IS A WEB SYSTEM

The World Wide Web can be thought of as a loose web of information It is built on the concept of information sharing and document hyperlinking. It was originally developed at CERN in Geneva as a means of sharing information. The original Web server software and viewer were developed there. Several other organizations have gotten involved developing both server and browser software.

The pieces of a Web system include:

Web Server

The Web server software runs on a host computer, typically workstations or file server computers. OSTI's server is a Sun SparcServer. Server software is available on a wide range of other systems ranging from Vaxen to MacIntoshes.

The server software performs several functions including:

The major purpose of a Web server is to share information, so network interface is a primary function. This may be to the Internet or to a local network. Web servers employ a special network protocol called HTTPto allow anonymous access of the information managed by the server.

There are two primary sources of Web server software, CERN and the National Center for Supercomputer Applications (NCSA). Their server software is available via anonymous FTP from the Internet. The state of Web server software continues to progress at a fairly rapid pace because of the competition between CERN and NCSA to provide the best server software. OSTI has used both servers and presently uses the NCSA server.

Document storage is another primary function of the server software. Document storage is more than just storing HTML documents; there are graphics files, audio files, and full motion video files. The capability of mapping actions to 'hot spots' on a graphic also falls into this function.

Interface with external software is not necessarily an obvious function for a Web server to provide. Common interfaces include E-mail and external databases such as WAIS. Much of the potential for creating advanced information delivery applications with a Web system can be traced to the server being able to interface with external software.

Web Documents

Web documents are primarily HTML documents. However, other types of files such as graphics often fall under the "Web document" category. HTML documents are text files marked with tags identifying structure such as

for a paragraph. HTML documents also may contain hyperlinks to other HTML documents via Universal Resource Locators (URLs). The tags allow Web browsers to format text for the screen. The hyperlinks may link to other internal points, to other documents on this server, or documents on any other Web server connected to the network.

Graphics files are generally either Graphic Interchange Format (GIF) or X-windows Bit Map (XBM) files. Any graphics format that is supported by your Web viewer can be used, but these are the two universal graphics formats supported. XBM format is considered obsolete, but still supported.

Audio (WAV) and full motion video (AVI and MPEG) files are also supported by many Web browsers. These have not achieved the universal use of GIF files because not all user computers include the necessary hardware.

Web Client

A Web client is the software that runs on the user's computer which accesses a Web server. The client software may also be referred to as a browser because it allows the user to browse documents on a Web server. The client must:

Web client software have network access to connect to a server. It uses the same HTTP network protocol thatthe server does to access documents.

Interpreting the HTML markup tags and rendering the information on the screen is another important task of the client. Most clients are very forgiving of adherence to the HTML DTD and instead treat the tags as a screen formatting language.

Hyperlinks provide the mechanism to navigate around a Web server. The top levels of documents are termed pages that locally form a menu system that is navigated by clicking on hyperlinks. Eventually, you navigate to full text documents which may contains additional hyperlinks to related documents.

Accessing external software on the user's machine is important because the client software does not have to be able to process every conceivable file type. Unknown document types can be passed off to external programs to process. For example, a document whose filename ends in .WRI may be passed off to Write for display if you are running on a MS-Windows based PC.

Much of the explosion in Web servers is directly attributable to public availability of a high quality Web browser. A partial list of current free Web clients includes:

Mosaic is the client that caused the initial explosion of WWW. Within two or three months of its release, Web servers were popping up everywhere. Mosaic was developed at the National Center for Supercomputer Applications (NCSA). It has a very user friendly graphical user interface that allows access to Web servers by simply clicking a mouse. Gopher server access is also built into Mosaic and WAIS databases can beaccessed via Web server external interfaces.

Mosaic is copyrighted but free. It does have some license restrictions on redistribution. Mosaic is available via anonymous FTP from NCSA at ftp://ftp.ncsa.uiuc.edu. Versions are available for X-windows based workstations, PCs running MS-Windows, and MacIntosh computers.

Cello is available from Cornell University at ftp://www.law.cornell.edu. It runs on Pcs with MS-Windows. Lynx is text-base client that runs on MS-DOS, VAX/VMS, and other computers. It is available from the University of Kansas at ftp://ftp2.cc.ukans.edu. WinWeb is a Mosaic-like browser for MS-Windows Pcs and is available from ftp://fpt.einet.net.

WHAT DOES MANAGING YOUR WEB SERVER DATA MEAN?

The short answer to "What does managing your Web server data mean?" is treating the Web server as a production system. It deserves the same care and feeding as your payroll system for example. If the information did not have value, why make it available on a Web server in the first place?

The problem is that Web servers are so easy to put up and feed HTML documents to that many start as a part time adhoc project that one of the system people put up because it is neat. System planning is often bypassed because it is so easy. What happens is that the server grows so fast that by the time you realize it needs to be managed, there is just too much information there to go back and do it right.

So what should you do? Some up front planning will pay back many times over the long term. Realize that several areas of expertise are necessary which are rarely found in one person. And if one such person exists in your organization she/he is probably too busy to be of much immediate use. At OSTI, this "WebMaster" role is divided among at least half a dozen people. The areas of expertise include:

System management includes installation of the Web server software, setting up network access, working out server interfaces to other software, and setting up the initial mechanism for loading documents. Backup and related production procedures must be setup if not already in place. Installing Web browsers on user computers may also be done by the systems management group if separate user support functions do notexist.

Administrative processes are also more extensive than many may think at first glance. Who gets Web clients? Who selects/approves documents to be installed on the server? Who responds to external inquiries? Who gets funding? The list goes on and on. There is a lot of administrative work to be done, especially on a large Web system.

The position of information architect is rarely mentioned in conjunction with Web systems. The person that fills this role has to:

Graphics design is an important part of any system with a graphical user interface (GUI) and a Web system is no exception. The secret in a Web system is to work within the constraints of the lowest common dominator viewing system. OSTI designs for a 16 color, VGA 640X480 dot screen where a graphic 600X400 is the largest viewing area. Icons add color and make the screen interesting. Too many graphics will add significantly to the network load causing documents will take longer to load, annonyning the user. Consistently sized icons also make it easier to integrate text in a pleasing fashion. There is a delicate balanceto getting just the right amount of graphics without unduly slowing the system down.

Another important, but frequently overlooked skill necessary for a successful Web server is a programmer to write text manipulation scripts and translation programs. A programmer with a publishing background is ideal, but hard to find. From our prospective a programmer with a genuine interest in SGML/TEX/Troff/desktop publishing is the next best thing.

The type of text filters required will vary and may include removing extra tags from HTML files edited with HoTMetaL, generating Table of Contents hyperlinks, and many other repetitive types of text filters. OSTI use AWK, but any other text manipulation language, such as PERL, is also appropriate.

Document conversion is a big job once a Web server grows past the initial few "Who We Are" documents that usually go up first. The critical point here is automation of the document conversion process. That means translation programs developed by a programmer and someone else to do the actual document conversion. At OSTI, the main translation programs are a combination of WordPerfect macros, OmniMark programs, and AWK scripts.

Our home grown approach was arrived at after examining several alternative, publically available conversion utilities. Each of which seemed to have a different defiency ranging from dropping important pieces of information to not running on the computer platform used for document conversion. The decision was also made that supporting the lowest common denominator document format, ASCII text, was important.

A typical document conversion consists of the following steps for a WordPerfect document:

The person performing the conversion then makes another pass to add customer specified external hyperlinks, decide how to handle tables, and determine how graphics in figures are to be handled. Several AWK scripts may be run to clean up specific data problems that may occur throughout the conversion process.

After this process is completed, the document is parsed, reviewed as a quality check, then forwarded to the information architect for installation on the production server. Document content takes precedent over matching the printed form, mainly because of formatting limitations of the electronic delivery system.

Also note at this point that authoring Web documents from scratch is an entirely different process than document conversion. The current best HTML authoring tool, HoTMetaL is not appropriate for document conversion.

The OSTI document conversion process currently does 80%-95% of the conversion work automatically, depending on the original document format.

DOCUMENT MANAGEMENT

Most organizations dedicate a workstation class computer to the Web server role and place access restrictions on network via electronic firewalls that prevents Internet access to internal networks. This extra security minimizes the risk of computer break ins. If the rest of the computer installation is worth protecting this way, then so is the information residing on the Web server. Good configuration management dictates that multiple copies of data be kept in separate locations. Baseline Web data is stored on another computer system at OSTI.

At OSTI, the production Web server documents reside on a SparcServer 630 and the baseline documents are kept on the internal PC network. It is important to have a copy that can be used to rebuild the entire Web server should its data ever be damaged.

When a document is updated, the copy on the PC network is changed. The modified document is parsed with an OmniMark parsing program and any SGML errors corrected before being moved to a development area on the Web server. The changes are reviewed prior to being installed on the production server area. Even an informal configuration management program as described here can have enormous benefits on the overall quality of a Web system.

WHY SGML

SGML is a very important cog in the Web wheel at OSTI. Hyperlink maintenance, configuration management, and overall system quality can be traced directly back to the use of SGML. SGML tools ease the maintenance of HTML documents. SGML based editors, such as HoTMetaL, validate documents in the authoring process. SGML parsers validate document structure and tags. SGML translators turn logical document names into full URL hyperlinks. SGML filter programs automatically generate Table of Contents and split large documents into manageable sized chunks. All of these things contribute to a well run production system. And since SGML is a general solution that can be applied to other document manipulation problems, valuable experience is gained that can be effectively applied to other projects.

SGML documents have a consistency that facilitates automated manipulation, even with non-SGML tools. Ease of manipulation is a very important, even critical, side effect of having data and documents in SGML form. SGML documents can be reused in different applications ranging from paper publishing to databases to CD-ROM applications.

Processes that come with the SGML territory include:

A special program called an SGML parser does the parsing of SGML documents. The parser loads the DTD, then reads the SGML document. The parsing process validates document structure, performs entity substitutions, and validates tag names. Some parsers are also capable of performing additional processing. These are called translating prasers. SGML parsers that only check document structure are called validatingparsers.

Parsing HTML documents with an SGML parser is an important step in document production at OSTI. Most document browsers are very forgiving of bad markup; they simply ignore invalid tag names, and allow headings inside lists, and many other markup errors. Markup errors tend to build up over time and may contribute to unexplained problems. For example, we have a large document which Mosaic only loaded the first 64 hyperlinks when it contained numerous SGML errors. After parsing and correcting the markup errors, all of the over 200 hyperlinks loaded properly.

The second function parsing accomplishes at OSTI is automatic hyperlink generation via entity references. This additional processing is possible through the use of a translating parser. After the parser expands the entity references (logical document names) into URLs, the URL is written into the parsed document. A primary advantage of this approach is that the value of a URL only has to be changed in one place, then the documents that reference it reparsed. Reparsing can be automated. Hyperlink maintenance may not seem a big deal until your Web server grows to several hundred files, but it becomes a problem with just a few dozen documents. Manual hyperlink maintenance is both very time consuming and virtually impossible to do without errors. The result is usually user frustration caused by dangling hyperlinks.

Manual document conversion is both error prone and time consuming. Again, SGML plays a significant role in the automated document conversion process at OSTI. The areas SGML impacts include:

The quick examination of the printed copy of the document is called a document review. Problem areas such as large tables and figures are identified. Decisions about logical break points for large files are made. The goal is documents with files no larger than 20k to 30k bytes long. The eventual file location on the serveris determined so file directory and entity references can be prepared by the time the conversion is completed.

Document tagging is a critical process to automate. There are a variety of approaches possible here. The OSTI approach has evolved into a combination of steps that may include WordPerfect macros to preserve emphasis markup, an OmniMark program for the actual autotagging, another OmniMark program to generate Table of Contents hyperlinks, and a variety of AWK scripts for specific data cleanup.

The current document conversion system does about 80% - 95% of the document conversion work, depending on the format of the original document. The percentages have improved with program refinement based on actual large document conversions. A list of additional enhancements has been identified for next fiscal year that should raise the percentages of conversion automated to the 90%-98% range. I believe that the low end number could be raised closer to 95% if the documents had consistent style and format. (The percentages listed here are estimates based on present experience.) In the OSTI environment, every document is essentially a custom conversion.

The markup consistency achieved by the autotagging process makes it practical to create hyperlinked internal Table of Contents and split large files into smaller, more manageable pieces automatically.

The final steps in the HTML document conversion process are parsing, online review, and document installation on the Web system.

Document format and style consistency saves time and cost in the document conversion process because consistency makes the process more amenable to automation. Good first steps for any organization toward document consistency are style guides and word processor style sheets. It is not enough to just have style guides and style sheets; they must be used!

SGML APPROACH vs HTML APPROACH

In my mind, there is no contest in choosing the SGML approach for managing HTML documents. Good HTML-specific tools exist. However, very few perform any kind of validation of either document structure or tags. The net effect is that creating incredibly bad documents (from a markup prospective) is incredibly easy. Users and management are lulled into thinking that HTML does not require the same care and feeding as other production systems.

The HTML-only approach to managing documents lock you into HTML. Migration to future HTML versions is questionable because of the lack of markup consistency in documents created in this environment. Conversion becomes almost a manual process because documents don't have markup consistency. HTML documents in this type of system lack much chance for reuse. HTML is used as a formatting language for the screen ala Troff or TEX. In short, the HTML only approach is a short term and short sighted solution that limits long term value of the you work so hard to convert.

Following the SGML approach on the other hand adds value to HTML data. Documents contain no markup errors and have a tagging consistency that makes conversion to other forms, whether SGML or any other, automatable. The automatic generation of hyperlinked Table of Contents mentioned several time in this paper is a good example. Consider the following HTML documents:


<h1>HEADER 1                       <HTML>
<P>                                <BODY>
This illustrates the HTML-only     <H1>HEADER 1</H1>
approach. Here is a list:          <P>
<ul>                               This illustrates the SGML
<LI><H2>Large bolded text          approach. Here is a list:
<li><h2>Another item</h4>          <UL>
<h1>ANOTHER HEADER</H1>            <LI>Another item
<LI><B>Bolded text</B>             </UL>
Another paragraph.                 <H1>ANOTHER HEADER</h1>
<P>
<P>
<P>
Another paragraph.
<P>
</BODY>
</HTML>


Both versions display essentially identical content. The sample on the left is typical of many documents that follow an HTML only approach. Automating the Table of Contents generation is impossible with documents like this. There is no document structure and no markup consistency. Mosaic displays the text but, there is very little chance of that HTML document ever being used for anything else.

The version on the right is typical of any document using an SGML approach. It will parse and is much easier to manipulate programmatically. Tags are used for their intended purpose; <H2> means header level two. In the left file, header tags are used to produce screen formatting. The SGML file is very to manipulate via software. Generating a Table of Contents is simply a matter of reading the SGML file and writing each header (open tag, data, and end tag) to an output file. Generating and including hyperlinks is only a little more involved.

Documents created, either through authoring or conversion, using SGML are good candidates for reuse. If a CD-ROM project comes up for example, SGML documents, even those in HTML are usually more useful than the original documents created with a propriety word processor (WordPerfect, MS Word, AmiPro, etc). Data consistency is a primary reason. Managing your Web server with SGML shows that you value your data.

Giving HTML documents on your Web server the SGML treatment also provides valuable learning experiences for dealing with other SGML applications. The hard parts of an SGML application are already done with a Web system. The DTD already exists and applications for storage and viewing documents already exist. You can concentrate on SGML document authoring and conversion. Parsing HTML documents prepares staff for other, more complex SGML applications. They have to deal with document conversion and related production issues. Staff members gain experience using SGML tools that can easily be translated into value on other SGML projects. Thus, general SGML efforts will also be furthered.

SGML is usually considered a long term solution because up front startup costs are usually high. Utilizing SGML to manage HTML documents for your Web server is one of the few SGML applications that can show short term benefits and pay back. The larger the Web server, the larger the benefit in quality and maintainability. The only commercial software OSTI purchased to start the first Web project was a translating parser (OmniMark).

OSTI's Web server has gone through four major update cycles, two of which included complete document tree restructurings since January 1994. The ease with which these changes were made would not have not been possible without SGML. The successful use of SGML for HTML document management contributed directly to OSTI becoming the maintainer of the official Department of Energy Home Page.

SGML is generally considered a long term solution because of high up front costs. Utilizing an SGML approach to maintaining Web servers is one of the few SGML applications that pays for itself in the short haul.

The only commercial SGML software OSTI purchased to start their Web system was the OmniMark translating parser. Treating the Web server as an SGML application from the very beginning has enabled OSTI to complete four major server restructuring efforts as the DOE Home Page evolved with no major down time due to the ongoing changes. This is remarkable considering the OSTI Web server has over 10M bytes of data in hundreds of HTML documents. The SGML approach has made this smooth evolution possible.

HOW OSTI USES SGML

SGML is the cornerstone of managing the OSTI Web server. Three primary areas of Web server management heavily affected by SGML are:

Automated hyperlink management is a critical function on any Web server. Dangling hyperlinks cause much user frustration. They are caused by link destination files being moved or deleted. Both document files and other Web server URLs are defined as SGML entities in the OSTI system. The entity name becomes the logical name used in the hyperlinks instead of the full URL. The parser program loads the DTD and entity definition file. Then, it reads the HTML document, expands the entity references while validating the tag structure, and writes the output document with the expanded entity references translated into full URLs. Thedata flow for the OSTI parsing program is shown in Figure 1.

Hyperlinks to the OSTI Home Page occur at least four times on the server. Only the entity definition has to be changed to point it to a different file or URL. The documents that references the entity definition are simply reparsed to update the URL references. Entity references also contribute to being able to keep simultaneous production and development server areas without having to physically modify documents to move them between development and production status.

SGML also plays a large part in HTML document conversion beyond just being the final output. The text conversion capabilities of OmniMark (the SGML translating parser used at OSTI) to autotag documents. Another OmniMark parser program generates hyperlinked Table of Contents. and finally, the document is parsed and markup errors corrected before being passed on for installation on the server. Again, the markup consistency SGML adds makes some of the extra processing OSTI performs practical.

The final major role SGML plays in the OSTI Web server management is HTML document configuration management. All HTML document processing is performed on the OSTI local area network. The baseline document is kept in a separate directory tree. SGML marked sections are utilized to allow both development and production versions of a document reside in the same file. Either version can be parsed by user selection. This prevents the possibility of multiple document versions and not knowing which is production and which is development. Hyperlinks can also be tied to development and production marked sections. Several new OSTI specific tags have been added to the documents to facilitate document tracking. The extra tags are dropped by the parsing process. Adding extra, non-HTML, tags to documents is not practical without SGML.

RECOMMENDATIONS

The number one recommendation for managing a Web server is to use SGML! There are many benefits that provide immediate return on investment in terms of both cost and quality from using SGML. Take advantage of Web HTML documents to gain experience with SGML systems.

Your Web server is just as much a production system as any other at your site. Your payroll system is valued and managed in a production setting; do the same with your Web system.

Automate as much of the document management process as possible. Two prime areas here are hyperlink maintenance and document conversion. Hyperlinks are almost impossible to maintain without some sort of automated procedure. Document conversion costs are tremendous for any but the smallest document unless the process is automated. Even highly automated document conversion is not inexpensive. The more data/document manipulation you can automate, the higher the quality of the information on your Web server.

Finally, split the duties of the "WebMaster" across several qualified individuals. About half a dozen people share some portion of the WebMaster function at OSTI.

Check out the results of using SGML to manage a Web server at http://www.doe.gov/home.html, http://apollo.osti.gov/home2.html, and http://apollo.osti.gov/html/osti/ostipg.html. Our Mosaic Document Type Definition and some OmniMark code can be accessed at http://apollo.osti.gov/html/osti/eei/eei.html.