[Mirrored from: Multiple Media Publishing in SGML ]
By Paul Prescod and Bill Bunn
In the last few years, document authors have had to choose between a bewildering number of formats for information distribution. Some formats, like Microsoft Word's file format or PostScript, promised authors excellent print quality. Others, like the Web's HTML and Microsoft's Windows Help Format, allowed fast electronic distribution. To compound the problem, new formats are being created every day. In the Internet world there has been Windows help, HTML, HTML 2.0, HTML 3.0, Hyper-G's HTF and Adobe's PDF. In the print world, PostScript, WordPerfect, MS Word 2.0, MS Word 6.0 and Rich Text Format have fallen in and out of favour. Standard Generalized Markup Language (SGML) can help you overcome these "format wars" and deliver high-quality World Wide Web and print documents.
Most of today's documents exist either to be printed out or to be published in a particular online medium (Windows Help, World Wide Web, Gopher , etc.). Most of the existing document file formats are designed and destined for a specific medium without considering a print document's digital afterlife. Titles are often set in a particular font at a particular size despite the fact that other media will not recognize them. References are entirely textual ("See Chapter 4", for example) when they should be hyperlinks in the online version. Online documents can seldom be converted into high-quality print documents. Authors must manually insert page breaks. Hypertext links should be changed to page and section numbers. High quality fonts should replace the online fonts.
As we move to the next level of documentation technology, we must begin to think about the multiple media dimensions of our documents because their expressions change depending on the medium: footnotes at the bottom of the printed page might become links in the online text. Our documents must be able to outlive the technologies and purposes of today, and yet support fully the mediums and idioms of the past.
Most organizations begin publishing online by manually converting electronic versions of word processor documents. Manual conversion is expensive and subject to errors and inconsistencies. If you want to convert a small number of documents, manual conversion makes sense. However, when the size and number of documents grow, the manual conversion process becomes exponentially expensive and complex. As document needs rise, organizations depend on automatic document conversion processes.
There are many automatic conversion processes, each with its own strengths and one recurring weakness: automatic conversion programs invariably weaken a document's structure. Document structure must be softened to accommodate different formats. Because almost every format, either online or print, is so different from its neighboring format that an author must break the structure to bring the document into a new format. Popular formats like PostScript, RTF and LaTeX describe how the document looks in print. HTML and Windows Help describe a document's online demeanor. And print and online formats have almost nothing in common. Even among formats for the print medium, there is a poor overlap of simple features. RTF cannot describe equations at all. LaTeX describes them in very high-level terms. PostScript describes formulae as pictures. The conflict between formats widens when we try to get print and online formats to agree.
A beautiful Microsoft Word document can translate into an extremely poor online document. There are templates and wizards that allow you to switch Word formats to HTML tags. It sounds like a smooth conversion -- but it isn't. Successful print features become liabilities in the online format. For example, large documents are saved as a single HTML file. A single file like this can often be too large to download conveniently. We could allow the conversion program to break the document into smaller pieces, but most likely the print version can't supply the information necessary to do a proper segmentation. Here is a list of common conversion problems:
These problems may frustrate us when we first encounter them, but they can help us, too. If we consider each of weakness as we choose a file format and write conversion programs for our documents we can make a better world for our documents. And our own documents will gain a regular, consistent structure across the mediums they use for display. If the format, both print and online, will vary, and change with every changing breeze, we might start by taking formatting out of our document. We want something that does not change. Why not separate the document and its structure from the format? We want a file format that remembers our document's structure. Then we can convert that structure to a format that displays the document and its structure in a meaningful way.
We want one format that helps express our document properly in any medium that we might choose, or that could be invented. Ideally, we would define the document's structure once and we wouldn't worry about how it looks until we convert it. This is SGML's place. SGML is a file format that tracks the structure of your document. You could consider it the skeleton of your document's body. Your words "flesh out" the document structure. Your words, combined with the document structure form your document's body. Then you may "dress-up" your document in a format of your choosing. Most folks want to add the "bones" of structure to an existing text to make a strong document body. Using SGML, you work your way through your document and define or "mark up" its structure. This marked up document would be much more rigorous and flexible than our original version.
SGML is a language designed to represent document structure. SGML stores your document in structural pieces called elements -- like title, introduction, conclusion, footnotes, and references. A special SGML editor helps you mark up the document to create a relationship between your words and document structure that a traditional word processor wouldn't allow. As we have already suggested, structure is the boney skeleton of your document, the information fleshes out those bones into a whole body. The format is the clothing on the outside of that body.
When the converter publishes, or "dresses" an SGML document in a display-specific format -- like HTML -- the converter knows the meaningful way to express structure in a given medium. For example, in your document you used a footnote. SGML tells the converter that the document has a footnote. When it converts the document to a paper version, the SGML file makes sure that the footnote appears at the bottom of the printed page, as you would expect it to. Then, when you convert the document to its online format, SGML hands the converter that same footnote, and the converter makes it a link, rather than placing it at the bottom of a page. The structure remains constant, but the way it's expressed changes.
In textbooks, for example, SGML's online medium optimization manifests as easily downloadable chapters and sub-chapters hyperlinked to each other and to examples, glossaries, indexes and bibliographies. Constructing this complex system of tiny, interlinked files would take weeks of time for a team of highly trained HTML markup experts. For many organizations, a team of HTML experts costs too much. And for one person, tackling a big document, the task is nearly impossible.
A converter can build a large document into a complex online format in a matter of seconds by manipulating the structure -- the bones -- you placed in your SGML file. Sections and subsections break into files and assigned section numbers. The converter extracts titles from sections and reorganizes them into a table of contents. Then the converter collects indexed words and places them into a hypertext index. Using the structure, the conversion process builds a glossary, linking and adding buttons to document sections a single polished web document. You don't have to learn HTML or some online format. You only needed to begin with the right kind of file format -- SGML.
The print manifestations of your SGML files retain all of the traditional idioms and characteristics of high-quality printing. When a print version exists before you markup the document, you can use style sheets to emulate its usual formatting characteristics. When an author applies a strong SGML structure to a document, that document often changes. This change is a good thing. Most documents written with conventional, unstructured writing tools develop inconsistencies. The SGML irons out the wrinkles and trouble spots in your document, so when you put it through the converter, you end up with absolutely consistent layout. Usually the inconsistency is the result of human error in the original document.
Proprietary formats both display and edit your files with a particular software tool: Microsoft Word files can only be displayed or edited in Word; RTF files can be displayed or edited in a handful of word processors; HTML files only display in Mosaic, Netscape and other World Wide Web browsers. When a software company decides to update the software or stop supporting that product, you're stuck. When the piece of software required to display or edit a particular file becomes obsolete, that file becomes obsolete with it. Your document is buried with the file format it was written in. Do you remember 8-track tapes?
Of course, you could try to "convert your way out" of obsolescence. You could try to take your old WordPerfect files and convert them to HTML or some other proprietary format. The document's structure would be lost or damaged and you'd end up starting over with your text in another format. SGML prevents the dead-end finish to software. SGML allowed us to develop documents that we felt would stand the test of time without being reworked and modified as the years progress. Like this document for instance; it was written in SGML. We have made no commitment to any particular proprietary format. You're reading this online. Tomorrow it will be print. Next week it will be in a multimedia CD. That's the beauty of SGML.
Of course, though you don't need to mess with your SGML file, you just need to upgrade your conversion tools. In the fluctuating world of software, this is a plus. As we write this, the standards are changing. HTML 2.0 standards are moving to HTML 3.0. Proprietary add-ons allow new tags almost every day. The conversion software is simpler to retarget than porting old files to new software packages, re-formatting or re-tagging old code to represent your structure. That's a time consuming, troublesome route. Let your original file stay constant while the vicissitudes of development carry its expression into new dimensions.
As completely new media arise, the converters change. We could convert our document to display three dimensionally in a virtual reality environment. Or we could write text-to-speech converters so a software format could read our documents aloud to visually impaired people. Suddenly, your file isn't limited to print, or even online formats. It can convert to any format you might want.
SGML itself, may become the HTML of the future. In a short while we will be able to publish SGML directly to the Web. HTML may become obsolete as a new breed of browser allows you to view files and styles in their native SGML environment. This gives you one more reason to consider SGML and one more smart reason to use it.
Unlike traditional formats, creating an SGML system requires some up-front thought and consideration. You need to figure out the structure, the bones of your document. And you need to build a file that makes sure that any document you build uses the skeleton you designed. You lay out your document's structural skeleton in the Document Type Definition (DTD).
At the heart of any SGML system are the DTDs. The DTDs set limits on how authors can structure their documents. A single DTD -- one particular skeletal structure -- will not work with all documents, because various document types have different elements or structural parts. For instance, a memo document would have a "to" element where an author can include information about the addressee. But a resume wouldn't use a structural part, or element like "to".
You can either create a DTD or use one that already exists. They aren't hard to write if you need to write one. Let's look at an imaginary DTD for a Report. We'll investigate a few of its possible features.
The report DTD should have elements for all of the common idioms of documentation, such as the footnote, the reference and the bibliographic entry. The report DTD also has elements for linking to multimedia and World Wide Web resources. Authors are encouraged to use multimedia and Web features for each of the media they want to publish in. Movie or sound clips won't work for print documents. Graphics won't display in a text only browser. The wise author layers the multimedia on the document. He or she includes media throughout the document, specifically targeted for the format of the document. The author includes video and sound clips for online display, screen captures for print and simple HTML, and a strong layout for text only mediums.
Once you arrange the bones of your document with the DTD, you can begin creating a new document or porting a document from another format. This is where companies like InContext enter the picture. InContext makes an editor that works in SGML called InContext 2. InContext 2 is a Windows based structured word processor that can be used to create and edit SGML files. The InContext SGML editor picks up your document's skeletal structure and lets you add the text, the flesh to your document. We can call this operation "structural editing." At this stage, we aren't concerned with the final format of the document, we just want to get the information into a useful structure.
Structural editing is quite a different experience from traditional, format-oriented word processors and their "WYSIWYG" (What You See Is What You Get) displays. Traditional word processors, authors are primarily responsibile to make their work look good on a piece of paper. The word processor offers obvious, visual tools for manipulating the formatting and layout of the document. In a structured word processor, the authors have the higher responsibility of matching the bones of structure to the text and don't concern themselves with formatting issues. If an author uses the "tab" key to simulate indentation and create a numbered list, it won't appear in the formatted version. The converter knows what it should do with each of the structural elements and so it ignores any formatting that you might enter manually into the text, like tabs and line breaks. The converter matches the text and its structure to a style and outputs.
Though this SGML development process is very different from traditional word processing, authors won't have too much difficulty adjusting to this new way of editing. Plus, SGML products are advancing as rapidly as the first format-oriented word processors did in their early days. Vigorous competition generates cheaper, more capable products every day.
The editing process should leave you with a document well matched to a structure. Now you can convert your document to its various manifestations. The following conversion is one of many possible conversions you can make and it uses specific software.
The converter processes your document in two steps: first, it analyzes your document structure against the proper structure you defined in your DTD; then, if the document structure checks out, the converter processes your document into the format of your choice. In this specific example, we wanted to generate a print and online expression of the same document, this particular article -- so we did. Here's how it happened.
We double clicked on the converter icon to start the converter software. The converter interface asked us for the name of the SGML file we wanted to convert. Then it checked our document against our DTD. Once our document passed the converter's inspection, it offered us a choice of document formats -- RTF, Microsoft's Rich Text Format, for a print version, or HTML for display on the Web.
Not all print documents look the same, and authors expect some control over what their documents look like. The converter offers us some choice, too. When we chose to convert to RTF, the converter presented a range of style options. The style sheets were easy to manipulate, too. So if we didn't like our presentation, we could have modified it easily.
Once we clicked on the "Convert" button, the converter began to work. It pulled data from the SGML file and dropped it into the RTF format. On the way it looked up styles in the style sheet and applied them to the document. It knew which style to apply by matching a style name to the corresponding structural element. For instance, when the converter found a structural element called a "title" it automatically applied an RTF style called "Heading 1" to that element. For each bit of the text, the converter applied the appropriate, pre-determined style until both the the structure and the text were completely transposed into the RTF format. To show us its good work, the converter launched Microsoft Word and loaded the RTF file. We checked it and printed it -- just what you'd expect in a quality paper document.
Then we returned to our converter and chose the online format -- HTML -- this time. Here's what the converter did to the same SGML file.
Each of the document sections shares a common look, so the pieces look like they belong to the same document. Our final document has an impressive final format. Very few online documents can match the ease and accuracy with which it was generated.
It would be convenient if there were a way to publish SGML directly on the Web without going through the conversion to HTML. We expect more and more authors will publish directly to the Web in SGML. There are two ways to do it -- using real-time conversion, or SGML browsers.
We could publish our SGML directly with Electronic Book Technology's software called DynaWeb. Instead of converting documents when they are created, it converts them on the server while users download them. This all occurs in real-time as a user moves through the document. This real-time conversion is technically impressive, and simplifies the process a little, but the setup and the cost could put you off. DynaWeb is an expensive, proprietary product. And DynaWeb presumes that you own a fast server that will process a document each time they download. This is a lot to ask of an author.
The Internet is built on a foundation of documents published in HTML. All Web browsers are required to support HTML. Web browsers can support other formats. As the Web evolves, other formats are gaining in prominence. One of these formats is SGML. SoftQuad Panorama is a program for viewing SGML files directly. Panorama reads in an SGML file and displays the document according to the rules in a style sheet. Panorama is not a full Web browser itself, but it can follow URLs through the Mosaic browser. In the future, Panorama might grow into a full, general purpose Web browser, or existent Web browsers may add SGML capabilities. This kind of SGML publishing would make all our lives simpler. SGML, naturalized as a Web format, would eliminate the need to convert documents to an online format. Wouldn't that be nice?
We hope this paper has accomplished its task -- to help you consider SGML as a legitimate and superior file format for some of your documents. Start up costs for an SGML system are usually expensive -- time, effort and perhaps money. Once you've made the SGML investment, the payback exceeds those inital costs with fast, easy Web and print publishing and simplified document maintenance. SGML preserves our documents in a lasting format so we can reuse the information in the years to come.
Product List | Customer Support | Demo & Download |
Business to Business | Tech Turf Index | InContext Home Page |