Authoring SGML documents with word processors

Jacques Deseyne, Senior Consultant
Sema Group Belgium, Electronic Document Management Systems
Stallestraat 96
B-1180
Brussels
Belgium
Jacques.Deseyne@sema.be

Abstract

A number of new and enhanced tools for producing SGML documents with word processors have appeared since mid-1994. How do these products succeed in bringing structuring functionality to applications which have only limited support for logical hierarchies ? What makes them different from dedicated SGML editors ?

The following sections discusses the approach taken by packages for two popular word processors in the MS-Windows environment: WordPerfect and Microsoft Word.


Introduction

The majority of today's documents are produced with advanced word processor software, offering an impressive range of layout and presentation functions which, in combination with modern high-resolution printers, make it possible to obtain almost typeset quality output. Features such as templates, macros, wizards, coaches, WYSIWYG display, table editors, drawing tools, spelling and grammar checkers make current word processors an extremely user-friendly environment for producing documents.

Authoring SGML documents puts the user in another world: instead of applying styles and emphasis, with direct visual feedback, he has to worry about the correct elements and attribute values in the right place. Unless he endeavors to take the risky path of a text editor, an author can choose between a number of available "structured editors". While some of these provide formatting and presentation features, their look and feel cannot compete with a modern word processor. "Tagging" documents remains a cumbersome task.

On the other hand, word processors already have a number of structuring features, such as headings of different levels. Why shouldn't it be possible to produce a well-structured document with your favorite word processor and then call a menu item such as "Save as SGML" ? That is exactly the functionality which a number of add-ons promise to deliver.

The following discussion will try to draw a background for the different approach taken by products for the MS-Windows environment. Several of these tools will be demonstrated during the session.

General issues

The following sections will review :

Word processor formats vs. SGML

Basically, the purpose of a word processor application is to produce formatted pages; the current WYSIWYG approach only stresses this idea. Structural elements within word processor applications are not much more than paragraphs, emphasis, headers and footers, numbering schemes, footnotes, graphic elements, object containers, tables and lists. The textual content is structured as a set of paragraphs, characterized by a set of formatting and presentation attributes. These paragraphs may contain character-level formatting. Unfortunately, there are very few (if any) DTDs for which this set of structural elements is rich enough. The Rainbow DTD (see footnote 1) is nothing more than an SGML-style representation of this flat structural hierarchy.

The average structural complexity of an SGML document requires a much higher number of levels to represent its element tree. How do SGML add-ons try to cope with this requirement ?

Mapping structure to styles

Styles are now a common feature of word processors; they have evolved from some kind of shorthand for a set of codes to structures holding presentation and formatting attributes. Styles for textual content can be paragraph or character styles. Such styles are "paired", i.e. they set some attributes at the start and reset them at the end of the content portion to which the style is applied.

The basic idea of mapping structural elements is to make styles represent the occurrence and the context of leaf nodes (containing the content) of the document tree.

In general, a one-to-one mapping of element names and styles is not enough. It must be possible to retain the information related to the different contexts in which the element can occur. In the following example, the element ti should be represented by more than four different styles (e.g., ArticleTitle, SectionTitle, ListTitles and TableTitle). It will be appropriate to create different styles to reflect different levels of sections or list titles. This example shows some recursiveness and the mapping will have to limit itself to a reasonable number of embedded levels.


Figure 1.
Partially developed document definition tree

To avoid any ambiguity, the safest approach is to define different style names corresponding to each different path taken from each possible leaf node element up to a node element which is an ancestor for all possible occurrences of the leaf node element. This common ancestor node may be the document root itself.

Introducing markup

Another way is to provide specific codes inside the document, which serve as markup indicating the start and end of structural elements. Such codes may or may not be coupled to formatting constructs. The developer can follow several ways to implement such codes. Features of the word processor, such as fields or comments can be used.

It is also possible to create new codes which are not present in the normal word processor version. In order to do this, the software must provide an accurate interfacing method or the developer should have close enough access to the internals of the application. Of the tools described hereafter, only WordPerfect SGML Edition has followed the latter approach.

Telling your word processor about your SGML application

An SGML application is generally understood as an application of the SGML standard, which includes an SGML declaration, one or more document type declaration(s) and, optionally, link process declarations.

The SGML declaration contains different types of information. The document character set can be whatever possible (ASCII, ANSI, EBCDIC, Unicode, ...) and the import and export facilities have to be aware of this information. At this moment, only WordPerfect SGML Edition provides a well-documented way of configuring import and export functions, enabling mapping between its 14 internal character sets and any document character set.

Apart from this exception, these tools feel only really comfortable with a standard SGML declaration declaring the reference concrete syntax. Features like SHORTREF, marked sections, USEMAPs, CONCUR, and LINK are not handled or not in a satisfactory way.

Dependent on the tool, the DTD can be loaded immediately or after passing through a DTD compiler. In the case of Microsoft's SGML Author, the DTD logic is only implicitly present in the distinction of style names corresponding to the different contextual occurrences of leaf node elements.

The document prolog can contain entity declarations. In most DTDs, some public sets of character entities are declared. It seems almost natural that the author shouldn't key in the entity reference, but the character itself. Here also, the mapping between the word processor's character set(s) and these entities has to be defined. Most tools provide some standard mapping; again, WordPerfect's tool provides greater flexibility and better documentation than the others. Microsoft Word uses only one character set (corresponding to ISO 8859-1). Visualization of other character sets implies switching from the default font to a specific one (e.g., for Monotoniko Greek), but none of the tested Word add-ons seems to provide this mechanism for representing character entities.

Word processors as structured editors

After using the tools for a while and talking to others who did the same, my conclusion is that the principle "Make your documents as ever before with your favorite word processor and then just press the SaveAsSGML button" doesn't work. In practice, normal word processor operation is never structured to the same level as authoring an SGML document.

The majority of the word processor users create documents in a formatting-oriented way, frequently overriding style attributes or not using styles at all. Such users experience applying specific styles, inserting elements and, most of all, specifying attribute values, as counterproductive. This is particularly true for applying emphasis and creating footnotes or tables. If the users don't get educated about structural hierarchy, the meaning of markup and the advantages of logical structuring, they won't understand why they should apply the Bold character style, or why they should choose an element emph and then filling in an attribute value type="4" instead of just pressing Ctrl-B.

The four reviewed tools reflect different design visions on what should constitute a user-friendly interface for introducing structural elements. Some of them allow to use WordPerfect or Word as a quite decent structured editor. They all provide buttons and/or menu items for activating a dialog box showing of valid elements/styles/actions in the current context. The information and options offered differ considerably depending on the actual tool. The following screen reproductions show this clearly.


Figure 2.
Dialog box from Microsoft SGML Author presenting allowed styles


Figure 3.
Dialog box from Nice Technologies' TagWizard presenting allowed elements


Figure 4.
Dialog box from Near & Far Author presenting allowed styles

In addition to a Select Next Style pop-up, Near & Far Author provides a graphical presentation of the document instance tree, allowing navigation and creation of new elements.


Figure 5.
Dialog box from WordPerfect SGML Edition presenting allowed elements

Word processors as formatters for SGML documents

Setting up an SGML authoring environment within a word processor in order to produce a logically-structured document may not always be straightforward. How does the opposite process, importing an SGML document and formatting it in the WYSIWYG environment, behave ?

All tools can achieve an acceptable level of functionality. With tools using the mapping approach, all the carefully designed styles seem to deliver their promise at last.

The tools using tag equivalents may perform even better, because they allow layout actions to be associated with container elements, while styles can only be applied to content. For instance, you may want to represent all text of subsections in a two-column layout, while section title and the introductory paragraphs should be set in a single column. The <subsection> start-tag may be associated with a Set two columns command, while the </subsection> end-tag will be associated with a Revert to one column command.

Such layout specifications actually do exactly what the LINK process definitions in SGML are meant for. Unfortunately, no tool is able to import directly such definitions at this moment. As far as we could check, none of them supports the use of FOSIs or of DSSSL specifications.

Impressions on products

We have had the opportunity to test out several add-on tools. For Microsoft Word, we tested Microsoft's SGML Author 1.0, Nice Technologies' TagWizard 1.8 and MicroStar's Near & Far Author for Word (in beta version). We also tested Novell's WordPerfect 6.1 SGML Edition. Rather than presenting an exhaustive description of each tool, we prefer to mention a number of points which have appeared to us as the most striking.

Products for Word install a number of templates which contain menus and/or button bars coupled to Word macros enabling and accessing specific functions of the tool. To enhance the efficiency of the setup, these functions are available in Word Link Libraries (WLL) which are installed in the Word startup directory. In all cases, installation is relatively straightforward.

Installing Novell's WordPerfect SGML Edition adds a number of DLLs, sets up a number of specific directories and replaces some DLLs and one executable.

Microsoft SGML Author

There are two operational parts in SGML Author: the "System Administrator" part and the User part.

The System Administrator part consists of the "Author" application which enables you to build the mappings between styles and structure. On the Word side, you have paragraph styles and character styles; on the SGML side, you have SGML documents with their hierarchy of elements. The System Administrator or person setting up the environment has to map all the different possible contextual occurrences of each element to a particular style. If the DTD shows any recursiveness, mappings should be provided for a reasonable number of levels. Such mapping files are created and edited in the Author application.

To take full profit of the special functions, such as a drop-down list of "legal" styles, a specific Word template has to be implemented with a template initialization file (an .INI file) holding several kinds of information (such as a listing of the legal styles applicable within the current style).

The User part consists of a Word Link Library (WLL) which adds a set of functions to MS-Word 6.0 and a number of Word customizations which are supplied as Word templates.

To validate a finalized document, the User part uses the parser DLLs from Avalanche in order to save the document as SGML. Mappings are applied and the resulting marked-up document is checked by this parser. If there are logical errors, they are brought to the author's attention as annotations in a copy of the document.

Comments on the product

Checking of the SGML structure is done only when the document is saved. Apart from the parser which operates during saving operations, the user environment is not SGML-aware at all. In order to provide some guidance during authoring, a complex customization has to be built to enable NextLegalStyle, Attribute, etc. dialog boxes.

Creating the template initialization file has to be done by hand and can get very complex for any but the simplest DTD. A suggestion for a future version could be to automate this work as much as possible, as a lot of this information is already present in the DTD and in the mapping file. It would also be a good thing if performance could be enhanced.

The tool can be considered in environments which are standardized on MS- Word and which need to take advantage of other integration functions, such as ODBC connectivity or OLE objects.

TagWizard

The SGML TagWizard announces itself rather modestly as "an aid to produce SGML document instances". Put in its simplest way, the product allows to insert contextually correct tags at any point in the document. These tags are set in the SGMLTAG character style and checked by the parser (a DLL), which is very well integrated with the Word environment through macro's.

Start- and end-tags may be added independently; it is also possible to "surround" a selected portion with a start- and end-tag. Tags can be inserted through a button, through shortcuts or typed in directly.

TagWizard is more than a simple tagging aid and an instance validator. It converts your Word tables to tagged tables, without limiting you to one table model. The mapping between Word table elements and the elements of your particular table model can be defined in an easy interactive way.

TagWizard also allows to associate formatting with elements. Formatting occurs as a result of applying a logical structure to your document. Each association has to be made through character styles, which means that paragraph breaks are not automatically applied. However, it is possible to achieve a remarkable number of things and automated formatting may be achieved with a few simple macros, given that all SGML tags are visible to ordinary Word functions such as Find.

Comments on the product

Of all the tools mentioned here, TagWizard provides the most straightforward way to produce tagged SGML documents. The developers did not seek to hide SGML structured authoring from the users, but rather to present it in an user-friendly way. The user-friendliness is certainly aimed at authors who have tagging experience.

Of course, this isn't the normal Word environment any more and a normal operation of Word won't produce SGML documents. Authors need to be instructed about SGML, descriptive markup and document hierarchies in order to feel at ease with TagWizard. The investment can bring along the benefits of well-structured documents, of which the audience of this Conference doesn't have to be convinced.

Near & Far Author

Near & Far Author introduces structure in a document through styles, but in a quite different way than Microsoft's SGML Author. Not only the leaf nodes, but all elements are represented by a style. A specific template is generated when importing a DTD and styles are automatically created.

During editing, a separate window, the "Near & Far Author" window, displays a tree view of the document structure. This display is very similar to the graphical representation of document models in other Near & Far products. As the user moves through the document, the current position is highlighted in the Near & Far Author tree display. Pressing the <Enter> key is interpreted as the insertion of a new element and a "Select Next Style" dialog box is displayed, presenting only valid styles at the insertion point.

Authoring documents with Near & Far Author is quite different from a normal editing session with Word. Although visual feedback is immediate, editing actions are controlled by the logical document model.

Creation of styles may be automatic, but it will help to differentiate their visual appearance. Styles may be changed as you would normally in Word. The creation of styles representing document elements makes distinction between container elements, elements with text and elements with a mixed content model. Elements which occur within a mixed content model are represented by character styles.

Visual feedback can much be improved by enhancing the style maps. Autotext entries can be defined and for elements occurring at different depths, a kind of RANKed style names can be used, such as Heading1, Heading2, Heading3, ... for different contexts of an section heading element <h>.

The advantage of inserting a style for container elements is that hooks are present wherever to specify attributes; the presence of required or implied attributes is shown by a small in-line button. Pressing this button pops up a dialog box permitting value specification.

Comments on the product

At the moment of this writing, we have only seen the beta test version, which preferred a rather performant machine to run on. The editor window and the tree window are not two views of the same data, but two separate data representations which must continuously be synchronized. This synchronization takes as much of system resources as it can get.

The automatic distinction between paragraph and character styles can't work when an element occurs in models with and without mixed content. In such cases, the corresponding style is a character style, although that may not always be an ideal solution from a layout point of view.

WordPerfect 6.1 SGML Edition

WordPerfect 6.1 SGML Edition is the successor to IntelliTag, the SGML version of WordPerfect 5.1. The new SGML Edition consists of three parts : the DTD compiler, the layout designer and the SGML editing environment, which is an extension of the normal WordPerfect version. Installation of the tool replaces the main executable file WPWIN61.EXE and a number of DLLs.

The DTD compiler takes a DTD, an SGML declaration (which may be a default one) and an entity mapping file. An entity mapping file gives the system-specific resolution of external entity references present in the DTD.

Important entities may be the character entity sets defined in Appendix D of ISO 8879 and elsewhere. The mapping file indicates how such character entities will be represented in the WordPerfect environment. A nice feature is that such entities can be represented either as entity references or as WordPerfect characters. Mapping is active both when importing or exporting an SGML instance.

The entity mapping file will also indicate the correspondence between document character sets and the internal representation. For instance, entries of ISO 8879-7 (Latin/Greek character set) can be mapped to WordPerfect's character set 8. Such a mapping is imperative when other BASESETs than ISO 646 occur in the SGML declaration.

The Layout Designer is the environment where the layout specification can be set up. Such specification is not required in order to operate the SGML editing environment, but layout rules can make authoring easier by providing visual feedback.

In WP SGML Edition, layout is applied as a result of the logical structure. Layout rules can be specified for start tags and end tags. Rules can include almost all layout and presentation features of WordPerfect, such as fonts, margins, lines, automatic text, justification, indents, tab settings, columns, counters, ... Applying such features is a bit different from normal WordPerfect operation, because most of them can be applied to any element, not just paragraphs.

The SGML editing environment is an extension of the WordPerfect 6.1 environment, showing a different toolbar and additional menus. Structure is introduced by start and end codes for elements, an equivalent of tags. The user-friendliness of the "Insert Element" dialog box has been thoroughly studied. Inserting elements is made more efficient through logic chaining and automatic switching of focus.

Logic chaining means that as long as no choices have to be made for the next element, this will be automatically inserted. For instance, if your DTD has a fragment like

<!ELEMENT section - - (stigrp, (ssec | p)+) >

<!ELEMENT stigrp - - (sti, ssubti?) >

<!ELEMENT sti - - (p)+ >

inserting a <section> start-tag will cause the insertion of the <stigrp>, <sti> and <p>elements.

Focus is also automatically switched from the Insert Element dialog box to the document text area whenever an element with a mixed content model is inserted. The author can immediately start to type text. Pressing the <Enter> key is interpreted as a request to insert a new element and the focus is set back to the Insert Element dialog box.

In addition to these features, it is possible to define aliases for element names. In case of DTDs where element names may not be very clear to the average user, aliases may be very useful (the DTD provided for the proceedings of this SGML BeLux Conference is a good example).

All SGML-specific commands are available in the WordPerfect macro language, which means that for existing documents, a part of the tagging can be automated. A powerful macro is available which can map styles in existing documents to elements.

Layout specifications are not exactly styles, but rather an extension of the paragraph- and character-oriented attribute sets. Layout rules can be created for each element, at each level in the document tree.

Comments on the product

WordPerfect has delivered a quite impressive combination of features with this SGML Edition. Data entry and tagging is made more efficient through the logic chaining and focus change features. The layout as a result of introducing logical structure is an approach which many SGML purists will appreciate. For European users, the transparent mapping between characters and entities is really welcome.

The tool may be impressive, the accompanying documentation, at least in the beta version, is sometimes minimal. As an example, the documentation of IntelliTag is more explicit and useful for setting up correct mapping files for DTD compilation than the SGML Edition.

Conclusion

We have discussed the discrepancy between the formatting-oriented approach of many word processor users and the characteristics of structured editing. The four tools which were discussed have their own way of narrowing this gap and providing an SGML document instance production environment.

Some companies have been offering SGML extensions for a word processor for a longer time than others, and this difference in maturity shows. Some tools offer such functionality that long-time providers of structured editors may have to reconsider their position. The apparition of these tools certainly means that SGML authoring is becoming available in a mainstream environment.

One question was not raised. Producing SGML documents is not a goal in itself. These documents serve some purpose, e.g. they are fed into a database, they are part of an information system or they will be used in a publishing environment. Add-on tools to word processors provide an authoring environment; they do not offer a complete solution.


Footnotes:

footnote 1 (back)

The Rainbow DTD was developed by Electronic Book Technologies to serve as an intermediate representation during the up-conversion of word processor documents to SGML. Information can be found at ftp://ftp.ebt.com.