This document provides background information on Standard Generalized Markup Language (SGML) and briefly discusses a forthcoming product from Microsoft to address the SGML authoring needs of Word users.
What follows is a brief history of the emergence of SGML.
The term 'markup' originally referred to the marks handwritten on a manuscript by the copy editor or book designer to tell the compositor how the manuscript was to be formatted. With the introduction of computers and their use in typesetting, the markup instructions would typically be embedded in the text of the document through a process called 'specific markup.' These markup instructions were typically surrounded by obscure control characters to offset them from the body text, making the task of entering them very manual and time consuming. In addition, each new phototypesetting system used its own proprietary markup language, thereby locking consumers into a particular language and vendor.
In the early 1980's the Graphics Communications Association (GCA) set out to define a standard markup language known as 'GenCode.' However, it quickly became apparent that it would be very difficult to build a tag set that was general enough to serve the needs of all typesetter manufacturers without being unwieldy in size and scope. At the same time the GCA was working on solving these problems, an ANSI committee was defining a standard based on another computer typesetting language, Generalized Markup Language (GML). This standard represented the document as a hierarchical tree of different related elements, each of which would be formatted in a certain way. The two organizations combined their efforts and focused on the task of building one standard. In December, 1986 the combined efforts of the committees were introduced by the International Standards Organization (ISO) as standard 8879, SGML.
First, SGML is a completely open standard which is platform, vendor, and application independent. SGML files are stored as ASCII text which ensures that they can be used on virtually any platform. Second, the power and promise of SGML comes once a document has been marked up with the appropriate SGML tags. By defining structure and relationship within previously unstructured information, SGML enables entirely new ways of managing, publishing and reusing that information. For example, an SGML database could store thousands of tagged documents, and it could use these tags to publish customized versions of the same document on demand.
To think about a hypothetical and extremely simple example of how such a system could work, pretend you have a 100,000 page airplane manual. As this document was originally written, every paragraph was tagged with a security clearance to say whether it was unclassified, classified, secret, or top secret. Chapters of the document were tagged to determine their relevance for technicians, air traffic controllers, pilots, and flight attendants. All of this information could then be parsed into an SGML database publishing system to facilitate on-demand publishing. Using the encoded tags to understand the structure of the document, you could create customized versions based on the original data. For example, you could just as easily create a version relevant to pilots with classified security clearance as you could create a version relevant to unclassified air traffic controllers. Because SGML stores information about the structure of the documents and not the formatting, different presentations can be used to suit the distribution model. For example, the information could just as easily be presented as a printed document or viewed by an on-line viewing tool. What were cross references in the printed document (e.g. see figure on page x) become hypertext jumps in the on-line document. By structuring this information, whole new levels of control and flexibility are gained to allow the use and reuse of the content.
So what is SGML? It is a data description language designed for, but not limited to, describing the structure of textual data. An SGML document has two parts: a DTD (Document Type Definition) and a document instance (the actual data). The DTD describes the structure of the instance. It identifies the legal tags in a document and their relationships to each other. The document instance contains the data, delimited by tags defined in the DTD.
1.2 This is a Title, (Top Secret) This is body Text, this is body text, This is body text, this is body text, This is body text, this is body text, This is body text, this is body text, This is body text.
The preceding paragraph could be encoded numerous ways. How it is represented in SGML depends on the DTD used to 'mark it up.' If you mark it up using the following DTD fragment (taken from a CALS DTD), you would get the following SGML.
<!ELEMENT para0 - - (title?, para+) > <!ATTLIST para0 sec (u|c|s|ts) u > <!ELEMENT title - - (#PCDATA) > <!ELEMENT para - - (#PCDATA|emph) > <!ATTLIST para id #NUMBER #IMPLIED NumStyle (Legal|Arabic) Arabic > <!ELEMENT emph - - (#PCDATA) > <!ATTLIST emph style (bold|italic|none) none >
<PARA0 Sec='TS'><TITLE>This is a Title</TITLE><PARA id='1.2' NumStyle = 'Legal'> This is body Text, this is body text, This is body text, this is body text, This is body text, this is body text, <EMPH Style='Bold'>This is body text, this is body </EMPH> text, This is body text, this is body text, This is body text, this is body text, This is body text </PARA></PARA0>
Microsoft's vision of SGML is to broaden the accessibility of this technology without requiring users to understand the details of the technology. The largest problem facing SGML usage today is the increased cost of tagging documents due to decreased productivity. The currently available SGML editing tools are typically not very user friendly and are designed almost exclusively for the UNIX environment.
Our approach contrasts rather starkly with the current offerings, and we hope that our product will result in highly increased author productivity by allowing authors to work in a familiar and comfortable editing environment (Word) while still enjoying linkage with SGML.
SGML Author has two parts, a converter (end-user focused) and a separate mapping application (MIS focused).
To author an SGML document the end users simply construct their documents in Word as they normally would, except they must use styles for all formatting. To ensure that they use the styles appropriately, the users format according to an MIS provided style guide and set of Word templates. To create SGML, the user then saves the file as SGML just as they would export to any other file format. Likewise, users can also choose to import SGML into Word.
Once the user has chosen to save an SGML representation of the file, an ASCII text file is created which contains syntactically correct (i.e. parseable) SGML. To achieve this syntactically correct SGML, the converter may modify the Word file to ensure conformity to the DTD. For example, a DTD might have a <list>element which required that there be at least two <items> in the list. If the user had only created one list item, the converter would create a necessary, albeit empty, second item and inform the user of this fact. The results of any necessary modifications are returned to the user in the form of a new Word file which has been annotated to describe in Word terminology why the file was changed. For example, if the DTD required that <para> always follows <para0> and the user did not follow this convention, then the converter would automatically create a <para0> structure in the Word document. It would also insert a Word annotation to inform the user why this change was necessary. The user could then determine whether or not this new Word document is semantically (i.e. it has the correct meaning) correct, and then make any appropriate edits. Importantly, the end user needs to know little about SGML throughout the entire process. The process of exporting a file is detailed in the diagram below.
To ensure that the desired result is achieved, the converter has to be pre-configured to create the appropriate SGML. This is done by creating a mapping file using a provided Mapping Application. This application is geared at the SGML knowledgeable individual, and it allows this individual to build specific mappings between Word templates (i.e. styles) and the structures in the SGML DTD. Microsoft will provide pre-assembled mapping files and templates for CALS and ATA-100. For customers who have built their own DTD's, they will need to use the mapping application to build corresponding templates and mapping files as detailed in the diagram below.
SGML Author for Word is planned for commercial availability early 1995 and will be sold and distributed as an add-on to Microsoft Word. SGML Author requires Word for Windws version 6.0 or later to run.
©1993-1994 Microsoft Corporation. All rights reserved. Printed in the United States of America.
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This technical overview is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
Microsoft, MS, and MS-DOS are registered trademarks and Windows is a trademark of Microsoft Corporation.