The following HTML(ized) document was originally obtained from a Microsoft FTP server on November 23, 1994. It was Microsoft WinWord 6.0 document. See (if the link still remains) ftp://ftp.microsoft.com/Softlib/MSLFILES/SGMLFCTS.EXE. The file (105629 bytes) is a self-extracting executable that produces the WinWord 6.0 document in the disk file SGMLFCTS.DOC (370,688 bytes; 11-19-93). For greatest accuracy, readers are urged to obtain the authentic WinWord file rather than relying upon this text (HTML) version.
INTRODUCTION SGML Background A Standard is Born What the Standard Offers What Is SGML? An Example The DTD Fragment Resulting SGML MICROSOFT SGML AUTHOR FOR WORD Market Background Goals of SGML Author How SGML Author Works The End User Model The MIS Model Expected Availability
This document provides background information on Standard Generalized Markup Language (SGML) and briefly discusses a forthcoming product from Microsoft to address the SGML authoring needs of Word users.
What follows is a brief history of the emergence of SGML. Note 1
The term "markup" originally referred to the marks handwritten on a manuscript by the copy editor or book designer to tell the compositor how the manuscript was to be formatted. With the introduction of computers and their use in typesetting, the markup instructions would typically be embedded in the text of the document through a process called "specific markup." These markup instructions were typically surrounded by obscure control characters to offset them from the body text, making the task of entering them very manual and time consuming. In addition, each new phototypesetting system used its own proprietary markup language, thereby locking consumers into a particular language and vendor.
In the early 1980's the Graphics Communications Association (GCA) set out to define a standard markup language known as "GenCode." However, it quickly became apparent that it would be very difficult to build a tag set that was general enough to serve the needs of all typesetter manufacturers without being unwieldy in size and scope. At the same time the GCA was working on solving these problems, an ANSI committee was defining a standard based on another computer typesetting language, Generalized Markup Language (GML). This standard represented the document as a hierarchical tree of different related elements, each of which would be formatted in a certain way. The two organizations combined their efforts and focused on the task of building one standard. In December, 1986 the combined efforts of the committees were introduced by the International Standards Organization (ISO) as standard 8879, SGML.
First, SGML is a completely open standard which is platform, vendor, and application independent. SGML files are stored as ASCII text which ensures that they can be used on virtually any platform. Second, the power and promise of SGML comes once a document has been marked up with the appropriate SGML tags. By defining structure and relationship within previously unstructured information, SGML enables entirely new ways of managing, publishing and reusing that information. For example, an SGML database could store thousands of tagged documents, and it could use these tags to publish customized versions of the same document on demand.
To think about a hypothetical and extremely simple example of how such a system could work, pretend you have a 100,000 page airplane manual. As this document was originally written, every paragraph was tagged with a security clearance to say whether it was unclassified, classified, secret, or top secret. Chapters of the document were tagged to determine their relevance for technicians, air traffic controllers, pilots, and flight attendants. All of this information could then be parsed into an SGML database publishing system to facilitate on-demand publishing. Using the encoded tags to understand the structure of the document, you could create customized versions based on the original data. For example, you could just as easily create a version relevant to pilots with classified security clearance as you could create a version relevant to unclassified air traffic controllers. Because SGML stores information about the structure of the documents and not the formatting, differ ent presentations can be used to suit the distribution model. For example, the information could just as easily be presented as a printed document or viewed by an on-line viewing tool. What were cross references in the printed document (e.g. see figure on page x) become hypertext jumps in the on-line document. By structuring this information, whole new levels of control and flexibility are gained to allow the use and reuse of the content.
So what is SGML? It is a data description language designed for, but not limited to, describing the structure of textual data. An SGML document has two parts: a DTD (Document Type Definition) and a document instance (the actual data). The DTD describes the structure of the instance. It identifies the legal tags in a document and their relationships to each other. The document instance contains the data, delimited by tags defined in the DTD.
1.2 This is a Title, (Top Secret) This is body Text, this is body text, This is body text, this is body text, This is body text, this is body text, This is body text, this is body text, This is body text.
The preceding paragraph could be encoded numerous ways. How it is represented in SGML depends on the DTD used to "mark it up." If you mark it up using the following DTD fragment (taken from a CALS DTD Note 2), you would get the following SGML.
<!ELEMENT para0 - - (title?, para+) > <!ATTLIST para0 sec (u|c|s|ts) u > <!ELEMENT title - - (#PCDATA) > <!ELEMENT para - - (#PCDATA|emph) > <!ATTLIST para id #NUMBER #IMPLIED NumStyle (Legal|Arabic) Arabic > <!ELEMENT emph - - (#PCDATA) > <!ATTLIST emph style (bold|italic|none) none >
<PARA0 Sec="TS"><TITLE>This is a Title</TITLE><PARA id="1.2" NumStyle = "Legal">This is body Text, this is body text, This is body text, this is body text, This is body text, this is body text, <EMPH Style="Bold">This is body text, this is body</EMPH> text, This is body text, this is body text, This is body text, this is body text, This is body text </PARA></PARA0>
[Note to reader: I think the authors meant for the content model of "para" to be "(#PCDATA | emph)*" or "(#PCDATA | emph)+"; and for the declared value to be "NUMBER" (not "#NUMBER"), and for the id value "12" (not "1.2" -- or else declared value NUTOKEN in the DTD fragment) in the document instance. Otherwise, it looks like the parse would fail on all three counts.-- rcc]
Microsoft's vision of SGML is to broaden the accessibility of this technology without requiring users to understand the details of the technology. The largest problem facing SGML usage today is the increased cost of tagging documents due to decreased productivity. The currently available SGML editing tools are typically not very user friendly and are designed almost exclusively for the UNIX environment.
Our approach contrasts rather starkly with the current offerings, and we hope that our product will result in highly increased author productivity by allowing authors to work in a familiar and comfortable editing environment (Word) while still enjoying linkage with SGML.
SGML Author has two parts, a converter (end-user focused) and a separate mapping application (MIS focused).
To author an SGML document the end users simply construct their documents in Word as they normally would, except they must use styles for all formatting. To ensure that they use the styles appropriately, the users format according to an MIS provided style guide and set of Word templates. To create SGML, the user then saves the file as SGML just as they would export to any other file format.
Once the user has chosen to save an SGML representation of the file, an ASCII text file is created which contains syntactically correct (i.e. parseable) SGML. To achieve this syntactically correct SGML, the converter may modify the Word file to ensure conformity to the DTD. For example, a DTD might have a <list> element which required that there be at least two <items> in the list. If the user had only created one list item, the converter would create a necessary, albeit empty, second item and inform the user of this fact. The results of any necessary modifications are returned to the user in the form of a new Word file which has been annotated to describe in Word terminology why the file was changed. For example, if the DTD required that <para> always follows <para0> and the user did not follow this convention, then the converter would automatically create a <para0> structure in the Word document. It would a lso insert a Word annotation to inform the user why this change was necessary. The user could then determine whether or not this new Word document is semantically (i.e. it has the correct meaning) correct, and then make any appropriate edits. Importantly, the end user needs to know little about SGML throughout the entire process. This usage model is detailed in the diagram below.
Diagram for the usage (end user) model [see original document for clarity; the following ASCII representation is approximate only]:
Exporting SGML: When the user saves a document as SGML, they can produce SGML tagged text, an annotated Word document to provide feedback, or both. The mapping file is provided by MIS in order to "configure" the converter appropriately.
Word Document ----> |CONVERTER | ------> Annotated Word Document Mapping File -----> | | ------> SGML Instance
To ensure that the desired result is achieved, the converter has to be pre-configured to create the appropriate SGML. This is done by creating a mapping file using a provided Mapping Application. This application is geared at the SGML knowledgeable individual, and it allows this individual to build specific mappings between Word templates (i.e. styles) and the structures in the SGML DTD. Where standard DTD's do exist (i.e. CALS), Microsoft will provide pre-assembled mapping files and templates. For customers who have built their own DTD's, they will need to use the mapping application to build corresponding templates and mapping files. This is detailed in the diagram below.
Diagram for the MIS model [see original document for clarity; the following ASCII representation is approximate only]:
Creating a Mapping File: To build a mapping file, MIS must first supply a template and a DTD. They then use the Mapping Application to associate these two items and build a mapping file for use by the converter.
Word Template ----> | MAPPING | ------> Mapping File SGML DTD ---------> | APPLICATION |
This product is planned for commercial availability in the first half of 1994. The initial release will be for Microsoft Windows, and it will be followed by releases for the Apple (tm)Macintosh and Windows NT. These products will be sold and distributed as separate add-ons to Microsoft Word, and they will require Word 6.0 or later to run.
Note 1 This is derived from an article by Elizabeth Gilmore in the Journal of the Society for Technical Communication (Volume 40, Number 2), May 1993. [Back]
Note 2 The CALS (Computer-Aided Acquisition Logistics Support) Initiative is a program of the US Department of Defense (DOD) and its vendors that requires the use of SGML to maintain documentation, contracts, and contract proposals. CALS governs a vendor's interaction with the DOD. [Back]
(c) 1993 Microsoft Corporation. All rights reserved. Printed in the United States of America.
The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.
This technical overview is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY.
Microsoft, MS, and MS-DOS are registered trademarks and Windows and Windows NT are trademarks of Microsoft Corporation.
Apple and Macintosh are registered trademarks of Apple Computer, Inc.
10/93 Part. No. 098-53048