Simple SGML

[This local archive copy mirrored from: http://www.i4i.org/simpl.htm; see the canonical version of the document.]

SGML Made SIMPL

Michel Vulpe
Founder and CEO of
Infrastructures for Information Inc.

Abstract

In spite of its name, Standard Generalized Markup Language (SGML) is, at its core, a data schema language, not just a markup language.

Its role as a markup language for text presentation is well understood. As such, it is one of the pillars of the WWW phenomenon. To limit SGML to text markup, however, is to do a disservice to its power. Textual presentation schemas, while important, constitute but one domain in which SGML can be applied. SGML can be used to specify schemas for many types of data and even for behaviors.

SGML has two separate and distinct roles. The first role is for document (that is, text document) interchange. The second, more interesting role, is as a data schema language that supports structural semantics: that is, how objects relate to other objects. It is this use of SGML that allows it to play a fundamental role in managing complex systems.

In this paper we will consider:

SGML as a text-encoding technology
SGML as a data schema language
the future of SGML
how to achieve this future

SGML as a Text-Encoding Technology

Text interchange is where SGML has traditionally been focused. One of the key problems confronting the document manufacturer of the 1960s through the early 1990s was how to exchange documents in a world of proprietary closed systems where there was no clear document standard. Computer scientists and document engineers looked at the problem and determined, quite reasonably, that the problem was that of "metacodes"ⁱ , that is, the codes that tell an application how to treat a stream of text. Each document application has its own unique way of identifying paragraphs, bolds, tables, graphic callouts, and so on. Replacing these metacodes with non-proprietary descriptive codes that any system could read would result in "open" documents.

Document interchange under this model is accomplished by encapsulating text in SGML tags in just the same way as a word processor encapsulates text in its metacodesⁱⁱ . This model is derived from the way in which applications, particularly word processors, work. Software applications are typically self-contained. They provide a class of services to the user: word processing, drawing, email, web browser, calendar services, etc. The application holds the content in the same data object as the metacodes that it uses to identify and organize the content. The format of that data object is specific to the application -- the file/application paradigm. Each application embeds its own unique metacodes in the content, and this is what creates a proprietary data format.

The traditional SGML solution to the problem of proprietary data formats says that if an application replaces its proprietary metacodes with ISO standard "SGML metacodes" (or tags) then the problem is eliminated. This solution, however, has its own problems, one of the most important being that the SGML metacodes, in and of themselves, have no meaning. By convention, some of them have taken meaning, for example, CALS (Continuous Acquisition and Lifecycle Support) table tags. But by and large the metacodes have remained descriptive, that is, they have no meaning in the sense that hexadecimal two-zero means "space" in ASCII. This, of course, means that systems that use SGML metacodes need further information so that they can present the tagged information in some useful form. To achieve this, the supporting applications have adopted the metacodes as their internal formatting model. Meaning is assigned either externally through, for instance, FOSIs (Formatted Output Specification Instances), or internally th rough style sheets. Regardless of how it is done, the net result is that large numbers of metacodes are required in order to achieve a level of functionality that matches commercial-grade information manufacturing tools.

All things considered, as a text-encoding language that duplicates the formatting capabilities of most proprietary systems, SGML has been reasonably successful. Its most visible success story, of course, is HTML, one of the foundations of the World Wide Web. HTML is replete with traditional word processing concepts: paragraphs, headings, tabs, breaks, tables, etc. Its sophistication has allowed webmasters to develop quite startlingly attractive Web pages. An important factor in the success of HTML is that, as a rule, the tools that support HTML do not use the HTML DTD. (Actually, most authors of HTML probably do not even know that there is an HTML DTD.) The DTD is embedded in the tool, and is mapped directly to the tool's internal model. The DTD is the tool's metacode set, in exactly the same way as the MS Word internal formatting codes are its metacode set. HTML is SGML without any of the overhead of declarations and DTDs.

The recent introduction of xML, which is being positioned as a dialectⁱⁱⁱ of SGML, is the result of the lessons learned from HTML. One of these lessons is that the overhead of SGML is excessive for the encoding of text formatting, although it is potentially useful for identification and navigation.

An SGML document is a three part object: the declaration, the DTD (or schema), and the instance (or subschema). For even moderately complex documents, the volume of information that has to be moved around is enormous. Moreover, the complexity and interdependencies of the information can be daunting. Simple errors, such as the use of SYSTEM identifiers when moving between Mac, PC, and UNIX systems, are frequent enough and troublesome enough to call into question the very notion of SGML as an interchange format. Moving RTF files (or even Word files) between platforms is often easier and more rewarding.

Removing two parts of the SGML document, the DTD and the declaration, significantly reduces the complexity of the environment. Text, encapsulated in tags with supporting attributes, provides enough functional information for most text interchange applications. The tags and their attributes provide basic navigation and search aids, as well as guides for presentation. Style sheets are also needed, since the receiving application may never have seen the tag set before, and therefore has no knowledge of how to present text encoded with it. But the overhead is less than that of delivering full SGML documents which still need the style sheets anyway.

xML continues in the spirit of HTML. The overhead and complexities of SGML are largely removed without losing the dynamism of a rich text-encoding methodology.

SGML as a Data Schema Language

SGML, as a data schema language, has traditionally been implemented as support for structured authoring.

In addition to solving the problem of proprietary metacodes, the developers of SGML sought to address the problems of the freeform nature of information authoring. Early word processors provided little if any support for structure. Styles and stylesheets are a relatively recent innovation^iv . Effective programmatic control over the authoring process was only implemented through forms systems that were far too limiting for effective authoring. In an effort to address this problem, SGML includes a means a describing and enforcing structures.

Structure in the information authoring process is a means of conveying information. Relationships between information objects are fundamental to understanding their relative roles in an information product. A well-crafted information product conveys much of its information through its structure. Different products have different structures -- in fact, a product can often be identified by its structure. For example, a modern textbook, a military manual, and a novel are all recognizable (and, to an extent, distinguishable) from the shapes of their outlines or tables of contents. The designers of SGML leveraged this insight and provided us with a formalism that allows us to develop grammars for these different genres.

A genre grammar is referred to as a DTD. A DTD can describe the structure of a section in a document:

     <!element section - - (title?, (para | list)*) >

There is nothing to say that the same formalism cannot be used to describe other types of objects such as:
a wall

     <!element wall - - (window*, door?) >

or a desktop

     <!element desktop - - (computer?, (memo | notepad | pencil)+) >

or a dashboard

     <!element dashboard - - (tachometer?, speedometer, (radio & climate)?) >

This grammar tells us, for instance, that an object known as a "speedometer" is part of an object known as a "dashboard" and that a "dashboard" may also have "tachometer", "radio", and "climate" objects. Even more important, it tells us something about the order and occurrence of these objects: the optional tachometer is first, the required speedometer is next, and then the optional radio and climate objects can come in any order, but if one is present the other must be present as well. Surely this information is as useful to an engineer as it is to graphic designer, as it is to an assembly robot, as it is to a technical documentation writer.

Without SGML, considerable application software would be needed to express this logic. In many cases, the cost of developing and maintaining this complex code would be far in excess of the gain. A computer language that can express that logic and make it available to applications is a tremendously powerful tool.

Information, to be useful, is organized, and organized in specific ways. SGML can tell us about that organization in a way that no other technology can. In an SQL environment, organizational structure is expressed in an application. In an object environment, structures are created for this purpose using the application language, and are thus highly specific and not very portable. Consider the complexity of expressing

     <!element dashboard - - (tachometer?, speedometer, ( radio & climate)? >

(trivial in SGML) in some other data language.

Network Management

One interesting use of SGML is for network management. Network configuration follows a set of rules. These rules are specific to both the type of network and the technologies involved. Traditionally, the requirements and constraints for configuring the network are expressed in complex application rules that are hard-coded into the configuration software provided by the network vendor. As technology and products evolve, the rules change, requiring new or updated software. Expressing the configuration rules in a schema language would relieve the vendor of much of the software development and software maintenance costs. In fact, it could even provide the industry with a standardized way of interchanging configuration information.

Fragment of i4i Network DTD:

     <!element fileserver - -
     ((fileserver|hub|printserver|workstation|connection|printer|peripheral)*) >
     <!attlist     fileserver
                   id ID #required
                   netconnections NUMBER #required
                   peripheralconnections NUMBER #required >
     <!element hub - - (generalinfo?, (hub|printserver|workstation)*)>
     <!attlist     hub
                   id ID #required
                   netconnections NUMBER #required
                   peripheralconnections NUMBER #required >
     <!element workstation - - (generalinfo?, (printer|peripheral)* ) >
     <!attlist workstation
                   id ID #required
                   netconnections NUMBER #required
                   peripheralconnections NUMBER #required >

A visual application can be developed that uses the DTD (the generic information needed to configure a network of a certain type) to guide the user through the configuration of a specific network. The output of the application is an instance that conforms to the DTD, and that embodies the specific information needed to configure this specific network. The resulting instance can be processed by the network software with the full assurance that it will result in a proper configuration, because it is valid according to the DTD.

What SGML provides to the application developer in this scenario is, first of all, a means of specifying the rules or schema of network configuration in a neutral language, and secondly, a means of sharing both the rules (the DTD), and the use of those rules (the instance) between applications.

Knowledge Representation

Like most GUI software, the application for network configuration would normally have some form of context-sensitive online help. Context-sensitive help typically has a one-to-one relationship between the current application object and the help object. Selection of the "Help" icon transfers to the help system an identifier for the current screen or field. This results in the corresponding information box being displayed.

What this type of context sensitivity does not provide is: context! True context sets the current object within a stream of events -- for example, the printer configuration screen was selected first, then the fileserver screen, then the hub screen, and so on. A truly context-sensitive help system would know that, when selecting help from the hub screen, the user had already been to the printer configuration screen and the fileserver screen, in that order. It would also know what values had been entered into the system at those points, because those values may have an impact on the way in which the hub is configured.

In the same way that the rules of the network can be expressed in an SGML schema, the rules for knowledge presentation can be expressed in an SGML schema. A DTD can specify, given a context, what specific and supporting information is required. The application that builds the information product acts on the rules provided in the DTD. The DTD is used to represent the knowledge that the network manufacturer has about how the documentation interacts with network events, as well as the knowledge base the user needs to understand the documentation.

The network configuration application passes data to the tool that builds the information product, which evaluates the current data and what data has already been entered. Based on that information, the tool queries the DTD to establish what documentation objects are necessary. For instance, if the current application object is "printer configuration", the schema informs the tool that this explanation is within the context of "workstation", therefore when assembling the information product it should include the discussion for "workstation". It also informs the tool that the user may need information on "peripheral". The tool then assembles the objects into a single HTML file, and presents it to the user as a help screen in a WWW browser.

Without the support provided by the schema, the relationships between "workstation", "printer", and "peripheral" would need to be embedded in the printer documentation. Changes in the relationships would require reworking the printer documentation to account, not for the change in the printer information per se, but for the changes in the relationship. By having the relationships represented in an SGML schema, changes are immediately effected by changing the DTD, without needing any changes to the content.

The Future of SGML

The future of any technology is always hard to predict, and predictions may expose biases in the would-be prophet that are better left buried. However, a technology company must invent its future.

The needs of information manufacturers are becoming more pressing. Information products are the norm in virtually all industries. Information, once thought of as something you gave away because you had to, is now carefully packaged and sold as a key product differentiator. The more sophisticated the product, the more costly the information, and with competitors pushing the envelope, the spiral is set in motion.

These pressures are creating a need for information product databases. Databases that can support the needs of the information manufacturer are very different from those required to meet the needs of traditional applications. In fact, the needs of the information manufacturer are the reverse of traditional databases that remove relationships between data from the database and put them into the application. In the information-manufacturing environment, managing the relationship is where the value is added.

The relationships that are of interest to the information manufacturer can be relatively simple, such as those in a lineal text document. However, more complex relationships are rapidly becoming the norm. Multimedia information technology is a dynamic data-driven environment that is currently fueled by application-level logic instead of data-level logic. In manufacturing environments, the information requirements are driven by the behaviors of the equipment, which in turn has it own information requirements. In these environments, where management of data from many sources is overwhelming, a computer language that provides data-level support for the definition and management of complex relationships offers huge potential gains.

The future of SGML is in being that language.

SGML will become a background technology in this domain in much the same way as SQL has become a background technology in its domain. SGML will be used to describe complex systems and to share information about system behaviors. Technologies that use SGML will be embedded in all sorts of systems, from network managers to graphics packages to text processors.

SGML will become a foundation technology if, and only if, it recognizes its strengths and leverages its power.

How to Achieve this Future

Achieving this future will not be easy and will take considerable time. The most important step, though, is a technical one. The constraints of text encoding must be removed. The traditional model of:

     <para>This is a <emph>paragraph</emph></para >

creates a fatal constraint on data, because it embeds in the data stream the information about how the stream should be processed. This results in the locking of the data into that format. While SGML can be used for text encoding, a means must be found to harness its power independently of data type and application.

Infrastructures believes that this can be achieved through what it refers to as the "associative model" which is managed by a "service provider" to any class of application^v . The associative model argues that encoding, be it SGML or otherwise, need not exist inside the content object. If one holds the content and the encoding in an associative relationship such as

     <para> 0 </para> 19
     <emph> 10 </emph>19
     This is a paragraph

and resolves them at run time, then multiple understandings of a single piece of content are possible.

     <para>This is <emph>a </emph> paragraph</para>

is effected not by data duplication, but by creating another associative relationship:

     Relationship 1

     <para> 0 </para> 19

     <emph> 10 </emph>19

     Relationship 2

     <para> 0 </para> 19

     <emph> 8 </emph>10

     This is a paragraph

An application selects the relationship that meets its needs, and resolves the content according to the corresponding encoding.

Not only does the associative model separate the encoding from the content, but, more importantly, it makes (almost) no assumptions about the internal organization of the data. For example, if Application 1 uses vector IDs to identify data because that is the way it needs it, and Application 2 works with text, the model would be:

     Relationship 1

     <para> V12321 </para>

     <emph> Q121 </emph>

     Relationship 2

     <para> 0 </para> 19

     <emph> 8 </emph>10

     This is a paragraph

The associative model allows any technology that has identifiable data objects (the only assumption of this model) to associate the objects in that data with a schema. The technology does not have to embed the schema, or the results of the use of the schema (a subschema), in database jargon. It only needs to be able to inform a tool about data objects, and request information about the schema from the tool in a manner consistent with how the tool understands data. Network managers or graphics packages do not natively support SGML, but through the associative model, any application can have access to SGML.

The associative model provides the mechanism by which the power of SGML as a schema language can be realized, because it removes application dependencies.

Conclusion

We have argued that relationships are at the core of complex systems, and that one way of overcoming the application bottleneck is by moving the logic of the relationships out of the application and into a data schema.

We have argued that SGML provides us with a language to describe and manage those relationships.

Finally, through the associative model, we have offered a means of making SGML, as a schema language, available to all types of technologies. The challenge now is to determine whether we can leverage this to bring the discipline of the database into a complex production environment -- be that information manufacturing, network management, or collaborative authoring.

The potential of SGML is being lost because of history. Its very name is its undoing: Standard Generalized Markup Language -- a name that suggests encoding and format. Its most successful implementation, HTML on the WWW, conveys only format. WWW products read and produce documents in the HTML format. Even SGML implementations usually start off using SGML as a format -- the RFI usually includes the question, "How do we get our documents into SGML format?".

However, once the documents are in the SGML "format", sophisticated users quickly see the benefits they can realize because the documents are now mini databases. (This benefit assumes that the documents have some form of semantic markup.) The simple view of a document as a database is that tagged objects are now addressable. For example, <partno> is uniquely identified and can be found. The more sophisticated view recognizes that not only is <partno> identifiable, but that the relationship of <partno> to the rest of the dataset can be defined and made available. SGML is now a very powerful tool to manage the creation and use of document objects, albeit text documents. This of course leads to the question, "If SGML is so powerful, why are we limited to text?".

In truth, there is no reason, except that the "markup" part of SGML embeds the codes in the datastream, effectively making the markup compatible only with text. The associative model removes that limitation. By putting the metacodes and the content in an associative relationship, the power of SGML, as a language to describe and manage complex relationships, is now open to any class of data, and any type of problem that requires this type of support.

So, if SGML is not the right name, what is the right name?

As a Language for the Manufacturing and Processing of Structured Information, we might say that the answer to that question is SIMPL. For anyone who is acronymically challenged, the name is "Structured Information Manufacturing and Processing Language".

ⁱIn the word processor environment, the "reveal codes" function in WordPerfect exposes a subset of the metacodes that that product uses. back to text
ⁱⁱxML, the extensible Markup Language, more appropriately meets this objective than does SGML. back
ⁱⁱⁱSee "XML, Java, and the future of the Web", Jon Bosak, Sun Microsystems. back
^iv"Nowadays most of us have newer word processors, and/or a choice of specialized screenplay formatting programs, so there isn't much need for using SGML". (from comp.text.sgml) back
^vInfrastructures' S4 Technology provides this service. It is delivered as libraries that are linked into applications to provide access to SGML as a schema language, and to manage the associative relationship. back


[Main Page] [Talk With Us] [Products] [S4-Desktop]

Created: 08/02/97 Updated: 25 /07 /97

SGML