Background on SGML

                                                                               
                            


COMPILED AND EDITED BY TERRY GIRILL                                        JULY
1994

                                                                               
                            

Educational materials on SGML fall into three broad categories, as shown in this comparative chart. An evaluative abstract is available for each item on the chart marked (*).



Topic or        General education      Open technical issues    SGML
implications
scope:          on SGML                or SGML problems         for authors

Primary         DOE publishing          DTD writers, policy-    Scientists and
audience:       professionals           makers & tech leads     engineers who
                (writers,               of SGML projects        publish
                librarians)

Character of    Broad, explanatory      Focused, presuppose     Approach SGML
use,
materials:      suitable for ref        SGML background         features,
probs. from
                or training too         already, advocate       perspective of
those
                                        specific solutions      who USE text
(STI)       

Examples:       TEI, Gentle Intro,      Nicholas & Welsch,*     Van Herwijnen,*
                (Ch. 2)*                SGML/ODA interchange    Use...in
physics

                Bryan, SGML: An         Kircz, Rhetorical
                Author's Guide          Structure*

                Girill,* SGML &         Mamrak & Barnes,*
                Science Comm.           DTD considerations

(*) = abstracts available for these.

Full bibliographic references for the cited items:

Martin Bryan, SGML An Author's Guide (Workingham, England: Addison-Wesley Publishing Company, 1988), 364 pp.

T. R. Girill, "How SGML Promotes Scientific Communication," STC Exchange, 1 (Fall, 1993), 2-5.

Joost G. Kircz, "Rhetorical Structure of Scientific Articles: The Case for Argumentational Analysis in Information Retrieval," Journal of Documentation, 47 (December, 1991), 354-372.

Sanrda A. Mamrak and J. A. Barnes, "Considerations for Preparation of SGML Document Type Definitions," Electronic Publishing, 4 (March, 1991), 27-42.

Charles K. Nicholas and Lawrence A. Welsch, "On the Interchangeability of SGML and ODA," Electronic Publishing, 5 (September 1992), 105-130.

Text Encoding Initiative, C. M. Sperberg-McQueen and Lou Burnard (Eds.), Guidelines for the Encoding and Interchange of Machine-Readable Text, Ch. 2 (Sec. 2.1), "A Gentle Introduction to SGML," 23 pp.

Eric Van Herwijnen, "The Use of Text Interchange Standards for Submitting Physics Articles to Journals," Computer Physics Communications, 57 (December II, 1989), 244-250.

HOW SGML PROMOTES SCIENTIFIC COMMUNICATION

T. R. Girill, "How SGML Promotes Scientific Communication," STC Exchange, 1 (Fall, 1993), 2-5.

This concise paper introduces SGML to an audience of professional editors and others active in technical publishing (originally, it was addressed to members of the Science Communication SIG of the Society for Technical Communication). It places SGML in the context of the three intellectual problems it solves (preserving editorial decisions, document exchange, and information life-cycle management) as well as the social setting in which it developed (standards committees, software vendors, and influential government agencies). SGML's assumptions (that text structure can be made explicit and that content and format are separable) receive just enough illustration using typical elements, attributes, and entities to reveal the flavor of SGML encoding without really teaching any of the technical details.

This short strategic overview could be good preparation before reading the more thorough TEI "Gentle Introduction" or the much more elaborate treatment in Martin Bryan's book. It is also suitable reading for managers and pollcy makers who need to be aware of SGML's implications without having to really learn its machinery.

Return to Cited Bibliographic References

RHETORICAL STRUCTURE OF SCIENTIFIC ARTICLES: THE CASE FOR ARGUMENTATIONAL ANALYSIS IN INFORMATIONAL RETRIEVAL

Joost G. Kircz, "Rhetorical Structure of Scientific Articles: The Case for Argumentational Analysis in Information Retrieval," Journal of Documentation, 47 (December, 1991), 354-372.

Kircz spends half of this article plausibly surveying the problems that beset current attempts at effective information retrieval (IR) among scientific articles, then spends the other half implausibly suggesting an SGML-supported response to those problems. The result is fascinating to read but impractical to adopt.

In a very perceptive if fast-paced tour of IR work over the last few decades, Kircz argues that IR generally (1) ignores the diversity of viewpoints that searchers have (e.g., informed, partly informed, or uninformed in different cases), (2) relies too heavily on overt internal (bibliographic), external (assigned keywords), or transmital (citation) identifiers, all secondary to the author's content, and (3) wrongly assumes (with computer-assisted indexing and full-text searching) that the author's words alone suffice to represent his or her content and the value of the text.

Kircz then proposes that enriching IR with "an argumentational analysis of science papers" that captures their neglected rhetorical structure will help. He suggests using SGML document analysis and encoding to identify not only such official features as sections and subsections, but also such latent rhetorical elements as assumptions made, outside facts mentioned, points of view adopted, constraints applied, and references to oneself and others (365, 368). He intriguingly suggests the relevance to mapping these features of both hypertext links and SGML "concurrent" markup (though not by name). But unfortunately Kircz fails to explain just how encoding this rhetorical structure would actually improve those IR efforts that remain "unable to deal with the real problem of finding information one does not yet appreciate sufficiently to be able to phrase questions about in formalised terms" (364).

This is a paper best appreciated by those with some background in information retrieval or rhetoric. Although every SGML enthusiast will sympathize with Kircz's goals, his case remains too weak to guide most practical SGML projects.

Return to Cited Bibliographic References

CONSIDERATIONS FOR PREPARATION OF SGML DOCUMENT TYPE DEFINITIONS

Sandra A. Mamrak and J. A. Barnes, "Considerations for Preparation of SGML Document Type Definitions," Electronic Publishing, 4 (March, 1991), 27-42.

The authors of this article argue in clear prose with ample examples that document analysts and DTD designers should (1) never use any of the SGML markup minimization features (they yield complex and unpredictable tagging results, especially during text updates), (2) never use attributes (they are not the best way to store metainformation), (3) avoid inclusion and exclusion exceptions (they often have unintended bad side effects), and (4) take advantage of the CONCUR feature (it promotes clear and simple encoding).

Since every one of these proposals is very controversial, perhaps the real value of this article lies not in its overt claims but in the design context in which the authors place all four. Many production-strength DTDs exceed the complexity of a typical computer programming language, making thoughtful, strategic design crucial for success. Mamrak and Barnes remind their readers that DTD success should always be two-fold, involving both understandability for humans and reliability for the software that processes the instructions. They advocate their (severe) design proposals as a means of achieving such two-fold success.

Because of its presentation, this article is suitable even for audiences just learning about SGML (if they are warned that it is controversial). In fact, reading it early in one's SGML career will focus attention on powerful aspects of the formalism whose significance might otherwise be overlooked by the novice. Regardless of whether the authors make their case, every reader who learns to look beyond the obvious implications of SGML features owes them a debt. Alas, their 17 references are almost entirely to private communications or obscure technical reports.

Return to Cited Bibliographic References

ON THE INTERCHANGEABILITY OF SGML AND ODA

Charles K. Nicholas and Lawrence A. Welsch, "On the Interchangeability of SGML and ODA," Electronic Publishing, 5 (September 1992), 105-130.

This fairly detailed and technical article compares the chief features of SGML with those of ODA (Office Document Architecture) and spells out the conditions under which documents prepared using one of these schemes can be reliably and automatically translated into the other.

The elements, tags, content models, and DTDs of SGML are compared with the logical objects, and the specific and generic logical structures of ODA by looking at how each treats the parts and relationships in a typical basic journal article. Against this background (about half the article), the authors argue that two-way interchange between SGML and ODA requires isomorphism between the elements and content models of the SGML version and the object classes and generator-for-subordinates attributes of the ODA version (although one-way translations require somewhat less). Concerns about losing considerable layout information encoded with ODA or losing significant structural information when relying on "Office Document Language" (ODL) to bridge between SGML and ODA are discussed and illustrated. The article ends by quite briefly describing the authors's own translation software and its successful test converting their own article from SGML to ODA and back.

This article will be best understood by and most useful for readers already familiar with at least one of SGML or ODA. Its comparison of the two, though well exemplified and lucid, assumes intermediate experience with at least one to be effective. The treatment is aimed at publishing professionals (not computer scientists or scientific authors), since it stresses logical issues in text representation rather than the software-engineering challenges of actually building a working translator. It provides excellent conceptual background for anyone faced with interchanging SGML and ODA material, but it does not offer do-it-yourself instructions for performing such conversions by hand or with customized software.

Return to Cited Bibliographic References

TEXT ENCODING INITIATIVE

Text Encoding Initiative, C. M. Sperberg-McQueen and Lou Burnard (Eds.), Guidelines for the Encoding and Interchange of Machine-Readable Text, Ch. 2 (Sec. 2.1), "A Gentle Introduction to SGML," 23 pp.

This chapter in the Text Encoding Initiative's Guidelines presents "a brief tutorial guide to the main features of the [SGML] standard for those readers who have not encountered it before" (p. 9). It begins by explaining the underlying concepts presupposed by SGML (markup, document types, the descriptive/procedural distinction). Then follows an illustrated tour of all the essential SGML machinery: elements, attributes, and entities; content models; minimization rules and occurrence indicatiors; and exceptions. The treatment is clear and patient but never trivialized or patronizing (unlike that in some popular magazines). The carefully chosen examples have a literary flavor that makes them especially accessible to readers with a liberal arts background, although less obviously relevant to scientific and technical material.

Because it is both substantive and readable, this concise chapter provides an excellent primer before reading book-length expository treatments of SGML (such as Bryan or Goldfarb), which are so thorough that they can be disorienting, as well as for narrow technical papers about specific SGML issues. The only "advanced" topic that gets disproportionate play in this "gentle introduction" is the encoding of concurrent structures, an optional feature of great potential value if only more SGML software actually supported it. No acquaintence with (or even interest in) the broader TEI agenda is required.

Return to Cited Bibliographic References

THE USE OF TEXT INTERCHANGE STANDARDS FOR SUBMITTING PHYSICS ARTICLES TO JOURNALS

Eric Van Herwijnen, "The Use of Text Interchange Standards for Submitting Physics Articles to Journals," Computer Physics Communications, 57 (December II, 1989), 244-250.

This paper quickly visits the major unresolved issues blocking the routine interchange of physics articles using SGML, overtly listing (but not thoroughly analyzing) the relevant arguments for resolving each issue.

Drawing his data from a 1989 survey by the European Physical Society, Van Herwijnen notes the broad diversity of computerized text-preparation systems used for journal articles by practicing physicists, and the prominence among them of TeX. He briefly explains why TeX is nevertheless not suitable for general electronic exchange of articles (it ignores database support, is not truly portable across systems, and is not standardized), and why PostScript fares no better (describes only page layout, not intellectual content). While SGML, by design, overcomes these problems, three more problems must be resolved for it to be a fully appropriate interchange mechanism: physicists still need "a standard structure for physics articles," inexpensive WYSIWYG input software for SGML, and reliable conversion between word-processor formats and SGML.

This typo-rich survey article nicely summarizes the strengths (and weaknesses) of SGML from the perspective of the publishing physicist, a viewpoint sometimes overlooked by enthusiastic SGML insiders. It is not (and was not intended to be) a carefully reasoned analysis of the issues it raises, however, leaving the curious reader hungry for more thorough treatments from elsewhere. The references are mostly to standards documents or private communications, not to more elaborate published discussions.

Return to Cited Bibliographic References