SGML FAQ, from Erik Naggum
INTRODUCTORY QUESTIONS with answers. What is SGML, briefly? SGML is an abbreviation for the "Standard Generalized Markup Language". SGML is defined in an International Standard published by the International Organization for Standardization (ISO), with reference number ISO 8879:1986, bearing the full name "Information processing -- Text and office systems -- Standard Generalized Markup Language (SGML)". To most people, _markup_ means an increase in the price of an article. Although we talk about increases in value, it's not the same thing. "Markup" is a term coming from the publishing and printing business, where it means the instructions for the typesetter that were written on a typescript or manuscript copy by an editor. Today, with your favorite editor, you can enter the markup yourself, or even have it entered for you, in terms of codes or other instructions for an electronic typesetting program, which in simple cases is also the editor. An example is troff's ".ce" for "center the following line". A _markup_language_ is a set of means (constructs) to express how text (i.e., that which is not markup) should be processed, or handled in other ways. Unlike most other artificial languages, markup languages have to deal with embedded data, and contain rules for what is markup and what is data. For instance, in TeX the backslash means that subsequent input is TeX instructions. Most markup languages offer additional, administrative, language constructs, with which to define other language constructs (such as macros). _Generalized_markup_ is markup that has the curious property that it does _not_ specify how things should look. We still call it markup, though, because of the similarity with markup as described above. For instance, "" and "" are used in this FAQ to denote Question and Answer, respectively. This doesn't say anything about how questions should look in a typeset edition of this FAQ. You could have all the questions rendered in bold-face, for instance. With generalized markup, you tell the system _what_ you have, rather than how it should look, and you do so by putting a label (tag) around the text. There is a clear correlation between tags and what things look like. Tags are placed at the start and at the end of text or a certain kind, and these are precisely the places where typographic features are used, such as spacing, change of typeface, etc. An example is LaTeX, which, through macros, let you talk about itemized lists, instead of indents, item numbering, among other things. The _Standard_Generalized_Markup_Language_ started out as GML, the Generalized Markup Language, created by Charles Goldfarb, Edward Mosher and Raymond Lorie (G, M, and L, respectively) in 1969 at IBM. GML became the basis for the Standard through work in ANSI and with aid from a project predating GML, GenCode, which attempted to standardize names of commonly used elements. Rather than take this (impossible) approach, SGML is a language which makes it possible to roll your own generalized markup, but with a standard form and in standard ways. (Historic note: The origin of SGML was confused with that of GenCode in the 1991-12-15 edition of this FAQ.) In practice, you won't exactly roll your own, any more than you design LaTeX packages on your own. Although some people actually do that! Central to the design of SGML is the idea that a set of generic identifiers (the names of the tags), together with their interrelationships, form a type (or class) of documents, and that every document is an instance of a class, which means it can be validated with respect to this class. Can I read more about SGML somewhere? Let me suggest only one book, and then a bibliography. The book is Charles F. Goldfarb: The SGML Handbook; Oxford University Press, 1990; ISBN 0-19-853737-9. This book includes the text of the standard, so you don't have to worry about finding out how to order it from your ISO national member body or directly from ISO in Geneva, or wherever. The main feature of this book is that Charles Goldfarb, who is the project editor for the standard in ISO's SGML committee, has added a tremendous amount of annotations and has provided links between parts of the standard to guide your yearning for knowledge. Another big win is the overview, which takes you through a guided tour of concepts and facilities. If there be only one authority on SGML, this book is it. A "paper hypertext" feature makes the links in the text easy to follow. This is a book you need. The bibliography is Robin Cover's Brief Bibliography, also to be published on this newsgroup, and it covers the essentials, as well as enough pointers to other works to fill a wall of literature. Robin Cover, et alia, produced the huge, 312-page "Bibliography on SGML" (Tech Report 91-299, Queen's University, Kingston, Ontario, Canada), an incredibly useful work. Robin Cover continues to track the SGML arena, and hopefully, he will continue to provide us with the fruits of his work. SGML is often mentioned as being a "meta-language". What is that? This refers to the fact that SGML isn't only one language, but a language which describes other languages within its framework. As we talked about classes of documents and every document being an instance of such a class, we talk about a class of markup languages, and every markup language being an instance of the class. SGML also has the necessary expressive power to redefine the particular characters that are to be considered markup in a particular markup language, so that SGML is really a meta-language with an abstract syntax that each SGML document fills in to get a concrete syntax and a particular markup language for that document. This is the administrative information that makes it possible to talk about "conformance" to SGML. What does an SGML document look like? An SGML document is divided into three different parts, each with a clearly defined function. The first part specifies the character set of the document, which of these characters have special meaning to SGML in the rest of the document, and which advanced features are used. This is called the "SGML declaration", and is like a list of ingredients on food, so you know what to expect and what you can't eat. Using this as a check- list, you can determine whether your system can handle the document at hand. The SGML declaration looks like this: (There can be several document types, and a another construct called link type declarations (similar to DOCTYPE, but with LINKTYPE).) The third part of an SGML document is the marked-up "real" document which all of the administrative information and legwork makes possible. This is called the document instance. It usually begins with the name of the document in angle brackets, like this; which is the syntax for a start-tag of an element. The corresponding end-tag looks like this: When your parser reads your document, it checks that the tags in the document belong to the document type, and that they are allowed where they're used, again according to the document type. This process is called "validation". When a document is validated, it does not need to be so again no matter what your parser is instructed to do with it, and no matter which application will use the data in the document. This is another strength of SGML: application-independent validation. What do you mean "my parser"? Are there any freely available ones? 99% of the fun with SGML can be had only with a parser, so you do need one. (The remaining 1% comes from beholding the elegance and beauty of the language, and contemplating all the wondrous things you can do with it, once you have a parser. This feeling tends not to last, unless you're developing a parser, in which case it's almost all the fun.) Fortunately, a competent programmer and SGML afficionado has had a lot of fun lately, and in mid-July 1991, the ARC SGML parser materials were released. The ARC SGML parser materials are legally unencumbered (i.e., you can do whatever you want with it) and it's available for a nominal cost from the SGML Users' Group, as well as from several public SGML repositories. Can I get the ARC SGML from somewhere electronically? The University of Oslo, Department of Informatics, kindly sponsors a public FTP archive with material on SGML and has the ARC SGML parser available for anonymous FTP. Both the original MS-DOS distribution and a Unix port done by James Clark are available. This archive also holds information on some standards related to SGML, most notably an SGML application for hypermedia documents (the Hypermedia/Time-based structuring language, HyTime). Take a look around in the SGML and SIGhyper subdirectories. (Anonymous FTP works like this: You need to be connected to the Internet, and need a program which can talk the FTP protocol, usually something with "FTP" in it. On Unix systems, you can say "ftp ftp.ifi.uio.no", and that should be it. You will be asked for a user name -- reply "anonymous". You will then be asked for a password -- reply with your Internet mail address. You're now logged in, and can use the "cd" command to switch directories ("cdup" to go one level up), and "ls" to look around. Use "get" to fetch files.) If you need guidance, or can't use FTP, you may write to , which I'll try to answer as fast as possible. There are also other FAQs available on how to FTP. I've received an SGML document from a net.friend, what can I do with it? Didn't your net.friend tell you?? Seriously, an SGML document is, as mentioned above, an instance of a document type, and a document type can be many things, and it's only part of an application of SGML. Such an application consists of several parts: First, there's the document type definition, which says which elements you can have, and how they interrelate. Second, with the document type definition, there's a description of the semantics of the elements, so you know what they mean. The description is needed because SGML is not concerned with what things mean, only how they are represented. (You might complain that this is too small, but it's better to do a given task well than to do a greater task badly. There are other standards in the great SGML family which take care of these things, and more are coming as we witness increased adoption of SGML in the market.) I'm writing a book, and my publisher wants me to submit an SGML document on a diskette, what do I do? You take a look at one of the several SGML editing system around, and see which you think you would like to write a whole book with. Recruit your publisher to help you understand what he wants, and try to play with SGML a little before you start writing. SGML is like, um, anyway, it gets better with experience, and can be frightening the first time. For a good list of starter tools, I again refer you to Robin Cover's brief bibliography for the details.
TECHNICAL QUESTIONS with answers What, precisely, is an "element"? An element is the smallest part of a document that SGML deals with, and it's the basic building block of document types. An element may contain data (text), subelements, both, or it may be empty. The task of a document type designer is to identify the elements a document is to consist of, and define a hierarchical structure of these elements by means of other elements. An element definition consists of the name (generic identifier) which will be used in tags, a description of the content (using a "content model"), and an indication of whether the start-tags or the end-tags may be omitted. An element (in the document instance) is indicated by a start-tag, the contents, and an end-tag. An element, with its notion of content models, provide a powerful abstraction over the different kinds of text that can be found in a document. For instance, ordinary text is just characters that will be formatted somehow on output. If you have special kinds of text, such as, for instance, a telephone number, it could make sense (depending on your application) to make a special element with generic identifer "phone". That way, you can look for telephone numbers and get matches only at the right places. If you're really far-sighted, you would define a telephone number notation you associated with this element, so that you could check that all your phone numbers had the right format. Then you could modify the presentation of a phone number to suit a particular need, e.g. +1 516 555 8879 in the document could come out as "(516) 555 8879" in a domestic catalog and with full, international format for an international catalog. In a way, elements are like concepts, where a concept (say, "beef") is an abstraction over an innumerable lot of things into a particular "type" of thing, all having common characteristics, and fits into a hierarchy where concepts may be abstractions over other concepts. This idea of "types" and of a conceptual tool for text is one of the many great things with SGML. A content model is like the definition of a concept, with the important difference that a content model is defined in terms of the behavior its subelements. A subelement may be optional, required, or repeatable, and subelements may be chosen from a set, form an ordered set, or form an unordered set. Then there are exceptional subelements, which may either be forbidden or allowed anywhere in the contents of the element. The similarity between element and concepts go further, as elements may have attributes. An attribute is information about an element which is not part of its content. The element in SGML is thus a high abstraction over identifiable, separate portions of contents of a document from a conceptual and hierarchical view. What is an "entity" in SGML? The notion of an entity is SGML is an even higher abstraction than the element, and since this is somewhat unexpected to most readers of SGML, it's probably the reason why so many have problems with it. The concept of an element comes from looking at the contents of a document and grasping that the contents forms an element structure, a hierarchy of elements, and that the nature of each element can be abstracted so that a content model can be defined which spans the varied use of each subelement. The concept of an entity comes from looking at the individual pieces of text that make up a whole document, and realizing that these pieces are independent of the element structure. E.g., a book may physically consist of several files on the author's disks. The element structure of the book spans all the disks and all the files, yet it's important to be able to refer to the files. The both complicating and relieving aspect of this is that we need to be able to refer to these pieces in a system- and storage-independent way. This is where the entity saves us a lot of trouble. Entities are named pieces of text. The abstraction that causes some confusion is over what a "piece of text" is, and, in particular, where it is found. We have looked at external entities, that is, entities which, when we refer to them, cause us to read a different file. We may also need to define short- hand notations for things in a document without needing an external file for every small piece of text. This means that entities have types, as well. There are internal entities, entities that are useful as short-hands for language constructs, entities that are text which is not to be interpreted, etc, and external entities, entities that are simply text, entities that are in a special notation, to be interpreted by a special program, perhaps with parameters, entities which constitute larger parts of the administrative functions of the first and second part of the SGML document. Moreover, entities may be used both by the administrative parts and the user, and the user shouldn't have to worry about which entities are used by the administrative functions he doesn't see. So, entities come in two flavors, parameter entities and general entities. An "entity", then, is an abstraction over several types of text that you want to refer to by name. Once defined, you don't need to know where it is found, or of what kind it is -- all (general) entities look and feel the same to the user.
FURTHER QUESTIONS without answers Is this all? No, it just takes a lot of time to invent questions and write good answers. In this FAQ I have not tried to make a summary of questions asked on the net so far, but to provide answers to questions that I have seen come up in several ways without necessarily being asked in the form presented above. A summary of question and answers in this group will be incorporated into the next version of the FAQ. How can I contribute? Glad you asked. You can, at any time, fetch the latest versino of the FAQ at ftp.ifi.uio.no:SGML/FAQ., where is 0.0 for this version. Other versions will be available as I write more, and as your contributions flood my mailbox. Please write to me at Erik Naggum or Erik Naggum