[Mirrored from concatenated files. Canonical source: http://sable.ox.ac.uk/ota/teiedw25/, December 16, 1996]

What is SGML and How Does It Help?

by Lou Burnard

SGML is an abbreviation for ``Standard Generalized Markup Language''. This language, or rather metalanguage, was first defined by an International Standard in 1986 [See note 1]. To complement the many detailed technical descriptions of SGML now available, [See note 2] this paper [See note 3] briefly describes the purpose and scope of the standard, aiming to persuade non-technically minded readers that it has something to offer them.

Table of contents

This electronic version may be freely distributed within the Higher Education community on a not for profit basis. It may not be printed or reproduced for profit.

This is a minor revision of TEI working paper EDW25, recoded in TEI Lite. The HTML version was derived automagically from the original form of EDW25. Please note that the text has not been re-edited and is thus rather out of date, particularly in the bibliography. Why, it doesn't even refer to The Gentle Introduction to SGML...

This text is also available in print form in Computers and the Humanities, vol 29, 41-50, 1995.

1. What is SGML for?

The objectives of those who designed SGML were simple. Confronted with an increasing number of so-called ``markup languages'' for electronic texts, each more or less bound to a particular kind of processing or even to a particular software package, they sought to define a single language in which all such schemes could be re-expressed, so that the essential information represented by such texts could be transferred from one program or application to another. I begin therefore by giving a slightly more formal definition of what is meant by the term ``markup language''. A universal language necessarily presupposes some basic concepts or semantic primitives in which the notions of all other languages can be expressed: the semantic primitives of SGML are simple and few in number, and their definition forms the bulk of the rest of this paper. I begin however with a few remarks about what SGML is not.

Newcomers to SGML often think of it as a special case of the kind of markup language with which they may be familiar. They expect it to define a universal set of tags or to define exactly what tags mean, in terms of how the items identified by tags are to be processed. But the semantics of a markup language are precisely what SGML does not concern itself with: it describes only the formal properties and inter-relations of the components of a document. It does not tell you what it means to define part of a text as belonging to some category (say, ``blort''); it simply tells you how things-called-blort can legally appear within texts -- whether they can be decomposed into ``blortettes'', or whether more than one of them can appear at the start of a document, and so on. Determining what a thing-called-blort actually might be is inextricably entangled with how the text is to be processed, and the function of SGML is to define the content of a document in terms that are entirely independent of its processing.

It follows from this that it is nonsensical to think of SGML as a kind of text formatting system (although its origins can be readily traced in the world of electronic text formatting), or as a competitor for such languages as TeX or PostScript. These are languages which define how ink (or its equivalent) is to be placed onto paper (or its equivalent); they are not primarily concerned with the formal structure of the language represented by those placements of dark-on-light. SGML by contrast is decidedly unhelpful about how texts are to be reproduced, since this is but one of the many applications for which a text may be placed into electronic form. Its strength is that by separating the notion of what the text actually is from how the text is rendered, it makes possible the use of the same text by many different kinds of processor.

As a simple example, consider the headings used to introduce the subdivisions of the present document. These need to be distinguished from the body of the text so that they can be formatted in a particular way. However, I have not yet decided how -- and it is more than likely that those responsible for printing this text will prefer to format them in some other way in any case. If therefore I use the facilities available on my word processor for the display of headings -- say a change of font size, a margin indent and a switch to bold font -- I will not be helping the typesetter's task very much. Moreover, should I wish to prepare a list of the subheadings in my text to serve as an index, I will very probably find it quite difficult to distinguish occasions where bold font indicates headings from cases where it indicates (say) emphasized phrases in the text. By the same token, I will find it difficult to check that each subsection has one and only one subheading, that any numbers included in the subheadings are in the right sequence and so forth. And when, in the fullness of time, this text enters the great database of late twentieth century prose, future linguists and historians will have comparable difficulties in assessing the linguistic properties of text used as subheadings as distinct from those of the main text. If however, I simply tag each sub-heading as such, using some unique string of codes to say ``here begins the text of a subheading'' and some other to mark its end, then the same input text can be used unchanged by any formatter, any indexing program and any linguistic analysis program. Each one will be able to decide for itself what it wants to do with the subheadings -- how it would like to process them, if at all.

While indexing the subheadings in a document of this nature is clearly of somewhat limited importance, it should be apparent that the solution proposed for that problem is an entirely generalisable one. Consider historical source materials for example. Which is likely to be of more use in compiling a list of the names in an electronic transcription of the records of an ecclesiastical court -- a version in which the names are simply italicised (as are, for example, Latin phrases, running titles, annotations etc.) or one in which each name is marked off clearly by a tag such as <name>? Which is likely to be of more use in extracting statistical data for input to a spreadsheet analysis of the average age of offenders -- a version in which birth dates are clearly marked as such, perhaps incorporating some normalised version of the date concerned, or one in which all dates are simply intermingled with the running text?

2. Markup and Markup Languages

The word markup was originally used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples, familiar to proofreaders and others, include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the production of texts was automated, the term was extended to cover all sorts of special ``markup codes'' inserted into electronic texts to govern formatting, printing, or other processing.

Generalizing from that sene, we define markup, or (synonymously) encoding, as any means of making explicit an interpretation of a text. At a banal level, all printed texts are encoded in this sense: punctuation marks, use of capitalization, disposition of letters around the page, even the spaces between words, might all be regarded as a kind of markup, the function of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings, and syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is in principle, like transcribing a manuscript from scriptio continua, a process of making explicit what is conjectural or implicit. It is a process of directing the user as to how the content of the text should be interpreted.

A ``markup language'', may be no more than a loose set of markup conventions used together for encoding texts. A markup language must specify what markup is allowed and whereabouts, what markup is required, how markup is to be distinguished from text, and what the markup means. As noted above, SGML provides the means for doing the first three of these only; it allows you to describe a markup language independently of what the markup is intended to do. To understand and act upon the markup, additional semantic information is needed, which will differ in different situations. Documentation like that enshrined in the TEI's Guidelines provides such information. In just the same way as one may be able to parse the syntactic structure of a Latin unseen without having the least idea what it is about, so an SGML-aware processor can analyze the structure of an SGML-encoded document with no sense of its meaning. This independence is necessary, given the open-ended nature of electronic textual applications. It does not, of course, imply that the intentions of the encoder of a text are unimportant or vacuous; only that they are formally distinct from the encoding itself.

Three basic concepts are fundamental to an understanding of all markup languages, when described in SGML terms. These are the notions of a markup entity, a markup element, with its associated attributes, and a document type. At the most primitive level, texts are composed simply of streams of symbols (characters or bytes of data, marks on a page, graphics, etc.): these are known as entities in SGML. At a higher level of abstraction, a text is composed of representations of objects of various kinds, linguistically or functionally defined. Such objects do not appear randomly within a text: coherence demands that particular types of object appear in specifiable relationship to other objects -- they may be included within each other, linked to each other by reference or simply presented sequentially, for example. This level of description sees texts as composed of structurally defined objects, known as elements in SGML. The grammar defining how elements may legally be combined in a particular class of texts is known as a document type. (This view of the nature of text has been nicely defined by De Rose et al [See note 4] as an ``ordered hierarchy of content objects''.) These three fundamental concepts together are, it is claimed, adequate to describe all the complexities of marked-up texts, of whatever kind and for whatever purposes. Each is discussed in turn in the next three sections

3. Entities

The word ``entity'' is used in SGML rather differently from its use elsewhere, notably in database design methodology. An SGML entity is simply a named bit of text, considered entirely independently of any structural or categorial classification it might have. A document may be an SGML entity, as may any arbitrary sequence of characters within it, or any symbol it contains. The definition of an entity associates a name with a particular string of bytes, which may be the representation of some characters in a particular computer encoding or held in a system-defined container of some kind (such as a file). Within an SGML document, entities are represented by reference, using the defined name. This mechanism has a number of important uses, specified further below, primarily in making it possible to encode textual features such as special characters or symbols which are unique to a particular environment or application in a way that is independent of that particular environment.

Everyone who has communicated by electronic mail, or even tried to move a file from a Macintosh computer to a PC, knows that some of the symbols of which texts are composed are less portable than others. With the best will in the world, computer manufacturers and standards bodies alike will never be able to represent all the possible symbols occurring in written texts in a single universally agreed code set, simply because these symbols do not form a closed set: the task is as hopeless as that of enumerating all the words in a language. Moreover, it is a fact of life that different computing environments adopt different methods of representing the same symbols, disregard entirely the existence of some and insist on distinguishing others.

A notorious consequence of this state of affairs is that some letters which appear perfectly normal when typed at a keyboard in Odense, Paris, or Tübingen, will either be mysteriously transmuted into a percent sign, or lost completely when transmitted over the network to the UK. How then are words to be stored in a computer file in such a way as to ensure that it can be satisfactorily processed by any computer, not just those which have the decency to be aware of the Danish, French, or German national character sets?

Exactly the same problem arises, in a more acute form, when considering the range of symbols likely to required in transcribing manuscript texts or spoken language. There is no computer character set in which the long form of s is distinguished from the short, still less for distinguishing ligatured forms of the same letters, or for representing all scribal abbreviations, astrological symbols, non-vocalic grunts, pauses etc. Nor, UNICODE notwithstanding, do I think it likely that there ever will be.

The SGML solution is to encode characters not available in the particular character set used for document transmission [See note 5] by means of entity reference. If Hans Jørgen is represented as Hans J&oslash;orgen, I can associate the unlovely acronym oslash with whatever particular stream of bytes is necessary on my computer to produce the slashed-o in the Danish national set. [See note 6]

Some have objected to the apparent verbosity of such mnemonics, by comparison with the variety of encoding tricks or ad hoc solutions customarily resorted to. The advantage of the entity reference solution is simply that it forms part of a single and consistent convention, comprehensible without resort to special purpose documentation (which is generally absent). Sets of standard mnemonics for all the accented letters and special symbols of modern European languages are to be found in ISO 8879 and elsewhere.

The same mechanism can be applied more widely for any stream of bytes to be included within a document. The use of a single short abbreviation for a much repeated or particularly complex phrase, is a simple way of ensuring consistency and reducing effort; it is worth noting in this context that the value of an entity reference can include other mark-up, such as tags or other entity references, provided that any element opened within a given entity is also closed within it. This method has been adopted for example by the TEI committee responsible for defining linguistic annotation. Another use is for identifying objects which cannot be directly represented in a text, for example non-textual entities like graphics or formulae. More mundane uses are not difficult to identify.

It should be stressed that entities have no structural properties: they are simply shortcuts enabling an SGML aware processor to substitute a system-defined string of bytes for a name identified as such by special SGML delimiters. As such they are merely a special (if well thought out) way of doing the kinds of things which transcribers and encoders of text have already been doing for many years.

4. Elements and their content

The level of description at which texts are composed solely of entities in the SGML sense defined above is not, however, a very satisfactory one. All markup schemes to a greater or lesser extent attempt to identify and to distinguish components of texts at a more ambitious level of description. Texts are not simply sequences of words, still less of bytes; they contain instances of objects, such as paragraphs, titles, names, dates etc. represented by such sequences. All markup schemes, to a greater or lesser extent, attempt to describe these components. A consideration of such schemes indicates at least three important aspects of textual objects which need to be recognised: their extent -- that is, the points in the textual stream at which object instances begin and end; their type -- that is, the category to which object instances are assigned; and their context -- that is, their relationship to other object instances within the document. SGML addresses each of these concerns: everything in an SGML document is delimited explicitly in some way; a document is decomposed into elements of a named type; and a kind of textual grammar can be defined.

4.1 A note on syntax

Most discussions of SGML mention if only in passing that the particular characters and conventions used to represent SGML markup in a particular document can be redefined. This is of course a necessary consequence of the fact that SGML is not itself a markup language, but a method of describing such languages. However, for sdimplicity's sake, this document will follow customary practice in using the reference concrete syntax to represent SGML markup. This rebarbative phrase is actually quite a precise description of what it denotes: it is a ``concrete'' syntax, because it represents by particular characters (the < > ! and other delimiters) distinctions required by the SGML model of how markup should be described (its ``abstract'' syntax); it is provided for ``reference'' purposes, as an example of one generally useful way of representing the constructs of the language.

The SGML reference concrete syntax has two great advantages over most other ways of making concrete a view of the abstract structure of a markup language: everything is delimited (bracketed) explicitly, and very few special characters are needed. As we have already seen, entity references are delimited explicitly by the ampersand character and the semicolon. [See note 7] In the same way, element occurrences within an SGML document are explicitly delimited in the reference concrete syntax by named tags. There are two kinds of tag: start-tags, which indicate the beginning of an element, and end-tags, which indicate its end. The tags themselves are delimited by special characters: ``<'' to mark the beginning of a start-tag, and ``'' is used to indicate the end of a tag. Between these delimiters is given a name identifying the type of element delimited by the start- and end-tag pair. For example, an embedded name element in a text might be tagged as follows:

             Call me <name>Ishmael</name>.      
This is by no means the only way of indicating the presence of an SGML element within a text; it is however the most explicit, and hence that into which other representations are most generally mapped.

4.2 Content models

As suggested earlier, the primary function of the start and end tags within a marked-up text is to indicate the extent of a particular object or textual component (the SGML term is element) within it. In addition, the tags identify the category or type of the element which they delimit, by supplying a name for it (``name'' in the example above). An SGML-aware processor can thus easily identify the start and end of all elements of a given type within a document -- it can identify all names, all sentences, all paragraphs (etc) and process them in a way appropriate for such objects.

The content of a document element of a particular type (that is, the portion between the start and end tags) may consist simply of running text, perhaps including entity references. More usually, it will contain other embedded document elements; occasionally it may have no content at all. The ability of SGML to specify rules about how elements can nest within other elements is one of its chief strengths and is discussed further below. Here we simply note that elements of one type typically contain elements of another: for example, a parish register consists of a mixture of birth, marriage and death records, each of which contains elements such as names, dates and details of an event. We might thus expect to find such records encoded in SGML with different tags for <birth>, <marriage> and <death> elements, within each of which might be found <name> and <date> elements. In exactly the same way, a document such as this one might be encoded as a series of <paper> elements, each of which begins with a <title>, followed optionally by an <abstract>, and at least one (and probably several) <section>s, each composed of <paragraph>s.

An empty element (one which has no content) may seem like a contradiction: what use can it be simply to tag a specific point in a text, especially if there is no way of associating information with it? At the very least, it should be possible to supply a name or other identifier to distinguish one such empty point in a text from another. Fortunately, SGML does provide a mechanism for adding such ``extra-textual'' information to the elements of a text: that of attributes, discussed in the next section.

4.3 Attributes and cross-references

Like ``entity'', the word ``attribute'' has a specific technical sense when used in the SGML context, which differs somewhat from its sense when used in the database design context. An SGML attribute is a category of information associated with a particular type of element, but not contained within it. Attributes are associated with particular element occurrences by including their name and value within the start-tag for the element concerned. For example:
 <![CDATA[ Call me <name type=Biblical>Ishmael</name>.    
Here ``type'' is the name of an attribute associated with any occurrence of the <name> element; ``Biblical'' is the value defined for this attribute in the case of the example <name> shown above. [See note 8]

Attributes are used for two related purposes: they enable an identifying number or name to be associated with a particular element occurrence within a text (which might otherwise be missing), and they enable additional information missing from a text to be added to it without violating its integrity.

As an example of the first usage, consider the page or folio numbering of a historical source. There is a sense in which the individual pages of a source might be regarded as distinct elements within it. This is not however generally the primary focus of interest for those using it: in most cases, the number of the page only is of importance as a means of documenting where the other elements of the text occur. Moreover, the page numbers may not appear at all in the original source. In such cases, a tag <pb> (for ``page break'') may conveniently be used to mark the point in the text at which a new page begins. An attribute (say, n for ``number'') would then provide a convenient means of indicating the number of the page just begun: thus

                  text of page 3 ends here 
                   <pb n=4> 
                 text of page 4 starts here 
As an example of the second usage, consider the common need for normalisation in prosopographical studies. One way of achieving this might be associate an attribute such as ``key'' with each occurrence of <name> elements in a text, the value of which would be a regularized and encoded form of the name, which could also serve as an identifying key in a database derived from the text. For example:
             <name key='SMITJ04'>Jack Smyth</name> 
Attribute values may be defaulted, taken from a controlled list or specified freely, the only constraint being that they cannot contain markup.

The most common use for attributes in the TEI and other SGML schemes is not however to categorise element occurrences in this way, but to identify them. In the TEI scheme, for example, every element is defined to have an ID attribute, which supplies a unique identifier for that particular textual element within the text. This makes possible the encoding of links between individual elements of a text in a simple and economical way. This facility is very commonly used in document preparation systems (such as TeX or Scribe) in order to link cross-references (such as ``see section 3 above'') within a text with the sections of a text to which they refer, when the section number is not known or may be dynamic. In SGML, such a system is completely generalizable. For example, let us suppose that we wish to encode a register of names in which the following passage occurs:

          John Smith, baker.
          Mary Smith, seamstress, wife of the above.
In this example we have two <entry>s, each containing a <name> and a <trade>. The second entry however contains an additional clause which states a relationship between it and another element. We begin by tagging the elements so far identified: [See note 9]
           <name>John Smith</name>
          <name>Mary Smith</name>
          <relation>wife of the above</relation>
Clearly ``wife of the above'' is meaningless as a relation unless we have some way of pointing to the entry with which it is linked. Let us assume that the referent of ``the above'' is the whole of John Smith's entry rather than just the name within it; the assumption does not affect the argument. What is needed is some way of identifying that entry uniquely; that identifying number can then be supplied as the target of the relationship. In other words, we need an identifying attribute (call it ``id'') that can be attached to any <entry> and a pointer attribute (call it ``target'') which can be attached to any <relation>. Using these, and inventing an arbitrary value for the identifier, we can encode the link implicit in the above text as follows:
        <entry id=E1234>
           <name>John Smith</name>
          <name>Mary Smith</name>
          <relation target=E1234>wife of the above</relation>

Here we have allocated the arbitrary name or identifier ``E1234'' to the Baker's entry. By supplying that same identifier as the value for the target attribute associated with the <relation> element of the Seamstress' entry, we assert both the existence of the relationshiop itself, but also its target. This simple solution to a well-known problem has several attractive features, but perhaps the most attractive is that it makes explicit the fact that the target of the relationship is an interpretation brought to bear on the text by the encoder of it, leaving the text itself unchanged. Other attributes (say, ``certainty'' or ``authority'') may also be imagined which might carry additional interpretative information associated with the link.

5. Ensuring consistency

While a rose might smell just as sweet by any other name, every computer user knows that names intended for automatic processing must be spelled exactly and defined precisely. The human reader might tolerate paragraphs sometimes labelled <p>, sometimes labelled <para>, and sometimes not labelled at all, but the computer is less forgiving. Slightly less obviously perhaps, the user of an SGML aware software system needs to know what elements have been defined for a given text (or group of texts) and what their possible contents are. He or she needs to know not just whether personal names should be tagged <propname> or <name>, but also in what contexts personal names may reasonably be expected to appear (for example, if something tagged as a name appears within a name, it is probably an error). He or she also needs to know what attribute names have been defined for particular elements and their legal values, and also what entity names should be used for particular symbols. The formal specification of these names and their usage is enshrined in a separate component, unique to SGML, known as a Document Type Definition or DTD.

5.1 Defining a Document : the content model

A DTD performs a function analogous to that of a grammar: it formally defines what are the legal productions of a given markup language. Of course, DTDs can be as lax or as restrictive as any other kind of grammar: the designer of a DTD generally has to trade off generality of use with accuracy of error detection. The simplest kind of DTD would be one which did no more than specify a set of tag names, requiring only that every element tagged in a document use one of them. Such a DTD would of course be unable to detect errors such as <name>s occurring within <name>s or within <date>s, nor to prohibit such errors as register entries appearing other than inside registers. Creating correctly encoded texts with such a DTD would be rather like trying to speak a foreign language with the aid of a lexicon of the language, but no idea of its syntax.

More usually however, the transcribers and creators of electronic texts wish to control how elements can meaningfully appear within a given class of texts, so that processors intended to act on them can do so more intelligently. The specification of what is legal within any one kind of textual component or element is known in SGML as its content model, because it provides a model for its content. Here, for example, is a part of the formal DTD for the register example given informally above.

       <!ELEMENT register - - (entry+)> 
       <!ELEMENT entry - o (name, trade, relation?)> 
       <!ELEMENT name - - (#PCDATA)> 

These three lines are examples of SGML declarations: each defines or declares a name for an element and what its content should be. The details of the syntax need not detain us; note only that each declaration (like everything else in SGML) is explicitly delimited, in this case by a symbol marking the start of a declaration (the ``''). The content model part of each declaration is given in parentheses at the end. Between the name of the element (``register'' in the first case) and the content model are two characters which specify whether or not both start- and end- tags are required to mark off occurrences of the element. The hyphen character indicates that a tag is required, the letter O that it is optional. Thus, in this example, <register>s and <name>s must have both start- and end-tags, whereas <entry>s can be specified using start-tags only.

The content model for register states that a <register> consists of one or more <entry>s, the plus sign indicates that the element before it can be repeated one or more times. Thus a register containing no entries, or one containing something other than an entry would be regarded as an error by this DTD. The content model for an entry states that a <entry> must begin with a <name>, followed by a <trade> and then optionally by a <relation>. The commas between the components of this content model indicate that the elements must all appear in the order given. The question mark following the <trade> indicates that this element need not be present. Thus, an entry with no name, or one where the trade preceded the name, would both be regarded as erroneous by this DTD, whereas entries are equally acceptabl, whether with or without <relation>s. Finally, the content model for a <name> states that it may contain only text, that is, simply data with no embedded tags. (The word ``#PCDATA'' is a special SGML symbol standing for parsed character data -- which must be ``parsed'' because it may contain entity references as well as raw characters).

SGML syntax allows for other variations, which will be necessary if we are to refine this model to reflect more accurately the probable content of register entries in the real world. We will begin by relaxing the restriction on the number of <trade>s an entry may contain:

        <!ELEMENT entry - o (name, trade*, relation?)>
The asterisk following the word ``trade'' indicates that an entry may contain zero or more <trade>s. An entry such as the following
            <name>John Smith</name>
            <trade>candle-stick maker</trade>
would be legal according to this second definition, as would one like this:
          <entry><name>John Smith</name></entry>

Suppose however that entries are mixed, sometimes containing names and trades, sometimes only one or the other. One possible content model for this situation would be:

           <!ELEMENT entry - o ((name|trade|relation)+> 
The vertical bar symbol may be read as ``or''. This content model states that an <entry> must contain at least one component, which may be a <name>, a <trade> or a <relation>, and may contain more than one of any of them, in any order. (The inner set of parentheses is needed to indicate that the plus sign is to be applied to the whole group of alternated names). The following entries would all be legal according to this definition:
             <name>John Smith</name>
             <relation>wife of the above</relation>
             <name>Mary Jones</name>
             <name>John Smith</name>
             <name>Henry Jones</name>
As the last example indicates, such laxity of definition may lead to difficulties of interpretation -- our syntax now cannot help us determine whether it is Smith or Jones who is the smith. But presumably in that respect we are being true to the source!

A more realistic situation would be that names, trades and relationships are found promiscuously intertwined in any order within any amount of other text. SGML offers two ways of dealing with this. One is simply to add #PCDATA as a fourth alternate to the example above, to give a declaration like this:

           <!ELEMENT entry - o ((name|trade|relation|#PCDATA)+> 
This solution is that generally preferred in the TEI Guidelines, for the general case, where elements contain of small identifiable elements (names, trades etc.) swimming about in an arbitrary mixture or `primal soup'. Another approach achieving a similar effect is to use what is known as an inclusion exception, as in this example:
         <!ELEMENT entry - o (#PCDATA) +(name|trade|relation)>
Either of the above definitions states that items tagged as names, trades or relations may appear anywhere within an entry, an arbitrary number of times, interspersed with arbitrary sequences of character data. The following example would be regarded as legal according to both the above definitions:

    <entry>Also<name>John Smith</name></entry>

    <entry><name>John Smith</name>
           and<name>Adam Smith</name>
           <relation>his brother</relation>
     of the
          <name>Tom Cobbley</name> and all
The difference between the two element declarations is that the second allows names trades or relations to appear arbitrarily not only within entries, but also within anything that entries contain. Unless further modified, this definition allows (for example) names to occur within trades (or vice versa), or even within other names. Entries such as the following would be legal by the second definition, but not the first:
      <name>John <trade>Smith</trade> Jones</name> 
    <entry> <relation>Brother of 
Of course, entries such as the following would also be acceptable by either definition:
    <entry>Any old string of characters you care to type in

To complete the content models for our simple register example, we need to define the sub-components of an entry. For the sake of argument, we will assume that <name>, <trade> and <relation> are to contain only text, with no further embedded elements. Since they all share the same content model a single declaration will suffice:

        <!ELEMENT (name|trade|relation) - - (#PCDATA)>

5.2 Defining a document: attribute lists

We now turn to the definition of attributes for each of the elements discussed so far. As with elements, SGML requires that all attributes likely to be used within a document must be defined in advance. It also offers a variety of features to control the values that specific attributes may take.

An example attribute declaration might look like the following:

         <!ATTLIST name id     ID    #IMPLIED
                        key    CDATA #IMPLIED
                        type (personal|honorific) personal>
This declaration associates three different attributes with the element <name>. The first attribute is called id, the second key and the third type. Note that all the attributes for a given element are defined together in a single ATTLIST declaration. For each attribute named, two additional pieces of information are required for its declaration. The first, following the name of the attribute, defines what type of value may be supplied for it, while the third, following the type specification, indicates what value should be assumed if the attribute is not specified for any element occurrence.

In this example, three different kinds of value are specified. The id attribute may take as its value only a string [See note 10] which the SGML processor will treat as an identifier for the associated element: this is indicated by the keyword ID. An value for the id attribute need not be supplied, and if it is not, then a processor may take whatever default action it chooses; this is indicated by the keyword #IMPLIED.

The key attribute may take as its value any string of characters; this is indicated by the keyword CDATA. The SGML processor will not check the value of this attribute in any way, except of course that it may not contain any form of markup. [See note 11]. Again, there is no requirement to supply a value for this attribute.

More exact checking is provided for in the case of the type attribute: here we have specified that only two values are legal for this attribute -- the type of a name must be marked as either personal or honorific. If no value is specified, the processor is to assume that the intended value is personal.

To illustrate some of the other possibilities for attribute specifications, we conclude with a declaration for the attribute list to be attached to the <relation> element:

     <!ATTLIST relation  target    IDREF  #REQUIRED 
                         certainty NUMBER 0            >
This states that the relation element may have two attributes, called target and certainty. The former takes as its value an IDREF, that is a string which has been used as the value for an ID-type attribute somewhere within the current document. The SGML parser will check that the value actually identifies some other element, as is the case in our example. The keyword #REQUIRED means additionally that no <relation> can exist in the document unless the element to which it points is specified in this way; this keyword is used to specify that a value must be supplied for every occurrence of the attribute to which it is attached (the examples of relations given above are thus all illegal by this definition!). Finally, the certainty attribute is defined as taking a numeric value, which defaults to zero.

6. Why should I use a DTD?

It should be emphasized that the preceding brief discussion is by no means comprehensive and is intended only to give a flavour of the kinds of tools at the disposal of the SGML document designer. The interested reader is referred to one of the introductory texts cited in the bibliography for more comprehensive information. It should also be noted that designed a DTD is not something which every user of an SGML system needs to do afresh. On the contrary, it is the objective of endeavours such as the Text Encoding Initiative to define general-purpose DTDs which can be used for a wide variety of purposes.

Why, however, should anyone transcribing a historical source for analysis care about the existence of such endeavours? How can a knowledge of SGML help in understanding a text, in placing it into its proper context? It should be relatively uncontentious that the mechanical drudgery of transcribing, editing and reproducing texts is enormously simplified by the uncoupling of the tasks of data interpretation (tagging) and data reproduction (formatting). The former is an essentially hermeneutical and scholarly act, while the latter is not. Coombs, Renear et al have persuasively argued that the proliferation of sophisticated tools for desktop publishing have effectively seduced scholars from their true vocation (Coombs 1990) and that SGML offers a chance to regain the high ground.

Suppose that you have obtained, or created, a DTD which adequately describes the kind of source texts you wish to process. How will it be of use to you? You will use it firstly as a means of checking that each individual document you process conforms to the model you have defined. As well as providing you with a diagnostic check on the accuracy of your keyboarding, this will also provide what the Americans call a reality check on your interpretation of the sources concerned: you may be forced to confront aspects of your sources to which an initial, possibly over-hasty, assessment has blinded you. Moreover, because each part of an electronic text (or at least, each part that has been tagged) is equally accessible, equally processable, you can analyse the contents of your sources with a far higher degree of accuracy and sophistication than either a simple transcription or a database of derived observations would provide.

Historical sources do not belong to one individual however. More perhaps than other kinds of resource, they must be shareable, whether from economic or ethical considerations. An electronic text encoded in SGML, with its associated document type definition and appropriate documentation, is a permanent asset, independent of time, place and personality.

7. A brief bibliography

This bibliography lists some useful further reading for general information about SGML and markup languages. A full and very detailed SGML bibliography edited by Robin Cover and David T. Barnard is also available as TEI document MLW14. A much expanded and updated version of the latter is also available on the World Wide Web from http://www.sil.org/sgml/sgml.html".


[ 1 ] International Organization for Standardization, ISO 8879: Information processing --Text and office systems -- Standard Generalized Markup Language (SGML), ([Geneva]: ISO, 1986)

[ 2 ] Examples include Bryan (1988), van Herwijnen (1990) and Goldfarb (1991).

[ 3 ] Originally published as part of Greenstein, D.I. (ed) Modelling Historical Data(Göttingen, Max-Planck-Inst. für Geschichte, 1991)

[ 4 ] De Rose (1990)

[ 5 ] I do not discuss here the possibility of using variant character sets within an SGML document; though possible this does not of course solve the general case

[ 6 ] The use of the ampersand and semicolon to delimit the acronym is an example of the SGML reference concrete syntax, discussed briefly below. It is not a necessary part of the SGML solution; merely a conventional one.

[ 7 ] The semicolon is not in fact strictly necessary in all situations: the end of an entity reference is signalled by the first character encountered which cannot form part of a name. A space is therefore sufficient; since however entity references are often found within words, rather than between them, the semicolon is often necessary to indicate where the entity name ends and the word within which it is embedded resumes.

[ 8 ] Categorization of this kind could equally well be achieved by using a different tag -- say biblicalName. The decision as to whether to use an attribute or a distinct element type is often a difficult one, involving more detailed technical and design skill than it is appropriate to consider here.

[ 9 ] For clarity of discussion, more specific element names have been used here than are proposed in the basic TEI scheme.

[ 10 ] More exactly, a name token, that is a sequence of alphanumeric characters of which the first is a letter

[ 11 ] It must also be enclosed in quotation marks if it is not a name token