TEI EDW25: What is SGML and How Does It Help?


Header

Received: from UICVM.UIC.EDU (NJE origin LISTSERV@UICVM) by UICVM.CC.UIC.EDU
 (LMail V1.2a/1.8a) with BSMTP id 3678; Wed, 23 Nov 1994 08:00:52 -0600
Date:         Wed, 23 Nov 1994 08:00:51 -0600
From: BITNET list server at UICVM (1.8a) <LISTSERV@UICVM.UIC.EDU>
Subject:      File: "EDW25 DOC"
To: Robin Cover <robin@UTAFLL.UTA.EDU>

Text



               TEI EDW25:  What is SGML and How Does It Help?

                           By Lou Burnard

                   1 August 1991, revised 3 Oct 1991

Abstract
--------

SGML is an abbreviation for "Standard Generalized Markup Language".
This language, or rather metalanguage, was first defined by an
International Standard in 1986 [1], and is currently undergoing its first
five-year review. While some changes are likely, it is certain that the
standard will be with us for many years to come. As a number of detailed
technical descriptions of SGML are already available [2], this paper [3]
will briefly describe the purpose and scope of the standard, aiming
to persuade the non-technically minded reader that it has something
to offer him or her.

1. What is SGML for?
--------------------

The objectives of those who designed SGML were simple. Confronted with
an increasing number of so called "markup languages" for electronic
texts, each more or less bound to a particular kind of processing or
even to a particular software package, they sought to define a single
language in which all such schemes could be re-expressed, so that the
essential information represented by such texts could be transferred
from one program or application to another. I begin therefore by giving
a slightly more formal definition of what is meant by the term "markup
language". A universal language necessarily presupposes some basic
concepts or semantic primitives in which the notions of all other
languages can be expressed: the semantic primitives of SGML are simple
and few in number, and their definition forms the bulk of the rest of
this paper. I begin however with a few remarks about what SGML is "not".

Newcomers to SGML often think of it as a special case of the kind of
markup language with which they may be familiar. They expect it to
define a universal set of tags or to define exactly what tags mean, in
terms of how the items identified by tags are to be processed. But the
semantics of a markup language are precisely what SGML does not concern
itself with: it describes only the formal properties and inter-relation
of the components of a document.  It does not tell you what it means
to define part of a text as belonging to some category (say, "blort");
it simply tells you how things-called-blort can legally appear within
texts - whether they can be decomposed into "blortettes", or whether
more than one can appear at the start of a document, and so on.
Determining what a thing-called-blort actually may be is inextricably
entangled with how the text is to be processed, and the function of SGML
is to define the content of a document in terms that are entirely
independent of its processing.

It follows from this that it is nonsensical to think of SGML as a kind
of text formatting system (although its origins can be readily traced in
the world of electronic text formatting), or as a competitor for such
languages as TeX or PostScript. These are languages which define how ink
(or its equivalent) is to be placed onto paper (or its equivalent); they
are not primarily concerned with the formal structure of the language
represented by those placements of dark-on-light. SGML by contrast is
decidedly unhelpful about how texts are to be reproduced, since this is
but one of the many applications for which a text may be placed into
electronic form. Its strength is that by separating the notion of what
the text actually "is" from how the text "is rendered", it makes
possible the use of the same text by many different kinds of processor.

As a simple example, consider the headings used to introduce the
subdivisions of the present document. These need to be separated from
the body of the text so that they can be formatted in a particular way.
However, I have not yet decided how - and it is more than likely that
those responsible for printing this text will prefer to format them
in some other way in any case. If therefore I use the facilities
available on my word processor for the display of headings - say a
change of font size, a margin indent and a switch to bold font - I will
not be helping the typesetter's task very much. Moreover, should I wish
to prepare a list of the subheadings in my text to serve as an index, I
will very probably find it quite difficult to distinguish occasions
where bold font indicates headings from cases where it  indicates (say)
emphasized phrases in the text. By the same token, I will find it
difficult to check that each subsection has one and only one subheading,
that any numbers included in the subheadings are in the right sequence
and so forth. And when, in the fullness of time, this text enters the
great database of late twentieth century prose, future linguists and
historians will have comparable difficulties in assessing the linguistic
properties of text used as subheadings as distinct from those of the
main text. If however, I simply tag each sub-heading as such, using some
unique string of codes to say "here begins the text of a subheading" and
some other to mark its end, then the same input text can be used
unchanged by any formatter, any indexing program and any linguistic
analysis program. Each one will be able to decide for itself what it
wants to do with the subheadings - how it would like to process them,
if at all.

While indexing the subheadings in a document of this nature is clearly
of somewhat limited importance, it should be apparent that the solution
proposed for that problem is an entirely generalisable one. Consider for
example any of the various kinds of historical source materials
described elsewhere in this volume. [4] Which is likely to be of more use
in compiling a list of the names in an electronic transcription of the
records of an ecclesiastical court records - a version in which the
names are simply italicised (as are for example Latin phrases, running
titles, annotations etc.) or one in which each name is marked off clearly
by a tag such as <name> ? Which is likely to be of more use in extracting
statistical data for input to a spreadsheet analysis of the average age
of offenders - a version in which birth dates are clearly marked as such,
perhaps incorporating some normalised version of the date concerned, or
one in which all dates are simply intermingled with the running text?

2. Markup and Markup Languages
------------------------------

The word "markup" was originally used to describe annotation or other
marks within a text intended to instruct a compositor or typist how a
particular passage should be printed or laid out.  Examples, familiar to
proof readers and others, include wavy underlining to indicate boldface,
special symbols for passages to be omitted or printed in a particular
font and so forth.  As the production of texts was automated, the term
was extended to cover all sorts of special "markup codes" inserted into
electronic texts to govern formatting, printing, or other processing.

Generalizing from that sense, we define markup, or (synonymously)
"encoding", as any means of making explicit an interpretation of a text.
At a banal level, all printed texts are encoded in this sense:
punctuation marks, use of capitalization, disposition of letters around
the page, even the spaces between words, might be regarded as a kind
of markup, the function of which is to help the human reader determine
where one word ends and another begins, or how to identify gross
structural features such as headings or simple syntactic units such as
dependent clauses or sentences. Encoding a text for computer processing
is in principle, like transcribing a manuscript from "scriptio continua",
a process of making explicit what is conjectural or implicit, a process
of directing the user as to how the content of the text should be
interpreted.

A "markup language", may be no more than a loose set of markup
conventions used together for encoding texts.  A markup language must
specify what markup is allowed and whereabouts, what markup is required,
how markup is to be distinguished from text, and what the markup means.
As noted above, SGML provides the means for doing the first three
of these only; it allows you to describe a markup language independently
of what the markup is intended to do. To understand and act upon the
markup, additional semantic information is needed, which will differ in
different situations. Documentation like that enshrined in the TEI's
"Guidelines" [5] provides such information. In just the same way as one
may be able to parse the syntactic structure of a Latin unseen without
having the least idea what it is about, so an SGML-aware processor can
analyze the structure of an SGML-encoded document with no sense of its
meaning. This independence is necessary, given the open-ended nature of
electronic textual applications. It does not, of course, imply that the
intentions of the encoder of a text are unimportant or vacuous; only
that they are formally distinct from the encoding itself.

Three basic concepts are fundamental to an understanding of all markup
languages, when described in SGML terms. These are the notions of a
markup "entity", a markup "element", with its associated "attributes",
and a "document type".  At the most primitive level, texts are composed
simply of streams of symbols (characters or bytes of data, marks on a
page, graphics, etc.): these are known as "entities" in SGML. At a
higher level of abstraction, a text is composed of representations of
objects of various kinds, linguistically or functionally defined. Such
objects do not appear randomly within a text: coherence demands that
particular types of object appear in specifiable relationship to other
objects - they may be included within each other, linked to each other by
reference or simply presented sequentially, for example. This level of
description sees texts as composed of structurally defined objects,
known as "elements" in SGML. The grammar defining how elements may be
legally combined in a particular class of texts is known as a "document
type". [This view of the nature of text has been nicely defined by
DeRose (1990) as an "ordered hierarchy of content objects".] These three
fundamental concepts together are, it is claimed,  adequate to describe
all the complexities of marked-up texts, of whatever kind and for
whatever purposes. Each is discussed in turn in the next three sections

3. Entities
-----------

The word "entity" is used in SGML rather differently from its use
elsewhere, notably in database design methodology. An SGML entity is
simply a named bit of text, considered entirely independently of any
structural or categorical classification it might have. A document may
be an SGML entity, as may any arbitrary sequence of characters within
it, or any symbol it contains. The definition of an entity associates a
name with a particular string of bytes, which may be the representation
of some characters in a particular computer encoding or held in a
system-defined container of some kind (such as a file). Within an SGML
document, entities are represented by reference, using the defined name.
This mechanism has a number of important uses, specified further below,
primarily in making it possible to encode textual features such as
special characters or symbols which are unique to a particular
environment or application in a way that is independent of that
particular environment.

Everyone who has communicated by electronic mail, or even tried to move
a file from a Macintosh computer to a PC, knows that some of the symbols
of which texts are composed are less portable than others. With the best
will in the world, computer manufacturers and standards bodies alike
will never be able to represent all the possible symbols occurring in
written texts in a single universally agreed code set, simply because
these symbols do not form a closed set: the task is as hopeless as that
of enumerating all the words in a language. Moreover, it is a fact of
life that different computing environments adopt different methods of
representing the same symbols, disregard entirely the existence of some
and insist on distinguishing others.

A notorious consequence of this state of affairs is that the second
letter of Hans J/oslash/rgen Marker's second name will look perfectly
satisfactory when typed on a keyboard here in Odense, but when
transmitted over the network to the UK will either be mysteriously
transmuted into a percent sign, or lost completely. How then is this
name to be stored in a computer file in such a way as to ensure that it
can be satisfactorily processed by any computer, not just those which
have the decency to be aware of the Danish national character set?

Exactly the same problem arises, in a more acute form, when considering
the range of symbols likely to required in transcribing manuscript
texts or spoken language. There is no computer character set in which
the long form of "s" is distinguished from the short, still less for
distinguishing ligatured forms of the same letters, or for representing
all scribal abbreviations, astrological symbols, non-vocalic grunts,
pauses etc.  Nor, UNICODE notwithstanding, do I think it likely that
there ever will be.

The SGML solution is to encode characters not available in the
particular character set used for document transmission [6] by means
of entity reference. If Hans J/oslash/rgen is represented as
Hans J&oslash;rgen, I can associate the unlovely acronym "&oslash;" with
whatever particular stream of bytes is necessary on my computer to
produce the slashed-o to which he is entirely entitled. [7]

Some have objected to the apparent verbosity of such mnemonics, by
comparison with the variety of encoding tricks or ad hoc solutions
customarily resorted to. The advantage of the entity reference solution
is simply that it forms part of a single and consistent convention,
comprehensible without resort to special purpose documentation (which is
generally absent). Sets of standard mnemonics for all the accented
letters and special symbols of modern European languages are to be found
in ISO 8879 and elsewhere.

The same mechanism can be applied more widely for any stream of bytes
to be included within a document. The use of a single short abbreviation
for a much repeated or particularly complex phrase, is a simple way of
ensuring consistency and reducing effort; it is worth noting in this
context that the value of an entity reference can include other mark-up,
such as tags or other entity references, provided that any element
opened within a given entity is also closed within it.  This method has
been adopted for example by the TEI committee responsible for defining
linguistic annotation. Another use is for identifying objects which
cannot be directly represented in a text, for example non-textual
entities like graphics or formulae. More mundane uses are not difficult
to identify.

It should be stressed that entities have no structural properties: they
are simply shortcuts enabling an SGML aware processor to substitute a
system-defined string of bytes for a name identified as such by special
SGML delimiters. As such they are merely a special (if well thought out)
way of doing the kinds of things which transcribers and encoders of text
have already been doing for many years.

4. Elements and their content
-----------------------------

The level of description at which texts are composed solely of entities
in the SGML sense defined above is not, however, a very satisfactory
one. All markup schemes to a greater or lesser extent attempt to
identify and to distinguish components of texts at a more ambitious
level of description. Texts are not simply sequences of words, still
less of bytes; they contain instances of objects, such as paragraphs,
titles, names, dates etc. represented by such sequences. All markup
schemes, to a greater or lesser extent, attempt to describe these
components. A consideration of such schemes indicates at least three
important aspects of textual objects which need to be recognised: their
"extent" - that is, the points in the textual stream at which object
instances begin and end; their "type" - that is, the category to which
object instances are assigned; and their "context" - that is, their
relationship to other object instances within the document. SGML
addresses each of these concerns: everything in an SGML document is
delimited explicitly in some way; a document is decomposed into elements
of a named type; and a kind of textual grammar can be defined.

   4.1 A note on syntax -
   ----------------------

Most discussions of SGML mention if only in passing that the particular
characters and conventions used to represent SGML markup in a particular
document can be redefined. This is of course a necessary consequence of
the fact that SGML is not itself a markup language, but a method of
describing such languages. However, in order to stay sane, this document
will follow customary practice in using the "reference concrete syntax"
to represent SGML markup. This rebarbative phrase is actually quite a
precise description of what it denotes: it is a "concrete" syntax,
because it represents by particular characters (the characters ">", "<",
"!" and other delimiters) distinctions required by the SGML model of how
markup should be described (its "abstract" syntax); it is provided for
"reference" purposes, as an example of one generally useful way of
representing the constructs of the language.

The SGML reference concrete syntax has two great advantages over most
other ways of making concrete a view of the abstract structure of a
markup language: everything is delimited (bracketed) explicitly, and
very few special characters are needed. As we have already seen, entity
references are delimited explicitly by the ampersand character and the
semicolon. [8] In the same way, element occurrences within an SGML
document are explicitly delimited in the reference concrete syntax by
named "tags". There are two kinds of tag:  "start-tags", which indicate
the beginning of an element, and "end-tags", which indicate its end. The
tags themselves are delimited by special characters: "<" to mark the
beginning of a start-tag, and "</" to mark the beginning of an end-tag.
In either case, the character ">" is used to indicate the end of a tag.
Between these delimiters is given a name identifying the type of element
delimited by the start and end tag pair.  For example, an embedded name
element in a text might be tagged as follows:

             Call me <name>Ishmael</name>.

This is by no means the only way of indicating the presence of an SGML
element within a text; it is however the most explicit, and hence that
into which other representations are most generally mapped.

   4.2 Content models -
   --------------------

As suggested earlier, the primary function of the start and end tags
within a marked-up text is to indicate the extent of a particular object
or textual component (the SGML term is "element") within it. In addition,
the tags identify the category or type of the element which they delimit,
by supplying a name for it ("name" in the example above).  An SGML-aware
processor can thus easily identify the start and end of all elements of
a given type within a document - it can identify all names, all
sentences, all paragraphs (etc) and process them in a way appropriate
for such objects.

The content of a document element of a particular type (that is, the
portion between the start and end tags) may consist simply of running
text, perhaps including entity references. More usually, it will contain
other embedded document elements; occasionally it may have no content at
all. The ability of SGML to specify rules about how elements can nest
within other elements is one of its chief strengths and is discussed
further below. Here we simply note that elements of one type typically
contain elements of another: for example, a parish register consists of
a mixture of birth, marriage and death records, each of which contains
elements such as names, dates and details of an event. We might thus
expect to find such records encoded in SGML with different tags for
<birth>, <marriage> and <death> elements, within each of which might
be found <name> and <date> elements. In exactly the same way, a document
such as this one might be encoded as a series of <paper> elements, each
of which begins with a <title>, followed optionally by an <abstract>,
and at least one (and probably several) <section>s, each composed of
<paragraph>s.

An empty element (one which has no content) may seem like a
contradiction: what use can it be simply to tag a specific point in a
text, especially if there is no way of associating information with it?
At the very least, it should be possible to supply a name or other
identifier to distinguish one such empty point in a text from another.
Fortunately, SGML does provide a mechanism for adding such
"extra-textual" information to the elements of a text: that of
"attributes", discussed in the next section.

   4.3 Attributes and cross-references -
   -------------------------------------

Like "entity", the word "attribute" has a specific technical sense
when used in the SGML context, which differs somewhat from its sense
when used in the database design context. An SGML attribute is a
category of information associated with a particular type of element,
but not contained within it. Attributes are associated with particular
element occurrences by including their name and value within the
start-tag for the element concerned. For example:

   Call me <name source=Biblical>Ishmael</name>.

Here "source" is the name of an attribute associated with any
occurrence of the <name> element; "Biblical" is the value
defined for this attribute in the case of the example <name>
shown above. [9]

Attributes are used for two related purposes: they enable an identifying
number or name to be associated with a particular element occurrence
within a text (which might otherwise be missing), and they enable
additional information missing from a text to be added to it without
violating its integrity.

As an example of the first usage, consider the page or folio numbering
of a historical source. There is a sense in which the individual pages
of a source might be regarded as distinct elements within it. This is
not however generally the primary focus of interest for those using it:
in most cases, the number of the page only is of importance as a means
of documenting where the other elements of the text occur. Moreover, the
page numbers may not appear at all in the original source. In such
cases, a tag <page.break> may conveniently be used to mark the
point in the text at which a new page begins. An attribute (say,
"number") would then provide a convenient means of indicating the
number of the page just begun: thus

               text of page 3 ends here
                <page.break number=4>
              text of page 4 starts here


As an example of the second usage, consider the common need for
normalisation in prosopographical studies. One way of achieving this
might be associate an attribute such as "normal" with each occurrence of
<name> elements in a text, the value of which would be a normalised or
encoded form of the name, which could also serve as an identifying key
in a database derived from the text. For example:

          <name normal='SMITJ04'>Jack Smyth</name>


Attribute values may be defaulted, taken from a controlled list or
specified freely, the only constraint being that they cannot contain
markup.

The most common use for attributes in the TEI and other SGML schemes is
not however to categorise element occurrences in this way, but to
identify them. In the TEI scheme, for example, every element is defined
to have an ID attribute, which supplies a unique identifier for that
particular textual element within the text. This makes possible the
encoding of links between individual elements of a text in a simple and
economical way. This facility is very commonly used in document
preparation systems (such as TeX or Scribe) in order to link cross-
references (such as "see section 3 above") within a text with the
sections of a text to which they refer, when the section number is not
known or may be dynamic. In SGML, such a system is completely
generalisable. For example, let us suppose that we wish to encode a
register of names in which the following passage occurs:

          John Smith, baker.
          Mary Smith, seamstress, wife of the above.

In this example we have two <entry>s, each containing a <name> and
a <trade>. The second entry however contains an additional clause
which states a relationship between it and another element. We begin
by tagging the elements so far identified: [10]

           <register>
              <entry><name>John Smith<trade>baker
              <entry><name>Mary Smith<trade>seamstress
              <relation>wife of the above

Clearly "wife of the above" is meaningless as a relation unless we have
some way of pointing to the entry with which it is linked. Let us assume
that the referent of "the above" is the whole of John Smith's entry
rather than just the name within it; the assumption does not affect
the argument. What is needed is some way of identifying that entry
uniquely; that identifying number can then be supplied as the target
of the relationship. In other words, we need an identifying attribute
(call it "id") that can be attached to any <entry> and a pointer
attribute (call it "target") which can be attached to any <relation>.
Using these, and inventing an arbitrary value for the identifier, we can
encode the link implicit in the above text as follows:

          <register>
             <entry id=E1234><name>John Smith<trade>baker
             <entry><name>Mary Smith<trade>seamstress
             <relation target=E1234>wife of the above


Here we have allocated the arbitrary name or identifier "E1234" to the
Baker's entry. By supplying that same identifier as the value for the
target attribute associated with the <relation> element of the
Seamstress' entry, we assert both the existence of the relationship
itself, but also its target.  This simple solution to a well-known
problem has several attractive features, but perhaps the most attractive
is that it makes explicit the fact that the target of the relationship
is an interpretation brought to bear on the text by the encoder of it,
leaving the text itself unchanged. Other attributes (say, "certainty" or
"authority") may also be imagined which might carry additional
interpretative information associated with the link.

5. Ensuring consistency
-----------------------

While a rose might smell just as sweet by any other name, every computer
user knows that names intended for automatic processing must be spelled
exactly and defined precisely. The human reader might tolerate paragraphs
sometimes labelled <p>, sometimes labelled <para>, and sometimes not
labelled at all, but the computer is less  forgiving. Slightly less
obviously perhaps, the user of an SGML aware software system needs to
know what elements have been defined for a given text (or group of texts)
and what their possible contents are. He or she needs to know not just
whether personal names should be tagged <propname> or <name>, but also
in what contexts personal names may reasonably be expected to appear
(for example, if something tagged as a name appears "within" a name,
it is probably an error). He or she also needs to know what attribute
names have been defined for particular elements and their legal values,
and also what entity names should be used for particular symbols. The
formal specification of these names and their usage is enshrined in a
separate component, unique to SGML, known as a "Document Type Definition"
or DTD.

   5.1 Defining a Document : the content model -
   ---------------------------------------------

A DTD performs a function analogous to that of a grammar: it formally
defines what are the legal productions of a given markup language. Of
course, DTDs can be as lax or as restrictive as any other kind of
grammar: the designer of a DTD generally has to trade off generality of
use with accuracy of error detection. The simplest kind of DTD would be
one which did no more than specify a set of tag names, requiring only
that every element tagged in a document use one of them. Such a DTD
would of course be unable to detect errors such as <name>s occurring
within <name>s or within <date>s, nor to prohibit such errors as register
entries appearing other than inside registers. Creating correctly
encoded texts with such a DTD would be rather like  trying to speak a
foreign language with the aid of a lexicon of the language, but no idea
of its syntax.

More usually however, the transcribers and creators of electronic
texts wish to control how elements can meaningfully appear within a
given class of texts, so that processors intended to act on them can
do so more intelligently. The specification of what is legal within
any one kind of textual component or element is known in SGML as its
"content model", because it provides a model for its content.  Here,
for example, is a part of the formal DTD for the register example given
informally above.

       <!ELEMENT register - - (entry+)>
       <!ELEMENT entry - o (name, trade, relation?)>
       <!ELEMENT name - o (#PCDATA)>


These three lines are examples of SGML "declarations": each defines or
declares a name for an element and what its content should be. The
details of the syntax need not detain us; note only that each
declaration (like everything else in SGML) is explicitly delimited, in
this case by a symbol marking the start of a declaration (the "<") and
its end (the ">"). The content model part of each declaration is given
in parentheses at the end. Between the name of the element ("register"
in the first case) and the content model are two characters which
specify whether or not both start- and end- tags are required to mark
off occurrences of the element. The hyphen character indicates that a
tag is required, the letter O that it is optional.  Thus, in this
example, <register>s must have both a start and an end tag, whereas
<entry>s and <name>s can be specified using start-tags only.

The content model for register states that a <register> consists of one
or more <entry>s, the plus sign indicates that the element before it can
be repeated one or more times. Thus a register containing no entries, or
one containing something other than an entry would be regarded as an
error by this DTD. The content model for an entry states that a <entry>
must begin with a <name>, followed by a <trade> and then optionally by a
<relation>. The commas between the components of this content model
indicate that the elements must all appear in the order given. The
question mark following the <trade> indicates that this element need not
be present. Thus, an entry with no name, or one where the trade preceded
the name, would  both be regarded as erroneous by this DTD, whereas
entries are equally acceptable, whether  with or without <relation>s.
Finally, the content model for a <name> states that it may contain only
text, that is, simply data with no embedded tags. (The word "#PCDATA" is
a special SGML symbol standing for "parsed character data" - which must
be "parsed" because it may contain entity references as well as raw
characters).

SGML syntax allows for other variations, which will be necessary if we
are to refine this model to reflect more accurately the probable content
of register entries in the real world. We will begin by relaxing the
restriction on the number of <trade>s an entry may contain:

        <!ELEMENT entry - o (name, trade*, relation?)>

The asterisk following the word "trade" indicates that an entry may
contain zero or more <trade>s. An entry such as the following

         <entry><name>John Smith
                <trade>butcher
                <trade>baker
                <trade>candle-stick maker

would be legal according to this second definition, as would one like
this:

          <entry><name>John Smith


Suppose however that entries are mixed, sometimes containing names and
trades, sometimes only one or the other. One possible content model for
this situation would be:

           <!ELEMENT entry - o (name|trade|relation)+>

The vertical bar symbol may be read as "or". This content model states
that an <entry> must contain at least one component, which may be a
<name>, a <trade> or a <relation>, and may contain more than one of any
of them, in any order. (The inner set of parentheses is needed to
indicate that the plus sign is to be applied to the whole group of
alternated names). The following entries would all be legal according to
this definition:

          <entry><name>John Smith
          <entry><trade>Baker<trade>Chandler
          <entry><relation>wife of the above<name>Mary Jones
          <entry><name>John Smith<name>Henry Jones<trade>smith


As the last example indicates, such laxity of definition may lead to
difficulties of interpretation - our syntax now cannot help us
determine whether it is Smith or Jones who is the smith. But presumably
in that respect we are being true to the source!

A more realistic situation would be that names, trades and relationships
are found promiscuously intertwined in any order within  any amount of
other text. SGML offers two ways of dealing with this. One is simply to
add #PCDATA as a fourth alternate to the example above, to give a
declaration like this:

           <!ELEMENT entry - - (name|trade|relation|#PCDATA)>


This solution is that generally preferred in the TEI Guidelines,
for the general case, where elements consist of small identifiable
elements (names, trades etc.) swimming about in an arbitrary mixture or
"primal soup". Note that with this content model it is no longer possible
to leave out end-tags.  Another approach achieving a similar effect is to
use what is known as an "inclusion exception", as in this example:

         <!ELEMENT entry - - (#PCDATA) +(name|trade|relation)>


Either of the above definitions states that items tagged as names, trades
or relations may appear anywhere within an entry, an arbitrary number of
times, interspersed with arbitrary sequences of character data. The
following example would be regarded as legal according to both the above
definitions:

    <entry>Also<name>John Smith</name>
    <entry><name>John Smith</ name>
           and<name>Adam Smith</name>
           <relation>his brother</relation>
   <entry><relation>Cousin of the<trade>blacksmith</trade>
   <entry>Old<relation>Uncle</relation>
          <name>Tom Cobbley</name> and all


The difference between the two element declarations is that the second
allows names, trades, or relations to appear arbitrarily not only within
entries, but also within anything that entries contain. Unless further
modified, this definition allows (for example) names to occur within
trades (or vice versa), or even within other names. Entries such as the
following would be legal by the second definition, but not the first:

    <entry> <name>John <trade>Smith</trade> Jones</name>
    <entry> <relation>Brother of
              <name>Henry
                     <trade>Potter</potter>
                    Jones
              </name>
            </relation>


Of course, entries such as the following would also be acceptable by
either definition:

    <entry>Any old string of characters you care to type in


To complete the content models for our simple register example, we need
to define the sub-components of an entry.  For the sake of argument, we
will assume that <name>, <trade> and <relation> are to contain only text,
with no further embedded elements.  Since they all share the same
content model a single declaration will suffice:

        <!ELEMENT (name|trade|relation) - - (#PCDATA)>


   5.2 Defining a document: attribute lists -
   ------------------------------------------

We now turn to the definition of attributes for each of the elements
discussed so far.  As with elements, SGML requires  that all attributes
likely to be used within a document must be defined in advance. It also
offers a variety of features to control the values that specific
attributes may take.

An example attribute declaration might look like the following:

         <!ATTLIST name id     ID    #IMPLIED
                        normal CDATA #IMPLIED
                        type (personal|honorific) personal>


This declaration associates three different attributes with the element
<name>. The first attribute is called "id", the second "norm" and the
third "type". Note that all the attributes for a given element are
defined together in a single ATTLIST declaration. For each attribute
named, two additional pieces of information are required for its
declaration. The first, following the name of the attribute, defines
what type of value may be supplied for it, while the third, following
the type specification, indicates what value should be assumed if the
attribute is not specified for any element occurrence.

In this example, three different kinds of value are specified. The
"id" attribute may take as its value only a string [11] which the SGML
processor will treat as an identifier for the associated element: this is
indicated by the keyword "ID". An ID-value need not be supplied, and if
it is not, then a processor may take whatever default action it chooses;
this is indicated by the keyword "#IMPLIED".

The "normal" attribute may take as its value any string of characters;
this is indicated by the keyword CDATA. The SGML processor will not
check the value of this attribute in any way, except of course that it
may not contain any form of markup. [12] Again, there is no requirement
to supply a value for this attribute.

More exact checking is provided for in the case of the "type" attribute:
here we have specified that only two values are legal for this attribute
- the type of a name must be marked as either "personal" or "honorific".
If no value is specified, the processor is to assume that the intended
value is "personal".

To illustrate some of the other possibilities for attribute
specifications, we conclude with a declaration for the attribute list to
be attached to the <relation> element:

     <!ATTLIST relation  target    IDREF  #REQUIRED
                         certainty NUMBER 0            >


This states that the relation element may have two attributes, called
"target" and "certainty". The former takes as its value an "IDREF", that
is a string which has been used as the value for an ID-type attribute
somewhere within the current document. The SGML parser will check that
the value actually identifies some other element, as is the case in our
example. The keyword "#REQUIRED" means additionally that no <relation>
can exist in the document unless the element to which it points is
specified in this way; this keyword is used to specify that a value
"must" be supplied for every occurrence of the attribute to which it is
attached (the examples of relations given above are thus all illegal by
this definition!).  Finally, the "certainty" attribute is defined as
taking a numeric value, which defaults to zero.

6. Why should I use a DTD?
--------------------------

It should be emphasized that the preceding brief discussion is by no
means comprehensive and is intended only to give a flavour of the kinds
of tools at the disposal of the SGML document designer. The interested
reader is referred to one of the introductory texts cited in the
bibliography for more comprehensive information. It should also be noted
that designing a DTD is not something which every user of an SGML system
needs to do afresh. On the contrary, it is the objective of endeavours
such as the Text Encoding Initiative to define general-purpose DTDs
which can be used for a wide variety of purposes, including those which
are the particular concern of historians.

Why, however, should anyone transcribing a historical source for
analysis care about the existence of such endeavours? How can a
knowledge of SGML help in understanding a text, in placing it into its
proper context? It should be relatively uncontentious that the
mechanical drudgery  of transcribing, editing and  reproducing texts is
enormously simplified by the uncoupling of the tasks of data
"interpretation" (tagging) and data "reproduction" (formatting). The
former is an essentially hermeneutical and scholarly act, while the
latter is not.  Coombs, Renear et al have persuasively argued that the
proliferation of sophisticated tools for desktop publishing have
effectively seduced scholars from their true vocation (Coombs 1987) and
that SGML offers a chance to regain the high ground.

Suppose that you have obtained, or created, a DTD which adequately
describes the kind of source texts you wish to process. How will it be
of use to you? You will use it firstly as a means of checking that
each individual document you process conforms to the model you have
defined. As well as providing you with a diagnostic check on the accuracy
of your keyboarding, this will also provide what the Americans call a
"reality check" on your interpretation of the sources concerned: you may
be forced to confront aspects of your sources to which an initial,
possibly over-hasty, assessment has blinded you.  Moreover, because each
part of an electronic text (or at least, each part that has been tagged)
is equally accessible, equally processable, you can analyse the contents
of your sources with a far higher degree of accuracy and sophistication
than either a simple transcription  or a database of derived
observations would provide.

Historical sources do not belong to one individual however. More
perhaps than other kinds of resource, they must be shareable, whether
from economic or ethical considerations. An electronic text encoded in
SGML, with its associated document type definition and appropriate
documentation, is a permanent asset, independent of time, place and
personality.


A brief bibliography
--------------------

This bibliography lists some useful further reading for general
information about SGML and markup languages. A full and very detailed
SGML bibliography edited by Robin Cover and David T. Barnard is also
available as TEI document MLW14.

   -Bryan 88
    Bryan, Martin
   **SGML: an Author's Guide to the Standard Generalized
          Markup Language**  (Addison-Wesley, 1988)
    [Detailed text book giving full treatment of
    the standard, but primarily from the publishing perspective.]

   -Coombs 87
    Coombs, James H., et al.

    "Markup Systems and the Future of Scholarly Text Processing"
    **Communications of the ACM**,  Vol. 30 no 11 (November, 1987)
    pp. 933-47
    [Classic polemic in favour of descriptive over procedural
     markup presented from the scholarly perspective.]

   -DeRose 90
    DeRose, Steven J., et al.
    "What is text, really?"
    **Journal of Computing in Higher Education**, Vol 1 no 2
    (Winter, 1990)

   -Goldfarb 91
    Goldfarb, Charles, **The SGML Handbook**  (Oxford University Press,
    1991)

    [Authoritative and exhaustive presentation of all aspects
     of ISO 8879, including annotated and cross referenced full text
     of the standard itself.]

   -ISO 1986
    International Organization for Standardization

    "ISO 8879: Information processing - Text and office systems -
     Standard Generalized Markup Language (SGML) (ISO,1986)"

    [Annexes A and B to the Standard provide a formal but
    readable summary of its most important features.]

   -ISO 1988
    International Organization for Standardization

    "ISO/TR 9573: Information processing - SGML support facilities -
     Techniques for using SGML" (ISO,1988)

    [Tutorial discussion of main features of the standard with some
     interesting examples.]

   -van Herwijnen 90
    van Herwijnen, Eric
    **Practical SGML** (Kluwer, 1990)
    [Good introductory textbook with emphasis on how
     SGML is currently being used.]


----------------Notes------------------------------------------------

[1] International Organization for Standardization, "ISO 8879:
Information processing ---Text and office systems --- Standard
Generalized Markup Language (SGML)", ([Geneva]: ISO, 1986).

[2] Examples include Bryan (1988), van Herwijnen (1990) and Goldfarb
(1991).

[3] Originally published as part of Greenstein, D.I. (ed) **Modelling
Historical Data** (Goettingen:  Max-Planck-Inst. fuer Geschichte, 1991)

[4] See Greenstein, D.I. (ed), **Modelling Historical Data**.

[5] ACH-ACL-ALLC "Guidelines for the Encoding and Interchange of
Machine-/radable texts", edited by C.M. Sperberg-McQueen and Lou Burnard
(Chicago and Oxford, Text Encoding Initiative, October, 1990)}

[6] I do not discuss here the possibility of using variant character
sets within an SGML document; though possible this does not of course
solve the general case.

[7] The use of the ampersand and semicolon to delimit the acronym is an
example of the SGML "reference concrete syntax", discussed briefly
below.  It is not a necessary part of the SGML solution; merely a
conventional one.

[8] The semicolon is not in fact strictly necessary in all situations:
the end of an entity reference is signalled by the first character
encountered which cannot form part of a name. A space is therefore
sufficient; since however entity references are often found within
words, rather than between them, the semicolon is often necessary to
indicate where the entity name ends and the word within which it is
embedded resumes.

[9] Categorisation of this kind could of course equally well be achieved
by using a different tag - say <biblical.name>. The decision as to
whether to use an attribute or a distinct element type is often a
difficult one, involving more detailed technical and design skill than
it is appropriate to consider here.

[10] For simplicity, I have omitted the end-tags in this example; this
is a legitimate abbreviatory convention in many circumstances, as
discussed further below.

[11] More exactly, a "name token", that is a sequence of alphanumeric
characters of which the first is a letter

[12] It must also be enclosed in quotation marks if it is not
a name token.