Standardised Languages for Data Exchange and Storage. The Encoded Archival Description: using SGML to create permanent electronic handlists.

[This local archive copy mirrored from the canonical site: http://www.dur.ac.uk/Library/asc/eadarticle.html 980319; links may not have complete integrity, so use the canonical document at this URL if possible.]

Durham University Library, Archives and Special Collections

HTML version of an article in Business Archives Principles and Practice, 73 (May 1997) (where, as if in order to demonstrate the dangers of loading a document created in one version of WP into another, the quote marks went quite awry. They have been corrected here, and an additional item added to the reading list)
June 1997

Standardised Languages for Data Exchange and Storage

The Encoded Archival Description: using SGML to create permanent electronic handlists.

Richard Higgins
Follett project: retrospective conversion of handlists, Durham University Library Archives and Special Collections.

Introduction

Since the invention of the computer there has been a fundamental problem with data, that the structure is dependent upon the hardware and operating system upon which it was created. Those working at institutions that have been using computers for a decade or so will be familiar with trying to retrieve documents created on defunct systems only to find strange anomalies have been introduced during the translation, or no 8 inch floppy disk drive has survived with which to access the data. The most obvious solution to this problem is only to have one type of computer, but that is not how the situation has evolved, and although there are probably less types of computer on desks now, there are several very different machines about, and still more operating systems available for them. In the imperfect world in which we find ourselves, how can we ensure that the data we create is proof from the vagaries of progress, that it is not rendered useless by the purchase of some new machinery, and also that it can be of use to those who already have different systems? Equally important is the potential to use the data of others: nothing brings a greater sense of satisfaction than to be able to make use of the hard work of another without spending more time manipulating the file than it would have done to retype it from scratch. A primary method for achieving this involves platform independent computing, whereby the functions performed by the computer are not dependent upon the type of machine or operating system used for that particular instance of the operation.

Platform independent computing

The rise of the Internet, propelled by the World Wide Web, has brought more fundamental benefits than the ability to present a photograph of your pet cat to the global audience it has always deserved. One of the major changes it has already wrought in computing is the erosion of the barriers between different computer systems. Formerly, one manufacturer would charge a great deal for the work it took to provide the software to communicate with the operating systems on other machines already in use by your organisation, and each time either operating system changed a further re-write (and charge) would be involved. While all computers function in much the same way, the hardware may require the internal processes to be performed in different ways: more obvious to the user will be the variations in how the machine stores data. Divergent file systems become apparent from the unfamiliar file naming conventions, or simply refuse to work in other environments,⁽¹⁾ and unexpected characters displayed on the screen hint at alternative character sets in use.⁽²⁾

Now, with global networks, the user is oblivious of the origin of the material on view - it may be from the same system, or a different platform on the other side of the world: this is no longer a factor. Whereas a network used to bind computers together within the private world of their own operating system, these monoliths are being superseded by more open configurations. Much of this freedom has been created by introducing a degree of separation between source and viewer, so that instead of treating the entire system as a single entity, bound together by a network, the process is divided into two parts capable of using the network to communicate not just between themselves but with any other devices sharing a common protocol. This is referred to as a client-server setup, and involves, apart from the communication medium, two separate programming tasks. The server runs the supply side of the transaction, controlling and delivering the data from a central source, while the client enables the desktop machine to receive and display the data. Familiar examples of this are e mail, received centrally and then viewed by the recipient, and the WWW browsers such as Lynx, Netscape or Mosaic.⁽³⁾ The programmer has the design specifications for the client and the server, and using these, creates a version to work with the operating system of the machine concerned. This means that this client or server application can interact with those on other computers with different operating systems, but for which the same client-server processes have been created using their own respective native codes. This concept, having a collection of predefined common functions that can be enabled in the vernacular of many machines, is a fundamental part of the new generation of flexibility.

Among the applications that have been constructed upon this principle, there is one which has significant value for those providing large bodies of data. It provides a client-server system for searching across many data collections held in different native formats. By introducing a degree of separation between enquirer and data source the standard, called Z39.50,⁽⁴⁾ seeks to overcome any incompatibility between the machines and systems at either end of the transaction, and also to allow the traversing of many different data repositories in a single sweep. It defines a client - server relationship for general searching by creating an archetypal template for a collection of data. This consists of a database, or collection of records,⁽⁵⁾ each of which have fields defined as access points which can be searched, and retrievable elements which can be reported back to the enquirer. Once the server has been installed and configured to provide this information from your data source, anyone with a Z39.50 client can search across your data without running, or even being aware of the actual application in which the information resides, and combine this with searches of other Z39.50 compliant servers. Once more, by creating a stable standard that can then be implemented in the language of whatever operating systems are being run, Z39.50 demonstrates how the old barriers to wide data searching have been overcome.

Platform independent data

All of these new developments, however, require raw material: what is the value of being able to search every data source, if there is data that is unavailable? When technology provides global searches at the press of a button, then if the data is not there to be searched the results obtained must always be qualified. Impossible it may be to record everything, but these technological advances only serve to demonstrate that priority should now be given to making data available in a structured format. While it is convenient to store finding aids as simple text files, as the quantities involved accumulate, multiplied across repositories on a global scale, the volume of the shapeless mass of data becomes unmanageable. A notable example which confronts archivists early in their forays onto the WWW is the result produced by searching there for occurences of the word "archive". As many computer operators keep all the underlying code for their progrmas, along with all their old data, in directories which they quite logically call archives, so a search for "archive" will produce many thousands of examples, none of which relate to repositories. On a very simple level, using a certain type of document to list archival holdings allows an initial filter to be set that only finds relevant files: if this is then pursued further, structuring the data as well as the document type, queries can be yet more specific, and their results more succinct. The amorphous mass of information can be reduced rapidly to a more comprehensible size.

Creating a structure suitable for electronic finding aids involves looking at existing archival lists, and either reproducing their existing format in an electronic medium or developing an entirely new method. Given the amount of work involved in cataloguing archival collections, an unacceptable quantity of existing lists would have to be discarded if the new methods demanded re-cataloguing, so it would seem that this dictates a system that can use the legacy data of existing paper lists.

The simplest way to create a list of records has been to use prose: registers, cartularies, and letter books have for many centuries been used to describe material in sequence. The transition from handwriting to typescript to wordprocessor had no effect on the structure, except perhaps an underlying trend of normalisation that could be attributed to the iterative process of listing itself. Information could, however, become hide-bound in this format, and a different paradigm was pursued by splitting the items up into the sheaf binder or card index, which found its natural progression to the computer in the database. A set of fields can be created into which the data is distributed, with the result that with the addition of indexes a new way of storing information and retrieving from it just those fields that comply with specified criteria becomes possible by interrogating the database with a query language.

Before the developments in interconnectivity I have sketched above, a priority was placed upon enabling data exchange, in order to be able to pool information to create devices such as union catalogues.⁽⁶⁾ The apotheosis of this approach has been MARC (MAchine Readable Cataloguing). A fixed set of fields, and a mechanism for carrying information about the record itself was what it provided, and entire systems were developed to house the resulting datasets. MARC has not, in spite of the MARC AMC (Archives & Manuscript Control) work done in America, managed to overcome the archivist's antipathy to all things originating in libraries. There are some good reasons for this, as MARC has always had a rather one-dimensional approach, which is acceptable for a simple task ideally suited to a database such as recording book titles. Databases, however, tend to deal with uniform sets of information, which with archival collections cannot be relied upon. Above all, mimicking the hierarchical nature of an archive has always proven elusive with a database.⁽⁷⁾

Even in America, MARC AMC has tended to be used for collection level description only, providing an access point to the collection via the authority fields, while firmly directing the enquirer to a paper handlist for greater detail. Many of the inter-relationships between archival materials in a collection have become embedded in the handlist format, to an extent that only becomes clear when it is lost, as when all the items are entered as separate database records. It becomes more difficult to grasp the archive as an organic entity when contextual inferences cannot be drawn from surrounding material. The database is not a good medium for printing, either, and it is easy to spot material printed from a database. Aside from the uniform blocks produced from fields, it is only possible to format the entire contents of a field, preventing the use of italics or underlining for parts of the contents of a field, such as titles.

SGML

There is a middle way between prose and database, which gives a field structure within a text framework: it is called SGML (Standard Generalized Markup Language), has an inherently hierarchical approach to data, and an implementation of it is being designed specifically for the listing of archival material. Just as there is no need to understand the programming language in which a word processor was written, so there is no need to follow how SGML works in any depth: what is important is what it produces. The power and flexibility of SGML comes from the "Generalized", for by creating a system for defining markup rather than markup itself SGML has never become tied to any one application or platform. SGML grew, as so much in computing has, from an IBM project, in this case to develop an information system that could cope with document sharing (this was in 1969, long before the World Wide Web, and was called GML [Generalized Markup Language]) between computers. This was subsequently developed by the project leader, Charles Goldfarb, until in 1986 it became an ISO standard - ISO8879:1986.⁽⁸⁾

It is a meta-language which means, like the other developments already mentioned, it introduces a degree of separation between the created text and the application: by creating a system for describing document structure that includes mechanisms for mapping the operating system specific protocols of file naming and character sets within it, SGML solves the problems that defeat other attempts to create platform independent textbases. Markup is information about a document that is not part of the text of the document itself, although it may well affect the appearance of it. It provides a medium for explaining the language used in a document to a computer, and creates a grammar with which to express the structure of a text. Markup has several functions. It tells the computer what to do with text: in a word-processing package these are the "format" options which give such control over the appearance of a document, but carry no information about its structure. This is referred to as procedural markup, as it only deals with how to process the text. In order to describe the structure of a document, and the information that it contains, descriptive markup is required. This defines the logical structure of a text by dividing it into constituent parts, which it explains to the computer by assigning these parts with names and marking them in the document itself. Descriptive markup is where the power and functionality of SGML lies, and what really distinguishes it from most software. Finally, to introduce new material to the computer, referential markup provides a way of expanding its horizons beyond the basic set of alphanumeric characters, either to use the vast repertoire of text characters that are available but not standard to most computers, to jump to other documents (hypertext), or to call into the document photographs, sound recordings or any other media stored externally in a digital format.

Three types of markup defined in SGML need to be distinguished.

Descriptive markup (as stated above, the basis of SGML): referred to hereafter as elements or tags, these are the basic structural divisions of a document. Thus A <genre>letter</genre> from <persname>Florence Nightingale</persname> - the line consists of a paragraph () in which a type of document (<genre>) and person (<persname>) are identified.
Entity references: these are substitutes for anything that cannot be directly expressed in SGML, or can be used to insert frequently used portions of text from source files, and provide the referential markup functionality of SGML.
To illustrate a point you could include ... <graphic entity="bread1">a plate of bread & butter</graphic> ... the entity "bread1" represents a digital file containing a picture of the comestible concerned, while & is a character entity which calls the "&" character from the correct place among the viewer's available fonts.
Markup declaration: this creates or modifies the SGML structure itself, descriptive markup for the markup, as it were, and so it is possible if you are using an existing SGML implementation that this can be left to the experts (for an example of this open any DTD file, where the document structure and tag usage is defined in SGML).

The primary concern of SGML is with re-usable text, so the major users have tended to be information providers. So saying, the commonest implementation of SGML now is HTML, designed to be a simple method of sharing text and graphics via the Internet. It is the most presentational application of SGML and tends to viewed with disapproval by other SGML users, as it neglects the descriptive power of SGML. SGML is mainly to be found in publishing, either the production of commercial product, where the same file can be processed to produce a book or a CD-ROM, or in the technical support field. The vast body of manuals for the software and hardware that provide modern computing are compiled in SGML, to an agreed standard. In order to be considered to supply equipment to the US military, a product will not be accepted without all manuals being available in SGML. As it is a meta-language, all the software to produce and process these implementations of SGML must work with SGML itself, which means there is a significant customerbase, with far more influence than the most global alliance of librarians and archivists, to drive forward the creation of SGML software. Within the academic world, the lead has been taken by TEI, the Text Encoding Initiative, an international consortium for the creation of electronic texts in the humanities,⁽⁹⁾ who use markup not merely to provide online versions of the classics, but to dissect the linguistic contents of large corpora of text or spoken material. The implementation of SGML which is of significance to archivists is the EAD (Encoded Archival Description) which developed from the Berkeley Finding Aid Project, under the lead of Daniel Pitti, to be taken up by the Library of Congress as an intended standard, which is now in development.⁽¹⁰⁾

How SGML works

SGML becomes an application by the creation of a DTD (Document Type Definition). This is effectively programme writing for SGML, but it is seldom required as an existing DTD can often be adopted, avoiding this rather complex process altogether. It requires a preliminary study of the structure of the documents to be produced, and then the definition of all these possible constituent parts as elements, with controls over where these can be used in the final document. In fact, when people say they are "using SGML" what they are actually using is a DTD created in SGML, as this is what allows SGML to perform a function. The DTD defines a series of elements which will form the constituent parts of the resulting document file. All data in the document will be bound within at least one element, which is an identifying container that holds and describes either more elements, ordinary text, or a combination of the two.⁽¹¹⁾

To a certain extent, it helps to think of an element as being like a field in a database, as it contains an item of information, although the element has far greater flexibility, as it can contain and be contained by other elements. The information it contains can be as specific as a date, or as general as a paragraph. An element works like a pair of brackets, in that it consists of both a start-tag and an end-tag,⁽¹²⁾ which contain the text that they are marking up as if in parentheses. Thus a heading could appear: <head>This is a heading</head>. ⁽¹³⁾

The end-tag is signalled by a "/" (solidus) symbol. In this way the computer is now able to identify a string of text as a heading, and can thus perform functions such as interpretation, querying or presentation upon it. What it also does is to stipulate where the elements can be used, so whatever element is open will govern which other elements will be available, and whether normal text can be added. The <head> tag, for example should only appear directly after another element - it would not make sense to type two lines of text and then put a heading. One of the most difficult tasks in the art of DTD writing is balancing the freedom to use a wide range of elements and text with the desire to control the structure of the finished document by restricting the availability of elements within some elements, or forcing the use of certain elements within others. The power of SGML is derived from the flexibility with which this structure can be manipulated far beyond the limits of the database form, allowing prose to escape the unnatural confines of fields.

As well as defining the elements, then, the DTD will also control their use. In many cases a logical structure consists of a set sequence of elements: a list contains one or more items, so item elements are only available within a list element. The DTD controls which elements are available within which, as well as where plain text can be entered. An SGML authoring tool will always conform with the DTD, allowing only permitted elements to be inserted. As must be stressed SGML is a neutral format, so it is up to the application to manipulate the data, but it is possible to see how automated processes are possible that can find text contained within a pair of tags and display them in bold, follow them with two blank lines, or search within them for specified text patterns.

Attributes occur within the start-tags of elements, in the form of ordinary text, and are controlled by the DTD as part of the definition of the element. They are modifiers of elements, opening another level of interpretation for the computer; a linguistic explanation often used is that they are the adjectives where the element is the noun. In short, their purpose is to provide further information about the contents of the element tag. To refine the information provided by a <persname> tag, the "role" attribute can be used to define the function of a person in a document <persname role="plaintiff">Joan Jones</persname>, or the given location of an item in a <unitloc> element can be on a large - <unitloc loctype="repository"> or small <unitloc loctype="container"> scale. Although they normally provide non-displayed information about the element, attributes can be set to be displayed, and will be searched by any process run on the SGML document. Very occasionally an attribute is compulsory within an element, but the major role of attributes is to introduce a flexible system for controlling or interpreting the use of an element. Nearly all elements will have an "id" attribute, which provides the option of supplying a unique identifying code for the element so that it can be located by reference to it. Thus the <item id="xyz8"> tag can be located by citing the reference "xyz8" elsewhere within the document, or in combination with the name of the file as well, from anywhere in the digital environment. Another common use of attributes is the normalization of data, which tends to mean rendering text in a form that can either be processed by the computer, or make an unusual usage more comprehensible to the user. To allow a computer to process a date, the following approach can be used: <date normal="16870518"> 18 May 19 Charles II </date>, where the normal form is "yyyymmdd". Attributes also provide procedural markup. It is usual in SGML to format text by providing an element to carry the formatting as a general emphasis element, such as <emph>, which can contain a series of attribute options thus <emph render="italics">.

The third piece of SGML terminology is the entity. As SGML is a platform independent meta-language, it needs a mechanism to cope with the way different computers work, and as the "meta" implies, it does this by creating an abstract system for describing actual things. These things are frequently the two major problems causing incompatibility between computers mentioned above - character sets and files. Even within the PC environment the settings for different countries cause problems with characters, so once machines with different architectures and operating systems become involved it becomes vital to be able to deal with extended character sets. This is where character entities are used: instead of the key combination that produces "é" on a PC display, the string "é" is used. As this format only requires the stable part of the character set,⁽¹⁴⁾ it will survive the transition to other systems. There is an accepted set of character entities, which can be distinguished by their starting with "&" and ending ";". This, incidentally, means that "&" needs a character entity when it is not indicating an entity - & - ampersand per se ampersand. Files as entities can either represent text, or any other digitally encoded information (such as a graphic, sound or video file). Just as with character sets, the problem lies with the different ways computers manage what we refer to as files, and the solution is the same, to provide a standard method for referring to the file rather than relying upon the file itself conforming to a standard.

It is vital at all times to regard the SGML file produced in conformance with the chosen DTD as a neutral item. Unlike, for example, a WordPerfect 5.1 document, which can only be used in word-processing software that can interpret its peculiar format,⁽¹⁵⁾ an SGML file is not WP document, database, or spreadsheet, or even an Internet publishing tool. Just as it is independent of the platform or operating system on which you happen to be working, so too is the data file wholly independent of any application in which it may be used. An SGML application, like the bear of the bestiary, licks its formless progeny into its own image. This is one of the most important advantages of SGML, that confronts and solves the problem of a long term stable data format. It is impossible to predict the future use of our data files. The present rise of the Internet has lead to a demand for screen presentation of what used to be produced on paper, so a premium is now placed upon online viewable data, for which SGML is the medium par excellence. It is, however, just as capable of producing print, as the major publishing houses have found, on any scale, while organisations such as ICADD (the International Committee on Accessible Document Design) are using it to produce Braille, voice synthesised or large print versions of texts. Structured data is useful data: the product need not be the entire file, as a Z39.50 server can be attached to an SGML file, allowing for extraction of individual sets of elements, or searching can be done with a text search and extract language such as PERL. As has been stressed, the point of SGML is to create a neutral data file, not to dictate how it is used.

Using SGML for archival description

The only way to grasp how SGML works is to look at some marked up text, so the rest of this article will look at the implementation of SGML most relevant to archivists.⁽¹⁶⁾ The outline structure for handlists defined by the EAD DTD is tripartite, made up of header, description and ancillary material. The first section, the header, contains information about the file itself. This is a vital part of any electronic documentation strategy, as anyone who has tried to keep track of revisions to a computer file will readily admit. Its role is to keep both the user and the author apprised of the history of the file with which they are working, by listing revisions analogous to the editions in which a book is produced. The middle section, usually the major part of a handlist, is the description of the contents of the collection in whatever depth is felt appropriate. The final part can contain further information, such as related material within the repository or elsewhere, a bibliography or index. To understand how EAD works, and yet remains flexible, it is only necessary to elaborate upon the central section.

EAD exploits SGML's element structure to place the descriptive elements within themselves to replicate the hierarchical layout of an archive. EAD provides a series of elements which conform to those set out in ISAD(G),⁽¹⁷⁾ describing date, location, identifying number, physical and content description. The real innovation is produced by the stacking of these descriptions within each other that is allowed by the element system. The following example has been constructed to demonstrate how a detailed description can be marked up, and also to show how EAD works.⁽¹⁸⁾

<c03 level="item"><did> <unitid id="SHE-156">156.</unitid> <unitdate normal="16650404">4 April 17 Chas. II [1665]</unitdate><note><list> <item>(1) <persname authfilenumber="957702">George Shepperd</persname> of the <geogname authfilenumber="NT0526">town and county of Newcastle upon Tine</geogname>, gent.</item> <item>(2) <persname authfilenumber="23549">Anne Carr (née Franks)</persname> of <geogname authfilenumber="SS0032">South Sheiles</geogname> in the county of Durham, widow.</item></list> Lease by (1) to (2) of his half part of the messuage in <geogname authfilenumber="PO0016">Pockerley</geogname> in the county of Durham with its <subject authfilenumber="c56">collieries and coalmines</subject>, and a fulling mill.<lb> Term: 1 month from <date normal=16650331">31 March 1665</date>.<lb> Consideration: £10.<lb> Signed: (1 ). Seal: red wax, papered, on parchment tag.</note> <physdesc><extent>Parchment. 1m.</extent></physdesc> <unitloc loctype="container">114/5-1</unitloc></did> <c04 level="item"><did> <unitid id="SHE-156a">156. (a)</unitid> <unitdate normal="16650414">14 April 17 Chas. II [1665]</unitdate><note> Attached to 156:<lb> Minutes of consultation with <persname authfilenumber="68239"> cousin Nan</persname> about above agreement.<lb> Refers to a book of surveys called <title render="italic">The Book of Pockerley</title>created in <date normal="162203xx">March 1622</date>.<lb> See <ref target="SHE-2056">no. 2056 below</ref> for letter containing description of this meeting.</note> <physdesc><extent>Paper. 1f.</extent></physdesc> <unitloc loctype="container">114/5-2</unitloc></did></c04></c03>

A typical, if rather full description of the sort of legal document which any repository will hold by the thousand,⁽¹⁹⁾ the above fragment shows how EAD analyzes the descriptive information. The hierarchical <c__> elements can be seen embracing the description, a lower level <c04> held within the <c03> describing the piece that it was attached to. Several attributes are being used: each item is accorded a unique id in the <unitid> element that contains the unit's number, and the <ref> element that links to another item uses just such an id as an anchor for referring to it. The <unitid>, along with <unitdate>, the physical description of the item <physdesc> - here containing an element <extent> to narrow the scope of the description still further, and the <unitloc> which gives the position of the item in the repository's filing system all provide descriptive markup specific to the item, while the more general procedural markup -  and <lb> for paragraphs and new lines - influences the layout of the document. The various authority elements (names and subjects - <persname>, <geogname>, <subject>) have been given "authfilenumber" attributes, which would provide cross-references to, for example, a series of databases of people, places and subjects.

The presentation of the above fragment of SGML depends upon the definitions established in a stylesheet file, which with a little careful thought can be used to give a uniform appearance to a well constructed list.⁽²⁰⁾

As the data is now in a structured format, styles can be applied to parts of the structure. Any element can be given certain features, such as a different font, highlight or preceding characters or spacing, or simply a new line to follow. These can in turn be modified in relation to where the element appears (first/last occurrence, or within another element) or when certain attributes of the element are used. It is possible to create multiple stylesheets, with which to present the EAD SGML file in different roles, perhaps a large font for screen display and a smaller one for printing, or concealing portions of the finding aid that are still being completed.

If the following styles were attached to elements in a stylesheet:⁽²¹⁾

<unitid> - format as bold; <unitdate> - preceded by a tab;  - new paragraph; <item> - new line, with indentation; <lb> - new line; <physdesc> - new line; <title>, with attribute render set to "italic", format in italics; <unitloc> - new line, display within square brackets; <ref> (a hypertext link), format as underlined. All other elements are set to have no effect on formatting. <c__> elements are indented relative to their parents, and to be preceded by a blank line.

This gives the following, on screen or in print:-

156. 4 April 17 Chas. II [1665]
(1) George Shepperd of the town and county of Newcastle upon Tine, gent.
(2) Anne Carr (née Franks) of South Sheiles in the county of Durham, widow.
Lease by (1) to (2) of his half part of the messuage in Pockerley in the county of Durham with its collieries and coalmines, and a fulling mill.
Term: 1 month from 31 March 1665.
Consideration: £10.
Signed: (1 ). Seal: red wax, papered, on parchment tag.
Parchment. 1m.
[114/5-1]

156. (a) 14 April 17 Chas. II [1665]
Attached to 156:
Minutes of consultation with "cousin Nan" about above agreement.
Refers to a book of surveys called The Book of Pockerley created in March 1622.
See no. 2056 below for letter containing description of this meeting.
Paper. 1f.
[114/5-2]

If this example appears too laborious, it is merely intended to demonstrate some of the more elaborate mechanisms that the EAD can manage that could be of use in archival description, should they be required.⁽²²⁾

It should also be borne in mind that the listing of archival collections has always been time consuming and laborious, so the extra effort of working in SGML will probably take a small proportion of the time allotted to sorting, deciphering and interpreting documents. What this approach does offer is the chance to have a live catalogue quite quickly in place: first a collection level description can be created using just the preliminary elements to produce an ISAD(G) minimal conformant account of the holding; as work progresses, a scheme can next be introduced that breaks the whole into its constituent sections using the component elements. The finding aid may well remain at this stage, but at any stage some or all of these components can be analysed down to item level without requiring a new start.

SGML can provide a viable long term cataloguing tool for archivists. Using the EAD DTD is not, of course, the only way to take advantage of this opportunity, but it is the most obvious. The difficult side of SGML, the basic DTD design, has already been dealt with (which means a far lower level of SGML knowledge is required of the cataloguer), and with the Library of Congress supporting EAD as a standard, data created in this format will become more familiar to the users.⁽²³⁾ If EAD still fails to do exactly what you wish, and the designers cannot be prevailed upon to include some element vital to the peculiarities of your listing practice, then SGML allows the declaration of new elements at the start of any document. This will, of course, mean divergence from an adopted standard, but it does give the opportunity to be very nearly conformant but still flexible.

By storing handlists in SGML, it should prove possible to generate versions of them in any medium required, while the data format is stable and not dependent upon the future vagaries of computer development. The header part of the SGML file enables a proper control to be kept upon the evolution of the document, so it is possible to combine the stability of published data with the freedom to amend and augment at will, thus overcoming the usual dilemmas of printing whereby a handlist is no sooner produced than it is rendered obsolete by the discovery of error or new information. These factors are to the advantage of the repository when creating lists, and would occur with any careful implementation of SGML. When a common DTD is used the benefits are far greater, as it should foster more standard listing methods (while permitting flexibility if required) as both repositories and researchers begin to realise the benefits of a shared SGML environment. These advantages include the potential to use hypertext to link material scattered between many sites, or from handlist to transcription to image (an option that would be prohibitivly expensive at present, but which may gradually evolve in an environment of long term stability such as SGML). Above all, the SGML handlist is available: once it is mounted on a web server it distributes itself on demand and it can be read from anywhere with access to the Internet; alternatively it can be despatched on a floppy disk; the file can then be printed out as hard copy or browsed on a computer. As the quantity of online archival listing increases, the need for some structure with which to organise that information will become apparent. The present use of SGML to control large scale data resources shows that when this happens, material listed using the EAD format will already contain the required metadata.

Further information

Liora Alschuler, ABCD ... SGML A user's guide to structured information (Boston MA, 1995)

M Bryan, SGML: an author's guide to the standard generalized markup language (Reading MA, 1988)

C Goldfarb, The SGML handbook (Oxford, 1990)

E van Herwijnen, Practical SGML (Dordrecht, 1990)

[Arriving too late for inclusion in the printed article, but a book which deals with SGML as a web medium is Rubinsky, Y. and Murray, M. SGML on the Web. Small steps beyond HTML (1997)]

Of the above books, Bryan and van Herwijnen give practical explanations of how to write books in SGML. Goldfarb, the deviser of SGML, gives a complete text, with commentary, of the ISO standard and is far more technical. Alschuler gives a series of studies of how SGML is being used. The major resource for information about SGML is the Internet. In view of the fact that the easiest way to access this is via a web page, those who can are pointed to http://www.dur.ac.uk/Library/asc/eadsgml.html where I have put the relevant links. For those with e-mail only, the EAD mailing list can be joined in the usual way by sending

"subscribe EAD yourname" to LISTSERV@LOC.GOV (where yourname is of course what you want the server to call you).

RLG (Research Libraries Group, an international consortium of research libraries, archives and educational establishments) have organised some training courses in using EAD. Although most take place in USA, one has been given in London. They provide the most useful introduction to getting started, and have so far been organised by Anne Van Camp - e mail BL.AHV@RLG.ORG

1. Microsoft's word processing software Word has been available for DOS, Macintosh and Windows: although the program appears the same, not only can you not load an ordinary Word file saved on a Macintosh into a PC version of Word, but MS Word 5.5 for DOS will crash if run on MS Windows 95, experto crede.

2. While most computer systems treat the basic alphanumeric characters in similar, predictable ways, the extended character set (accented or non-Roman symbols) will often misbehave, as English users in search of a pound sign can testify.

3. These are three different developments of a client: Lynx is a minimal implementation, that allows most terminals to view the text provided by the server, while Netscape and Mosaic require more powerful machines and allow viewing of text, graphics and the full panoply of digital multimedia.

4. Z39.50 is relatively clear as acronyms go: it is an ANSI (American National Standards Institute) standard, the fiftieth produced by the Z39 committee working on bibliographic standards for that institute.

5. This need not be an actual database, as any structured collection of data will suffice.

6. The other benefit of a standard approach was that user skills could be learned that would be applicable with any employer.

7. Products such as MODES have been designed to cope with hierarchical description, but it is not a process that integrates easily with a database methodology, as will quickly become obvious to a designer working with a standard generic database package.

8. For the background, and a full annotated text of the standard, see The SGML Handbook, Charles Goldfarb (Oxford, 1990). As a full explanation of the language, it is however, rather daunting for those who just want to use it.

9. They have produced an adaptable suite of DTD fragments with which to encode most types of material studied in the Humanities. The manual for using this, Guidelines for Electronic Text Encoding and Interchange (TEI P3), published in June 1994, fills 1,300 pages.

10. For accounts of the history of this project, which began in 1993, see the pointers on the web page cited in the bibliography.

11. An element can also be empty, for example to act as a anchor to call an entity (such as a photograph) into the document.

12. This holds true in nearly all cases. The two exceptions are for the occasional element that marks a processing event without any content, such as a new line, or to call into the document an external file such as a graphic. It is clear in both these cases that the elements will not contain any document text.

13. By convention elements are contained within angle brackets, and takes whatever name is assigned to them by the DTD. As with all parts of SGML grammar, it is possible to substitute another character to perform this function. To make it more obvious here, all markup has been rendered in bold, although it is only ordinary ASCII text.

14. This consists of the alphanumeric roman characters, and some punctuation and control characters (more or less those characters actually represented on the keyboard, and those such as the paragraph marks which a word-processing package can be set to reveal). As these are either shared by different systems (PC and UNIX) or are stable in their own environments, they can be relied upon. Changing the country settings on a PC will demonstrate that different countries have different sequences of extended characters (those with accents, etc).

15. For all that manufacturers vaunt their products' ability to convert between formats, something is nearly always lost in that process, often without the user being aware of it.

16. Business archivists may well find in the future that SGML documents are deposited with them. Much legal material is already being held in SGML.

17. ISAD(G): General International Standard Archival Description (ICA: Ottawa, 1994).

18. This fictitious description has been marked up, probably to a more thorough extent than would ordinarily be done. There are more general elements for many of these specific tags, but the use of elements in the example is clearer by being more detailed. It should be noted that American devisers of the EAD DTD follow the pattern for archival description laid down in Steven L. Hensen, Archives, personal papers, and manuscripts: a cataloguing manual for archival repositories, historical societies, and manuscript libraries (Chicago: Society of Archivists, 1989 [2nd ed.]). That the EAD works with several different listing methods implies a successful interpretation of the underlying structure of finding aids.

19. This description, while of one sort of document, should demonstrate that letters or bound volumes of items such as diaries or ledgers can be produced. Collections of other items, such as photographs, can also be accommodated, with or without digital images of the items. Transcriptions of items, should they be available, can be linked to the description via standard SGML links.

20. Conversely, this becomes a very good test for the inconsistent use of markup, as it will produce obviously incongruous formatting.

21. SGML applications use a separate stylesheet, which defines how to present the text: there can be one for many files, or several for each file (perhaps one to generate printout and another for online viewing).

22. Very little of the tagging given above is required, although the basic structure should be given in at least a skeletal form.

23. The EAD DTD is at present in beta stage of testing,which means that while it has broadly reached the final state, it is still being amended.