Published in TEXT Technology, 4.2(Summer, 1994), 90-92.

The Electronic Texts We Want and Need

by

Eric Johnson

For those of us who are interested in text analysis, all electronic texts can be useful, of course, and we are glad to have them in almost any form, but some formats are preferable to others. We who use electronic texts should clearly state what we want, and we should go out of our way to praise the best formats, and to urge the producers of the others to reconsider their choices.

Electronic texts (sometimes called "machine-readable" texts) are commonly distributed by commercial and non-commercial producers in three forms. First, most electronic texts seem to be disseminated as plain vanilla ASCII files (PVASCII) that are formatted much like the printed texts (although the lines are often longer and hyphenation is avoided at the ends of lines). If you load a PVASCII file into your word processor, you see simply the text -- and that is all that the file contains. Second, recently, electronic texts have been produced as ASCII files that also include information about the texts; such information is contained in Standard Generalized Markup Language encoding (SGML). These files can also be loaded into a word processor, but in addition to the author's text, you will see codes (enclosed in angle brackets) and special references (preceded by an ampersand). The electronic editions of the works of Austen and Coleridge reviewed in "Column One" following are SGML files. Third, some electronic texts are distributed in proprietary formats that can be used only with proprietary software. It will probably be impossible to make sense of such files if they are loaded into a word processor.

Of the three forms, SGML electronic texts are definitely the most valuable. The Text Encoding Initiative (TEI) has issued guidelines for standard encoding, and, if the guidelines are followed, text users and software developers know exactly what to expect from SGML texts. They may be encoded with tags that simply indicate the titles and the start and end of parts of a work (such as chapters, or individual poems in a collection), or they may contain more complex encoding that identifies the homographic forms of words or identifies each speaker of dialogue in a novel. SGML texts of plays are particularly useful since they distinguish speaker tags, stage directions, and other notes from the lines of characters' speeches. In addition, it is common for SGML texts to contain references that can be converted into special characters (for example, accented letters) appropriate to the type of computer used (Macintosh or IBM). The production of even a simple form of SGML texts is time-consuming and exacting work, but such texts are worth a good deal to the researcher and the teacher. Publishers of SGML texts, such as Oxford University Press, should be praised, and they should be supported by the purchase of the electronic texts and by the purchase of the printed volumes that correspond to the electronic versions. Since the use of electronic texts is still in its relative infancy, commercial producers may have to be patient about recovering the costs of creating SGML texts.

PVASCII texts are the easiest of the three kinds to create, but their usefulness is limited. Sophisticated textual analysis requires information about the text, and PVASCII texts do not provide anything but the text. (Occasionally we receive a file that is more or less a plain vanilla text but that also contains references to line or page numbers -- usually not in SGML format.) Often, PVASCII texts do not contain any indication of their source. Sometimes it is said that PVASCII files are frequently inaccurate, but without a source to compare, it is difficult to know. It is said that there are people who systematically remove the markup from SGML texts, and then distribute the results as PVASCII texts. Such a practice seems almost shocking: all the care and effort of encoding the texts is thrown away. It would be far, far better to distribute an SGML text along with a computer program that will remove the encoding; then users could have it both ways.

Text files produced with proprietary formats often contain the kinds of information found in SGML files, and they are sometimes shipped with special indexes. The problem with texts in proprietary formats is that they can be used only with the proprietary software designed especially for them. If that software does not do what we want done, the texts may be useless. Some companies are responsive to users' entreaties, and they will create new software if it is requested, but there will be significant delays, and it may be impossible to obtain additional software if a company has gone out of business. There is so little flexibility in the use of proprietary files that their production should probably be discouraged. Creators of proprietary texts should be urged to convert them to SGML files.

We users of electronic texts should make it known that we want and need SGML files (particularly those that follow the guidelines of the TEI). We want texts that can be used with a variety of commercial and custom-made SGML processors. We want encoding that provides us with as much information about the text as possible; certainly we want it to indicate the titles and parts of a text. We need to know the source of the text. If SGML texts are not available, we can, naturally, do some work with PVASCII texts, and we may be extremely glad to have them, but they are not our first choice.


Eric Johnson is Editor of TEXT Technology; he may be contacted at JohnsonE@dsuvax.dsu.edu

Click here to go to Eric Johnson's publications.

Click here to go to Eric Johnson's home page.