The Lays of Ancient ROM

[Mirror from: The Lays of Ancient ROM]

[ There follows the text of an article published in the British Weekly, The Economist, issue dated 27 August 1994, which I think many on this list will find interesting and encouraging LB ]

The Lays of Ancient ROM

Databases are transforming scholarship in the most conservative comers of the academy, forcing technological choices even on to the humanities

IN 1987 a budding classicist from the University of Lausanne finished four years of labour. She had spent them scouring ancient Greek tomes searching for the classical sources of 2,000 anonymous fragments of medieval text. Then, just when she was getting down to writing her doctoral dissertation, all that effort was eclipsed. In a few dozen hours working with a new database she found every one of her 600 hard-won sources again -- and 300 more that had passed her by.

That database, the Thesaurus Linguae Grecae, was the first of the tools that is transforming the staid world of what used to be bookish learning. When computers were mere calculating machines, only the sciences had need of them. Now that they can be easily used to scan vast memories at inhuman speeds, the humanities have every reason to catch up. Whole libraries are vanishing into the digital domain, where their contents can be analysed exhaustively. The changes in the practice of scholarship may be greater than any since Gutenberg.

The process seems now to have been inevitable, but even the inevitable has to start somewhere. In the 1970s a group of classicists at the University of California, Irvine, thought up a then extraordinary goal: having every extant word of ancient Greek literature in a single database; 3,000 authors, 66m words, all searchable, accessible and printable. With the help of nearby computer companies, this idea became the Thesaurus Linguae Grecae. There are now 1,400 places around the world where a classicist can use it to do a lifetime's worth of scanning for allusions or collecting references just for a single essay. On compact disc, the whole thesaurus costs about $300.

Scholars using the growing electronic- text archives at such places as Oxford University, the University of Virginia and Rutgers University, New Jersey, have more than the classics to play with. There are at least five different competing software bibles, some with parallel texts that provide Greek and Hebrew, and several with full concordances and indexing. Shakespeare's words have been done as well as God's; indeed, the entire canon of British poetry prior to this century is now digitised. So is Aquinas. Wittgenstein's unpublished fragments -- 20,000 pages of them -- are expected soon; so are 221 volumes of Migne's Patrologia Latina, a collection of theological writings that date from between 200 AD and the Council of Florence in 1439.

Some of this is the work of governments and charities: half the $7m needed for the Thesaurus Linguae Grecae came from America's National Endowment for the Humanities, the other half from foundations and private patrons. Some of it is done for profit, just like traditional publishing. The English Poetry Full-Text Database (EPFID), released in June on four compact discs by Chadwyck Healey, a company in Cambridge, England, costs #30,000 ($46,500). It took four years to assemble from roughly 4,500 volumes of verse; it is easy to use, and is poised to become an in- dispensable research tool. Chadwyck Healey says it has sold more than 100 copies. The company is now working on an index to the entire runs of more than 2,000 scholarly journals.

Typing in every word of a nation's literary heritage is a time-consuming and expensive task, even when the work is exported to take advantage of cheap labour in Asia, as it almost always is. Another approach is to record the appearance of books and other writings, ratherthan their contents, by scan ning in images of them. The Archive of the Indies in Seville has used IBM scanners and software to put near-perfect facsimiles ofthe letters of Columbus, Cortes and their contemporaries on to the screen. Years of scanning by a full-time staff of 30 people has put more than 10m handwritten pages -- one-seventh of the total -- into the archive's memory banks.

The computers store the pages as images, not text, so they cannot be searched and compared in the way that the EPFID can. They offer scholars other compensations, though. Scanners like those originally designed for medical imaging provide extremely detailed and subtle digitisation. This can then be fed through image- enhancement software, so ancient smudges and ink-spills can be filtered out. And since the users cannot damage the copies as they might the originals, humble students can have access to documents previously available only to a handful of elite scholars. The Seville project is proving so successful that IBM and El Corte Ingles, a big Spanish retailer, have founded a company to market the tech- niques used. Half-a-dozen ventures are already under way, including a proposal to digitise the gargantuan (and recently- opened) Comintern archives in Moscow.

Logos and log-ons

It is possible to combine the image of a page with searchable electronic text, simply by having both stored in the same system with cross-references. A Chaucer archive that of- fers multiple manuscripts and searchable texts is being released one Canterbury tale at a time ("The Wife of Bath" comes first). Of course, putting both together costs even more than a straight text database, and those can be pretty expensive. The EPFTD works out at a quite reasonable $10-or-so per volume -- but that still makes it a pricey proposition when bought, as it must be, all at once. The cost of such commercially compiled databases worries some scholars, not to mention librarians. It is not their only worry.

The difference between a printed page and the text it contains is not just one of aesthetics; there can be meaning in the way typefaces are chosen, in how pages are laid out, in the indentations before lines and the gaps between them. There are data on the title page that apply to the whole text. A good database has to encode all this information somehow, and has to offer ways in which it can be used in searches.

That is why databases have "mark-up languages", which allow the text and the spaces within it to be tagged with particular meanings. Mark-up languages tell the computer, for example, that a title is a title and a footnote is a footnote; the computer can then display them as such, with typefaces to taste, and the interested user can search the text for titles and footnotes. The more complex the search, the more extensive the mark-up required. The mark-up for the EPFTD allows the computer to identify things like stanzas, verses, dates and names.

In a perfect world mark-ups would be neutral and descriptive, capable of applying equally well to almost any texts. In practice individual mark-up languages have sprung up like mushrooms. There is now a move to concentrate on using the Standard Generalised Mark-up Language (SGML) to define the codes that tag text. It was developed at IBM for lawyers, and adopted by the Pentagon for its mountains of manuals. At present SGML is probably the most widely used mark-up language; officially, it is an international standard. But it is not necessarily ideal for academics, who are aware that the way a text is marked up will have far-reaching implications for the kind of research that is possible. Marking up is an invisible act of interpretation; the scholars want the interpreting left to them.

That is why so much effort has gone into a specific way of using SGML for prose, verse, drama and other forms of text that are pored over by scholars. The Text Encoding Initiative is the sort of huge multinational research effort that nuclear physicists are used to but that scholars in the humanities can still be shocked by. After six years of work, supported financially by the American government, the European Union and others, the TEI published its guidelines in May -- all 1,300 pages of them. More than 100 TEI scholars have had to decide everything from how poetry should be distinguished from prose to whether footnotes to footnotes are admissible in conforming texts. Their peers seem happy with the work.

Standardised formats will enable electronic texts to move on-line. That will make them available from any computer hooked up to a telephone line, not just from a dedicated terminal devoted to a single database and nothing else. That is good for the far- flung; the University of Dubrovnik, its library destroyed, has just been given a networked computer terminal that puts it on-line to a host of foreign databases. It is also good for the independent researcher. Texts will be freed from academia's grip, just as books before them were freed from the church and the wealthy by printing.

More research; different research, too. Speculative hypotheses about influence or style will be rigorously testable by textual comparisons as cheap and plentiful as the numerical calculations in a computer model of the weather. Critics still raise the spectre of great literature passing under the die-stamp of conformity, but some degree of conformity may be a price of new forms of access. The first die-stamp for literature was a printing press. The passing of the illuminated manuscript made the world a slightly poorer place; the coming of print made it a far, far richer one.