The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
Advanced Search
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

Cover Stories
Articles & Papers
Press Releases

XML Query

XML Applications
General Apps
Government Apps
Academic Apps

Technology and Society
Tech Topics
Related Standards
Last modified: March 19, 2005
Academic Applications

SGML/XML: Academic Applications. Contents.

TEI: Text Encoding Initiative

[CR: 20010413] [Table of Contents]


[June 30, 1999] On the XML version of the TEI DTD and TEI events after 1999-06, see "Text Encoding Initiative (TEI) - XML for TEI Lite."

The TEI (Text Encoding Initiative) has developed an SGML encoding for a wide range of document types in the domain of humanities computing. The Text Encoding Initiative is an international research project sponsored by the Association for Computing in the Humanities (ACH), the Association for Literary and Linguistic Computing (ALLC), and the Association for Computational Linguistics (ACL). Funding has been provided in part by the US National Endowment for the Humanities, Directorate XIII of the Commission of the European Communites, the Andrew W. Mellon Foundation, and the Social Science and Humanities Research Council of Canada. The TEI ("P3") Guidelines were published in May 1994, after six years of development involving many hundreds of scholars from different academic disciplines worldwide. They are available in print copy, in searchable/linked format on CDROM (see also the anouncement), or on the Internet in plain text format. An overview of the TEI's origins and goals is given in "Text Encoding for Information Interchange. An Introduction to the Text Encoding Initiative" (TEI Document no TEI J31, by Lou Burnard, July 1995). See also: "An Introduction to the Text Encoding Initiative" (TEI EDW26, by Lou Burnard), available on the OTA FTP server, from the UIVCM Listserver, or from the SIL WWW server.

The authoritative FTP site for TEI DTDs, Writing System declarations, and documentation is TEI: FTP to UIC. The TEI P3 DTDs are also stored on the OTA FTP server. Encoding guidelines have been published as Guidelines for Electronic Text Encoding and Interchange. TEI P3, May 1994, edited by Michael Sperberg-McQueen and Lou Burnard. See the bibliographic reference for full details. The current draft of the Guidelines is thus sometimes identified as "P3" ("P2" and "P1" represent earlier drafts). Instructions for ordering the P3 Guidelines are available here, or by requesting the file 'P3ORDER DOC' from the UICVM LISTSERVer, using syntax described below. The TEI FTP server also contains a number of resources relating to the production and maintenance of the TEI Guidelines: TEI working papers, TEI organization & personnel, TEI introductions & tutorials, TEILITE introduction & DTDs, Model Editions Partnership (a TEI application), Explanation of TEI tagset for producing TEI P3, Information on TEI P1, P2, and P3, TEI & SGML resources, etc. Information on the literate programming style used to produce the TEI P3 DTDs and documentation is found in the subdirectory; [local archive copy].

Chapter 2 of the TEI Guidelines "A Gentle Introduction to SGML" is one of the best SGML introductions. It is available from the UIC TEI Web server, or from Oxford: It was translated into Russian by Boris Tobotras: HTML or SGML format, [local archive copy].

Canonical files for the Guidelines and other TEI research documents may also be obtained through the official mail server: To get a complete file listing of TEI materials, send the command INDEX TEI-L in the body of an email message to the LISTSERVer at this address. To subscribe to the TEI-L discussion forum, send the command SUBSCRIBE TEI-L YOUR-NAME to the LISTSERVer (where 'YOUR-NAME' is your personal name).

Courses and seminars on TEI encoding are offered periodically at various universities. For 1997, see the announcement from Lou Burnard for TESS: The Text Encoding Summer School, sponsored by The Humanities Computing Unit at Oxford. The course will be held at Oxford University, 8 - 11 July, 1997.

Conference entry for the TEI 10th Anniversary User Conference, November 14 - 16, 1997. Brown University, Providence, Rhode Island, USA. See: General Information

TEI and the MLA. See the announcement from Charles Faulhaber (University of California, Berkeley) for the publication of the MLA (Modern Language Assocation of America) draft Guidelines for Electronic Scholarly Editions. Highlights from the : "B. Encoding norms. It is preferable to use the implementation of Standard Generalized Markup Language (SGML) specifically devised for coding electronic texts, the Text Encoding Initiative (TEI). The choice of an alternate standard should be fully justified and explained. C. The text itself should be essentially self-describing, which means that the computer file which embodies it should contain a header with essential meta-data. The Guidelines for Electronic Text Encoding and Interchange (TEI P3), edited by C.M. Sperberg-McQueen and Lou Burnard (1994) offer detailed descriptions of the sorts of information that should be provided for the source document as well as the electronic text itself." See: Guidelines for Electronic Scholarly Editions; [archive copy, August 12, 1997]

TEI Monographs and Journal Special Issues. Several monographs and journal special issues have been dedicated to the Text Encoding Initiative's encoding guidelines. See, for example:

  • See [the bibliographic reference for]: Ide, Nancy; Véronis, Jean, (editors, with a volume preface by Charles F. Goldfarb and volume bibliography by Robin C. Cover). The Text Encoding Initiative: Background and Context. Dordrecht, Netherlands: Kluwer Academic Publishers, [August] 1995. Extent: vi + 242 pages. ISBN: 0-7923-3689-5 (hardbound); 0-7923-3704-2 (paperback). Also published as a three-part special issue of CHUM.
  • TEXT Technology special issue, edited by . Details: Electronic Texts and the Text Encoding Initiative. A Special Issue [5.3] of 'TEXT Technology: The Journal of Computer Text Processing. Madison, SD: College of Liberal Arts, Dakota State University, [George M. and Merrill D. Hunter Electronic Publishing Center], Autumn, 1995. ISSN: 1053-900X.
  • Announcement from Nancy Ide for a special issue of Cahiers GUTenberg dedicated to the Text Encoding Initiative. Number 24 of Cahiers GUTenberg is a 251-page issue containing eleven articles on TEI, all in French. The full text of the issue is now available at the following web site: Table of Contents, local archive copy.
  • Bibliographic entry for the volume edited by Daniel I. Greenstein. Modelling Historical Data: Towards a Standard for Encoding and Exchanging Machine-Readable Texts. Halbgraue Reihe zur Historischen Fachinformatik, Serie A, Historische Quellenkunden, edited by Manfred Thaller, Band (A) 11. St. Katharinen: [Published for the Max-Planck-Institut für Geschiche, Göttingen by] Scripta Mercaturae Verlag, 1991. Extent: iv + 223 pages. ISBN: 3-928134-45-0.

TEI Lite

TEI Lite is a subset of the full TEI DTD. Researchers who may have been put off initially by the TEI's elaborate use of parameter entities and driver files (necessary to initialize a full TEI setup) should return now [July 1995] to have a look at the much simpler TEI Lite. It is a "small but usable subset of the TEI main DTD" that avoids some of the complexities in full TEI DTD. The documentation for TEI Lite (in SGML and HTML format) is superb, making the TEI accessible to a much wider audience. Whereas the official P3 reference manual weighs in at 1200 pages, TEI Lite is nicely presented in a 200K HTML document.

From the document which describes TEI Lite, cited below: "This document provides an introduction to the recommendations of the Text Encoding Initiative (TEI), by describing a manageable subset of the full TEI encoding scheme. The scheme documented here can be used to encode a wide variety of commonly encountered textual features, in such a way as to maximize the usability of electronic transcriptions and to facilitate their interchange among scholars using different computer systems. It is also fully compatible with the full TEI scheme, as defined by TEI document P3, Guidelines for Electronic Text Encoding and Interchange, published in Chicago and Oxford in May 1994."

TEI: Primary WWW/FTP Sites

[CR: 19970207]

Useful links and prettier views of TEI documents

TEI DTD and SoftQuad's Author/Editor

[CR: 19970617]

For handling of the TEI DTD subsets referenced by parameter entities:

TEI (Lite) DTD and Panorama [version 1.5], with Netscape

TEI and Other Software

[CR: 19981216]

In addition to the comments on TEI DTD configuration with specific software products (above), note the following TEI support tools and facilities:

  • "Ebenezer's software suite for TEI." See the announcement from Kevin Russell (Linguistics, University of Manitoba) for package of files and installation instructions for esiting TEI documents with Emacs and PSGML. See the URL: The package, entitled "Ebenezer's software suite for TEI," includes "the program files for Emacs, Lennart Staflin's PSGML package, James Clark's Jade engine and SP parser, the official files for the TEI (DTDs, entity files, WSDs), the catalogue files for making all of the above run hopefully transparently, and a short tutorial."
  • [October 09, 1998] Apropos of managing DTD fragments, designing modularized DTDs, DTD subsetting, namespaces, (etc.), readers will be interested to survey Lou Burnard's Web page entitled The Pizza Chef: a TEI Tag Set Selector, recently referenced in an announcement. Lou Burnard (European editor for the Text Encoding Initiative Guidelines) has created the tool to help users design their own TEI-conformant document type definition. The TEI DTD itself is very large, but its modular construction and heavy use of 'classes' (defined in parameter entities) allow the user to select desired tag sets for a project and thus 'make up their very own view of the TEI DTD, including their own modifications and restrictions.' The Pizza Chef tool "allows you to select the TEI tagsets you want from a menu, and also to pick out individual elements for inclusion, exclusion, or modification. You can then download a customized DTD subset, or a completely compiled (i.e., non parameterized) DTD for use by e.g., SoftQuad's Rulesbuilder." Another strategy for subsetting large (complex, overly-general) DTDs uses architectural processing; see the abstract for the paper to be presented by Gary Simons at the November Markup Technologies '98 Conference
  • [December 09, 1997] TEItools from Boris Tobotras, as described in a posting to TEI-L. "TEItools denotes my collection of scripts for transforming documents written in SGML to various output format. I'm in process of writing it now, and currently it is able to produce HTML, LaTeX2e, RTF, PS and PDF." See also the TEItools user guide (under development) and the local database entry for TEItools.
  • Possibly useful SGML Open CATALOG for use with the TEI DTD and psgml/emacs [nsgmls], supplied by David Birnbaum. In this connection, note the URL for the TEI DTD public identifiers (FPIs), and the documentation.
  • tei2latex, tei2html: [October 23, 1997] Announcement from Jean-Daniel Fekete (Ecole des Mines de Nantes) for the availability of TEI2LATEX and TEI2HTML version 0.2. - 'Two Perl5 Programs to Translate TEI Lite Documents into LaTeX2e and HTML documents .' TEI2HTML can now split a TEI Lite document into several linked html subdocuments. See the main entry for tei2latex: TEILITE to LaTeX2e, or FTP: [Previously: Announcement from Jean-Daniel Fekete (Universite de Paris-Sud) for tei2latex version 0.1: "tei2latex is a Perl5 Program to Translate TEI Lite Documents into LaTeX2e documents...See also announcement on TEI site.]
  • FTP tei2latex (version 0.1c, July 25, 1996; [archive copy]

Oxford Text Archive (OTA)

[CR: 19961022] [Table of Contents]

For some twenty years the Oxford Text Archive has been collecting electronic texts, and has sponsored extensive research involving the use of SGML in an academic setting. The Archive is "a facility provided by Oxford University Computing Services and forms part of the Humanities Computing Unit . . . serving the interests of the academic community by providing low-cost archival and dissemination facilities for electronic texts."

"The Archive contains electronic versions of literary works by many major authors in Greek, Latin, English and a dozen or more other languages. It contains collections and corpora of unpublished materials prepared by field workers in linguistics. It contains electronic versions of some standard reference works. It has copies of texts and corpora prepared by individual scholars and major research projects worldwide. The total size of the Archive exceeds a gigabyte and there are over 2000 titles in its catalogue." [from the "General Information" page]

"All texts which are publicly available from the Archive's FTP server are first converted to a standard format. This format conforms to the recommendations of the Text Encoding Initiative (TEI), and is therefore an application of ISO 8879, Standard Generalized Mark Up Language (SGML). A catalog of electronic texts in the Archive is available in SGML format. OTA is also the authoritative FTP site for a significant corpus of literary texts encoded in (TEI) SGML by members of the Oxford Text Archive project and by others. An October 1996 snapshot of the OTA file listings provided here illustrates the range of texts in SGML format available to the academic public via FTP; (compare: snapshot of date: December 2, 1994).


The Oxford Text Archive
Oxford University Computing Services
13 Banbury Road
Oxford OX2 6NN
Tel: +44 01865 273238
FAX: +44 01865 273275

University of Virginia Electronic Text Center

[CR: 19980104] [Table of Contents]

The University of Virginia has pioneered a number of highly successful uses of (TEI) SGML in delivering online electronic texts, including structured-text searches. "Since 1992, the Electronic Text Center at the University of Virginia has combined an on-line archive of thousands of SGML-encoded electronic texts (some of which are publicly available) with a library-based Center housing hardware and software suitable for the creation and analysis of text. Through ongoing training sessions and support of individual teaching and research projects, the Center is building a diverse user community locally, and providing a model for similar enterprises at other institutions." The Text Center, in cooperation with the Bibliographical Society of the University of Virginia is making Studies in Bibliography [On-Line] freely accessible on the Internet, based upon TEI-SGML encoding of the (ca. 1000) articles.

Many of the texts in UVA's Electronic Text Center are indexed with Open Text's PAT search engine. Some of the materials are available only to UVA's institutional members (OED 2nd edition, English Poetry Full-Text Database, Patrologia Latina, Old English Corpus, Shakespeare, French and Latin Collections). Other online texts are searchable by any researchers via the Internet, including a Middle English corpus (see bibliography), Michigan Early Modern Materials, King James Bible, and others.

The Institute for Advanced Technology in the Humanities at the University of Virginia in Charlottesville uses SGML in many of its text projects, and has developed some SGML (aware) software in this connection.

See further explanation through exploration of the following links:

Contact address:

David Seaman, Coordinator        804-924-3230 (phone)
Electronic Text Center           804-924-1431 (fax)
Alderman Library                 email:
University of Virginia 
Charlottesville, Virginia 22903

The Electronic Archive of Early American Fiction (UVA)

[CR: 19980105] [Table of Contents]

The University of Virginia Library has received a grant from the Andrew W. Mellon Foundation for $400,000 for a two-year project (1996-1998) involving digitizing and delivering electronic texts of rare books. "Two versions of each text will be made available: a TEI-conformant SGML-tagged text and color images of the pages of the first editions--a total of 118,000 pages. The project will conclude in 1998 with an economic study of usage of the e-texts compared with usage of the original rare books."

The encoding uses TEI/SGML: "As the texts are created, standard SGML markup is added to record the physical and structural characteristics of the text: title-page layout, pagination, paragraphs, verse lines, italics, accented letters, etc." The parger project goal is "to create electronic texts of rare books and to compare the usage and costs of electronic texts and of original paper texts of rare books. As part of the study, 582 first editions of the most important novels and short stories will be digitized and put on the World Wide Web. . . The project will focus on e-texts of a well-defined and comprehensive collection of early American fiction derived from the two standard bibliographies of American fiction. Specific outcomes expected from the project are: (1) electronic texts and images on the World Wide Web of 582 seminal volumes in early American literature; (2) a model process, exportable to other libraries, for creating e-texts of rare books; (3) measurement and analysis of usage and costs of the e-texts and of the originals on which they are based; (4) two written reports: (a) presenting this project as a model for the creation of images and SGML-tagged ASCII texts of rare books in research libraries; (b) on the usage and costs of e-texts of rare books; (5) presentations of the results of this project at national or international conferences."


David Seaman
Tel: +1 (804) 924-3230
[See the main UVA entry for other address details]

University of Michigan - Humanities Text Initiative (HTI)

[CR: 19980821] [Table of Contents]

"The Humanities Text Initiative (HTI) is a project of the University of Michigan Libraries, the UM Press, and the School of Library and Information Studies, with support from the College of Literature, Science & Arts. Special thanks to ITD for providing bridge equipment. The HTI is responsible for creating and maintaining new textual collections, primarily in SGML. The initial focus of the project will be in Middle English materials and American verse, as well as recent publications of the UM Press. The HTI is also available to assist faculty and students using SGML and in particular the Text Encoding Initiative Guidelines for publishing. For more information or assistance, send e-mail to or call 761-4760." [from the HTI Home Page]

Under the direction of John Price-Wilkin, the Humanities Text Initiative at the University of Michigan is developing a set of online text resources, some of which employ SGML encoding as a basis for search and retrieval. Currently available to the public via WWW browser (as an interface to PAT): TEI Guidelines for Electronic Text Encoding and Interchange (P3), Middle English Collection, Revised Standard Version of the Bible, King James Version of the Bible, Michigan Early Modern English Works (16 MB of SGML-tagged text). The center also supports other reference material restricted to the University of Michigan: (OED 2nd edition, English Poetry Database, Old English Corpus, Migne's Patrologia Latina, Modern English Works. Many of the tools mirror (functionally) the resources at UVA, previously developed by John Price-Wilkin.

Texts structured in SGML are searchable by the PAT "SGML" software (from Open Text), and user interfaces to PAT are provided on the Internet using WWW forms and line-mode access. Short segments of text in a hit list reveal the SGML tags, but linking to the full text from the concordance (hit list) presents the document in a formatted appearance. Very subtle queries are possible: proximity specifiers, extended set of relational and Boolean operators, context-unit specifiers (for collocations), etc. Try these links:

The HTI American Verse Project

[CR: 19980115] [Table of Contents]

Summary: "The American Verse Project is a collaboration between the University of Michigan Humanities Text Initiative (HTI) and the University of Michigan Press. The project is assembling an electronic archive of volumes of American verse. Most of the archive is made up of 19th century poetry, although a few early 20th century texts are included. The full text of each volume is being converted into digital form and coded in Standard Generalized Mark-up Language (SGML) using the TEI Guidelines. . . The collection is made accessible in SGML, dynamically rendered HTML, and as a searchable database. As with all of the other Humanities Text Initiative resources, simple word and phrase searches are supported, as well as proximity searches, and searches for verses or paragraphs containing two or three phrases. The project uses an unusual model for rights for a project involving a University Press: most uses are without practical restrictions and cost, but the texts are available for sale to other publishers and agencies who wish to provide access to the texts from their own system."

". . .second goal of the project is to provide a service to scholars by advancing their ability to use Web documents in their work. Currently, the Internet does not have well-established mechanisms for authors seeking to integrate complete texts, or parts of texts, into their scholarship. The TEI Guidelines provide clearly defined ways of linking from one SGML document to portions of another; however, no one has yet set up a Web server to accept this sort of linking. The HTI proposes to explore this as part of the American Verse project. This will allow, for example, someone writing about Dickinson to embed links in his or her electronic text pointing the reader to various poems, stanzas, or lines from volumes that are part of the project without having to replicate the material within his or her own document as is currently the case. The evidence of scholarship would remain in this central archival server, rather than be replicated on a number of different scholars' machines."


HTI - Middle English Compendium

[CR: 19980821] [Table of Contents]

The Middle English Compendium is a project under the direction of the University of Michigan Digital Library Production Service, funded by a grant from the National Endowment for the Humanities. "The Compendium provides access to and interconnectivity among three resources: an electronic version of the Middle English Dictionary, a HyperBibliography of Middle English prose and verse based on the MED bibliographies, and a full-text Corpus of Middle English Prose and Verse. The MED and the Corpus are encoded in SGML using the Text Encoding Initiative Guidelines. The first installment (currently online) includes 1,073 HyperBibliography entries covering 1,526 copies of Middle English texts, 15,940 MED entries covering M-U (more than one-third of the projected complete print MED), and 42 searchable texts in the Corpus."


Making of America (MOA) Project - University of Michigan and Cornell University

[CR: 20010309] [Table of Contents]

"Making of America (MOA) is a digital library of primary sources in American social history from the antebellum period through reconstruction. The collection is particularly strong in the subject areas of education, psychology, American history, sociology, religion, and science and technology. The collection contains approximately 1,600 books and 50,000 journal articles with 19th century imprints. The project represents a major collaborative endeavor in preservation and electronic access to historical texts." [from the Home Page]

The MOA supports SGML-based Access Systems: "We hope that users of the system will appreciate some of the functionality developed through UM's nearly eight years of experience with deploying SGML-based access and delivery systems. Attractive, easily navigated displays of results showing the number of occurrences per page are combined with displays of the page image, circumventing many of the problems encountered when relying on OCR alone. As we have opportunities to "clean up" and more richly encode OCR'd texts, the system will begin to show dynamically-rendered HTML with links to the page images. The mechanisms used for the MOA system will be provided to participants in the UM's SGML Server Program." [from the announcement].


Making of America Project
University of Michigan Digital Library
Email: John Price-Wilkin

Model Editions Partnership: Historical Editions in the Digital Age

[CR: 19990902] [Table of Contents]

"Project Description: "The Model Editions Partnership is a consortium of seven historical editions which has joined forces with leaders of the Text Encoding Initiative and the Center for Electronic Text in the Humanities. The participants are now developing a prospectus setting forth editorial guidelines for publishing historical documents in electronic form. Later they will create a series of SGML demonstration models."

"Electronic editions should use standard non-proprietary formats (markup) for the representation of text, images, and other material. Standard formats, such as SGML for example, are essential if editions are to remain usable despite rapid changes in computer hardware and software. Publicly-controlled standards are essential if editions are to be used with a wide variety of hardware and software. International and national standards issued by recognized standards bodies should be preferred to de facto standards because such organizations guarantee standards based on a consensus of all interested parties. At the current time, this means use of a markup design like the Text Encoding Initiative Guidelines formulated under the Standard Generalized Markup Language architecture, adopted in 1986 by the Organization for International Standardization (ISO 8879). Relevant standards for images and other material have yet to be selected for the Partnership models." [from the Prospectus]

"Text encoded under the Standard Generalized Markup Language has become the de facto standard for creating electronic text. We will use the Text Encoding Initiative's markup to create an SGML archive for samples from each edition. From the archive, we will create both CD-ROM and Internet models." [from the Work Plan]

[September 02, 1999] As of September 1999, the MEP Web site hosted seven mini-editions. "Four of the experimental mini-editions are based on full-text searchable document transcriptions; two are based on document images; and one is based on both images and text." These include: (1) Documentary History of the First Federal Congress, (2) Documentary History of the Ratification of the Constitution and the Bill of Rights, (3) Papers of Henry Laurens, (4) Abraham Lincoln Legal Papers, (5) Papers of General Nathanael Greene, (6) Margaret Sanger Papers, (7) Papers of Elizabeth Cady Stanton and Susan B. Anthony. "The DynaText and Dynaweb software from Inso has been used to present the mini-editions; this software allows users to construct powerful searches or to use a series of built-in search forms. The mini-editions can be searched using the full range of standard search tools -- wildcards, proximity searching and Boolean searching." Dynatext also "has built-in support for search of tagged documents with hierarchical structures, such as HTML and XML. By permitting searches of words and phrases inside particular tags, as well as words in documents, DynaText allows users to efficiently target their searches, resulting in more relevant, focused matches."


American Memory Project, Library of Congress

[CR: 19970806] [Table of Contents]

"American Memory consists of collections of primary source and archival material relating to American culture and history. These historical collections are the Library of Congress's key contribution to the national digital library. Most of these offerings are from the unparalleled special collections of the Library of Congress."

"The elements in each historical collection include digital reproductions of items, a finding aid, and various accompaniments. The finding aid may consist of a catalog (a database of bibliographic records) or take the form of a register (a hierachical listing or directory)."

The principal standard for text encoding in the American Memory project is SGML, sometimes in TEI-SGML. See the Library of Congress - EAD Finding Aid Pilot Project main entry for other technical information, or "American Memory pilot--seed of a universally available Library".

Woman Suffrage Collection

One of the collections of American Memory is the Woman Suffrage Collection. "The NAWSA collection consists of 165 books, pamphlets and other artifacts documenting the suffrage campaign. They are a subset of the Library's larger collection donated by Carrie Chapman Catt, longtime president of the National American Woman Suffrage Association, in November of 1938. The collection includes works from the libraries of other members and officers of the organization including: Elizabeth Cady Stanton, Susan B. Anthony, Lucy Stone, Alice Stone Blackwell, Julia Ward Howe, Elizabeth Smith Miller, Mary A. Livermore."

Texts are prepared in SGML. See Woman Suffrage Collection: Technical Note on Texts [mirror]: "This full text collection provides researchers with an SGML-encoded (Standard Generalized Markup Language) version of the full text in addition to an HTML-encoded version of the same text. . .Images of the pages and illustrations can be accessed by a viewer launched from Panorama."

WPA Life Histories

Another collection of American Memory is: Life History Manuscripts from the Folklore Project, WPA Federal Writers' Project, 1936 - 1940. "These life histories were written by the staff of the Folklore Project of the Federal Writers' Project for the U.S. Works Progress (later Work Projects) Administration (WPA) from 1936-1940. The Library of Congress collection includes 2,900 documents representing the work of over 300 writers from 24 states. Typically 2,000-15,000 words in length, the documents consist of drafts and revisions, varying in form from narrative to dialogue to report to case history. The histories describe the informant's family education, income, occupation, political views, religion and mores, medical needs, diet and miscellaneous observations. Pseudonyms are often substituted for individuals and places named in the narrative texts."

The texts are encoded in SGML, as explained in WPA Life Histories--Editor's and Technical Notes [mirror copy]: "When initially transcribed, these texts were marked up in Standard Generalized Markup Language (SGML). The American Memory SGML markup scheme conforms to the guidelines of the Text Encoding Inititiative (TEI), the work of a consortium of scholarly institutions. Since this Internet presentation employs the conventions of the World Wide Web, the SGML markup has been simplified and reprocessed to create documents in HyperText Markup Language (HTML). In the final version, SGML markup will be utilized. Interested persons may obtain the American Memory SGML document type definition (DTD) and related information by file transfer protocol (ftp) from the Library of Congress server."

African American Pamphlets

A third collection using SGML encoding is African-American Pamphlets from the Daniel A. P. Murray Collection, 1880-1920, Rare Book and Special Collections Division, Library of Congress. "The Daniel A. P. Murray Pamphlet Collection presents a panoramic and eclectic review of African-American history and culture, spanning almost one hundred years from the early nineteenth through the early twentieth centuries, with the bulk of the material published between 1875 and 1900. Among the authors represented are Frederick Douglass, Booker T. Washington, Ida B. Wells-Barnett, Benjamin W. Arnett, Alexander Crummel, and Emanuel Love."

The document African American Pamphlets: Technical Note on Texts explains the use of SGML in the text encoding.

Links for American Memory

Brown University Scholarly Technology Group (STG)

[CR: 19970902] [Table of Contents]

Under the guidance of Allen Renear (Director), the Brown University Scholarly Technology Group (STG) "supports the development and use of advanced information technology in academic research, teaching, and scholarly communication. STG pursues this mission by exploring new technologies and practices, developing specialized tools and techniques, and providing consulting and project management services to academic projects. STG focuses on four related areas: (1) educational applications of hypertext and hypermedia; (2) SGML textbase development networked scholarly communication; (3) electronic curriculum and collaborative learning environments."

STG's SGML Textbase Development is an example of the technology focus: "STG is committed to open, high-function standards for data representation. Most important among these are SGML (the Standard Generalized Markup Language, a meta-grammar for developing encoding systems for textual data), and two SGML-based encoding systems: HTML (Hypertext Markup Language, used in World Wide Web) and TEI (Text Encoding Initiative Guidelines). Among STG's consultants are internationally active experts in SGML and TEI, and one of its affiliated projects, the Women Writers Project, is among the world's leading SGML/TEI databases."


Scholarly Technology Group
Computing and Information Services
Box 1885
Brown University
Providence, RI 02912
Tel: 401-863-7312
Fax: 401-863-9313

The Brown University Women Writers Project

[CR: 19990329] [Table of Contents]

The Women Writers Project is creating a full-text database of women's writing in English from the period 1330-1830. Texts are encoded in TEI SGML, as explained in the following excerpt from the online overview. "The WWP is developing its encoding system in close cooperation with the international Text Encoding Initiative, of which it is a leading affiliated project. Members of the WWP participate in TEI activities in various ways and participate in research on text encoding and computing methodology. The use of the TEI encoding guidelines ensures not only a very high level of encoding sophistication and sensitivity to scholarly needs, but also, because the TEI Guidelines conform to international standards (namely ISO 8879:1986 SGML), the resulting WWP textbase is entirely free of hardware and software dependencies. Creating this textbase and developing derived products also involves the WWP in related research and scholarship on the application of information technology to humanities research and teaching -- particularly literary text encoding, textbase development, computer-based publishing and textual editing, and computer-supported collaborative work (CSCW) in the humanities."

[March 29, 1999] In March 1999, Julia Flanders of the Brown University Women Writers Project posted an announcement indicating that the WWP textbase is now freely available online in a beta-test version. The Women Writers Project textbase "is a collection of pre-Victorian women's writing in English. The initial publication will include over 200 texts from the period 1450-1830, with 50-100 more being added in the first year. The texts cover a huge range of genres and topics, and represent an unparalleled resource for the study of women's writing and history, and of English literature generally." Features of the system include: "(1) The texts are richly encoded in SGML, using the full TEI Guidelines. The transcription preserves the text of the original document in full, including all front and back matter, with original pagination, typography, spelling, and rendition. Title pages, signatures, catchwords, and other bibliographic details are transcribed in full. (2) The textbase will be published over the web using Inso's DynaWeb software, giving the user full access to the SGML tagging for searching and navigation. (3) Varied style sheets will allow the user to view the text with its original typography and errors intact, or in a corrected and regularized form. (4) Users may search the entire textbase or individual texts for words and phrases, either on their own or within specified contexts, using the SGML markup. Users may also search for sets of texts which meet certain criteria such as date, genre, place of publication, and so forth. (5) The primary source material will be accompanied by topic essays and biographical information for each author."


Women Writers Project
Box 1841
Brown University
Providence, RI 02912
Tel: +1 (401) 863-3619
FAX: +1 (401) 863-9313

Midrash Pirqe Rabbi Eliezer Electronic Text Editing Project

[CR: 19970428] [Table of Contents]

[Site under construction, by editor Lewis Barth]

The project addresses ". . .the process of creating a manual for encoding an electronic edition of Pirqe Rabbi Eliezer (Pirqe R. El.), the Chapters of Rabbi Eliezer. Pirqe R. El. is a midrashic retelling of significant aspects of the biblical narrative, from the creation story through the Book of Esther. . . The initial goal of this project was to create a critical edition of Pirqe Rabbi Eliezer. The goal has now expanded to include electronic publication of all Pirqe R. El. manuscripts and fragments in two forms: digital facsimiles and transcriptions with hypertext links. There are two reasons for this: 1) the quantity of textual material and 2) recent hypotheses regarding the development of medieval Hebrew manuscripts which argue that each manuscript of a work is a completely new literary creation. . .[We conclude] SGML/TEI markup is particularly useful for scripturally based text, i.e., texts from the vast literatures of Judaism, Christianity and Islam which frequently cite biblical or koranic verses. There are numerous genres in these religious literatures (exegetical works, homilies, scriptural essays, dialogues, legal texts, liturgical texts, religious poetry, etc.). They all have in common the citation of texts sacred to a religious community, the frequent mention of characters, places and institutions found in such texts, plus references to later individuals, places and institutions. In addition, these texts are often macaronic, i.e., they contain more than one human language."[extract of ACH paper, below; provisional]


Pirqe Rabbi Eliezer Electronic Text Editing Project
Attention: Lewis M. Barth
Hebrew Union College - Jewish Institute of Religion
3077 University Avenue
Los Angeles, California 90007-3796
Office: (213) 749-3424
Office FAX: (213) 749-1192

University of Cincinnati College of Law, Center for Electronic Text in the Law

[CR: 20001002] [Table of Contents]

"CETL currently produces two text databases that can be accessed from the Internet. The first is the University of Cincinnati's portion of DIANA, a unique database of human rights materials. In cooperation with a numbe r of other North American law school libraries, CETL offers through the DIANA database a comprehensive source of human rights documents to researchers and activists around the wor ld and supports the work of the Urban Morgan Institute for Human Rights, an institution affiliated with the College of Law. The documents that the University of Cincinnati contributes to DIANA are Standard Generalized Markup Language (SGML) versions of United Nations human rights documents, historic United Nations material and documents from the Organization of African Unity. Putting these documents into SGML optimizes them for users' needs,b ecause SGML allows maximum access to the information they contain in a variety of formats across computing platforms."

"The second database, the Securities Lawyer's Deskbook, provides electronic acc ess from the Internet to the text of the Securities Act of 1933 and the Securities Exchange Act of 1934, together with the rules and forms necessary for compliance with these statutes. The existence of this database aids practitioners and scholars and su pports the work of the College's Center for Corporate Law."

CETL makes use of DynaWeb (from EBT)for management and delivery of documents from an SGML database. Documents themselves use the (abridged) TEI Header for bibliographic control. By clicking on the "TEI" icon for a given document, the SGML version is sent by DynaWeb instead of the HTML version. DynaWeb is a Web server software that, in addition to supporting standard HTTPD Web server protocol, "converts DynaText electronic books stored in SGML into HTML on-the-fly for rapid navigating and searching by any Web browser. . . This product effectively shields publishers from the evolving HTML standards by allowing them to store and manage documents in the stable SGML format, and subsequently re-target the information to the latest version of HTML with minimal incremental effort."

The Legal Electronic Text Consortium is asociated with Diana: it is "comprised of a number [thirteen (13) as of August 30, 1996] of academic and research law libraries whose common goal is to further the digitization of legal materials through research and cooperative development. . . Members of LETC are currently engaged in a number of cooperative projects. These include the building of the DIANA database of human rights documents and the development of legal extensions to the Text Encoding Initiative SGML document type definition."

Project addresses:
Center for Electronic Text in the Law
University of Cincinnati College of Law Library
Clifton and Calhoun Streets
P.O. Box 210142
Cincinnati, OH 45221-0142
Tel: (513) 556-0103
FAX: (513) 556-6265

British National Corpus Project (BNC)

Description 2007-11:

"The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written...

The latest edition [as of 2007-11] is the BNC XML Edition, released in 2007. The written part of the BNC (90%) includes, for example, extracts from regional and national newspapers, specialist periodicals and journals for all ages and interests, academic books and popular fiction, published and unpublished letters and memoranda, school and university essays, among many other kinds of text. The spoken part (10%) includes a large amount of unscripted informal conversation, recorded by volunteers selected from different age, region and social classes in a demographically balanced way, together with spoken language collected in all kinds of different contexts, ranging from formal business or government meetings to radio shows and phone-ins.

The corpus is encoded according to the Guidelines of the Text Encoding Initiative (TEI) to represent both the output from CLAWS (automatic part-of-speech tagger) and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI-conformant header.

Work on building the corpus began in 1991, and was completed in 1994. No new texts have been added after the completion of the project but the corpus was slightly revised prior to the release of the second edition BNC World (2001) and the third edition BNC XML Edition (2007). Since the completion of the project, two sub-corpora with material from the BNC have been released separately: the BNC Sampler (a general collection of one million written words, one million spoken) and the BNC Baby (four one-million word samples from four different genres)..."

Description vintage-1995:

"... the BNC is a very large (100 million words) corpus of modern English, both spoken and written, produced by an academic/industrial consortium lead by Oxford University Press, involving Longman UK Ltd, Chambers/Larousse, Oxford University Computing Services, the University of Lancaster and the British Library. Production of the corpus was funded by the commercial partners and by the UK Government, under the DTI/SERC Joint Framework for Information Technology." [...] At the last count, the corpus contained 104 million words, totalling about 1.6 gigabytes of disk space. The corpus is automatically segmented into orthographic sentence units, and each word in the corpus is automatically assigned a word class (part of speech) code by the CLAWS software developed at the University of Lancaster. The corpus is encoded according to the TEI (Text Encoding Initiative)'s Guidelines, using the ISO standard SGML to represent this and a variety of other structural properties of texts (e.g. headings, paragraphs, lists etc.). Full classification, contextual and bibliographic information is also included with each text in the form of a TEI conformant header file." [from the FAQ, November 1994]

The [encoding] format used by the BNC is called the Corpus Document Interchange Format (CDIF for short) and is fully documented in the CDIF Reference Manual. An article by Dominic Dunlop and Gavin Burnage titled Encoding the British National Corpus, written while the BNC was being developed, describes the scheme and its use within the project in some detail. CDIF is an application of SGML (ISO 8879: Standard Generalized Markup Language) and can therefore be used with any SGML-compliant software. SGML is a widely used international standard format for which many public domain and commercial utilities already exist; new software is also coming on the market very rapidly. CDIF is formally defined by an SGML Document Type Definition (DTD)." [from the Encoding description, March 1995]


Earlier references:

British National Corpus
Oxford University Computing Services
13 Banbury Road
Oxford OX2 6NN
TEL: +44 (1865) 273 280
FAX: +44 (1865) 273 275
Email: (Dominic Dunlop)

Linguistic Data Consortium (LDC)

[CR: 19981002] [Table of Contents]

"The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. The University of Pennsylvania is the LDC's host institution. The LDC was founded in 1992 with a grant from ARPA, and is partly supported by grant IRI-9528587 from the National Science Foundation."

"The best formatting mechanism for text is Standard Generalized Markup Language (SGML); it is widely and commonly used (more so than SPHERE: the HyperText Markup Language (HTML), which is the format used throughout the World Wide Web, is actually one instance of SGML usage), it can be kept quite simple, there is free software available to support its use, and it is adaptable to a wide range of languages and uses. It includes the notion of a "Document Type Definition" (DTD), which provides a clear and complete specification of the markup used in a given collection of text. The LDC does not require that a fully functional DTD be supplied, or that the SGML tagging of a text collection be fully compliant to a given set of conventions (e.g. those developed by the Text Encoding Initiative, TEI); what is essential is that the markup be clear, consistent, and correctly applied, so that it can be "parsed" according to a finite set of rules."

Several of the text corpora included in distributions from the LDC use SGML encoding. For example, with respect to the Association for Computational Linguistics Data Collection Initiative (ACL/DCI), 620 MB: "The many formats in which the originals of these texts came have all, to one extent or another, been mapped into a markup language consistent with the SGML standard (ISO 8879). SGML provides a labelled bracketing of the text, with labels permitted to have associated feature-value pairs. Eventually, ACL/DCI will be furnished with tags conformant to the Text Encoding Initiative standards. Because of time constraints, the files in this initial release are not so conformant, and thus are likely to be re-released eventually in a conformant state. The ACL/DCI welcomes help in establishing "proper" SGML coding for all of its collection."

Or, with respect to the United Nations Parallel Text Corpus (English, French, Spanish) [Catalog number LDC93T4A; set of three compact discs]: "In preparing the text for publication, we have applied a fully-compliant SGML format (Standard Generalized Markup Language). For those researchers who use SGML, a working DTD (Document Type Definition) is provided on each disc. For those who do not need SGML markup, a simple script is included that can be used to filter out the SGML-specific material, and leave only the plain text. The character set used is the 8-bit ISO 8859-1 Latin1, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table."

With respect to the European Corpus Initiative Multilingua Corpus I [Catalog number LDC94T5]: "Most of the data is marked up in TEI-compliant SGML -- see mci.edt for discussion, and the bin and src directories for tools to assist in processing and accessing the data. The top-level file mci.sgm provides an SGML way in to the corpus as a whole, or for selected parts of it -- again see mci.edt for further instructions."

[Re: TIPSTER format] "The format uses a labelled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). The SGML DTD's used for verification at NIST are included on the CDs. All five different datasets have their major structures identical for easier reading, but have different minor structures. The philosophy in the formatting both at the University of Pennsylvania and at NIST has been to preserve as much of the original structure as possible, but to provide enough consistency to allow simple decoding of the data."


  • Linguistic Data Consortium Home Page
  • LDC SGML encoding: Through a now refined process the Language Analysis Center is able to produce a final digitized text of approximately 8,000 entries complete with SGML tags, in the span of one month. . . All dictionaries are fully compliant with the latest version of the SGML and TEI guidelines. A Document Type Definition (DTD) is used to describe the structure of tags for each dictionary. It is a fluid document and is delivered with the final version of the project."
  • [October 01, 1998] The Linguistic Data Consortium located at the University of Pennsylvania has announced the release of a new text corpus in the JURIS (Justice Department Retrieval and Inquiry System) collection, from the U.S. Department of Justice. The new two-CD-ROM JURIS set contains a total of 694,667 document units in 1664 individual text files, with text data ranging from the 1700's to the early 1990's. Examples come from Case Law, Executive Orders, Treaties and other International Agreements, Federal Regulations, Administrative Law, Department of Justice Briefs, Freedom of Information Act, Indian Law, Statutory Law, Immigration and Naturalization Law, Tax Law, etc. As with much of the LCD corpus material, these documents are structured in SGML: "The text files are all formatted using a set of SGML tags to mark document boundaries, and to mark major structural features within documents. As with file organization, the markup is derived from the document structures as provided by the Justice Department."
  • [October 01, 1998] Also released by LDC is a corpus of "1997 Mandarin Broadcast News Speech and Transcripts." These data are encoded using "SGML tagging to identify story boundaries, speaker turn boundaries, and phrasal pauses; these tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish Hub-4 collections."
  • [April 21, 1998] Announcement from the Linguistic Data Consortium for the release of a new SGML-encoded speech corpus. The 1996 Broadcast News Speech Corpus "contains a total of 104 hours of broadcasts from ABC, CNN, and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts" (including programs such as ABC Nightline, ABC World Nightly News, CNN Headline News, CSPAN Washington Journal, NPR All Things Considered, NPR Marketplace, and others). The released version of the transcripts is in SGML format, and there is accompanying documentation, and an SGML DTD file, included with the transcription release."
  • About the Linguistic Data Consortium
  • Catalog: Speech Corpora
  • Catalog: Text Corpora
  • The LDC as publisher and distributor of speech corpora; [mirror copy]
  • Catalog: Corpora Available from The Linguistic Data Consortium
  • Sample news release: Spanish News Corpus [March 1996]"The presentation of text data in these collections is modeled on the TIPSTER corpus. Within each data file, SGML tagging is used (1) to mark article boundaries, (2) to delimit the text portion within each article, and (3) to label various pieces of information about the article that are external to the text content (e.g. headlines, bylines, and so on."
  • FTP site:
  • Sample Announcement: "European Language Newspaper Text" ("...that has been marked using SGML - 65 million words of French"); see also the description on the LDC server

Linguistic Data Consortium
3615 Market Street
Suite 200
Philadelphia, PA 19104-2608
Phone: 215-898-0464
FAX: 215-573-2175
Email: [General Information]
Email: [LDC-Online]

IATH - Institute for Advanced Technology in the Humanities, University of Virginia at Charlottesville

[CR: 20000225] [Table of Contents]

IATH (Institute for Advanced Technology in the Humanities) at the University of Virginia at Charlottesville sponsors text analysis as part of its broad goal "to explore and expand the potential of information technology as a tool for humanities research." Several IATH projects use SGML encoding in the preparation of electronic scholarly text editions, and software developed under IATH auspices has occasionally been released as well. Projects having a significant SGML emphasis have included the Rossetti Archive, the Piers Plowman database, the Walt Whitman Hypertext Archive, the Blake Archive, and others. Structured searches of SGML documents are supported for several collections, including Dante's Inferno, Blake Illuminated Books, The Greek Manumissions Project, and others.

The Institute for Advanced Technology in the Humanities has developed some SGML (aware) software in connection with its digital library projects. For example, Inote (An Image Annotation Program in Java), "can automatically identify lines or columns of text for annotation, and [the authors] are working on SGML utilities that will allow a user to connect SGML transcriptions and annotated images." MU: Web-Based SGML Markup is a set of Perl programs which, in combination with a Web server, allow one person or a group of people to create and modify SGML files, using standard forms-capable Web browsers as the editing interface. MU supports multiple e diting sessions through lock-files, and it builds its forms from simple ascii-text tag templates. MU is distributed with a sample template for the TEI-lite DTD." Also, in early 1995, IATH announced (pre-release) Babble: A Synoptic Text Viewer. "Babble, under development by Robert Bingler, is an SGML-capable synoptic text tool that can display multiple texts in parallel windows. It uses Unicode, an ISO 16-bit character set standard, which allows multilingual texts, using mixed character sets, to be displayed simultaneously. Babble also allows users to search for strings in text or in tags, and to link open texts for scrolling and searching.

IATH links:

IATH: Piers Plowman Database (Hoyt Duggan)

[CR: 19980105] [Table of Contents]

Included in the IATH archive is the Piers Plowman database, demonstrating the work of Hoyt Duggan. The Piers database uses TEI SGML in the encoding of text-critical information. See:

IATH: Rossetti Archive (Jerome J. McGann)

The Rossetti Archive at IATH features the writings and pictures of Rossetti encoded in SGML. "The Rossetti Archive is a hypermedia environment for studying the works of the Pre-Raphaelite poet and painter D[ante] G[abriel] Rossetti (1828-1882). The archive is a structured database holding digitized images of Rossetti's works in their original documentary forms. Rossetti's poetical manuscripts, early printed texts - including proofs and first editions - as well as his drawings and paintings are stored in the archive, in full color as needed. The materials are marked up for electronic search and analysis, and they are supplied with full scholarly annotations and notes. . . A key feature of the structure of the Archive is its SGML markup (SGML= Standard Generalized Markup Language). This is a formal marking scheme that establishes a set of conceptual categories of information that are determined to be especially important for study purposes. The documents in the Archive are all SGML marked to allow the documents to be searched and analyzed for the marked up features and categories. Thus, all of the pictorial documents ae marked for a full physical description of the picture (e.g., medium, dimensions, frame, etc.) or a full treatment of its production and transmission history. Similar formal categories are established for searching and analyzing the Archive's other documents (printed texts, manuscripts, proofs, etc.). The Archive has a search engine (Pat/Lector) for executing the analytic operations made possible by the SGML markup scheme." (from the home page)


IATH: William Blake Archive

[CR: 20000225] [Table of Contents]

[February 25, 2000] TEI-Encoded Edition of Erdman's Complete Poetry and Prose of William Blake. Matt Kirschenbaum recently posted an announcement which reports on a significant milestone reached in the Blake Archive. "The editors of the William Blake Archive are very pleased to announce the publication of our searchable SGML-encoded electronic edition of David V. Erdman's Complete Poetry and Prose of William Blake. The Blake Archive's electronic Erdman is tagged in SGML using the Text Encoding Initiative DTD and is presented online using Inso's DynaWeb software. But we should note that Erdman's edition is an extraordinarily rich and complex textual artifact in its own right, and encoding and rendering it has proven a substantial technical challenge. The addition of the electronic Erdman means that the site is now inclusive of an even greater range of Blake's work than the approximately 3000 digital images that will eventually form the structured core of the Archive proper..."

The William Blake Archive is a hypermedia archive sponsored by the Library of Congress and supported by the Getty Grant Program, the Institute for Advanced Technology in the Humanities at the University of Virginia, Sun Microsystems, and Inso Corporation, with additional support from the Paul Mellon Centre for Studies in British Art and the University of North Carolina at Chapel Hill. Its editors include Morris Eaves (University of Rochester), Robert Essick (University of California, Riverside), and Joseph Viscomi (University of North Carolina at Chapel Hill). Matthew Kirschenbaum serves as Project Manager, and Greg Murray as a Project Assistant.

"The Blake Archive is an electronic archive based on the illuminated books of William Blake, heavily supplemented by his paintings, drawings, and commercial illustrations. For a large international community of art historians and literary critics ... the archive will be a powerful reference tool, offering high-quality reproductions of an important body of work not currently available, and making that work accessible and useable in new ways that can deepen interdisciplinary understanding in this area of interest. The Blake archivists propose to reproduce approximately 55 copies of Blake's illuminated books, about half of which have never been reproduced before. Once archived digitally, tagged (indexed for retrieval by a standard marking system that will be adapted for the purpose), and annotated, the images can be examined like ordinary color reproductions, but they can also be enlarged, computer enhanced, juxtaposed in numerous combinations, and otherwise manipulated to investigate features (such as the etched basis of the designs and texts) that have heretofore been imperceptible without close first-hand scrutiny of the original works, which are housed in international collections at widely separated locations. SGML is "used to tag images and texts in the archive," including "the SGML edition of David V. Erdman's Complete Poetry and Prose of William Blake, which [the Blake archivists] anticipate releasing sometime in the Spring [1998]." See the annotations from the 'Update' documents, below. The Institute for Advanced Technology in the Humanities at the University of Virginia in Charlottesville uses SGML in many of its text projects, and has developed some SGML (aware) software in this connection.


  • Blake Archive Home Page
  • Project Introduction
  • Index of works
  • [February 15, 1998] Update on the collection holdings
  • [December 1997] Update on the Blake Archive from Matthew Kirschenbaum (Project Manager); [alt. source]
  • [June 05, 1998] "Managing the Blake Archive." Published in Romantic Circles. By Matt Kirschenbaum, Blake Archive Project Manager. Romantic Circles is part of "a Website devoted to the study of Lord Byron, Mary Wollstonecraft Shelley, Percy Bysshe Shelley, John Keats, their contemporaries and historical contexts."
  • Morris Eaves, "Behind the Scenes at the William Blake Archive: Collaboration Takes More Than Email."
  • [August 1, 1997] Update "...this copy of Thel has been tagged using SGML (Standard Generalized Markup Language). SGML tagging offers the Archive's users the opportunity to perform sophisticated searches, either on the text of the plates, or, more remarkably, on the content of their illustrations. Search results are retrieved and presented using DynaWeb, a product of the Inso Corporation. The text and image searching enabled by DynaWeb and the underlying SGML tagging is a powerful demonstration of the potential of electronic resources in the humanities."
  • [March 20, 1997] Update Summer 1997 - ". . . encoding everything, texts and images, in SGML (Standard Generalized Markup Language), the common currency of our enterprise."
  • [November 1996] Update on the William Blake Archive - "we are also encoding all texts, plates, and images in Standard General Markup Language."

National Institute of Japanese Literature

[CR: 19971016] [Table of Contents]

"The National Institute of Japanese Literature (NIJL) is one of the inter-university research institutes of Japan founded in 1972. The purpose of its establishment is to survey the printed and handwritten Japanese classical materials from the Edo period (1603-1863) and before, and to collect their original and/or microfilm reproductions in order to preserve these and also to provide public access. Over more than two decades of activity, the NIJL has become the center of archival activity. The Research Information Department has been engaged in design, production, management, and maintenance of an information system of classical Japanese materials for academic researchers both in Japan and foreign countries."

The abstract for a D-Lib Magazine article by Hara and Yasunaga says: "We investigated the various functions for the text data description by analyzing Japanese classical materials. As the result, we have defined and developed the rules with three functions, calling these KOKIN Rules. Many Japanese classical texts have been electronically transcribed using these rules. We have evaluated their availability especially for their application to databases, CD-ROMs, and publishing. Recently, as SGML has become a popular markup language, we have conducted a study of conversion to SGML compliant text. A full-text database system has been produced based on the string search system conforming to SGML."


Japanese Text Initiative (University of Virginia and the University of Pittsburgh)

[CR: 19980122] [Table of Contents]

The Japanese Text Initiative is a collaborative effort by The University of Virginia Library's Electronic Text Center and the University of Pittsburgh East Asian Library to make searchable SGML texts of classical Japanese literature available on the World Wide Web. The texts are Co-edited by Kendon Stubbs and Sachie Noguchi. "The first text of this initiative is Ogura Hyakunin Isshu (often also called Hyakunin Isshu), or 100 Poems by 100 Poets. Hyakunin Isshu is an anthology of 100 poems by 100 different poets. The poems are all "waka" (now called "tanka")--five-line poems of 31 syllables. The 100 poems of Hyakunin Isshu are in rough chronological order from the seventh through the thirteenth centuries, and include tanka by the most famous poets through the late Heian period in Japan."

"The Japanese Text Initiative edition of Hyakunin Isshu includes versions in Japanese (EUC) characters, a Roman transliteration, and a new English translation."


CETH: Center for Electronic Texts in the Humanities

[CR: 19970407] [Table of Contents]

"CENTER PROFILE: The Center for Electronic Texts in the Humanities (CETH) is sponsored jointly by Rutgers, the State University of New Jersey and Princeton University. The Center's administrative headquarters are located in the Archibald Stevens Alexander Library, the main humanities and social sciences research library of Rutgers on the College Avenue Campus in New Brunswick, New Jersey. CETH acts as a national focus for the creation, dissemination and use of electronic texts in the humanities with emphasis on scholarly applications and primary source materials. CETH's activities include an Inventory of Electronic Texts in the Humanities, research into methods of providing Internet access to collections of SGML-encoded material in the humanities, an international summer seminar on methods and tools for electronic texts in the humanities and general information services for humanities computing. CETH is also developing associated projects in partnership with other institutions and research groups. A consortium of member institutions is planned to start in July 1995. CETH is supported in part by the National Endowment for the Humanities and the Andrew W. Mellon Foundation." [extracted from job advertisement, posted to HUMANIST on March 21, 1995]

CETH operates from Princeton and Rutgers Universities under the directorship of Susan Hockey. CETH has been an advocate for the use of SGML in the management and delivery of electronic information under the jurisdiction of research libraries. In particular, CETH has promoted the study and use of the TEI (SGML) header as a form of authentication in cataloging electronic texts by libraries and archive centers. The FTP archive contains compiled SoftQuad A/E rules files for TEI SGML (among other resources) for use by researchers.

Papers and Technical Reports

Internet links to CETH

[CR: 19970604]

Electronic Text Centre (ETC), University of New Brunswick Libraries

[CR: 19970405] [Table of Contents]

Under the Director Alan Burk, (also Associate Director of Libraries), the Electronic Text Centre of the University of New Brunswick Libraries is seeking "to make available and publish over the Web a variety of information from archival material to journals and newspapers. The Centre is also planning to assist faculty and graduate students in publishing texts electronically. The Centre in the near future will be offering instructional assistance, the use of a server, software and eventually physical space, pc's and scanners for project construction."

"At the Centre, archival texts will be marked up in the Standard Generalized Markup Language or SGML. Some other types of publications such as journals will be tagged primarily in HTML. SGML is a descriptive markup language which allows for rich search capabilities and for easy print or Web publishing. HTML is largely a formating language allowing information to be delivered over the Web. At the Centre data is stored on a Sun ULTRASPARC server which is running OpenText search and retrieve software for tagged text. SGML tagged text is converted on the fly to HTML for delivery over the Web."

"With the assistance of the Electronic Text Center at the University of Virginia, an ETC staff member has been marking up in SGML the Benjamin Marston diaries. The Benjamin Marston Diaries Project was designed as a prototype in the creation of the University of New Brunswick Electronic Text Centre (ETC). The diaries form part of the Libraries' Winslow Papers and consist of 3 volumes, covering most of the years 1778-1787. Marston was a prosperous and respected Harvard graduate whose life was torn apart as a result of the American Revolution. A declared Loyalist, Marston quickly lost his wealth, position and family and spent the remaining 17 years of his life struggling to survive."


Electronic Text Centre (ETC)
Harriet Irving Library, Fifth Floor
University of New Brunswick
Fredericton, NB, Canada E3B 5A3
Tel: 506-447-3309
FAX: 506-453-4595
Email: (Alan Burk, Director)

Les Presses de l'Université de Montréal

[CR: 19980323] [Table of Contents]

A program is underway by the Direction des publications électroniques des Presses de l'Université de Montréal to use SGML as the base technology for constructing a specific production line for learned journals. The first step consists in receiving from the publishers the articles in word processing format, and converting them to SGML. The SGML documents thus become the genuine products of the electronic publishing process. The encoders then create and disseminate on the Internet some by-products (SGML, HTML, PDF, and XML). The publications team hopes that these diversified means of diffusion will satisfy the various needs of the readers. Up to now they have produced an electronic version of an issue of Géographie physique et Quaternaire journal as a pilot project. They are now entering a new project which consists in producing all the issues of five different learned journals for the current year." [adapted from a communique of Marie-Hélène Vezina]


The Canterbury Tales Project

[CR: 19951206] [Table of Contents]

The Canterbury Tales Project is sponsored by the Universities of Sheffield and Oxford under the leadership of Project Director Professor N. F. Blake, Executive Officer Dr Peter Robinson, and Principal Transcriber Dr Elizabeth Solopova. "The Canterbury Tales Project CD-Rom's will be published by Cambridge University Press. The first of these will contain transcripts of the fifty-five manuscripts and four pre-1500 printed editions of the Wife of Bath's Prologue and computer images (subject to necessary permissions) of all eleven hundred pages of these manuscripts and editions. It will also contain full collations and analyses of the manuscript relations of the Prologue, in both regularized and unregularized forms, databases of spellings and variants, and (for the Macintosh version) collation and analytic software." [from the electronic edition description]

The electronic edition is being published on CD-ROM using the DynaText browser from Electronic Book Technologies. DynaText displays SGML documents and supports a rich set of searching, hypertext linking, and structured document navigation features based upon the SGML markup. In the production of the CD-ROM, the Collate program is used to to carry out the text collations, and then to generate all the collations and spelling databases in Standard Generalized Markup Language (SGML). Collate is also used to convert the witness files into SGML.


Project addresses:

University of Oxford
The Canterbury Tales Project
Faculty of English
St Cross Building
Manor Road

University of Sheffield
The Canterbury Tales Project
Humanities Research Institute
Rm 1.19, Arts Tower
University of Sheffield
S10 2TN
Tel: 0114 - 2824789 or 2768555 ext. 4789
Fax: 0114 - 2768251

University of Pittsburgh Electronic Text Project

[CR: 19960728] [Table of Contents]

"The University of Pittsburgh Electronic Text Project is a reseach and development effort investigating the technology and policy issues involved in producing, collecting, and serving richly marked-up scholarly texts over the University and wide-area network. The project was chartered in July 1994, and commenced in September 1994. The Electronic Text Project is composed of librarians and faculty from the University of Pittsburgh. An initial and large and component of the project is the SGML TEI encoding of the texts, and a subset of the Project Team is working exclusively on those aspects. Two other subsets of the Project Team will be convened to deal with Interface Design and Collection Issues."

"The [pilot] project will involve a small focused team of librarians and faculty with a teaching or research interest in electronic text. The project will produce: (1) a detailed survey and analysis of electronic text issues and technology, to be used to in form future ULS electronic text policy, (2) a set or sets of complex electronic texts of scholarly importance utilizing SGML markup, and (3) a core of electronic text technology experience and knowledge from which can develop future electronic text services and initiatives." Tasks include..."In depth study and practice with electronic text technologies and scholarly-use applications, primarily SGML, the TEI and HTML applications of SGML, and existing e-texts, textual computing software and corpora as available."


Georgetown University: Labyrinth Medieval Studies and Peirce Projects

[CR: 19950903] [Table of Contents]

Labyrinth Medieval Studies: Manuscripts, Codicology, Paleography

"The Labyrinth is a global information network providing free, organized access to electronic resources in medieval studies through a World Wide Web server at Georgetown University. The Labyrinth's easy-to-use menus and hypertext links provide automatic connections to databases, services, and electronic texts on other servers around the world. . ." The project directors encourage the use of . . ."HTML (Hypertext Markup Language), W3, SGML (Standard General Markup Language) and TEI (Text Encoding Initiative) encoding so that as many people as possible can contribute their own texts to the Labyrinth and participate in its development."

Peirce Project

Project Opéra (Outils pour les documents électroniques, recherche et applications)

[CR: 19951113] [Table of Contents]

A wide range of SGML-related research has been conducted for the past 10 years within French universities and government laboratories. Many theoretical papers and design documents have now been made available on the Internet, and the commercial products (the Grif SGML editor; Symposia) are visible proof of the success of these researches. Grif SA (a technical association of INRIA) has commercialized SGML-based products built upon the early prototype Grif editor, while parallel development of the structure editor Thot continues as a research effort within Project Opéra.

Bibliographic entries have been created in this database for representative works made available from the Opéra Project. Look for titles under the following names: Vincent Quint (e.g., the languages of Grif); Jacques André (e.g., SGML and train crashes); Extase K. A. Akpotsui (e.g., Thesis on Type Transformation); Cécile Roisin (e.g., cut-and-paste in structured editing); Dominique Decouchant (e.g., structured, cooperative editing); Philippe Louarn (e.g., electronic documents for the WWW); Hélène Richy (e.g., indexing of structured documents); Irène Vatton (e.g., with V. Quint, hypertext aspects of Grif). Richard Furuta (now at Texas A&M University) while not a part of Project Opéra, has co-authored a number of publications relating to the prototype structure editor Grif (e.g., structured editing with Grif). Fuller bibliographic lists are available in the project reports, and the documents themselves are accessible via the IMAG (Institut d'Informatique et de Mathématiques Appliquées de Grenoble)and IRISA FTP servers.

Vincent Quint (Grenoble) is current Project Head for the Opéra Project, and Jacques André (Rennes) is Opéra's scientific co-leader. Opéra has working relations with the Swiss Federal Institute of Technology, and the Universities of Lausanne, Maryland and California (at Berkeley). It "is concerned with electronic documents (technical documentation, hypertexts, multimedia, digital typography, etc.). It studies models of documents which take into account both their logical or abstract organization, their graphical presentation and contents. It also involves the development of editing techniques based on these models. The long-term goal is to design and build an editing environment for developing and maintaining large, complex multimedia documentation. OPERA is a common project with INRIA Rhône-Alpes, INRIA-Rennes and IMAG.

Opera's Research Topics

  • Design of a meta-model that can represent different types of documents in a homogeneous way, including structured documents, hypertexts and multimedia.
  • Research into document "contents" and their relationships with structures.
  • Production of a structured document editor called Grif [now called Thot within the research program], as the basis for a number of experiments on active documents, the embedding of complex physical structures into logical structures, transformations of logical structures, multimedia editing, etc.
  • Construction of a first prototype of the editing environment with two components: a cooperative structured editor (called Alliance) and a version management tool based on document structure.
  • Development of an authoring system for hypertexts and multimedia documents on the World-Wide Web, Tamaya

SGML-related research topics from 1994:

WWW Links:


  • Vincent Quint (Project leader, Grenoble): INRIA-IMAG, 2 avenue de Vignate, 38610 Gières, France; E-Mail, Tel +33 76 63 48 31; Fax: +33 76 54 76 15
  • Jacques André (Rennes): Email, Tel +33 99 84 73 50
  • IRISA/INRIA Rennes; Campus Universitaire de Beaulieu; 35042 Rennes Cedex; France; Phone: + 33 99 84 71 00; Fax: + 33 99 84 71 71

MULTEXT (Multilingual Text Tools and Corpora) and MULTEXT-EAST (Multilingual Texts and Corpora for Eastern and Central European Languages)

[CR: 19970609] [Table of Contents]

"Project Overview: MULTEXT (Multilingual Text Tools and Corpora) is a recently initiated large-scale project funded under the Commission of European Communities Linguistic Research and Engineering Program, intended to contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create multi-lingual text corpora with structural and linguistic markup. It will attempt to establish conventions for the encoding of such corpora, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT will also work towards establishing a set of guidelines for text software development, which will be widely published in order to enable future development by others. The project consortium, consisting of eight academic and research institutions and six major European industrial partners, is committed to make its results, namely corpus, related tools, specifications, and accompanying documentation, freely and publicly available."

"At the outset of the project, the consortium will undertake to analyse, test and extend the SGML-based recommendations of the TEI on real-size data, and gradually develop encoding conventions specifically suited to multi-lingual corpora and the needs of NLP and MT corpus-based research. To manipulate large quantities of such texts, the partners will develop conventions for tool construction and use them to build a range of highly language-independent, atomic and extensible software tools."

Markup occurs at four different levels. "The TEI Guidelines provide the basis for markup at levels 0 (the TEI header), 1 and 2 as well as many elements of level 3. In collaboration with Eagles, MULTEXT is extending the TEI scheme in order to specify a TEI-conformant Corpus Encoding Style (CES) that is optimally suited to NLP research and can therefore serve as a widely accepted TEI-based style for European corpus work. Application of the CES to CEE languages, which may require minor modifications to accomodate CEE language-specific information and structures, will provide a test of both the TEI Guidelines and MULTEXT and Eagles' extensions to it." [from Ide/Véronis SLNR paper]


MULTEXT (Multilingual Text Tools and Corpora)
Coordinator: Dr. Jean Véronis
Laboratoire Parole et Langage
CNRS & Université de Provence
29, Avenue Robert Schuman
F-13621 Aix-en-Provence Cedex 1
Tel: +33 42 95 20 73
Fax: +33 42 59 50 96

EAGLES Initiative (Expert Advisory Group for Language Engineering Standards)

[CR: 19960114] [Table of Contents]

"The Expert Advisory Group for Language Engineering Standards (EAGLES) is an initiative of the European Commission which aims to accelerate the provision of standards for: (1) Very large-scale language resources (such as text corpora, computational lexicons and speech corpora); (2) Means of manipulating such knowledge, via computational linguistic formalisms, markup languages and various software tools; (3) Means of assessing and evaluating resources, tools and products." [from the Home Page]

"Text representation: The Text representation subgroup will continue developing a Corpus Encoding Standard to provide a precise set of guidelines for the encoding of corpora optimally suited to use in language engineering research and applications, and which can serve as a widely accepted standard for European corpus work. This will involve evaluation, adaptation and extension of the TEI Guidelines. The final report will provide a Corpus Encoding Standard, which will recommend: (1) A minimal level of encoding that a corpus must achieve to be standardised from the point of view of the descriptive representation (bibliographical information, markup of structural and typographical information); (2) Mechanisms to encode linguistic annotation and alignment; (3) A library of DTDs." [from the Introduction]

Corpus Encoding: "This work has been carried out in collaboration with the LRE project MULTEXT, as part of a task on markup specifications, the goal of which is to develop a proposal for a Corpus Encoding Standard (CES) optimally suited for use in language engineering. The standard will be formulated as a Text Encoding Initiative (TEI)-conformant application of the Standard Generalized Markup Language (SGML) ISO 8879." [Introduction]


Corpus Encoding Standard (CES)

[CR: 19960412] [Table of Contents]

[Abstract] "This document is the first version of the Corpus Encoding Standard (CES). The CES has been designed to be optimally suited for use in language engineering research and applications, in order to serve as a widely accepted set of encoding standards for corpus-based work in natural language processing applications. The CES is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language) compliant with the specifications of the TEI Guidelines for Electronic Text Encoding and Interchange of the Text Encoding Initiative."

"The CES specifies a minimal encoding level that corpora must achieve to be considered standardized in terms of descriptive representation (marking of structural and typographic information) as well as general architecture (so as to be maximally suited for use in a text database). It also provides encoding specifications for linguistic annotation, together with a data architecture for linguistic corpora."

Links: (sample)

European Corpus Initiative (ECI)

[CR: 19951220] [Table of Contents]

"The European Corpus Initiative was founded to oversee the acquisition and preparation of a large multi-lingual corpus to be made available in digital form for scientific research at cost and without royalties. We believe that widespread easy access to such material would be a great stimulus to scientific research and technology development as regards language and language technology. . . . No amount of abstract argument as to the value of corpus material is as powerful as the experience of actually having access to some in one's laboratory.

"In reply to this call, many colleagues spontaneously offered their data and/or supplied the contacts to negotiate redistribution rights for the ECI. In many cases we actively sought out data providers especially for the under-represented languages and for larger parallel collections. The majority of the work of collecting materials and permissions and converting them into a consistent format has been done at the HCRC in Edinburgh and at ISSCO, University of Geneva, under the overall supervision of Henry S. Thompson and Susan Armstrong, respectively."

"The ECI/MCI corpus has now been published on CD-ROM, and contains almost 100 million words in 27 (mainly European) languages. It consists of 48 opportunistically collected component corpora marked up in [TEI] SGML (to varying levels of detail), with easy access to the source text without markup. Twelve (12) of the component corpora are multilingual parallel corpora with from two to nine sub-corpora."


Henry S. Thompson
2 Buccleuch Place,
Edinburgh EH8 9LW, UK
OTS, Utrecht University
Trans 10
3512 JK Utrecht
The Netherlands

Centro Ricerche Informatica e Letteratura (CRILet)

[CR: 19980310] [Table of Contents]

"The Centro Ricerche Informatica e Letteratura (CRILet) is a research group established at the Department of Linguistics and Literary Studies of the University of Roma La Sapienza. The group is co-ordinated by Prof. Giuseppe Gigliozzi, and is composed of a number of undergraduate and postgraduate students and researchers. The main objectives of CRILet are to conduct studies in Humanities and Literary computing and to produce digital and on-line literary resources and archives. CRILet has initiated as an experimental project the Web publication of some texts of the Italian literary tradition encoded in SGML according to the TEI P3 or TEI Lite DTD. The currently-available encoded text are browsable on-line with the help of an SGML browser like Softquad Panorama. The project has also recently prepared an Italian translation of TEI document N. TEI U5, written by Lou Burnard and Michael Sperberg-McQueen, TEI Lite. Introduction to Text Encoding for Interchange." [adapted]


Thesaurus Musicarum Italicarum

[CR: 20010116] [Table of Contents]

[2001-01-16] "TMEuro is an information system for Western music history that contains digitised and enriched source materials dating from before c. 1750. The aim of TMEuro is to provide effective and user-friendly access to these materials, to connect these to related materials outside the system, to store knowledge acquired by users and to communicate it to others, by means of suitable and innovative technology. TMEuro is destined for and supported by a broad group of academic and professional users. TMEuro consists of: (1) reliable, high-quality and durable digital representations of sources (text, music notation, pictures, objects) in a suitable form (facsimile, transcription, edition, translation, sound or otherwise); (2) knowledge that is related to these sources (enrichment, annotations, links, thesauri, databases, electronic publications etc.); (3) a software environment providing access over the Internet (not excluding distribution through other media), making use as much as possible of SGML/XML and related technologies..."

[1997 description] Thesaurus Musicarum Italicarum (TMI) is a multimedial edition of Italian music-theoretical sources. It is "an initiative of the Department of Computer and Humanities at Utrecht University, the aim of which is to publish a cohesive electronic corpus of Italian music treatises from the second half of the sixteenth to the early seventeenth century." [...] "The TMI will employ a specialized SGML markup system for the representation of historical source materials developed by the international Text Encoding Initiative (TEI). In addition to indicating its structure, this system allows the recording of layout and physical properties of a source, which are often of the greatest importance for research. . . DARMS [Digital Alternate Representation of Musical Scores] will be used at the start for the storage of music notation, but the aim is to convert all music notation to SMDL [Standard Music Description Language] if it proves viable. Both DARMS and SMDL allow the encoded music to be played automatically."


Earlier/old links:

Project addresses:
Thesaurus Musicarum Italicarum
Dr. Frans Wiering, Project Coordinator
Vakgroep Computer & Letteren
Achter de Dom 22-24
NL-3512 JP Utrecht
Tel: +31 (0) 30-2536335 or 2546426
FAX: +31 (0) 302539221

Language Technology Group (LTG), Human Communication Research Centre (HCRC), University of Edinburgh

[CR: 19970128] [Table of Contents]

Part of the Human Communication Research Centre (HCRC), the Language Technology Group offers natural language engineering and software services, drawing upon "the skills and expertise of one of the largest communities of natural language processing specialists in Europe." The LTG supports a Software Helpdesk: "a free service dedicated to the support of public domain and freely available software for natural language processing and the fostering of its use in practical applications." "The Language Technology Group makes available various software packages. For research purposes, these are often available for free to academic research groups and for a small fee to industrial R&D groups." The package LY NSL (a library of normalised SGML tools) is described below. [extracts from the Home Page]

"LT NSL is a development environment for SGML-based corpus and document processing, with support for multiple versions and multiple levels of annotation. It consists of a C-based API for accessing and manipulating SGML documents and an integrated set of SGML tools. The LT NSL initial parsing module incorporates v1.0 of James Clark's SP software, arguably the best SGML parser available. The basic architecture is one in which an arbitrary SGML document is parsed once, yielding two results: (1) An optimised representation of the information contained in the document's DTD; (2) A normalised version of the document instance, which can be piped through any tools built using our API for augmentation, extraction, etc."

"The use of the cached DTD together with the normalisation of SGML to nSGML means that applications processing nSGML streams can be very efficient. LT NSL provides two views of an nSGML file; one as a flat stream of markup elements and text; a second as a sequence of tree-structured SGML elements. The two views can be mixed, allowing great flexibility in the manipulation of SGML documents. It also includes a powerful, yet simple, querying language, which allows the user to quickly and easily select those parts of an SGML edocument which are of interest. Finally, LT NSL supports SGML output, making it easier to write SGML to SGML conversion programs."


The HCRC Map Task Corpus

[CR: 19980501] [Table of Contents]

The HCRC Map Task Corpus represents publicly distributed material which uses the TEI encoding scheme for linked spoken language transcripts (of unscripted dialogue) and digital audio. The following entry supplements and partially supersedes information provided in the European Corpus Initiative (ECI), above.

"The HCRC Map Task Corpus was produced in response to one of the core problems of work on natural language: much of our knowledge of language is based on scripted materials, despite most language use taking the form of unscripted dialogue with specific communicative goals. Our intention, therefore, was to elicit unscripted dialogues in such a way as to boost the likelihood of occurrence of certain linguistic phenomena, and to control some of the effects of context. To this extent while our dialogues are spontaneous, the corpus as a whole comprises a large, carefully controlled elicitation exercise. . ."

"The HCRC Map Task Corpus consists of 128 digitally recorded unscripted dialogues and 64 citation form readings of lists of landmark names. All dialogues were transcribed verbatim in standard orthography, including (where possible) filled pauses, false starts, hesitations, repetitions and interruptions. The sampled speech data, transcriptions, list reading, and some other ancillary material has been published for distribution on a collection of 8 CD-ROM disks. . . Transcriptions are provided for each conversation, marked up with TEI-compliant SGML, in a minimally intrusive and easily separated way. PostScript files of the map images used in the experiments are provided, along with full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials, and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation, and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs)."


Lingua Parallel Concordancing Project

[CR: 19980309] [Table of Contents]

"The Multilingual Parallel Concordancing project is supported by the European Union under its Lingua initiative, and is led by the University of Nancy II, France [1]. The objective of the project, which began in 1994, was to develop software for parallel concordancing that would enable a user to enter a search string in one language, and find not only the citations for that string in the search language, but the corresponding sentences in the target language. It operates therefore on parallel translated texts which constitute the corpus. . . The corpus itself is designed to conform with the Text Encoding Initiative (TEI), using Standard Generalised Markup Language (SGML) to encode the text. This is to allow the corpus to be used in UNIX and Macintosh environments, as well as on IBM PC's. The program uses the bare minimum of encoding to reduce the physical size of the distributed corpus, and to allow potential users to mark up their own texts relatively easily." [from the University of Birmingham description]

"The proposal for a parallel concordancing project put to the Lingua bureau of the EU originated in the desire of a group of lecturers from different European universities to enhance the use of concordancing in the process of learning a second language. . . Despite the considerable amount of work implied for anyone who wishes to conform to such guidelines as those produced by the Text Encoding Initiative, it appeared to us that this was definitely worth the effort. Indeed, as a norm, it makes it possible for a given electronic text to be transferred from one scholar to another without spending hours in defining an interchange format and writing tools for transcription. Besides, it offers the possibility to incrementally add new information to a given text without any loss of generality. We have seen how natural it was to add alignment information to a text, but in the context of second language teaching, there is always a possibility to add some specific syntactic or rethorical elements on the basis of the different sets of tags defined within the TEI. It is clear however, that such encodings have to be accompanied by a clearly defined set of tools which will effectively give a semantics to the corresponding marks." [from the project description; see below]


Bâtiment Loria
B.P. 239
F-54506 Vandoeuvre Lès Nancy
Email: (Laurent Romary)
David Woolls
University of Birmingham
B15 2TT

University of Waterloo Centre for the New OED and Text Research

[CR: 20000314] [Table of Contents]

The current research interests at the Centre are very broad, including: "SGML and SQL, Management of text through grammars, Integration of text and relations, Federated database technology, Database optimization, Update of text." Research on SGML and document grammars as applied to the NOED forms a significant body of valuable literature. See a description of some of the NOED work provided in the annotation to Gaston Gonnet's report "Examples of PAT Applied to the Oxford English Dictionary". The Centre's research results have been further developed and commercialized by Open Text Corporation, a spin-off company located in Waterloo. Numerous publications and technical reports relating to the use of SGML are registered in the online bibliographic database which is a part of this work. Search in the relevant bibliography documents under the names of the personal authors, including Frank W. Tompa [example: "What is (Tagged) Text?"], Gaston Gonnet [example: "Mind Your Grammar: A New Approach to Modelling Text"], Darrell R. Raymond [example: "Flexible Text Display with Lector"], Heather J. Fawcett, Derick Wood, and others.

Is there a DTD for the OED? Apparently, of sorts. . . Tim Bray wrote: "I was the only person ever to write a grammar for the OED, & did so by programmatically reverse-engineering the tagged text. It had 25,000 productions, sigh, and was hard to boil down (but was still useful). That's an extreme example of course, but early evidence suggests that the task of automatically abstracting structure from pre-existing texts is Really Hard." (Perl-XML Mailing List,, Fri, 04 Sep 1998 10:59:46 -070).

  • The New OED Project at Waterloo
  • "Doing Other Things with Texts: The Use of Electronic Resources in Revising the OED", by Jeffery Triggs, Oxford University Press [mirror copy]
  • Centre for New OED and Text Research Home Page
  • Text/Relational Database Management System Project
  • New Oxford English Dictionary Publications
  • UW Centre for the New OED and Text Research: Software [GOEDEL, PAT, LECTOR, etc.]
  • Frank Tompa's Publications Page
  • [August 04, 2000] "How Oxford University Press and HighWire Took the OED Online." By Mike Letts. In The Seybold Report on Internet Publishing Volume 4, Number 09 (May 2000). "Oxford University Press, with the help of HighWire Press, recently launched the first Web edition of the massive Oxford English Dictionary (OED). Behind the scenes, the project team overcame numerous challenges that the voluminous reference work posed to online delivery, and proved the return on investment that careful markup provides. Believing that contemporary dictionaries were not adequately documenting the history and usage of the English language, the Philological Society of London decided in 1857 to begin a complete documentation of the evolution of the language from the 12th century forward. However, it wasn't until 1879 that a formal agreement was reached with Oxford University Press (OUP) to begin work on what would eventually become the eminently staid, but authoritative, Oxford English Dictionary. Considered the definitive guide to the evolution of the English language, the first full edition of the dictionary wasn't completed until 1928 under the name A New English Dictionary on Historical Principles. What was planned as a four-volume undertaking, became a four-decade, 12-volume project. By 1984, with supplemental volumes added to the first edition, OUP decided to move its magisterial reference work (now known as The Oxford English Dictionary) into an electronic format using a then innovative SGML tagging scheme. The project took five years and cost $13.5 million, culminating in 1989 with the publication of the 20-volume second edition. Now, using the foundation that was laid in SGML some 15 years ago, OED has established the dictionary on the World Wide Web. OED Online (, which went live on March 14, is a complete online copy of the second edition, which features more than 500,000 defined terms and 2.5 million quotations. It will also be used as the foundation of the reference work as OUP begins what it estimates will be at least a 10-year project to publish a third edition of the dictionary. . . Oxford's approach to taking the OED online marks yet another major print reference work whose editorial and production processes has been transformed from a print-centric workflow to a Web-centric one, in which print publications are merely snapshots of the live publication that lives online. The speed at which the project was brought online and its relatively small price tag were made possible by Oxford's prescient decision 15 years ago to convert the dictionary from film and typesetting files into SGML. To move a century's worth of work and accumulated material onto the Web in 16 months, at a cost of only about $1.5 million, is a testament to what solid markup, responsible content management and forward thinking can do. It's worth noting that the project could have been done even faster and at less cost had Oxford cut corners in the quality of the presentation or the efficiency of the application. It didn't. OED Online is well-executed, a pleasure to read and to use. Oxford didn't go it alone, however, and credit is certainly due to HighWire Press for its skill in production. Its tuning of Verity's K2 engine to handle the fine-grained searches that OED demanded defies conventional wisdom, which says to avoid such finely grained markup when deploying to thousands of readers. And its experience in SGML-based Web publishing, proven for more than 100 journals, translated well into its first venture into reference works and was essential to the successful launch of this landmark Web publication."
  • [March 14, 2000] On XML-DEV, David Megginson noted the appearance of the 'OED Online': "The OED (an XML-like project, if not XML itself) is now online... That in itself is cool. The fact that they think they'll convince Web users to cough up US$550.00/year for single-user access on a slow server, on the other hand, is just plain..."
  • "Element Sets: A Minimal Basis for an XML Query Engine." By Tim Bray. "Between 1986 and 1996 I was employed, first at the New Oxford English Dictionary project, then at Open Text Corporation, in both places almost entirely concerned with the problems of search and retrieval in large structured text-bases. The technology developed for the New OED project at the University of Waterloo and further refined and marketed by Open Text was a search engine named pat. Pat implemented a facility much like that described in this note. It was applied successfully to the 572 million characters of the electronic OED and variety of other text-bases, including service in one of the first large-scale Internet search engines, the now-defunct Open Text Index. Pat (under a variety of different product names) remains in active use by Open Text customers (although the company has retreated from the chronically-unprofitable search and retrieval business.) The data to which pat was applied was typically tagged in the SGML style, but did not have a DTD - in effect, XML..."

Other links for the University of Waterloo

UW Centre for the New OED and Text Research
University of Waterloo
200 University Avenue West
Waterloo, Ontario N2L 3G1

Tel: (519) 888-4567 ext. 6183 or 4675
fax: (519) 885-1208

University of Waterloo English Department - Technical Writing Course Using SGML

[CR: 19961009] [Table of Contents]

"The University of Waterloo English Department is using SGML to develop hypertext courseware for a second year Technical Writing course on the World Wide Web. Subject matter experts from the university and business communities develop tutorials in the InContext structured word processor and save them as SGML. Then they run them through special Windows-based conversion software to create files in the World Wide Web hypertext format, the HyperText Markup Language (HTML). This software was written for English 210E and we were pleased to find that the printed and online versions of our documents are generally of higher quality than we would have created by hand. The automatic publishing process removes the inconsistencies that plague manually formatted print and online documents. . . SGML is also used for student assignments. All assignments are written in SGML and submitted. The submission system converts them and mounts them on the World Wide Web for peer review. Marked assignments with marker's comments are returned through password-protected Web sites. . .English 210E uses three DTDs for resume, letter and report documents. The resume and letter are very confining DTDs intended to be an easy introduction to SGML. Instead of overwhelming students with choices, they give them a strict "menu" of core options." [extracted]]


University of Waterloo
200 University Avenue West
Waterloo, Ontario
Canada N2L 3G1

Indiana University: LETRS Services

[CR: 19960811] [Table of Contents]

LETRS (Library Electronic Text Resource Service)provides online access to SGML tagged texts. "Access is provided by use of the UNIX textbase management system called PAT, originally developed by the Open Text Corporation for the 2nd edition of the Oxford English Dictionary. In conjunction with related client display programs used to access the Open Text system from various types of computers, PAT supports the rapid indexing, searching, retrieving, and navigating of large, structured, full text files in ways relevant to the types of questions that humanists ask about texts. Texts are tagged in SGML (Standard Generalized Markup Language) and conform to the TEI (Text Encoding Initiative) implementation of that standard, which allows tagged texts to be transferred electronically from one location to the other and from one hardware and software system to another without loss of information." (from the home page)

Try the following links:

LETRS, Indiana University
E-157 Main Library
Bloomington, IN 47405-1801
Tel: (812) 855-3877 (85LETRS)
Email: (Carolyn C. Sherayko, LETRS Manager)

Indiana University: Victorian Women Writers Project

[CR: 19980313] [Table of Contents]

"The goal of the Victorian Women Writers Project is to produce highly accurate, SGML-encoded transcriptions of literary works by British women writers in the late Victorian period. The works chosen for this project will otherwise not be available in SGML-encoded form, either in the Chadwyck Healey English Poetry Full-Text Database, through the Brown Women Writers Project, or elsewhere. The works will include anthologies, broadsides, and volumes of poetry and verse drama. Considerable attention will be given to the accuracy and completeness of the texts, and to accurate bibliographical descriptions of them. Texts will be encoded according to the Text Encoding Initiative (TEI) Guidelines, using the TEILite.DTD (version 1.6). We will include with each text a header describing fully the source text, the editorial decisions, and the resulting computer file. The texts will be made freely available through the World Wide Web. A faculty advisory board will consider textual and editorial issues, while a technical advisory panel will consider issues of encoding and distribution." (from the Home Page)

As of early 1998, some 34 encoded works were available online.

Principals in the project include: General Editor: Perry Willett [email:]; Editorial Advisor: Donald J. Gray; Technical Advisor: Richard Ellis; Contributing Editor: Felix Jung.


Encoded Archival Description (EAD) and Finding Aids Projects

[CR: 19980829] [Table of Contents]

"Finding Aids," according to a description at Berkeley Digital Library SunSITE, are "inventories, registers, indexes or guides to collections held by archives and manuscript repositories, libraries, and museums. Finding aids provide detailed descriptions of collections, their intellectual organization and, at varying levels of analysis, of individual items in the collections. Access to the finding aid is essential for understanding the true content of a collection and for determining whether it is likely to satisfy a scholar's research needs." The leadership role of the US Library of Congress in promoting the Encoded Archival Description (EAD) standard is documented in the "Government Applications" section of the database. The EAD DTD is maintained in the Network Development and MARC Standards Office of the Library of Congress in partnership with the Society of American Archivists. In early 1998, revisions to the DTD were made to render it usable for XML data.

In several sections which follow, applications of the EAD DTD and other SGML-based digital library collections are featured. The institutions and activities described are meant to be representative.


Berkeley Digital Library SunSITE [Formerly: Berkeley Finding Aid Project]

[CR: 19980319] [Table of Contents]

Berkeley Digital Library SunSITE is a digital library and a resource center for other digitial library projects. It is sponsored by The UC Berkeley Library, UC Berkeley, and Sun Microsystems, Inc. Among its principal resources are the American Heritage Project, the California Heritage Collection, the Finding Aids for Archival Collections (delivery supported by Inso's DynaWeb software), and The Online Medieval and Classical Library. Beginning in 1993, the University of California, Berkeley was the primary center of activity on SGML-based Archival Finding Aids. Several of these collections use SGML encoding. Since 1995, several other academic institutions and consortial bodies have sponsored independent and collaborative research in this domain. The discussion below focuses upon Berkeley, but links will be found to several sites now participating in this effort.

Background: "Developing the EAD DTD has been a cooperative venture since October 1993 when the University of California, Berkeley, Library initiated a project to investigate the desirability and feasibility of creating a platform-independent encoding standard for inventories, registers, indexes, and other finding aids, which are created by libraries, museums, manuscript repositories, and archives to describe and provide access to their holdings. The project's growth has been nurtured through a series of fellowships, grants, and in-kind contributions made by several institutions and professional associations, including the Department of Education (Title IIA grant); Library, University of California, Berkeley; University of California, San Diego; Commission on Preservation and Access; Bentley Historical Library; Andrew W. Mellon Foundation; National Endowment for the Humanities (Division of Preservation and Access); Library of Congress National Digital Library Program; and the Council on Library Resources."

"Detailed information about the history and development of the project from October 1993 to January 1996 may be found at the Berkeley Digital Library SunSite. On 26 February 1996, the Library of Congress Network Development and MARC Standards Office (NetDev/MSO) announced the release of the alpha version of the EAD. A few months earlier, NetDev/MSO agreed to serve as the international maintenance agency for the standard. At about the same time, the Society of American Archivists (SAA), through an EAD Working Group of its Committee on Archival Information Exchange (CAIE), accepted responsibility for monitoring and assisting in the ongoing development of the EAD. The working group includes individuals representing various interests within the SAA as well as representatives from OCLC, Research Libraries Group (RLG), and the Library of Congress." [From: Background Information on EAD Development and Sponsorship

With assistance from Electronic Book Technologies' Educational Grant Program and the NEH (Act Title IIA Research and Demonstration Grant. October 1993 - September 1995), the University of Califoirnia at Berkeley is sponsoring advanced research in SGML technologies for use by electronic libraries.

BFAP Project Overview: "The Berkeley Finding Aid Project is a collaborative endeavor to test the feasibility and desirability of developing an encoding standard for archive, museum, and library finding aids. Finding aids are documents used to describe, control, and provide access to collections of related materials. In the hierarchical structure of collection-level information access and navigation, finding aids reside between bibliographic records and the primary source materials. Bibliographic records lead to finding aids, and finding aids lead to primary source materials. A standard for encoding finding aids will ensure not only broad based access to our cultural heritage and natural history collections, but that the findings aids themselves will survive hardware and software platform changes, and thereby remain available for future generations."

"The Project involves two interrelated activities. The first task entails the design and creation of a prototype encoding standard for finding aids. The prototype standard is in the form of a Standard Generalized Markup Language (ISO 8879) Document Type Definition (SGML DTD). Researchers at the University of California, Berkeley are developing the encoding standard in collaboration with leading experts in collection cataloging and processing, text encoding, system design, network communication, authority control, and text retrieval and navigation. The design and development team is analyzing the structure and function of representative finding aids. The basic elements that occur in finding aids are being isolated and their logical interrelationships defined. The DTD is based on the results of this analysis. The project team is using ArborText's Document Architecture to facilitate DTD development. The first iteration of the DTD was completed in Fall 1994."

"Building a prototype database of finding aids is the second objective of the Project. Toward this end, available SGML based software has been evaluated. For authoring and validating finding aids compliant with the DTD, project staff are using ArborText's Adept Editor . Conversion of finding aids that already exist in various word processing and database formats is accomplished through a combination of Adept Editor, WordPerfect macros, and Microsoft Access. Scanning and OCR of finding aids that only exist in paper form will follow conversion of those finding aids already in machine-readable form. Currently the database has approximately 100 finding aids from Berkeley, the Library of Congress, the National Library of Australia, the Getty Center, Duke University, University of California, San Diego, State Historical Society of Wisconsin, and others."

"Electronic Book Technologies' DynaText is being used for electronic network publishing of the finding aids. At this time the only version of DynaText available for use on the Internet is X- Windows. Stand-alone versions of DynaText are available for Macintosh and Microsoft Windows . DynaText supports inline display of a variety of graphic formats (GIF, TIFF, etc.) and launching of external display software for image viewing and manipulation. DynaText also supports a variety of search types within and across finding aids: boolean keyword, word adjacency and proximity, as well as element or field qualified searches. The text viewing and navigation component of DynaText provides dynamic generation of an expandable table of contents adjacent to the document text to supply context clues for reading comprehension and random, informed access to the text." [from the project description]


Daniel Pitti
Librarian for Advanced Technologies Projects
386 Library
University of California, Berkeley
Berkeley, California 94720-6000
Phone: 510/643-6602
FAX: 510/642-4759
Beth Davis-Brown
National Digital Library Program
The Library of Congress
LIBN/CS/NDL (1025)
Washington, DC 20540-1000
Tel: 202-707-3301
FAX: 202-707-0815

Berkeley Art Museum/Pacific Film Archive

[CR: 19970523] [Table of Contents]

The Web site has made a recent [May 1997] "addition of several searchable resources of film documentation (12,000 curatorial film notes encoded in SGML) and EAD-encoded guides to our art collection.. . The means we use for delivery of all the SGML data (including the EAD guides) online is to have them be searchable in both fielded and relevance-ranked full-text searches, and then convert the SGML file (or chunks of it) to HTML on the fly for viewing in standard web browsers."

"These guides present our collections information in a context for greater understanding of them than an object-level database alone would provide. We wanted two things when looking for a way to present primary collections information - use of standards so that this information would have lasting value and maximize the work put into it; and the maximum flexibility and ability to share this information since it is the core upon which we will build. The method we found that fit all the above needs was the EAD. SGML (Standard Generalized Markup Language) is an international standard for encoding full-text and richly structured information. The EAD (Encoded Archival Description) is an implementation of SGML intended to encode information which describes collections. . . The collection guides are comprised of essays by curators and scholars including overviews and organization of collections, as well as artists biographies and historical context of their creation and collection. These guides are intended to provide context and depth to enable meaningful access to our collections."


Berkeley Art Museum/Pacific Film Archive
University of California
2625 Durant, Berkeley, CA 94720-2250

Electronic Binding Project (EBIND) - UC Berkeley Digital Page Imaging and SGML

[CR: 19970601] [Table of Contents]

"The Electronic Binding Project, or Ebind, is a method for binding together digital page images using an SGML document type definition (DTD) developed at UC Berkeley in 1996 by Alvin Pollock and Daniel Pitti. The Ebind SGML file records the bibliographic information associated with the document in an ebindheader, the structural hierarchy of the document (e.g., parts, chapters, sections), its native pagination, textual transcriptions of the pages themselves, as well as optional meta-information such as controlled access points (subjects, personal, corporate, and geographic names) and abstracts which can be provided all the way down to the level of the individual page. . . This SGML file acts primarily as a non-proprietary, international standards-based (ISO 8879) control file for the multiple image files which make up a digitized book or document. But it can also serve as the basis for browsing the images in any SGML-aware software system in a natural and convenient way." [from the main page]

"The structure of the Ebind DTD is based loosely on the Core tag set of the Text Encoding Initiative (TEI) DTDs. Like TEI, Ebind is divided into a bibliographic header, front matter, a body, and back matter. The front, body and back elements can themselves be divided into generic textual divisions called divs. A type attribute on the div element may specify the type of division more precisely, e.g., type='chapter'."


American Heritage Virtual Archive Project

[CR: 19970609] [Table of Contents]

"The American Heritage Virtual Archive Project will investigate one of the most serious problems facing knowledge seekers everywhere, the geographic distribution of both collections of primary source material and the written guides describing and providing access to them. We propose to solve this problem by creating a prototype "virtual archive", integrating into a single source, hundreds of archival finding aids describing and providing access to a large body of primary source materials from collections documenting American culture and history held by four major academic research libraries. The project will demonstrate the feasibility of providing both scholars and average American citizens with user-friendly, universal Internet access to the research collections of the world."

"...the project's participants have agreed to adhere strictly to standards. The collection-level records use the USMARC format. The finding aids are encoded with the Standard Generalized Markup Language Document Type Definition (SGML DTD) developed in the Berkeley Finding Aid Project. . . using X-Server software on networked workstations [researchers] will be able to search, browse, and navigate the finding aid database directly using DynaText software, or by launching DynaText from within the catalog record. Second, using a World Wide Web browser (for example, Mosaic or Netscape), they will have full searching and navigating access to the full DynaText database using Electronic Book Technologies' new product DynaWeb as a filtering device for converting the SGML to HTML. The virtual archive will thus be widely and publicly available."

Institutional participants include [June 1997}: Duke University, Stanford University,The University of California, Berkeley, and The University of Virginia.


Tim Hoyer, Project Manager
The Bancroft Library
University of California, Berkeley
Berkeley, CA 94720
Tel: (510) 643-3202

Yale University Library EAD Finding Aids Project

[CR: 19970520] [Table of Contents]

"The Yale University Library EAD (Encoded Archival Description) Finding Aids Project provides access to archival finding aids in a platform-independent electronic format, using SGML (Standard Generalized Markup Language). Finding aids are inventories, indexes, or guides that are created by archival and manuscript repositories to provide information about specific collections. While the finding aids created by repositories may vary somewhat in style, their common purpose is to provide detailed description of the content and intellectual organization of collections. Access to finding aids through the Internet will assist scholars in determining whether collections contain material relevant to their research."

"The finding aids mounted on Yale's server are encoded in SGML, using the beta version of the EAD DTD (Document Type Definition). SGML encoding of files allows for more complex formatting and navigation than does HTML. The EAD encoded versions of Yale's finding aids are configured to be viewed using SoftQuad's SGML browser, Panorama . . . The system includes finding aids from three Yale repositories: the Beinecke Rare Book and Manuscript Library, Yale Divinity Library Special Collections, and Yale University Library Manuscripts and Archives. There are currently 324 finding aids in the database and more will be added as retrospective encoding of existing finding aids takes place and as new finding aids are created." [from the online descriptions]


Yale University Library
New Haven, CT 06520

University of Iowa Library, Iowa Women's Archives

[CR: 19970316] [Table of Contents]

"The mission of the Louise Noun - Mary Louise Smith Iowa Women's Archives is to collect, preserve, and make available primary source material on the women of Iowa. Established in 1992, the archives is named for its founders, two prominent Des Moines women who conceived the idea of a repository that would collect solely on Iowa women and who worked to bring it to fruition."

"The archives has over 700 linear feet of materials including, but not limited to, letters, diaries and journals, memoirs, scrapbooks, reports, minutes, memoranda, speeches, photographs, audio and videocassettes, oral history interviews, slides, and films. A select number of finding aids have been encoded in SGML and HTML and can be accessed through this web page. The SGML files are best viewed with Softquad's Panorama Viewer, a Netscape plug-in for Windows95."



California Heritage Digital Image Access Project

[CR: 19961217] [Table of Contents]

The California Heritage Digital Image Access Project is building a digital archive containing photographs, pictures, and manuscripts from the collections of the Bancroft Library. The project is sponsored by the National Endowment for the Humanities and the library, University of California and Berkeley, with a software grant from Electronic Book Technologies. Public access to the archive is provided through "embedding digital representations of the primary sources directly within the documents -- archival finding aids -- created by the Bancroft Library's curators and archivists to describe the collections of which they are part. . the user of the California Heritage collection finds the objects of his search within an electronic version of the finding aid itself. This is made possible through the use of the emerging standard for archival finding aids, Encoded Archival Description (EAD)."

The SGML-encoded finding aids are served by a special Web server (Electronic Book Technologies' DynaWeb) that translates SGML into [HTML]. "The online finding aid collection can be searched using a rich and powerful query language. In addition to the full range of standard search tools--wildcards, proximity searching, boolean searching--users can search the underlying SGML with which all of the finding aids in this database are marked up." The DynaWeb browser interface also supports a "Stylesheet Selection Form" that allows Net users to configure the view of the information sent by the server (e.g., "default view," "Show Inline Thumbnails", "Show Images As Icons", "Display SGML Tags").

"The project's central objective has been to build a prototype demonstration database that, by the project's end, will provide collection-level access to 25,000 digital representations of primarily source materials documenting California history, which have been selected from the collections of The Bancroft Library. By creating this prototype system and making it available for testing on the Internet, the California Heritage Digital Image Access Project will be addressing important issues in digital image access and control. When fully developed, the prototype will also use USMARC collection-level records to provide access to its EAD encoded finding aids and digital images. While the USMARC access component of this project is still under development, direct access to the encoded finding aids is currently provided here through a WWW interface (DynaWeb) that lets users directly search and navigate the SGML encoded finding aids and digitized primary source materials." [from the Home Page, December 1996]


Daniel Pitti:

Harvard/Radcliffe Digital Finding Aids Project (DFAP)

[CR: 19961204] [Table of Contents]

"The Digital Finding Aids Project was established in February 1995 to plan and oversee the design and deployment of a new computer application system to store, search, and retrieve digital finding aids in SGML format format for all of the forty-nine (49) Harvard/Radcliffe repositories in a shared database. Presently, eight repositories are participating in the project: Baker Library (Business School), Design School, Divinity School, the Gray Herbarium and Houghton Library (Harvard College Library), Law School, Schlesinger Library on the History of Women in America (Radcliffe), and the Harvard University Archives."

The new DFAP WWW site "includes a history of the project to date; Harvard guidelines for using SGML for finding aids, based on the Encoded Archival Description (EAD), a proposed national standard; repository-specific versions of the guidelines; and a growing number of SGML-encoded finding aids."

"The Digital Finding Aids Project is the first step towards establishing a strong administrative structure for the creation and technical support of SGML-encoded finding aids at Harvard. The project is a collaborative effort involving curators, archivists, catalogers, electronic text specialists, and library systems people, engaging participants from across Harvard." [adapted from the announcement]

The project members say: "We are not alone in our belief that SGML finding aids will be a major force in creating the library of the future. The EAD is rapidly becoming a national standard. The Network Development and MARC Standards Office of the Library of Congress, working with the Society of American Archivists, will maintain the EAD as it does the MARC standard. The Council of Library Resources, with SAA, is committed to the development and publication of application guidelines for the EAD. And the National Digital Library Federation sees in the EAD a common standard enabling searches for digitized archival materials across collections and institutions. Products like Hytime, a hypermedia extension of SGML, can be used to li nk images and sound to the finding aid. Overall, SGML and the EAD promise future flexibility as such enhancements are demanded by the scholarly community." [from the History, "Finding Common Ground"]


Research Libraries Group (RLG) FAST Track: Finding Aids SGML Training

[CR: 19961212] [Table of Contents]

In July 1996, the RLG received $100,000 from the Delmas Foundation to "support RLG's project to train its members in applying SGML (Standard Generalized Markup Language) coding to their archival finding aids." The finding aids documents will become part of the RLG digital collections, a current example of which is Arches (Archival Server and Test Bed). RLG's Arches Project is devoted to building "the conceptual and physical foundation for collaborative projects that will create long-lasting, internationally accessible digital resources of all types" In early 1997, the Arches project plan calls for full-text searching capability that can take advantage of SGML-encoded texts.

The RLG FAST Track project is designed to "enhance national and international access to primary sources through digitized finding aids linked to their RLIN collection-level records. Finding aids are guides that provide detailed descriptions of the content of archival collections; they can form a valuable bridge between collection-level cataloging and whole information objects. The FAST (Finding Aids SGML Training) Track is a series of workshops designed to train RLG members in encoding their finding aids with Standard Generalized Markup Language (SGML) according to the Encoded Archival Description (EAD) standard and guidelines. . . RLG members trained in encoding their finding aids with SGML according to an international standard. Links in the RLIN database leading searchers all over the world to these finding aids. Widespread adoption of the evolving EAD standard. A growing body of valuable primary sources information accessible worldwide."

"With generous support from The Gladys Krieble Delmas Foundation and the Council on Library Resources, RLG has fostered the creation of training curriculum materials and has organized a series of eight regional workshops for members. The RLG FAST workshops build on the work of members and allies in the archival community. The University of California at Berkeley pioneered in developing the EAD Document Type Definition (DTD) through the Berkeley Finding Aids Project. The Bentley Library Research Fellows advanced the EAD standard to the alpha stage, and the EAD "Early Implementers" are helping to take it through the beta stage. The Library of Congress has agreed to be the ISO maintenance agency for this standard. The Society of American Archivists (SAA) has agreed to cosponsor the standard, form a working group to develop application guidelines, and help promote its use." [extracted]

"Four workshops were conducted in 1996, training 60 staff from 35 RLG member institutions through sessions held in St. Paul, Minnesota; at the National Archives II in College Park; Maryland; and at the British Library in London. The schedule for 1997 is: (1) January 9-10, 1997 -- Tarlton Law Library, University of Texas at Austin; (2) March 6-7, 1997 -- Emory University, Atlanta, Georgia; (3) April 28-29, 1997 -- Harvard University, Cambridge, Massachusetts; (4) June 9-10, 1997 -- Columbia University, New York, New York."


RLG Member Services
Attention: Fran Devlin
Tel: +1 415-691-2239)
Tel: 1-800-537-7546
FAX: +1 415.964.0943
Research Libraries Group, Inc (RLG)
1200 Villa Street
Mountain View, CA 94041-1100 USA
Fax: +1 415.964.0943

Cheshire II Project and SGML (UC Berkeley)

[CR: 19960716] [Table of Contents]

"The Cheshire II project is developing a next-generation online catalog and full-text information retrieval system using advanced IR techniques. This system is being deployed in a working library environment and its use and acceptance by local library patrons and remote network users are being evaluated. The Cheshire II system was designed to overcome twin problems of topical searching in online catalogs, search failure and information overload. The system incorporates a client/server architecture with implementations of current information retrieval standards including Z39.50 and SGML." [from the main WWW page]

"The Cheshire II system is intended to provide a bridge between the purely bibliographic realm of previous generations of online catalogs and the rapidly expanding realm of full-text and multimedia information resources. To accomplish this goal, the Cheshire II system includes the following design features: (1) it supports SGML as the primary data base format of the underlying search engine. . .


UCSD Archives Finding Aids Database UC, San Diego)

[CR: 19960912] [Table of Contents]

The Mandeville Special Collections Library at the University of California, San Diego, has mounted a subset of the library's finding aids on the WWW in SGML and HTML coded versions. The database is located at

The database currently includes approximately 65 listings for manuscripts, personal papers, and UCSD records. Each listing contains a link to the SGML-coded finding aid, to the HTML-coded finding aid, and to the corresponding catalog record in ROGER WEB, the WWW version of UCSD's library catalog. In turn, each catalog record for items having a finding aid includes a link to the SGML finding aid, to the HTML finding aid, and to the "homepage" for the database, where one can search the entire database of SGML finding aids. The SGML part of the database is indexed and searchable by means of Verity Query Language. Also, corresponding records in OCLC's Intercat database include links to the two versions of the finding aids and to the database homepage."


Bradley D. Westbrook
Manuscripts Librarian / University Archivist
Mandeville Special Collections Library 0175S
Geisel Library
UC, San Diego
La Jolla, CA 92093
Tel: 619-534-6766
FAX: 619-534-5950

Durham University Library - EAD Finding Aids

[CR: 19971008] [Table of Contents]

In an October 1997 announcement from Richard Higgins (Durham University Library, Durham, UK), the availability of 100 (+) finding aids was advertised. The EAD (SGML) encoded finding aids for Durham University Library Archives and Special Collections are available via Dynaweb as HTML for ordinary WWW browsers. The Durham University Library's DynaWeb Server "is designed to serve HTML created on the fly from SGML documents, which at present are handlists for the holdings of Durham University Library Archives and Special Collections, which have been created in SGML using the EAD (Encoded Archival Description) DTD. Dynaweb should sense whether your browser is frames-enabled, and if so use a three frame layout; if not, it will format the response to appear within a single screen. If your browser has Javascript, ensure that it is enabled."



[CR: 19970826] [Table of Contents]

AQUARELLE, "The Information Network on Cultural Heritage," is carried out by an international consortium gathering public and private cultural organisations. Two of its principal objectives are to "develop a resource discovery system for the cultural heritage information available in archive and folder databases, and to provide the technical facilities supporting information access through hypertext navigation as well as information retrieval by querying." The partnership is currently composed of cultural organisations, publishers, information technology companies, and research organisations; it is for "curators, urban and regional planners, publishers and researchers." Under Project Manager Dr Alain Michard, The project is sponsored ed by ERCIM, the European Research Consortium for Informatics and Mathematics and supported by the European Commission, Telematics Applications Programme.

"At the technical level, the most relevant and important standards on which AQUARELLE rely are: (1) Z39.50, a protocol supporting distant access to documentation systems, and (2) SGML, a language supporting the specification of complex document structures. . .The main technical challenges of the project are to support query broadcasting to multilingual archive databases, to provide powerful resource-discovery facilities, and to provide tools for authoring and indexing of distributed hypermedia folders with specific SGML DTDs. The project will long for three years (1996 to 1998), with a total budget of 6,5 MECUS, and an EU-funding of 3 MECUS. It will coordinate with national and international bodies which have launched similar intiatives (e.g.: UNESCO, Getty Trust, CIMI, etc.). "

"SGML DTDs: Folders in Aquarelle version 1 will be based on two different SGML DTDs . The first one has been designed by the CIMI in Project CHIO, and is primarily adapted to process documents related to museum collections. The second one, developed under a French initiative to create digital versions of the "Dossiers de l'Inventaire Giniral", is adapted to document the built heritage of geographical areas. Both DTDs have been slightly modified to match the precise requirements of the cultural organisations involved in Aquarelle. For the second version of the system, merging these two DTDs into a single one will be considered. . .The Grif SGML editor will be tailored to the specific folder DTD's that will be defined. Besides being fully WYSIWYG, this editor works with any SGML DTD, includes a full API, has a sophisticated physical structure language, includes structured hypertext links and supports most content formats. It is widely used and recently a version of Grif for HTML , the Grif Symposia has become available to the Internet." See the published report: C. Bounne, SGML Documents Type Definitions , Project AQUARELLE IE-2005, Deliverable Number D6.2, St Quentin, 13-November-1996.


Dr Alain Michard
Project Manager
BP 105
F-78153 Le Chesnay
Cedex, France

Project Silfide (Serveur Interactif pour la Langue Française, son Identité, sa Diffusion et son Étude)

[CR: 19971007] [Table of Contents]

"SILFIDE est un projet du CNRS et de l'AUPELF-UREF. Son objectif principal, en tant que serveur interactif, est de permettre l'accès d'une manière conviviale et raisonnée à des ressources textuelles (quelle que soit leur origine, écrite ou orale) à l'ensemble de la communauté universitaire travaillant à partir de la langue (linguistes, enseignants, informaticiens,...) à travers un réseau de serveurs informatiques et d'actions en alimentant les fonctions. Ce projet s'effectuant dans un cadre francophone, la langue de base est donc le français. L'utilisateur aura la possibilité de questionner le serveur à partir de différents index (accès par auteur, titre, langue,...). A partir de cette élection, SILFIDE propose différentes visualisations des résultats obtenus avant d'accéder au(x) texte(s) directement ou via le fournisseur. Les ressources proposées sont toutes codées sous le format SGML et conformes aux directives de la TEI (Text Encoding Initiative). Le second rôle de SILFIDE est de permettre a tout utilisateur de travailler sur les ressources auxquelles il a accès au moyen de différents outils (recherche de mots-clés dans le texte, alignement multilingue,...)" [from the description]


Network of Literary Archives (NOLA)

[CR: 19960229] [Table of Contents]

The intent of NOLA is to "focus on what may be called literary archives - the unpublished sources on major novelists, philosophers, musicians and painters. Some of these collections are kept by libraries, others by museums, archives and research institutions, in the private as well as in the public sector. NOLA will establish a common platform for cooperation on use of standards, tools and methods as well as the provision of mutual access to resources between libraries and other institutions."

"NOLA will build upon the work already carried out by the EU-funded Text Encoding Initiative (TEI) in definining Guidelines for the encoding and interchange of the widest range of machine-readable resources. The TEI is a major international project, with an unusually high degree of visibility, in North America and the Far East, as well as within Europe. It is the most successful attempt so far made to determine a comprehensive set of encoding standards, based on ISO standard SGML, which are of truly general applicability in realistically scaled projects."


Duke University: Special Collections Library, SGML Finding Aids

[CR: 19980121] [Table of Contents]

"The Duke University Special Collections Library has been an active participant in the Berkeley Finding Aids Project. We have put together this series of WWW pages as a demonstration of some of the benefits of encoding our finding aids, using SGML."

The collection already contains many finding aids already encoded in SGML. Each collection name is a link to the SGML source file. If you are using Windows Mosaic (or Spyglass Mosaic), and have SoftQuad's free SGML viewer, Panorama, you will be able to view the finding aid as it is meant to be seen in an on-line environment, complete with a table of contents view, and a full text view. Panorama employs a browser that supports Object Linking and Embedding (OLE), so that it can interact with the Web browser to retrieve auxilliary files, such as the style sheet(s), navigator(s), catalog files, and the Document Type Definition (DTD). Three example documents (with extension .sgml) are provided in the list below:

Special Collections Library
Duke University
Box 90185
Durham, NC 27708-0185
Telephone: 919-660-5820 (office)
Tel: 919-660-5822 (research services reference desk)
Fax: 919-684-2855

University of Warwick Modern Records Centre, Finding Aids Project

[CR: 19980319] [Table of Contents]

"The Modern Records Centre is beginning to convert its finding aids to an electronic format. The finding aids are being encoded in the EAD DTD (Encoded Archival Description Document Type Definition) of SGML (Standard Generalised Mark-up Language). The Modern Records Centre is pleased to announce that it has finally placed an EAD encoded catalogue on the web site ( EAD encoded documents may be viewed using Softquad's Author/Editor and Panorama PRO packages. We invite all members of the list and other interested parties to visit and to send in any comments or suggestions. The actual web page from where you can access the SGML document is [Adapted from the EAD page and from a posting by Alan Crookham to the EAD list, March 05, 1998]


Project ELSA (Electronic Library SGML Applications)

[CR: 19970316] [Table of Contents]

"Project ELSA is carrying out research into the use of documents in libraries which have been marked up in SGML format (Standard Generalized Markup Language). The project will construct an electronic store of documents which will take the form of a server on a network. Client computers will be able to access the material on the server, download it and make it available to librarians and end users for use and manipulation. The project will establish a detailed specification end user environment based on a client server model through consultation with librarians and end users. The specification and an examination of relevant standards will drive the construction of a user interface which will be produced to provide the ability to search, retrieve and view material. . . The project is funded through European Commission DG XIII and has three partners. Jouve System D'Information (France) the lead partner, will provide the search engine, user interface software and client server software, Elsevier Science (Netherlands) will provide the documents and De Montfort University will develop the user interface and provide a test bed."

. . .Central to the project, there will be an electronic store of articles, which are coded in SGML (Standard Generalised Mark-up Language) with corresponding artwork. The documents are articles from approximately 100 Elsevier Science journals."

Catherine Lupovici
Jouve Systemes d'Information
Tel: +44 76 86 00
Fax +44 76 86 10
OR The ELSA Project
Division of Learning Development
De Montfort University
Gateway Building 1.8D
The Gateway
Leicester LE1 9BH

UCLA - InfoUCLA Project, including ICADD SGML

UCLA has used SGML in a number of projects. From the historical overview: "In the summer of 1992, Simile [UCLA information processing group] began to experiment with SGML (Standard Generalized Markup Language) as a way to publish its data so that it can be accessed in a variety of ways, one of which is via the International Committee for Accessible Document Design (ICADD) document type definition, which through the use of enabling technologies makes electronic information accessible to people with disabilities. Of no small additional benefit is that using SGML for data representation can permit InfoUCLA's data investment to be preserved as future technologies become available. The initial experiments with SGML demonstrate Simile's intention to keep equal access to electronic information a central and ongoing concern."

"For the future, Simile continues to conduct its search for the perfect information client/server combination: one that allows browsing, searching, sorting of results with relevance feedback, transaction processing, use of non- Roman character sets, file and document delivery, image, sound, video, access to a number of different kinds of servers, and one that is compatible with all platforms. Future projects include a prototype client/server system based on Open Text Software's PAT database engine and LECTOR SGML browser, which will incorporate support for the ANSI Z39.50 search and retrieval protocol, bringing some of the College Library reserves online, and posting Internet training materials online."

Representative Poetry Project: University of Toronto

Representative Poetry is "a historical anthology of English poetry, from the early medieval period to the beginning of the twentieth century, which includes about 730 poems by roughly 80 authors. Based on the University of Toronto Press publication of the same name (3rd edition, revised, 1967)."

"Representative Poetry is available on-line in the World-Wide Web front page of the University of Toronto English Library (UTEL). It is encoded in Standard Generalized Markup Language (SGML) and is converted on the fly to Hypertext Markup Language (HTML) for Web browsers. The encoded collection is part of the TACT manual to be published by the Modern Language Association in 1995. The tagging guidelines for this collection give further details. . .Representative Poetry root files are encoded in Standard Generalized Markup Language (SGML), an ISO-sponsored syntax for the tagging of electronic documents, but are converted to Hypertext Markup Language (HTML) for display on the World Wide Web. SGML tags are useful for the exchange of electronic documents across many different computers and for the enrichment of texts with information. . .the poems are encoded in Standard Generalized Markup Language (SGML) to make explicit features of the original edition that were often unstated, such as line number (for long poems like Spenser's Faerie Queene), stanza number, type of text (including heading, subheading, epigram, note, speech prefix, and stage direction), and language. This kind of tagging assists in text analysis and retrieval." [from the introductions]

See the related work at the University of Toronto in the 'Renaissance Electronic Texts' series ("A series of old-spelling, SGML-encoded editions of Renaissance books and manuscripts (with critical introduction), transcriptions of basic texts, and supplementary studies, published on the Internet as a free resource for students of the period. [General Editor: Ian Lancashire]"). The URL: RENAISSANCE ELECTRONIC TEXTS:

Ian Lancashire
Department of English, New College
Centre for Computing in the Humanities
Robarts Library
University of Toronto
Toronto, Ont. M5S 1A1

Stanford University - Academic Information Resources (AIR) and Academic Text Service (ATS)

[CR: 19960323] [Table of Contents]

Stanford University's Academic Text Service (ATS) hosts a number of informative resources for researchers wishing to prepare electronic texts using SGML encoding. ATS is "now introducting Web-based access to a number of electronic texts, and by the beginning of the 1996-1997 academic year we plan to make all of Stanford University's electronic texts available over the Web. [Due to licensing restrictions, these texts are currently limited in use to the Stanford community only.] The texts that ATS is delivering are encoded in SGML, using a document type definition (DTD) that complies with the TEI."


Stanford University
Academic Text Service (ATS)
315 Sweet Hall
phone: (415) 725-3163

Project PREMIUM (PRoduction of Electronic Materials through International and Uniform Methods)

[CR: 19951015] [Table of Contents]

Project PREMIUM is an academic project closely affiliated with the Department of Humanities Computing of Groningen University. It has a significant SGML emphasis, as described in the project plan, referenced below. The project is commissioned by Stichting SURF (SURF Foundation) and executed by SURFnet bv during 1995.

CELT (Corpus of Electronic Texts) - University College Cork [was:CURIA Project]

[CR: 19971126] [Table of Contents]

"The CELT database is a growing corpus of literary, historical, social, and political texts relating to Ireland and Irish culture. The text is encoded using the TEI DTD, and is available online in SGML, HTML, plaintext, and PostScript."

The CELT Project (Corpus of Electronic Texts), online resource for contemporary and historical Irish documents in literature, history and politics, is centered at University College, Cork, Ireland. "The CELT project grew out of the joint involvement of the Department of History and the Computer Centre in a number of related text projects over many years, including the project formerly known as CURIA, the Peritia and Chronicon journals, and the History Ireland project. The CELT project aims to produce an online database of contemporary and historical topics from many areas, including literature and the other arts. It will provide material for the greatest possible range of readers, researchers, academic scholars, teachers, students, and the general public. The texts can be searched, read on-screen, downloaded for later use, or printed out. Texts are taken from the best printed editions*, scanned, and proofread. Markup for structural and analytic features is added according to the recommendations of the Text Encoding Initiative (TEI). Conversions to HTML are made for online reading in the World-Wide Web, and the master files can be used to create versions in other formats, and for contextual searching, concordancing, and other analyses."

Links to the CELTProject:

Margaret Lantry
Managing Editor
CELT Project
Computer Centre
University College
Tel: +353-21-902736

Chadwyck-Healey: English Poetry Full-Text Database, Patrologia Latina [other full-text databases]

[CR: 19950831] [Table of Contents]

All of Chadwyck-Healey's full-text databases are SGML encoded. The description for the English Poetry Full-Text Database is illustrative of the philosophy behind the C-H endeavor.

"English Poetry is a machine-readable full-text database encompassing [about 165,000] works of 1,350 poets from the Anglo-Saxon period to the end of the nineteenth century, available commercially from Chadwyck-Healey. [The poetic canon draws upon 4,500 printed sources.] It is the largest and most accessible full-text database yet published in the humanities. The great size and chronological span of English Poetry and the consistent coding of all the texts make it a valuable resource for research, teaching and reference."

"An important feature of English Poetry is the use of Standard Generalized Markup Language (SGML) for the coding of the texts. This internationally recognised coding language, specified in ISO 8879, greatly enhances the value of the database to researchers: (1) SGML coding breaks down the full text into its structural elements so that each of these elements can be searched or manipulated separately; (2) It provides a standard method of text capture and description which allows the encoded data to be interchanged or combined with other SGML-coded texts. Users can merge texts from the database with their own texts for analysis. Texts from English Poetry can also be used with any SGML-compatible software; (3) Most traditional coding languages have been designed with specific ends in mind, e.g. database manipulation or typesetting. SGML, however, is deliberately neutral so that the output format is determined by the processing software employed, not by the coding itself."

"The SGML encoding scheme to be used by English Poetry is closely modelled on that being developed by the Text Encoding Initiative (TEI), the international research project sponsored by the Association for Computers and the Humanities, the Association for Literary and Linguistic Computing and the Association for Computational Linguistics, and jointly funded by the US National Endowment for the Humanities, the Commission of the European Communities (DG XIII) and the Andrew W. Mellon Foundation."

"Examples of elements distinguished by the English Poetry encoding scheme include structural units such as volume, part, book, etc down to the level of individual poems or groups of poems, stanzas and lines. Titles, headings, refrains, prologues, notes, etc are dearly distinguished from texts, and prose from verse. Page divisions, use of typographic emphasis and indentation are also all clearly marked. For verse dramas scene, act, speaker, stage instructions and cast list are coded. [extracts from the database overview online at the University of Virginia]

PC versions of the English Poetry database use a modified version of the SGML browser DynaText, developed by Electronic Book Technologies. The license fees charged by Chadwyck-Healey for institutional/mainframe use of the database are rather high.

Other links. Chadwyck-Healey plans to have a dedicated WWW server soon. Meantime, from an experimental page, here are some descriptions of the electronic databases of probable interest to humanities scholars. Several (which ones?) of the databases use SGML encoding and can be licensed for use with the Windows-based SGML-aware software based upon DynaText.

  • Voltaire électronique. "This important new database will form the most complete and up-to-date critical collection of Voltaire's works. . .In addition to the 24 volumes so far published in the Voltaire Foundation's printed edition of the Oeuvres complètes de Voltaire, it will include all Voltaire's other texts which have been prepared in electronic form by the Voltaire Foundation but not yet published in print. . .All texts have been prepared by the Voltaire Foundation with SGML-coding and the database will also be published as SGML-encoded data on magnetic tape, for institutions wishing to network it with their own retrieval software."
  • The English Poetry Full-Text Database. "English Poetry contains over 165,000 poems drawn from 4,500 printed sources. It is, in essence, a database of the complete English poetic canon from Anglo-Saxon times to the end of the nineteenth century."
  • English Poetry Plus on CD-ROM. "5,000 poems by 314 poets from Chaucer to the end of the nineteenth century with biographies, portraits, background illustrations and recorded readings."
  • English Verse Drama: the Full-Text Database. "English Verse Drama complements and extends the pioneering English Poetry Full-Text Database. It adds six centuries of poetry intended for the stage to the thirteen centuries of poetry already available. English Verse Drama contains more than 1,500 works by around 450 named authors and approximately 230 anonymous works, from the Shrewsbury Fragments of the late thirteenth century through the unparalleled output of the Elizabethan and Jacobean period to the end of the nineteenth century."
  • Patrologia Latina Database. The Patrologia is divided into the Patrologia Latina and the Patrologia Graeco-Latina. The Patrologia Latina covers the works of the Latin Fathers from Tertullian in 200 A.D. to Pope Innocent III in 1216. In 221 volumes, it covers most major and minor Latin authors and contains the most influential works of late ancient and early medieval theology, philosophy, history, and literature. Patrologia Latina Database is the full-text electronic version of the Patrologia Latina, including all prefatory material, original texts, critical apparatus and indexes. Illustrations are also included, as are Migne's column numbers - essential references for researchers."
  • The American Poetry Full-Text Database. "The poetical works of over 200 authors from the earliest American poets of the seventeenth century to the beginnings of modernism. . .with more than 30,000 poems. The complete text of each poem is included. Any accompanying text written by the original author and forming an integral part of the work, such as notes, dedications and prefaces to individual poems, is also generally included."
  • African-American Poetry, 1760-1900. "More than 2,500 poems written by African-American poets in the late eighteenth and nineteenth centuries."
  • Editions and Adaptations of Shakespeare. "The complete text of eleven major editions of Shakespeare's works from the First Folio to the Cambridge edition of 1863-6, twenty-four separate contemporary printings of individual plays, selected apocrypha and related works and more than 100 adaptations, sequels and burlesques from the seventeenth, eighteenth and nineteenth centuries. You can search the text quickly and precisely within and across editions and display different versions simultaneously on screen. Editions and Adaptations of Shakespeare provides a complete and accessible concordance to Shakespeare's works and enables different editions to be compared as never before."
  • Eighteenth-Century Fiction "Eighteenth-Century Fiction contains a selection of works in English prose from the period 1700-1780, by writers from the British Isles. All of the most widely-studied texts are in the database, alongside others which are being made available for the first time in many years. The complete text of each work is included, and any accompanying authorial material, such as footnotes or introductions. . . Searches can be made on any word or phrase in the whole database or be limited to an individual author or work and can be further restricted to the text alone or to critical apparatus or secondary authorial matter."
  • English Prose Drama contains more than 1,800 plays by approximately 400 authors from the Renaissance to the end of the nineteenth century. As with English Verse Drama, the bibliographic basis is the New Cambridge Bibliography of English Literature, CUP 1969-72. Early, and where appropriate, collected editions are generally selected. If a contemporary edition is considered unreliable, a later edition may be used. In certain cases, modern critical editions are included. . . As with English Verse Drama, you can search on any word or phrase in the text or title of any drama in the entire database. The detailed encoding of the texts provides you with a wide choice of search criteria, enabling you to select and view, from a vast body of text, just the lines or passages of interest to you."

See now [September 1995] links to the CH WWW site:

Chadwyck-Healey contact address:

Chadwyck-Healey, Inc.
1101 King Street
Alexandria, VA 223l4
Phone: (800) 752-05l5
Fax: (703) 683-7589

Cambridge University Press Electronic Editions

[CR: 19970429] [Table of Contents]

Several of the Cambridge University Press Electronic Editions use SGML encoding and the DynaText SGML searching/browsing software. In addition to the rich hypertext navigation, SGML encoding permits very specific searches within the texts, including: (1) restricting queries to particular SGML elements, or (2) enabling queries that address the hierarchical text "structure" itself. In addition, the search software supports traditional query logic (proximity, context regions, boolean operators, substrings, wild card characters, etc.) Examples of CUP's SGML encoded electronic texts are given below.

For Samuel Johnson: Dictionary of the English Language on CD-ROM: "The CD-ROM contains: A complete transcription of the text of the First Edition of Johnson's Dictionary of 1755 (c. 40,000 entries), with full SGML coding and page breaks following the original. A complete transcription of the text of the Fourth Edition of Johnson's Dictionary of 1773 (c. 40,000 entries), with full SGML coding and page breaks following the original. Digitised images of the original printed pages of both First and Fourth Editions. A DynaText search and retrieval engine with user-friendly screen design."

For Chaucer: The Wife of Bath's Prologue: "The CD-ROM contains: (a) Transcriptions of all 58 pre-1500 witnesses of Chaucer's Wife of Bath's Prologue (54 manuscripts and 4 early print editions), fully encoded in SGML (Standard Generalised Mark-up Language); (b) Digitized images of the originals of 1,200 manuscript pages, useful for teaching purposes or for checking the accuracy of any transcription against the original; (c) Hypertext linking between files allowing the reader to call up the full collation of any word across the entire range of witnesses; (d) A lemmatized spelling database permitting scholars to trace every spelling of every word across the witnesses; (e) A description of each witness based on examination of the witness itself; (f) Transcripts of all of the glosses in every manuscript of the Prologue. (g) A powerful DynaText search and retrieval engine with user-friendly screen design."


Cambridge University Press
Edinburgh Building
Shaftesbury Road
Cambridge CB2 2RU
Tel: +44 1223 312393
Email: (Kevin Taylor)

The Electronic Arden Shakespeare: Texts and Sources for Shakespeare Studies

[CR: 19961119] [Table of Contents]

The Arden Shakespeare is a major SGML-based reference resource for Shakespeare scholars, originally developed as an electronic text project by Database Publishing Systems Ltd of Swindon, UK in conjunction with Routledge, where Brad Scott was the Electronic Project Manager. Thomas Nelson will soon [November 1996] be distributing the electronic text as The Arden Shakespeare CD-ROM. The project's consulting editor was Jonathan Bate of University of Liverpool, UK. While the SGML encoding was done by DPSL, the academic involvement comprised considerable detailed planning with various Shakepeareans about what should be on the CD, and what functionality it should have. A distinguished group of international scholars created a unique design to support a high level research across the entire corpus. SGML encoding permits searching within collections and tagged units: within all plays, within specific plays (acts, scenes), within groups of plays selected by the user, within prose or verse sections, within songs, asides, stage directions and speech prefixes, etc.

The Arden Shakespeare CD-ROM "contains the complete works of Shakespeare in the Arden edition (plays as well as poems and sonnets), together with commentary and variant notes. Related texts are are synchronized in a scrolling multiple-window display based upon a customized DynaText SGML browser application. The database includes images of the key early editions, which are also synchronised with the the play texts. In addition, it includes a large amount of source material on which Shakespeare probably drew; the most important of these are linked to the individual scenes in the plays. There are a few other useful reference works in there as well. The core texts are thus supplemented by reference and supporting materials which, together, help undergraduates and researchers carry out in-depth research into the works, Shakespeare's sources and the modern editor's reading of the text. [adapted from online descriptions and comunique from Brad Scott]

The Arden Shakespeare CD-ROM is an integrated database that includes: (1) The omplete modern text of every Shakespeare play from The Arden Shakespeare 2nd edition; (2) Facsimile images of each page of first Folios and appropriate early Quarto texts the key source texts on which any edition must be based; (3) Poems and Sonnets; (4) Bullough: Narrative and Dramatic Sources of Shakespeare to examine major sources for an act, scene of play, and resonances within the texts; (5) Bibliography: Bevington's Shakespeare with details of over 4,600 titles for further research; (6) Abbott: A Shakespearean Grammar; (7) Onions/Eagleson: A Shakespeare Glossary; (8) Partridge: Shakespeare's Bawdy; (9) Introductions, notes, appendices and variants from The Arden Shakespeare editions; (10) Original illustrations from all texts; (11) Chronology of works (with caveats); (12) A fact table on each play including dates, place in the corpus, and more.


Thomas Nelson
Attention: Roda Morrison
Nelson House
Mayfield Road
Surrey KT12 5PL
Tel: 01932 252211
FAX: 01932 252497
Attention: Brad Scott, Electronic Project Manager
11 New Fetter Lane
London EC4P 4EE
tel: 0171 842 2134
fax: 0171 842 2299

University of North Carolina at Chapel Hill. Documenting The American South: The Southern Experience in 19th Century America

[CR: 19961201] [Table of Contents]

"This database presents primary sources documenting the culture of the American South from the viewpoint of Southerners. We plan to scan and encode texts and digitize images so that faculty and students at colleges, universities, and even secondary schools throughout the South - and the world - can use them. This database is the first stage of a larger project to document the cultural history of the American South. It will offer diaries, autobiographies, travel accounts, titles on slavery and regional literature drawn from the splendid Southern holdings of the UNC--CH Academic Affairs Library. We have begun with testimonial materials because students use them heavily, and we believe they are of interest to a larger audience."

"All the selected materials are encoded according to the Text Encoding Initiative (TEI P3) SGML-based Guidelines, using TEILite.DTD (version 1.6). Please use SoftQuad PanoramaPro, the SGML-browser for the WWW, or its free version - SoftQuad Panorama Free - to read the files."


Academic Affairs Library
University of North Carolina at Chapel Hill
Chapel Hill, NC 27599 USA

English-Norwegian Parallel Corpus Project

[CR: 19951128] [Table of Contents]

The English-Norwegian Parallel Corpus project (ENPC) is a research endeavor carried out primarily under the sponsorship the Norwegian Computing Centre for the Humanities and the Department of British and American Studies, University of Oslo. All texts prepared for computer processing are encoded according to the recommendations of the Text Encoding Initiative (TEI), as specified in TEI P3. "The document type definition for the texts in the corpus differs in some respects from the TEI model. The differences are, however, mainly additions to the TEI model; a few new tags and entities have been introduced. These tags and entities can be found in the files ENPC.DTD and ENCP.ENT respectively. Together with ENPC.TXT, which invokes the appropriate TEI tag sets, they constitute the complete ENPC tag set."

"The aim of the project is (1) to compile a parallel corpus of English and Norwegian texts for computer processing; (2) to develop tools for analysing parallel texts; and (3) to carry out studies of the structure and communicative use of the two languages on the basis of the corpus. The project is carried out in cooperation with a research group at the University of Lund (headed by Bengt Altenberg and Karin Aijmer) and with similar research teams in Belgium, Denmark, Finland, and Germany. Cooperation has also been established with HarperCollins Publishers, who will provide some English texts in machine-readable form. Through the cooperation with other contrastive teams, the study can be extended to multilingual comparison."

Some primary links:

Department of British and American Studies
University of Oslo
P.O. Box 1003, Blindern
N-0315 OSLO

Norwegian Computing Centre for the Humanities
University of Bergen
Harald Hårfagres gt. 31

ETAP - Uppsala University Parallel Corpus Project

[CR: 19961216] [Table of Contents]

ETAP is "one of the research projects at the department of Linguistics in Uppsala University, Sweden: 'Etablering och annotering av parallelkorpus för igenkänning av översättningsekvivalenter' (in English: 'Creating and annotating a parallel corpus for the recognition of translation equivalents'). This project is a part of The Stockholm-Uppsala Research Programme 'Translation and Interpreting - A Meeting between Languages and Cultures' financed by the National Bank of Sweden (Riksbanken Jubleumsfond). . . Text structure in the documents in the two corpora has been automatically marked up with TEI Lite conformant SGML by means of software developed in the project." [from the Home Page]

"The basic aim of the project is to develop a computerized multilingual corpus that can be used in bilingual lexicographic work and in methodological studies directed towards the automatic recognition and extraction of translation equivalents from text. The corpus will comprise Swedish source text representing different styles and domains with translations into several languages. A basic requirement on the corpus is to have it word class tagged and aligned, primarily, sentence by sentence. By October 1996, the project has resulted in two parallel, aligned, subcorpora, the Scania Corpus and the Swedish Statement of Government Policy Corpus."

"The Scania corpus is a collection of truck manuals which were orginally written with the word processing program Framemaker. The documents in the corpus are available in eight languages. In order to be able to process the corpus for linguistic purposes we have converted the documents in the corpus to TEI SGML." [from abstract, E. F. Tjong Kim Sang]


Anna Sågvall Hein
Department of Linguistics
Uppsala University
Box 513
S 751 20 Uppsala
Fax: Fax: +46-18181416
Institutionen för Lingvistik
Box 513
751 20 Uppsala
phone: +46 18 18 11 13
fax: +46 18 18 14 16

Electronic Thesis and Dissertation Project

[CR: 19970812] [Table of Contents]

The ETD (Electronic Thesis and Dissertation Project) is coordinated through Virginia Polytechnic Institute and State University, sponsored partially through the Southeastern Universities Research Association (SURA). The ETD Project management believes that SGML "is the logical solution for the long-term problem of preparing and archiving electronic documents." "The concept of electronic theses and dissertations (ETDs) was first openly discussed at a 1987 meeting in Ann Arbor arranged by UMI, and attended by representatives of Virginia Tech (Ed Fox from Computer Science and Susan Bright from the Computing Center), University of Michigan, SoftQuad, and ArborText. As followup, Virginia Tech funded development of the first SGML Document Type Definition (DTD) for this purpose, by Yuri Rubinsky of SoftQuad."

A current submission option in the Electronic Thesis and Dissertation Project is "for ETDs encoded in the Standard Generalized Markup Language (SGML). SGML was designed to encode electronic documents that are portable across platforms. It is the logical solution for the long-term problem of preparing and archiving electronic documents. SGML documents may encode diverse document structures including headings and paragraphs, but also including structures as complex as music and virtual reality. In fact, the HyperText Markup Language (HTML) is an application of SGML for documents available on the World Wide Web."


Electronic Thesis and Dissertation Project
Virginia Tech
Blacksburg, Virginia
Edward A. Fox, Computing Center (Email:
John L. Eaton, Graduate School (Email:
Neill Kipp, Computer Science (Email:

Electronic Theses and Dissertations: Additional Materials

Some URLs germane to (SGML) encoding in theses and dissertations [970913]:

Princeton University: The Charrette Project

[CR: 19950903] [Table of Contents]

"The Charrette project is a complex, scholarly, multi-media electronic archive containing a medieval manuscript tradition -- that of Chrétien de Troyes's Le Chevalier de la Charrette (Lancelot, ca. 1180). It is developed and maintained by the Department of Romance Languages, Princeton University." A leading role is played by Karl D. Uitti, The John N. Woodhull Professor of Modern Languages, in the Department of Romance Languages and Literatures, Princeton University.

The text of this Old French narrative verse romance "composed by Chrétien de Troyes around 1180, tells the tale of Lancelot and his love for King Arthur's wife Guenevere. This seminal text was recast into the Old French prose Lancelot in the thirteenth century, the primary source for Mallory's Morte d'Arthur, which in turn has been the source of modern retellings of the Arthurian legend, including Tennyson's Idyls of the King, White's Once and Future King and Renault's popular novels." It now contains about 7100 verses, and modern editions are based upon 8 mss from the 13th century.

The archive includes transcriptions in TEI P3 SGML: "Electronic diplomatic transcriptions of the eight manuscripts of the Charrette tradition with markup for physical representation and for rhetorical features. The electronic transcriptions include markup for manuscript abbreviations, illuminations, marginal and interlinear text as well as for rhetorical features such as rich rhyme, chiasmus, and adnominatio. The transcriptions follow closely the manuscript sources, preserving source word spacing, punctuation, and textual variation. We have adopted the tagging protocol provided by the Text Encoding Initiative, version P3, a dialect of the Standard Generalized Markup Language."

Using PAT 4.0 and the TEI-SGML encoding notations, the texts are searchable online using specifications for "dictionary form" and "part of speech" in addition to the customary string searching strategies. A forms-based retrieval alternative allows the user to search specifying variable lines of context.

RIDDLE Project - Rapid Information Display and Dissemination in a Library Environment

RIDDLE (Rapid Information Display and Dissemination in a Library Environment) is sponsored by the Libraries Programme of the Commission of the European Communities' TELEMATICS programme.

"Technical approach: Following consultation within the European library community, a set of user requirements has been produced. The project has investigated the current state of the art in the areas of scanning technology, Optical Character Recognition (OCR), and automatic markup/translation. Information from a wide variety of different scientific journals has been analysed and a system design produced which can be applied to any target OLC. International, industry and formal standards such as the Standard Generalized Markup Language (SGML, ISO 8879) have been examined. The catalogue at CWI forms the basis for the pilot which demonstrates the success of the concept."

The RIDDLE Project's evaluation of SGML software has resulted in a valuable published appraisal of post-OCR tagging/conversion technologies using products like SGML Translator (Shaftsall Corporation), TagWorX (Xerox Information Systems), IntelliTag (WordPerfect Corporation), TagWrite (Zandar), OmniMark (Exoteria Corporation), and FastTag (Avalanche Corporation). These studies are presented in the project's fourth deliverable, WP4 - Translation of contents pages text to Online Library Catalogue format. See the full bibliographic entry for this document and an overview article by the developers.

Net links:

Department of Computer Science and Information Systems University of Jyväskylä

Address:Department of Computer Science and Information Systems; University of Jyväskylä; Mattilanniemi, Building MaD, 3rd floor; P.O.Box 35; 40351 JYVÄSKYLÄ FINLAND; Telefax: (+358 41) 603 011

Katholieke Universiteit Leuven - Document Architectures Research Unit

[CR: 19961119] [Table of Contents]

SGML research of the KU Leuven Document Architectures Research Unit includes participation in the CAPS, HARMONY, MATHS, and DigiBook projects, as summarized in the extracts and links below. Some of these researches are financed under the Technology Initiative for Disabled and Elderly people (TIDE) programme sponsored by the Directorate-General XIII for Telecommunications, Information Industries and Innovation of the E.U. Primary TEO research participants include: dr. ir. Jan Engelen (project head), with three research assistants: ir. Bart Bauwens, ir. Filip Evenepoel and ir. Geert Bormans.

"The research group on Document Architectures originated from within the former division of Applied Electronics and Optics (Toegepaste Elektronica en Optiek ) within the Department of Electronics at the Katholieke Universiteit Leuven. The electronic research in the Applied Electronics and Optics group was mainly focused on the hardware research in the field of rehabilitation engineering, more specific, on the design of braille printers."

Document architectures are formal computer languages to describe the logical structure of an electronic document. There are two ISO Standards which describe such document architectures. The first, and most used, standard is called the "Standard Generalized Markup Language", abbreviated by SGML. The second standard is called "Open Document Architecture" or "Office Document Architecture" (ODA). This Research Group in the field of Information Technology focuses on the study and application of ISO standards for Document Architectures, especially those related to SGML." [from the Home Page]

MATHS Project: Mathematical Access for Technology and Science for Visually Disabled Users. "The MATHS Project started in January 1994 and will run until December 1996. . . The MATHS workstation is based on a number of different technologies which are brought together to provide the benefits of their combined facilities. It is important that the workstation should give access to a range of material rather than just to material input by using the workstation itself . This requires a standard format to be used. The project is based upon SGML which provides an international standard for describing the structure of documents. A document conforms to a specific template (or DTD) written within SGML which defines the components which can appear in that document; For example, a DTD for books would consist of author details, title page, chapters etc. This project requires a DTD for describing mathematics. The one which has been chosen is a variation on one of the EUROMATH DTDs. A new development from SGML is HTML, which extends the use of the language to the description of hypertext documents. This has become important as the description language for documents on the World Wide Web. An extension of the MATHS workstation can therefore be used to provide browsing facilities on the Web."

DigiBook Project. "The DigiBook Project started in December 1994 and will run until November 1996. This action is financed under the Cooperation in Science and Technology with Central and Eastern European Counties (COPERNICUS) programme by the Directorate-General XIII for Telecommunications, Information Industries and Innovation of the E.U. The TIDE Office is responsible for co-ordinating the activities for this project." It is the overall aim of this project, DigiBook, to improve the production of talking books and structured electronic texts for use by reading impaired persons. This can be realised through the application of digital speech processing and document structuring based on the Standard Generalized Markup Language (SGML, ISO 8879)."


Electronic Thesaurus Linguae Latinae

[CR: 20050319] [Table of Contents]

Note: References in this section are under revision.

[March 19, 2005] Updated information from the Commission for the Publication of the Thesaurus Linguae Latinae [Kommission für die Herausgabe eines Thesaurus Linguae Latinae der Bayerische Akademie der Wissenschaften], October 27, 2004 or later:

"The task of the institute is to produce the first comprehensive scholarly dictionary of the ancient Latin language from the earliest times down to AD 600. The work is based on an archive of about 10 million slips which takes account of all surviving texts. In the older texts there is a slip for each occurrence of each word; the later ones are generally covered by a selection of lexicographically relevant examples. Nowadays this material is supplemented by the use of modern data-banks. The first fascicule appeared in 1900; two thirds of the work have already been published.

The project is staffed by about 20 Latinists, both German and non-German. Among them are holders of scholarships, who work at the Thesaurus for a limited period as part of their academic training. The institute's archive and its unique library are consulted by scholars from the whole world. Since 1949 the Thesaurus has been an international project, in which at present more than 20 academies and learned societies from three continents participate..."

As of October 2004, work on the Thesaurus Linguae Latinae was hosted at the Bayerische Akademie der Wissenschaften, Marstallplatz 8, D-80539 München, Germany; Tel: +49 89 23031 1160; Fax: +49 89 23031 1275; Email Contacts: Vertretungsberechtigter: Professor Dr. Ernst Vogt, Kommissionsvorsitzender; Redaktionelle Verantwortung: Dr. Hugo Beikircher, Generalredaktor; Technische Verantwortung: Dr. Johann Ramminger.

[February 17, 1997] Historical. "Recently work has begun on developing an electronic version of the Thesaurus Linguae Latinae, or TLL. This work is being done under the auspices of the Consortium for Latin Lexicography (CLL). An electronic TLL on CD-ROM has the potential to be an extremely valuable tool, bringing the advantages of computerized search to the wealth of information in TLL. With an electronic TLL, users will be able to find and examine articles that meet the criteria they specify--for example, all articles that cite a certain author or work. It will be possible to retrieve articles of interest with a speed and thoroughness that is impossible with the printed version."

"The CLL has decided to use the Text Encoding Initiative (TEI) implementation of SGML (Standard Generalized Markup Language) as functional tags to distinguish the different types of information found within TLL articles. TEI offers features that we need in our tags for the TLL, such as functional, as well as formal, definitions, and the use of both start and end tags to surround an element; it also has the advantage of being an emerging standard in tagging electronic texts in the Humanities. . .In devising a tag set for the TLL, our starting point will be the tag set for printed dictionaries defined in the TEI P3 guidelines (Sperberg-McQueen and Burnard 1994), which is the set used in the figure above; however, since this tag set does not meet all of the needs of a historic lexicon like the TLL, the CLL hopes to work with the TEI and other interested parties to develop a tag set for historic lexica. (These plans are contingent upon all parties receiving the requisite funding.) The CLL is confident that it will be possible to devise a TEI-conformant SGML tag set that will be able to define each of the sections and types of information within the TLL." [see APAlink below]


Historical/legacy references:

  • 1995 URL for the Electronic Thesaurus Linguae Latinae Home Page.
  • Developing an Electronic TLL: Constructing a Grammar for a Latin Lexicon." Ann DeVito. HTML version of a paper presented by Ann DeVito at the meetings of the American Philological Association in San Diego, California, 28-December-1995. It describes plans to use a grammar to direct the automated tagging of the electronic TLL database. From the Introduction: "The Consortium for Latin Lexicography (CLL), based at the University of California at Irvine under the Directorship of Patrick Sinclair, has just begun to plan the development of an electronic version of the Thesaurus Linguae Latinae (TLL). We hope to release, in several years' time, an electronic TLL on CD-ROM, complete with its own search engine and user interface. With an electronic TLL, users will be able to find and examine articles that meet the criteria they specify with a speed and thoroughness that is impossible using the printed version. In order to support computerized search of this sort, it is necessary to tag the TLL database appropriately. This paper concentrates on the CLL's grammar-directed approach to automating the tagging of the electronic TLL..." [cache/mirror, partial links]
  • Ann DeVito: Developing an Electronic Thesaurus Linguae Latinae, July 1995. [mirror, text only partially-linked, December 1995]
  • LEXI. Description: "LEXI: An E-mail List for Greek and Latin Lexicography and Language. The LEXI listserv provides an open forum for discussion between scholars interested in the study of Greek and Latin language and lexicography. It is maintained by the Consortium for Latin Lexicography (sponsor of the project to create an electronic Thesaurus Linguae Latinae) and the Thesaurus Linguae Graecae. The listowner is Patrick Sinclair, Associate Professor of Classics at the University of California at Irvine and Director of the Consortium for Latin Lexicography..."

Patrick Sinclair, Director
Consortium for LatinLexicography
Dept. of Classics, 156 HH
University of California
Irvine, CA 92717-2000
Tel: +1 (714) 824-5831
FAX: +1 (714) 824-2464

Ann DeVito, Systems Analyst
Consortium for Latin Lexicography
Dept. of Computer Science
1C101 Engineering Building
University of Saskatchewan
57 Campus Drive
Saskatoon, Saskatchewan
WWW: Home Page

University of Helsinki - Document Management Research Group

[CR: 19960418] [Table of Contents]

"DocMan is the structured document management research group at the Department of Computer Science at the University of Helsinki. . .Text with a structure is quite common: dictionaries, reference manuals, annual reports etc. are typical examples. In recent years, research on systems for writing structured documents has flourished. The SGML and ODA standards have further increased the interest in the area. Active Projects include: SID (Structured and Intelligent Documents), and DOKU (A Finnish joint project on structured documents and text databases). Previous Projects included: HST/RATI (A structured text database system/Rakenteiset tekstitietokannat), and VITAL/ALCHEMIST (A general purpose transformation generator)."

"Structured and Intelligent Documents (SID) is a three-year research project, which studies and develops methods for attaching intelligent features to structured documents. The purpose of these features is to make the manipulation, i.e., storage, retrieval and assembly of documents easier. . . The SID project started on August 1, 1995 and will end on July 31, 1998. SID is a part of the Electronic Printing and Publishing programme started by the Finnish Technology Development Centre (TEKES). . . As a basis for the project we consider structured documents marked up according to the Standard Generalized Markup Language (SGML), which is an ISO standard for defining document markup languages." [from the SID home page]


Department of Computer Science
Attention: Project Manager, Pekka.Kilpelainen@cs.Helsinki.FI
P.O. Box 26 (Teollisuuskatu 23)
FIN-00014 University of Helsinki
Phone: +358 0 70 851
Fax: +358 0 7084 4441

Project ELVYN: Implementing an Electronic Version of a Journal

[CR: 19951201] [Table of Contents]

"Project ELVYN is a research project funded by the British Library Research and Development Department (BLR&DD) in cooperation with the Institute of Physics Publishers (IoPP) to look at how publishers and libraries can work together to provide an electronic version of a printed journal. The project involves a number of academic institutions in the UK and Europe: Loughborough University of Technology; University of Manchester; Oxford University; University College London; Imperial College; University of Hertfordshire; Chalmers University of Technology, Sweden; The Harwell Laboratory of the Atomic Energy Authority."

"IoPP offered to make all issues of the journal Modelling and Simulation in Materials Science and Engineering (MSMSE) available in electronic form to the members of the project. The offered formats were SGML, PostScript and plain TeX. It was up to the individual libraries to ascertain which format was best for them and to devise their own delivery strategy. The libraries also had to attempt to recruit users who would find the journal useful and would be willing to use an electronic version (not always an easy task!). At Loughborough we looked at the offered formats and decided to adopt the SGML format and use the IoPP's DTD to convert the SGML documents into HTML documents for viewing using standard World Wide Web. To accomplish this we used Klaus Harbo's Copenhagen SGML Tool (CoST). The use of HTML and the WWW fitted in with the change that was occurring on campus at the time to make more information available over the network using the Web and hypertext." [from the Home Page]


HyperLib: Hypertext Interfaces to Library Information Systems

[CR: 19951201] [Table of Contents]

"HyperLib is the EC funded project of the Loughborough University (UK) and the University of Antwerp (B). The project aims to improve access to the services of the libraries of the University of Antwerp and thereby enhance the library's utility to its users. The project adopted a human factors oriented design methodology to ensure that the revised computer interface achieves the advances intended for as wide a spectrum of the library's users as is possible. In order to achieve a system that is both effective and easy to use the project partners decided to investigate graphical interfaces supporting dynamic hypertext linking. The project focusses on two implementations: electronic guides (for library end-users) and manuals (for library staff), and database related resources (an academic bibliography and a navigation tool for subject classification). The technology of the project is based on SGML, WWW and HTML. The complete online documentation (including a report on the HyperLib DTD design and on the conversion from SGML to HTML) can be found at" [description from Jan Corthouts, Deputy Librarian, December 01, 1995]

HyperLib: Hypertext Interfaces to Library Information Systems. Telematic Systems in Areas of General intereset Libraries Programme. CEC Project: LIB-HYPERLIB/3-1015.


Contact address:
Jan Corthouts
Deputy Librarian
PB 13
2610 Antwerpen
tel.: +32-3-820 21 42
fax : +32-3-820 21 59

Electronic New Testament Manuscript Project

[CR: 19970426] [Table of Contents]

"The Electronic New Testament Manuscripts Project aims to make available on the Internet, and on CD-ROM, transcriptions and digital images of Greek manuscripts of the New Testament. . ."

"Transcriptions will come under a copying policy similar to that of the Free Software Foundation. This would mean that transcriptions are redistributable but that the original transcribers and editors would retain the copyright."

"Transcriptions will be done under the Standard Generalized Markup Language (SGML) application produced by the Text Encoding Initiative (TEI). This scheme, devised by Humanities scholars in Europe and North America provides a standard means for the the transcription of primary sources and textual variants. An increasing amount of software works with TEI encoded documents, including software for collating manuscript variants automatically. Because the manuscript transcriptions will be based on an SGML application, they will be platform-independent and will not become obsolete as software technology advances."

UMI (University Microfilms International)

"UMI is one of the world's leading providers of information to library, corporate, and academic organizations. The company opens up access to more than 25,000 periodicals and newspapers, and over 1.3 million dissertations from universities around the world."

"ASCII/SGML/HTML Dissertations:

"As part of an overall effort to increase the number of documents available in ASCII format, UMI will be testing the creation of ASCII collections of dissertations in selected subject areas."

"We have no immediate plans to receive, convert to, or distribute dissertations as SGML- or HTML-formatted documents. However, we are implementing internal document database structures that will allow us to convert dissertations to some basic level of SGML or HTML tagging."

UMI Electronic Sales
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, MI 48106-1346
Telephone: (800) 521-0600 x3898

Perseus Project

[CR: 20001026] [Table of Contents]

[1998] A Marlowe web site is "currently under construction at Tufts University as part of the Perseus Project, a digital library for the study of ancient Greece and Rome. This SGML-encoded edition of the complete works of Christopher Marlowe and his sources has been produced according to TEI standards."


University of Bergen (Wittgenstein Archives)

[CR: 19960324] [Table of Contents]

MECS is "sorta-SGML", for studied reasons: "The transcriptions under preparation at The Wittgenstein Archives are coded in a primary format using a syntax called MECS (Multi Element Code System). MECS defines the syntax for The Wittgenstein Archives' registration standard, MECS-WIT, while MECS software allows for varying presentation formats, code extraction, variant control, word lists and other statistical data. . .The relationship between MECS and SGML (Standard Generalised Markup Language) can be outlined as follows. All SGML documents are formally MECS-conforming, but not vice versa. MECS contains some of the properties of SGML, but it also contains additional, simpler mechanisms for representing structures which are cumbersome in SGML. Unlike SGML, MECS allows overlapping elements and does not require a Document Type Definition, although it allows (but does not require) the specification of a similar (though simpler) Code Definition Table." [from the Transcription description; see below]

See now: "Markup Language for Complex Documents (Bergen MLCD Project)".


The Wittgenstein Archives
The University of Bergen
Harald Haarfagresgate 31
N-5007 Bergen
Tel: +47-55 58 29 50
Fax: +47-55 58 94 70

The Orlando Project: An Integrated History of Women's Writing in the British Isles

[CR: 200109-7] [Table of Contents]

Information on the Orlando Project is available in a separate document.

Description of 19971018: "The primary objective of the Orlando Project is to produce, in both printed and electronic form, the first scholarly history of women's writing in the British Isles. The integration of the project's key disciplines -- literary history and humanities computing -- will produce a highly sophisticated research tool for the study of women's writing in the multiple traditions of England, Ireland, Scotland, and Wales."

"...Using a combination of project-specific SGML with the TEI, we plan to extend the current capabilities of textual markup to include both subject tagging and the mapping of critical and argumentative movements within our textbase. Doing so will maximize the search, retrieval and display capabilities of our computing project."


The Orlando Project
3-5 Humanities Centre
University of Alberta
Edmonton, AB
T6G 2E5
Email address:
Tel: (403) 492-7803
FAX: (403) 492-8142

Sue Fisher
Project Librarian
An Integrated History of Women's Writing in the British Isles
3-5 Humanities Centre
University of Alberta
Edmonton, AB
T6G 2E5

British Women Romantic Poets Project

[CR: 19971007] [Table of Contents]

"The British Women Romantic Poet's Project aims to produce an online scholarly archive consisting of E-text editions of poetry by British women written between 1789 (the onset of the French Revolution) and 1832 (the passage of the Reform Act), a period traditionally known in English literary history as the Romantic period. . . The texts will be tagged using Standard Generalized Markup Language (SGML) and encoded using the Text Encoding Initiative (TEI) Guidelines, with the TEILite DTD."


Nancy Kushigian, Ph.D. (General Editor)
British Women Romantic Poets Project
Shields Library
University of California, Davis
Davis, California 95616
Tel: 916-754-4337

SBL Seminar on Electronic Standards for Biblical Language Texts

[CR: 19980120] [Table of Contents]

In January 1998, Patrick Durusau (Information Technology, Scholars Press) posted an announcement for a new mailing list to support the the ongoing work of the SBL (Society of Biblical Literature) Seminar on Electronic Standards for Biblical Language Texts. This seminar is designed for "biblical scholars, publishers, librarians, archivists, researchers and software designers [who] need good computer tools for working with biblical langauges and texts'" and is "dedicated to solving problems such as interchange and publication of materials containing biblical languages, creation of electronic texts for analysis and archiving, and other problems routinely faced by those working with biblical materials."

"The Biblical Language Standards mailing list is a forum for discussion of and announcements concerning the ongoing work of the SBL Seminar on Electronic Standards for Biblical Language Texts. In February of 1997, the steering committee for the Seminar adopted the following statement of goals:

  1. Create standards for electronic representation and interchange of all types of documents in biblical and related studies. These standards will include naming, description, and analysis.
  2. Encourage the development of software implementing these standards.
  3. Produce a set of sample documents and software developed as models which are consistent with these standards.
  4. Build a consensus for adoption of these standards.
  5. Provide training in their use.
The Seminar is using the TEI Guidelines as the starting point for its discussion of encoding biblical language texts with a view of modifying or extending those guidelines as necessary."


Hebrew Syntax Encoding Initiative

[CR: 19960831] [Table of Contents]

"An ad hoc committee has been formed to extend the Westminster Morphologically Analyzed Machine-Readable Hebrew Bible (MORPH) to the syntactic level. The committee was formed loosely under the auspices of the Computer-Assisted Research Group (CARG) of the Society of Biblical Literature (SBL)."

"So the project entails: (1) a partially tagged data file(s); (2) a front-end interface; (3) some sort of search engine . . . [the project] is looking at a solution using the Text Encoding Initiative's SGML DTD (document type description) and marking up MORPH; then, using and hacking a SGML parser (like SGMLS), creating the syntactic tagged text, leaping off MORPH. Then the SGML parser can be used to create a query language (or maybe tcl [Tool Control Language]) and attach it to the (probably GUI) interface. Other SGML tools exist which can actually be used to quickly build applications based upon a DTD. That is the theory. Why TEI's DTD? Because it is a standard. SGML DTDs are reconfigurable either by changing the DTD itself (which changes the meaning of the tages already present) or by changing a parser which would actually reconfigure the data files in some arbitrary manner. This approach - if successful - satisfies many of the requirements of HSEI." [adapted from "What is HSEI?"]


Kirk Lowery, Ph.D.
Szeplepke koz 3/b
H-2119 Pecel, HUNGARY
Tel: (36) 30/423/440
Email: Kirk Lowery


Vincent DeCaen, Ph.D.
c/o Near Eastern Studies Dept.
4 Bancroft Ave., 3d floor
University of Toronto
Toronto ON, M5S 1A1, CANADA
Email: Vincent DeCaen

Archivio Testuale Multimediale (ARTEM) Project

[CR: 19960609] [Table of Contents]

"The project, named "Archivio Testuale Multimediale" (ARTEM), will pursue three main goals: (1) To build a repository of electronic texts in Italian language, selected on the basis of the best editorial reliability, and fully encoded according to the best standards available. The repository will be freely accessible in WWW network; (2) To link the repository to other similar ones, offering the same scientific reliability; (3) To build a catalogue of existing electronic texts in Italian language, providing a statement of their editorial reliability and encoding methodology, and stating if and how they are available."

"Special attention is devoted to the problems of encoding, following the SGML procedures, according to the standards proposed by TEI. The previous analysis of textual features, to obtain the full list of elements to encode, will be declared and discussed."

Links (provisional):

Université de Montréal (EBSI-GRDS)

[CR: 19970531] [Table of Contents]

Researches programs and results of the Groupe départemental de Recherche sur les Documents Structurés (GRDS) and École de bibliothéconomie et des sciences de l'information (EBSI) are summarized in the linked information below.


École de bibliothéconomie et des sciences de l'information
Université de Montréal
C.P. 6128, succursale Centre-ville
Montréal (Québec), Canada
H3C 3J7
Téléphone: (514) 343-7750
Télécopieur: (514) 343-5753

GATE (General Architecture for Text Engineering) Project [Sheffield]

[CR: 19960907] [Table of Contents]

"The GATE (General Architecture for Text Engineering) project is a 3 year EPSRC-funded initiative which aims to enhance collaborative research in Information Extraction (IE) whilst developing a general architecture for language engineering (LE) systems both in research and industrial environments. . . The basic architecture of GATE includes: (1) an object-oriented model of the information associated with documents during their analysis under a range of algorithms; (2) a client-server distributed database for document management based on the US Tipster architecture; (3) SGML input and output for compatibility with, for example, MULTEX and TEI initiatives." "GDM is the GATE document manager, based on the TIPSTER document manager with added SGML capabilities; GGI is the GATE graphical interface, a development tool for LE research and development, providing integrated access to the services of the other components and adding visualisation and debugging tools; CREOLE is a Collection of REusable Objects for Language Engineering." SGML support is added to the TIPSTER model via i/o filters; SGML annotations are stored in a separate database.


GATE - Natural Language Processing Research Group
Department of Computer Science
The University of Sheffield
Regent Court, 211 Portobello Street
Sheffield, S1 4DP, UK

The LEGEBIDUNA Project (Universidad de Deusto)

[CR: 19970313] [Table of Contents]

"The LEGEBIDUNA project concentrates on the explotation of a bitext corpus of administrative documents in both Basque and Spanish as a source for the development of simultaneous editing and translating software. [The Web site] includes discussion on legal texts, translation memory, descriptive mark-up (SGML, TEI, MULTEXT), variable translations units, and parallel text alignment."

The Legebiduna Project has "created a corpus of administrative documentation: Official Bilingual Journals of the Basque Administration - almost 10 million words in each language, Basque and Spanish. [Researchers are] now tagging the texts, i.e. recognizing administrative and legal formulae and terminology, and their distribution in the texts' structure. The DTDs are deduced from the tagged corpora, i.e, first the text is tagged, then the DTDs are constructed." [adapted from a posting of Joseba K. Abaitua Odriozola to TEI-L.


Joseba K. Abaitua Odriozola (Universidad de Deusto), Email:, WWW:
Arantza Casillas Rubio (Universidad de Alcalá de Henares), Email:
Raquel Martínez Unanue (Universidad Complutense de Madrid), Email:

SETIS: Electronic Texts at the University of Sydney Library

[CR: 19970316] [Table of Contents]

"The University of Sydney Library has acquired a large number of primary texts in digital form over the last few years. These texts include numerous versions of The Bible, the works of Shakespeare, Goethe and Kant, more than 700 classical Greek texts in the Thesaurus Linguae Graecae, the enormous Patrologia Latina Database of the Church Fathers, the English Poetry Full-Text Database, and the Intelex Philosophy Texts. To these texts and others like them must be added the texts available from remote sites such as the collection of some 2,000 French literary, scientific and philosophical texts at the Frantext Web site, and the many public domain texts available at the University of Virginia Electronic Text Centre and the Oxford Text Archive."

"SETIS makes available a growing number of SGML encoded texts for use via web browsers. . . SETIS is engaged in a number of text creation projects, and this has involved acquiring knowledge and skills not only about scanning and text recognition software, but more significantly, about Standard Generalised Markup Language (SGML) and the Text Encoding Initiative guidelines for humanities texts. Current projects include work on lecture notes by Professor John Anderson from the 1930's up to the 1950's and which are held in the University of Sydney Archives; an edition of Lord Shaftsbury's Characteristics, Manners, Opinions, Times held in the Rare Books collection at Fisher Library, digital images of the New Australia Journal in Rare Books which are in a state of decomposition. SETIS is also engaged in encoding the novels identified for digitisation by the Australian Co-operative Digitisation Project. These projects will give us the expertise to provide support for similar initiatives at the University among academic staff and research students." [from the article of Creagh Cole, cited below]


Creagh Cole
SETIS Coordinator
University of Sydney Library
University of Sydney 2006
Phone: +61 02 9351 7408
Fax: +61 02 9351 7290

Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation


XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI:  —  Legal stuff
Robin Cover, Editor: