[Archive copy mirrored from: http://sunsite.berkeley.edu/Ebind/, June 01, 1997]

Berkeley Digital Library

Digital Page Imaging and SGML
An Introduction to the Electronic Binding DTD (Ebind)

View some Ebind-encoded documents


The Electronic Binding Project, or Ebind, is a method for binding together digital page images using an SGML document type definition (DTD) developed at UC Berkeley in 1996 by Alvin Pollock and Daniel Pitti. The Ebind SGML file records the bibliographic information associated with the document in an ebindheader, the structural hierarchy of the document (e.g, parts, chapters, sections), its native pagination, textual transcriptions of the pages themselves, as well as optional meta-information such as controlled access points (subjects, personal, corporate, and geographic names) and abstracts which can be provided all the way down to the level of the individual page.

This SGML file acts primarily as a non-proprietary, international standards-based (ISO 8879) control file for the multiple image files which make up a digitized book or document. But it can also serve as the basis for browsing the images in any SGML-aware software system in a natural and convenient way. One such system, a cgi program written in perl, provides a simple, easy-to-use web-based interface that remote users can connect to using a web browser such as Netscape or MS Internet Explorer. This cgi-script is freely available and may be downloaded from this site.


The Digital Photocopy Project

In 1993 the University of California Printing Dept. acquired a Xerox XDOD (Xerox Document on Demand) system consisting of a Docutech printer and a high quality Xerox scanner. Brittle books which previously had been copied using a traditional photocopier could then be copied using this state-of-the-art digital photocopier. The paper output was bound and placed back on the library shelf, but in addition to this the digital image files could be archived for direct patron access to files as well as to replace damaged pages or even to make more copies of the entire book at some future date. Ebind was chosen as the archival control file format because it was based on an international standard, SGML, and would migrate easily to any future standards if and when they are developed. It was demonstrated that in addition to acting as a control file, the very same SGML document could act as the basis for an on-line document navigation system. This was the Ebind cgi script written in perl.

American Heritage Virtual Archive Project

In 1996 The American Heritage Virtual Archive Project was begun as a collaboration between UC Berkeley, Stanford, the University of Virginia, and Duke University, to encode the finding aids to their archives and repositories in SGML using the Encoded Archival Description (EAD) DTD. In a later phase of the project, selected manuscripts and other primary source material would be digitized and made available on the World Wide Web. These digitized images would be bound together using Ebind and linked to the elctronic finding aid.


The structure of the Ebind DTD is based loosely on the Core tag set of the Text Encoding Initiative (TEI) DTDs. Like TEI, Ebind is divided into a bibliographic header, front matter, a body, and back matter. The front, body and back elements can themselves be divided into generic textual divisions called divs. A type attribute on the div element may specify the type of division more precisely, e.g., type="chapter".

View the Ebind DTD
View SGML tagging of some sample Ebind-encoded documents.

Two fundamental concepts separate Ebind from TEI. First, Ebind privileges the physical structure of a document while TEI privileges the intellectual structure. In Ebind, the atomic unit is the page while in TEI it can be down to the individual character. In TEI there is no element which can contain a page. The reason for this is that two distinct structural hierarchies cannot exist within the same document, at least not in current implementations of SGML. If a chapter ends in the middle of a page and a new chapter begins on that same page, one cannot explicitely describe both the hierarchy of the page and the hierarchy of the chapter. TEI favors the chapter by enclosing it within a div tag and describes the hierarchy of the page implicitely through the use of the pb (page break) empty tag, one of TEI's so-called "milestone" elements. (See TEI guidelines section 6.9.3). In Ebind, all pages are enclosed within a <page> element. This allows one to gather together a variety of information associated with individual pages, such as textual transcription ("raw" OCR or keyed), page abstracts, even controlled access points for individual pages if desired.

The second fundamental difference between TEI and Ebind is that Ebind is simpler to use. It was recognized early on that Ebind would be used in a high-volume production environment and would be applied to a wide variety of documents. The same DTD can be used to encode books, manuscripts, diaries, newspapers, or magazines. For this reason, many of the requirements imposed by TEI were "loosened up" in Ebind. The DTD is far less restrictive. Page elements can occur just about anywhere, for example. They may occur between divs and in fact needn't be enclosed in divs of any kind. This greatly simplifies the task of automated markup.

The UC Berkeley Digital Photocopy Project is a good example of how Ebind may be applied in a high-volume production environment. The Ebind SGML documents are encoded programmatically from simple, single page worksheets prepared by library staff and completed by the scanner operator.

View some sample worksheets.

Like the Ebind cgi script, the perl script which generates the SGML file from the worksheet is freely available from this site.

Download perl conversion script (ebind.pl)

Copyright © 1996 UC Regents. All rights reserved.
Document maintained at http://sunsite.berkeley.edu/Preservation/ by the SunSITE Manager.
Last update 6/7/96. SunSITE Manager: manager@sunsite.berkeley.edu