Mirrored from ftp://rs7.loc.gov/pub/american.memory/white.papers/techov.txt, March 1996
ELEMENTS OF DIGITAL ARCHIVAL COLLECTIONS TECHNICAL OVERVIEW AND FORMAT DESCRIPTION * * * * * Carl Fleischhauer Coordinator, American Memory October 27, 1994 :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: This document is intended to present a "snapshot" of the Library of Congress digital conversion activity as of October 1994. The ideas and approaches outlined here represent the outcome of the five-year American Memory pilot program (1990-1994). Although the elements described in this document will guide the Library as it plans its long-term digitizing effort, the institution recognizes that many avenues remain unexplored and that new technology will continue to lead to changing practices. Interested readers are encouraged to also refer to the American Memory "white papers" provided at this ftp (ftp.loc.gov) or gopher (marvel.loc.gov) site: Reproduction-Quality Issues in a Digital-Library System Bibliographic Records in a Full-content Access System Frameworks and Finding Aids: Organizing Digital Archival Collections (supplement to the paper on bibliographic records; to be added during November 1994) ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: I. What Kinds of "Items" Make Up a Digital Historical Collection? Digital historical collections are tremendously varied; some actual examples are described in Section III. Broadly speaking, each collection consists of two major elements, each of which consist of a variety of sub-elements: Framework An assemblage of title or "home" pages, finding aid(s), and associated information. Finding aid Either an online register (directory of holdings) or a collection-specific database of brief bibliographic records. Background informationExplanatory texts, scope and content notes, chronologies, and bibliographies that provide users with a context for the collection. Special programs Interpretive texts and/or interactive programs that introduce collections to users, explaining what they contain and how they may be used (optional) Reproductions Digital images, texts, sound recordings, or moving-image files that reproduce original items at the Library of Congress. Detailed specifications are provided in Section II. ********************************************************************** II. Technical description of items in collections. A. Background A collection in its raw state consists of the types of items listed below. Framework items (finding aids, background information, and special programs) are produced by Library of Congress staff. Reproductions, for the most part, are produced by specialist contractors. Readers are warned that the formats listed below are subject to change as the Library's approach to digitization evolves. All of the raw materials are (or should be) archived in their production formats, as listed below. In a perfect world, all of these formats would adhere to well-established "true" or "industry" standards. Alas, the world is not perfect and the list below reveals that the Library is using a combination of true standards, widespread industry standards, and temporary solutions. The most stable formats are those used for cataloging, SGML-marked-up machine-readable texts, and non-moving images (reproductions of documents and pictorial materials). The least stable formats are those for time-based items, i.e., recorded sound and moving images. The raw materials are also ready to be assembled and loaded into retrieval software, thus forming the finished collection. At this writing, our main focus for assembly and retrieval is the Library of Congress World Wide Web server and we are loading our content there. In the recent past, however, the Library also produced CD-ROM disks for both Macintosh and IBM-compatible systems. In the future, the same content may be loaded into different server systems and/or other media. At the time of assembly and loading, some of the archived data must be converted into different formats, selected to fit the particular delivery system at hand. The list that follows identifies the production and archiving formats. For recorded sound and moving-image collections, however, these notes also highlight the interim formats selected for World Wide Web delivery. B. Framework items 1. Finding aids i. Online register This type of finding aid has not yet been realized at the Library of Congress; current plans call for adoption of the SGML scheme developed by Daniel Pitti at the Bancroft Library, University of California, Berkeley. ii. Collection-specific bibliographic-record database Collection databases containing MARC catalog records that follow AACR2 approaches for minimal-level cataloging (up to a point) and include fields like 856 and some additional local fields. More details are provided in other American Memory documents. 2. Background information Searchable texts are coded (for now) with HyperText Markup Language (HTML). 3. Special programs The core element is typically browseable texts coded (for now) with HyperText Markup Language (HTML). Various images or audio clips may be attached to these texts; the formats for these will, of course, be similar to the formats used for collection-item reproductions. C. Reproductions 1. ASCII texts with SGML markup: searchable reproductions of textual materials American Memory's contractor requirement is 99.95 percent accurate transcribed texts (when compared to the original item). The texts are coded with Standard Generalized Markup Language (SGML). The American Memory SGML document type definition (DTD) conforms to the guidelines of the international-standard Text Encoding Initiative (TEI). The SGML markup indicates the location and number of each illustration for an illustrated work, as well as the page images for most works. This coding permits links to be made to images (generally by file name), thus reuniting text, illustration, and/or page image at retrieval time. The original source texts are given identifiers (alphanumeric strings) that are incorporated in the filenames for the searchable texts and associated images. Additional identifying information appears in the TEI header for the searchable texts. Separate documents provide detail about the American Memory SGML DTD and the program's use of the TEI header. Searchable texts are delivered by the contractor as computer files in MS-DOS format, together with documentation and a directory of filenames. Thus filenames adhere to DOS naming requirements. 2. Bitonal images: reproductions of manuscripts or printed matter a. Images of documents or line art TIFF files with TIFF header (Intel rather than Motorola; American Memory checks headers in various way, including the use of the shareware TIFFINFO) CCITT group 4 compression (group 3 for some early examples) 300 dpi (as referenced to, say, a printer; may or may not be 300 dpi as referenced to the source document). b. Images of illustrations. NOTE: for printed halftones, special dithering treatment is applied at scan time to reduce or suppress moir‚ patterns. Each illustration image is created in the following forms: i. A "printing image," i.e., a reproduction of an illustration (line or printed halftone) that an end-user can print at retrieval time. Pixel depth and size: The image is to be bitonal (one bit deep), suitable for printing at 300 dpi on 8«x11-inch paper. The image need not fill the paper; the requirement is simply that, when printed at 100 percent size, the image shall not occupy more than one sheet of paper. PCX (current practice) or TIFF header (preferred; same flavor as indicated in section 2.A; with CCITT group 3 or group 4 compression 300 dpi (as referenced to, say, a printer; may or may not be 300 dpi as referenced to the source document). ii. A screen-display image (or thumbnail). Bitonal reproduction of the illustration for computer-screen display. The image is dithered at scan time in the case of printed halftones. Gray-scale is being considered for future collections. Current specification calls for the screen-display image not to exceed 512 pixels horizontal by 342 pixels vertical. In the future, may be reduced to smaller thumbnail size of 200x200. c. General notes: Each image is to be placed in a separate file. LC specifies filenames or a filenaming approach. The filenames must relate in a logical way to the names assigned to SGML-marked-up text files, thus permitting the linking described above. 3. Gray-scale and color images: reproductions of pictorial materials, especially continuous-tone items like photographs. The Library's contractor creates a film intermediate by re-photographing materials onto 35mm or 70mm roll film. The film serves the Library as a very-high-quality analog archiving reproduction. The contractor digitizes the film images to meet the specifications that follow. Image Specifications: UNCOMPRESSED IMAGES for reprocessing and reuse in future, improved computer systems. Screen resolution 640x480 (optional 1024x768 or 1280x1024) B&W images at 8 bits per pixel (bpp) Color images at 24 bpp. Uncompressed TIFF ver. 5.0 headers (see below for added content) REFERENCE IMAGES for current retrieval system display. Screen resolution 640x480 (optional 1024x768 or 1280x1024) B&W images at 8 bits per pixel (bpp) Color images at 24 bpp. B&W images at approx 10:1 compression Color images at approx 20:1 compression JPEG compression JFIF headers THUMBNAIL IMAGES for current retrieval system display. No dimension greater than 150 pixels Both B&W and color thumbnails in 8 bpp. Color palette optimized (adaptive palettes) for each image. Palette "reserves" the colors of the standard/default Windows palette to reduce conflict with Windows delivery software. Uncompressed TIFF ver. 5.0 headers (see below for added content) TIFF image file headers TIFF headers. Our usual specification for gray-scale or color images (pictorial collections) with TIFF headers include the following tags. In general, the "typical" or "expected" data goes in the tag; the comments column indicates cases where the data may not be of the expected type. In the future, we anticipate using a similar set of tags for the TIFF headers for bitonal document-type images. Description Tag Comments NewSubfileType 254 ImageWidth 256 actual pixel count ImageLength 257 actual pixel count BitsPerSample 258 Compression 259 PhotometricInterpretation262 StripOffsets 273 SamplesPerPixel 277 RowsPerStrip 278 StripByteCounts 279 XResolution 282 actual pixel count for larger images (for thumbnails, dots per inch) YResolution 283 actual pixel count for larger images (for thumbnails, dots per inch) ResolutionUnit 296 1 (no unit) for larger images; 2 (inch) for thumbnails DocumentName 269 name supplied by LC Artist 315 Library of Congress DateTime 306 date scanned 4. Recorded sound. Standard digital formats for compression and networking have not yet been established. The current World Wide Web offering employs the WAV format with an "AU wrapper." We anticipate that, for the immediate future, each individual package or system will convert and load sound recordings afresh from interim masters. The interim masters may be analog (e.g. broadcast-standard full-track-monaural or double-track-stereo ¬-inch tape at 7.5 inches-per-second) or standard digital form (e.g. DAT cassette), with a list of timings and content. 5. Moving-image materials. Standard digital formats for compression and networking have not yet been established. The current World Wide Web offering consists of films in the AVI format that can be displayed after downloading using the INDEO codec. The were digitized using INDEO version 3.2 at a low data-transfer rate (in order to keep file sizes small). We anticipate that, for the immediate future, each individual package or system will probably convert and load moving-image materials afresh from interim masters. Since digital copies tend to be of lower resolution than NTSC video, the current round of interim masters are analog videotapes made from the original motion piture film (e.g. broadcast-standard 1-inch type "C" tapes, D2 digital tapes, or Betacam composite video, etc.). In the future, as digital-motion quality increases, the source material may be a motion-picture-film copy of the original or a better-than-NTSC video master. ********************************************************************** III. Illustrative examples of collections 1. Life History Manuscripts from the WPA Federal Writers' Project, 1936-1940 QUANTITY TYPE OF UNIT FRAMEWORK: ca. 10 ASCII texts in the Home Page group for background information and special program; HTML markup 1 Menu-form finding aid; HTML markup 13 Audio clips for the special program; WAV format/AU wrapper ca. 30 Digital pictorial images for the special program; gray-scale JPEG images, ca. 640x480; also thumbnails REPRODUCTIONS: 2,900 ASCII texts w/ SGML markup (TEI implementation, American Memory DTD); avg size: 11 Kb; total all doc's: 32 Mb 22,591 Digital page images of original pages, bitonal 300 dpi; TIFF headers, CCITT group 3 compression (most) or group 4 (a few); av size: 68 Kb; total all doc's: 1,536 Mb 4 Digital pictorial images of originals in the collection; gray-scale JPEG images, ca. 640x480; also thumbnails; av size: 300 Kb 2. Early films of New York City, 1897-1906 QUANTITY TYPE OF UNIT FRAMEWORK: ca. 6 ASCII texts in the Home Page group for background information; HTML markup 45 Bib-record form finding aid: item level MARC catalog records REPRODUCTIONS: 45 Digital film clips (for now, INDEO format); b/w, silent; av length 4 minutes 45 Digital still-frame thumbnail images that present a trio of scenes from the films (for presentation with the bibliographic record) gray-scale TIFF and GIF images, ca. 500x150 3. Panorama photos (American places, ca. 1880-1920) QUANTITY TYPE OF UNIT FRAMEWORK: ca. 10 ASCII texts in the Home Page group for background information and special program; HTML markup 4,000 Bib-record form finding aid: item level MARC catalog records 4 Video clips for special program (not yet digitized) ca. 50 Digital pictorial images for the special program; various formats (mostly JPEG or TIFF) REPRODUCTIONS: 36,000 Digital pictorial images of originals in collection; each panorama reproduced in one "wide shot" and from 3-8 closeups; gray-scale and color JPEG images, ca. 640x480; also thumbnails 4. Ethnic Folk Music from Northern California, Recorded by a WPA Project, 1938-1940, Part One QUANTITY TYPE OF UNIT FRAMEWORK: ca. 6 ASCII texts in a main Home Page group for background information and special program; HTML markup ca. 10 ASCII texts in secondary-level Home Page groups; HTML markup ca. 600 Bib-record form finding aid: four sets of item-level MARC catalog records, each related to a secondary-level Home Page group; figure given is approx total of all four sets ca. 40 Digital pictorial images for the special program; various formats (mostly JPEG or TIFF) REPRODUCTIONS: 393 Recorded sound selections; average length: 1.5 minutes; range: 0.5 - 5 minutes; estimated total audio: 15 hours (current holding is in the QT compressed format for Macintosh) 168 ASCII texts w/SGML markup (TEI implementation, American Memory DTD); av size: 6 Kb; total all doc's: 1.3 Mb 219 Digital page images of original pages, bitonal 300 dpi; TIFF headers, CCITT group 3 compression; av size: 51 Kb; total all doc's: 51 Mb 1,002 Digital pictorial images: gray-scale JPEG images, ca. 640x480; also thumbnails; av size 230 Kb; total 39 Mb 45 Digital images of engineering drawings: bitonal 300 dpi at (1) actual size and (2) reduced to print on 8.5x11-inch paper; TIFF group 3; av size for (1), 4 Mb, for (2), 130 Kb; total 186 Mb.