Mirrored from ftp://rs7.loc.gov/pub/american.memory/white.papers/techov.txt, March 1996
ELEMENTS OF DIGITAL ARCHIVAL COLLECTIONS
TECHNICAL OVERVIEW AND FORMAT DESCRIPTION
* * * * *
Carl Fleischhauer
Coordinator, American Memory
October 27, 1994
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
This document is intended to present a "snapshot" of the Library
of Congress digital conversion activity as of October 1994. The ideas
and approaches outlined here represent the outcome of the five-year
American Memory pilot program (1990-1994). Although the elements
described in this document will guide the Library as it plans its
long-term digitizing effort, the institution recognizes that many
avenues remain unexplored and that new technology will continue to
lead to changing practices.
Interested readers are encouraged to also refer to the American
Memory "white papers" provided at this ftp (ftp.loc.gov) or gopher
(marvel.loc.gov) site:
Reproduction-Quality Issues in a Digital-Library System
Bibliographic Records in a Full-content Access System
Frameworks and Finding Aids: Organizing Digital Archival
Collections (supplement to the paper on bibliographic
records; to be added during November 1994)
:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
I. What Kinds of "Items" Make Up a Digital Historical Collection?
Digital historical collections are tremendously varied; some
actual examples are described in Section III. Broadly speaking, each
collection consists of two major elements, each of which consist of a
variety of sub-elements:
Framework An assemblage of title or "home" pages, finding
aid(s), and associated information.
Finding aid Either an online register
(directory of holdings) or a
collection-specific database of
brief bibliographic records.
Background informationExplanatory texts, scope and
content notes, chronologies, and
bibliographies that provide users
with a context for the collection.
Special programs Interpretive texts and/or
interactive programs that introduce
collections to users, explaining
what they contain and how they may
be used (optional)
Reproductions Digital images, texts, sound recordings, or
moving-image files that reproduce original items
at the Library of Congress. Detailed
specifications are provided in Section II.
**********************************************************************
II. Technical description of items in collections.
A. Background
A collection in its raw state consists of the types of items
listed below. Framework items (finding aids, background information,
and special programs) are produced by Library of Congress staff.
Reproductions, for the most part, are produced by specialist
contractors. Readers are warned that the formats listed below are
subject to change as the Library's approach to digitization evolves.
All of the raw materials are (or should be) archived in their
production formats, as listed below. In a perfect world, all of these
formats would adhere to well-established "true" or "industry"
standards. Alas, the world is not perfect and the list below reveals
that the Library is using a combination of true standards, widespread
industry standards, and temporary solutions. The most stable formats
are those used for cataloging, SGML-marked-up machine-readable texts,
and non-moving images (reproductions of documents and pictorial
materials). The least stable formats are those for time-based items,
i.e., recorded sound and moving images.
The raw materials are also ready to be assembled and loaded into
retrieval software, thus forming the finished collection. At this
writing, our main focus for assembly and retrieval is the Library of
Congress World Wide Web server and we are loading our content there.
In the recent past, however, the Library also produced CD-ROM disks
for both Macintosh and IBM-compatible systems. In the future, the
same content may be loaded into different server systems and/or other
media. At the time of assembly and loading, some of the archived data
must be converted into different formats, selected to fit the
particular delivery system at hand.
The list that follows identifies the production and archiving
formats. For recorded sound and moving-image collections, however,
these notes also highlight the interim formats selected for World Wide
Web delivery.
B. Framework items
1. Finding aids
i. Online register
This type of finding aid has not yet been realized at the
Library of Congress; current plans call for adoption of the
SGML scheme developed by Daniel Pitti at the Bancroft
Library, University of California, Berkeley.
ii. Collection-specific bibliographic-record database
Collection databases containing MARC catalog records that
follow AACR2 approaches for minimal-level cataloging (up to
a point) and include fields like 856 and some additional
local fields. More details are provided in other American
Memory documents.
2. Background information
Searchable texts are coded (for now) with HyperText Markup
Language (HTML).
3. Special programs
The core element is typically browseable texts coded (for
now) with HyperText Markup Language (HTML). Various images
or audio clips may be attached to these texts; the formats
for these will, of course, be similar to the formats used
for collection-item reproductions.
C. Reproductions
1. ASCII texts with SGML markup: searchable reproductions of
textual materials
American Memory's contractor requirement is 99.95 percent
accurate transcribed texts (when compared to the original
item). The texts are coded with Standard Generalized Markup
Language (SGML). The American Memory SGML document type
definition (DTD) conforms to the guidelines of the
international-standard Text Encoding Initiative (TEI).
The SGML markup indicates the location and number of each
illustration for an illustrated work, as well as the page
images for most works. This coding permits links to be made
to images (generally by file name), thus reuniting text,
illustration, and/or page image at retrieval time.
The original source texts are given identifiers
(alphanumeric strings) that are incorporated in the
filenames for the searchable texts and associated images.
Additional identifying information appears in the TEI header
for the searchable texts. Separate documents provide detail
about the American Memory SGML DTD and the program's use of
the TEI header.
Searchable texts are delivered by the contractor as computer
files in MS-DOS format, together with documentation and a
directory of filenames. Thus filenames adhere to DOS naming
requirements.
2. Bitonal images: reproductions of manuscripts or printed
matter
a. Images of documents or line art
TIFF files with TIFF header (Intel rather than
Motorola; American Memory checks headers in
various way, including the use of the shareware
TIFFINFO)
CCITT group 4 compression (group 3 for some early
examples)
300 dpi (as referenced to, say, a printer; may or may
not be 300 dpi as referenced to the source
document).
b. Images of illustrations. NOTE: for printed halftones,
special dithering treatment is applied at scan time to
reduce or suppress moir‚ patterns.
Each illustration image is created in the following
forms:
i. A "printing image," i.e., a reproduction of an
illustration (line or printed halftone) that an
end-user can print at retrieval time.
Pixel depth and size: The image is to be bitonal
(one bit deep), suitable for printing at 300 dpi
on 8«x11-inch paper. The image need not fill the
paper; the requirement is simply that, when
printed at 100 percent size, the image shall not
occupy more than one sheet of paper.
PCX (current practice) or TIFF header (preferred;
same flavor as indicated in section 2.A; with
CCITT group 3 or group 4 compression
300 dpi (as referenced to, say, a printer; may or
may not be 300 dpi as referenced to the
source document).
ii. A screen-display image (or thumbnail). Bitonal
reproduction of the illustration for
computer-screen display. The image is dithered at
scan time in the case of printed halftones.
Gray-scale is being considered for future
collections.
Current specification calls for the screen-display
image not to exceed 512 pixels horizontal by 342
pixels vertical. In the future, may be reduced to
smaller thumbnail size of 200x200.
c. General notes: Each image is to be placed in a
separate file. LC specifies filenames or a filenaming
approach. The filenames must relate in a logical way
to the names assigned to SGML-marked-up text files,
thus permitting the linking described above.
3. Gray-scale and color images: reproductions of pictorial
materials, especially continuous-tone items like
photographs.
The Library's contractor creates a film intermediate by
re-photographing materials onto 35mm or 70mm roll film. The
film serves the Library as a very-high-quality analog
archiving reproduction. The contractor digitizes the film
images to meet the specifications that follow.
Image Specifications:
UNCOMPRESSED IMAGES for reprocessing and reuse in future, improved
computer systems.
Screen resolution 640x480 (optional 1024x768 or 1280x1024)
B&W images at 8 bits per pixel (bpp)
Color images at 24 bpp.
Uncompressed
TIFF ver. 5.0 headers (see below for added content)
REFERENCE IMAGES for current retrieval system display.
Screen resolution 640x480 (optional 1024x768 or 1280x1024)
B&W images at 8 bits per pixel (bpp)
Color images at 24 bpp.
B&W images at approx 10:1 compression
Color images at approx 20:1 compression
JPEG compression
JFIF headers
THUMBNAIL IMAGES for current retrieval system display.
No dimension greater than 150 pixels
Both B&W and color thumbnails in 8 bpp.
Color palette optimized (adaptive palettes) for each image. Palette
"reserves" the colors of the standard/default Windows palette to
reduce conflict with Windows delivery software.
Uncompressed
TIFF ver. 5.0 headers (see below for added content)
TIFF image file headers
TIFF headers. Our usual specification for gray-scale or color images
(pictorial collections) with TIFF headers include the following tags.
In general, the "typical" or "expected" data goes in the tag; the
comments column indicates cases where the data may not be of the
expected type. In the future, we anticipate using a similar set of
tags for the TIFF headers for bitonal document-type images.
Description Tag Comments
NewSubfileType 254
ImageWidth 256 actual pixel count
ImageLength 257 actual pixel count
BitsPerSample 258
Compression 259
PhotometricInterpretation262
StripOffsets 273
SamplesPerPixel 277
RowsPerStrip 278
StripByteCounts 279
XResolution 282 actual pixel count for larger
images (for thumbnails, dots per
inch)
YResolution 283 actual pixel count for larger
images (for thumbnails, dots per
inch)
ResolutionUnit 296 1 (no unit) for larger images; 2
(inch) for thumbnails
DocumentName 269 name supplied by LC
Artist 315 Library of Congress
DateTime 306 date scanned
4. Recorded sound.
Standard digital formats for compression and networking have not
yet been established. The current World Wide Web offering
employs the WAV format with an "AU wrapper."
We anticipate that, for the immediate future, each individual
package or system will convert and load sound recordings afresh
from interim masters.
The interim masters may be analog (e.g. broadcast-standard
full-track-monaural or double-track-stereo ¬-inch tape at 7.5
inches-per-second) or standard digital form (e.g. DAT cassette),
with a list of timings and content.
5. Moving-image materials.
Standard digital formats for compression and networking have not
yet been established. The current World Wide Web offering
consists of films in the AVI format that can be displayed after
downloading using the INDEO codec. The were digitized using
INDEO version 3.2 at a low data-transfer rate (in order to keep
file sizes small).
We anticipate that, for the immediate future, each individual
package or system will probably convert and load moving-image
materials afresh from interim masters.
Since digital copies tend to be of lower resolution than NTSC
video, the current round of interim masters are analog videotapes
made from the original motion piture film (e.g.
broadcast-standard 1-inch type "C" tapes, D2 digital tapes, or
Betacam composite video, etc.). In the future, as digital-motion
quality increases, the source material may be a
motion-picture-film copy of the original or a better-than-NTSC
video master.
**********************************************************************
III. Illustrative examples of collections
1. Life History Manuscripts from the WPA Federal Writers' Project,
1936-1940
QUANTITY TYPE OF UNIT
FRAMEWORK:
ca. 10 ASCII texts in the Home Page group for background
information and special program; HTML markup
1 Menu-form finding aid; HTML markup
13 Audio clips for the special program; WAV format/AU wrapper
ca. 30 Digital pictorial images for the special program; gray-scale
JPEG images, ca. 640x480; also thumbnails
REPRODUCTIONS:
2,900 ASCII texts w/ SGML markup (TEI implementation, American
Memory DTD); avg size: 11 Kb; total all doc's: 32 Mb
22,591 Digital page images of original pages, bitonal 300 dpi; TIFF
headers, CCITT group 3 compression (most) or group 4 (a
few); av size: 68 Kb; total all doc's: 1,536 Mb
4 Digital pictorial images of originals in the collection;
gray-scale JPEG images, ca. 640x480; also thumbnails;
av size: 300 Kb
2. Early films of New York City, 1897-1906
QUANTITY TYPE OF UNIT
FRAMEWORK:
ca. 6 ASCII texts in the Home Page group for background
information; HTML markup
45 Bib-record form finding aid: item level MARC catalog records
REPRODUCTIONS:
45 Digital film clips (for now, INDEO format); b/w, silent; av
length 4 minutes
45 Digital still-frame thumbnail images that present a trio of
scenes from the films (for presentation with the
bibliographic record) gray-scale TIFF and GIF images,
ca. 500x150
3. Panorama photos (American places, ca. 1880-1920)
QUANTITY TYPE OF UNIT
FRAMEWORK:
ca. 10 ASCII texts in the Home Page group for background
information and special program; HTML markup
4,000 Bib-record form finding aid: item level MARC catalog records
4 Video clips for special program (not yet digitized)
ca. 50 Digital pictorial images for the special program; various
formats (mostly JPEG or TIFF)
REPRODUCTIONS:
36,000 Digital pictorial images of originals in collection; each
panorama reproduced in one "wide shot" and from 3-8
closeups; gray-scale and color JPEG images, ca.
640x480; also thumbnails
4. Ethnic Folk Music from Northern California, Recorded by a WPA
Project, 1938-1940, Part One
QUANTITY TYPE OF UNIT
FRAMEWORK:
ca. 6 ASCII texts in a main Home Page group for background
information and special program; HTML markup
ca. 10 ASCII texts in secondary-level Home Page groups; HTML markup
ca. 600 Bib-record form finding aid: four sets of item-level MARC
catalog records, each related to a secondary-level Home
Page group; figure given is approx total of all four
sets
ca. 40 Digital pictorial images for the special program; various
formats (mostly JPEG or TIFF)
REPRODUCTIONS:
393 Recorded sound selections; average length: 1.5 minutes;
range: 0.5 - 5 minutes; estimated total audio: 15
hours (current holding is in the QT compressed format
for Macintosh)
168 ASCII texts w/SGML markup (TEI implementation, American
Memory DTD); av size: 6 Kb; total all doc's: 1.3 Mb
219 Digital page images of original pages, bitonal 300 dpi; TIFF
headers, CCITT group 3 compression; av size: 51 Kb;
total all doc's: 51 Mb
1,002 Digital pictorial images: gray-scale JPEG images, ca.
640x480; also thumbnails; av size 230 Kb; total 39 Mb
45 Digital images of engineering drawings: bitonal 300 dpi at
(1) actual size and (2) reduced to print on 8.5x11-inch
paper; TIFF group 3; av size for (1), 4 Mb, for (2),
130 Kb; total 186 Mb.