SGML: ELEMENTS OF DIGITAL ARCHIVAL COLLECTIONS

SGML: ELEMENTS OF DIGITAL ARCHIVAL COLLECTIONS

Mirrored from ftp://rs7.loc.gov/pub/american.memory/white.papers/techov.txt, March 1996




             ELEMENTS OF DIGITAL ARCHIVAL COLLECTIONS
             TECHNICAL OVERVIEW AND FORMAT DESCRIPTION



                             * * * * *



                         Carl Fleischhauer

                   Coordinator, American Memory 

                         October 27, 1994


::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

     This document is intended to present a "snapshot" of the Library 
of Congress digital conversion activity as of October 1994.  The ideas 
and approaches outlined here represent the outcome of the five-year 
American Memory pilot program (1990-1994).  Although the elements 
described in this document will guide the Library as it plans its 
long-term digitizing effort, the institution recognizes that many 
avenues remain unexplored and that new technology will continue to 
lead to changing practices.

     Interested readers are encouraged to also refer to the American 
Memory "white papers" provided at this ftp (ftp.loc.gov) or gopher 
(marvel.loc.gov) site:

     Reproduction-Quality Issues in a Digital-Library System
     Bibliographic Records in a Full-content Access System
     Frameworks and Finding Aids: Organizing Digital Archival 
          Collections (supplement to the paper on bibliographic 
          records; to be added during November 1994)


:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::


I.   What Kinds of "Items" Make Up a Digital Historical Collection?

     Digital historical collections are tremendously varied; some 
actual examples are described in Section III.  Broadly speaking, each 
collection consists of two major elements, each of which consist of a 
variety of sub-elements:


     Framework      An assemblage of title or "home" pages, finding 
                    aid(s), and associated information.

               Finding aid         Either an online register 
                                   (directory of holdings) or a 
                                   collection-specific database of 
                                   brief bibliographic records.

               Background informationExplanatory texts, scope and 
                                   content notes, chronologies, and 
                                   bibliographies that provide users 
                                   with a context for the collection.

               Special programs    Interpretive texts and/or 
                                   interactive programs that introduce 
                                   collections to users, explaining 
                                   what they contain and how they may 
                                   be used (optional)


     Reproductions  Digital images, texts, sound recordings, or 
                    moving-image files that reproduce original items 
                    at the Library of Congress.  Detailed 
                    specifications are provided in Section II.


**********************************************************************


II.  Technical description of items in collections.

A.   Background

     A collection in its raw state consists of the types of items 
listed below.  Framework items (finding aids, background information, 
and special programs) are produced by Library of Congress staff.  
Reproductions, for the most part, are produced by specialist 
contractors.  Readers are warned that the formats listed below are 
subject to change as the Library's approach to digitization evolves.

     All of the raw materials are (or should be) archived in their 
production formats, as listed below.  In a perfect world, all of these 
formats would adhere to well-established "true" or "industry" 
standards.  Alas, the world is not perfect and the list below reveals 
that the Library is using a combination of true standards, widespread 
industry standards, and temporary solutions.  The most stable formats 
are those used for cataloging, SGML-marked-up machine-readable texts, 
and non-moving images (reproductions of documents and pictorial 
materials).  The least stable formats are those for time-based items, 
i.e., recorded sound and moving images.

     The raw materials are also ready to be assembled and loaded into 
retrieval software, thus forming the finished collection.  At this 
writing, our main focus for assembly and retrieval is the Library of 
Congress World Wide Web server and we are loading our content there.  
In the recent past, however, the Library also produced CD-ROM disks 
for both Macintosh and IBM-compatible systems.  In the future, the 
same content may be loaded into different server systems and/or other 
media.  At the time of assembly and loading, some of the archived data 
must be converted into different formats, selected to fit the 
particular delivery system at hand.

     The list that follows identifies the production and archiving 
formats.  For recorded sound and moving-image collections, however, 
these notes also highlight the interim formats selected for World Wide 
Web delivery.


B.   Framework items

     1.   Finding aids

          i.   Online register

          This type of finding aid has not yet been realized at the 
          Library of Congress; current plans call for adoption of the 
          SGML scheme developed by Daniel Pitti at the Bancroft 
          Library, University of California, Berkeley.

          ii.  Collection-specific bibliographic-record database

          Collection databases containing MARC catalog records that 
          follow AACR2 approaches for minimal-level cataloging (up to 
          a point) and include fields like 856 and some additional 
          local fields.  More details are provided in other American 
          Memory documents.

     2.   Background information 

          Searchable texts are coded (for now) with HyperText Markup 
          Language (HTML).

     3.   Special programs

          The core element is typically browseable texts coded (for 
          now) with HyperText Markup Language (HTML).  Various images 
          or audio clips may be attached to these texts; the formats 
          for these will, of course, be similar to the formats used 
          for collection-item reproductions.


C.   Reproductions

     1.   ASCII texts with SGML markup: searchable reproductions of 
          textual materials

          American Memory's contractor requirement is 99.95 percent 
          accurate transcribed texts (when compared to the original 
          item).  The texts are coded with Standard Generalized Markup 
          Language (SGML).  The American Memory SGML document type 
          definition (DTD) conforms to the guidelines of the 
          international-standard Text Encoding Initiative (TEI). 
          The SGML markup indicates the location and number of each 
          illustration for an illustrated work, as well as the page 
          images for most works.  This coding permits links to be made 
          to images (generally by file name), thus reuniting text, 
          illustration, and/or page image at retrieval time.

          The original source texts are given identifiers 
          (alphanumeric strings) that are incorporated in the 
          filenames for the searchable texts and associated images.  
          Additional identifying information appears in the TEI header 
          for the searchable texts.  Separate documents provide detail 
          about the American Memory SGML DTD and the program's use of 
          the TEI header.

          Searchable texts are delivered by the contractor as computer 
          files in MS-DOS format, together with documentation and a 
          directory of filenames.  Thus filenames adhere to DOS naming 
          requirements.

     2.   Bitonal images: reproductions of manuscripts or printed 
          matter

          a.   Images of documents or line art

               TIFF files with TIFF header (Intel rather than 
                    Motorola; American Memory checks headers in 
                    various way, including the use of the shareware 
                    TIFFINFO)
               CCITT group 4 compression (group 3 for some early 
                    examples)
               300 dpi (as referenced to, say, a printer; may or may 
                    not be 300 dpi as referenced to the source 
                    document).

          b.   Images of illustrations.  NOTE: for printed halftones, 
               special dithering treatment is applied at scan time to 
               reduce or suppress moir‚ patterns.

               Each illustration image is created in the following 
               forms:

               i.   A "printing image," i.e., a reproduction of an 
                    illustration (line or printed halftone) that an 
                    end-user can print at retrieval time.

                    Pixel depth and size:  The image is to be bitonal 
                    (one bit deep), suitable for printing at 300 dpi 
                    on 8«x11-inch paper.  The image need not fill the 
                    paper; the requirement is simply that, when 
                    printed at 100 percent size, the image shall not 
                    occupy more than one sheet of paper.  

                    PCX (current practice) or TIFF header (preferred; 
                         same flavor as indicated in section 2.A; with 
                         CCITT group 3 or group 4 compression
                    300 dpi (as referenced to, say, a printer; may or 
                         may not be 300 dpi as referenced to the 
                         source document).

               ii.  A screen-display image (or thumbnail).  Bitonal 
                    reproduction of the illustration for 
                    computer-screen display.  The image is dithered at 
                    scan time in the case of printed halftones.  
                    Gray-scale is being considered for future 
                    collections.

                    Current specification calls for the screen-display 
                    image not to exceed 512 pixels horizontal by 342 
                    pixels vertical.  In the future, may be reduced to 
                    smaller thumbnail size of 200x200.

          c.   General notes:  Each image is to be placed in a 
               separate file.  LC specifies filenames or a filenaming 
               approach.  The filenames must relate in a logical way 
               to the names assigned to SGML-marked-up text files, 
               thus permitting the linking described above.

     3.   Gray-scale and color images: reproductions of pictorial 
          materials, especially continuous-tone items like 
          photographs.

          The Library's contractor creates a film intermediate by 
          re-photographing materials onto 35mm or 70mm roll film.  The 
          film serves the Library as a very-high-quality analog 
          archiving reproduction.  The contractor digitizes the film 
          images to meet the specifications that follow.

Image Specifications: 

UNCOMPRESSED IMAGES for reprocessing and reuse in future, improved 
     computer systems.
Screen resolution 640x480 (optional 1024x768 or 1280x1024)
B&W images at 8 bits per pixel (bpp) 
Color images at 24 bpp.
Uncompressed
TIFF ver. 5.0 headers (see below for added content)

REFERENCE IMAGES for current retrieval system display.
Screen resolution 640x480 (optional 1024x768 or 1280x1024)
B&W images at 8 bits per pixel (bpp) 
Color images at 24 bpp.
B&W images at approx 10:1 compression 
Color images at approx 20:1 compression 
JPEG compression
JFIF headers 

THUMBNAIL IMAGES for current retrieval system display.
No dimension greater than 150 pixels
Both B&W and color thumbnails in 8 bpp.  
Color palette optimized (adaptive palettes) for each image.  Palette 
     "reserves" the colors of the standard/default Windows palette to 
     reduce conflict with Windows delivery software.
Uncompressed
TIFF ver. 5.0 headers (see below for added content)
TIFF image file headers

TIFF headers.  Our usual specification for gray-scale or color images 
(pictorial collections) with TIFF headers include the following tags.  
In general, the "typical" or "expected" data goes in the tag; the 
comments column indicates cases where the data may not be of the 
expected type.  In the future, we anticipate using a similar set of 
tags for the TIFF headers for bitonal document-type images.

Description              Tag       Comments
NewSubfileType           254
ImageWidth               256       actual pixel count
ImageLength              257       actual pixel count
BitsPerSample            258
Compression              259
PhotometricInterpretation262
StripOffsets             273
SamplesPerPixel          277
RowsPerStrip             278
StripByteCounts          279
XResolution              282       actual pixel count for larger 
                                   images (for thumbnails, dots per 
                                   inch)
YResolution              283       actual pixel count for larger 
                                   images (for thumbnails, dots per 
                                   inch)
ResolutionUnit           296       1 (no unit) for larger images; 2 
                                   (inch) for thumbnails
DocumentName             269       name supplied by LC
Artist                   315       Library of Congress
DateTime                 306       date scanned   


4.   Recorded sound.

     Standard digital formats for compression and networking have not 
     yet been established.  The current World Wide Web offering 
     employs the WAV format with an "AU wrapper." 

     We anticipate that, for the immediate future, each individual 
     package or system will convert and load sound recordings afresh 
     from interim masters.

     The interim masters may be analog (e.g. broadcast-standard 
     full-track-monaural or double-track-stereo ¬-inch tape at 7.5 
     inches-per-second) or standard digital form (e.g. DAT cassette), 
     with a list of timings and content.

5.   Moving-image materials.

     Standard digital formats for compression and networking have not 
     yet been established.  The current World Wide Web offering 
     consists of films in the AVI format that can be displayed after 
     downloading using the INDEO codec.  The were digitized using 
     INDEO version 3.2 at a low data-transfer rate (in order to keep 
     file sizes small).

     We anticipate that, for the immediate future, each individual 
     package or system will probably convert and load moving-image 
     materials afresh from interim masters.

     Since digital copies tend to be of lower resolution than NTSC 
     video, the current round of interim masters are analog videotapes 
     made from the original motion piture film (e.g. 
     broadcast-standard 1-inch type "C" tapes, D2 digital tapes, or 
     Betacam composite video, etc.).  In the future, as digital-motion 
     quality increases, the source material may be a 
     motion-picture-film copy of the original or a better-than-NTSC 
     video master.


**********************************************************************


III. Illustrative examples of collections

1.   Life History Manuscripts from the WPA Federal Writers' Project, 
     1936-1940

QUANTITY  TYPE OF UNIT

          FRAMEWORK:
ca. 10    ASCII texts in the Home Page group for background 
               information and special program; HTML markup
     1    Menu-form finding aid; HTML markup
    13    Audio clips for the special program; WAV format/AU wrapper
ca. 30    Digital pictorial images for the special program; gray-scale 
               JPEG images, ca. 640x480; also thumbnails

          REPRODUCTIONS:
 2,900    ASCII texts w/ SGML markup (TEI implementation, American 
               Memory DTD); avg size: 11 Kb; total all doc's: 32 Mb
22,591    Digital page images of original pages, bitonal 300 dpi; TIFF 
               headers, CCITT group 3 compression (most) or group 4 (a 
               few); av size: 68 Kb; total all doc's: 1,536 Mb
     4    Digital pictorial images of originals in the collection; 
               gray-scale JPEG images, ca. 640x480; also thumbnails; 
               av size: 300 Kb


2.   Early films of New York City, 1897-1906 

QUANTITY  TYPE OF UNIT

          FRAMEWORK:
 ca. 6    ASCII texts in the Home Page group for background 
               information; HTML markup
    45    Bib-record form finding aid: item level MARC catalog records

          REPRODUCTIONS:
    45    Digital film clips (for now, INDEO format); b/w, silent; av 
               length 4 minutes
    45    Digital still-frame thumbnail images that present a trio of 
               scenes from the films (for presentation with the 
               bibliographic record) gray-scale TIFF and GIF images, 
               ca. 500x150


3.   Panorama photos (American places, ca. 1880-1920)

QUANTITY  TYPE OF UNIT

          FRAMEWORK:
 ca. 10   ASCII texts in the Home Page group for background 
               information and special program; HTML markup
  4,000   Bib-record form finding aid: item level MARC catalog records
      4   Video clips for special program (not yet digitized)
 ca. 50   Digital pictorial images for the special program; various 
               formats (mostly JPEG or TIFF)

          REPRODUCTIONS:
 36,000   Digital pictorial images of originals in collection; each 
               panorama reproduced in one "wide shot" and from 3-8 
               closeups; gray-scale and color JPEG images, ca. 
               640x480; also thumbnails


4.   Ethnic Folk Music from Northern California, Recorded by a WPA 
     Project, 1938-1940, Part One

QUANTITY  TYPE OF UNIT

          FRAMEWORK:
  ca. 6   ASCII texts in a main Home Page group for background 
               information and special program; HTML markup
 ca. 10   ASCII texts in secondary-level Home Page groups; HTML markup
ca. 600   Bib-record form finding aid: four sets of item-level MARC 
               catalog records, each related to a secondary-level Home 
               Page group; figure given is approx total of all four 
               sets
 ca. 40   Digital pictorial images for the special program; various 
               formats (mostly JPEG or TIFF)

          REPRODUCTIONS:
   393    Recorded sound selections; average length:  1.5 minutes; 
               range:  0.5 - 5 minutes; estimated total audio: 15 
               hours (current holding is in the QT compressed format 
               for Macintosh)
   168    ASCII texts w/SGML markup (TEI implementation, American 
               Memory DTD); av size: 6 Kb; total all doc's: 1.3 Mb
   219    Digital page images of original pages, bitonal 300 dpi; TIFF 
               headers, CCITT group 3 compression; av size: 51 Kb; 
               total all doc's: 51 Mb
  1,002   Digital pictorial images:  gray-scale JPEG images, ca. 
               640x480; also thumbnails; av size 230 Kb; total 39 Mb
     45   Digital images of engineering drawings:  bitonal 300 dpi at 
               (1) actual size and (2) reduced to print on 8.5x11-inch 
               paper; TIFF group 3; av size for (1), 4 Mb, for (2), 
               130 Kb; total 186 Mb.