[This local archive copy mirrored from the canonical site: http://www.iath.virginia.edu/iath/treport/markup.html; links may not have complete integrity, so use the canonical document at this URL if possible.]
Some of the advantages of incorporating digital technology into humanities research are having quick and easy access to large numbers of accurate reproductions of texts and images. This allows researchers to explore new ways of correlating and analyzing different works through the use of indexed and manipulatable document files. In order to provide users of our projects with these files, we must add information files to the works scanned from newspapers, microfilm, slides and transparencies which organizes the data within into a program-readable framework. This is done by using SGML to create a customized document type declaration, or DTD, to meet each project's specifications.
The various projects at the Institute incorporate digital images into their electronic archives in different ways, which involves different markup strategies. For Jerome McGann's Rossetti Archive we generated three separate DTDs for each type of digital image captured from microfilms of Rossetti's works. The Rossetti Archive Picture DTD allows physical, bibliographical and historical information to be entered into a header which is added onto each image file. This header allows each painting to be identified and indexed by such characteristics as dimensions, medium, model, commissioning patron and provenance. We created the Rossetti Archive Documents DTD for Rossetti's poetry. The poetry image headers include information on compostion, page organization and typography, among other data. Some works in the Rossetti Archive combine both textual and visual art. These files' headers use the Rossetti Archive Works DTD.
Edward Ayers "Valley of the Shadow" Civil War project presented a special problem. Because of the age, condition and printing style of the newspapers, we could not use OCR (Optical Character Recognition) to create machine-readable files from them. Instead we established a newspaper tagging system, for which staff members would review each newspaper and create an SGML header file for each issue, and for each page within an issue. Each issue header file would list the year, month, day, paper name, state and county of origin both within the file and as the file name itself. Also contained within the file header would be the frequency of the paper, the name of the person creating the tags (referred to as 'tagger') and any additional notes on the edition. Each issue of these Civil War-era newspapers contained 4-8 pages. The page headers, which make up the body of the archive, would list more specific information such as: name, location and type of article(s) on the page; transcription and summaries of articles; and full names of any people mentioned. For added convenience, we devised a system of macros for inserting issue and page header templates into files, searching for page header tags within issue headers and saving the completed headers.
This system of SGML-based text mark-up allows us to provide a multi-tiered searchable archive of Civil War newspapers for this particular project. By choosing a selected group of articles within a particular newspaper, the archive gives the user a choice of articles first by county and year. Choosing a year links to summations for each article, along with page and column location, with each article title generating a scanned rendition of the actual article. Transcripts of articles from newspapers outside the area of the project can be retrieved also. After entering a name into the Newspaper Searching Archive, the archive provides you with a list of occurrences of the name in a list of issues, along with a line from the article cited for context. Choosing one specific year links to summations of articles including page, column number and title, with an additional link to the actual newspaper page itself.
Currently we are designing a similar template and archival system for the Blake Project, so that, once the archive is complete, users can access accurate reproductions of Blake's illustrated poems and commercial works and also manipulate and/or juxtapose the images so as to examine particular features of specified works, or even works in combination. Once captured from slides and transparencies, the images have added to them an Image Production Header containing technical information on equipment, settings and any necessary enhancement or corrections made to produce the image in its current form. The new template system allows our Blake fellows to create their own markups of each image, as well as review and revise previously recorded header information, without any specialized knowledge of SGML.
We designed a web-based HTML interface so that the fellows can access the image header files through Netscape. The interface includes a "lock-out" feature so no file can be accessed (and edited) by two people at the same time. Using the interface, our Blake scholars record for each image a list of features applicable to an index of artistic elements, subjects or themes. Our goal is to provide a search engine that can retrieve images according to a scheme of visual components designated by the user; an extension of research capabilities unique to electronic technology. This template design increases interactivity between less technically-oriented scholars and their electronic projects, and also with each other, generally expediting the reaching of a consensus on project parameters and implementation.
The creation of digital image archives in accordance to standards across an always expanding field such as electronic technology is a delicate business. Imaging systems must strike a balance between providing providing information across as great a number of platforms as possible and respecting the rights and wishes of those who own the images used. The The Imaging Initiative at the Getty Art History Information Program (AHIP) has been working over a decade towards keeping advancing technology implementation in the humanities in touch with these goals. Information on the debate over legal and contractual standards can be found here and at Bandwidth Conservation Society.
IATH WWW Server