ILS 605: The Making of Digital Libraries, Fall 1994
The University of Michigan, School of Information and Library Studies


Retrieving Images
from Structured Documents:
Realizing the Potential of SGML

by John Weise

Images present unique information retrieval problems that must be approached with enthusiasm and caution as new publishing standards such as SGML (Standard Generalized Markup Language) spread throughout the industry.

Publications of all types contain images. From current events to bar-graphs, images are an essential part most publications. But, when it comes to searching collections of publications for specific images, the task is next to impossible. A person can search through library catalogs and indices of periodicals in order to locate books, articles and other materials, but generally catalogs and indices provide very little information about the images that exist in documents. As an example, consider a researcher who would like to examine how popular news magazines have portrayed cigarette smokers by looking at photographs from the past several decades. What options does the researcher have to locate such images? Periodical indices can be searched for articles or issues about smoking, but the task would invariably involve viewing thousands of pages of microfilm in order to find relevant images. Even to conduct this search on recent literature, using full text in electronic form with abstracts and keywords and controlled vocabulary, would be a formidable task. The language used to provide access to the content of the literature is not likely to effectively describe the image. It is possible that the most useful images, from the researcher's perspective, are not associated at all with articles about smoking. A high precision search is out of the question.

Research in the area of image databases shows that text, specifically provided to facilitate image retrieval, is inadequate for conducting searches. "Historically, text-based intellectual access systems have been woefully inadequate for describing the multitude of access points from which the user might try to recall the image. Images are rich and often contain information that can be useful to researchers coming from a broad set of disciplines." (Besser)

As the world moves deeper and deeper into the digital age, computers are increasingly used to create publications of all types. Digital typesetting has been used for some time, and now photographers use digital cameras, avoiding film and furthering the ease with which digital documents are prepared. An important aspect of this evolution is the adoption of standards. Perhaps the most important is SGML (Standard Generalized Markup Language). "SGML documents have a rigorously described structure that may be analyzed by computers and easily understood by humans." (van Herwijnen) Structure is the key concept here, and structured documents have the potential to dramatically increase access to the images they contain.

What SGML Provides

The structure that SGML imposes does not affect the visual appearance of a document. Instead, the structure provides a method for internally describing documents. For instance, a magazine publisher could develop a document structure. It would define all of the different elements of their publication, such as magazine title, article title, article, byline, and image caption to name just a few. There could be an element for identifying the editorial section, the index, and the document as a whole. This is a very simple summary of SGML. There are two points that need to be made. First, SGML is a very powerful standard with the potential to describe most document types. More important to this essay, it identifies the elements of a document that can be used to enhance searching.

The advantages of SGML for text alone are impressive. A digital document without structured format can have its content parsed, but not with the control that structure can provide. For instance, it would be possible to search just the editorial section of multiple issues of the magazine.

SGML and Image Accessibility

Regarding images, the most important contribution of SGML is that it has the potential to provide wide scale access to images that reside within documents of all types. However the degree to which an image is described within a document is likely to vary. Many images don't have captions or bylines. Alternatives are to search the text of the page the image is on, or the paragraph that has a direct reference to the image. If given proper thought at the time the document is created, images can be logically connected to the text that describes them, whether that text is in the caption or on the next page. Furthermore, a range of text could be searched. For example, find all images that are referenced within two paragraphs of the word "cigarette."

Realities of Searching for Images

Unfortunately, there are indicators that the image access SGML can provide will only be marginally useful. On one hand, images that were not easily located before, will be attainable. On the other hand, we return to the aggravating factor that text is a weak tool for describing images. "Both in descriptive cataloging and in providing access points, even extensive text-based descriptions of the images are seldom sufficiently descriptive for the researcher to determine which images are likely to be relevant to his or her needs. Even an enormous amount of descriptive text cannot adequately substitute for the viewing of the image itself." (Besser) This statement was made in the context of discussing image databases, but the principle is applicable to the structured document situation.

Presumably, the text of a document directly relates to its images. SGML document type structures can strengthen associations between the text and the image. In all cases, the value of the text to the image searcher will depend on the existence of similarities between the language used within the document and the language of the searcher. This potential gap could be narrowed if controlled vocabularies or classification schemes are used to describe the images within publications. However, such vocabularies and schemes are not available for many subject domains.

In general, the classification of images for the purpose of organization and retrieval poses problems that are difficult to resolve. This is especially true when the subject domain of the collection is broad and deep. While there is ongoing research to discover and refine methods for image retrieval based on visual attributes such as shape and color, it seems unavoidable that the retrieval of visual information in the foreseeable future will rely primarily on textual description. Howard Besser proposed that the relevance judging process for retrieved image records is greatly improved by providing the searcher with surrogate images that can be browsed. (Besser) This method could be applied to systems that search collections of SGML files.

Structured documents have the potential to provide content level access to images. To gain understanding of the true potential, research that examines the integration of images into structures, and the relationships that exist among the images and text, needs to take place. Investigation of different subject domains will likely uncover nuances of structure that are inherent to that area. As a starting point for research, the following investigations should be made.

Conclusion

Structured text provides invaluable functionality for retrieving images from a variety of subject areas. Success will not be comparable to what can be achieved with the application of a properly developed subject domain classification scheme or controlled vocabulary, nonetheless, the advantages will be numerous. This is especially true for otherwise unorganized images. Research in this area, and continued research in the area of image databases is required in order to obtain a better understanding of the potential of SGML as an image access medium.


Bibliography

Bell, Leslie A. (1994) Gaining Access to Visual Information: Theory, Analysis and Practice of Determining Subjects: A Review of the Literature with Descriptive Abstracts, Graduate School of Library and Information Science, University of Western Ontario

Besser, Howard. (1990) Visual Access to Visual Images: The UC Berkely Image Database Project. Library Trends, 38(4), 787-798.

Buckland, Michael K. (1991) Information Retrieval of More Than Text. Journal of the American Society for Informatoin Science, 42(8), 586-588.

Frost, Carolyn O., & Janes, Joseph (1993) Integrating an Image Database Into Gopher: Department of Education Research Grant Proposal, The University of Michigan, School of Information and Library Studies, Ann Arbor, MI 48109

Furuta, Richard. (1994) Defining and Using Structure in Digital Documents. Digital Libraries '94, Proceedings, Hypermedia Research Library, Department of Computer Science,Texas A&M University, College Station, TX 77843-3112.

Lynch, Clifford A. (1992) Describing and Classifying Networked Information Resources. Electronic Networking: Research, Applications and Policy Spring 1992.

Lynch, Clifford A., & Buckland, M., et al. (1992) Networking, Telecommunications, and the Networked Information Revolution. ASIS 1992 Mid-Year Meeting, Proceedings, American Society for Information Science.

Stam, Deirdre C. (1984) How Art Historians Look for Information. Art Documentation, Spring 1984, 117-119.

Stam, Deirdre C. (1989) A Quest for a Code, Or a Brief History of the Computerized Cataloging of Art Objects. Art Documentation, Spring 1989 7-15.

van Herwijnen, Eric. (1994) Practical SGML (2nd ed.). Boston: Kluwer Academic Publishers.

Zinkham, Helena (1994) Image Catalaging Issues: Image Collection Implementors' Workshop Rochester, NY, September 28-29, 1994, Notes, Helena Zinkham,Prints & Photographs Division, Library of Congress


ILS 605 - The Making of Digital Libraries
Instructor: David L. Rodgers
The University of Michigan
School of Information and Library Studies
Essay 2: First Draft
December 9, 1994

John Weise (jweise@umich.edu)