Project Manager: Diane Vizine-Goetz, Consulting Research Scientist
Abstract
Over the past five years, librarians, humanities computing researchers, and computer scientists have been working to establish standards for encoding and accessing local and networked electronic information resources. These standards are just now being put into practice by their corresponding user communities, and their application provides opportunities for exploring synergies among the various approaches used. The OCLC Office of Research Cataloging Internet Resources project is investigating the relationship among three of these the Machine- Readable Cataloging (MARC) format used by librarians, the Text Encoding Initiative (TEI) Header developed by humanities computing researchers, and the emerging Uniform Resource Citation (URC) standard for accessing materials on the World Wide Web. One result of our analysis is a prototype Web-based tool called Spectrum that enables individuals without specialized knowledge of library cataloging or markup to create records describing the bibliographic and location elements of networked electronic resources of various types.
[Note: A version of this paper will appear in the forthcoming proceedings of the 1994 Chicago World Wide Web Conference to be published as part of a special issue in Elsevier's Scientific Computer Networks and ISDN Systems.]
The United States MARC format, TEI Header, and URC each contain elements for encoding bibliographic data for electronic resources. For details on the definition and use of URCs, see the Internet Engineering Task Force Internet Draft "URC Scenarios and Requirements" at URL: ftp://ds.internic.net/internet-drafts/draft-ietf-uri-urc-req-00.txt
In 1992, the Office of Research conducted a study of Internet resources. The first phase of the project focused on collecting and characterizing electronic textual information available through the Internet. In a follow-up cataloging experiment, the practical and theoretical difficulties creating USMARC format bibliographic records for networked textual information were investigated. As a consequence of the project, the USMARC field 856 (Electronic Location and Access) was developed. This field, adopted by the USMARC Advisory Group in January 1993, contains all data elements necessary to locate and access Internet resources, including a data element for Uniform Resource Locator (URL). The 856 field was implemented in January 1995.
While the library community was assessing the viability of cataloging standards for Internet resources, humanities computing scholars were developing a general text encoding scheme for complex electronic textual structures. This effort, the Text Encoding Initiative, resulted in the 1994 publication of guidelines for encoding the intellectual content of texts: Guidelines for Electronic Text Encoding and Interchange. The TEI guidelines are an application of Standard Generalized Markup Language (SGML), an international standard for describing marked-up electronic text. The SGML standard specifies what markup is permitted and necessary and how markup is distinguished from the text. The TEI guidelines define what the markup means.
Although the TEI guidelines focus on the text markup needs of humanities scholars, chapters 5 and 24 contain detailed provisions for recording data elements important for access and bibliographic control of electronic text files. The TEI header, a mandatory part of TEI-conformant texts, contains elements for recording both bibliographic and nonbibliographic elements. Freestanding headers extracted from TEI documents are called independent headers. Our research extends the application of independent TEI headers to non-TEI-encoded electronic texts. A TEI-based header for the electronic version of Assessing Information on the Internet, available at http://www.oclc.org/oclc/menu/reschdoc.htm is shown in fig. 1.
<teiHeader> <fileDesc> <titleStmt> <title>Assessing Information on the Internet: Toward Providing Library Services for Computer Mediated Communication</title> <author>Martin Dillon</author> <author>Erik Jul</author> <author>Mark Burge</author> <author>Carol Hickey</author> </titleStmt> <editionStmt>NA</editionStmt> <extent>NA</extent> <publicationStmt> <publisher>OCLC Online Computer Library Center, Inc., Office of Research</publisher> <address>6565 Frantz Road Dublin, Ohio 43017-3395</address> <date>1994</date> </publicationStmt> <seriesStmt>NA</seriesStmt> <notesStmt> <note> 856 7 $u http://www.oclc.org/oclc/menu /reschdoc.htm $z For an introductory page to an electronic version of: Assessing information on the Internet $2 http</note> <note> 856 1 $a ftp.rsch.oclc.org $d ftp/pub/internet_resources_project /report $f cover.ps $s 9679 bytes $f internet.ps $s 257990 bytes $f appenda.ps $s 84957 bytes $f appendb.ps $s 66017 bytes $f appendc.ps $s 37973 bytes $f appendd.ps $s 46106 bytes $f appende.ps $s 351941 $u ftp://ftp.rsch.oclc.org/pub /internet_resources_project/report $z These files are in PostScript format. You may read them online if you have a PostScript viewer. Otherwise, load them to disk and print them on a PostScript printer</note> <note> 856 1 $a ftp.rsch.oclc.org $c Must be decompressed with Unix uncompress $c Must be untarred with Unix tar -xvf $d ftp/pub/internet_resources _project/report $f report.ps.tar.Z $s 312328 bytes $u ftp://ftp.rsch.oclc.org/pub /internet_resources_project/report /report.ps.tar.Z</note> </notesStmt> <sourceDesc> <biblFull> <titleStmt> <author>;Martin Dillon ... [et al.] </author> <title>Assessing information on the Internet : toward providing library services for computer-mediated communication</title> </titleStmt> <editionStmt> <edition>;NA</edition> </editionStmt> <extent>1 v. (various pagings) : ill. ; 29 cm.</extent> <publicationStmt> <resp><role>publisher</role><name>OCLC </name></resp> <place>Dublin, Ohio</place> <idno type='OCLC'>27635027</idno> <date>1993</date> </publicationStmt> <sourceDesc>No source: this is an original work</sourceDesc> </biblFull> </sourceDesc> </fileDesc> <encodingDesc>NA</encodingDesc> <profileDesc> <textClass> <keywords scheme=LCSH> <list> <item>Internet (Computer network)</item> <item>Cataloging of computer files</item> <item>Information networks</item> <item>Computer networks</item> <item>Libraries---Communication systems</item> <item>Information storage and retrieval systems</item> <item>Library information networks</item> </list> </keywords> <classCode scheme=DDC20>004.67</classCode> <classCode scheme=LCC>TK5105.875.I57 </classCode> </textClass> </profileDesc> <revisionDesc>NA</revisionDesc> </teiHeader>
Fig. 1 Example TEI Header
The TEI header is composed of four major parts: file description, encoding description, text profile, and revision history. The file description is intended to serve as the electronic equivalent of the title page of a printed work and is the only mandatory element of the TEI header. TEI headers can be used by libraries and bibliographic utilities to provide access to electronic files.
The Spectrum system provides a method for individuals without specialized cataloging or markup knowledge to create records describing Internet resources of various types. These---records formatted as TEI headers, MARC records, or URCs---can then be collected into a database that is searchable using conventional techniques. As shown in fig. 2, the Spectrum system performs three major tasks: record creation, database creation, and record retrieval. Our focus is on record creation; the other components build on products that are already in place at OCLC.
The user accesses a standard Web browser, such as Mosaic, and a standard copy of the National Center for Supercomputing Applications (NCSA) HyperText Transfer Protocol (HTTP) Server, and creates a record by filling out a series of HyperText Markup Language (HTML) forms that prompt for relevant data elements. While creating a record, the user can interactively refine the record, view the record either in TEI, MARC, or URC format, submit the record to a syntactic and semantic validation, and correct problems identified by the validation process. Once the record is in the desired form, the user can submit it for inclusion in a database of Internet resources that is accessible to Web users.
Our preliminary analysis of the URC, TEI, and MARC schemes reveals that there is enough overlap among these formats to create a set of data elements from which minimal versions of all three record types can be generated. Record translations are performed by calling OCLC's SGML Document Grammar Builder, a set of software tools developed by Shafer (1993) that automatically identifies the corpus structure of SGML-tagged documents. It takes SGML-tagged input, induces the grammar, and produces whatever output is required by the application at hand. The same translator has been used to convert SGML to TeX and SGML to HTML. The translations required for creating catalog records would be difficult to perform in a general way without exploiting knowledge of the source document's grammar.
To illustrate the kinds of transformations required to generate a MARC record from a TEI record, consider some of the mappings between the TEI record and the MARC format record shown in fig. 3.
090 RK5105.875.I57 092 00467 $2 20 100 Martin Dillon 245 Assessing information on the internet $h [computer file] : toward providing library services for computer mediated communication 260 Dublin, Ohio : $b OCLC Online Computer Library Center, Inc., Office of Research $c 1994 650 Internet (Computer network) 650 Cataloging of computer files 650 Information networks 650 Libraries---Communication systems 650 Information storage and retrieval systems 650 Library information networks 700 Erik Jul 700 Mark Burge 700 Carol Hickey 856 7 $u http://www.oclc.org/oclc/menu.reschdoc.htm 856 1 $a ftp.rsch.oclc.org $d ftp/pub/internet_resources_project/report $f cover.ps $s 9679 bytes $f internet.ps $s 257990 bytes $f appenda.ps $s 84957 bytes $f appendb.ps $s 66017 bytes $f appendc.ps $s 37973 bytes $f appendd.ps $s 46106 bytes $f appende.ps $s 351941 bytes 856 1 $a ftp.rsch.oclc.org $s ftp/pub/internet_resources_project/report $f report.ps.tar.Z $s 312328
Fig. 3 MARC-Format Record Translated from TEI Header
The 260 field, publisher data, is assembled from the <publisher><date> and <address> fields in the <publicationStmt>. In the 092 field, the Dewey Decimal Classification number, the data comes from the <classCode> element in the <profileDesc> field of the TEI header; the data in the $2 subfield is the value of the attribute in that field. The 100 field, author data, comes from the data enclosed in the first <author> tag in the <fileDesc> subsection of the TEI header. The data from the remaining author fields in this portion of the TEI header may be formatted as 700 fields. It is necessary to distinguish this <author> field from the one in the <sourceDesc> subsection of the TEI header, which may be formatted into a 500 field, general note, in the MARC record---or omitted, as in this example. The 650 fields, representing the Library of Congress Subject Headings for this record, come from the <item> fields in the <textClass> subsection of the TEI header. The data from these fields can be extracted in a straightforward way, but a MARC-compliant record would require the consultation of the Library of Congress Subject Authority File for appropriate subfield tags and codes.
The automatic translation of bibliographic records from TEI to MARC formats involves a wide range of operations on tree-structured data. Data extracted from fields in the TEI record may be joined, split, inserted or omitted. It is sometimes necessary to refer to the ancestors of the TEI tags containing data to be transferred to the MARC record. Finally, the extracted fields must be sorted to conform to cataloging and MARC format standards. Much work remains to be done to fully translate between the TEI header and the MARC format. Since few TEI headers are currently available for electronic resources, project staff work closely with institutions already engaged in projects that use the TEI header as the source for cataloging information such as the Center for Electronic Texts in the Humanities and University of Virginia Libraries.
As fig. 2 shows, Spectrum includes a means for storing and retrieving records. Once the user has indicated that the record is suitable for the catalog of Internet resources, the record is automatically prepared for inclusion in a full-text database using OCLC's Newton database management system. The record is now accessible to a Mosaic user via OCLC's WebZ Server, a replacement to the standard HTTP server which maintains a database session conforming to the Z39.50 information retrieval protocol. In a later implementation of the Spectrum system, the WebZ Server will be used for record creation as well as record retrieval, eliminating the NCSA HTTPD Server.
Future research efforts will focus on a Common Client Interface version of Spectrum capable of interacting with locally resident software tools, for example, generic tools such as spelling checkers, word processors, general dictionaries and thesauri, and specialized tools such as Electronic Dewey For Windows, LC Cataloger's Desktop, PRISM session, etc. Other enhancements include a module for eliciting user-supplied subject categorizations, and automatic indexing of textual electronic resources.
Dillon, M. et al. 1993. Assessing Information on the Internet: Toward Providing Library Services for Computer-mediated Communication. Dublin, Ohio: OCLC Online Computer Library Center, Inc. OCLC/OR/RR- 93/1.
Shafer, Keith. 1993. SGML Grammar Structure. Annual Review of OCLC Research: July 1992-1993, 39-40. Dublin, Ohio: OCLC Online Computer Library Center, Inc.
Project Staff: Mark Bendig, Consulting Systems Analyst; Tony Ershadi, Systems Analyst; Jean Godby, Associated Research Scientist; Carol Hickey, Research Associate