[Mirrored from: http://www.acls.org/n44rock.htm]


Volume 4, Number 4 (February 1997)

Internet-Accessible Scholarly Resources
for the Humanities and Social Sciences

This issue focuses on the presentations of a program session on Internet-accessible scholarly resources held at the 1996 ACLS Annual Meeting.

The World Wide Web as a Resource
for Scholars and Students

by Richard C. Rockwell
Executive Director,
Inter-university Consortium for Political and Social Research (ICPSR),
University of Michigan

In this wonderful annual meeting of social scientists and humanists, I assume it's not rude to talk about data, but it might be useful to define what I mean by "data." I refer to data of three kinds. One kind is the chart, table, or graph that you're accustomed to seeing. In the social sciences, these are called aggregate data. The second kind is administrative data, data taken from the records of governmentraw data from which personal information has been stripped so that you can't identify a person even by guessing well. These raw data are available for analysis and often underlie those charts, graphs, and tables. The third kind is data taken from surveys, experiments, and other tools used by the social sciences to collect data. These raw data have been widely distributed among social scientists for several generations now and have formed the foundation for much research and education.

The old technologies for disseminating those kinds of data were very labor-intensive, expensive, and slow. So when the Internet came along, there was a major motivation for the social sciences to begin to use it very quickly. Now we are beginning to see the original data collectors, the statistical agencies, and the data archives all taking a step beyond the dissemination of dataand that step is to provide access to the data. So rather than shipping the data sets across the network, you ship your query to the computer where the data reside, and then we give you an answer back over the network. If you want to see an example of how this works and what its potential is, look at the
U.S. Bureau of the Census home page. At the Census home page, you'll see the capacity to do a number of analyses on census data products, as well as to retrieve packaged tables and charts. This is the trend within all the statistical agencies of the U.S. government, some of them doing the work themselves, some of them "outsourcing."

In addition, as I mentioned, data collectors are beginning to provide direct access to their data. The Health and Retirement Survey and the Panel Study of Income Dynamics, each at the University of Michigan, are both offering very good, very user-friendly ways for people to get access to the data that they're producing.

At the Inter-university Consortium, we now have data from about 97% of our studies up for transmission over the network; that represents about 70% of the megabytes of the quantity of data in the archive. The archivethe largest social science archive in the worldtotals a little upwards of 600 gigabytes of data. In addition we're offering new Web services, including search services that are tailored to social science data and social science data resources. We're offering analytical and extract services to the public on the National Archive of Criminal Justice Data home page, which you can find from the ICPSR home page. We'll soon expand the scope of these public analytical services to data on aging and the aged. We're developing new archives that will be of interest to particular fields. One is going to be on education internationally, and we're working very hard to establish one on the Holocaust.

In all of these things documentation is a major problem, for most documentation is still not electronic, including most documentation produced by federal agencies. What we get instead are what are called "data definition statements" for SAS and SPSS, which are two popular statistical packages. They don't document a study; they don't tell you how to use it, even for classroom use.

In order to put the contents of the ICPSR archive up for use, we have to scan around 350,000 pages of materialand I agree with Susan Hockey: OCR is not the way to do that. We're likely to adopt a proprietary product, Adobe Acrobat, as a standard for scanned documents; it does solve for us problems of large image files, and it partly gets around the OCR problem. But such documents are not logically structured: they are only images, or quasi-images. You need to be able to logically discriminate the name of the principal investigator from the code for missing data from the variables or universe statement. And none of that is possible with an image-scanned file.

What we need are structured documents, so that a study can be documented to a statistical package by the "code book" itself. Accordingly, we're developing something called a Document Type Definition (DTD) for SGML (Standard Generalized Markup Language) markup of social science code books, and we'll put that Document Type Definition in the public domain. Information on the DTD is available at the Data Documentation Initiative. We're striving to be compliant with the Text-Encoding Initiative (even though it wasn't really constructed for the social sciences). And we strongly hope to see federal statistical agencies use the Document Type Definition to prepare code books for their own data resources. If they do so, they will achieve a major improvement in documentation.

Now, some observations on this brave new world into which we've entered: so much is enabled by this technology, it's sometimes easy to ignore what remains to be done. I'm going to take a perspective that could easily be identified as a "neo-Luddite" perspective. I think you'll see it's not quite that.

Recently, a faculty member at the University of Michigan said, "My students are now doing the research for their papers on the Internet and never go to the library!" That's a deeply worrisome trend, if it's widespread, because the Internet and the popular World Wide Web, the major tool on it, are really not yet mature information resources. The Web has some immensely valuable resources on it. But it also has a great deal that needs to be improved. We have very far to go before the words and the data currently available on the Web become information of the kind that we have grown accustomed to finding in our reference libraries and data archives.

Barely three years ago, the World Wide Web was the most recursive knowledge system that you could imagine: it was mostly about the World Wide Web! You'd click on a Web page and you'd get information about the World Wide Web. It had been developed at CERN in Geneva to solve a problem of internal communication, and it did that very well.

Even just three years ago, the Web was in its infancy, an amazing plaything for some enormously creative people. Now it's beginning to walk and to talk, and it's beginning to get into trouble, as toddlers often do. It has free speech issues, and there are performance issues. The development of "hot spots" in the commercial networks over the last few months is proving to be a major problem. Pricing issues will become a major topic of debate, with the old Internet ethic of "everything should be free" falling by the boards. Nothing is free; it's just a question of who pays for it. To reach its early adolescence successfully, the Internet must become technically stable. We can't have the situation in which home pages routinely disappear and in which "404 URL Not Found" is the message that you see at least a dozen times a day. It must be predictable or reliable so that whenever you go out looking for information from a single place you get the same information back every timeit's a little disconcerting when the information changes when it ought not to do so. That means that the Web has to have authoritative content. It means the Web has to be much more functional. It means that the Web has to have better response time. Further, it needs to be priced correctly, probably including pricing of the method of access. It has to be usable with a standard set of tools, including effective search tools, so that the user does not have to learn new procedures for every site that the user visits.

What are some means to these ends? Well, we need to have review procedures for content, not just for design. We need to have Web site reviews in the places where we now will also see book reviews. We need to ensure intellectual property protection and security for cash and information transactions. I think the disciplines ought to create explicit programs for identifying needs and gaps in the Web. There is a major program required to put the intellectual resources of our libraries online. The current digital library programs are just the barest beginning of what is required, although we have already made considerable progress in putting archival data resources online.

In achieving these ends, it seems to me that there is a professional group that we can use far more effectively than it has yet been used in the development of the Web: we need to put the design and implementation of the Web much more in the hands of information specialists, not in the hands of programmers. In other words, we need people who are able to do what librarians, cataloguers, archivists, editors, publishers, and book reviewers have always done. They know a lot, and their knowledge is germane to the development of the Web.

The anarchy of the Web is one of its more attractive features. Yet the resources are not available to permit endless duplication, reinventing of the wheel, pursuit of dead ends. It ought to be possible for major Web developers to work together to achieve common goals. The Coalition for Networked Information, the Web Consortium, and ACLS itself could offer institutional settings for such collaboration.

Let me end by telling you what I think is a horror story. Web sites come and go every few seconds. The content on a given Web site can change one or two times within the course of a day. Thus it's somewhat frightening to realize that the Web is already a part of our national heritage, our cultural heritage. And as such, some aspect of its content must be preserved. Take the example of the statistical tables that will be available from the 2000 Census: The plans at the Bureau are to sharply reduce the printed publications program, preserving only the highly visible works, such as the Statistical Abstract. People will then be able to generate the tables that they need on the fly through Census Bureau Web services.

But what tables about us will then be authoritative? Will people cite a Web site as the source of their evidence? How will they document the procedures they used to generate the tables? How will they and the courts resolve disputes about the evidence?

The private sector will get involved, and different companies will have different ways of producing tables out of the Bureau's Web site. Which com-pany's tables will we believe? Will not society spend more in the aggregate to generate these tables than it would by simply paying the Bureau to do them and to print them?

Further, are government Web sites official records of the United States? If so, they must be archived. But what do you archive? It's hard to do, because you cannot archive a Web site by freezing it for all time! Technological change would soon render it unusable. So when you archive something in this field, you have to continually change it in order to keep it the same. If we don't archive Web sites, then how will the future generation reconstruct what we thought that we knew about ourselves?



Building the Scene: Words, Images, Data, and Beyond by David Green
Electronic Texts: The Promise and the Reality by Susan Hockey
Images on the Internet: Issues and Opportunities by Jennifer Trant
The American Arts and Letters Network (AALN) by Charles Henry
The National Initiative for a Networked Cultural Heritage (NINCH) by David Green
Because It's Time: A Commentary on the Program Session by Willard McCarty
Online Scholarly Resources Mentioned in this Issue

Visit the ACLS website for further information on the American Council of Learned Societies and its publications.