[This local archive copy mirrored from the canonical site: http://www.hcrc.ed.ac.uk/dialogue/maptask.html; links may not have complete integrity, so use the canonical document at this URL if possible.]

The HCRC Map Task Corpus

Motivation for producing the Map Task Corpus

The HCRC Map Task Corpus was produced in response to one of the core problems of work on natural language: much of our knowledge of language is based on scripted materials, despite most language use taking the form of unscripted dialogue with specific communicative goals. There is, of course, good reason for this. There is no guarantee that the phenomena of theoretical interest will appear with any frequency in naturally occurring speech. Even huge corpora may fail to provide sufficient instances to support any strong claims about the phenomenon under study. In addition there is the problem of context: critical aspects of both linguistic and extralinguistic context may be either unknown or uncontrolled. Prepared materials may lack spontaneity but will be designed to elicit specific examples of linguistic behaviour in controlled conditions and consequently ensure that the particular research needs are met. Our intention, therefore, was to elicit unscripted dialogues in such a way as to boost the likelihood of occurrence of certain linguistic phenomena, and to control some of the effects of context. To this extent while our dialogues are spontaneous, the corpus as a whole comprises a large, carefully controlled elicitation exercise. The choice of variables manipulated in the design illustrates the different interests of the researchers involved in this data gathering effort.

The current version of the map task design was intended to provide a common corpus for a vertical study of dialogue generating material which can be discussed at levels from the acoustic to the sociolinguistic. All the relevant parameters incorporated in the design are described here.

The entire maptask corpus is available on CD-ROM. Some sample dialogues are available for examination.

Task Description

The Map Task (Brown, Anderson, Shillcock and Yule, 1984) is a cooperative task involving two participants. The two speakers sit opposite one another and each has a map which the other cannot see. One speaker -- designated the Instruction Giver -- has a route marked on her map; the other speaker -- the Instruction Follower -- has no route. The speakers are told that their goal is to reproduce the Instruction Giver's route on the Instruction Follower's map. The maps are not identical and the speakers are told this explicitly at the beginning of their first session. It is, however, up to them to discover how the two maps differ.

Map Design

All maps consist of landmarks -- or features -- portrayed as line drawings and labelled with their intended name. The differences in the maps result from the systematic manipulation of a design variable we refer to as sharedness: the extent to which features contrast or are shared between pairs of maps. Features were deemed as common if the identical form and label appeared in the identical location on both the Giver's and Follower's map. Features which were not common differed in one of three ways: All map routes begin with a starting point, marked on both maps, and end with a finishing point marked only on the Instruction Giver's map. Both start and end points are adjacent to a common feature but landmarks between these points alternate in sharedness.

This manipulation of mismatches between landmarks enables us to control the information initially shared by the participants.

Since the only constraint on the range of map landmarks is the ease with which the feature can be represented graphically (that is, choice is restricted only by the ingenuity of the artist) we were able to include landmark names of phonological interest. Thus, feature names provided sites for four optional phonological reduction processes:

Landmark names also provided examples of polysyllabic words with differing metrical structure (eg initial S-W words like buffalo and initial W-S words like baboons).

Familiarity and Eye-Contact

In addition to the design variables relating to the maps themselves, two other variables were incorporated in the design of the corpus overall.

Subjects are necessarily paired for the task, and since the pairing is under the experimenter's control we were able to vary systematically the familiarity between the participants, by asking subjects to attend with a friend. Each pair of familiar subjects was tested in coordination with another pair who were unknown to either member of the first pair. Two pairs formed a quadruple of subjects who used among them a different set of four map-pairs, with maps being assigned to pairs by Latin Square. Each subject participated in four dialogues, twice as Instruction Giver and twice as Instruction Follower, once in each case with a familiar partner, and once with an unfamiliar partner. As Instruction Giver they gave directions on the same map, but when following they used different maps each time. Half of the subjects gave instructions to a familiar partner first, the others to an unfamiliar partner first.

The option of placing a small barrier between map task participants to prevent them from seeing each other's faces allowed us to control the availability of the visual channel for communication. Half of the subjects who took part in the task were able to make eye-contact with their partner, while the other half had no eye-contact.

Procedure

Subjects sat three or four feet apart, facing each other across a desk, with their maps placed on sloping boards, to prevent each subject seeing the other's map. Pairs of subjects were randomly assigned to one of the two ``eye-contact'' conditions.

After they had completed their map dialogues, subjects were asked to read a wordlist containing all the feature names from the set of maps they had encountered. Feature names appeared twice in random order, and subjects were asked to read the list slowly and carefully, aiming for a between word interval of approximately one second. These list readings provided citation forms against which the unscripted dialogue forms could be compared.

Materials were recorded on Digital Audio Tape (Sony DTC1000ES) using one Shure SM10A close-talking microphone and one DAT channel per speaker. Split-screen video recordings were also made for half of the dialogues, capturing an almost full-face image of both subjects. Dialogues were orthographically transcribed and then checked several times against the original DAT recordings.

All sixty-four subjects who participated were undergraduates at the University of Glasgow. Sixty-one of the 64 subjects were Scottish, 56 of them having been born or brought-up within a thirty mile radius of Glasgow. Half the subjects were male, half were female, and their mean age was 20. Subjects accommodated easily to the task and experimental setting, producing unselfconscious and relatively fluent speech.

Some Corpus Statistics

The HCRC Map Task Corpus consists of 128 digitally recorded unscripted dialogues and 64 citation form readings of lists of landmark names. All dialogues were transcribed verbatim in standard orthography, including (where possible) filled pauses, false starts, hesitations, repetitions and interruptions. The sampled speech data, transcriptions, list reading, and some other ancillary material has been published for distribution on a collection of 8 CD-ROM disks.

CD-ROM Contents

The waveform data are provided in "raw" (headerless) files (16-bit samples, 20 kHz sample rate, 2 channels per conversation), and alternative header files are provided for use with software based on either the NIST "SPHERE" header structure or the European "SAM" header structure. Transcriptions are provided for each conversation, marked up with TEI-compliant SGML, in a minimally intrusive and easily separated way. PostScript files of the map images used in the experiments are provided, along with full documentation of the experimental design and data collection protocol, resources for using SGML tools on the transcriptions and other text materials, and an extensive set of source code for performing basic signal processing functions on the waveform data, such as down-sampling, de-multiplexing, channel summation, and D/A conversion for Sun workstations (including playback of segments selected via inspection of transcripts in Emacs).

The CD-ROMs are in High Sierra (ISO 9660) format with the RockRidge extensions, and are compatible with (inter alia) Unix, MS-DOS and Macintosh operating systems.

For more information, contact: maptask@cogsci.ed.ac.uk