[This local archive copy mirrored from the canonical site: http://www.hcrc.ed.ac.uk/dialogue/maptask.html; links may not have complete integrity, so use the canonical document at this URL if possible.]
The HCRC Map Task Corpus
Motivation for producing the Map Task Corpus
The HCRC Map Task Corpus was produced in response to one of the core
problems of work on natural language: much of our knowledge of
language is based on scripted materials, despite most language use
taking the form of unscripted dialogue with specific communicative
goals. There is, of course, good reason for this. There is no
guarantee that the phenomena of theoretical interest will appear with
any frequency in naturally occurring speech. Even huge corpora may
fail to provide sufficient instances to support any strong
claims about the phenomenon under study. In addition there is the
problem of context: critical aspects of both linguistic and
extralinguistic context may be either unknown or uncontrolled.
Prepared materials may lack spontaneity but will be designed to elicit
specific examples of linguistic behaviour in controlled conditions and
consequently ensure that the particular research needs are met. Our
intention, therefore, was to elicit unscripted dialogues in such a way
as to boost the likelihood of occurrence of certain linguistic
phenomena, and to control some of the effects of context. To this
extent while our dialogues are spontaneous, the corpus as a whole
comprises a large, carefully controlled elicitation exercise. The
choice of variables manipulated in the design illustrates the
different interests of the researchers involved in this data gathering
effort.
The current version of the map task design was intended to provide a
common corpus for a vertical study of dialogue
generating material
which can be discussed at levels from the acoustic to the
sociolinguistic. All the relevant parameters incorporated in the
design are described here.
The entire maptask corpus is available on CD-ROM. Some sample
dialogues are available for examination.
Task Description
The Map Task
(Brown, Anderson, Shillcock and Yule, 1984)
is a cooperative task involving two participants.
The two speakers sit opposite one another and each has a map which the
other cannot see.
One speaker -- designated the Instruction Giver --
has a route marked on her map;
the other speaker -- the Instruction Follower -- has no route.
The speakers are told that their goal is to reproduce the Instruction
Giver's route on the Instruction Follower's map.
The maps are not identical and the speakers are told this explicitly
at the beginning of their first session.
It is, however, up to them to discover how the two maps differ.
Map Design
All maps consist of landmarks -- or features --
portrayed as line drawings and labelled with their intended name.
The differences in the maps result from the systematic manipulation of
a design variable we refer to as sharedness:
the extent to which features contrast or are shared between pairs of
maps.
Features were deemed as common if the
identical form and label appeared in the identical location on both
the Giver's and Follower's map.
Features which were not common differed in one of three ways:
- Absent/Present features were found on one map but not the other;
- Name Change features were identical in form and location but
had different labels on the two maps;
- 2:1 features appeared twice on the Giver's map, once in a
position close to the route and once more distant, while the
Follower had only the distant irrelevant one.
All map routes begin with a starting point, marked on both maps, and
end with a finishing point marked only on the Instruction Giver's map.
Both start and end points are adjacent to a common feature
but landmarks between these points alternate in sharedness.
This manipulation of mismatches between landmarks enables us to
control the information initially shared by the participants.
Since the only constraint on the range of map landmarks is the ease
with which the feature can be represented graphically
(that is, choice is restricted only by the ingenuity of the artist)
we were able to include landmark names of phonological interest.
Thus, feature names provided sites for four optional phonological
reduction processes:
- /t/-deletion eg vast meadow
- /d/-deletion eg reclaimed fields
- glottalisation eg chestnut tree
- nasal assimilation eg broken gate
Landmark names also provided examples of polysyllabic words with
differing metrical structure (eg initial S-W words like buffalo
and initial W-S words like baboons).
Familiarity and Eye-Contact
In addition to the design variables relating to the maps themselves,
two other variables were incorporated in the design of the corpus
overall.
Subjects are necessarily paired for the task, and since the pairing is
under the experimenter's control we were able to vary systematically
the familiarity
between the participants, by asking subjects to attend
with a friend. Each pair of familiar subjects was tested in coordination
with another pair who were unknown to either member of the first pair.
Two pairs formed a quadruple of subjects who used among them a
different set of four map-pairs, with maps being assigned to pairs by
Latin Square.
Each subject participated in four dialogues, twice as Instruction
Giver and twice as Instruction Follower, once in each case with a
familiar partner, and once with an unfamiliar partner.
As Instruction Giver they gave directions on the same map, but when
following they used different maps each time.
Half of the subjects gave instructions to a familiar partner first,
the others to an unfamiliar partner first.
The option of placing a small barrier between map task participants to
prevent them from seeing each other's faces allowed us to control the
availability of the visual channel for communication.
Half of the subjects who took part in the task were able to make
eye-contact
with their partner,
while the other half had no eye-contact.
Procedure
Subjects sat three or four feet apart, facing each other across a
desk, with their maps placed on sloping boards, to prevent each
subject seeing the other's map.
Pairs of subjects were randomly assigned to one of the two ``eye-contact''
conditions.
After they had completed their map dialogues, subjects were asked to
read a wordlist containing all the feature names from the set of maps
they had encountered. Feature names appeared twice in random
order, and subjects were asked to read the list slowly and carefully,
aiming for a between word interval of approximately one second.
These list readings provided citation forms against which the
unscripted dialogue forms could be compared.
Materials were recorded on Digital Audio Tape (Sony DTC1000ES)
using one Shure SM10A close-talking microphone and one DAT channel per
speaker. Split-screen video recordings were also made for half of the
dialogues, capturing an almost full-face image of both subjects.
Dialogues were orthographically transcribed and then checked several
times against the original DAT recordings.
All sixty-four subjects who participated were undergraduates at the
University of Glasgow.
Sixty-one of the 64 subjects were Scottish, 56 of them having been
born or brought-up within a thirty mile radius of Glasgow.
Half the subjects were male, half were female, and their mean age was 20.
Subjects accommodated easily to the task and experimental setting,
producing unselfconscious and relatively fluent speech.
Some Corpus Statistics
The HCRC Map Task Corpus consists of 128 digitally recorded unscripted
dialogues and 64 citation form readings of lists of landmark names.
All dialogues were transcribed verbatim in standard orthography,
including (where possible) filled pauses, false starts, hesitations,
repetitions and interruptions.
The sampled speech data, transcriptions, list reading, and some other
ancillary material has been published for distribution on a
collection of 8 CD-ROM disks.
CD-ROM Contents
The waveform data are provided in "raw" (headerless) files (16-bit
samples, 20 kHz sample rate, 2 channels per conversation), and
alternative header files are provided for use with software based on
either the NIST "SPHERE" header structure or the European "SAM" header
structure. Transcriptions are provided for each conversation, marked
up with TEI-compliant SGML, in a minimally intrusive and easily
separated way. PostScript files of the map images used in the
experiments are provided, along with full documentation of the
experimental design and data collection protocol, resources for using
SGML tools on the transcriptions and other text materials, and an
extensive set of source code for performing basic signal processing
functions on the waveform data, such as down-sampling,
de-multiplexing, channel summation, and D/A conversion for Sun
workstations (including playback of segments selected via inspection
of transcripts in Emacs).
The CD-ROMs are in High Sierra (ISO 9660) format with the RockRidge
extensions, and are compatible with (inter alia) Unix, MS-DOS and
Macintosh operating systems.
For more information, contact: maptask@cogsci.ed.ac.uk