Can a Team Tag Consistently? Experiences on the Orlando Project
Terry Butler
Sue Fisher
Susan Hockey
Greg Coulombe
Patricia Clements
Susan Brown
Isobel Grundy
Kathryn Carter
Kathryn Harvey
Jeanne Wood
Arts Technologies for Learning Centre
University of Alberta
5-21A Humanities Centre
Edmonton, AB T6G 2E6
Canada
Terry.Butler@UAlberta.ca
www.ualberta.ca/Orlando/ach-allc1999.html
Introduction
Given that a team of thirty-five humanities researchers cannot possibly use structural and interpretive SGML tags consistently over a period of three years, what issues face a major SGML research project wanting to impose adequate amounts of consistency to their tagged data? This is the key question currently facing the Orlando Project as we take stock and embark on a process of tag cleanup work.
Extent of the Project's Tagging
The Orlando Project is in the 4th year of its 6 year tenure as a Major Collaborative Research Initiative funded by the Social Sciences and Humanities Research Council of Canada and the Universities of Alberta and Guelph. Our aim is to research, write, and tag in SGML an integrated history of women's writing in the British Isles. We are not tagging pre-existing texts; rather we are creating our own literary history through conducting primary research that we then filter through SGML tagging. To do this we have created three unique yet interdependent SGML document types (DTDs): one that permits the description of biography, one that takes into account all the factors that contribute to a writing career, and one that provides an architecture for describing chronological events. Our DTDs are modeled structurally on the TEI but each contains many interpretive tags that allow us to foreground our research practices and label the intellectual content of our material. For example, the biography DTD has tags for birth, family, education, and political affiliations; writing documents use tags for such text-specific information as genre, intertextuality, literary awards, and relations with publishers; events documents contain chronological events that have such information as organization names and places tagged.
Currently there are 252 unique tags in our DTDs and 114 unique attributes. These tags are used extensively across over 2,200 documents in our system. As of April 1999, the total number of elements in use in all project documents was over 640,000. For example, in biography documents alone the <quote> tag is used 2,135 times and the <name> tag 8,103 times. There are 51,125 uses of <date> in events documents. With element numbers of this scale, it became clear to us that we simply couldn't clean up our tags on an element by element basis. This paper will address the issues that we have faced when trying to achieve tagging consistency on the project. It will also report on a pilot study on tag consistency work that we have undertaken in the Fall and Winter of 1998-1999.
Is Consistency Possible?
We realize that achieving complete consistency with a diverse team of taggers using a complex tagset is impossible. Surveys of similar work done in the fields of indexing [Leonard, 1977], tagging linguistic data [Garside, 1993], and hypertext linking [Ellis, 1994] show that only a modest degree of consistency can be achieved by a team, even with ample training and a more focused and smaller tagset than we possess. Our own process of document analysis when we developed our DTDs lead us to the same conclusion: it is immensely difficult for a group of people to come to a common understanding of exactly what a particular tag means and what its use in the research document should be.
However, there is a level of consistency to which we aspire, and it is predicated upon the needs of the scholarly community who will be our end users. Therefore, some tag cleanup will be necessary to ensure adequate search and retrieval, chronological sorting, and consistent presentation of material to our research communities.
Current and Past Efforts at Achieving Tagging Consistency
Although we have only begun tag cleanup work in earnest in the Fall of 1998, we have had practices in place from the early stages of the project that have helped us work towards consistent tagging. Our Graduate Research Assistants (GRAs) receive three full-time days worth of training each September, after which they begin tagging and have their work reviewed by experienced team members. This training and checking is complemented by a comprehensive online glossary of our SGML tags and tagging practices. When a tagger cannot find an answer in the documentation or is faced with a new tagging/research problem, she can pose a question of our student e-mail discussion list. Here Postdoctoral Fellows answer most questions that come up (referring the more difficult cases to an e-mail list of Co-investigators and Postdoctoral fellows) and revise and augment the documentation when necessary.
Our practices for ensuring tag consistency have proven enormously useful, but they cannot in themselves moderate the work of 20 researches using such a complex tag set. Tags sometimes get used without the tagger realizing that an issue has been discussed or appears in the documentation. At over 500 pages, the documentation is a testament to the fact that a complex research project deals with issues that can't be tracked by all taggers at all times.
Tag cleanup pilot
In the Fall of 1998, we began a tag cleanup pilot study to investigate the problem of inconsistent tagging and to draw conclusions about the degree of consistency which will be achievable. During the pilot, GRAs were each given a group of representative tags to look at in detail across all project documents. The following is a list of some of the strategies they used to uncover inconsistency problems in the data:
- Comparing the use of a tag to how its use has been documented and recommending a change in practice or documentation where necessary.
- Investigating the use of 'odd' sub-elements. For example, there is an instance in the system where a word has been tagged as both <genre> and <characterName>. In all likelihood, this is a tagging mistake.
- Finding missing attribute values such as, for example, the titletype attribute on <title>.
- Investigating "odd" uses of attributes.
- Finding improperly filled in attributes. For example, the value attribute on <date> has some inconsistencies in terms of how standard dates should be expressed.
- Investigating the over-use of a single tag in a document.
In their reports, the GRA's noted which problems could be fixed with batch search and replace capabilities and which would need to be manually fixed. The pilot helped us realize that the tags most in need of fixing were those that acted as index/linking tags (name, orgname), those that governed presentation of our material (title, quote), and those that needed to be consistent for automated sorting/alphabetizing (date, name, orgname, title).
Computer Tools
In the spring of 1999, we turned our attention to developing a set of protocols for fixing these tags. We defined and ran a set of batch changes that would address many of the consistency problems outlined in the tag cleanup reports. Our programmer has also created a set of tools over the last year that would ease the process of manual cleanup. A Web front-end to the SGREP program with canned search queries allows team members to see just how a tag or attribute has been used (or misused) across all project documents. A "System-wide Document Statistics Report" provides statistical information on the use of tags and their attributes and reports documents where tag use has varied widely from normal usage. An finally, a "Tag Cleanup Reporting Tool" alphabetizes any given tag according to its standard or reg attribute across any combination of project documents. These tools will be demonstrated and their merits discussed in the during session.
Conclusions
The insights gleaned from our pilot study have helped us better quantify and assign the work that we are currently undertaking. By setting priorities for fixing the problems, we hope that we have defined a level of consistency that we can achieve and that we can learn to be content with. We also hope that we can now avoid having qualified researchers spending valuable time fixing problems that can be more quickly and accurately fixed by computer processes. We hope that doing this work will make our data rich in its breadth and depth of research and tagging and will make our end product meaningful and accessible to our user community.
References
- Inter-Indexer Consistency Studies, 1954-1975: A Review of the Literature and Summary of the Study Results. Lawrence E. Leonard. Univ of Illinois Graduate School of Library Science, Occasional Papers, #131. December 1977.
- "On the Creation of Hypertext Links in Full-Text Documents: Measurement of Inter-Linker Consistency." David Ellis, Jonathan Furner-Hines and Peter Willett. Journal of Documentation (June 1994) 50(2): 67-98.
- "The Large-scale Production of Syntactically Analysed Corpora." Roger Garside. Literary and Linguistic Computing (1993) 8(1):39-46.