[Archive copy mirrored from the URL: http://www.qucis.queensu.ca/achallc97/papers/s005.html; see this canonical version of the document.]
Keywords: literary history, SGML, hypertext
Orlando is using computing technology and SGML at all stages within the project, but one of its key contributions to humanities computing methodologies is the use of SGML to encode interpretive information in basic research notes as the research is being carried out. The authors of the printed volumes will draw on this database of SGML-encoded information as they write. The database will also be used to create a number of hypertext products for research and teaching.
Because of the scope of the project, Orlando sets a significant and challenging test for some basic humanities computing tenets:
(1) that SGML has the flexibility to encode sophisticated and subtle electronic text
(2) that scholarly information can be stored and managed in systems which will not soon be blind-sided by technological change
(3) that our networked computer landscapes can facilitate substantial collaboration that overcomes the limits of both place and time.
In short, our project depends upon the confidence that systems and tools are now mature in the field of humanities computing. Our experiences over the first two years of this project will be examined in light of these themes.
Our three papers address these issues in various ways. The first paper, "The Orlando Project: Origins and Aims," provides an explanation of the intellectual aims of the project, and develops the rationale for our use of SGML as the primary vehicle for data capture. The iterative process, by which the capabilities of our computer systems challenge and limit the scholarly work, is explored.
The second paper, "SGML and the Orlando Project: Theorizing Descriptive Markup," delves more deeply into SGML, and explores the innovative characteristics of our project's use of it. A signal feature is our development of DTD's prior to the writing of the scholarly research text (known fondly to the team members as "document analysis without documents"). Key to our SGML development have been document analysis sessions aimed at integrating the subjects, themes, and issues of research into documents that do not give short shrift to the demands of discursive prose. The paper focuses on our experiences balancing structural and content-based SGML markup within our documents in a manner that will provide the most comprehensive access to our material when our final delivery systems are in place.
Our third paper, entitled "Eluding the Chains of Technology: Computer Systems for Orlando," explores other dimensions of our computer systems, and their interaction with the research procedures and scholarly practices of the team members. Our initial choice of operating systems and hardware arose out of a matrix of factors: users' existing skills; their computer apprehensions; limited time to train; and the expectation of significant staff turnover. Some of those decisions have evolved in year two, as the expertise and confidence of the team increases. Web-based tools have been developed to handle the document management process. The difficulty of managing SGML documents with SGML-ignorant tools has been borne strongly upon us; but we have created some useful systems none the less. We rely extensively on network-based tools, such as e-mail and mailing lists, for inter-project communication. The degree to which the project can remain "in synch" between Alberta (where most of the project team members reside) and Guelph (where one principal investigator and her two research assistants work) depends in large measure on the effectiveness of these tools and our intelligent use of them. Some practical lessons learned from this process will be expounded. A major challenge of year two of the project has been to model the electronic delivery system -- who is our audience, what will their needs, abilities, and desires be?
We will first outline our intellectual aims and the reason we decided to integrate computer technologies with our research from the outset.
Our use of computer technology is intimately bound up with the way we conceive of literary history. Critics have charged traditional literary history with contributing to a totalizing or linear view of the past, and we thus seek to mobilize multiple arguments which foreground process and the degree to which the components of a history are always in flux. We see our history as bringing together a number of distinct fields: writing by women in the British Isles, the conditions of their lives, writing by men and writing from outside the British Isles, historical processes, and the broader social and cultural environment. Following as we do the last several decades of vigorous activity in the recovery and reassessment of women's writing, we will be drawing on an immense body of prior research as well as adding to it our own discoveries and conclusions. Our aim is to pay particular, but not exclusive, attention to the complex and changing construction of gender, and always to keep it in dialogue with other constituents of identity and other historical and cultural factors. Our historical methodologies and inquiries both draw on and test themselves against the conclusions of other disciplines, including history.
In the Orlando project we are using computer tools, and particularly SGML--Standard Generalized Markup Language--to bring together the diverse fields of inquiry that combine in our vision of literary history: not as a stable unity but as a series of complex relationships. Our tagging structures and makes available a large quantity of research in the areas of women's writing and women's history, much of it subject to contestation. We are using SGML as the basis for a presentation of these details that will allow complications and contradictions to emerge from the connections and narratives which we will be offering. We believe our tagging will enable us to make visible the patterns and meanings immanent in this mass of detail, in addition to calling attention to the gaps, the discontinuities, and the unknowable silences in the history of women's writing. We are thus in the process of creating a textbase and a delivery system that attends to the theoretical concerns of recent literary historiography, which includes consideration of generic inclusiveness, the politics of canon formation, and the multiplicity of theoretical paradigms that make sense of the past. The bane of literary history is oversimplification and stereotyping: our electronic tools seek to emulate the fluidity, flexibility, and nuance of continuous prose while incorporating the structure and complex searchability of a database.
One particularly innovative feature of this project is that, rather than planning and conducting the research and writing first, driven only by the considerations of the humanities scholar, and deciding on the electronic delivery system once the research is nearing completion, we are designing our data structure as part of the process of research and writing, which means that the research process and the computing practices of the project are indistinguishable and indeed thoroughly integrated. Our process, which you will be hearing about in more technical detail shortly, has been to combine SGML encoding which structures and "adds value" to the textual material we are writing with hypertext linking.
This means that our electronic literary history can be interactive and respond minutely to particular users' interests and levels of expertise. Users of our end-product will be able to find their way to information of which they not yet aware; to generate their own time-lines for the life-histories of individual authors, of genres (like the epistolary novel, or detective fiction), or of historical processes as they impinge on writing (like the suffrage campaign or the shifting practices around childbirth). SGML allows us to tag various kinds of information to make it accessible to analysis and retrieval, but users will be able to shape the information according to their own interests, whether they are students, scholars, or general users. To do this we needed to work out a complex set of content tags, quite unlike the kind of structural tags, such as paragraph markers or address information, for which SGML is typically used. We have been developing a series of SGML Document Type Definitions, or DTD's, related to the different kinds of material that our history will incorporate. This aspect of the project will be the focus of the second paper in this session.
Our process of developing the document coding scheme has been obliged to be innovative. Because the research document design is being developed at the same time the research program is being formulated, we have an unusual situation -- we are performing document analysis, which is the usual way of developing SGML coding, but there are no extant documents to analyze! Document analysis normally seeks to find features of interest to tag in a corpus of texts. Our process is rather a collaborative enterprise in which the intention for the research directs the creation of a suitable vehicle for its expression and capture. The past two years have involved us in intense discussion, debate, and negotiation over what the information that we are collecting consists in and how to understand and organize it.
Early on, we identified several primary areas of inquiry in which the work will proceed: biography, the writing lives and textual histories of the works, and their cultural, social, and historical contexts: in short, women, their writing, and the world. The document design process has lead us to formalize the definition of hundreds of tags for each of these axes, to structure, organize, and interrelate the research information. The tags encourage precision in the collection of information, and consistency in its preparation; and they will facilitate searching and display by end-users which will be both more precise, and much deeper, than what has previously been made available in electronic information products.
We will briefly outline what we see as the major components of the project: a complex and highly manipulable chronology of women's writing; lots of brief, informative documents on topics ranging from the biographical summaries and discussions of literary genres to short explanations of historical events or cultural phenomena; lengthier discussions treating major developments in women's literary history; mapping capabilities drawing on SGML structures; the generation of webs of association and connection; sound clips; images; and possibly some video. All of these facets of the project, while diverse in many respects, will be integrated by the SGML structures we have devised and the complex cross-referencing, linking, and thesaural systems that we are developing to bring them together. This project, of incorporating the intellectual concerns and, perhaps, the conceptual structures of our arguments into the structure of our markup, remains our biggest challenge.
Having outlined the nature and scope of the project, we will then briefly discuss some of our conclusions resulting from the two years of this conversation between literary history and humanities computing.
The tools of humanities computing are helping us achieve an unprecedented degree of searchability given the complex structuring of our data. We anticipate our literary history hypertext will be used in ways impossible with a print text, and for specialized ends we may not anticipate. Yet there are also ways in which the diversity of the intellectual aims of our project mean that we find ourselves having to consider carefully the constraints and structures that the use of SGML and our plans for hypertext delivery dictate. We will thus, for instance, consider the implications of SGML's hierarchical structures for our project and the overall impact that adopting SGML has had on the way we have approached the research. We will reflect on the implications of combining documents tagged according to the Text Encoding Initiative, or TEI, with documents structured using non-TEI conformant SGML. We will also consider the possible drawbacks, as well as the advantages, of hypertext linking for a project that wants to bring large amounts of information together with complex and sustained argumentation.
The Orlando Project is thus attempting, in developing a textbase for a literary critical history of British women's writing, to adapt the existing tools of humanities computing to suit an interlocking set of methodological and intellectual aims. We think SGML offers the greatest flexibility for our purposes, as well as the hope of longevity. It allows us to build relational database-like structures for chronological and other granular data, while also supporting multiple levels of encoding specificity for framing critical arguments and for interlinking both data and argument in a hypertext system. It will also allow us to link our project with the growing corpus of primary works on-line that are tagged in TEI-conformant SGML.
Our two years of integrating literary historical research with humanities computing design have brought home to us the magnitude of what we are attempting. We have been grappling hands-on with the attempt to make computer tools address the myriad needs of contemporary scholarship in the humanities. In developing our tagging systems, we have had substantial experience of ways that this process differs from traditional approaches to literary research. One thing that has become clear is that our use of computing tools is radically intensifying the collaborative nature of the project. Instead of a single researcher needing to communicate effectively and clearly with one or more research assistants, we have a research collective that together had to develop a shared view of the project's research aims. We have already learned that explicitness is paramount. We are continually forced, in ways both frustrating and beneficial, to articulate our various assumptions about our purpose, our methods, our theoretical frameworks, more explicitly and more frequently than, say, a traditional co-editing or co-authoring project would demand. Every term in use needs ample discussion, to ensure we have a common understanding of its scope and import, and every new tag produces a body of discussion and debate about its purpose, use, and application.
One of the many challenges that remains for the future, of course, is what kind of delivery system or systems we can develop to reflect the project's aims, which is the focus of the third paper in this session. We hope to deliver, along with accessibility and flexibility, the same degree of explicitness and self-reflexivity that this process is forcing on us. We want the users to be able to construct their own narrative trajectories of women's writing and create their own web of history. Our challenge is to enable as many seen and unforeseen, imagined and unimaginable, understandings of women's literary history as possible.
The computing mandate for the Orlando Project is unique in the world of SGML. Like other projects, we are interested in SGML's descriptive markup capabilities for delimiting the structural components of our material; marking such information as chapter divisions, paragraphs, and long quotations for the purposes of systematic display and search and retrieval is key to ensuring the flexibility of our information. Also, like other projects we are interested the rudimentary content tagging that comes hand in hand with descriptive markup: names, dates, and titles of books provide information that goes beyond the simple structure of the document at hand. Unlike other humanities computing projects, however, we want to exploit the possibilities of SGML to systematize and foreground the interpretive models that underlie our literary history. Such interpretive structures vary from the seemingly objective keyword labeling of historical events such as women's suffrage to tagging the central issues in our history as they occur in our documents. For example, we want to be able to mark all instances where we discuss women's relationships to economic institutions, regardless of what line of argument the person writing the document adopts. Such tagging will allow us to foreground the positions taken by the multiple voices that make up the Orlando Project.
Most SGML-based humanities computing projects use descriptive markup to identify information that has meaning within the context of the document itself. Our extended use of descriptive markup allows us to tag content that has referential meaning to the world beyond our documents. In theorizing and pushing the limits of descriptive markup, we recognize that the world we wish to label cannot be viewed objectively; rather, we are presenting that world as contextualized in a specific time and place, a world seen through critical lens of the project's researchers.
To create our DTD's we have spent much time in document analysis sessions. But because we feel it important that our documents be researched, written, and tagged in an integrated process, our document analysis sessions have preceded the creation of our documents. This situation has been both frustrating at times, in that we lack concrete models from which to derive the hierarchical models for our DTD's, and freeing at other times, in that it has forced us to examine the research aims of the project in constant dialogue with the possibilities offered by the computing tools. In such sessions, the project's co-investigators discuss the aims of their research, what they want the project's hypertext to accomplish, and how these aims relate to individual areas of research, such as women's biography, writing lives, world chronology and the like. Based on these sessions, we create hierarchical structures that reflect the overall patterns of thinking in such meetings. These hierarchical structures are then revisited by a sub-committee representing the co-investigators, the SGML authors, the project's postdoctoral fellows, and the GRA's who do the research and tagging. Once all the components in these hierarchical models have been defined, documented, and scrutinized, we design DTD's based on both the structural and research areas of document analysis. These DTD's will then be tested by creating real research documents, in dialogue with the needs of the document as defined by the DTD.
In creating the structural shell for our DTD's we are indebted to the work of the Text Encoding Initiative. Its understanding of document divisions, paragraph components, linking mechanisms, and expressions for dates, names and places have sharpened our thinking in these areas. Furthermore, the cataloguing principles inherent in the TEI header have been of great use to us in our SGML authoring. The structural principles in place in the TEI have been modified by our project, however, in order to accommodate our research needs. Such extensions include structural components for chronology items, research and scholarly notes, and the ability to blend our subject tagging within our structural hierarchies.
Integrated within our TEI-like structural shell are all our content and critical and interpretive tags. Layering structural and content tags has forced us to negotiate the difficulties of wedding multiple hierarchies in SGML. Consequently, some of our content tags have been integrated into the structural hierarchical model of the DTD's while others have been rendered as location-specific inclusions.
Although this paper will have, to this point, referred to the project's DTD's as they exist independent of one another, we have devised several mechanisms for making documents created with multiple DTD's and existing in thousands of files participate in our hypertext web of history. The overlying architecture for our hypertext will be the use of linking tags, such as the TEI's ptr, ref, xptr, and xref. Combined, these tags will present paths through our documents that the end-user can follow according to her wishes. Such links are not adequate in creating a complex electronic reference tool, however, as the hypertext trails have been blazed according to the wishes of those creating the links, not according to the needs of those following them. In order to enhance this hypertext model, we have planned other access routes to and through our information. We will provide hierarchically organized subject-specific indexes based on the content and critical/interpretive elements within our documents. We will also label information according to a series of keywords developed by our literary researchers as pertaining to the key areas of subject interest on the project. Thus an end-user will be able to search for information in all documents having to do with, for example, politics--activism. Finally, for all items labeled as pertaining to chronology, we will assign keywords and values for relevance so that end-users will be able to produce subject- and period-specific chronologies. We feel that these DTD design strategies will allow search and retrieval of our information to go far beyond simple text string searching.
To date, we have DTD's for biography and women's writing lives in place and in use by our graduate research assistants. Since January 1996, our GRA's have been carrying out the integrated processes of researching, writing, and tagging documents about British women writers. Their experiences have been the true test of our approach to SGML. The first two months of GRA tagging quickly revealed to us the numerous shortcomings in our DTD design. Key among these was the awkward blend we afforded the admixture of tabular information and discursive prose. Once such problems were addressed, however, we found that our GRA's were creatively adapting the tags to the information at hand. What is now the biggest ongoing issue facing our taggers is the ability to manage all the practice decisions that have been made regarding proper use of our SGML. We continue to work with how our SGML has been put into practice in order to ensure that 8 GRA's in two separate provinces are creating documents that have a consistency in readability and tagging practice.
The decisions taken with regard to operating systems and hardware, at the start of the project, had an interesting admixture of crystal-ball gazing, given the five-year timelines of our grant. Projecting backward from 1995 five years, we know we would have completely overlooked the relevance of the World Wide Web, and probably hotly debated whether Macintosh or DOS/Windows would be the best beast to ride to the promised land of academic computing. And we knew in 1995 that the pace of change in the underlying technologies was increasing, and that dramatic changes, not just gentle evolutions, had been and would be characteristic of our future.
Our central choice, which was not a difficult one, was a commitment to use SGML to capture our research information and build the data collection toward the end product, which will be both print (volumes of literary history and a chronology) and electronic (a hypertext/ on-line searchable electronic text). To match this intention up with specific tools, we chose Windows for Workgroups as our underlying operating system, and conventional Intel hardware as our base platform . This tandem was much less expensive, pound for pound, than a Macintosh-based one. There were already in 1995 no shortage of SGML and hypertext/ multimedia tools available for these systems, and we believed that the unholy alliance of SGML riding on Bill Gates shoulders would provide us with an upgrade path to new tools and systems over the following five years.
Decisions as to specific tools followed, with the same logic. Our choice of an SGML editor was crucial, for virtually all the team members would use it extensively, and the research assistants would have it for their regular daily work tool. Several SGML editors attracted our attention at the beginning: the add-on to Microsoft Word; Soft Quad's Author/Editor; and WordPerfect's SGML edition. But we attended to our own arguments which took into account the existing team's computer skill levels and experience (as well as their cheerfully disguised computer apprehensions); the limited time available to train our professors in the tool's use; and the likelihood of significant staff turnover among the cadre of 8 graduate students. Inquiries about other SGML projects showed that Word Perfect SGML was successfully in use where the users were "amateurs" in various senses, while Author/Editor success stories were often where the tool users were using it for their entire writing job, or could be put under "orders" to do so (technical writing environments, for example). So our decision to use WP SGML edition leveraged by the existing familiarity with Word Perfect which most of our team had, and made it a more "comfortable" learning curve (quickly renamed the "learning cliff" when the complexities of our DTD's were unveiled to the taggers). Our steeply discounted academic price for the product didn't hurt either.
Our year one experiences with WP SGML have been generally positive; our taggers have learned to use it well, and have been productive with it. It has not been without its rough edges, of course. We ran the WP version 6.0 beta for most taggers under Windows 3.11, and tried the integrated Corel WP 7.0 under Windows 95. The program at first appeared unstable, but carefully controlling our procedures for saving and opening documents, and being "gentle with it" (as one tagger aptly said), reduced the problems to a tolerable level. Recently, our taggers have been tested the latest version of Author/Editor, and have expressed enthusiasm for it. It is fair to speculate that they are now more like the "pros" than they were in year one. The price gap between the two products is no longer as large as it once was, and we have now purchased copies of Author/Editor for all the team. Both programs will be in our arsenal, but we think experienced taggers will be more productive with Author/Editor.
The value of our SGML commitment is already proving itself, as the change of authoring tools makes clear. We are managing and massaging our documents with a number of SGML-aware tools: the most valuable of these is SGMLS. As we have begun to enhance what we can do with our documents, through style-sheets in WP SGML, and through style-sheets and webs in Panorama Pro, the principal investigators and research assistants have become more assertive in their clamour for "more". The SGML tagging was really not meaningful to the taggers until we could pop a viewable version of the document up in Panorama for them: then the coin dropped, and our mixed content and hierarchical tagging began to be used to tell the intended story. But the counterbalancing danger is that seeing the document in this form tended to reify it -- and a Panorama view of one of our documents is far from what we intend and imagine as our final delivery system. So taggers are certainly now "writing for Panorama" in a sense, and we are all to some extent judging the success and adequacy of our work by how it looks ... for now.
The empowerment of the taggers pulls us in another direction: besides multiple views of a document, they want views of multiple selections out of our growing document "text base". We are only beginning to build some computer systems which can respond to these requests. The question arises: how much time and effort should we invest in such tools, which may well be transient? Our exploration of SGML-savvy document management systems concluded, at least by November 1996, that there are none that are really mature and fit our needs (which include the need to manage and deliver documents to distributed sites) -- at least, not that fit into the budget of even a generously-funded humanities computer project.
We have exploited some ad-hoc tools to manage our SGML data: VEdit (a programmable editor) in the DOS environment; Perl and shell scripts in Unix. These tools have proved useful for making simple programs which generate and manipulate our SGML files. But it must be firmly said: SGML data is very opaque to non-SGML tools. It is very hard to extract useful information out of an SGML document instance, because the SGML character-stream spans from one line to another, and any program which tries to navigate into the sea of SGML structure does so very inconsistently, unless it is (or has access to) an SGML parser. By using SGMLS (a shareware SGML parser) and Perl scripts, we have been able to extract components from our documents. We have our complete document collections, and subsets of our documents, loaded into a LiveLink searching program, for our team's use.
An example of a useful document management tool which we built is a web-based document checkin/checkout system. The system was built to give equal access to our central (Alberta) and remote site (Guelph), as well as our users who are at times working in libraries, or from home. The system stores documents on our Unix server, and provides a Web interface for searching the document archive, submitting a document, or checking a document out. This provides a central project repository for the documents, as well as version control and back-up of them.
Year two of our project (1997) also brings the task of integrating our SGML documents with non-SGML data. At the start of the project, we began capturing chronology information in a Microsoft Access database of our own devising; and we have been capturing bibliographic citations with Pro-Cite. The design of these databases has all along been synchronized with our document analysis, with the intention of ultimately folding them into our SGML document collection. This process has been completed, but the management of information as granular as a single "event" in a chronology, or a single bibliographic reference, is difficult. We are currently exploring the use of the Oracle database management system as repository for these documents, so they can be viewed as a database collection or worked on singly or in clusters, from any one of our locations.
Our networked communications have extensively relied upon e-mail, and several electronic lists, for information sharing and exchange. The volume of traffic on these lists is impressive, and they are the main vehicle for conveying detailed practice issues about the tagging to the team. The lists work very well as a written vehicle to convey the "how do I ..." and "here's how you do ..." exchanges. Everyone involved is trying to achieve the discipline of always using the list, when just a shout across the hall is so much easier: but the shout evaporates into silence, while the posting with its answer becomes an important element of the team's documented procedures. This communication works well when the questions asked are current and the "need to know" is warm: it is less useful when the need has cooled, and the tagger recollects "I saw something about that just last month ..." We needed better indexing of the e-mail messages themselves (again requiring discipline by the sender in keeping each message to a single topic, and giving a full subject line to it) and better and more various ways to index and access the on-line documentation of tagging procedures, including an World Wide Web-searchable format. These indexes have been mounted: using MHonArch for the e-mail messages, and LiveLink for the documentation.
As far as policy issues, and in-depth collaborative work goes, our use of e-mail, FTP, and shared file spaces has been ambitious, but far from resoundingly successful. Sometimes the details trip us up: FTP that won't "T", messages which get missed among the many others on our screens, and so on. There is a sense of consensus which develops over issues at the Alberta centre, that is neither explicit or defined in any particular conversation or meeting, so conveying this evanescent sense to Guelph, or securing their participation in it, is elusive. We all feel strongly that we work much better, and get much more done, when we all are present face-to-face.
Our final computing challenge, at this stage of our project, is to bring the modelling of the electronic delivery system into a much sharper focus, to translate it from "wouldn't it be nice ..." into "it will look and work like this ...". We have said bold things: the user will produce maps on the fly, which will locate events (controlled by selection criteria of time, place, and topic) in a geographic display. Our scholars are excited by the power and potential of such a system, and the new connections and links that users will be able to make with it. We are presently capturing place information with a considerable degree of exactitude in our data; we now must delve into Geographic Information Systems, base maps, and coordinate systems, to work toward a realization of this goal.
The delivery system will embody a sophisticated text searching engine, which will be able to exploit the richness and complexity of the hierarchies we have embedded in our texts. The users' interests will have to be brought into contact with our purposes and intentions, the story we want to tell, the emphases we wish to make, the misconceptions (some of a monumental nature) we wish to redress. The process of modeling (picturing, imagining) what our system will do and what it will be, significantly shapes the scholarly text which we are now researching and writing, and will shape the final computer product, about which we can be only, at present, dreaming.