The use of SGML by philologists

[Mirrored from: http://www.sgmlbelux.be/96/dobreva.htm]

The use of SGML by philologists
Experiences gained during the Medieval Slavic Manuscripts Encoding Project

Milena Dobreva
Institute of Mathematics and Computer Science
bl. 8, Acad. G. Bonchev
1113 Sofia
Bulgaria

E-mail : dobreva@bgearn.acad.bg

Abstract:

This paper presents the problems which arose in the use of TEI P3 encoding of medieval Slavic manuscripts by philologists.

An encoding scheme for manuscript description was created within the frameworks of a joint American-Bulgarian project entitled 'Computer Supported Processing of Medieval Slavic Manuscripts'.

During the work on the project, we had to train specialists for data entering using SoftQuad Author/Editor(tm) v. 3.1. Eight philologists who were both SGML- and TEI-unaware entered about 200 manuscript descriptions.

Since this was the first SGML-oriented project implemented in Bulgaria we were not able to use any previous local experience. In addition to this, the problem area for which the TEI P3 encoding scheme was developed (medieval manuscripts) resulted in a high level of complexity for the work our specialists had to perform.

We present our experience in the training, data entering and editing stages of the project, as well as some characteristics of the work performed (time required, error rates and user satisfaction).

As a result of our work, half of the specialists entering manuscript data suggested improvements to the SGML encoding scheme used - a fact which we consider to be one of the most positive feedbacks of the project, and a clear indication of the acceptance of SGML by our previously inexperienced users.

Keywords:TEI P3, SGML, Medieval Slavic Manuscripts, project organisation, training of computer-illiterate users

Introduction

A joint American-Bulgarian project entitled 'Computer Supported Processing of Medieval Slavic Manuscripts' coordinated by Prof. D. Birnbaum, University of Pittsburgh, USA, and Assoc. Prof. A. Miltenova, Institute of Literature, Sofia, Bulgaria, was implemented in 1994-1996.

The basic aim of the project was to produce a representative set of manuscript descriptions in electronic form to be used for educational and research purposes. The reasons which motivated us to undertake the project were:

lack of electronic resources in the field of Medieval Slavic studies (we do not have electronic text corpora or representative databases storing manuscript descriptions),
our desire to help the researchers and trainers in the area of Medieval Slavic studies by providing them with (i) electronic resources, and (ii) specialised tools for using those resources.

The basic difficulty in creating computer tools for manuscript description is caused by the fact that the application area still lacks generally accepted formal models for representing its knowledge (this is a common problem in the design of computer applications in the Humanities). On the one hand, the development of any computer-oriented model still requires a clear identification of the objects in the knowledge domain and their relations. On the other hand, the needs of the specialists in the application area, the degree of detail and the depth of their knowledge differ considerably. In the case of medieval manuscripts for example, the researchers who study the codicology of manuscripts are usually not interested in the structure of the manuscript texts. They would therefore not need a tool offering them a detailed description of the text structure, and vice versa.

In addition to these difficulties specific to an 'application area', we also had to overcome problems related to the specific computer platforms being used. The computer representation of medieval Slavic manuscripts texts is still a subject of standards development (i.e. we do not have an internationally recognised standard which serves the needs of the specialists in the area).

Previous experience in the field of applications of computer technology to medieval Slavic studies has been gained during the last 15 years already. There were three major attempts to use computers for manuscript descriptions: one implemented in the Netherlands in the early 80's (Note), one in Bulgaria in the late 80's and one in Germany in the early 90's. The experience which was gained during these projects showed that:

the computer platform (hardware, software and encoding standards) which is being used for a project has a crucial impact on its future use by specialists from other organisations and/or countries
the ease of electronic interchange of descriptions influences the popularity of a certain computer tool aimed at manuscript description
the computer model should be open for adding further levels of detail without damaging the already entered data.
the manuscript description tool should be as user-friendly as possible.

With respect to these requirements, the goal of the project was to create a manuscript description which would be:

user-friendly
easily used on different computer platforms
interchangeble
very detailed.

We chose to use the Standard Generalized Markup Language (SGML) and, more specifically, the Text Encoding Intitiative (TEI P3) standard as an encoding platform which serves most adequately our needs because of its orientation towards document interchange; the possibility to work at different level of detail; the possibility to make changes to a data model without data loss if the original model is well-elaborated and the existence of both commercial and shareware tools for various computer platforms meeting different needs of the users.

The Work on the Project

Project Organisation

The researchers involved in the project formed an interdisciplinary group of three specialists with expertise in different areas of Medieval Slavic Studies and one computer scientist. During the implementation of the project, 6 more specialists in Slavic Studies worked on manuscript description encoding.

The work on the project consisted of several stages:

Design of the encoding scheme.
Training and data entering of manuscript descriptions.
Editing and initial use of the manuscript descriptions.

The first stage, design of the encoding scheme, took an initial period of about 3 man-months. This stage can not be considered as being finalized, since new ideas for further refinements are still arising periodically.

This stage started with a discussion of the model of a manuscript description which we would like to present in an electronic form. The basic elements which had to be included in the model and their relations were clarified. After that, it was checked how TEI P3 could meet the needs of our model. It became clear that the most specialised elements of the model are not present in the general TEI P3 Guidelines. This forced us to produce our own DTD which contains a significant number of new elements, added to those already present in TEI P3. The ratio of the number of added elements to the number of already existing elements in the TEIHEADER is 3:1. This means that even users accustomed to TEI P3 would still need some additional training in order to use our description.

The computer implementation of our DTD was based on the use of SoftQuad Author/Editor(tm) with compilation of a rules file by the SoftQuad Rules Builder.

The second stage, training and data entering of manuscript descriptions, takes about 2 weeks per person for training and 2 days (on avarage) for entering one manuscript description (usually after a preparatory period of data collecting which takes about 1-3 weeks and depends on the expert's level of familiarity with the concrete manuscript).

We discuss the training together with the data entering because the typical training practice was to work directly on the computer with a real manuscript example.

As another important training support we should mention here the participation in specialised SGML workshops. The philologists who entered data participated in 2 workshops (one of them before the beginning of the project and one in the middle). The first one, delivered by Dr. Harry Gaylord in 1993 in Sofia, provided the philologists with some initial knowledge about the underlying philosophy of SGML. The second one, a hands-on session presented by Prof. D. Birnbaum, Dr. H. Gaylord, Prof. N. Finke and Dr. W. Bader at the First International Conference on Computer Processing of Medieval Slavic Manuscripts held in 1995 in Blagoevgrad, Bulgaria, gave the same philologists a possibility to refresh their views on the encoding principles which they were already applying in their work.

The last stage, editing and initial use of the manuscript descriptions, is the phase of the project where we are now currently. The use of the data entered in the manuscript description is basically focussed on extracting lists of data which serve the needs for standardising terminology, finding misspellings and cross-checking some of the data. For the extraction of these data we had to develop our own software tools.

The Encoding Scheme

The encoding scheme used in the project is presented in Appendix 1. The basic changes we made to the TEIHEADER section of the TEI.2 element is in its PROFILEDESC part where we placed some elements representing specific subject domain knowledge and connected with the history of the creation of the manuscript. The most important groups of elements present data about:

codicological appearence of the manuscript (grouped in the element CODICOLOGY),
scribes (grouped in the element SCRIBE),
content of the manuscript as a whole (grouped in the element MANUSCRIPTCONTENTDESC),
content of a single manuscript text (to avoid the confusion with the TEI sense of the term text, we called this part of the description ARTICLECONTENTDESC).

Some smaller changes were done in the FILEDESC section of the TEIHEADER regarding to the specific features of manuscripts cataloging information.

We also had to add new elements presenting the incipita and explicita of manuscript texts which are important for the study of medieval manuscripts.

Thus, our encoding contains about 100 elements relevant to the description of the manuscript itself, excluding the features which could be encoded in the presentation of the text itself. This description can be considered to be rather complex to use, since somebody who is planning to enter data should be well-acquainted with all the elements in a description and their proper placement and interrelations.

Although our current document model has already reached an impressive level of complexity, we should mention that many elements can still be represented in even greater detail than they already are in our current encoding. One of the difficult challenges in our work was to stop adding more details which could be considered to be interesting for only a very small group of researchers. This requires taking decisions which are not always easy to make since the decision criteria ultimately depend on personal views about what is important and what is not, criteria which can be easily attacked.

Preliminary Expectations of the Participants

During the design stage of our work, we also had to take into account the intended use of the encoded materials. The normal software development approach is to choose among several possible encoding solutions by taking into account the desired future use.

However, software development for use in the Humanities does not follow this approach because the work goes in exactly the opposite way. It is only after some computer resources have already been created, based on rather poor computer models because of the inherent difficulties of interdisciplinary work, that the users start to understand more about the use of computers and can become dissatisfied with the existing model they have to use.

In our case, it was clear from the very beginning that the encoding scheme would be used for entering data, but besides this the really interesting question was: "What are we going to use these encoded materials for?". The possible uses suggested by the philologists at the beginning of the project did not go beyond the compiling of indices and file cards on the basis of the description. This limited view on the possible uses of this SGML technology clearly illustrates how familiarity with old ways of workings limits one's capability of imagining new, innovative ways of using computer technology.

However, the existence of electronic descriptions was estimated as being useful if only because of the availability of electronic resources. The ideas about the data they would like to extract from the descriptions were not very clear and this fact is still confirmed by the permanent appearance of new suggestions of desired outputs.

This kind of situation is rather typical for a project in the Humanities and it requires a high level of understanding from the computer experts involved in the implementation of a software tool.

Involvement of Philologists

We would like to present here the tasks which were performed by the philologists participating in the project and to discuss some of the specific difficulties we encountered.

Design stage

At the design stage, the philologists had to discuss the elements to be included in the model. During the initial work on the encoding scheme, we had 2 specialists participating actively. Some previous experience acquired during the development of other models in the same problem area was used (we should mention here the traditional manuscript description formats used in the printed editions and the formal models applied in previously developed computer systems.)

The participation of the same philologists who entered the manuscript descriptions resulted at the end of the project in a number of proposals for extending the original encoding scheme. These proposals probably represent the most important feedback about the success of the use of SGML in the encoding work, because they were themselves submitted in an SGML-like style.

Entering data

This was the stage of the project that required the most intensive involvement, both in time and efforts spent, of the philologists participating in the project. Since the data entering requires well-qualified people (capable of recognizing different elements in the description and disposing of a good knowledge of Old Church Slavic), we had to include in our team researchers who were interested in contributing to the creation of electronic resources in the field of medieval Slavic manuscripts.

We had 8 philologists who entered manuscript descriptions. Three of them have a long experience in the subject domain (more than 18 years), and 5 of them constitute a group with less than 10 years of experience. The computer literacy of the group does not correlate to the years of work in the field. We had reserachers with some level of computer literacy, and we also had researchers without any computer literacy in the both above-mentioned groups.

Processing data

In this stage of the project, we had one specialist who was editing all manuscript descriptions. After the editing we used specially designed software to produce lists of important elements (usually by processing all the descriptions produced by one specialist, and the whole set of descriptions). The philologist who entered the descriptions was thus able to see what results were extracted from his/her description and to find out about their place in the general picture.

As I already mentioned earlier, we still receive new proposals for desirable data outputs from the descriptions. This clearly illustrates the raise in competence in the use of electronic resources by the members of the project team.

Feedback

In order to improve the work on this project and similar projects in future, we tried to perform some evaluation of the acceptance of SGML by the project participants, people who are not specialists in computer science and who did not have a high level of computer literacy before starting their work on the project.

We have three basic types of feedback. The first one, which I call here controlled feedback, consists of the results of a test which was designed especially for evaluation purposes. Another interesting type of feedback we received was extracted from the typical errors observed during the encoding work. A third important type of feedback was the receipt of suggestions for improvement of the work related to the encoding scheme used.

Test evaluation

Test Design

After one year of experience with the use of the encoding scheme, we designed a test whose basic aims were to determine:

the level of understanding of the SGML concepts underlying the encoding used in the project;
the knowledge acquired about the encoding scheme elements;
the degree of acceptance of the work on encoding manuscripts;
the views expressed on the usability of the encoded descriptions;
the level of satisfaction expressed by the participants in the project;
the degree of complexity of the manuscript encoding as an activity,
the influence which any personal experience with computers and the subject area itself had on the above-mentioned elements.

Group Characteristics

The basic characteristics of the participants in the test evaluation which were used to differentiate between different types of participants were: theircomputer literacy , theirEnglish language fluency and theirexperience in the application domain (medieval Slavic manuscripts).

We had 3 persons with an experience in the subject domain of more than 18 years, but only one of them was using word processing software in Microsoft Windows environment before the project started. The two other specialists had some experience in the use of the DOS-based text editor ChiWriter.

In the group of 5 specialists with an experience of less than 10 years, which represents the younger generation, only one person did not have any computer experience. The others had different level of computer literacy - one with the use of ChiWriter, the rest with word processing software under Microsoft Windows.

The level of English language fluency was an important characteristic for us because the encoding scheme is based on the use of English terms, and the user interface of Microsoft Windows and SoftQuad Author/Editor(tm) requires some working knowledge of English. The majority of the researchers reported an average or good knowledge of English, while 3 specialists reported a very good working knowledge of English and one reported infamiliarity with the language.

The distribution of the values of the above mentioned characteristics in our relatively small group of studied subjects shows that the group can be divided into 2 homogeneous subgroups with regards to their experience in the subject domain (less than 10 and more than 18), with a relatively low level of computer literacy but a good working knowledge of English.

Controlled Feedback

In the controlled feedback we were interested to find out more about the participants' appreciation of the proposed encoding scheme. All the participants demonstrated a good knowledge of the scheme's details and expressed general satisfaction with its use. They were also absolutely positive that they would apply it in their further work.

This desire to continue to use the computer-based description of the texts was somewhat in contradiction with the fact that they showed only to have interest in such traditional usages of the encoded texts as the production of indices. However, all participants in the experiment showed a good understanding of what were extractable and not-extractable results, which were given as examples in the test.

One feature of the encoding scheme which caused general dissatisfaction among the participants was the necessity to retype some data several times in different descriptions.

An interesting observation on the level of SGML competence acquired after more than one year of practical experience was that some of the specialists in medieval Slavic manuscripts still did not understand the difference between elements and attributes and did not recognize the hierarchical nature of the structure they were using. This looks surprising because SoftQuad Author/Editor(tm) supports a clear hierarchical representation of the document structure on the screen and supports different mechanisms for entering elements' values and attributes' values. These misunderstandings influenced the work of some specialists, but it is still a point of discussion whether the users should be acquainted with SGML itself when it is being applied in a project.

Analysis of the work

Another source of information about the level of acceptance and understanding of the work in SGML environment is the practical work done by the specialists. On the one hand, their typical errors show the specific problems they have in their work. On the other hand, a comparison of the different usage styles of the same SGML descriptions may give some good ideas for future work.

Errors

The typical errors observed during the stage of the processing of the descriptions included:

Forgetting about optional elements.
One of the problems observed quite often was that no data for optional elements was filled in. This could mean that the optional elements (like BINDING) were considered not to be important, but discussions with the philologists quickly showed that they simply forgot that such elements exist in the encoding scheme, or could not find their location during their data entering work. This phenomenon was observed in the work of 7 specialists. This forgetting of optional elements can probably be directly influenced by the size of the encoding scheme. We do not know whether any research has already been performed on the relationship between the complexity of a document model and its improper use, but from our own experience we are certainly inclined to believe that such research could be very useful for the practical work of DTD developers, in the creation of front-end tools, etc.
Misuse of attribute values.
The necessity to change the default values or to fill in specific values for an attribute was also causing problems for practically all specialists. The corroborates with the results of our test, which showed that the role of elements and attributes was frequently confused. Probably the simple explanation here was the specific interface design of SoftQuad Author/Editor(tm), where the attributes are only visible after opening a new window.
Getting confused about references.
One particular difficulty in the work of the philologists was caused by the necessity to keep records of identifiers of references. This also could be overcome by support for 'reference-books', made easy for handle.
Misspellings in transcriptions of Old Church Slavic, modern Bulgarian, etc.
During the work on entering data our specialists had to use a transcription table for representing the Old Church Slavic and modern Cyrillic alphabets. We observed errors in the work of all philologists in the use of this transcription table. This was caused by the fact that the transcription table used differs from the traditional transcription systems (e.g., the telegraphic one for modern Bulgarian). One of the decisions taken from the very beginning of our project was to apply the transcription system which was created in the Netherlands for the purposes of the pioneering project in this area. This decision however caused a conflict with the 'transcription habits' of all users.

Proposals

Half of the researchers who entered manuscript descriptions suggested further refinements and enlargements of the encoding scheme. The style of these suggestions was SGML-like, which shows that all of them accepted the underlying design philosophy. Some of them also explained that their involvement in the project helped them to better organize their own research work.

We also constructed new types of outputs from the descriptions as requested by the users, which also illustrates their growth in understanding of how computer encoded descriptions can serve their research needs.

Conclusion

Finally, we would like to share some ideas about how to improve the participation of similar SGML-inexperienced users in a major SGML project:

The level of motivation of the users can be raised by offering a detailed explanation of the beneficial outcomes from the project. Sometimes, the discussion of the expected outcomes will also improve the work done during the design stages, where all the crucial decisions about the actual computer implementation still can be made.
The lack of understanding of the future uses and the instability of the document models suggests that special attention has to be paid to a prototyping stage in order to avoid producing something which is of no use already immediately after the end of the project.
We hope that in the future increasingly special attention will be paid to the study of the complexity of the document models used in SGML applications (the amount of elements, attributes and their relations). Since we worked ourselves on a very complex model, we witnessed firsthand that some elements were not used systematically.
Very complex document models also need special tools for visualizing. Entering data during the encoding was sometimes depressing for the specialists, because the tags would already take up several display screens before any real content could be actually entered. Probably, front-ends based on a dialogue-like interaction would be more appropriate in such cases.

We believe that our experiences, as presented in this paper, will be of interest to all those who intend to start working on large-scale projects involving the participation of SGML-unaware users.

Appendix : ENCODING SCHEME

 <TEI.2>
 <TEIHEADER>
 <FILEDESC>
                <TITLESTMT>
                        <TITLE>
                        <AUTHOR>
                        <EDITOR>
                        <FUNDER>
                        <PRINCIPAL
                        <SPONSOR>
                <PUBLICATIONSTMT>
                        <PUBLISHER>
                        <PUBPLACE>
                        <DATE>
                <SOURCEDESC>
                        <CATALOGUESTMT>
                                <MANUSCRIPTNAME>
                                <MANUSCRIPTLOCATION>
                                        <REPOSITCOUNTRY>
                                        <REPOSITCITY>
                                        <REPOSITORY>
                                        <REPOSITSIGNATURE>
                                        <CATALOGNR>
                                        <RELATEDPERSON>
                                </MANUSCRIPTLOCATION>
                        </CATALOGUESTMT>
                </SOURCEDESC>
 </FILEDESC>
 <ENCODINGDESC></ENCODINGDESC>
 <PROFILEDESC>
                <LANGUSAGE>
                <CODICOLOGY>
                        <NUMFOLIO>
                        <QUIRESTRUCTURE>
                                <QUIRE>
                                        <NUM>
                                        <COMPOSITIONQUIRE>
                        <PAGINATION>
                        <PRICKING>
                        <BINDING>
                        <MATERIALDESC
                        TYPE=paper|vellum|papyrus
                        EXTENT=general|partial
                                          missing>
                                <LAYOUT>
                                        <NUMFOLIO>
                                        <SIZEMATERIAL
                                        TYPE=vertical|horizonal
                                        RANGE=material|written
                                                        area
                                        UNIT=cm|inch>
                                        <RULELINE>
                                        <NUMBCOLUMN>
                                        <NUMBLINES>
                                </LAYOUT>
                                <INK>
                                <WATERMARK>
                                <GREGORYRULE>
                                <ORNAMENT
                                TYPE=borders|cadels|
                                calendars|capitals|
                                initials|illustrations|
                                linefillers|vjaz>
                                <MISCOBSERVAT
                                TYPE=miscdamage|
                                restoration|palimpsest>
                                <ALPHABET>
                        </MATERIALDESC>
                </CODICOLOGY>
                <SCRIBE>
                        <NAME>
                        <PAGERANGE>
                                <STARTINGPAGE>
                                <ENDINGPAGE>
                        <ORTHOGRCHARACT>
                        <PALAEOCHARACT>
                </SCRIBE>
                <MANUSCRIPTCONTENTDESC
                TYPE=compilation|original|
                translation
                STYLE=narrative|non-narrative>
                <NUMBERTEXTS>
                <MANUSCRIPTCREATION>
                        <MANUSCRIPTDATE>
                        <MANUSCRIPTPLACE>
                <SOURCE TYPE=Greek|other>
                <TRANSLATION>
                        <NUM>
                        <DATE>
                <PROTOGRAPH>
                <ANTIGRAPH>
                <LITREDACTION>
        </MANUSCRIPTCONTENDESC>
        <ARTICLECONTENTDESC
                ID
                TYPE=compilation|original|
                translation
                STYLE=narrative|non-narrative>
                <NUMBERTEXTS>
                <ARTICLENAME>
                <ARTICLEAUTHOR>
                <SOURCE TYPE=Greek|other>
                <TRANSLATION>
                        <NUM>
                        <DATE>
                <ANTIGRAPH>
                <APOGRAPH>
                <CHURCHCALENDAR>
        </ARTICLECONTENDESC>
</PROFILEDESC>
<REVISIONDESC></REVISIONDESC>
</TEIHEADER>
 <TEXT>
        <BODY>
                <DIV>
                        <HEAD></HEAD>
                                <INCIPIT>
                                        <NORMINCIPIT>
                                        <NONNORMINCIPIT>
                                </INCIPIT>
                                <P></P>
                                <EXPLICIT>
                                        <NORMEXPLICIT>
                                        <NONNORMEXPLICIT>
                                </EXPLICIT>
                        </DIV>
                </BODY>
        </GROUP>
</TEXT>
</TEI.2>

Acknowledgements

The research presented in the paper was partially supported by a research project MU-IS 6/94 entitled 'Computer Modelling of Palaeographic Data and Activities' (National Fund for Scientific Research).

I would like to thank Dr. Malina Jordanova for her kind assistance in the preparation of the feedback test. I also would like to express my gratitude to the specialists in Medieval Slavic Studies who worked on the project.

Notes

A good overview of the earliest attempts to apply computer technology to Medieval Slavic studies can be found in vol. 17 of the journal 'Polata Knigopisnaja', 1987. (Back)