[Mirrored from: http://www.sgmlbelux.be/96/dobreva.htm]

The use of SGML by philologists
Experiences gained during the Medieval Slavic Manuscripts Encoding Project

Milena Dobreva
Institute of Mathematics and Computer Science
bl. 8, Acad. G. Bonchev
1113 Sofia
Bulgaria

E-mail : dobreva@bgearn.acad.bg

Abstract:

This paper presents the problems which arose in the use of TEI P3 encoding of medieval Slavic manuscripts by philologists.

An encoding scheme for manuscript description was created within the frameworks of a joint American-Bulgarian project entitled 'Computer Supported Processing of Medieval Slavic Manuscripts'.

During the work on the project, we had to train specialists for data entering using SoftQuad Author/Editor(tm) v. 3.1. Eight philologists who were both SGML- and TEI-unaware entered about 200 manuscript descriptions.

Since this was the first SGML-oriented project implemented in Bulgaria we were not able to use any previous local experience. In addition to this, the problem area for which the TEI P3 encoding scheme was developed (medieval manuscripts) resulted in a high level of complexity for the work our specialists had to perform.

We present our experience in the training, data entering and editing stages of the project, as well as some characteristics of the work performed (time required, error rates and user satisfaction).

As a result of our work, half of the specialists entering manuscript data suggested improvements to the SGML encoding scheme used - a fact which we consider to be one of the most positive feedbacks of the project, and a clear indication of the acceptance of SGML by our previously inexperienced users.


Keywords:TEI P3, SGML, Medieval Slavic Manuscripts, project organisation, training of computer-illiterate users

Introduction

A joint American-Bulgarian project entitled 'Computer Supported Processing of Medieval Slavic Manuscripts' coordinated by Prof. D. Birnbaum, University of Pittsburgh, USA, and Assoc. Prof. A. Miltenova, Institute of Literature, Sofia, Bulgaria, was implemented in 1994-1996.

The basic aim of the project was to produce a representative set of manuscript descriptions in electronic form to be used for educational and research purposes. The reasons which motivated us to undertake the project were:

The basic difficulty in creating computer tools for manuscript description is caused by the fact that the application area still lacks generally accepted formal models for representing its knowledge (this is a common problem in the design of computer applications in the Humanities). On the one hand, the development of any computer-oriented model still requires a clear identification of the objects in the knowledge domain and their relations. On the other hand, the needs of the specialists in the application area, the degree of detail and the depth of their knowledge differ considerably. In the case of medieval manuscripts for example, the researchers who study the codicology of manuscripts are usually not interested in the structure of the manuscript texts. They would therefore not need a tool offering them a detailed description of the text structure, and vice versa.

In addition to these difficulties specific to an 'application area', we also had to overcome problems related to the specific computer platforms being used. The computer representation of medieval Slavic manuscripts texts is still a subject of standards development (i.e. we do not have an internationally recognised standard which serves the needs of the specialists in the area).

Previous experience in the field of applications of computer technology to medieval Slavic studies has been gained during the last 15 years already. There were three major attempts to use computers for manuscript descriptions: one implemented in the Netherlands in the early 80's (Note), one in Bulgaria in the late 80's and one in Germany in the early 90's. The experience which was gained during these projects showed that:

With respect to these requirements, the goal of the project was to create a manuscript description which would be:

We chose to use the Standard Generalized Markup Language (SGML) and, more specifically, the Text Encoding Intitiative (TEI P3) standard as an encoding platform which serves most adequately our needs because of its orientation towards document interchange; the possibility to work at different level of detail; the possibility to make changes to a data model without data loss if the original model is well-elaborated and the existence of both commercial and shareware tools for various computer platforms meeting different needs of the users.

The Work on the Project

Project Organisation

The researchers involved in the project formed an interdisciplinary group of three specialists with expertise in different areas of Medieval Slavic Studies and one computer scientist. During the implementation of the project, 6 more specialists in Slavic Studies worked on manuscript description encoding.

The work on the project consisted of several stages:

The first stage, design of the encoding scheme, took an initial period of about 3 man-months. This stage can not be considered as being finalized, since new ideas for further refinements are still arising periodically.

This stage started with a discussion of the model of a manuscript description which we would like to present in an electronic form. The basic elements which had to be included in the model and their relations were clarified. After that, it was checked how TEI P3 could meet the needs of our model. It became clear that the most specialised elements of the model are not present in the general TEI P3 Guidelines. This forced us to produce our own DTD which contains a significant number of new elements, added to those already present in TEI P3. The ratio of the number of added elements to the number of already existing elements in the TEIHEADER is 3:1. This means that even users accustomed to TEI P3 would still need some additional training in order to use our description.

The computer implementation of our DTD was based on the use of SoftQuad Author/Editor(tm) with compilation of a rules file by the SoftQuad Rules Builder.

The second stage, training and data entering of manuscript descriptions, takes about 2 weeks per person for training and 2 days (on avarage) for entering one manuscript description (usually after a preparatory period of data collecting which takes about 1-3 weeks and depends on the expert's level of familiarity with the concrete manuscript).

We discuss the training together with the data entering because the typical training practice was to work directly on the computer with a real manuscript example.

As another important training support we should mention here the participation in specialised SGML workshops. The philologists who entered data participated in 2 workshops (one of them before the beginning of the project and one in the middle). The first one, delivered by Dr. Harry Gaylord in 1993 in Sofia, provided the philologists with some initial knowledge about the underlying philosophy of SGML. The second one, a hands-on session presented by Prof. D. Birnbaum, Dr. H. Gaylord, Prof. N. Finke and Dr. W. Bader at the First International Conference on Computer Processing of Medieval Slavic Manuscripts held in 1995 in Blagoevgrad, Bulgaria, gave the same philologists a possibility to refresh their views on the encoding principles which they were already applying in their work.

The last stage, editing and initial use of the manuscript descriptions, is the phase of the project where we are now currently. The use of the data entered in the manuscript description is basically focussed on extracting lists of data which serve the needs for standardising terminology, finding misspellings and cross-checking some of the data. For the extraction of these data we had to develop our own software tools.

The Encoding Scheme

The encoding scheme used in the project is presented in Appendix 1. The basic changes we made to the TEIHEADER section of the TEI.2 element is in its PROFILEDESC part where we placed some elements representing specific subject domain knowledge and connected with the history of the creation of the manuscript. The most important groups of elements present data about:

Some smaller changes were done in the FILEDESC section of the TEIHEADER regarding to the specific features of manuscripts cataloging information.

We also had to add new elements presenting the incipita and explicita of manuscript texts which are important for the study of medieval manuscripts.

Thus, our encoding contains about 100 elements relevant to the description of the manuscript itself, excluding the features which could be encoded in the presentation of the text itself. This description can be considered to be rather complex to use, since somebody who is planning to enter data should be well-acquainted with all the elements in a description and their proper placement and interrelations.

Although our current document model has already reached an impressive level of complexity, we should mention that many elements can still be represented in even greater detail than they already are in our current encoding. One of the difficult challenges in our work was to stop adding more details which could be considered to be interesting for only a very small group of researchers. This requires taking decisions which are not always easy to make since the decision criteria ultimately depend on personal views about what is important and what is not, criteria which can be easily attacked.

Preliminary Expectations of the Participants

During the design stage of our work, we also had to take into account the intended use of the encoded materials. The normal software development approach is to choose among several possible encoding solutions by taking into account the desired future use.

However, software development for use in the Humanities does not follow this approach because the work goes in exactly the opposite way. It is only after some computer resources have already been created, based on rather poor computer models because of the inherent difficulties of interdisciplinary work, that the users start to understand more about the use of computers and can become dissatisfied with the existing model they have to use.

In our case, it was clear from the very beginning that the encoding scheme would be used for entering data, but besides this the really interesting question was: "What are we going to use these encoded materials for?". The possible uses suggested by the philologists at the beginning of the project did not go beyond the compiling of indices and file cards on the basis of the description. This limited view on the possible uses of this SGML technology clearly illustrates how familiarity with old ways of workings limits one's capability of imagining new, innovative ways of using computer technology.

However, the existence of electronic descriptions was estimated as being useful if only because of the availability of electronic resources. The ideas about the data they would like to extract from the descriptions were not very clear and this fact is still confirmed by the permanent appearance of new suggestions of desired outputs.

This kind of situation is rather typical for a project in the Humanities and it requires a high level of understanding from the computer experts involved in the implementation of a software tool.

Involvement of Philologists

We would like to present here the tasks which were performed by the philologists participating in the project and to discuss some of the specific difficulties we encountered.

Design stage

At the design stage, the philologists had to discuss the elements to be included in the model. During the initial work on the encoding scheme, we had 2 specialists participating actively. Some previous experience acquired during the development of other models in the same problem area was used (we should mention here the traditional manuscript description formats used in the printed editions and the formal models applied in previously developed computer systems.)

The participation of the same philologists who entered the manuscript descriptions resulted at the end of the project in a number of proposals for extending the original encoding scheme. These proposals probably represent the most important feedback about the success of the use of SGML in the encoding work, because they were themselves submitted in an SGML-like style.

Entering data

This was the stage of the project that required the most intensive involvement, both in time and efforts spent, of the philologists participating in the project. Since the data entering requires well-qualified people (capable of recognizing different elements in the description and disposing of a good knowledge of Old Church Slavic), we had to include in our team researchers who were interested in contributing to the creation of electronic resources in the field of medieval Slavic manuscripts.

We had 8 philologists who entered manuscript descriptions. Three of them have a long experience in the subject domain (more than 18 years), and 5 of them constitute a group with less than 10 years of experience. The computer literacy of the group does not correlate to the years of work in the field. We had reserachers with some level of computer literacy, and we also had researchers without any computer literacy in the both above-mentioned groups.

Processing data

In this stage of the project, we had one specialist who was editing all manuscript descriptions. After the editing we used specially designed software to produce lists of important elements (usually by processing all the descriptions produced by one specialist, and the whole set of descriptions). The philologist who entered the descriptions was thus able to see what results were extracted from his/her description and to find out about their place in the general picture.

As I already mentioned earlier, we still receive new proposals for desirable data outputs from the descriptions. This clearly illustrates the raise in competence in the use of electronic resources by the members of the project team.

Feedback

In order to improve the work on this project and similar projects in future, we tried to perform some evaluation of the acceptance of SGML by the project participants, people who are not specialists in computer science and who did not have a high level of computer literacy before starting their work on the project.

We have three basic types of feedback. The first one, which I call here controlled feedback, consists of the results of a test which was designed especially for evaluation purposes. Another interesting type of feedback we received was extracted from the typical errors observed during the encoding work. A third important type of feedback was the receipt of suggestions for improvement of the work related to the encoding scheme used.

Test evaluation

Test Design

After one year of experience with the use of the encoding scheme, we designed a test whose basic aims were to determine:

Group Characteristics

The basic characteristics of the participants in the test evaluation which were used to differentiate between different types of participants were: theircomputer literacy , theirEnglish language fluency and theirexperience in the application domain (medieval Slavic manuscripts).

We had 3 persons with an experience in the subject domain of more than 18 years, but only one of them was using word processing software in Microsoft Windows environment before the project started. The two other specialists had some experience in the use of the DOS-based text editor ChiWriter.

In the group of 5 specialists with an experience of less than 10 years, which represents the younger generation, only one person did not have any computer experience. The others had different level of computer literacy - one with the use of ChiWriter, the rest with word processing software under Microsoft Windows.

The level of English language fluency was an important characteristic for us because the encoding scheme is based on the use of English terms, and the user interface of Microsoft Windows and SoftQuad Author/Editor(tm) requires some working knowledge of English. The majority of the researchers reported an average or good knowledge of English, while 3 specialists reported a very good working knowledge of English and one reported infamiliarity with the language.

The distribution of the values of the above mentioned characteristics in our relatively small group of studied subjects shows that the group can be divided into 2 homogeneous subgroups with regards to their experience in the subject domain (less than 10 and more than 18), with a relatively low level of computer literacy but a good working knowledge of English.

Controlled Feedback

In the controlled feedback we were interested to find out more about the participants' appreciation of the proposed encoding scheme. All the participants demonstrated a good knowledge of the scheme's details and expressed general satisfaction with its use. They were also absolutely positive that they would apply it in their further work.

This desire to continue to use the computer-based description of the texts was somewhat in contradiction with the fact that they showed only to have interest in such traditional usages of the encoded texts as the production of indices. However, all participants in the experiment showed a good understanding of what were extractable and not-extractable results, which were given as examples in the test.

One feature of the encoding scheme which caused general dissatisfaction among the participants was the necessity to retype some data several times in different descriptions.

An interesting observation on the level of SGML competence acquired after more than one year of practical experience was that some of the specialists in medieval Slavic manuscripts still did not understand the difference between elements and attributes and did not recognize the hierarchical nature of the structure they were using. This looks surprising because SoftQuad Author/Editor(tm) supports a clear hierarchical representation of the document structure on the screen and supports different mechanisms for entering elements' values and attributes' values. These misunderstandings influenced the work of some specialists, but it is still a point of discussion whether the users should be acquainted with SGML itself when it is being applied in a project.

Analysis of the work

Another source of information about the level of acceptance and understanding of the work in SGML environment is the practical work done by the specialists. On the one hand, their typical errors show the specific problems they have in their work. On the other hand, a comparison of the different usage styles of the same SGML descriptions may give some good ideas for future work.

Errors

The typical errors observed during the stage of the processing of the descriptions included:

Proposals

Half of the researchers who entered manuscript descriptions suggested further refinements and enlargements of the encoding scheme. The style of these suggestions was SGML-like, which shows that all of them accepted the underlying design philosophy. Some of them also explained that their involvement in the project helped them to better organize their own research work.

We also constructed new types of outputs from the descriptions as requested by the users, which also illustrates their growth in understanding of how computer encoded descriptions can serve their research needs.

Conclusion

Finally, we would like to share some ideas about how to improve the participation of similar SGML-inexperienced users in a major SGML project:

We believe that our experiences, as presented in this paper, will be of interest to all those who intend to start working on large-scale projects involving the participation of SGML-unaware users.

Appendix : ENCODING SCHEME

 <TEI.2>
 <TEIHEADER>
 <FILEDESC>
                <TITLESTMT>
                        <TITLE>
                        <AUTHOR>
                        <EDITOR>
                        <FUNDER>
                        <PRINCIPAL
                        <SPONSOR>
                <PUBLICATIONSTMT>
                        <PUBLISHER>
                        <PUBPLACE>
                        <DATE>
                <SOURCEDESC>
                        <CATALOGUESTMT>
                                <MANUSCRIPTNAME>
                                <MANUSCRIPTLOCATION>
                                        <REPOSITCOUNTRY>
                                        <REPOSITCITY>
                                        <REPOSITORY>
                                        <REPOSITSIGNATURE>
                                        <CATALOGNR>
                                        <RELATEDPERSON>
                                </MANUSCRIPTLOCATION>
                        </CATALOGUESTMT>
                </SOURCEDESC>
 </FILEDESC>
 <ENCODINGDESC></ENCODINGDESC>
 <PROFILEDESC>
                <LANGUSAGE>
                <CODICOLOGY>
                        <NUMFOLIO>
                        <QUIRESTRUCTURE>
                                <QUIRE>
                                        <NUM>
                                        <COMPOSITIONQUIRE>
                        <PAGINATION>
                        <PRICKING>
                        <BINDING>
                        <MATERIALDESC
                        TYPE=paper|vellum|papyrus
                        EXTENT=general|partial
                                          missing>
                                <LAYOUT>
                                        <NUMFOLIO>
                                        <SIZEMATERIAL
                                        TYPE=vertical|horizonal
                                        RANGE=material|written
                                                        area
                                        UNIT=cm|inch>
                                        <RULELINE>
                                        <NUMBCOLUMN>
                                        <NUMBLINES>
                                </LAYOUT>
                                <INK>
                                <WATERMARK>
                                <GREGORYRULE>
                                <ORNAMENT
                                TYPE=borders|cadels|
                                calendars|capitals|
                                initials|illustrations|
                                linefillers|vjaz>
                                <MISCOBSERVAT
                                TYPE=miscdamage|
                                restoration|palimpsest>
                                <ALPHABET>
                        </MATERIALDESC>
                </CODICOLOGY>
                <SCRIBE>
                        <NAME>
                        <PAGERANGE>
                                <STARTINGPAGE>
                                <ENDINGPAGE>
                        <ORTHOGRCHARACT>
                        <PALAEOCHARACT>
                </SCRIBE>
                <MANUSCRIPTCONTENTDESC
                TYPE=compilation|original|
                translation
                STYLE=narrative|non-narrative>
                <NUMBERTEXTS>
                <MANUSCRIPTCREATION>
                        <MANUSCRIPTDATE>
                        <MANUSCRIPTPLACE>
                <SOURCE TYPE=Greek|other>
                <TRANSLATION>
                        <NUM>
                        <DATE>
                <PROTOGRAPH>
                <ANTIGRAPH>
                <LITREDACTION>
        </MANUSCRIPTCONTENDESC>
        <ARTICLECONTENTDESC
                ID
                TYPE=compilation|original|
                translation
                STYLE=narrative|non-narrative>
                <NUMBERTEXTS>
                <ARTICLENAME>
                <ARTICLEAUTHOR>
                <SOURCE TYPE=Greek|other>
                <TRANSLATION>
                        <NUM>
                        <DATE>
                <ANTIGRAPH>
                <APOGRAPH>
                <CHURCHCALENDAR>
        </ARTICLECONTENDESC>
</PROFILEDESC>
<REVISIONDESC></REVISIONDESC>
</TEIHEADER>
 <TEXT>
        <BODY>
                <DIV>
                        <HEAD></HEAD>
                                <INCIPIT>
                                        <NORMINCIPIT>
                                        <NONNORMINCIPIT>
                                </INCIPIT>
                                <P></P>
                                <EXPLICIT>
                                        <NORMEXPLICIT>
                                        <NONNORMEXPLICIT>
                                </EXPLICIT>
                        </DIV>
                </BODY>
        </GROUP>
</TEXT>
</TEI.2>

Acknowledgements

The research presented in the paper was partially supported by a research project MU-IS 6/94 entitled 'Computer Modelling of Palaeographic Data and Activities' (National Fund for Scientific Research).

I would like to thank Dr. Malina Jordanova for her kind assistance in the preparation of the feedback test. I also would like to express my gratitude to the specialists in Medieval Slavic Studies who worked on the project.


Notes

A good overview of the earliest attempts to apply computer technology to Medieval Slavic studies can be found in vol. 17 of the journal 'Polata Knigopisnaja', 1987. (Back)