The TEI and Author/Editor

This information is derived from a post to the TEI-L about the problems involved in getting some parts of the TEI DTD to compile with RulesBuilder, the DTD parser/compiler companion for SoftQuad's Author/Editor product.

This message did not pretend to be anything other than an informative summary of events: it is not intended to be regarded as any kind of formal statement of how a solution ought or ought not to be derived.

The message header was:
From pflynn  Fri Jul 14 10:57:34 1995
Date: Fri, 14 Jul 1995 10:57:34 +0100
From: Peter Flynn <pflynn>
To: TEI-L@UICVM.CC.UIC.EDU
Subject: Re: RulesBuilder a nightmare?
Cc: trujillo@mail.utexas.edu
and the original text is located in the TEI-L archive at listserv@uicvm.cc.uic.edu as
Item #   Date   Time  Recs   Subject
------   ----   ----  ----   -------
001254 95/07/14 09:19  219   Re: RulesBuilder a nightmare?

A previous correspondent had mentioned:

> I accidentally erased the message, but in the last couple of
> days someone wrote an aside about the horrors of using Rules Builder
> with the TEI DTD(s).

That was probably me, but there weren't any horrors, just a few snags getting it to swallow stuff the right way round. Now that I have found out how to do it (thanks to some swift help from SQ), it's working fine. I was critical of RB doing things this way, but SQ's support has been excellent and has fixed the problem.

> Please tell me more.  I am about to embark on a medium-sized tagging
> project (a 45,000-word language corpus), and after getting lots of
> really slick adverts from the big names in the business (ArborText,
> Interleaf, etc) with Big Business-sized price tags, I had pretty
> much decided on getting an Author/Editor-Rules Builder bundle
> which carries a *substantial* academic discount. But now I'm not
> so certain. . .

No, as far as I can see A/E is quite capable of handling this. I don't know how it performs on a file that large (if you really do have it all as a single corpus file), as my experience of editing files >1Mb has been limited to A/E on a very small old PC, which is not a valid platform to compare it with modern machines. (In fact, A/E worked fine, it was the slowness of the PC and lack of memory and disk space which I had a problem with :-)

I got the DTD finished yesterday, and I promised to tell you all what I had to do, so here goes:

I had tried to compile the DTD as it stands, using Rules Builder for MS-Windoze. It didn't like it, which was why I posted my original complaint. I'm not sure why SQ wrote it so that it won't accept a prolog in the same fashion as other SGML software, but they were quick to find me a solution once I had identified the problem.

Here's what the original prolog said:

   <!DOCTYPE TEI.2 SYSTEM "tei2.dtd"
   [
   <!-- Standard tagsets needed for TLH work -- >
   <!ENTITY % TEI.corpus.dtd         'INCLUDE'>
   <!ENTITY % TEI.prose              'INCLUDE'>
   <!--ENTITY % TEI.verse              'INCLUDE' can't be used with prose-->
   <!ENTITY % TEI.transcr            'INCLUDE'>
   <!ENTITY % TEI.textcrit           'INCLUDE'>
   <!ENTITY % TEI.names.dates        'INCLUDE'>
   <!-- Extra tagset needed to allow documentation of tags in header -->
   <!--ENTITY % TEI.tagsets.dtd system 'teitsd2.dtd'>
   %TEI.tagsets.dtd;-->
   <!-- Standard character entities -->
   <!ENTITY % ISOlat1 system         "ISOLat1" 
        --"ISO 8879:1986//ENTITIES Added Latin 1//EN"-->
   %ISOlat1;
   <!ENTITY % ISOlat2 system         "ISOLat2" 
        --"ISO 8879:1986//ENTITIES Added Latin 2//EN"-->
   %ISOlat2;
   <!-- Local mods to supplement ISO char ents and rename tags -->
   <!ENTITY % CURIA.entities system  'curia.ent'>
   %CURIA.entities;
   ]>

The file curia.entities contains extra character entities as well as some renaming of elements. The project is adding a lot of content-descriptive markup, so shortening the tagnames shortens file size and makes it more usable for people without graphical SGML editors. Here's the file:

   <!ENTITY fdot     SDATA "f" -- lenited f (dot-over)-->  
   <!ENTITY Fdot     SDATA "F" -- lenited F (dot-over)-->
   <!ENTITY ndot     SDATA "n" -- nasalised n (dot-over)-->
   <!ENTITY mdot     SDATA "m" -- nasalised m (dot-over)-->
   <!ENTITY mmacr    SDATA "m" -- m with macron -->
   <!ENTITY Sdot     SDATA "S" -- lenited S (dot-over)-->
   <!ENTITY sdot     SDATA "s" -- lenited s (dot-over)-->
   <!ENTITY ampersir SDATA "&" -- insular ampersand-->
   <!ENTITY turnsemi SDATA ";" -- inverted semi-colon-->
   <!-- we need to add more of these, eg &longe; -->

   <!ENTITY % n.persName  'ps'>
   <!ENTITY % n.addName   'an'>
   <!ENTITY % n.genName   'gn'>
   <!ENTITY % n.forename  'fn' >
   <!ENTITY % n.surname   'sn' >
   <!ENTITY % n.nameLink  'nk' >
   <!ENTITY % n.placeName 'pn'>
   <!ENTITY % n.orgName   'on' >
   <!ENTITY % n.roleName  'rn'>
   <!ENTITY % n.expan     'ex' >
   <!ENTITY % n.foreign   'frn' >
   <!ENTITY % n.milestone 'mls' >
   <!ENTITY % n.supplied  'sup' >
   <!ENTITY % n.quote     'qt' >
   <!ENTITY % n.unclear   'uncl' >

   <!-- Bodge to get round bug in TEI DTD, courtesy of LB -->
   <!ENTITY % x.data '%n.persName | %n.placeName | %n.orgName |'>

This lot compiled happily with psgml-mode. I had simply snipped the prolog out of an instance and put it in a separate file and expected it to compile. But here's the way SQ said to do it:

   <!-- Standard tagsets needed for TLH work -- >
   <!ENTITY % TEI.corpus.dtd         'INCLUDE'>
   <!ENTITY % TEI.prose              'INCLUDE'>
   <!--ENTITY % TEI.verse              'INCLUDE' can't be used with prose-->
   <!ENTITY % TEI.transcr            'INCLUDE'>
   <!ENTITY % TEI.textcrit           'INCLUDE'>
   <!ENTITY % TEI.names.dates        'INCLUDE'>
   <!-- Extra tagset needed to allow documentation of tags in header -->
   <!--ENTITY % TEI.tagsets.dtd system 'teitsd2.dtd'>
   %TEI.tagsets.dtd;-->
   <!-- Standard character entities -->
   <!ENTITY % ISOlat1 system         "ISOLat1" 
        --"ISO 8879:1986//ENTITIES Added Latin 1//EN"-->
   %ISOlat1;
   <!ENTITY % ISOlat2 system         "ISOLat2" 
        --"ISO 8879:1986//ENTITIES Added Latin 2//EN"-->
   %ISOlat2;
   <!-- Local mods to supplement ISO char ents and rename tags -->
   <!ENTITY % CURIA.entities system  'curia.ent'>
   %CURIA.entities;

   <!ENTITY % TEI2.full.dtd "tei2.dtd">
   %TEI2.full.dtd;

These last two lines are what was supposed to do the trick, but RB still didn't like it, so I replaced the two lines with the entire contents of tei2.dtd and suddenly it seemed to work.
Not quite: during compilation it claimed it couldn't find any of the parameter entity files referenced in the above, even though they were in the same directory as the DTD it was compiling, but fortunately it's smart enough to put up a dialog box so you can tell it (rather than dying like so much software does). This is fixed by editing the configuration file rb.ini and including a period and semicolon before the relevant search paths, so that it searches the current directory first, before trying elsewhere and failing.
Incidentally that still doesn't fix the failure to find isolat1.ent and isolat2.ent...I've obviously missed something, and I'm sure it must be easy to fix.
At the end of compilation it complained fatally that I had some elements referenced but not declared. Now, this is the source of some contention. First, you have to be very clear that we are talking about elements which your DTD mentions in a content model, but which are not declared because they come from modules which are not loaded in your version of the DTD (eg <camera>, because you might not be doing film stuff). We are not talking about elements which are declared but simply not used anywhere: RB already handles these latter via a checkbox on the BUILD panel.
Referencing elements which are not declared is actually permissible SGML (which surprised me when I first learned it) but it is quite kosher: it's just not handled by RB. In some people's view this is a bug. IMHO it's probably politer to call it by the latest buzzword, a USI (Unexpected System Inability :-)
The simple way round it is to add declarations for all the `missing' elements, and set them either to EMPTY or ANY. This does have the disadvantage that they will then appear in the `allowed elements' menus when you do an `Insert Element' etc, but you can comment them using the tag file mechanism that RB provides, so that users know not to use them.
RB doesn't do a log file which includes a list of the undeclared elements, so you have to copy them down from the error panel (which only mentions the first few), then add them to the DTD, re-compile, note down the next batch mentioned, add them, re-compile, and repeat until you have them all. Windoze doesn't let you clip text from error panels :-)

So here's what I ended up with, taking it from the bit about referencing the `master' tei2.dtd file which you recall I commented out:

   <!--ENTITY % TEI2.full.dtd "tei2.dtd">
   %TEI2.full.dtd;-->

   <!ENTITY % missing.bits  "lang|oref|ovar|pref|pvar|link|xptr|xref|caesura|
                             anchor|c|cl|m|phr|s|seg|w|att|gi|tag|val|formula|
                             camera|caption|move|sound|tech|view|castlist|
                             figure|table|textdesc|etree|graph|tree|
                             particdesc|settingdesc|alt|altgrp|certainty|
                             flib|fs|fslib|fvlib|interp|interpgrp|join|
                             joingrp|linkgrp|respons|span|spangrp|timeline|
                             epilogue|performance|prologue|set">

   <!ELEMENT (%missing.bits;) - o EMPTY>

   <!-- tei2.dtd:  written by OddDTD 1994-09-09                  -->
   <!-- 3.6.1: File tei2.dtd: Main document type declaration     -->
   <!-- file                                                     -->

and then the rest of the DTD. Note the sequence is important:

your module options first
any private project additions
any `missing' declarations
then the `master' DTD

It works just fine now I've got the knack. It sounds long-winded but it's not, really, just a little different from other systems I've used.

> (BTW, I am primarily an OS/2 user, and although I have several
> Windows-based products, MS-published software is generally not
> welcome here. So the inexpensive MS Word SGML editor add-ons are
> not an option.)

I don't know if SQ do an OS/2 version of A/E. I haven't had a chance to test SGML Author for Word or WordPerfect SGML Edition with the TEI. Anyone done this yet?

///Peter