[Mirrored from: http://www.pitt.edu/~djbpitt/etp_ama.html]

University of Pittsburgh

University Library System (ULS)

Electronic Text Project (ETP)

American Medical Association DTD Development

Report by David J. Birnbaum
21 February 1995

Introduction

The ETP team adopted the Text Encoding Initative's (TEI) teilite.dtd, a reduced TEI-conformant DTD, as its starting point for the etp_ama.dtd that we will be using to encode early volumes of the Journal of the American Medical Association. Our document analysis procedure led us to conclude that two types of modifications were required: we needed to add support for tables and we needed to support the encoding of new elements that would be used by analysts working with our materials.

Tables

We borrowed our table structure from the DTD being developed for HTML3, largely because we felt that this strategy would support the features we needed in a way that would simplify down-translation for HTML-based document serving. We modified the HTML3 DTD by retaining only the <tr> (table row), <th> (table header information), <td> (table data information), and <caption> elements, all of which we defined as having #PCDATA content. We deleted all presentation- or rendering-oriented attributes, retaining only "rowspan" and "columnspan," which indicate the number of rows and columns a particular cell spans.

Our modified table support is encoded in a separate html3tab.dtd file, which is invoked from our main etp_ama.dtd file. We retain some misgivings about the general table-encoding strategy, since it means that the contents of a particular cell can be identified only by walking the entire table from the upper left, but decided to adopt it because of our desire for easy translation to real HTML3.

Our <table> element is a global inclusion exception in our DTD. It does not inherit any of the TEI global attributes, or any of the new attributes that we define for our other new elements, below.

New Elements

We modified the original teilite.dtd by adding certain entity definitions at the beginning and certain element and attribute declarations at the end. In keeping with the general guidelines for modifying TEI DTD materials, no changes were made inside the teilite.dtd code. But because teilite.dtd is a single file, changes had to be made in that file, rather than in auxiliary .ent and .dtd files, as is the preference with regular TEI DTD modifications.

For legibility and ease in maintenance, we identified five classes (m-classes, in TEI terms) of new elements, representing places (social, medical, geographic, and commercial), persons (medical, patient, [other] person), objects (animal, medicine, disease, apparatus), processes (nutrition, sanitation, treatment, diagnosis), and descriptions (housing conditions, economic conditions, environmental conditions).

In addition to the specific class members listed above, each class also contains a class-specific "postpone" element, so that our taggers can mark items whose general class membership is clear, but which do not fit cleanly into any of the classes we may have identified. Our DTD also contains a global <postpone> element, so that our taggers can identify items that they feel require tagging, but whose class membership is not clear. These "postpone" tags will be used as feedback to modify the DTD during the early stages of the encoding process.

All of our new elements are defined as phrase-level in TEI terms, which means that they can occur only within chunks (paragraphs or paragraph-level elements), but not between them (and thus not directly within a <div>, etc.). This is explained in greater detail in the TEI P3 documentation, section 3.7.3, p. 62. Our new elements are all described as having mixed #PCDATA and phrase-level elements as their content (phrase.seq, in TEI terms), as is explained in P3, section 3.7.7. p. 68. They have TEI a.global and a.seg attributes, as well as new attributes number ("one" or "many," defaulting to one), authform (for the authoritative form of an entry, according to a MESH or other authoritative name source), and authority (to identify the source of the authform value). All have "seg" as the value of their TEIform attribute.