[See canonical source at http://cdli.ucla.edu/methods_dtd.html or http://cdli.mpiwg-berlin.mpg.de/methods_dtd.html.]

The following is the current document type definition for the CDLI and PSD projects.
Please contact cdli@ucla.edu with comments or questions.


Preliminary (but probably pretty close to final) DTD for CDLI XML archival text formats. Developed as part of the March 2001 meeting at UCLA (the Kinsey Accord) with input from the participants of that meeting.

This version, by Steve Tinney 04/03/01, for PSD and CDLI projects.

v1.0. Placed in the Public Domain.


This DTD defines an object type which requires its own XML namespace. CDLI objects should have namespaces rooted in a common absolute URL, followed by a type, optionally followed by a version (the versioning is an emerging and still-discussed practice; it is not an indicator of minor change, but a buffer against future incompatible changes).

<!ENTITY % cdli-text-ns
"xmlns CDATA #FIXED 'http://cdli.ucla.edu/text/1'">

A convenience wrapper for having files with more than one text; this practice probably has to be deprecated when we get to the point of defining the structural map of the digital object metadata (because we will have an easier time if we can derive the URL of the transliteration and images directly from the text ID rather than having to look up which file the text is in).

<!ELEMENT texts (text)*>

TEXT is the basic data object type in the CDLI corpus. A TEXT consists of a series of one or more objects, and zero or more associated seals. *Properly*, seals should be handled in an external dataset, but as discussed in the March 2001 meeting (the Kinsey Accord), it is *practically* preferable to enable inline editing of seals during the preparation of the corpus.

@N: the museum number, publication reference or other common way of referring to the text. This is the one that humans use.

Constraint: id must be unique among the entire CDLI corpus. This is the relational ID that can be used for catalogue lookups also.
<!ELEMENT text (object+ , seal*) >
<!ATTLIST text

TEXTs contain OBJECTS; normally, multiple OBJECTS will be of different types, as with tablets and envelopes.

If a TEXT contains multiple OBJECTs of the same type, each OBJECT is a fragment of the original whole OBJECT of that type.

@N: The name of the object, often the same as the parent TEXT element and therefore unnecessary. This provides for a TEXT being composed of fragments or envelopes with different names from the TEXT name.

Constraint: id must be unique among the entire CDLI corpus.
<!ELEMENT object (surface)+ >
<!ATTLIST object
type (tablet|envelope|prism|tag) "tablet"
certain (y|n) "y">

A seal is a text, and as such has columnar and non-columnar data. Because the situation of seals in the final (probably post-1.0) CDLI corpus is likely to change, we accomodate three vectors of association:

@ID: the cdli-global-unique ID of this edition of a seal text.

@N: the text-local-unique ID of this edition of a seal text (e.g., 1, 2, 3). The goal of this ID is to provide a target for seal location designators elsewhere on the OBJECTs, so the intent is that this be a simple identifier. It should be easy for a transliterator to say something like:

$ seal 1


@ Seal 1
1. da-da
2. arad-zu

To encode the fact that `seal 1' occurs somewhere on the object and then also describe what `seal 1' says.

@SCID: the Seal-Corpus ID; this is really an IDREF, but to facilitate parsing it is defined here as a NMTOKEN. The assumption is that within the seal-corpus every seal will have an ID, and this will provide the facility for cross referencing from specific transliterations to the instance of the seal in the seal-corpus.
<!ELEMENT seal (column|noncolumn)+ >
<!ATTLIST seal

<!-- Surface is a physical area on which text is layed out. The provision for multiple seals and faces (for prisms) is weak, but probably adequate (the expectation is that it will be extended as necessary when the present limits are exceeded).

Constraint: id must be unique among the entire CDLI corpus.
<!ELEMENT surface (column|noncolumn)+ >
<!ATTLIST surface
type (obverse | reverse | left | right | top | bot | face) #REQUIRED
face (a|b|c|d|e|f|g|h|I|j|k|l|m|n|o|p|q|r|s|t) #IMPLIED
certain (y|n) "y">

Columns are wrapped in COLUMN; there must always be at least one column. A single column text has @N: the 'name' of the column as presented for display, unless n="0", in which case it is a redundant wrapper column on a single-column text.

Constraint: id must be unique among the entire CDLI corpus.
<!ELEMENT column (l|nonl)+ >
<!ATTLIST column
certain (y|n) "y">

Lines are wrapped in L.
@N: the 'name' of the line as presented for display

Constraint: id must be unique among the entire CDLI corpus.
<!ELEMENT l (nong|n|w|g|cg|gg|igg)*>

Words are wrapped in W.
<!ELEMENT w (g|cg|gg|igg)*>

Numbers are wrapped in N. The content is the word or grapheme sequence constituting the number; attributes give information on the numeric system and the value for which the sequence is taken to stand.

<!-- add systems here as desired -->
<!ENTITY % systems "(
| capacity
| unidentified
)" >

<!ELEMENT n (w|g|cg|gg|igg)*>
system %systems; "unidentified" >

<!-- IGG = Interpretive Grapheme Group; a mechanism for inline presentation of both an interpretive version of what the graphemes on the tablet were supposed to be, and the literal sequence occurring on the tablet.

By definition, the first child of the group is the interpretation; the second child is the literal grapheme sequence on the object.
<!ELEMENT igg ((g|cg|gg),(g|cg|gg))>
type (ordering|correction|explanation) #REQUIRED >

GG (grapheme group) is exclusively a scoping mechanism for treating several graphemes as a single unit.
<!ELEMENT gg ((g|cg),(g|cg)+)>

Graphemes are wrapped in G.

** Grapheme regexps to be added. **

Compound graphemes are wrapped in CG.

** Content model to be added. **

The grapheme attribute definitions were made with an underlying assumption that CDLI transliterations would be as simple as possible for manipulation as data, and that wherever possible editorial commentary and squeamishness should be reserved to a commentary file.

The commentary is not expected to be machine-manipulated, beyond the assumption that commentary entries will reference the ID at the L level, such that an HTML version of the corpus could include machine-generated links back from the lines to the commentary.

Items to be removed/reserved to the commentary include:

palimpsest writings
alternate possible identifications; e.g., ki/di
alternate readings; e.g., gin/du
explanatory addition of sign name; e.g., mu4(TUG2)


Defines whether the grapheme content is a sign-value, a sign-name or a reference to an entry in a sign list. Grapheme readings and sign-names are not differentiated by use of lowercase and uppercase, but instead by use of @TYPE.


Allows qualification of whether graphemes are glosses or not. No distinction is made between types of gloss. Glosses are simply characterized as pregloss (i.e., occurring before what they gloss) or postgloss (i.e., occurring after what they gloss).
nametype (signref|listref) #IMPLIED
breakage (damaged|missing) #IMPLIED
sign (unusual.form|really.is|ed.emended|ed.removed|ed.supplied |scribe.implied) #IMPLIED
uncertain (y) #IMPLIED
collated (y) #IMPLIED
gloss (pre|post) #IMPLIED

All the non-x types (noncolumn, nonl (non-line) and nong (non-grapheme) share a common set of attributes and content model.

The content model, PCDATA, is intended purely for the preservation of the verbatim text of comments in legacy data.

break, gap and ruling: self-explanatory
traces: for extents with signs or traces which have not been transliterated as data
image: for drawings included by the scribe
seal: for seals

self: derived from object-oriented programming practice;
'self' indicates that the extent is given in units of the type of the element on which the attribute occurs: for noncolumn, self means 'column(s)'; for nonl, self means 'line(s); for nong, self means 'grapheme(s)'.

Co-constraint notes (these cannot be expressed in the DTD):

For type=image, UNIT may be 'self'. If type=image and unit=quantity, the extent indicates the amount of 'self' which is covered with the image. Regardless of the value of UNIT, the REF attribute may be used when type=image to give a URL which shows the image.

For type=seal, UNIT is always 'self'; the REF attribute is always used. The reference is either to a text-local seal-transliteration or to the seal corpus entry of which the instance seal is an exemplar.

quantity: indicates that the extent is given as a quantity

A reference. For seals, the REF should give the local ID of the seal whose occurrence is being noted in the non-x element.
For images, the REF should be a URL (note: this is technically never necessary for CDLI; an exception could be a situation in which a specific file contains a shot of the image which is better than, or more specifically targetted than, the images which give shots of the tablets).

@EXTENT: the extent of the non-x material. Should match

i.e., it may be a number, or a measurement in mm or cm.

<!ELEMENT noncolumn (#PCDATA)>

<!ENTITY % non-x-attr-set "
type (broken|traces|gap|ruling|image|seal) #REQUIRED
unit (self|quantity|ref) #IMPLIED

<!ATTLIST noncolumn %non-x-attr-set; >
<!ATTLIST nonl %non-x-attr-set; >
<!ATTLIST nong %non-x-attr-set; >