[Archive copy mirrored from: http://users.ox.ac.uk/~lou/papers/XR/, which should be regarded the canonical document]

TEI Extended Pointers: a brief tutorial

by Lou Burnard

This is a background document prepared originally for the information of the W3C XML working group. It provides an informal introduction to the methods proposed in the Text Encoding Initiative's published Guidelines (TEI P3) for the representation of inter- and intra-document links. Authoritative and complete information is provided in that work; in case of conflict between this document and TEI P3, this document is wrong.

Linking mechanisms are explicitly defined for several different elements in the Guidelines: these include independent links (<link>), alternative readings (<alt>), aggregations (<join>) and conventional footnotes (<note>), as well as the generic pointer elements discussed here. A brief tutorial introduction is given below. For completeness, this is followed by a brief discussion of some of the other linking mechanisms.

Automagically generated by lite2html on [date]

1 Extended Pointers

The TEI scheme defines two generic pointer elements which support both inter and intra document linkage: <xptr> and <xref>. The only difference between them is that the former is empty, while the latter can contain phrase-level elements or PCDATA. The content of an <xref> is typically a string indicating how the link is to be rendered at the source end.

These elements share the following attributes, which are used to specify the target of the cross reference or link:

An <xptr> (or <xref>) may point to the whole of some other entity simply by supplying its name as the value of the doc attribute:

<!ENTITY TEIP3 SYSTEM "http://elib.virginia.edu/TEI/">
<!-- ... -->
see <xref doc=TEIP3>The TEI Guidelines, passim</xref>

This example assumes that some system or public entity with the name TEIP3 has been declared.

The from attribute is used to specify a location within the document specified by the doc attribute, using the TEI extended pointer syntax. In this language, locations are defined as a series of steps, each one identifying some part of the document, often with respect to locations identified in a previous step. For example, you would point to the third sentence of the fourth paragraph of chapter two by selecting chapter two in the first step, the fourth paragraph in the second step, and the third sentence in the last step. A step can be defined in terms of the SGML tree (using such keywords as parent, descendent, preceding, etc.) or, more loosely, in terms of text patterns, word or character positions. You can also use a foreign (non-SGML) notation, or specify a location within a graphic in terms of its co-ordinate system.

The from and to attributes use the same notation. Each points to a location within the target document; the target of the extended pointer is the whole sequence beginning at the start of the location indicated by the from attribute, and running to the end of the location indicated by the to attribute.

The first step in a location path will often be to specify the identifier of some element within the target document, as in this example:

<xptr doc=TEIP3 from='id (SA)'>
This selects the whole of whatever element bears the identifier SA within the entity TEIP3. If a finer-grained target is required, other steps might follow. The following keywords are available for you to specify other locations in terms of their relationship to this one: In the above definitions and elsewhere preceding and synonymous terms are to be understood as implying elements which would be encountered earlier when the document is processed correctly from beginning to end. The term pseudo-element is used for any string of PCDATA content occurring between SGML tags, which is not itself a complete SGML element, but forms part of one. The <p>element in the following example
<p>See <xref doc=TEIP3>The TEI Guidelines, passim</xref>
for a full discussion</p>
has three children: the second is an element (the <xref>) while the first and last are pseudo-elements (the pieces of content data containing the words `see ' and `for a full discussion' respectively).

Each of the above keywords implies a particular set of elements or pseudo-elements (the set of children, the set of ancestors, the set of previous siblings, etc.); to specify which of them you are pointing at, the keyword may optionally be followed by a parenthesized list containing:

Continuing the above example, the following reference will select the third <p> element directly contained by whatever element has the identifier SA:

<xptr doc=TEIP3 from='id (SA) child (3 p)'>

Note the difference between this and

<xptr doc=TEIP3 from='id (SA) child (3)'>
which selects the third child of the element bearing the identifier SA, whatever it may be. If entity TEIP3 contained the following text:
<div id=SA><head>Linking and Alignment</head>
<p id=Para1>Text of paragraph 1. </>
<p id=Para2>Text of paragraph <num>2</num>, which is rather short.</p>
<p id=Para3>Text of paragraph <num>3</num>, which is also rather short.</p></div>
the above <xptr> would reference the second paragraph above, because of the <head> element which is also a child element. Similarly, the following <xptr>
<xptr doc=TEIP3 from='id (Para3) child (3)'>
points to the pseudo-element `which is also rather short.' within the element with identifier P2

Rungs of the same or different kinds can be combined as required. Assuming for example that the entity TEIP3 is in fact a reference to the SGML form of the TEI Guidelines, then the following reference will select section 14.2.2 of that publication in which (as it happens) the extended pointer syntax is formally defined:

For full details, see
<xref doc=TEIP3 from='id (SA) child (2 div2) child (2 div3)'>
  TEI Extended pointer syntax definition

Complex specifications are easily built using this syntax. For example, the following reference will select the most recent <head> element which carries an attribute lang with the value LAT, and which occurs before the start of the element with identifier SA:

<xptr doc=TEIP3 from='id (SA) preceding (1 head lang lat)'>

You can define the target of a link with respect to the location of the link itself, rather than with respect to the root of the document, by using the keyword HERE. For example,

<xptr from="HERE ancestor(1)"> 
points to the parent of the element within which it appears;
<xptr from="HERE ancestor(2)"> 
points to the grandparent of the element within which it appears and so on. As this example also shows, when no value is supplied for the doc attribute, the current document is assumed. The HERE keyword makes no sense except as the first rung of a location ladder.

2 Locating parts of element content

The TEI extended pointer syntax is most reliably used to locate particular SGML element (or pseudo-element) occurrences. In the TEI scheme any SGML element can bear an ID attribute, which (together with the tree location methods described above) means that this is less of a restriction than it might appear.

Sometimes however the target of a cross reference does not correspond with any particular feature of a text, and so may not be tagged as an element, or its position within the SGML document tree is not reliably known. If the desired target is simply a point in the current document, the easiest way to mark it is by introducing an <anchor> element at the appropriate spot. If the target is some sequence of words not otherwise tagged, the <seg> element may be introduced to mark them.

In the following (imaginary) example, <xref> elements have been used to represent points in this text which are to be linked in some way to other parts of it; in the first case to a point, and in the second, to a sequence of words:

Returning to <xref from=id(ABCD)>the point where I dozed off</xref>, I noticed that <xref from="id(EFGH)"> three words</xref> had been circled in red by a previous reader

This encoding requires that elements with the specified identifiers (ABCD and EFGH in this example) are to be found somewhere else in the current document. Assuming that no element already exists to carry these identifiers, the <anchor> and <seg> elements might be used:

  .... <anchor type=bookmark id='ABCD'> ....
   ....<seg type=target id='EFGH'> ... </seg> ...

An alternative approach, useful when identifiers or other markup cannot be introduced into the target document, is to use the string, token, or pattern location methods provided in the TEI extended pointer syntax by the following keywords:

These three methods should not be used to count across element boundaries: they are provided chiefly to locate fine detail within a given document element, where such points are not already explicitly marked up. The token and str methods are defined as behaving in exactly the same way as the HyTime dataloc method, with quanta token and str respectively. The syntax used to define pattern locations is (yet another) subset of the regular expression syntax used by most Unix systems.

Some examples follow:

<p>This <xptr from="HERE token(3 5)">is not a very good idea.
selects the three tokens `a very good'.
<p>This <xptr from="HERE str(3 5)">is not a very good idea.
selects the string` no' (i.e. space, n, o)
<p>This <xptr from="HERE pattern([aeiou][aeiou])">is not a very good idea.
selects the first pair of adjacent vowels following the pointer, i.e. the string`oo' in `good'

Thus, assuming that the three words circled in red in the example above occurred at the start of the third paragraph in the chapter with identifier `C5', a pointer like the following would point to them:

I noticed that <xref from="id(C5) child(3 p) tokens(1 3)"> three words</xref> had been circled in red by a previous reader

3 Locating sequences of element content

Often, the scope of a cross reference will be adequately defined by the from attribute. For some documents, however, it may be necessary to define both a starting and an ending location: if for example the target crosses SGML element boundaries, or involves several elements. As noted above, the to attribute is provided for this purpose. For example,

  <xptr doc=P1 from='id(xyz)' to='id(abc)'>
is an extended pointer whose target is the part of the document P1 starting at the beginning of whatever element has identifier XYZ and ending at the end of whatever element in the same document has identifier ABC. Any elements in between are also included, irrespective of structure; the pointer is erroneous if the end of ABC precedes the start of XYZ.

The keyword DITTOkeyword can be used to simplify the specification of a location ladder for a to attribute which differs only slightly from that already supplied in the from attribute. For example:

<xptr from='id(xyz) ancestor(1 div) pattern(Hegel)' to='ditto pattern(Marx)'>
will find the sequence starting with the first occurrence of the string `Hegel ' in the <div> element which is the immediate parent of an element with the identifier xyz and ending with the first occurrence of the string `Marx' that follows it within the same element.

4 Other location methods

Three other location methods are defined in the TEI extended pointer syntax, specifically for non-SGML data and for HyTime conformant data. The SPACE keyword may be used to address locations in space; the FOREIGN keyword may be used to address locations in terms of some external notation not defined by the Guidelines; the HYQ keyword may be used to supply a HyTime query expression. The FOREIGN and HYQ keywords are not further described here. (The latter is obsolete, while the former is largely of documentary use only).

The SPACE keyword takes either two or three parameters. The first is a name for the co-ordinate system in use: it will typically be a string like `2D' (two dimensional) or `3D'(three dimensional). The second and third parameters consist of a list of numerical values giving co-ordinate values as measured along each dimension of a Cartesian space with all axes orthogonal. The number of values is equal to the number of axes (usually 2 or 3). If only the second parameter is given, the location indicated is a point in the co-ordinate system. For example:

<xptr from='SPACE (2d) (0 0)'>
indicates the origin of a two dimensional space.

If the third parameter is present, the location indicated is the rectangular prism defined by treating corresponding items from the two lists as inclusive bounds along each dimension in turn. For example

<xptr from='SPACE (2d) (0 0) (1 1)'>
indicates a single unit square tangential to the origin of a two dimensional space.

5 Attributes for linking elements

All TEI linking elements carry a number of general purpose attributes, listed here for completeness:

The targType attribute may be used to add an additional semantic constraint to the linkage, by requiring that the GI of the elements indicated match the list of names supplied. attribute can be used to specify that the element pointed to must be of a particular type, as in the following example:

this is discussed in <xref from=id(dspec) targType='div1 p'>the section on links</ref>
This reference should fail if the element with identifier dspec is not either a <div1> or a <p>. This constraint is not enforceable by SGML parsers, of course, but a TEI-aware application may choose to enforce it.

The type attribute can be used to categorize the link represented by the pointer in any convenient way. The resp and crDate attributes may also be used to represent the person or agency responsible for making the link, and its date of creation, as in the following example:

   this is discussed in
   <xref type=navigator
    resp=auto crdate=950521 
    targtype='div1 div2'>
   the section on links</xref>

The evaluate attribute can take values all, one or none. Its purpose is to specify the intended significance of a link which points to a pointer. With evaluate=all, if the element pointed at is itself a pointer, then the target of that pointer will be taken, and so on, until an element is found which is not a pointer. With evaluate=one, this evaluation process is carried out once only; with evaluate=none, it is not carried out at all.

6 Special purpose linking attributes

A number of special purpose linking attributes may be defined for every element in the TEI Lite DTD:

All of these attributes have declared values of either IDREF or IDREFS. If an <xptr> is required to carry one of the semantic roles listed above, it should be given an identifier which can then be specified as the value for the attribute concerned. For example, a linguistic analysis of the sentence ``John loves Mary'' might be encoded as follows:

<seg type=sentence ana=SVO>
  <seg type=lex ana=NP1>John</seg>
  <seg type=lex ana=VVI>loves</seg>
  <seg type=lex ana=NP1>Mary</seg>
This encoding implies the existence elsewhere in the document of elements with identifiers SVO, NP1, and VV1 where the significance of these particular codes is explained. Such elements might in fact be references to elements in some other document, as follows:
<xptr id=SVO doc=synSpec from=id(xsvo)>
<xptr id=NP1 doc="synSpec" from=id(xnp1)>
Here the implication is that there is an element in the external entity synSpec which carries an identifier xsvo and which provides the definition for the analysis concerned.

In the same way, the corresp (corresponding) attribute can be used to represent some form of correspondence between two elements within a document, or different documents. For example, in a multilingual text, it may be used to link translation equivalents, as in the following example

<seg lang=FRA id=FR001 corresp=EN001>Jean aime Marie</seg>
<seg lang=ENG id=EN001 corresp=FR001>John loves Mary</seg>

If, as is more likely, the French and English sentences are contained in two different documents, external pointers could be added to either document to refer to the other.

All the pointers so far discussed have been contextual, that is, one end of the link being represented is given by the location of the pointing element itself. The TEI also defines an independent linking pointer, (in HyTime terms, an ilink), represented by the <link> element. The targets attribute of this element specifies the identifiers of two or more other elements which are to be linked, in some way defined by its type attribute. For example, the translation equivalence expressed by means of the corresp attribute above could equally well be represented as follows:

<seg lang=FRA id=FR001>Jean aime Marie</seg>
<seg lang=ENG id=EN001>John loves Mary</seg>
<link type=translation targets="EN001 FR001">

This mechanism provides a convenient way of linking together sentences from different entities. Suppose that the English sentences are in an entity called ENtext and the French in one called FRtext. Since these are distinct SGML documents, we will use extended pointers to indicate each sentence, and express their alignment by means of an independent link:

<xptr id=EN1 doc=ENtext from=id(S1)>
<xptr id=FR1 doc=FRtext from=id(S1)>
<link type=translation targets="EN1 FR1">

A <link> element whose targets are pointers is defined as linking the targets of those pointers.

Groups of pointers of similar types can be identified using the <linkGrp> element: all the links within such a group inherit a type value from their parent.

7 Implementations

A large subset of the TEI recommendations for extended pointers has been implemented in the Synex Viewport engine, and is consequently available to applications of this engine, such as Softquad's Panorama and Panorama Pro.

Here is the formal specification for the subset implemented by Synex. Note that the ten keywords given here in upper case can in fact be specified in upper or lower case, or a mixture:

locterm  ::= 'ROOT'	  // default first rung
   |    'HERE'	          // location of the xptr.
   |    'ID' '(' NAME ')' // only one ID is allowed. 
   |    'CHILD' steps
   |    'ANCESTOR'  steps
   |    'PREVIOUS'  steps
   |    'NEXT'      steps
   |    'PRECEDING' steps
   |    'FOLLOWING' steps
   |    'DITTO'	          // valid only in TO attribute.

steps    ::=  '(' step ')
   |    steps '(' step ')' 
step     ::= instance
   |    instance element
   |    instance element avspecs 
avspecs  ::= attribute value
   |    avspecs attribute value 
instance ::=  'ALL'
   |    NUMBER	        // default sign is + 
   |    '+' NUMBER 
   |    '-' NUMBER 
element  ::=  NAME 
   |    '#CDATA' 
   |    '*' 
attribute ::= NAME 
   |     '*' 
value     ::=  LITERAL	// i.e. a quoted string.
   |     NAME 	        // As for attribute values in 
   |     NUMBER	        //   a document, NMTOKENs need not
   |     NUMTOKEN	//   be quoted.
   |    '#IMPLIED'	// No value specified, no default.
   |    '*'	        // Any value matches. 
range    ::=  NUMBER
   |    NUMBER NUMBER</EG></P>

You do not need to use the TEI dtd to take advantage of TEI extended pointers (though it may be a good idea to do so for other reasons). If you are using Panorama, you specify in your dtd or document that it should treat any element as a TEI extended pointer by supplying a processing instruction like the following:

<?TAGLINK foo "TEI-P3">
The element named (<foo> in the above example) must, of course, be defined in your dtd with attributes doc, from and to in the same way as the TEI elements <xptr> and <xref>.

A simple test file, demonstrating some of the features documented here is available from the URL http://users.ox.ac.uk/~lou/papers/XR/XRtest.sgm. You must have Panorama installed on your machine to read this file.