Using HyTime to Link Translations

Version 1.0, 6th December 1995 © Martin Bryan, The SGML Centre

This paper has been written in response to Manuel Tomas Carrasco Benitez's proposal to the WINTER e-mail discussion group for the development of a mechanism to link translations of documents available on the World Wide Web. It suggests how the Hypermedia/Time-based Structuring Language (HyTime) defined in ISO/IEC 10744 could be used to manage the relationships between a source document and its translations.

This proposal will take into account the extensions to HTML currently being proposed by the HTML Working Group at the IETF. In particular it will refer to the proposal to allow an ID attribute to be associated with any element.

The Current Situation in HTML

The elements that can be used to link documents together in HTML are the <LINK> element used in the header to identify meta-data links and the <A> (anchor) element used in the text to identify actual links. The current definition for <A> given in RFC1866 is:

5.7.3. Anchor: A

The <A> element indicates a hyperlink anchor (see 7, "Hyperlinks").
At least one of the NAME and HREF attributes should be present.
Attributes of the <A> element:

HREF
gives the URI of the head anchor of a hyperlink.

NAME
gives the name of the anchor, and makes it available as
a head of a hyperlink.

TITLE
suggests a title for the destination resource --
advisory only. The TITLE attribute may be used:

* for display prior to accessing the destination
  resource, for example, as a margin note or on a
  small box while the mouse is over the anchor, or
  while the document is being loaded;

* for resources that do not include a title, such as
  graphics, plain text and Gopher menus, for use as a
  window title.

REL
The REL attribute gives the relationship(s) described by
the hyperlink. The value is a whitespace separated list
of relationship names. The semantics of link
relationships are not specified in this document.

REV
same as the REL attribute, but the semantics of the
relationship are in the reverse direction. A link from A
to B with REL="X" expresses the same relationship as a
link from B to A with REV="X". An anchor may have both
REL and REV attributes.

URN
specifies a preferred, more persistent identifier for
the head anchor of the hyperlink. The syntax and
semantics of the URN attribute are not yet specified.

METHODS
specifies methods to be used in accessing the
destination, as a whitespace-separated list of names.
The set of applicable names is a function of the scheme
of the URI in the HREF attribute. For similar reasons as
for the TITLE attribute, it may be useful to include the
information in advance in the link. For example, the
HTML user agent may chose a different rendering as a
function of the methods allowed; for example, something
that is searchable may get a different icon.

The definition for <LINK> is:

5.2.4. Link: LINK

The <LINK> element represents a hyperlink (see 7, "Hyperlinks"). Any
number of LINK elements may occur in the <HEAD> element of an HTML
document. It has the same attributes as the <A> element (see 5.7.3,
"Anchor: A").

The <LINK> element is typically used to indicate authorship, related
indexes and glossaries, older or more recent versions, document
hierarchy, associated resources such as style sheets, etc.

These definitions are further qualified by the following descriptions of HTML hyperlinks:

7. Hyperlinks

In addition to general purpose elements such as paragraphs and lists,
HTML documents can express hyperlinks. An HTML user agent allows the
user to navigate these hyperlinks.

A hyperlink is a relationship between two anchors, called the head
and the tail of the hyperlink[DEXTER]. Anchors are identified by an
anchor address: an absolute Uniform Resource Identifier (URI),
optionally followed by a '#' and a sequence of characters called a
fragment identifier. For example:

http://www.w3.org/hypertext/WWW/TheProject.html
http://www.w3.org/hypertext/WWW/TheProject.html#z31

In an anchor address, the URI refers to a resource; it may be used in
a variety of information retrieval protocols to obtain an entity that
represents the resource, such as an HTML document. The fragment
identifier, if present, refers to some view on, or portion of the
resource.

Each of the following markup constructs indicates the tail anchor of
a hyperlink or set of hyperlinks:

* <A> elements with HREF present.

* <LINK> elements.

* <IMG> elements.

* <INPUT> elements with the SRC attribute present.

* <ISINDEX> elements.

* <FORM> elements with `METHOD=GET'.

These markup constructs refer to head anchors by a URI, either
absolute or relative, or a fragment identifier, or both.

In the case of a relative URI, the absolute URI in the address of the
head anchor is the result of combining the relative URI with a base
absolute URI as in [RELURL]. The base document is taken from the
document's <BASE> element, if present; else, it is determined as in
[RELURL].

7.1. Accessing Resources

Once the address of the head anchor is determined, the user agent may
obtain a representation of the resource.

For example, if the base URI is `http://host/x/y.html' and the
document contains:

<img src="../icons/abc.gif">

then the user agent uses the URI `http://host/icons/abc.gif' to
access the resource, as in [URL]..

7.2. Activation of Hyperlinks

An HTML user agent allows the user to navigate the content of the
document and request activation of hyperlinks denoted by <A>
elements. HTML user agents should also allow activation of <LINK>
element hyperlinks.

To activate a link, the user agent obtains a representation of the
resource identified in the address of the head anchor. If the
representation is another HTML document, navigation may begin again
with this new document.

7.4. Fragment Identifiers

Any characters following a `#' character in a hypertext address
constitute a fragment identifier. In particular, an address of the
form `#fragment' refers to an anchor in the same document.

The meaning of fragment identifiers depends on the media type of the
representation of the anchor's resource. For `text/html'
representations, it refers to the <A> element with a NAME attribute
whose value is the same as the fragment identifier. The matching is
case sensitive. The document should have exactly one such element.
The user agent should indicate the anchor element, for example by
scrolling to and/or highlighting the phrase.

For example, if the base URI is `http://host/x/y.html' and the user
activated the link denoted by the following markup:

<p>See: <a href="app1.html#bananas">appendix 1</a>
for more detail on bananas.

Then the user agent accesses the resource identified by
`http://host/x/app1.html'. Assuming the resource is represented using
the `text/html' media type, the user agent must locate the <A>
element whose NAME attribute is `bananas' and begin navigation there.

Possible Changes for HTML Version 3.0

In the March 1995 draft of the proposed changes Dave Raggett wrote the following:

The Body Element and Related Elements

The BODY element

   Permitted Context: HTML   Content Model: %Body.Content 

Within the BODY element, you can structure text into paragraphs, and
lists, as well as highlighting phrases and creating links, amongst
other things. The BODY element has the following attributes, all of
which are optional: 

Note that the ID, LANG and CLASS attributes can be used with
virtually all of the elements permitted in the document body. 
ID    An SGML identifier used as the target for hypertext links or 
      for naming particular elements in associated style sheets.
      Identifiers are NAME tokens and must be unique within the scope
      of the current document. 
LANG  This is one of the ISO standard language abbreviations, e.g.
      "en.uk" for the variation of English spoken in the United
      Kingdom. It can be used by parsers to select language specific
      choices for quotation marks, ligatures and hypenation rules etc.
      The language attribute is composed from the two letter language
      code from ISO 639, optionally followed by a period and a two
      letter country code from ISO 3166. 
CLASS This a space separated list of SGML NAME tokens and is used to
      subclass tag names. For instance, <P CLASS=STANZA.COUPLET>
      defines a paragraph that acts as a couplet in a stanza. By
      convention, the class names are interpreted hierarchically, with
      the most general class on the left and the most specific on the
      right, where classes are separated by a period. The CLASS
      attribute is most commonly used to attach a different style to
      some element, but it is recommended that where practical class
      names should be picked on the basis of the element's semantics
      as this will permit other uses, such as restricting search
      through documents by matching on element class names. The
      conventions for choosing class names are outside the scope of
      this specification.

Unfortunately these attibutes cannot be used with the empty <LINK> element, though they can be used with anchors (<A>) and most other elements containing text.

Dave Raggett also proposed, in different parts of the specification the following extensions to the use of <LINK> elements:

Additional features include a static banner area for corporate
logos, disclaimers and customized navigation/search controls. The
LINK element can be used to provide standard toolbar/menu items for
navigation, such as previous and next buttons. The NOTE element is
used for admonishments such as notes, cautions or warnings, and also
used for footnotes. 

LINK
The LINK element indicates a relationship between the document and
some other object. A document may have any number of LINK elements.
The LINK element is empty (does not have a closing tag), but takes
the same attributes as the anchor element. The important attributes
are: 
REL   This defines the relationship defined by the link
REV   This defines a reverse relationship.
      A link from document A to  document B with REV=--relation--
      expresses the same relationship  as a link from B to A with REL=--relation--.
      REV=made is  sometimes used to identify the document author, either the
      author's email address with a --mailto-- URI, or a link to the
      author's home page. 
HREF  This names an object using the URI notation. 

Using LINK to define document specific toolbars

An important use of the LINK element is to define a toolbar of
navigation buttons or an equivalent mechanism such as menu items.

LINK relationship values reserved for toolbars are: 
   REL=Home         The link references a home page or the top of some hierarchy. 
   REL=ToC          The link references a document serving as a table of contents. 
   REL=Index        The link references a document providing an index for the
                    current document. 
   REL=Glossary     The link references a document providing a glossary of terms
                    that pertain to the current document. 
   REL=Copyright    The link references a copyright statement for the current
                    document. 
   REL=Up           When the document forms part of a hierarchy, this link
                    references the immediate parent of the current document. 
   REL=Next         The link references the next document to visit in a guided tour. 
   REL=Previous     The link references the previous document in a guided tour. 
   REL=Help         The link references a document offering help, e.g. describing
                    the wider context and offering further links to relevant
                    documents. This is aimed at reorienting users who have lost
                    their way. 
   REL=Bookmark    Bookmarks are used to provide direct links to key entry points
                   into an extended document. The TITLE attribute may be used to
                   label the bookmark. Several bookmarks may be defined in each
                   document, and provide a means for orienting users in extended
                   documents. 
   An example of toolbar LINK elements: 
       <LINK REL=Previous HREF=doc31.html>
       <LINK REL=Next HREF=doc33.html>
       <LINK REL=Bookmark TITLE="Order Form" HREF=doc56.html>

Using LINK to include a Document Banner

   The LINK element can be used with REL=Banner to reference another
   document to be used as banner for this document. This is typically
   used for corporate logos, navigation aids, and other information
   which shouldn't be scrolled with the rest of the document. For
   example: 
       <LINK REL=Banner HREF=banner.html>
   The use of a LINK element in this way, allows a banner to be shared
   between several documents, with the benefit of being able to
   separately cache the banner. Rather than using a linked banner, you
   can also include the banner in the document itself, using the 
   BANNER element. 

Link to an associated Style Sheet

   The LINK element can be used with REL=StyleSheet to reference a
   style sheet to be used to control the way the current document is
   rendered. For example: 
       <LINK REL=StyleSheet HREF=housestyle.dsssl>

Other uses of the LINK element

   Additional relationship names have been proposed, but do not form
   part of this specification. Servers may also allow links to be
   added by those who do not have the right to alter the body of a
   document.

Until these proposals are accepted it will not be possible to use links within banner headlines or to identify every element in a translated document. As will be seen below, HyTime location facilities could be used to identify elements that do not currently have unique identifiers (IDs) and also strings within elements for which links are required.

Using Existing HyTime Constructs to Associate Translations

The existing <LINK> and <A> elements allow a start to be made to controlling translation referencing. Firstly the problem of identifying translations of a retrieved document. If the document starts with link statements of the following form it will be possible to identify the translations that are available for the currently displayed document:

<BASE "http://www.myco.org/pub/subject/en/myfile.htm">
<LINK name=author  href="mailto:author@myco.com" title="Author" rev=made>
<LINK name=spanish href="../sp/myfile.htm" title="Español" rel=translation>
<LINK name=french  href="../fr/myfile.htm" title="Français" rel=translation>
<LINK name=german  href="../de/myfile.htm" title="Deutsch" rel=translation>

Unfortunately current browsers do not display link elements, so these elements will not be selectable by users until the HTML 3.0 proposals for Banners are actioned by browser developers.

Until browsers offer mechanisms for listing links in menus that user can use to interconnect files the only alternative is to turn the links into anchors within the body of the document, using defintions such as:

<A name=spanish href="../sp/myfile.htm" title="Espanol" rel=translation>
[Español]</a>
<A name=french  href="../fr/myfile.htm" title="Francais" rel=translation>
[Français]</a>
<A name=german  href="../de/myfile.htm" title="Deutsch" rel=translation>
[Deutsch]</a>

Note that there is a different model for links (which are declared using the EMPTY keyword and therefore have no content or end-tag) and anchors (which must have content or, at very least, an end-tag).

When HTML 3.0 is available it will be possible to extend the above descriptions as follows:

<A name=spanish href="../sp/myfile.htm" title="Espanol" rel=translation lang=sp class=automatic>
[Español]</a>
<A name=french  href="../fr/myfile.htm" title="Francais" rel=translation lang=fr class=manual>
[Français]</a>
<A name=german  href="../de/myfile.htm" title="Deutsch" rel=translation lang=de class=semi-automatic>
[Deutsch]</a>

In this case I have used class to indicate whether conversion was done automatically by a program, manually by a skilled translator or semi-automatically, using human correction to an automatic translation.

Now let us look at how we can name points within a document. At present the only valid mechanism that will work across all browsers is through use of the <A name=xyz> mechanism as the HTML 2.0 spec does not allow IDs to be associated with elements. If you want to attach a name to a paragraph you have to do something along the lines of:

<p><a name=SGML></a>The uses of SGML are legion ...

Note that I have named the paragraph here, not identified a piece of text as an anchor. The anchor in this case has no content. Whilst it would be possible to place the end-tag for the anchor at the end of the paragraph this would serve no real purpose and introduces the risk that the limited model of HTML could be broken by elements within the paragraph.

If I translate this file into French in the translated file I would have:

<p><a name=SGML></a>Les usages de SGML sont legion ...

[Excuse my lousy French!]

I don't need to rename the object as anchor names are local to the file they are in, as the above excerpt from RFC1866 on Fragment Identifiers makes clear.

Note: These HTML links are not really "fragment identifiers" as anchors do not identify element sets. HTML anchors can at most identify a set of characters within a single element. Alternatively, as shown here, they can identify a spot in the document.

Links such as <a href="#SGML"> will take you to the point in the current document that has the name SGML assigned to an anchor. All you need to do to move from one translation to another is to switch from the BASE definition of the document you are working in to that of the translation listed in the LINK statements that is appropriate to your request. A mechanism for doing this would be very easy to define in the HTML 3.0 specification.

Where document structure changes during translation the translator must determine which is the most appropriate place in the translation to place the anchor. (There is a distinct advantage in the use of empty anchors in this context as they are much easier to reposition than anchors that are placed around text, as the latter may need to be mingled with other text when translated, as anyone who has struggled with maintaining links in English/German translations will tell you! In fact this makes a very good argument for using anchors rather than IDs to identify points in the document that are linked within translations.)

The HyTime Location Model

Note: The following material, which is probably more detailed than is needed at the present moment, but which is included for completeness sake at this point, is extracted from a book I am in the process of writing on HyTime. It is copyrighted and must not be used outside of the discussions within the WINTER discussion group.

Within an SGML document the basic method of addressing an element is by assigning it a unique identifier (ID) to an element and then making a reference to this identifier using an attribute whose declared value is either an id reference value (IDREF) or an id reference list (IDREFS). Each unique identifier must be a valid SGML name, beginning with a letter, or one of the alternative name start characters defined in the SGML declaration, optionally followed by one or more letters, digits, or characters declared in the SGML declaration to be valid name characters.

Note: URLs are not valid SGML names, and neither are fragment identifiers by default as # is not a valid name character in the HTML. The name used in an SGML IDREF is identical to that used in the ID being pointed to. All IDREFs are presumed to be local so there is no need to add a # to distinguish local name references as HTML currently requires.

SGML's basic object addressing method has a number of limitations. The principal limitation is that the identified object must form part of the document, or subdocument, from which it is referenced. In addition, unique identifiers must be assigned to a single element's start-tag. This prevents an identifier from selecting more than one point in a document, though references can be made to more than one unique identifier from a reference point using an id reference list. A further restriction is that, because unique identifiers are defined as attributes of elements, they cannot be assigned to entity references, or to significant data strings.

HyTime's location address module allows SGML unique identifiers to be assigned to parts of a document which do not otherwise have identifiers. HyTime allows unique identifiers to be assigned to:

SGML entities
identified elements in other documents/subdocuments
unidentified elements in any document
data strings
a node in a data tree or other form of object list
a token within a token list
a position within a finite coordinate space
a date or time specification
elements with specific property values, including generic identifiers and attribute values
elements or entities belonging to a specific object class, e.g. conforming to a particular notation, or conforming to a particular architectural form
data encoded using any structured or unstructured notation
inaccessible objects that can only be identified using a classical, bibliographic, form of textual referencing.

There are three main types of HyTime location address:

name space locations identify objects by reference to one of SGML's standard name spaces, such as those for entity names and the unique identifiers assigned to elements
coordinate locations identify objects by reference to their position along an axis whose quanta are defined in terms of known measurement granules (virtual or absolute)
semantic locations identify objects by reference to one or more properties, which may be notation specific or otherwise unresolvable by a HyTime engine.

Name space locations

Name space locations are used to point to objects that have been assigned a name. Such objects include:

SGML elements that have been assigned a unique identifier
SGML entities
SGML document entities, with their document type definitions and link process definitions
HyTime coordinate and semantic locations.

The named location address (nameloc) element type architectural form is used to define elements that can identify locations by reference to a name in a specified name space. Its meta-DTD definition is:

<!element nameloc  -- Assigns a local ID to one or more named objects --
                   -- Constraint: name list derived from content 
                      elements. --
                   - O      (nmlist|nmquery)* >
<!attlist nameloc  HyTime   NAME     nameloc
                   id       ID       #REQUIRED
                            -- multloc attributes --
                            -- spanloc attributes --  >

Each named location address associates a unique identifier (id) with one or more objects that are identified by name in a constructed name list. This name list is constructed by concatenating the name lists supplied by the name list specifications (nmlist) and name list queries (nmquery) that form the contents of the named location address.

The name list specification (nmlist) element type architectural form is used to define elements that will contain the lists of names used to form a constructed name list. The meta-DTD definition for this architectural form is:

<!element nmlist   -- List of local or external ID or entity names --
                   - O      (#PCDATA) -- lextype(NAMES) -->
<!attlist nmlist   HyTime   NAME     nmlist
                   nametype -- Entity names or IDs of elements --
                            (entity|element|unified) entity
                   obnames  -- Objects treated as names? --
                            (obnames|nobnames) nobnames
                   docorsub -- SGML document or subdoc whose prolog 
                               declares entities or elements named in
                               name list; initially the document in
                               which this occurs. --
                            ENTITY   #IMPLIED  --Default: no change--
                   dtdorlpd -- Active DTD or LPDs for SGML document
                               entity. Base DTD if unspecified and
                               docorsub changed; no change if 
                               unspecified and docorsub unchanged. --
                            -- lextype(DTD|LPD+) --
                            NAMES    #IMPLIED
                            -- Default: unchanged or base --           >

The name type (nametype) attribute is used to identify if the name is a unique identifier for an element or the name of an entity.

The objects treated as names (obnames) attribute indicates whether or not the objects pointed to are to be treated as names. If the default value, nobnames, is changed to obnames the contents of each of the named objects is treated as a name specificaton list which is to be added to the constructed name list of the named location address in which the object was named.

The SGML document or subdocument (docorsub) attribute identifies the SGML document/subdocument entity used to declare the document/subdocument in which the names listed are defined.

If multiple DTDs and SGML LINKTYPES are being supported the active DTD or LPDs (dtdorlpd) attribute can be used to identify the tree the names belong to.

The names in a name list specification are not checked for validity as a unique identifier or entity name until the named location address is referenced in a manner that requires access to the addressed object.

To understand the effect of the above rules, consider the following examples, which are constructed using the following elements:

<!ELEMENT topic    - O (anchors+)                       >
<!ATTLIST topic    HyTime   NAME      #FIXED   nameloc
                   id       ID                 #REQUIRED
                   set      (set|notset)       set      >
<!ELEMENT anchors - O (#PCDATA) --lextype(NAMES)--      >
<!ATTLIST anchors HyTime   NAME      #FIXED   nmlist
                  nametype (entity|element)   element
                  obnames  (obnames|nobnames) nobnames
                  docorsub CDATA --lextype(entity|URL)-- #IMPLIED           >

Using the default values for the non-compulsory attributes the following named location addresses could be specified:

<topic id=fleas><anchors>p-12 p-34 p-35</topic>
<topic id=dogs><anchors>p-3 p-24 p-35</topic>

Each of the names in the anchors name list specifications points to an anchor element defined in the currently active document. Each of the topic named location address elements identifies three paragraphs. Note that the paragraph called p-35 is included under both the dogs and fleas topics.

The following named location specification can be used to concatenate the two lists:

<topics id=allergies><anchors obnames>fleas dogs</topic>

In this example the name list specification is identified as pointing at the unique identifiers of one or more object names. The elements pointed to are the two named topic elements used in the preceding example. Because the default value assigned to the set attribute is set duplicated entries will be removed, so the constructed name list known as allergies will contain the following entries: p-3 p-12 p-24 p-34 p-35

Coordinate locations

HyTime coordinate locations can be used to identify the following types of objects that can be addressed as HyTime "quanta":

bit combinations, which may represent characters in a character string, can be addressed using a data location address (dataloc)
tokens, such as SGML names, numbers and words, in a token list can also be addressed using a data location address (dataloc)
nodes in a tree structure, such as the elements in an SGML document, can be addressed using either a width first tree location address (treeloc) or a depth first path location address (pathloc), or through their relationship with another node (relloc)
nodes in a list, such as a name specification list, can be addressed by a list location address (listloc)
events in a schedule, including `events' such as cells in a multidimensional spreadsheet or table, can be identified using a finite coordinate system location address (fcsloc).

Each coordinate location addresses a location with respect to some other location, known as a location source. The combination of a location source and a coordinate location is known as a location view. Different location views can use the same location source. The location source for a coordinate location can be another location address element. In such cases the combination of locations is said to form a location ladder.

Note: Location sources allow relative addressing.

The location source (locsrc) attribute list architectural form is used by all location addresses that can form part of a location view to identify the element(s) that the address is related to. The meta-DTD definition for this architectural form is:

<!attlist locsrc   -- use: dataloc treeloc pathloc relloc
                           listloc proploc fcsloc --
          locsrc   -- location source --
                   -- Constraint: No HyTime reftype constraints --
                   IDREFS   #CURRENT --Default: previous specified-->

The location source (locsrc) attribute identifies the source of the coordinate location by reference to a unique identifier which has been assigned to either an element in the current document or a location address that identifies the required source element. As such it has the same effect as a named location address, except that it can only identify elements in the current document.

If no value is entered the location is taken to be that which applied the last time the attribute was specified (#CURRENT). For the first occurrence of an element with this attribute a value must be specified.

Data location addresses

Data location addresses can be used to identify:

a sequence of bit combination quanta known as a string
a space-delimited token in a sequence of token quanta known as a token list.

The data location address (dataloc) element type architectural form must be used to define elements that identify locations within a data stream. The meta-DTD definition for this architectural form is:

<!element dataloc  -- Locates string and token data objects in data --
                   -- Constraint: dimlists are concatenated into one
                                  list --
                   - O     (dimlist*) >
<!attlist dataloc  HyTime  NAME     dataloc
                   id      ID       #REQUIRED
                   quantum -- Data quantum: bit combination or token --
                           (str|norm|word|name|sint|date|time|utc) str
                   catsrc  -- Concatenate multiple source objects into
                              a single object before applying dimlist
                              to it --
                           (catsrc|catsrcsp|nocatsrc) nocatsrc
                   catres  -- Concatenate results of applying dimlist,
                              in the order of the dimlist --
                           (catres|catressp|nocatres) nocatres
                           -- overrun attributes --
                           -- locsrc attributes --
                           -- multloc attributes --
                           -- spanloc attributes --                   >

Each dimension list (dimlist) should contain two markers, one identifying where the data location starts and the other where it ends. Where more than one dimension list is specified, or a dimension list contains more than one dimension specification, each identified part of the addressable range will be concatenated to form a single data string or token list.

The attributes associated with the data location address architectural form include the HyTime attribute that identifies the element as conforming to the dataloc architectural form and a compulsory unique identifier (id). The data quantum (quantum) attribute identifies the type of data being located. Permitted values for this attribute are:

str to indicate that the dimension list identifies a sequence of bit combinations that form a string
norm to indicate that the identified data is part of a normalized (tokenized) text string
word to indicate that each token in the tokenized string consists solely of valid SGML name characters
name to indicate that the located data forms a single SGML name
sint to indicate that the located data is a signed integer consisting of digits optionally preceded by a plus sign or a hyphen
date to indicate that the located data is a valid universal time coordinate (UTC) date in yyyy-mm-dd format
time to indicate that the located data is a valid UTC time in hh:mm:ss.decimal format
UTC to indicate that the located data forms a valid UTC date/time, in yyyy-mm-dd hh:mm:ss.decimal format.

Except when the value of this attribute is set to str, bit combinations in the addressable range will be treated as normalized tokens. This means that any leading or trailing white spaces and separators will be removed, multiple sequences of separators being replaced by a single space. Any characters which do not conform to the model for the specified data type will be replaced by spaces.

The concatenate source (catsrc) attribute indicates whether or not data segments identified by a multiple location source are to be concatenated into a single addressable string or token list. If the attribute is set to catsrcsp the data associated with each element identified by the locsrc element will be concatenated with a single space between each string. No space will be inserted between the concatenated strings/token list if the value is set to catsrc. If the attribute is left at the default nocatsrc value the data associated with each source location will be be processed individually, in the order specified in locsrc, against the same set of dimension specifications.

The concatenate result (catres) attribute indicates whether or not strings located in different location sources are to be concatenated to form a single result string or token list. If the value is set to catressp the string identified in each of the location sources identified by the locsrce attribute will be separated from its neighbours by a space when the results are concatenated into a single string. If catres is used as the attribute value no space will be inserted between identified strings/tokens, but they will be concatenated. The default value of nocatres indicates that the data located from each location source is to be treated as a separate result string/token list.

The following SGML-encoded data will be used to illustrate the use of data location addresses:

<p id=oriental>The <train>Orient Express</train> will leave·
<station>Gare du Nord</station> at <time>11:00:00.00</time> on·
<date>1999-12-23</date>.</p>

The elements used to identify data within this structure have the following declarations:

<!ELEMENT string  - O (segment+)                                       >
<!ATTLIST string   HyTime  NAME             #FIXED    dataloc
                   id      ID               #REQUIRED
                   locsrc  IDREFS           #CURRENT
                   quantum (str|norm|word|name|sint|date|time|utc) str
                   catsrc  (catsrc|catsrcsp|nocatsrc) nocatsrc
                   catres  (catres|catressp|nocatres) catres           >
<!ELEMENT segment  O O      (starts, (length|ends))+                   >
<!ATTLIST segment  HyTime   NAME             #FIXED    dimspec         >
<!ELEMENT (starts|length|ends) O O (#PCDATA)                           >
<!ATTLIST (starts|length|ends) HyTime        #FIXED    marklist        >

These elements can be used to specify the following simple data location address:

<string id=action locsrc=oriental><segment><starts>20<length>10</string>

The locsrc attribute tells us that the data to be located is part of the object which has been assigned the identifier oriental within the current document. The data segments of this element, together with that of all its descendents, will be concatenated to identify the addressable range of the located object. Figure 6.2 shows the 72 characters that form the identified data string.

NOTE: For this example · is used to identify a new line (record end) code.


The Orient Express will leave·Gare du Nord at 11:00:00.00 on·1999-12-23.

|________|_________|_________|_________|_________|_________|_________|__
1        10        20        30        40        50        60        70

Figure 6.2: Data string for object whose unique identifier is oriental

In this case the dimension list (segment) starts at quantum (character) 20 and identifies a string that is 10 characters long. In other words it identifies the string "will leave"

Another way to identify the two words that make up this string would be to enter the following address specification:

<string id=does locsrc=oriental quantum=words>
<segment><starts>4<length>2</string>

Figure 6.3 shows the words that form the data within the element whose unique identifier is oriental. Note that the time is split into three words as colons are not recognized as valid within the standard SGML name character set, and so are replaced by spaces during tokenization. Conversely the date, including the following period, is treated as a single word as both hyphens and periods are normally valid SGML name characters.


 The Orient Express will leave Gare du Nord at 11 00 00.00 on 1999-12-23.
|___|______|_______|____|_____|____|__|____|__|__|__|_____|__|___________|
   1    2      3      4    5    6    7   8   9 10 11 12    13     14

Figure 6.3: Data string counted using word quanta

Node locations

Structured documents such as SGML documents can be viewed as a hierarchically structured tree consisting of a number of nodes. For SGML document the following node types can be identified:

element - an element whose start and end points are clearly identifiable from the data model
pelement - psuedo-element containing a segment of data that satisfies a #PCDATA token within an element's content model (or forms a data tag pattern that occurs between two tags where the end-tag of the preceding element has been entered in the form of a data tag
dataent - data contained in a uniquely named SGML data entity (which is not parsed by the SGML parser)
dataobj - data identified by a location address element within the content of an element, data entity or property value.

Trees can be viewed in either a width-first or a depth-first manner. The width-first approach looks at each level in the tree as a separate list of nodes from which members can be selected. The depth-first approach considers each path down the tree as a separate list of nodes.

Both of these approaches to the tree structure can be used to define a set of measurement domains whose addressable range is determined by the number of nodes in a given node list. Nodes in a node list do not need to be unique. For example, the same data entity could occur at two points at a given level. Hence node lists do not form 'sets' in the mathematical sense. Each node list is 'ordered'; the order in which the nodes are listed is always preserved during processing.

Each node in a node list can be considered as a tree of one or more nodes. This allows location ladders to be built that find one node in a tree and then locate other nodes with respect to the located node.

NOTE: This paper will not concern itself with the use of tree or path locators.

Relative location addresses

If support for the relloc location address module option has been declared the relative location address (relloc) element type architectural form can be used to define elements that will be used to select nodes based on their relationships to nodes named by another location address or a unique identifier.

The meta-DTD for the relative location address element type architectural form is defined as:

<!element relloc   -- Locates nodes in a tree relative to a starting
                      node --
                   -- Constraint: dimlists are merged into a single list
                      that resolves to pairs of dimspecs --
                   - O      (dimlist*)                                 >
<!attlist relloc   HyTime   NAME     relloc
                   id       ID       #REQUIRED
                   root     -- Root of tree --
                            -- Constraint: 1 per starting node 
                                           or 1 for all --
                            IDREFS   #IMPLIED
                            -- Default: document root of each 
                                        starting node --
                   relation -- Relationship to starting node --
                            -- Constraint: des requires "pathloc" 
                                           option --
                            (anc|esib|ysib|des|parent|children) parent
                            -- overrun attributes --
                            -- locsrc attributes --
                            -- multloc attributes --
                            -- spanloc attributes --                   >

In addition to the compulsory HyTime and id attributes relative locations have two unique attributes. The root of tree (root) attribute is used to specify the object which is to form the starting node of the relationship. Where locsrc identifies a multiple location, all the starting nodes of which occur in the same tree, a single starting node can be specified for all sources. Otherwise a separate starting node has to be specified for each tree identified by the location source attribute. If no value is entered the document element of each source will be taken as the starting point for the relationship.

The relationship to starting node (relation) attribute is used to specify the part of the located tree that is to form the addressable range from which nodes can be selected. The available options are:

anc - all ancestors, between root node and the parent of the starting node
esib - all elder (left-hand) siblings of the starting node's parent
ysib - all younger (right-hand) siblings of the starting node's parent
des - all descendents in the subtree whose root is the starting node
parent - parent node of starting node
children - children of starting node.

Semantic locations

HyTime provides access to three types of semantic address:

addresses based on one or more of the properties associated with an object
addresses that must be interpreted according to the rules specified for a particular data notation
addresses that must be interpreted, in the classical bibliographic manner, by a human, or by a program that can identify objects in some way other than through the entity references used to define the objects that make up a HyTime bounded object set.

NOTE: This paper will not concern itself with the definition of semantic locations.

HyTime Links

HyTime provides two types of links::

contextual links (clink) are embedded in the source document at the point from which the reference is being made
independent links (ilink) are stored independently of the data they reference, either in one of the documents being pointed to or in a completely separate document.

The meta-DTD definition for a contextual link is:

                       <!-- Contextual Link -->            
<!element clink -- Contextual link --
                - O      (%HyBrid;)* >
<!attlist clink HyTime   NAME     clink
                id       ID       #IMPLIED  -- Default: none --
                linkend  -- Link end --
                         -- Constraint: No HyTime reftype constraints,
                            but application designers can constrain
                            element types with reftype attribute --
                         IDREF    #REQUIRED                             >

In addition to the standard HyTime and id attributes a single reference to a unique identifier (IDREF) must be entered in the linkend attribute. The located ID could belong to a HyTime locator or an element in the current file. When a HyTime locator has been used the IDREF can point to a location in a document other than the current one as long as the locator element that identifies it is in the same document as the clink element.

NOTE: The HTML <A> element can be seen as a less constrained version of this model, with the name attribute taking the part of the id attribute and href taking the place of the linkend attribute. As both the attributes are defined using the CDATA keyword they are less constrained than their HyTime equivalents, which must follow SGML's naming rules, which would not accommodate URLs without modification of the default name set (which is not a problem as HTML already redefines most of the other default SGML limitations).

HyTime independent links must conform to the following meta-DTD definition:

                   <!-- Independent Link -->        
<!element ilink -- Independent link --
                - O      (%HyBrid;)* >
<!attlist ilink HyTime   NAME     ilink
                id       ID       #IMPLIED  -- Default: none --
                anchrole -- Anchor roles --
                         -- Constraint: one per anchor --
                         -- lextype((NAME, s+, (RNI, "AGG")?),
                                    (s+, NAME, s+, (RNI, "AGG")?)+) --
                         CDATA    #FIXED in-DTD
                linkends -- Link ends --
                         -- Constraint: one anchor per anchor role. If
                            one is omitted, ilink element is first
                            anchor. --
                         -- Constraint: No HyTime reftype constraints,
                            but application designers can constrain
                            element types with reftype attribute --
                         IDREFS   #REQUIRED
                extra    -- External access traversal rule --
                         -- Constraint: one/anchor or one for all--
                         -- lextype(("E"|"I"|"A"|"N"|"P"),
                                    (s+, ("E"|"I"|"A"|"N"|"P"))*) --
                         NAMES    #IMPLIED -- Default: no traversal --
                intra    -- Internal access traversal rule --
                         -- Constraint: one/anchor or one for all --
                         -- lextype(("E"|"I"|"A"|"N"|"P"),
                                    (s+, ("E"|"I"|"A"|"N"|"P"))*) --
                         NAMES    #IMPLIED -- Default: no traversal --
                endterms -- Link end term information --
                         -- Constraint: one/anchor or one for all --
                         -- reftype(HyBrid) --
                         IDREFS   #IMPLIED  -- Default: none --
                aggtrav  -- Traversal of agglink anchors: agg or
                             members --
                         -- Constraint: one/anchor or one for all --
                         -- lextype(("AGG"|"MEM"|"COR"),
                                    (s+, ("AGG"|"MEM"|"COR"))*) --
                         NAMES    agg                                   >

In this case the link points to two or more named location specifications via its linkends attribute. Each of these locations can be assigned a role by the anchor role (anchrole) attribute. Each role can have a method associated with it through the endterms attribute. Rules for traversing to each of the anchors can be controlled using the intra and extra attributes, while the aggregate traversal (aggtrav) attribute determines whether the links are traversed in parallel, separately or under user control.

Using HyTime within HTML

To see how HyTime independent links work consider the following example of what might be possible if HyTime locators and independent links were permitted in the header of an HTML document. For this example I will use a special element defined as follows:

<!ELEMENT extlink -- External link --
                - O      (#PCDATA) >
<!ATTLIST extlink HyTime    NAME     ilink
                  HyNames   CDATA    "anchrole language
                                      linkends locators
                                      endterms show-as"
                  id        ID       #IMPLIED  -- Default: none --
                  languages CDATA    #REQUIRED --one per language--
                  locators  IDREFS   #REQUIRED --one per language--
                  show-as   IDREFS   #REQUIRED --one per language--
                  extra     NAMES    #IMPLIED -- Default: None --
                  intra     NAMES    "A"
                  aggtrav   NAMES    agg                               >

In this example three of the attributes have been renamed using the HyNames attribute. This allows us to use languages as the name of the anchrole attribute, locators as the replacement for linkends and show-as in place of endterms.

Each use of the element must have three attributes defining which languages are to available, which locators are used to identify the relevant point in each language, and what form should be used to identify the translations.

For this example no external (extra) traversal through the link is permitted by default, but all links can be traversed from within the link itself. Aggregate (agg) traversal is used if a multiple location is pointed to by one of the locators so that users can control which link to go to by selecting from a menu, etc.

For this example we will also use the topic and anchor elements that were defined earlier, but to make the example more HTML friendly we will rename one of the attributes, docorsub as URL using the optional HyNames attribute, and have redefined the lexical model for the element so that it refers to HTML-NAMES (i.e. a name attribute of an HTML anchor) rather than an SGML unique identifier, resulting in the following definition:

<!ELEMENT anchors - O (#PCDATA) --lextype(HTML-NAMES)--      >
<!ATTLIST anchors HyTime   NAME      #FIXED   nmlist
                  HyNames  CDATA     #FIXED   "docorsub URL"
                  nametype (entity|element)   element
                  obnames  (obnames|nobnames) nobnames
                  URL      CDATA --lextype(entity|URL)-- #IMPLIED      >

The external link (extlink), topic and anchors elements could then be used as follows within an HTML header:

<html><header><title>Linked Translation Set</title>
<BASE "http://www.myco.org/pub/subject/en/myfile.htm">
<LINK name=author  href="mailto:author@myco.com" title="Author" rev=made>
<LINK name=spanish href="../sp/myfile.htm" title="Español" rel=translation>
<LINK name=french  href="../fr/myfile.htm" title="Français" rel=translation>
<LINK name=german  href="../de/myfile.htm" title="Deutsch" rel=translation>
<topic id=SGML-en>
<anchors>SGML HTML</topic>
<topic id=SGML-sp>
<anchors URL="../sp/myfile.htm">SGML HTML</topic>
<topic id=SGML-fr>
<anchors URL="../fr/myfile.htm">SGML HTML</topic>
<topic id=SGML-de>
<anchors URL="../de/myfile.htm">SGML HTML</topic>
<extlink id=connect-sgml languages="EN SP FR DE" locators="SGML-en SGML-sp SGML-fr SGML-de" show-as="hot-spot SP-flag FR-flag DE-flag">
...
</head>
<body>
<h1>Linking together the World Wide Web</h1>
<p><a name=HTML></a>
HTML is ....
<p><a name=SGML></a>
SGML is ...
</body></html>

Note that the anchors definition for the topic related to the local file (SGML-en) has no URL attribute. This is because the default value for this attribute is the local document or the document identified by the BASE element.

Using this form of coding it will possible to provide a facility whereby, when you select a point in the document, the system will look for the nearest named element and then look in the header to identify which topics refer to that locator. Users can then either chose to go to related points in the same document, or to the similarly named points in any of the translations that have been identified as being associated with the document.

Note: If this basic approach meets the approval of the Winter discussion group I will expand this paper to show how datalocs can be used to identify individual words and phrases, and how elements not assigned identifiers or names could be identified using location ladders.