Version 1.0, 6th December 1995 © Martin Bryan, The SGML Centre
This paper has been written in response to Manuel Tomas Carrasco Benitez's proposal to the WINTER e-mail discussion group for the development of a mechanism to link translations of documents available on the World Wide Web. It suggests how the Hypermedia/Time-based Structuring Language (HyTime) defined in ISO/IEC 10744 could be used to manage the relationships between a source document and its translations.
This proposal will take into account the extensions to HTML currently being proposed by the HTML Working Group at the IETF. In particular it will refer to the proposal to allow an ID attribute to be associated with any element.
The elements that can be used to link documents together in HTML are the <LINK>
element used in the header to identify meta-data links and the
(anchor) element used in the text to identify actual links. The current
<A> given in RFC1866 is:
5.7.3. Anchor: A The <A> element indicates a hyperlink anchor (see 7, "Hyperlinks"). At least one of the NAME and HREF attributes should be present. Attributes of the <A> element: HREF gives the URI of the head anchor of a hyperlink. NAME gives the name of the anchor, and makes it available as a head of a hyperlink. TITLE suggests a title for the destination resource -- advisory only. The TITLE attribute may be used: * for display prior to accessing the destination resource, for example, as a margin note or on a small box while the mouse is over the anchor, or while the document is being loaded; * for resources that do not include a title, such as graphics, plain text and Gopher menus, for use as a window title. REL The REL attribute gives the relationship(s) described by the hyperlink. The value is a whitespace separated list of relationship names. The semantics of link relationships are not specified in this document. REV same as the REL attribute, but the semantics of the relationship are in the reverse direction. A link from A to B with REL="X" expresses the same relationship as a link from B to A with REV="X". An anchor may have both REL and REV attributes. URN specifies a preferred, more persistent identifier for the head anchor of the hyperlink. The syntax and semantics of the URN attribute are not yet specified. METHODS specifies methods to be used in accessing the destination, as a whitespace-separated list of names. The set of applicable names is a function of the scheme of the URI in the HREF attribute. For similar reasons as for the TITLE attribute, it may be useful to include the information in advance in the link. For example, the HTML user agent may chose a different rendering as a function of the methods allowed; for example, something that is searchable may get a different icon.
The definition for
5.2.4. Link: LINK The <LINK> element represents a hyperlink (see 7, "Hyperlinks"). Any number of LINK elements may occur in the <HEAD> element of an HTML document. It has the same attributes as the <A> element (see 5.7.3, "Anchor: A"). The <LINK> element is typically used to indicate authorship, related indexes and glossaries, older or more recent versions, document hierarchy, associated resources such as style sheets, etc.
These definitions are further qualified by the following descriptions of HTML hyperlinks:
7. Hyperlinks In addition to general purpose elements such as paragraphs and lists, HTML documents can express hyperlinks. An HTML user agent allows the user to navigate these hyperlinks. A hyperlink is a relationship between two anchors, called the head and the tail of the hyperlink[DEXTER]. Anchors are identified by an anchor address: an absolute Uniform Resource Identifier (URI), optionally followed by a '#' and a sequence of characters called a fragment identifier. For example: http://www.w3.org/hypertext/WWW/TheProject.html http://www.w3.org/hypertext/WWW/TheProject.html#z31 In an anchor address, the URI refers to a resource; it may be used in a variety of information retrieval protocols to obtain an entity that represents the resource, such as an HTML document. The fragment identifier, if present, refers to some view on, or portion of the resource. Each of the following markup constructs indicates the tail anchor of a hyperlink or set of hyperlinks: * <A> elements with HREF present. * <LINK> elements. * <IMG> elements. * <INPUT> elements with the SRC attribute present. * <ISINDEX> elements. * <FORM> elements with `METHOD=GET'. These markup constructs refer to head anchors by a URI, either absolute or relative, or a fragment identifier, or both. In the case of a relative URI, the absolute URI in the address of the head anchor is the result of combining the relative URI with a base absolute URI as in [RELURL]. The base document is taken from the document's <BASE> element, if present; else, it is determined as in [RELURL]. 7.1. Accessing Resources Once the address of the head anchor is determined, the user agent may obtain a representation of the resource. For example, if the base URI is `http://host/x/y.html' and the document contains: <img src="../icons/abc.gif"> then the user agent uses the URI `http://host/icons/abc.gif' to access the resource, as in [URL].. 7.2. Activation of Hyperlinks An HTML user agent allows the user to navigate the content of the document and request activation of hyperlinks denoted by <A> elements. HTML user agents should also allow activation of <LINK> element hyperlinks. To activate a link, the user agent obtains a representation of the resource identified in the address of the head anchor. If the representation is another HTML document, navigation may begin again with this new document. 7.4. Fragment Identifiers Any characters following a `#' character in a hypertext address constitute a fragment identifier. In particular, an address of the form `#fragment' refers to an anchor in the same document. The meaning of fragment identifiers depends on the media type of the representation of the anchor's resource. For `text/html' representations, it refers to the <A> element with a NAME attribute whose value is the same as the fragment identifier. The matching is case sensitive. The document should have exactly one such element. The user agent should indicate the anchor element, for example by scrolling to and/or highlighting the phrase. For example, if the base URI is `http://host/x/y.html' and the user activated the link denoted by the following markup: <p>See: <a href="app1.html#bananas">appendix 1</a> for more detail on bananas. Then the user agent accesses the resource identified by `http://host/x/app1.html'. Assuming the resource is represented using the `text/html' media type, the user agent must locate the <A> element whose NAME attribute is `bananas' and begin navigation there.
In the March 1995 draft of the proposed changes Dave Raggett wrote the following:
The Body Element and Related Elements The BODY element Permitted Context: HTML Content Model: %Body.Content Within the BODY element, you can structure text into paragraphs, and lists, as well as highlighting phrases and creating links, amongst other things. The BODY element has the following attributes, all of which are optional: Note that the ID, LANG and CLASS attributes can be used with virtually all of the elements permitted in the document body. ID An SGML identifier used as the target for hypertext links or for naming particular elements in associated style sheets. Identifiers are NAME tokens and must be unique within the scope of the current document. LANG This is one of the ISO standard language abbreviations, e.g. "en.uk" for the variation of English spoken in the United Kingdom. It can be used by parsers to select language specific choices for quotation marks, ligatures and hypenation rules etc. The language attribute is composed from the two letter language code from ISO 639, optionally followed by a period and a two letter country code from ISO 3166. CLASS This a space separated list of SGML NAME tokens and is used to subclass tag names. For instance, <P CLASS=STANZA.COUPLET> defines a paragraph that acts as a couplet in a stanza. By convention, the class names are interpreted hierarchically, with the most general class on the left and the most specific on the right, where classes are separated by a period. The CLASS attribute is most commonly used to attach a different style to some element, but it is recommended that where practical class names should be picked on the basis of the element's semantics as this will permit other uses, such as restricting search through documents by matching on element class names. The conventions for choosing class names are outside the scope of this specification.
Unfortunately these attibutes cannot be used with the empty
element, though they can be used with anchors (
<A>) and most
other elements containing text.
Dave Raggett also proposed, in different parts of the specification the
following extensions to the use of
Additional features include a static banner area for corporate logos, disclaimers and customized navigation/search controls. The LINK element can be used to provide standard toolbar/menu items for navigation, such as previous and next buttons. The NOTE element is used for admonishments such as notes, cautions or warnings, and also used for footnotes. LINK The LINK element indicates a relationship between the document and some other object. A document may have any number of LINK elements. The LINK element is empty (does not have a closing tag), but takes the same attributes as the anchor element. The important attributes are: REL This defines the relationship defined by the link REV This defines a reverse relationship. A link from document A to document B with REV=--relation-- expresses the same relationship as a link from B to A with REL=--relation--. REV=made is sometimes used to identify the document author, either the author's email address with a --mailto-- URI, or a link to the author's home page. HREF This names an object using the URI notation. Using LINK to define document specific toolbars An important use of the LINK element is to define a toolbar of navigation buttons or an equivalent mechanism such as menu items. LINK relationship values reserved for toolbars are: REL=Home The link references a home page or the top of some hierarchy. REL=ToC The link references a document serving as a table of contents. REL=Index The link references a document providing an index for the current document. REL=Glossary The link references a document providing a glossary of terms that pertain to the current document. REL=Copyright The link references a copyright statement for the current document. REL=Up When the document forms part of a hierarchy, this link references the immediate parent of the current document. REL=Next The link references the next document to visit in a guided tour. REL=Previous The link references the previous document in a guided tour. REL=Help The link references a document offering help, e.g. describing the wider context and offering further links to relevant documents. This is aimed at reorienting users who have lost their way. REL=Bookmark Bookmarks are used to provide direct links to key entry points into an extended document. The TITLE attribute may be used to label the bookmark. Several bookmarks may be defined in each document, and provide a means for orienting users in extended documents. An example of toolbar LINK elements: <LINK REL=Previous HREF=doc31.html> <LINK REL=Next HREF=doc33.html> <LINK REL=Bookmark TITLE="Order Form" HREF=doc56.html> Using LINK to include a Document Banner The LINK element can be used with REL=Banner to reference another document to be used as banner for this document. This is typically used for corporate logos, navigation aids, and other information which shouldn't be scrolled with the rest of the document. For example: <LINK REL=Banner HREF=banner.html> The use of a LINK element in this way, allows a banner to be shared between several documents, with the benefit of being able to separately cache the banner. Rather than using a linked banner, you can also include the banner in the document itself, using the BANNER element. Link to an associated Style Sheet The LINK element can be used with REL=StyleSheet to reference a style sheet to be used to control the way the current document is rendered. For example: <LINK REL=StyleSheet HREF=housestyle.dsssl> Other uses of the LINK element Additional relationship names have been proposed, but do not form part of this specification. Servers may also allow links to be added by those who do not have the right to alter the body of a document.
Until these proposals are accepted it will not be possible to use links within banner headlines or to identify every element in a translated document. As will be seen below, HyTime location facilities could be used to identify elements that do not currently have unique identifiers (IDs) and also strings within elements for which links are required.
allow a start to be made to controlling translation referencing. Firstly the
problem of identifying translations of a retrieved document. If the document
starts with link statements of the following form it will be possible to
identify the translations that are available for the currently displayed
<BASE "http://www.myco.org/pub/subject/en/myfile.htm"> <LINK name=author href="mailto:email@example.com" title="Author" rev=made> <LINK name=spanish href="../sp/myfile.htm" title="Español" rel=translation> <LINK name=french href="../fr/myfile.htm" title="Français" rel=translation> <LINK name=german href="../de/myfile.htm" title="Deutsch" rel=translation>
Unfortunately current browsers do not display link elements, so these elements will not be selectable by users until the HTML 3.0 proposals for Banners are actioned by browser developers.
Until browsers offer mechanisms for listing links in menus that user can use to interconnect files the only alternative is to turn the links into anchors within the body of the document, using defintions such as:
<A name=spanish href="../sp/myfile.htm" title="Espanol" rel=translation> [Español]</a> <A name=french href="../fr/myfile.htm" title="Francais" rel=translation> [Français]</a> <A name=german href="../de/myfile.htm" title="Deutsch" rel=translation> [Deutsch]</a>
Note that there is a different model for links (which are declared using the EMPTY keyword and therefore have no content or end-tag) and anchors (which must have content or, at very least, an end-tag).
When HTML 3.0 is available it will be possible to extend the above descriptions as follows:
<A name=spanish href="../sp/myfile.htm" title="Espanol" rel=translation lang=sp class=automatic> [Español]</a> <A name=french href="../fr/myfile.htm" title="Francais" rel=translation lang=fr class=manual> [Français]</a> <A name=german href="../de/myfile.htm" title="Deutsch" rel=translation lang=de class=semi-automatic> [Deutsch]</a>
In this case I have used class to indicate whether conversion was done automatically by a program, manually by a skilled translator or semi-automatically, using human correction to an automatic translation.
Now let us look at how we can name points within a document. At present the
only valid mechanism that will work across all browsers is through use of the
<A name=xyz> mechanism as the HTML 2.0 spec does not allow
IDs to be associated with elements. If you want to attach a name to a paragraph
you have to do something along the lines of:
<p><a name=SGML></a>The uses of SGML are legion ...
Note that I have named the paragraph here, not identified a piece of text as an anchor. The anchor in this case has no content. Whilst it would be possible to place the end-tag for the anchor at the end of the paragraph this would serve no real purpose and introduces the risk that the limited model of HTML could be broken by elements within the paragraph.
If I translate this file into French in the translated file I would have:
<p><a name=SGML></a>Les usages de SGML sont legion ...
[Excuse my lousy French!]
I don't need to rename the object as anchor names are local to the file they are in, as the above excerpt from RFC1866 on Fragment Identifiers makes clear.
Note: These HTML links are not really "fragment identifiers" as anchors do not identify element sets. HTML anchors can at most identify a set of characters within a single element. Alternatively, as shown here, they can identify a spot in the document.
Links such as
<a href="#SGML"> will take you
to the point in the current document that has the name SGML assigned to an
anchor. All you need to do to move from one translation to another is to switch
from the BASE definition of the document you are working in to that of the
translation listed in the LINK statements that is appropriate to your request. A
mechanism for doing this would be very easy to define in the HTML 3.0
Where document structure changes during translation the translator must determine which is the most appropriate place in the translation to place the anchor. (There is a distinct advantage in the use of empty anchors in this context as they are much easier to reposition than anchors that are placed around text, as the latter may need to be mingled with other text when translated, as anyone who has struggled with maintaining links in English/German translations will tell you! In fact this makes a very good argument for using anchors rather than IDs to identify points in the document that are linked within translations.)
Note: The following material, which is probably more detailed than is needed at the present moment, but which is included for completeness sake at this point, is extracted from a book I am in the process of writing on HyTime. It is copyrighted and must not be used outside of the discussions within the WINTER discussion group.
Within an SGML document the basic method of addressing an element is by
assigning it a unique identifier (ID) to an element and then making a reference
to this identifier using an attribute whose declared value is either an id
reference value (
IDREF) or an id reference list (
Each unique identifier must be a valid SGML name, beginning with a letter, or
one of the alternative name start characters defined in the SGML declaration,
optionally followed by one or more letters, digits, or characters declared in
the SGML declaration to be valid name characters.
Note: URLs are not valid SGML names, and neither are fragment identifiers by default as # is not a valid name character in the HTML. The name used in an SGML IDREF is identical to that used in the ID being pointed to. All IDREFs are presumed to be local so there is no need to add a # to distinguish local name references as HTML currently requires.
SGML's basic object addressing method has a number of limitations. The principal limitation is that the identified object must form part of the document, or subdocument, from which it is referenced. In addition, unique identifiers must be assigned to a single element's start-tag. This prevents an identifier from selecting more than one point in a document, though references can be made to more than one unique identifier from a reference point using an id reference list. A further restriction is that, because unique identifiers are defined as attributes of elements, they cannot be assigned to entity references, or to significant data strings.
HyTime's location address module allows SGML unique identifiers to be assigned to parts of a document which do not otherwise have identifiers. HyTime allows unique identifiers to be assigned to:
There are three main types of HyTime location address:
Name space locations are used to point to objects that have been assigned a name. Such objects include:
The named location address (
type architectural form is used to define elements that can identify locations
by reference to a name in a specified name space. Its meta-DTD definition is:
<!element nameloc -- Assigns a local ID to one or more named objects -- -- Constraint: name list derived from content elements. -- - O (nmlist|nmquery)* > <!attlist nameloc HyTime NAME nameloc id ID #REQUIRED -- multloc attributes -- -- spanloc attributes -- >
Each named location address associates a unique identifier (
with one or more objects that are identified by name in a constructed name
list. This name list is constructed by concatenating the name lists
supplied by the name list specifications (
nmlist) and name list
nmquery) that form the contents of the named location
The name list specification (
type architectural form is used to define elements that will contain the lists
of names used to form a constructed name list. The meta-DTD definition for this
architectural form is:
<!element nmlist -- List of local or external ID or entity names -- - O (#PCDATA) -- lextype(NAMES) --> <!attlist nmlist HyTime NAME nmlist nametype -- Entity names or IDs of elements -- (entity|element|unified) entity obnames -- Objects treated as names? -- (obnames|nobnames) nobnames docorsub -- SGML document or subdoc whose prolog declares entities or elements named in name list; initially the document in which this occurs. -- ENTITY #IMPLIED --Default: no change-- dtdorlpd -- Active DTD or LPDs for SGML document entity. Base DTD if unspecified and docorsub changed; no change if unspecified and docorsub unchanged. -- -- lextype(DTD|LPD+) -- NAMES #IMPLIED -- Default: unchanged or base -- >
The name type (
nametype) attribute is used
to identify if the name is a unique identifier for an element or the name of an
The objects treated as names (
attribute indicates whether or not the objects pointed to are to be treated as
names. If the default value,
nobnames, is changed to
the contents of each of the named objects is treated as a name specificaton list
which is to be added to the constructed name list of the named location address
in which the object was named.
The SGML document or subdocument (
attribute identifies the SGML document/subdocument entity used to declare the
document/subdocument in which the names listed are defined.
If multiple DTDs and SGML LINKTYPES are being supported the active
DTD or LPDs (
dtdorlpd) attribute can be used to identify
the tree the names belong to.
The names in a name list specification are not checked for validity as a unique identifier or entity name until the named location address is referenced in a manner that requires access to the addressed object.
To understand the effect of the above rules, consider the following examples, which are constructed using the following elements:
<!ELEMENT topic - O (anchors+) > <!ATTLIST topic HyTime NAME #FIXED nameloc id ID #REQUIRED set (set|notset) set > <!ELEMENT anchors - O (#PCDATA) --lextype(NAMES)-- > <!ATTLIST anchors HyTime NAME #FIXED nmlist nametype (entity|element) element obnames (obnames|nobnames) nobnames docorsub CDATA --lextype(entity|URL)-- #IMPLIED >
Using the default values for the non-compulsory attributes the following named location addresses could be specified:
<topic id=fleas><anchors>p-12 p-34 p-35</topic> <topic id=dogs><anchors>p-3 p-24 p-35</topic>
Each of the names in the
anchors name list specifications
points to an anchor element defined in the currently active document. Each of
topic named location address elements identifies three paragraphs.
Note that the paragraph called
p-35 is included under both the
dogs and fleas topics.
The following named location specification can be used to concatenate the two lists:
<topics id=allergies><anchors obnames>fleas dogs</topic>
In this example the name list specification is identified as pointing at the
unique identifiers of one or more object names. The elements pointed to are the
two named topic elements used in the preceding example. Because the default
value assigned to the
set attribute is
duplicated entries will be removed, so the constructed name list known as
allergies will contain the following entries:
p-3 p-12 p-24 p-34 p-35
HyTime coordinate locations can be used to identify the following types of objects that can be addressed as HyTime "quanta":
treeloc)or a depth first path location address (
pathloc), or through their relationship with another node (
Each coordinate location addresses a location with respect to some other location, known as a location source. The combination of a location source and a coordinate location is known as a location view. Different location views can use the same location source. The location source for a coordinate location can be another location address element. In such cases the combination of locations is said to form a location ladder.
Note: Location sources allow relative addressing.
The location source (
locsrc) attribute list
architectural form is used by all location addresses that can form part of a
location view to identify the element(s) that the address is related to. The
meta-DTD definition for this architectural form is:
<!attlist locsrc -- use: dataloc treeloc pathloc relloc listloc proploc fcsloc -- locsrc -- location source -- -- Constraint: No HyTime reftype constraints -- IDREFS #CURRENT --Default: previous specified-->
The location source (
identifies the source of the coordinate location by reference to a unique
identifier which has been assigned to either an element in the current document
or a location address that identifies the required source element. As such it
has the same effect as a named location address, except that it can only
identify elements in the current document.
If no value is entered the location is taken to be that which applied the
last time the attribute was specified (
#CURRENT). For the first
occurrence of an element with this attribute a value must be specified.
Data location addresses can be used to identify:
The data location address (
type architectural form must be used to define elements that identify locations
within a data stream. The meta-DTD definition for this architectural form is:
<!element dataloc -- Locates string and token data objects in data -- -- Constraint: dimlists are concatenated into one list -- - O (dimlist*) > <!attlist dataloc HyTime NAME dataloc id ID #REQUIRED quantum -- Data quantum: bit combination or token -- (str|norm|word|name|sint|date|time|utc) str catsrc -- Concatenate multiple source objects into a single object before applying dimlist to it -- (catsrc|catsrcsp|nocatsrc) nocatsrc catres -- Concatenate results of applying dimlist, in the order of the dimlist -- (catres|catressp|nocatres) nocatres -- overrun attributes -- -- locsrc attributes -- -- multloc attributes -- -- spanloc attributes -- >
Each dimension list (
dimlist) should contain two markers, one
identifying where the data location starts and the other where it ends. Where
more than one dimension list is specified, or a dimension list contains more
than one dimension specification, each identified part of the addressable range
will be concatenated to form a single data string or token list.
The attributes associated with the data location address architectural form
HyTime attribute that identifies the element as
conforming to the
dataloc architectural form and a compulsory
unique identifier (
id). The data quantum (
attribute identifies the type of data being located. Permitted values for this
strto indicate that the dimension list identifies a sequence of bit combinations that form a string
normto indicate that the identified data is part of a normalized (tokenized) text string
wordto indicate that each token in the tokenized string consists solely of valid SGML name characters
nameto indicate that the located data forms a single SGML name
sintto indicate that the located data is a signed integer consisting of digits optionally preceded by a plus sign or a hyphen
dateto indicate that the located data is a valid universal time coordinate (UTC) date in yyyy-mm-dd format
timeto indicate that the located data is a valid UTC time in hh:mm:ss.decimal format
UTCto indicate that the located data forms a valid UTC date/time, in yyyy-mm-dd hh:mm:ss.decimal format.
Except when the value of this attribute is set to
combinations in the addressable range will be treated as normalized tokens. This
means that any leading or trailing white spaces and separators will be removed,
multiple sequences of separators being replaced by a single space. Any
characters which do not conform to the model for the specified data type will be
replaced by spaces.
The concatenate source (
indicates whether or not data segments identified by a multiple location source
are to be concatenated into a single addressable string or token list. If the
attribute is set to
catsrcsp the data associated with each element
identified by the
locsrc element will be concatenated with a
single space between each string. No space will be inserted between the
concatenated strings/token list if the value is set to
the attribute is left at the default
nocatsrc value the data
associated with each source location will be be processed individually, in the
order specified in
locsrc, against the same set of dimension
The concatenate result (
indicates whether or not strings located in different location sources are to be
concatenated to form a single result string or token list. If the value is set
catressp the string identified in each of the location sources
identified by the
locsrce attribute will be separated from its
neighbours by a space when the results are concatenated into a single string. If
catres is used as the attribute value no space will be inserted
between identified strings/tokens, but they will be concatenated. The default
nocatres indicates that the data located from each
location source is to be treated as a separate result string/token list.
The following SGML-encoded data will be used to illustrate the use of data location addresses:
<p id=oriental>The <train>Orient Express</train> will leave· <station>Gare du Nord</station> at <time>11:00:00.00</time> on· <date>1999-12-23</date>.</p>
The elements used to identify data within this structure have the following declarations:
<!ELEMENT string - O (segment+) > <!ATTLIST string HyTime NAME #FIXED dataloc id ID #REQUIRED locsrc IDREFS #CURRENT quantum (str|norm|word|name|sint|date|time|utc) str catsrc (catsrc|catsrcsp|nocatsrc) nocatsrc catres (catres|catressp|nocatres) catres > <!ELEMENT segment O O (starts, (length|ends))+ > <!ATTLIST segment HyTime NAME #FIXED dimspec > <!ELEMENT (starts|length|ends) O O (#PCDATA) > <!ATTLIST (starts|length|ends) HyTime #FIXED marklist >
These elements can be used to specify the following simple data location address:
<string id=action locsrc=oriental><segment><starts>20<length>10</string>
locsrc attribute tells us that the data to be located is
part of the object which has been assigned the identifier
within the current document. The data segments of this element, together with
that of all its descendents, will be concatenated to identify the addressable
range of the located object. Figure 6.2 shows the 72 characters that form the
identified data string.
NOTE: For this example
· is used to identify a new line
(record end) code.
The Orient Express will leave·Gare du Nord at 11:00:00.00 on·1999-12-23. |________|_________|_________|_________|_________|_________|_________|__ 1 10 20 30 40 50 60 70
Figure 6.2: Data string for object whose unique identifier is
In this case the dimension list (
segment) starts at quantum
20 and identifies a string that is
characters long. In other words it identifies the string "
Another way to identify the two words that make up this string would be to enter the following address specification:
<string id=does locsrc=oriental quantum=words> <segment><starts>4<length>2</string>
Figure 6.3 shows the words that form the data within the element whose
unique identifier is
oriental. Note that the time is split into
three words as colons are not recognized as valid within the standard SGML name
character set, and so are replaced by spaces during tokenization. Conversely the
date, including the following period, is treated as a single word as both
hyphens and periods are normally valid SGML name characters.
The Orient Express will leave Gare du Nord at 11 00 00.00 on 1999-12-23. |___|______|_______|____|_____|____|__|____|__|__|__|_____|__|___________| 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Figure 6.3: Data string counted using word quanta
Structured documents such as SGML documents can be viewed as a hierarchically structured tree consisting of a number of nodes. For SGML document the following node types can be identified:
#PCDATAtoken within an element's content model (or forms a data tag pattern that occurs between two tags where the end-tag of the preceding element has been entered in the form of a data tag
Trees can be viewed in either a width-first or a depth-first manner. The width-first approach looks at each level in the tree as a separate list of nodes from which members can be selected. The depth-first approach considers each path down the tree as a separate list of nodes.
Both of these approaches to the tree structure can be used to define a set of measurement domains whose addressable range is determined by the number of nodes in a given node list. Nodes in a node list do not need to be unique. For example, the same data entity could occur at two points at a given level. Hence node lists do not form 'sets' in the mathematical sense. Each node list is 'ordered'; the order in which the nodes are listed is always preserved during processing.
Each node in a node list can be considered as a tree of one or more nodes. This allows location ladders to be built that find one node in a tree and then locate other nodes with respect to the located node.
NOTE: This paper will not concern itself with the use of tree or path locators.
If support for the
relloc location address module option has
been declared the relative location address (
element type architectural form can be used to define elements that will be used
to select nodes based on their relationships to nodes named by another location
address or a unique identifier.
The meta-DTD for the relative location address element type architectural form is defined as:
<!element relloc -- Locates nodes in a tree relative to a starting node -- -- Constraint: dimlists are merged into a single list that resolves to pairs of dimspecs -- - O (dimlist*) > <!attlist relloc HyTime NAME relloc id ID #REQUIRED root -- Root of tree -- -- Constraint: 1 per starting node or 1 for all -- IDREFS #IMPLIED -- Default: document root of each starting node -- relation -- Relationship to starting node -- -- Constraint: des requires "pathloc" option -- (anc|esib|ysib|des|parent|children) parent -- overrun attributes -- -- locsrc attributes -- -- multloc attributes -- -- spanloc attributes -- >
In addition to the compulsory
attributes relative locations have two unique attributes. The root of
(root) attribute is used to specify the object which
is to form the starting node of the relationship. Where
identifies a multiple location, all the starting nodes of which occur in the
same tree, a single starting node can be specified for all sources. Otherwise a
separate starting node has to be specified for each tree identified by the
location source attribute. If no value is entered the document element of each
source will be taken as the starting point for the relationship.
The relationship to starting node (
attribute is used to specify the part of the located tree that is to form the
addressable range from which nodes can be selected. The available options are:
anc- all ancestors, between root node and the parent of the starting node
esib- all elder (left-hand) siblings of the starting node's parent
ysib- all younger (right-hand) siblings of the starting node's parent
des- all descendents in the subtree whose root is the starting node
parent- parent node of starting node
children- children of starting node.
HyTime provides access to three types of semantic address:
NOTE: This paper will not concern itself with the definition of semantic locations.
HyTime provides two types of links::
clink) are embedded in the source document at the point from which the reference is being made
ilink) are stored independently of the data they reference, either in one of the documents being pointed to or in a completely separate document.
The meta-DTD definition for a contextual link is:
<!-- Contextual Link --> <!element clink -- Contextual link -- - O (%HyBrid;)* > <!attlist clink HyTime NAME clink id ID #IMPLIED -- Default: none -- linkend -- Link end -- -- Constraint: No HyTime reftype constraints, but application designers can constrain element types with reftype attribute -- IDREF #REQUIRED >
In addition to the standard
a single reference to a unique identifier (
IDREF) must be entered
linkend attribute. The located ID could belong to a HyTime
locator or an element in the current file. When a HyTime locator has been used
IDREF can point to a location in a document other than the
current one as long as the locator element that identifies it is in the same
document as the
NOTE: The HTML
<A> element can be seen as a less
constrained version of this model, with the
name attribute taking
the part of the
id attribute and
href taking the place of the
linkend attribute. As
both the attributes are defined using the
CDATA keyword they are
less constrained than their HyTime equivalents, which must follow SGML's naming
rules, which would not accommodate URLs without modification of the default name
set (which is not a problem as HTML already redefines most of the other default
HyTime independent links must conform to the following meta-DTD definition:
<!-- Independent Link --> <!element ilink -- Independent link -- - O (%HyBrid;)* > <!attlist ilink HyTime NAME ilink id ID #IMPLIED -- Default: none -- anchrole -- Anchor roles -- -- Constraint: one per anchor -- -- lextype((NAME, s+, (RNI, "AGG")?), (s+, NAME, s+, (RNI, "AGG")?)+) -- CDATA #FIXED in-DTD linkends -- Link ends -- -- Constraint: one anchor per anchor role. If one is omitted, ilink element is first anchor. -- -- Constraint: No HyTime reftype constraints, but application designers can constrain element types with reftype attribute -- IDREFS #REQUIRED extra -- External access traversal rule -- -- Constraint: one/anchor or one for all-- -- lextype(("E"|"I"|"A"|"N"|"P"), (s+, ("E"|"I"|"A"|"N"|"P"))*) -- NAMES #IMPLIED -- Default: no traversal -- intra -- Internal access traversal rule -- -- Constraint: one/anchor or one for all -- -- lextype(("E"|"I"|"A"|"N"|"P"), (s+, ("E"|"I"|"A"|"N"|"P"))*) -- NAMES #IMPLIED -- Default: no traversal -- endterms -- Link end term information -- -- Constraint: one/anchor or one for all -- -- reftype(HyBrid) -- IDREFS #IMPLIED -- Default: none -- aggtrav -- Traversal of agglink anchors: agg or members -- -- Constraint: one/anchor or one for all -- -- lextype(("AGG"|"MEM"|"COR"), (s+, ("AGG"|"MEM"|"COR"))*) -- NAMES agg >
In this case the link points to two or more named location specifications
linkends attribute. Each of these locations can be
assigned a role by the anchor role (
anchrole) attribute. Each role
can have a method associated with it through the
attribute. Rules for traversing to each of the anchors can be controlled using
extra attributes, while the aggregate
aggtrav) attribute determines whether the links are
traversed in parallel, separately or under user control.
To see how HyTime independent links work consider the following example of what might be possible if HyTime locators and independent links were permitted in the header of an HTML document. For this example I will use a special element defined as follows:
<!ELEMENT extlink -- External link -- - O (#PCDATA) > <!ATTLIST extlink HyTime NAME ilink HyNames CDATA "anchrole language linkends locators endterms show-as" id ID #IMPLIED -- Default: none -- languages CDATA #REQUIRED --one per language-- locators IDREFS #REQUIRED --one per language-- show-as IDREFS #REQUIRED --one per language-- extra NAMES #IMPLIED -- Default: None -- intra NAMES "A" aggtrav NAMES agg >
In this example three of the attributes have been renamed using the HyNames
attribute. This allows us to use
languages as the name of the
locators as the replacement for
show-as in place of
Each use of the element must have three attributes defining which languages are to available, which locators are used to identify the relevant point in each language, and what form should be used to identify the translations.
For this example no external (
extra) traversal through the
link is permitted by default, but all links can be traversed from within the
link itself. Aggregate (
agg) traversal is used if a multiple
location is pointed to by one of the locators so that users can control which
link to go to by selecting from a menu, etc.
For this example we will also use the topic and anchor elements that were
defined earlier, but to make the example more HTML friendly we will rename one
of the attributes,
URL using the
HyNames attribute, and have redefined the lexical model
for the element so that it refers to
HTML-NAMES (i.e. a name
attribute of an HTML anchor) rather than an SGML unique identifier, resulting in
the following definition:
<!ELEMENT anchors - O (#PCDATA) --lextype(HTML-NAMES)-- > <!ATTLIST anchors HyTime NAME #FIXED nmlist HyNames CDATA #FIXED "docorsub URL" nametype (entity|element) element obnames (obnames|nobnames) nobnames URL CDATA --lextype(entity|URL)-- #IMPLIED >
The external link (
elements could then be used as follows within an HTML header:
<html><header><title>Linked Translation Set</title> <BASE "http://www.myco.org/pub/subject/en/myfile.htm"> <LINK name=author href="mailto:firstname.lastname@example.org" title="Author" rev=made> <LINK name=spanish href="../sp/myfile.htm" title="Español" rel=translation> <LINK name=french href="../fr/myfile.htm" title="Français" rel=translation> <LINK name=german href="../de/myfile.htm" title="Deutsch" rel=translation> <topic id=SGML-en> <anchors>SGML HTML</topic> <topic id=SGML-sp> <anchors URL="../sp/myfile.htm">SGML HTML</topic> <topic id=SGML-fr> <anchors URL="../fr/myfile.htm">SGML HTML</topic> <topic id=SGML-de> <anchors URL="../de/myfile.htm">SGML HTML</topic> <extlink id=connect-sgml languages="EN SP FR DE" locators="SGML-en SGML-sp SGML-fr SGML-de" show-as="hot-spot SP-flag FR-flag DE-flag"> ... </head> <body> <h1>Linking together the World Wide Web</h1> <p><a name=HTML></a> HTML is .... <p><a name=SGML></a> SGML is ... </body></html>
Note that the
anchors definition for the topic related to the
local file (
SGML-en) has no
URL attribute. This is
because the default value for this attribute is the local document or the
document identified by the
Using this form of coding it will possible to provide a facility whereby, when you select a point in the document, the system will look for the nearest named element and then look in the header to identify which topics refer to that locator. Users can then either chose to go to related points in the same document, or to the similarly named points in any of the translations that have been identified as being associated with the document.
Note: If this basic approach meets the approval of the Winter
discussion group I will expand this paper to show how
be used to identify individual words and phrases, and how elements not assigned
identifiers or names could be identified using location ladders.