<!DOCTYPE gcapaper PUBLIC "-//GCA//DTD GCAPAPER.DTD 19970212 Vers 2.0a//EN">
<gcapaper>
<front>
<title>Hyperlink semantics for standoff markup of read-only documents</title>
<author><fname>Henry S. </fname><surname>Thompson</surname>
<address>
<affil>University of Edinburgh</affil>
<subaffil>Language Technology Group,</subaffil>
<subaffil>Human Communication Research Centre,</subaffil>
<aline>2 Buccleuch Place, </aline>
<city>Edinburgh EH8 9LW, </city>
<cntry>SCOTLAND</cntry>
<phone>+44 131 650-4440</phone>
<fax>+44 131 650-4587</fax>
<email>ht@cogsci.ed.ac.uk </email>
<web>http://www.cogsci.ed.ac.uk/~ht/ </web>
</address>
<bio><para>Henry S. Thompson is a Reader in the Department of Artificial
Intelligence and the Centre for Cognitive Science at the University of
Edinburgh, where he is also a member of the Human Communication
Research Centre. He received his Ph.D. in Linguistics from the
University of California at Berkeley in 1980. His university
education was divided between Linguistics and Computer Science, in
which he holds an M.Sc. His research interests are in the area of
Natural Language and Speech processing, from both the applications and
Cognitive Science perspectives. Particular projects he has been
involved with include the creation of spoken and written language
corpora, the use of real language in linguistic theory and practice,
and the creation of tools and architectures for SGML-based processing
of language data. He is a member of the W3C SGML Working Group, and
has advised both the British government and the Commission of the
European Community on strategic planning in the areas of speech and
language technology.</para></bio></author>
<author><fname>David </fname><surname>McKelvie</surname>
<address>
<affil>University of Edinburgh</affil>
<subaffil>Language Technology Group,</subaffil>
<subaffil>Human Communication Research Centre,</subaffil>
<aline>2 Buccleuch Place, </aline>
<city>Edinburgh EH8 9LW, </city>
<cntry>SCOTLAND</cntry>
<phone>+44 131 650-4630</phone>
<fax>+44 131 650-4587</fax>
<email>David.McKelvie@cogsci.ed.ac.uk </email>
<web>http://www.cogsci.ed.ac.uk/~dmck/ </web>
</address>
<bio><para>David McKelvie is a research fellow in the Language
Technology Group of HCRC. He is an experienced programmer/researcher
with particular interests in speech recognition, language enginering,
corpus markup and user interface design. He has been involved in a
senior position on five British and EU research projects. His latest
work involves the SGML markup of large multilingual corpora and the
design of programs for the linguistic analysis of these corpora.</para></bio>
</author>
<abstract>
<para>There are at least three reasons why separating markup from the
material marked up ("standoff annotation") may be an attractive
proposition:
</para>
<para>1) The base material may be read-only and/or very large, so copying it
to introduce markup may be unacceptable;
2) The markup may involve multiple overlapping hierarchies;
3) Distribution of the base document may be controlled, but the markup
is intended to be freely available.
</para>
<para>In this paper we introduce two kinds of semantics for hyperlinks to
facilitate this type of annotation, and describe the LT NSL toolset
which supports these semantics.
</para>
<para>The two kinds of hyperlink semantics which we describe are (a)
inclusion, where one includes a sequence of SGML elements from the
base file; and (b) replacement, where one provides a replacement for
material in the base file, incorporating everything else.
</para>
<para>We also address the issues of indexing
large files to improve the speed of accessing SGML elements in the
base files.
</para>
</abstract>
</front>
<body>
<section><title>Introduction</title>
<para>
This is not a paper about browsers or rendering. It is a
paper about novel uses of hyperlinks to address a range of problems in
document management and corpus annotation. The canonical application
environments for the kinds of markup I will discuss are information
retrieval, message understanding, machine translation, text
summarisation---in other words, language and content oriented
applications.
</para>
<para>
This is not a paper about process architecture (that's another paper).
I assume a pipelined architecture where individual tools operate on an
SGML or XML document stream, augmenting, transforming and modifying it
step-by-step. It is paper about document architecture: how to use
specialised link semantics to organise information across documents in
a sensible and powerful way, while hiding that distribution from
applications.
</para>
<para>
The crucial idea is that standoff annotation allows us to distribute
aspects of virtual document structure, i.e. markup, across more than
one actual document, while the pipelined architecture allows tools
that need to work with the virtual document stream.
</para>
<para>
This can be seen as taking the SGML paradigm one step further:
Whereas SGML allows single documents to be stored transparently in any
desired number of entities; our proposal allows virtual documents to be
composed transparently from any desired network of component documents.
</para>
<para>
Note that the examples which follow uses a simplified version of the draft
proposed syntax for links from the XML-LINK draft proposal <bibref
refloc='xml'>, which notates a span of elements with two
<acronym.grp><acronym>TEI</><expansion>Text Encoding
Initiative</></acronym.grp>extended pointer expressions separated by
two dots ('..').
</para>
</section>
<section><title>Adding markup from a distance</title>
<para>
Consider marking sentence structure in a read-only corpus of text which is
marked-up already with tags for words and punctuation, but nothing more:
</para>
<para>
<sgml.block>
. . .
<w id='w12'>Now</w><w id='w13'>is</w><w id='w14'>the</w>
. . .
<w id='w27'>the</w><w id='w28'>party</w><c id='c4'>.</c>
</sgml.block>
</para>
<para>
With an inclusion semantics, I can mark sentences in a separate document
as follows:
</para>
<para><sgml.block>
. . .
<s xml-link='simple' href="#ID(w12)..ID(c4)"></s>
<s xml-link='simple' href="#ID(w29)..ID(c7)"></s>
. . .
</sgml.block></para>
<para>
Now crucially (and our <acronym>LT NSL</> and <acronym>LT XML</> products already implement this
semantics <bibref refloc='ltg'>), we want our application to see this document collection as
a single stream with the words nested inside the sentences:
</para>
<para><sgml.block>
. . .
<s>
<w id='w12'>Now</w><w id='w13'>is</w><w id='w14'>the</w>
. . .
<w id='w27'>the</w><w id='w28'>party</w><c id='c4'>.</c>
</s>
<s>
. . .
</s>
</sgml.block></para>
<para>
Note that the linking attribute is gone from the <sgml>S</> start-tag,
because its job has been done.
</para>
<para>
We believe this simple approach will have a wide range of powerful
applications. We are currently using it in the development of a
shared research database, allowing the independent development of
orthogonal markup by different sub-groups in the lab.
</para>
</section>
<section><title>Invisible Mending</title>
<para>
We use <highlight style=ital>inverse replacement</> semantics for e.g. correcting errors in
read-only material. Suppose our previous example actually had:
</para>
<para><sgml.block>
. . .
<w id='w15'>tiem</w>
. . .
</sgml.block></para>
<para>
If we interpret the following with inverse replacement semantics:
</para>
<para><sgml.block>
<mend xml-link='simple' href="#ID(w15)">
<w id='w15'>time</w>
</mend>
</sgml.block></para>
<para>
we mean "take everything from the base document except word 15, for
which use my content". In other words, we can take this document, and
use it as the target for the references in the sentence example, and
we'll get a composition of linking producing a stream with sentences
containing the corrected word(s).
</para>
</section>
<section>
<title>Indexing SGML files</title>
<para>
The paradigm of pipelined processes communicating via (normalised)
SGML, is very suitable where the desired processing only requires
localised sequential access to the document. This is the case in many
language based algorithms. For example, for an message understanding
application, the corpus can be treated as a sequence of messages, each
of which can be read into memory (as an SGML document tree) and
processed. However, sequential processing is less efficient in cases where
random access to large documents is required. In such cases
a database solution or separate index files are required.
</para>
<para>
The <acronym>LT NSL</> system contains programs which allow one to
create index files for SGML documents to provide a random access mapping between a
subset of elements of the document (selectable by a query language)
and character offset and file name (since SGML documents can be
distributed over several files).
A separate program allows one to make content addressable
indices, by providing a flexible method of indexing SGML elements by
their text contents (<bibref refloc='ltgindex'>).
Finally, we provide retrieval programs for both of these indexing
schemes.
The above programs have been used by us and a group at Sheffield
University to index the large British National Corpus.
</para>
<para>
At present these indexing programs are separate from the LT NSL
application program interface and the work on hyperlinking described
above. It is clear that it would be very useful to develop a single
abstraction covering both.
Further work is continuing on discovering the most common patterns of
usage for referring to SGML elements via a random access index.
Two possible uses would be (a) to allow access to the text of a
footnote or bibliographic item while processing a paragraph
containing a reference to it; and (b) to allow the hyperlinks to refer
to an index file, allowing random access to hyperlinked elements (the
current implemention is much more efficient if the linked-to elements
are not out of order in the target document).
</para>
</section>
</body>
<rear>
<acknowl><para> The work described here was conducted in the Language
Technology group of the Human Communication Research Centre, whose
baseline funding comes from the UK Economic and Social Research
Council. The work was initiated in the context of the EU-funded
MULTEXT project, and is now being carried forward with support from
the UK Science and Engineering Research Council.</para>
</acknowl>
<bibliog>
<bibitem id=xml><bib>Bray & DeRose 1997</bib><pub>T. Bray and S. DeRose, eds, <highlight style=ital>Extensible Markup Language (XML) Version
1.0</>: WD-xml-link-970406, World Wide Web Consortium, 1997. (See also <highlight style=bold>http://www.w3.org/pub/WWW/TR/</></pub></bibitem>
<bibitem id='ltgindex'>
<bib>Mikheev and McKelvie 97</bib>
<pub>
A. Mikheev and D. McKelvie,
<highlight style=ital>Indexing SGML files using LT NSL</>,
Technical Report, Language Technology Group, HCRC, University
of Edinburgh, 1997.
</pub>
</bibitem>
<bibitem id=ltg><bib>LTG tools</bib><pub><highlight style=bold>http://www.ltg.ed.ac.uk/software/</>
</pub></bibitem>
</bibliog>
</rear>
</gcapaper>