[Mirrored from: http://www.ltg.hcrc.ed.ac.uk/projects/nsl/sgml-europe/workshop.html - "A software architecture for simple, efficient SGML applications"]
( A workshop presented at SGML Europe'96 Munich)
This workshop is an expansion of the topic presented in session B on the LT NSL SGML API. In this workshop the LT NSL API is presented in more detail, and examples of how it can be used will be presented. We will compare it to other approaches to the development of SGML applications, particularly where large volumes of data need to be pipe-lined through a number of applications.
We define semi-valid SGML to be a sequence of text and/or SGML elements, each of which is a valid subtree of a named cached DTD. For example, if the DTD defines an element <name>, then a file consisting of a sequence of <name> elements is semi-valid SGML, even though it may not be valid SGML. The reason for this increased flexibility is to allow the output of tools which select parts of the input stream to be processed by subsequent NSL programs without requiring an explicit change of DTD.
A file is in Normalised SGML (nSGML) format if it satisfies the
following conditions:
nSGML is used as a format for communicating data
between different NSL programs.
NSL Corpus components can be
hyper-documents, with low-density (i.e. above the token level)
annotation being expressed indirectly in terms of links. In the first
instance, this is constrained to links with an
incorporation semantics, that is situations where
element content at one level of one document is entirely composed of
elements from another document. Suppose, for example, we already had
segmented the 3rd component of an English corpus resulting in a single
document d marked up with TEI-compliant headers
and paragraph marking, and with the segmentation marked with <w>
tags:
The output of a phrase-level segmentation might then be stored as
follows:
Linking is specified using one of the available TEI mechanisms,
details are not relevant here, suffice it to say that doc=file1
resolves to d and establishes a default for
subsequent links. At a minimum, links are able to target single
elements or sequences of continguous elements. The NSL implements a
textual inclusion semantics for such links, inserting
the referenced material as the content of the element bearing the
linking attributes. Note that although the example above shows links
to only one document, this is artifactual, and it is possible to link
to several documents, e.g. to a word document and a lexicon document:
Note that the architecture is recursive, in that
e.g. sentence-level segmentation could be expressed in terms of links
into the phrase-level segmentation as presented above.
The data architecture needs to address not only multiple levels of
annotation but also alternative versions at a given level. Since our
linking mechanism uses the SGML entity mechanism to implement the
identification of target documents, we can use the entity manager's
catalogue as a means of managing versions. For our example above,
this means that the connection between the phrase encoding document
and the segmented document would be in two steps: the phrase document
would use a PUBLIC identifier, which the catalogue would map to the
particular file d. Since catalogue entries are
interpreted by tools as local to the directory where the catalogue
itself is found, this means that binding together groups of
alternative versions can be easily achieved by storing them under the
same directory.
Subdirectories with catalogue fragments can thus be used to represent
both increasing detail of annotation and alternatives at a given level
of annotation.
Note also that with a modest extension of functionality, it is
possible to use the data architecture described here to implement
patches, e.g. to the tokenisation process. If alongside an inclusion
semantics, we have a special empty element <repl> which is replaced by the range it points to, we can produce a patch file,
e.g. for a misspelled word, as follows (irrelevant details omitted):
Whether such a patch would have knock-on effects on higher levels of
annotation would depend, inter alia, on whether a change in
tokenisation crossed any higher-level boundaries.
Should entities (especially character entities) be expanded by the
stream interface before they are handed to the tool, and if so, should
they be replaced in the stream before they are ouput?
Consider the following fragment of SGML marked up text:
Let us assume further that the character set is 7-bit ASCII, and
therefore that there is no expansion of the entity ç.
Which, if any, of these entities should be expanded? Now, there are
some properties of these entities that a tool will need to know about
if it is to do its job. For example, it may need to know that
ç is a lower-case letter, while & is an
ampersand sign, and &Xerox; is a string which expands to a
word, or a string of words. If markup is to be preserved throughout
the process of, for example, a tokeniser and a word segmenter, then
clearly these are all potential problems.
We therefore propose that tools must know about character entities such as ç and &.
These will not be expanded, and can be reincorporated into the markup
by the stream interface. All other entities, whether or not they
contain markup, will be expanded and lost. This is necessary since
markup may need to be added within their expansions (e.g. if
&Xerox; expanded to ``The Rank Xerox Corporation'', then word
markup would have to be added within the entity. In general, if a
string entity is expanded then markup may have to terminate in the
entity which was started outside it. This would be obscure and not
licensed by SGML in any case under its ``obfuscatory markup''
rules).
The current implementation (version 1.4.4) achieves this by
expanding all entity references except those of type SDATA (assumed to be un-coded
characters in the document character set) or external entities with
explicit NOTATION declarations.
There are two common ways of looking at SGML documents. First, as a
linear stream of text with embedded markup tags, and secondly as a
hierarchic tree structure. Both of these views can be useful in some
circumstances, and accordingly, the NSL API provides data structures
and access functions to support both these views.
To view SGML documents as a linear stream of text plus markup, we
use the NSL_Bit data structure. A NSL_Bit is either an SGML start tag, an end tag, an
empty tag, an SGML processing instruction or text with no SGML
elements in it. For example, the SGML text:
would be split into the following NSL_Bits
(one per line).
The NSL function GetNextBit reads the next
NSL_Bit from an nSGML file. Function PrintBit writes NSL_Bits to an
output nSGML file.
In order to view an SGML document as a hierarchic structure, the
NSL API constructs a C data structure made up of NSL_Item and NSL_Data data
structures which mirrors the tree structure of the document.
The NSL_Item type describes an SGML element
and all its contents in a document, i.e. it represents a complete
subtree of the document structure. So that, in the above example,
there would be one NSL_Item for the <P>
element, with an nested NSL_Item for the
<NAME> element.
The NSL_Data data structure represents a chunk
of SGML element content, i.e. either an SGML element or a piece of
text without element structure. NSL_Data
structures for the contents of an NSL_Item are
organised into a linked list of mixed NSL_Items
and text in mixed content, or all NSL_Items in
element-only content. In the above example, the NSL_Item corresponding to <P> will point to five
NSL_Data structures corresponding to ``This
... name,'', <NAME>, ``and ... break'', <PB>, and ``in
it.''
Figure This model application program (simple) has been
written to demonstrate the use of the NSL API. The simple program reads an nSGML file containing
paragraph and word markup. It assumes that each word element has an
attribute which contains part of speech (POS) information. The program
then outputs a modified version of the input file where the text of
each word element has been replaced by some text which shows the word
and the POS tag associated to the word. For example, if the input
file looks like:
then the output file will look like:
Simple.c is not intended to be a particularly
useful program, rather to be an example of the use of the NSL API.
The program can be called as follows:
The source of the simple program can be found
in the NSL release file.
The annotated code of the simple program is as
follows:
Include header file for character functions.
Include header file for string functions.
Include the header file for the NSL API.
Main program.
Initialise the NSL SGML API.
Read the command line options.
The optional -d option allows one to specify
a .ddb file externally. If not given, then we assume that
the input file contains a <?NSL DDB ... > processing instruction.
Open the input nSGML file, passing in the doctype declaration
dct if any has been specified by a -d option. If
the doctype declaration was not specified by a -d option, then it will
be set by reading from the input file on opening.
Open the output nSGML stream using the same doctype declaration.
Get the unique name of the elements we care about.
Loop round, reading bits of the SGML input
text. A bit
is either a single piece of SGML markup ( a start tag, an end tag,
or a processing instruction) or it is text without SGML element
markup.
Now we perform a case statement on the type of the NSL_bit we have just got.
Case 1: We have found the start tag for an SGML element. Note
that the item value of this NSL_Bit is of type NSL_inchoate, meaning that unless you call ItemParse on it, it's got
just the start tag info and no contents.
If we are inside a <text> element, then note this fact.
If we are inside a paragraph (<P>) inside text, then note this fact.
We have found a word in a text paragraph. Note this fact and
save the POS tags which are the value of the SGML attribute
of the <W> that we just found,
given by the tagAttr variable.
In any case, we fall through to the next case
to print out the NSL_bit we just read.
Case 2: We have found an empty SGML element.
There is no point in comparing labels of empty items. Note that
PrintItem is smart and will only print a start tag for empty or
inchoate items (not an end tag), but will get into a different
state depending on which.
Case 3: If we are inside the text of a word element, then
strip off trailing whitespace, and write the word and its POS in
a user defined format. Note that we use PrintText, rather
than printf to
actually write the string to the output file. This is in order
to keep the nSGML output state up-to-date.
Case 4: We have found an SGML end tag, so we keep track of where we are
in the SGML markup tree.
Note: processing instructions are not dealt by the above code. If this
is required, then a new case should be added with label
NSL_pi_bit.
NSL_Bits are fixed but their contents are not freed
unless we do it ourselves.
At the very end we need to close the output nSGML stream.
It will be noted that in the above program (simple.c), a
largish part of the code is devoted to implementing a finite-state
machine which keeps track of where we are in the The reason that this API support for tracking SGML
positions was not used in simple.c, is that, for simple
cases, it is much more efficient to spell out this processing
explicitly rather than to call the API query processing mechanism,
which, by necessity, must be able to deal with the general case. If
however, ease of programming is more important than processing speed
or you need to handle a wider range of document structures, then the
NSL query processing functions may provide what you need.
NSL queries are a way of specifying particular nodes in the
SGML document structure. Queries are coded as strings which
give a (partial) description of a path from the root of the
SGML document (top-level element) to the desired
SGML element(s). For example, the query ".*/TEXT/.*/P"
describes any <P> element which occurs anywhere (at any level of
nesting) inside a <TEXT> element which, in turn, can occur
anywhere inside the top-level document element.
A query is basically a path based on terms separated by ``/'', where each
term describes an SGML element. The syntax of queries is as follows:
A condition with an index matches only the index'th sub-element of the
enclosing element. Index counting starts from 0, so the first
sub-element is numbered 0. Conditions with indices and atests only match if
the index'th sub-element also satisfies the atests. Attribute tests
are not exhaustive, i.e. P[rend=it] will match <P n=45 rend=it> as
well as <P rend=it>. They will match against both explicitly present and
defaulted attribute values, using string equality. Bare anames are
satisfied by any value, explicit or defaulted. Matching of
queries is bottom-up, deterministic and shortest-first.
In this section we show some examples of NSL queries, assuming the
following DTD.
The SGML structure of a sample document which uses this DTD is shown in
The query: "CORPUS/DOC/TITLE/s"
means all s elements directly under TITLE's directly under DOC. This
is shown graphically in
figure 3
The query: "CORPUS/DOC/./s"
means all s's directly under anything directly under DOC, as
shown in
Figure 4
"CORPUS/DOC/.*/s"
means all s's anywhere underneath DOC. ``.*'' can be thought of as
standing for all finite sequences of ``.'' For the example document
structure this means the same as CORPUS/DOC/./s, but in more nested
structures this would not be the case. An alternative way of
addressing the same sentences would be to specify .*/s as query.
We also provide a means of specifying the Nth node in a
particular local tree. So "./.[1]/.[2]/.[0]" means the 1st element
below the 3rd element below the 2nd element in a stream of
elements. as shown in
Figure 5
This is also the referent of
"CORPUS/DOC[1]/BODY[2]/s[0]"
assuming that all our elements are s's under BODY under DOC, which
illustrates the combination of positions and types.
".*/BODY/s[0] "
refers to the set of the first elements under any BODY which are also s's.
The referent of this is shown in
figure 6
Additionally, we can also refer to attribute values in the square
brackets: ".*/s/w[0 rend=lc]" gets the initial elements under any
<s> element so long as they are words with rend=lc (perhaps
lower case words starting a sentence).
As will be obvious from the preceding description, the query language is
designed to provide a small set of orthogonal features. Queries which depend
on knowledge of prior context, such as ``the third element after the
first occurrence of a sentence having the attribute quotation'' are not
supported. It is however possible for tools to use the lower-level
API to find such items if desired. The reason for the limitation is
that without it the search engine might be obliged to keep potentially
unbounded amounts of context.
The following program simpleq.c shows how the NSL API query functions
can be used.
Include header file for character functions.
Include header file for string functions.
Include the header file for the NSL API.
Main program.
Initialise the NSL SGML API.
Read the command line options.
The optional -d option allows one to specify
a .ddb file externally. If not given, then we assume that
the input file contains a <?NSL DDB ... > line.
Open the input nSGML file, passing in the doctype declaration
dct if any has been specified by a -d option. If
the doctype declaration was not specified by a -d option, then it will
be set by reading from the input file on opening.
Open the output nSGML stream using the same doctype declaration.
Construct a query, which looks for words anywhere inside paragraphs
anywhere inside a text.
Read items of the SGML input text. When we find an item which
matches the query we execute the body of the while loop. Items which
do not match are written to the output stream by GetNextQueryItem. Each call of GetNextQueryItem creates a new item, which it is the
responsibility of the programmer to free once it has been used.
If we are inside the text of a word element, then strip off trailing
whitespace (in the inner while loop), and write the word and its
part of speech in a user defined format. Note that we use PrintItem to write the modified item to the output file. This is
in order to keep the nSGML output state up-to-date. Note that the
code here assumes that each word element contains only text and no
embedded SGML markup. More complex code could cope with this more
complex possibility.
NSL_Items are fixed but their contents are not freed
unless we do it ourselves.
At the very end we need to close the output nSGML stream.
We don't bother to explicitly free the NSL_Query since there was only
one created.
Compound Annotation and Links
. . .
<p id=en.c3.p4>
<w id=en.c3.p4.w1>
Time
</w>
<w id=en.c3.p4.w2>
flies
</w>
<w id=en.c3.p4.w3>
.
</w>
</p>
. . .
. . .
<p id=en.c3.p4>
<phr id=en.c3.p4.ph1 type=n doc=file1 from='id en.c3.p4.w1'>
</phr>
<phr id=en.c3.p4.ph2 type=v from='id en.c3.p4.w2'>
</phr>
</p>
<word>
<source doc=file1 from='id en.c3.p4.w1'>
</source>
<lex doc=lex1 from='id en.lex.40332'>
</lex>
</word>
Versions
<nsl>
<repl doc=original from='id hdr1'>
<!-- to get the original header-->
<text>
<repl from='id p1' to='id p324'>
<!-- the first swatch of unchanged text -->
<p id=p325>
<repl from='id p325.t1' to='id p325.t15'>
<!-- more unchanged text -->
<corr sic='procede' resp='ispell'>
<!-- the correction itself -->
<token id=p325.t16>
proceed
</token></corr>
<repl from='id p325.t17' to='id p325.t96'>
<!-- more unchanged text-->
</p>
<repl from='id p326' to='id p402'>
<!-- the rest of the unchanged text-->
</text>
</nsl>
Entities in nSGML
<s>François Martin said yesterday
that the following companies announced quarterly
results: IBM; AT&T; &Xerox;</s>
Overview of the NSL API
Stream View
<P> This is some text with a name, <NAME>Fred Bloggs</NAME>
and a page break <PB> in it. </P>
<P> type=NSL_start_bit
This is some text with a name, type=NSL_text_bit
<NAME> type=NSL_start_bit
Fred Blogs type=NSL_text_bit
</NAME> type=NSL_end_bit
and a page break type=NSL_text_bit
<PB> type=NSL_empty_bit
in it. type=NSL_text_bit
</P> type=NSL_end_bit
Tree View
NSL API data structure
Figure 1
<name>David<surname>McKelvie</surname></name>
simple.c -- A model NSL application
<HEADER>blah blah</HEADER>
<TEXT><P>
<W TYPE=det>The</W>
<W TYPE=nn>cat</W>
</P></TEXT>
<HEADER>blah blah</HEADER>
<TEXT><P>
<W TYPE=det>The/det</W>
<W TYPE=nn>cat/nn</W>
</P></TEXT>
simple [options] nsgmlfile
------- ---------
Allowed options (all of which are optional) are:
-d The name of the cached DOCTYPE .ddb file
-t name of attribute containing the POS information
(default TYPE)
-w name of word element
(default W)
-t print format for output words and their POS tags
(default ``%s/%s'')
#include <ctype.h>
#include <string.h>
#include "nsl.h"
void main(int argc, char **argv) {
NSL_Bit *bit;
NSL_File inf, outf;
NSL_Doctype dct=NULL;
char *paraLabel,*wordLabel,*textLabel,*ptr,*label,*tagVal=NULL,buf[100];
int in_para=0,in_text=0,ac=1,in_word=0,len;
/* defaults for command line arguments */
/* Name of attribute carrying tag -- set with -t */
char* tagAttr=(char*)"TYPE";
/* Name of word element -- set with -w */
char* wordTag=(char*)"W";
/* Format string for word, tag -- set with -f */
char* textFormat=(char*)"%s/%s";
NSLInit(0);
while (ac<argc) {
if (STREQ(argv[ac], "-d")) {
dct=DoctypeFromDdb(argv[++ac]);
ac++;
}
else if (STREQ(argv[ac], "-t")) {
ptr=tagAttr=argv[++ac];
ac++;
/* need upper case for attribute comparison */
while (*ptr) {
*ptr=toupper(*ptr);
ptr++;
};
}
else if (STREQ(argv[ac], "-w")) {
ptr=wordTag=argv[++ac];
ac++;
/* need upper case for tag comparison */
while (*ptr) {
*ptr=toupper(*ptr);
ptr++;
};
}
else if (STREQ(argv[ac], "-f")) {
textFormat=argv[++ac];
ac++;
}
else {
break;
};
};
inf=SFFopen(stdin, dct, NSL_read,"");
dct=DoctypeFromFile(inf);
outf=SFFopen(stdout, dct, NSL_write_normal,"");
textLabel=ElementUniqueName(dct,(char*)"TEXT",4);
paraLabel=ElementUniqueName(dct,(char*)"P",1);
wordLabel=ElementUniqueName(dct,wordTag,0); /* length will be computed */
while ((bit=GetNextBit(inf))) {
switch (bit->type) {
case NSL_start_bit:
if((label=bit->label)==textLabel) {
in_text=1;
}
else if (in_text && label==paraLabel) {
in_para=1;
}
else if (in_para && label==wordLabel) {
in_word=1;
tagVal=GetAttrVal(bit->value.item,tagAttr);
};
case NSL_empty_bit:
PrintItem(outf, bit->value.item);
break;
case NSL_text_bit:
if (in_word) {
len=strlen(bit->value.body);
while (strchr((char*)" \t\n",bit->value.body[len-1])) {
bit->value.body[--len]='\000';
};
sprintf(buf,textFormat,bit->value.body,tagVal);
PrintText(outf,buf);
}
else {
/* text in some other context -- just print it */
PrintText(outf, bit->value.body);
}
break;
case NSL_end_bit:
if (in_para) {
if (bit->label==paraLabel) {
/* no longer in para */
in_para=0; /* NOTA BENE assume no nested para's! */
}
else if (bit->label==wordLabel) {
/* no longer in word */
in_word=0;
};
};
/* print it no matter what */
PrintEndTag(outf,bit->label);
break;
default:
SHOULDNT;
}; /* end switch */
FreeBit(bit);
}; /* end while */
SFclose(outf);
}
NSL queries
<query> :=<term> ( '/' <term> )*
<term> :=<GI> <cond>? '*'?
<GI> :=<elementName> | '.'
<cond> :='[' ( <index> | <atests> | <index> <atests> ) ']'
<index> :=<number>
<atests> :=<atest> ( ' ' <atest> )*
<atest> :=<aname> ( '=' <aval> )?
That is, a query is a sequence of terms, separated by ``/''. Each term
describes either an SGML element or a nested sequence of SGML
elements. An item is given by an SGML element name, optionally
followed by a list of attribute specs (in square brackets), and
optionally followed by a ``*''. An item which ends in a ``*'' matches
a nested sequence of any number of SGML elements, including zero,
each of which match the item without the ``*''. For example ``P*''
will match a <P> element, arbitrarily deeply nested inside other
<P> elements. The special GI ``.'' will match any SGML element
name. Thus, a common way of finding a <P> element anywhere inside
a document is to use the query ``.*/P''. Aname (attribute name) and
aval (attribute value) are as per SGML .
Examples of NSL queries
<!ELEMENT CORPUS - - (DOC+)>
<!ELEMENT DOC - - (DOCNO,TITLE,BODY,IT,NI) >
<!ELEMENT DOCNO - - (#PCDATA) >
<!ELEMENT TITLE - - (s+) >
<!ELEMENT BODY - - (s+) >
<!ELEMENT IT - - (#PCDATA) >
<!ELEMENT NI - - (#PCDATA) >
<!ELEMENT s - - (#PCDATA|w)* >
<!ELEMENT w - - (#PCDATA) >
<!ATTLIST BODY id ID #IMPLIED >
<!ATTLIST IT id ID #IMPLIED>
<!ATTLIST w rend CDATA #IMPLIED>
The hierarchical structure of an example document.
Figure 2
CORPUS/DOC/TITLE/s
CORPUS/DOC/.*/s
./.[1]/.[2]/.[0]
.*/BODY/s[0]
simpleq.c - a model NSL application using queries
#include <ctype.h>
#include <string.h>
#include "nsl.h"
void main(int argc, char **argv) {
NSL_File inf, outf;
NSL_Doctype dct=NULL;
NSL_Query qu;
NSL_Item *item;
char *ptr, *tagVal=NULL,buf[100],qustr[100];
int ac=1,len;
/* defaults for command line arguments */
/* Name of attribute carrying tag -- set with -t */
char* tagAttr=(char*)"TYPE";
/* Name of word element -- set with -w */
char* wordTag=(char*)"W";
/* Format string for word, tag -- set with -f */
char* textFormat=(char*)"%s/%s";
NSLInit(0);
while (ac<argc) {
if (STREQ(argv[ac], "-d")) {
dct=DoctypeFromDdb(argv[++ac]);
ac++;
}
else if (STREQ(argv[ac], "-t")) {
ptr=tagAttr=argv[++ac];
ac++;
/* need upper case for attribute comparison */
while (*ptr) {
*ptr=toupper(*ptr);
ptr++;
};
}
else if (STREQ(argv[ac], "-w")) {
ptr=wordTag=argv[++ac];
ac++;
/* need upper case for tag comparison */
while (*ptr) {
*ptr=toupper(*ptr);
ptr++;
};
}
else if (STREQ(argv[ac], "-f")) {
textFormat=argv[++ac];
ac++;
}
else {
break;
};
};
inf=SFFopen(stdin, dct, NSL_read,"");
dct=DoctypeFromFile(inf);
outf=SFFopen(stdout, dct, NSL_write_normal,"");
strcpy(qustr,".*/TEXT/.*/P/.*/");
strcat(qustr,wordTag);
qu=ParseQuery(qustr);
while( ( item=GetNextQueryItem(inf, qu, outf ) ) ) {
len=strlen(item->data->first);
while (strchr((char*)" \t\n",item->data->first[len-1])) {
item->data->first[--len]='\000';
};
tagVal=GetAttrStringVal(item,tagAttr);
sprintf(buf,textFormat,item->data->first,tagVal);
item->data->first = buf;
PrintItem(outf, item);
FreeItem(item);
}; /* end while */
SFclose(outf);
}