SGML: CoST-2.0
Subject: Re: Grep-like tool for SGML
Date: Thu, 18 Apr 96 08:44:27 GMT
From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
In article <4l4gp6$fag@murphy2.servtech.com>
amiano@marburg.roc.servtech.com "amiano" writes:
>
> You may want to look at one of the tcl-based interpreters like tclyasp or cost,
> since these would allow more complex pattern/action matching. In some ways,
The ability to search SGML documents has been a revelation to me. I have
been using CoST-2.0 for several months and am amazed by the precision and
power of questions that can be asked. For example, if I now start on an
information-based project I would seriously question whether effort should
be put into relational schemas rather than developing an SGML-based approach.
This heresy(?) springs from the realisation that a well-thought-out DTD allows
the future user to ask a much wider range of questions than deciding at the
beginning what types of information should be queriable.
(NOTE: Cost-2.0 is very different from Cost-1 (b4) - there has been some
confusion on this newsgroup. It is easy to install and run and provides
very powerful search facilities. It holds the documents in a hierarchically
structured form (ESIS) as well as using the event stream model where needed.)
I have developed a graphical interface to CoST-2.0 and given a lot of thought
to querying documents. (The interface - costwish - is due RRRSN - hopefully
on Saturday/Sunday.) Joe English has written a very powerful prolog-like
language for queries in CoST and I have concentrated on three of its
components:
- GIs
- attributes
- content (#PCDATA/CDATA)
All of these can be searched independently or in many combinations. Remember
that SGML preserves the *context* of information so that it's easy to
look for a TITLE that belongs to a SCENE but not to a PLAY or ACT. My
examples below come from Julius Caesar marked up by Jon Bosak (thanks, Jon).
I have given the CoST queries verbatim, but costwish also lets you construct
these queries graphically without having to know the query language. (You
need to have a feel for the structure of the document, obviously, and the idea
of hierarchical concepts).
Q1: what are the titles of the SCENES? (The text contains markup like:
<SCENE><TITLE>SCENE I. Rome. A street.</TITLE>..... </SCENE> )
CoST:
foreachNode doctree withGI SCENE child withGI TITLE {
puts stdout [content]
}
This searches the whole tree (doctree), finds all nodes (elements) with a
GI of SCENE (i.e. <SCENE>...</SCENE>) and uses only those which contain
immediate subelements with GI TITLE (child). It then outputs the 'content'
of TITLE - in this case the ASCII text "SCENE I. ...street.".
Q2. which SCENES take place in Rome? Here we have to search the content
of the titles above for the string 'Rome'. (Since it *might* be 'ROME'
we'll use case-insensitivity). As above, but:
CoST:
if {[regexp -nocase "Rome" [content]]} {puts stdout [content]}
The content is extracted and searched (grepp'ed) for the regular expression
'Rome'.
Here are some further example of questions that can be answered in a line
or two of CoST or through costwish:
"When do two actors speak at the same time?"
"When does someone leave in the middle of someone else's speech?"
"How many times does Caesar mention 'Brutus'?"
NOTE: CoST incorporates regular expressions through the tcl language as in
Q2, where the content is extracted and then searched. This can be somewhat
inefficient if a large amount of information is to be searched and I believe
that Joe has plans to add regexp to his query language, Joe??
My primary interest in developing costwish has been for managing non-textual
information (e.g. numeric data, molecules, DNA/protein sequences, etc.) Many
of the current applications use (IMO) inappropriate data models or tools
and the great attraction of SGML is that authors can prepare data by *marking
up what they know about the data* rather than spending lots of time trying
to develop relational models which most primary creators of the data and
most users do notunderstand. (Data marked up in SGML can usually be converted
into other forms if required for performance or similar concerns). If, say,
a scientific publication contains markup like:
The boiling-point of <C.MOL NAME="ethanol"><C.FORM FORMULA="C2H6O"></C.FORM>
</C.MOL> is <X.VAR UNITS="celsius" TYPE="float">80</X.VAR>
it can then be searched at a later stage in very complex ways which the author
did not anticipate (e.g. "all molecules with molecular weight < 60 and boiling
point > 300K" - the numeric translations (formula->molwt, celsius->K) can be
done automatically).
I would be interested to hear what other types of information can be searched
for in SGML documents (I don't use ENTITIES or NOTATION very much, but CoST
will cater for those and several other concepts).
Peter.
--
Peter Murray-Rust, Virtual School of Molecular Sciences, domestic net connexion