SGML: CoST-2.0

SGML: CoST-2.0

Subject: Re: Grep-like tool for SGML
Date: Thu, 18 Apr 96 08:44:27 GMT
From: Peter Murray-Rust <Peter@ursus.demon.co.uk>
In article <4l4gp6$fag@murphy2.servtech.com> amiano@marburg.roc.servtech.com "amiano" writes: > > You may want to look at one of the tcl-based interpreters like tclyasp or cost, > since these would allow more complex pattern/action matching. In some ways, The ability to search SGML documents has been a revelation to me. I have been using CoST-2.0 for several months and am amazed by the precision and power of questions that can be asked. For example, if I now start on an information-based project I would seriously question whether effort should be put into relational schemas rather than developing an SGML-based approach. This heresy(?) springs from the realisation that a well-thought-out DTD allows the future user to ask a much wider range of questions than deciding at the beginning what types of information should be queriable. (NOTE: Cost-2.0 is very different from Cost-1 (b4) - there has been some confusion on this newsgroup. It is easy to install and run and provides very powerful search facilities. It holds the documents in a hierarchically structured form (ESIS) as well as using the event stream model where needed.) I have developed a graphical interface to CoST-2.0 and given a lot of thought to querying documents. (The interface - costwish - is due RRRSN - hopefully on Saturday/Sunday.) Joe English has written a very powerful prolog-like language for queries in CoST and I have concentrated on three of its components: - GIs - attributes - content (#PCDATA/CDATA) All of these can be searched independently or in many combinations. Remember that SGML preserves the *context* of information so that it's easy to look for a TITLE that belongs to a SCENE but not to a PLAY or ACT. My examples below come from Julius Caesar marked up by Jon Bosak (thanks, Jon). I have given the CoST queries verbatim, but costwish also lets you construct these queries graphically without having to know the query language. (You need to have a feel for the structure of the document, obviously, and the idea of hierarchical concepts). Q1: what are the titles of the SCENES? (The text contains markup like: <SCENE><TITLE>SCENE I. Rome. A street.</TITLE>..... </SCENE> ) CoST: foreachNode doctree withGI SCENE child withGI TITLE { puts stdout [content] } This searches the whole tree (doctree), finds all nodes (elements) with a GI of SCENE (i.e. <SCENE>...</SCENE>) and uses only those which contain immediate subelements with GI TITLE (child). It then outputs the 'content' of TITLE - in this case the ASCII text "SCENE I. ...street.". Q2. which SCENES take place in Rome? Here we have to search the content of the titles above for the string 'Rome'. (Since it *might* be 'ROME' we'll use case-insensitivity). As above, but: CoST: if {[regexp -nocase "Rome" [content]]} {puts stdout [content]} The content is extracted and searched (grepp'ed) for the regular expression 'Rome'. Here are some further example of questions that can be answered in a line or two of CoST or through costwish: "When do two actors speak at the same time?" "When does someone leave in the middle of someone else's speech?" "How many times does Caesar mention 'Brutus'?" NOTE: CoST incorporates regular expressions through the tcl language as in Q2, where the content is extracted and then searched. This can be somewhat inefficient if a large amount of information is to be searched and I believe that Joe has plans to add regexp to his query language, Joe?? My primary interest in developing costwish has been for managing non-textual information (e.g. numeric data, molecules, DNA/protein sequences, etc.) Many of the current applications use (IMO) inappropriate data models or tools and the great attraction of SGML is that authors can prepare data by *marking up what they know about the data* rather than spending lots of time trying to develop relational models which most primary creators of the data and most users do notunderstand. (Data marked up in SGML can usually be converted into other forms if required for performance or similar concerns). If, say, a scientific publication contains markup like: The boiling-point of <C.MOL NAME="ethanol"><C.FORM FORMULA="C2H6O"></C.FORM> </C.MOL> is <X.VAR UNITS="celsius" TYPE="float">80</X.VAR> it can then be searched at a later stage in very complex ways which the author did not anticipate (e.g. "all molecules with molecular weight < 60 and boiling point > 300K" - the numeric translations (formula->molwt, celsius->K) can be done automatically). I would be interested to hear what other types of information can be searched for in SGML documents (I don't use ENTITIES or NOTATION very much, but CoST will cater for those and several other concepts). Peter. -- Peter Murray-Rust, Virtual School of Molecular Sciences, domestic net connexion