COSTWISH overview

costwish is a simple addition of a GUI to Joe English's CoST for the analysis of SGML documents. There is an executable (costwish) which is a tcl/tk interpreter and for which I have written a number of *.tcl scripts. I hope that you can start using it without knowing any tcl/tk or SGML. To help you with this I have written some simple tutorials which you should find under the 'User Options' button.

SGML can be used in several ways for structuring information and CoST supports at least:

Tree-structured documents. Think of a table of contents, with chapters, sections, subsections, appendices, etc. This is the most developed metaphor in costwish and the default way of presenting a document. If you read in an SGML document it will be presented as an indented table, with active links from the components.
A stream. If you're familiar with HTML, think of the tags (e.g. ... as switching formatting on or off as the stream of text is processed. costwish only really addresses this when it encounters HTML2.0 documents.
Entity management. SGML supports documents composed of many distributed 'bits' (think of a book made up of 'include' files). CoST supports entities bu costwish has not yet addressed them.
Searchable information. CoST can locate parts of an SGML document by searching for tags, modified it necessary by context and attributes. This is extremely powerful, but it's not easy to customise this generally and I haven't tried. I suspect that in the first instance (like me) you should learn CoST and hide the queries under GUI buttons.
Translation. SGML documents are often translated to various formats (e.g. LaTeX, HTML, etc.) and Joe has written some routines for this. I haven't yet included them, partly because the LaTeX one is in [incr tcl] which I haven't installed.

SGML documents

An SGML document requires a DTD (Document Type Declaration) for it to be valid and will carry the name of the DTD in a DOCTYPE statement at the head. (Any document without DOCTYPE is not a conforming SGML document.) The DTD can validate the document and also add defaults, so that in principle an SGML document cannot be read without the DTD. For newcomers this can be a very tricky area as several files are often required to manage the 'parsing'. Amongst these are the SGML declaration, the DTDs, and often a CATALOG to manage the DTDs. There may also be environment variables or other mechanisms for managing SGML parsing. costwish allows the user to customise this proces if required.

The result of parsing an SGML document is an ESIS document - essentially a parse tree of the SGML. It's essentially a translation of the SGML document into one that no longer requires the DTD. (It can't be retransformed completely into SGML without knowledge of the DTD - e.g. EMPTY tags and defaults can't be recreated automatically.) CoST operates on ESIS streams (it does not read any DTDs) so that the fundamental operation is to load an ESIS file. If you are happy operating sgml parsers you may wish to create your ESIS files independently of costwish. If, like me, you often get the mechanics of running the parser wrong, you may like to try to customise that bit under costwish.

CoST

CoST is a very smart program and I have only mastered about half of it. Esentially it lets the user load an ESIS file and then to write tcl commands to search, translate, analyse the parsed data. It has a very flexible query language which navigates the tree and it this can't easily be customised unde a GUI. costwish, therefore, only really supplied some basic operations, but hopefully they are ones which will let you learn more about SGML and CoST. One way forward may be to add a command line option, but this is for later.

The Parse Tree

In very simple terms an SGML document consists of tags which may contain other tags and/or free text (#PCDATA). Tags may also have attributes. Precisely what is allowed is governed by the particular DTD. CoST regards tags as nodes in a tree with optional attributes. For example, in the HTML string:
Here is <A HREF="me.html">my home page</A> at work.
the P node contains:

A PCDATA string: "Here is "
An A node
More PCDATA: " at work."

The A node contains:

PCDATA: "my "
A B node
more PCDATA: " page"

The A node also has an attribute HREF with a value "me.html". (The " are stripped from attribute values by the parser). Although you can't see it, the nodes all have additional default attributes added by the DTD. Thus if the string above is part of a conforming HTML2.0 document it is actually parsed into:



(A
-my 
ASDAFORM CDATA B
(B
-home
)B
- page
)A
- at work.
)P
]]>

This is part of the ESIS stream. (X ... )X represents an X node which can contain other nodes. The attributes precede the opening of the node and are prefixed by A. Thus the (only) attribute of P is "SDAFORM="Para" which was automatically added by the parser from the DTD as a default. (The SDA attributes are a standard document architecture which supports Braille). The PCDATA is denoted by a leading "-" field. (There are a few other types of information that can occur in the ESIS stream but it's fairly simple to follow.)

ESIS is the sole input for CoST. When it loads a file it stores it as a tree and every tag creates a node. These nodes can have attributes and content (usually CDATA). (There are several types of string defined in SGML (PCDATA, CDATA, RDATA) but some of these distinctions disappear when they get translated into ESIS.

I - and I'm not alone - find the treatment of string data very tricky in SGML. There is no concept of whitespace (spaces are transcribed exactly to the ESIS stream) but there is a concept of record-end (RE). Roughly, REs correspond to newlines in the PCDATA except for leading ant trailing ones. If newlines are precisely important for you other than simply as whitespace you'll have to explore this carefully. CoST stores REs as nodes and this may take some getting used to. To manage this, CoST also introduces Pseudo-elements (PELs) which you can think of as invisible tags managing the CDATA and REs. There is confusion between SGML and HyTime as to how to treat REs and I trust Joe's analysis of the problem!

Every node has an address in the parse tree and this can be used for referencing it. costwish uses these addresses extensively since they are the only unique way of referencing nodes. You may therefore see them displayed (e.g. 10:2). There is an option for switching them on in the overview so that subtrees can be located.

Nodes may have the following types of 'content' which are formally distinct though some applications may blur them:

Attributes. e.g. TITLE="some title" is provided for many HTML elements.
Contained elements. e.g. UL in HTML contains LIs (and possibly ULs).
String content (PCDATA).

It's sometimes a matter of taste as to whether something is provided as an attribute or is contained, and whether it is then a string or a separate element. For this reason costwish cannot make intelligent guesses about what the 'content' 'means'.

There's actually a much stronger reason why costwish cannot interpret elements - SGML is precisely semantically free. All it guarantees (but that is considerable) is the abstract structure of a document. Thus: <FOO>
<BAR>xyzzy</BAR>
</FOO>
and
<AUTHOR>
<NAME>plugh</NAME>
</AUTHOR>
are isomorphous to SGML.

It is at the postprocessor stage (costwish) that we start to add semantics to documents. For that reason it is VERY important that costwish scripts are prepared carefully since they have the power to reinterpet or change information. For example:
<HEIGHT>173</HEIGHT>
is meaningless until it is decided whether the UNITS are cm or m! A more involved example is the use of BASE, META, LINK, A in HTML2.0. The DTD does not define precisely how they are to be implemented and Murray Maloney is producing a document for the HTML-WG to coordinate ideas. In using costwish to postprocess HTML2.0 documents, it will be very important to implement these semantics as consistently as possible.

Rendering

The success of HTML on the WWW (which I welcome) has given the impression that documents are primarily for rendering (i.e. 'viewing' or 'browsing'). In fact SGML has great power to transmit document content precisely, and rendering is only one of many operations. Others are archiving, merging, seraching, indexing, translation, etc. costwish does not emphasise rendering (and does not honour non-standard HTML tags such as CENTER - if only because they cannot be included in conforming SGML documents).

It's tempting for authors to provide rendering instructions or data in SGML documents, but costwish is unlikely to be able to take much notice of these. In general it can do the following:

Treat all nodes in a document as generic and render for each:
- The attributes
- Any CDATA content
- Any contained elements
In general this will be visually messy since there could be zero or hundreds of any of these components in any node.
Allow the user to write specific routines for any elements they wish, and to allow generic display for the others. This is the recommended route - i.e. that the users customise costwish for their own applications and DTDs. This can even be done at run-time. I have done this for my own DTDs XML and CML and you are welcome to hack through what I've written. I've tried to keep costwish application-independent, but I may not get it quite right at the start.

Translating

SGML is an excellent way to provide documents for translation into different target format. To do this the user must provide tools which decide on action for each node, and they may also wish to add something about the overall document structure (e.g. table of contents, index, etc). Here are some common examples:

It's almost trivial to reconvert ESIS into a conforming SGML document, but you'll need to supply a list of EMPTY elements. (A typical EMPTY element in HTML 2.0 is IMG which has no closing tag. If one is included, it breaks the parser - this seems unnecessary, but those are the rules.).
It is often easy to transform a text-based document into HTML. For example you might render CHAPTER as H2. Go through your DTD and see whether a tag can be transformed, or whether it's best omitted.
LaTeX. Again it's not too difficult to do this for simple documents, and <CHAPTER>...</CHAPTER> could translate to \chapter{...\}.

Conclusion

costwish and its scripts form a DTD-independent 'core' on which many simple extensions can be built. Some of these will be to support other DTDs, whilst others might add functionality (e.g. translation, searching, etc.) We are hoping to maintain a consistent core for costwish but would be pleased to know of extensions whose incorporation we could then explore.

Peter Murray-Rust
April 1996