James Clark's SGML Parser SP
[Back to main Page]
SP
I have been writing a new SGML parser, called SP, over the last couple
of years. My main goal was to provide a solid base for the SGML
applications that I wanted to write in the future. But I think it now
has sufficient functionality that it may be useful to others. So I've
decided to make it available for alpha-testing.
The code is copyrighted under the same terms as X11R6; these allow
commercial use.
Here are some of the features of SP:
* Written from scratch in C++.
* Supports any concrete syntax allowed by the standard.
* Supports all varieties of LINK (but the length of chains of LINK
processes is limited to 1.)
* Reentrant: a single process can use multiple parsers at the same
time.
* Includes application (nsgmls) that generates sgmls output format
and is command-line compatible with sgmls.
* Includes application (rast) to generate RAST output format.
(Converting sgmls output to RAST format isn't sufficient for
testing LINK.)
* Supports large character sets. It can be compiled to use 16-bit
characters internally (32-bits is also a possibility for the
future). This allows the entity manager to deal with all the messy
details of multi-byte encodings. System identifiers can specify
which coding scheme a file uses. Supported coding systems include
UTF-8 and Unicode/UCS-2 (intended for use with ISO 10646) and
UJIS/EUC and Shift-JIS (for Japanese character sets).
* Fast for large documents. For example, on my machine (a sparc 10),
it processes the 7Mb Exoterica test document (lcrw.sgm) roughly
twice as fast as sgmls. On the other hand, it's slower than sgmls
at parsing prologs. It also uses much more memory than sgmls. I
haven't looked at its performance on other machines, and it may
well be worse.
It doesn't support any of the following:
* CONCUR (it will parse documents that use CONCUR but you can't make
document types active).
* DATATAG (support for arbitrary short references makes DATATAG of
limited usefulness in my view.)
* CAPACITY checking.
I would emphasize that this is an alpha release. This means:
* It hasn't been tested very much (although I have run it on one
large SGML test suite).
* There isn't much documentation. In particular, there's no
documentation on the parser interface. On the other hand, by
reading the header files and looking at the included applications,
you should be able to figure out how to use it.
* I haven't done much release engineering. There's no fancy
configure script. You may well encounter difficulties building it
if your system differs from mine.
* The parser is still under active development. All of the
interfaces are subject to change.
My development platform is a Sparc 10/Solaris 2.3 with gcc 2.6.2.
Earlier versions of gcc will not work. An iostreams library is needed
(such as is included with libg++). You should have little difficulty
building on SunOS 4.1, since I only recently moved from that to
Solaris. I have also compiled it with Sun C++ 4.0, but I wouldn't
recommend using this. The code is in theory quite portable, but it
uses templates very extensively, and the mechanisms for instantiating
templates vary greatly between compilers, and furthermore many
template implementations are rather buggy. So I would recommend using
gcc if you can. The source code uses long filenames (> 14 characters),
so if you want to build it with a DOS-hosted development system, you
will have to do some renaming (I would suggest you build it with a
Windows NT or OS/2 hosted development system).
The following are available:
* source code for version 0.1
* binary of nsgmls for a sparc running Solaris 2.3
* binary of rast for a sparc running Solaris 2.3
* binary of nsgmls for a sparc running SunOS 4.1.3
* binary of rast for a sparc running SunOS 4.1.3
* unformatted man page for nsgmls (this is also included in the
source code)
* unformatted man page for rast (this is also included in the source
code)
I would very much like to hear about any bugs you find. I am also
interested in hearing about any places where the code does not conform
to the current ANSI C++ working paper. At this point, I'm less
interested in hearing about bugs in compilers that prevent them from
compiling SP.
Although the code is all new, the parser uses many ideas from Charles
Goldfarb's ARCSGML parser and the entity manager uses a number of
ideas from Erik Naggum's POEM. I thank them both.
James Clark
jjc@jclark.com