James Clark's SGML Parser SP

[Back to main Page]

                                      SP
                                         
   I have been writing a new SGML parser, called SP, over the last couple
   of years. My main goal was to provide a solid base for the SGML
   applications that I wanted to write in the future. But I think it now
   has sufficient functionality that it may be useful to others. So I've
   decided to make it available for alpha-testing.
   
   The code is copyrighted under the same terms as X11R6; these allow
   commercial use.
   
   Here are some of the features of SP:
     * Written from scratch in C++.
     * Supports any concrete syntax allowed by the standard.
     * Supports all varieties of LINK (but the length of chains of LINK
       processes is limited to 1.)
     * Reentrant: a single process can use multiple parsers at the same
       time.
     * Includes application (nsgmls) that generates sgmls output format
       and is command-line compatible with sgmls.
     * Includes application (rast) to generate RAST output format.
       (Converting sgmls output to RAST format isn't sufficient for
       testing LINK.)
     * Supports large character sets. It can be compiled to use 16-bit
       characters internally (32-bits is also a possibility for the
       future). This allows the entity manager to deal with all the messy
       details of multi-byte encodings. System identifiers can specify
       which coding scheme a file uses. Supported coding systems include
       UTF-8 and Unicode/UCS-2 (intended for use with ISO 10646) and
       UJIS/EUC and Shift-JIS (for Japanese character sets).
     * Fast for large documents. For example, on my machine (a sparc 10),
       it processes the 7Mb Exoterica test document (lcrw.sgm) roughly
       twice as fast as sgmls. On the other hand, it's slower than sgmls
       at parsing prologs. It also uses much more memory than sgmls. I
       haven't looked at its performance on other machines, and it may
       well be worse.
       
   
   
   It doesn't support any of the following:
     * CONCUR (it will parse documents that use CONCUR but you can't make
       document types active).
     * DATATAG (support for arbitrary short references makes DATATAG of
       limited usefulness in my view.)
     * CAPACITY checking.
       
   
   
   I would emphasize that this is an alpha release. This means:
     * It hasn't been tested very much (although I have run it on one
       large SGML test suite).
     * There isn't much documentation. In particular, there's no
       documentation on the parser interface. On the other hand, by
       reading the header files and looking at the included applications,
       you should be able to figure out how to use it.
     * I haven't done much release engineering. There's no fancy
       configure script. You may well encounter difficulties building it
       if your system differs from mine.
     * The parser is still under active development. All of the
       interfaces are subject to change.
       
   
   
   My development platform is a Sparc 10/Solaris 2.3 with gcc 2.6.2.
   Earlier versions of gcc will not work. An iostreams library is needed
   (such as is included with libg++). You should have little difficulty
   building on SunOS 4.1, since I only recently moved from that to
   Solaris. I have also compiled it with Sun C++ 4.0, but I wouldn't
   recommend using this. The code is in theory quite portable, but it
   uses templates very extensively, and the mechanisms for instantiating
   templates vary greatly between compilers, and furthermore many
   template implementations are rather buggy. So I would recommend using
   gcc if you can. The source code uses long filenames (> 14 characters),
   so if you want to build it with a DOS-hosted development system, you
   will have to do some renaming (I would suggest you build it with a
   Windows NT or OS/2 hosted development system).
   
   The following are available:
     * source code for version 0.1
     * binary of nsgmls for a sparc running Solaris 2.3
     * binary of rast for a sparc running Solaris 2.3
     * binary of nsgmls for a sparc running SunOS 4.1.3
     * binary of rast for a sparc running SunOS 4.1.3
     * unformatted man page for nsgmls (this is also included in the
       source code)
     * unformatted man page for rast (this is also included in the source
       code)
       
   
   
   I would very much like to hear about any bugs you find. I am also
   interested in hearing about any places where the code does not conform
   to the current ANSI C++ working paper. At this point, I'm less
   interested in hearing about bugs in compilers that prevent them from
   compiling SP.
   
   Although the code is all new, the parser uses many ideas from Charles
   Goldfarb's ARCSGML parser and the entity manager uses a number of
   ideas from Erik Naggum's POEM. I thank them both.
   
    James Clark
    jjc@jclark.com