Notes about SP version 1.0.1

Introduction

Writing HTML files without a validating parser is like trying to write computer programs without a compiler: don't do it! Fortunately, help is readily available on the Internet.

James Clark <jjc@jclark.com> is developing a new implementation of a suite of SGML parser tools, called SP. These include:

nsgmls
-- an sgmls-compatible validating SGML parser.
spam
-- an SGML markup stream editor. This program is capable of supplying missing end tags, expanding abbreviated tags, and also of filling in all known attributes in every begin tag. The latter can be handy if you want to find what attributes are available, without having to read the HTML grammar.
sgmlnorm
-- a simple SGML tag normalizer (as yet undocumented).
spent
-- print SGML entity on the standard output

Besides being a complete redesign of the earlier successful smgls implementation, the new programs are designed for the future: they support extended character sets, such as Unicode, and various multi-byte encodings used in oriental languages.

Some details

The new code is written almost entirely in C++ (just over 50K lines at version 1.0.1, or 2.5 times the size of Don Knuth's TeX or Metafont), and requires template support, a relatively new feature of C++ which is not yet widely available.

WARNING: To build these programs, you will need about 50MB of disk space, unless you remove the default -g compiler option. Doing so reduces the executable sizes from almost 10MB each to about 1.5MB (on a Sun SPARC Solaris 2.3 system). Alternatively, you can build them, then run the UNIX strip command on the executables to remove debug symbols.

The SP programs can be compiled and built using recent releases of GNU g++ and libg++ (2.7.1 or later: patches to gcc 2.7.0 are included in the SP distribution). g++ itself is built as part of the GNU gcc compiler installation; although that installation takes a few hours, and requires about 120MB of disk space to be able to run the validation tests before installation, it is straightforward, and should be problem free on most current UNIX systems. The GNU compiler suite has also been built on IBM PC MS DOS and DEC OpenVMS systems, although those versions usually lag behind.

WARNING: With at least libg++ 2.7.1, there is an installation problem that has been reported to the developers: make install does not install libio.a, libiostream.a, and librx.a. libiostream.a is required for building SP, and most other C++ programs. To remedy this, I did the following steps manually in the libg++ directory:

(cd librx; make install)
cp libio/libio*.a /usr/local/lib

Unfortunately, I only discovered this problem after having built libg++ on 8 systems, and then having deleted the build trees after the make install, so I had to do it all over again, sigh...

The SP distribution site has binaries for IBM PC DOS, Intel 386 Linux, Intel 386 Windows NT 3.5, Sun SunOS 4.1.3 and Sun Solaris 2.3, so if you have such a system, you may not need to build any of the SP code from scratch, or to install g++. Binaries are also available for the previous version (0.4) for DEC Alpha OSF/1 3.x and IBM PC OS/2 systems.

Just as with sgmls, lengthy command lines are needed to run these programs successfully. To facilitate their use, I've prepared simple UNIX shell scripts html-ncheck and html-spam to hide the complexity, so that only the HTML files need to be provided on the script command lines.

If you have installed the html-check distribution, and you want to use html-spam, you need to add to end of the HTML catalog file, /usr/local/lib/html-check/lib/catalog. these lines:

        -- Added at the suggestion of James Clark <jjc@jclark.com> --
        -- so that spam -p doesn't output the contents of html.decl --
SGMLDECL html.decl

Without this change, the contents of html.decl are copied to the output if the -p is included in the spam invocation in html-spam ; omitting -p and including html.decl doesn't help, because the <!DOCTYPE ... > line is then lost.

Installation report

I have successfully built sp-1.0.1 with g++ (gcc 2.7.1 [13-Nov-1995] and libg++ [15-Nov-1995]) on these systems:

using the command

make && make check && make install

On a few of these, minor problems cropped up and were solved; they are discussed further below.

I also made unsuccessful attempts to build SP with native C++ compilers on Hewlett-Packard HP-UX 10.0.1 and Silicon Graphics IRIX 5.3, with a command line like

make CXX=CC CXXFLAGS=-O DEFINES='-DANSI_CLASS_INST $(XDEFINES)'

Numerous compiler errors quickly led to my abandoning the effort.

Compilation with native Sun Solaris 2.3 CC looked initially promising, but linking failed with errors about differing sizes of particular symbols, and with many missing functions arising from template instantiation. This linking problem is just what I found with SP 0.4 on the IBM RS/6000 AIX 3.2.5 systems too.

DECstation ULTRIX 4.3

The make step completed successfully, but the make check failed with a shell script error
./dotest
sh: bad substitution
I simply switched shells from sh to GNU bash, instead of fiddling with the dotest script:
bash < dotest
The test completed successfully, and make install worked as expected.

IBM RS/6000 AIX 3.2.5

Once the missing libiostream.a problem (see above) was solved, I was able to complete the first successful installation of SP on the IBM RS/6000. I was previously completely unable to get version 0.4 to build successfully with either g++ or native xlC.

I also tried a build with the native C++ compiler, using

make CXX=xlC WARN= DEFINES='-DANSI_CLASS_INST $(XDEFINES)' -i

This may be close to working: here are the compilation errors produced:

sp-1.0.1/entmg:
xlC -ansi -I. -I./../lib -I./../entmgr -DANSI_CLASS_INST -c \
            ExtendEntityManager.C
"ExtendEntityManager.C", line 34.1: 1540-251: (S) The previous
            declaration of "memmove" did not have a linkage
            specification.

sp-1.0.1/app:
xlC -ansi -I. -I./../lib -I./../entmgr -I./../parser -I./../xentmgr \
            -DANSI_CLASS_INST -c LineOutputCodingSystem.C
"LineOutputCodingSystem.C", line 17.1: 1540-293: (W)
            "LineEncoder::output(const Char*,size_t,streambuf*)" hides
            the virtual function
            "Encoder::output(Char*,size_t,streambuf*)".

sp-1.0.1/nsgmls:
xlC -ansi -I. -I./../lib -I./../entmgr -I./../parser -I./../xentmgr \
            -I./../app -DANSI_CLASS_INST -c nsgmls.C
"nsgmls.C", line 77.1: 1540-055: (S) "char**" cannot be converted to
            "const char**".
"nsgmls.C", line 77.1: 1540-306: (I) The previous message applies to
            argument 2 of function "getopt(int,const char**,const
            char*)".

sp-1.0.1/spam:
xlC -ansi -I. -I./../lib -I./../entmgr -I./../parser -I./../xentmgr
            -I./../app -DANSI_CLASS_INST -c spam.C
"spam.C", line 101.1: 1540-055: (S) "char**" cannot be converted to
            "const char**".
"spam.C", line 101.1: 1540-306: (I) The previous message applies to
            argument 2 of function "getopt(int,const char**,const
            char*)".

sp-1.0.1/sgmlnorm:
xlC -ansi -I. -I./../lib -I./../entmgr -I./../xentmgr -I./../app \
            -I./../api -DANSI_CLASS_INST -c sgmlnorm.C
"sgmlnorm.C", line 43.1: 1540-055: (S) "char**" cannot be converted to
            "const char**".
"sgmlnorm.C", line 43.1: 1540-306: (I) The previous message applies to
            argument 2 of function "getopt(int,const char**,const
            char*)".

sp-1.0.1/spam:
xlC -ansi -I. -I./../lib -I./../entmgr -I./../parser -I./../xentmgr \
            -I./../app -DANSI_CLASS_INST -c spam.C
"spam.C", line 101.1: 1540-055: (S) "char**" cannot be converted to
            "const char**".
"spam.C", line 101.1: 1540-306: (I) The previous message applies to
            argument 2 of function "getopt(int,const char**,const
            char*)".

sp-1.0.1/sgmlnorm:
xlC -ansi -I. -I./../lib -I./../entmgr -I./../xentmgr -I./../app \
            -I./../api -DANSI_CLASS_INST -c sgmlnorm.C
"sgmlnorm.C", line 43.1: 1540-055: (S) "char**" cannot be converted to
            "const char**".
"sgmlnorm.C", line 43.1: 1540-306: (I) The previous message applies to
            argument 2 of function "getopt(int,const char**,const
            char*)".

sp-1.0.1/spent:
xlC -ansi -I. -I./../lib -I./../entmgr -I./../xentmgr -I./../app
            -DANSI_CLASS_INST -c spent.C
"spent.C", line 54.1: 1540-055: (S) "char* const*" cannot be converted
            to "const char**".
"spent.C", line 54.1: 1540-306: (I) The previous message applies to
            argument 2 of function "getopt(int,const char**,const
            char*)".

All of the errors about getopt() arise from confusion between const char** and char* const*. The DEC Alpha OSF/1 3.x, Hewlett-Packard HP-UX 10.x, Silicon Graphics IRIX 5.x, and Sun Solaris 2.x header files stdlib.h have the latter, while the IBM RS/6000 stdlib.h file has the former.

As an experiment, I therefore temporarily modified the file spent/spent.C to add a type cast (const char**) to the second argument of getopt(): compilation was then successful, but after adding a needed -L/usr/local/lib search path to the LIBS variable in the Makefile, linking failed with massive numbers of unresolved external names generated from templates. This is the same problem that existed with both g++ 2.6.3 and xlC with sp 0.4, and I therefore abandoned further attempts with the xlC compiler.

NeXT Turbostation Mach 3.0

I modified the top-level SP Makefile to set RANLIB=ranlib. The build of SP then completed successfully, and make check passed all of the validation tests.

Sun SPARC SunOS 4.1.3 with gcc 2.6.2

On Sun SunOS 4.1.3, the Makefile needs to have comment markers removed to generate the lines

LIBOBJS = strerror.o memmove.o
LIBS    = -liostream -lg++ -L/usr/lang/SC1.0/ansi_lib -lansi

Without the -lansi, function strtoul was not resolved from the C or C++ libraries. The Makefile comments

# On SunOS 4, using libg++ 2.6, uncomment this.
# libg++ is needed for strtoul which is used by libiostream.
# LIBS=-liostream -lg++

incorrectly imply that strtoul can be found in libg++.a, but that is not the case. However, the function can be found in the library for the SunOS 4.x half-ANSI acc compiler.