NSGMLS (SP 0.2) man page
NSGMLS(1) NSGMLS(1)
NAME
nsgmls - a validating SGML parser
An SGML System Conforming to
International Standard ISO 8879 --
Standard Generalized Markup Language
SYNOPSIS
nsgmls [ -deglprsuv ] [ -alinktype ] [ -iname ] [ -mfile ]
[ filename... ]
DESCRIPTION
Nsgmls parses and validates the SGML document entity in
filename... and prints on the standard output a simple
text representation of its Element Structure Information
Set. (This is the information set which a structure-
controlled conforming SGML application should act upon.)
Note that the document entity may be spread amongst sev-
eral files; for example, the SGML declaration, document
type declaration and document instance set could each be
in a separate file. If no filenames are specified, then
nsgmls will read the document entity from the standard
input. Each filename is actually interpreted as a system
identifier. A command line filename of - can be used to
refer to the standard input. (Normally in a system iden-
tifier, fd:0 is used to refer to standard input.)
The following options are available:
-alinktype
Make link type linktype active. Not all ESIS
information is output in this case: the active LPDs
are not explicitly reported, although each link
attribute is qualified with its link type name;
there is no information about result elements; when
there are multiple link rules applicable to the
current element, nsgmls always chooses the first.
-d Warn about duplicate entity declarations.
-e Describe open entities in error messages. Error
messages always include the position of the most
recently opened external entity.
-g Show the GIs of open elements in error messages.
-iname Pretend that
<!ENTITY % name "INCLUDE">
occurs at the start of the document type declara-
tion subset in the SGML document entity. Since
repeated definitions of an entity are ignored, this
definition will take precedence over any other
1
NSGMLS(1) NSGMLS(1)
definitions of this entity in the document type
declaration. Multiple -i options are allowed. If
the SGML declaration replaces the reserved name
INCLUDE then the new reserved name will be the
replacement text of the entity. Typically the doc-
ument type declaration will contain
<!ENTITY % name "IGNORE">
and will use %name; in the status keyword specifi-
cation of a marked section declaration. In this
case the effect of the option will be to cause the
marked section not to be ignored.
-l Output L commands giving the current line number
and filename.
-mfile Map public identifiers and entity names to system
identifiers using the catalog entry file whose sys-
tem identifier is file. Multiple -m options are
allowed. Catalog entry files specified with the -m
option will be searched before the defaults.
-p Parse only the prolog. Nsgmls will exit after
parsing the document type declaration. Implies -s.
-r Warn about defaulted references.
-s Suppress output. Error messages will still be
printed.
-u Warn about undefined elements: elements used in the
DTD but not defined.
-v Print the version number.
External entities
An external entity resides in one or more storage objects,
each of which contains a sequence of bytes. The entity
manager component of nsgmls maps a sequence of storage
objects into an entity as follows:
1. The bytes in each storage object are converted into
characters, each represented by a single bit combi-
nation, according to the encoding translation asso-
ciated with the storage object.
2. The characters in each storage object are concate-
nated.
3. The sequence of characters is treated as a sequence
of lines each terminated by a line terminator. The
line terminator is either a line feed or a carriage
return or a a carriage return followed by a line
2
NSGMLS(1) NSGMLS(1)
feed. Nsgmls determines which line terminator to
use for a storage object according to which of the
possible line terminators is used for the first
line of the storage object. A record start is
inserted at the beginning of each line, and a
record end at the end of each line. If there is a
partial line (a line that doesn't end with the line
terminator) at the end of the entity, then a record
start will be inserted before it but no record end
will be inserted after it.
An encoding translation defines a translation between the
storage coding system and the entity coding system. The
storage coding system represents characters by sequences
of bytes; it can be variable width and stateful. The
entity coding system represents each character by a single
bit combination; it is fixed-width (but not limited to 8
bits) and stateless. Note that the SGML declaration
describes the entity coding system not the storage coding
system.
System identifiers
A system identifier describes a sequence of storage
objects, each optionally associated with a encoding trans-
lation. Nsgmls will attempt to interpret a system identi-
fier as a keyword followed by a colon followed by a
string, which is interpreted in a keyword-dependent way.
Keywords are case-insensitive. The following keywords are
recognized:
file The string is interpreted as a filename. The sys-
tem identifier describes a single storage object
that will be read from the named file.
fd The string is as a number. The system identifier
describes a single storage object that will read
from the file descriptor with that number. For
example, fd:0 will read the storage object from
standard input.
concat The string is treated as a list of substrings sepa-
rated by + characters. Each of the substrings is
in turn interpreted as a system identifier, and the
sequences of storage objects that each denote are
concatenated. The concat system identifier
describes the resulting sequence of storage
objects.
utf8 The string is interpreted as a system identifer.
Each storage object that it describes that is not
associated with a encoding translation is associ-
ated with an encoding translation that translates
UTF8 to fixed-width encoding. Invalid multi-byte
sequences are represented by the character 0xFFFD.
3
NSGMLS(1) NSGMLS(1)
This keyword is recognized only in the multi-byte
version of nsgmls.
ucs2 The string is interpreted as a system identifer.
Each storage object that it describes that is not
associated with a encoding translation is associ-
ated with an encoding translation that translates
UCS2 to a fixed width encoding. The more signifi-
cant octet of each character always precedes the
less significant octet irrespective of the system's
native byte-order. The codes 0xFFFE and 0xFEFF are
not treated specially in any way. This keyword is
recognized only in the multi-byte version of
nsgmls.
unicode
The string is interpreted as a system identifer.
Each storage object that it describes that is not
associated with a encoding translation is associ-
ated with the an encoding translation, which trans-
lates the Unicode coding system to a fixed-width
encoding. The Unicode coding system treats each
pair of octets as a character in the system's byte
order. If the first character is the byte order
mark character (0xFEFF), it will be discarded. If
the first character is the byte order mark charac-
ter byte-swapped, it will be discarded and the
remaining characters will be byte-swapped. This
keyword is recognized only in the multi-byte ver-
sion of nsgmls.
ujis The string is interpreted as a system identifer.
Each storage object that it describes that is not
associated with a encoding translation is associ-
ated with an encoding translation where the storage
coding system is variable-width (packed) UJIS
(EUC), and the entity coding system represents each
character in the same way as the EUC complete two-
byte format. In the entity coding system the code
of characters in the G0 set (usually the Japanese
version of ISO 646) is unchanged; The code of char-
acters in the G1 set (usually JIS X 0208-1990) is
ORed with 0x8080; the code of characters in the G2
set (usually half-width katakana from JIS X
0201-1986) is ORed with 0x0080; the code of charac-
ters in the G3 set (JIS X 0212-1990) is ORed with
0x8000. This keyword is recognized only in the
multi-byte version of nsgmls.
sjis The string is interpreted as a system identifer.
Each storage object that it describes that is not
associated with a encoding translation is associ-
ated with an encoding translation where the storage
coding system is Shift JIS and the entity coding
4
NSGMLS(1) NSGMLS(1)
system is the same as with the ujis encoding trans-
lation (except for characters in the G3 set which
are not representable using Shift JIS.) This key-
word is recognized only in the multi-byte version
of nsgmls.
identity
The string is interpreted as a system identifer.
Each storage object that it describes that is not
associated with a encoding translation is associ-
ated with the identity encoding translation. The
identity coding system converts bytes to characters
by zero-extending each character.
raw The string is interpreted as a system identifier.
No translation of line-terminators onto RS and RE
characters will be performed for each storage
object that it describes. Error messages referring
to these storage objects will not contain line num-
bers.
huge This keyword is intended for use with huge files,
for which the cost of keeping track of line bound-
aries (roughly one byte per line) is too large.
The string is interpreted as a system identifier.
For each storage object that it describes, nsgmls
will not keep track of where line boundaries occur
as it usually does. Error messages referring to
these storage objects will not contain line num-
bers.
If a system identifier does not contain a keyword or uses
a keyword that is not recognized, then the system identi-
fier will be treated as a filename. Note that the system
identifier file:utf8:doc.sgm identifies the file named
utf8:doc.sgm but utf8:file:doc.sgm identifies the file
named doc.sgm using the utf8 coding scheme.
A relative filename in a system identifier is interpreted
relative to the file in which the system identifier is
specified, if any, and otherwise relative to the current
directory. This applies both to system identifiers speci-
fied in SGML documents, and to system identifiers speci-
fied in catalog entry files.
If a system identifier does not specify the encoding
translation, the encoding translation of the storage
object in which the system identifier was specified will
be used.
System identifier generation
If a system identifier is not specified, then the entity
manager will attempt to generate one using catalog entry
files in the format defined in the SGML Open Draft
5
NSGMLS(1) NSGMLS(1)
Technical Resolution on Entity Management. A catalog
entry file contains a sequence of entries in one of the
following four forms:
PUBLIC pubid sysid
This specifies that sysid should be used as the
system identifier if the public identifier is
pubid. Sysid is a system identifier as defined in
ISO 8879 and pubid is a public identifier as
defined in ISO 8879.
ENTITY name sysid
This specifies that sysid should be used as the
system identifier if the entity is a general entity
whose name is name.
ENTITY %name sysid
This specifies that sysid should be used as the
system identifier if the entity is a parameter
entity whose name is name. Note that there is no
space between the % and the name.
DOCTYPE name sysid
This specifies that sysid should be used as the
system identifier if the entity is an entity
declared in a document type declaration whose docu-
ment type name is name.
LINKTYPE name sysid
This specifies that sysid should be used as the
system identifier if the entity is an entity
declared in a link type declaration whose link type
name is name.
OVERRIDE
This specifies that system identifiers specified in
the catalog should override system identifiers
specified in the document. Normally, if an entity
declaration in the document specifies a system
identifier, the catalog is not consulted. If OVER-
RIDE is specified, then the catalog is searched
first; the system only uses the system identifier
specified in the document, if no match is found in
the catalog.
SGMLDECL sysid
This specifies that if the document does not con-
tain an SGML declaration, the SGML declaration in
sysid should be implied.
The last four forms are extensions to the SGML Open for-
mat. The delimiters can be omitted from the sysid pro-
vided it does not contain any white space. Comments are
allowed between parameters delimited by -- as in SGML.
6
NSGMLS(1) NSGMLS(1)
The environment variable SGML_CATALOG_FILES contains a
colon-separated list of catalog entry files. These will
be searched after any catalog entry files specified using
the -m option. If this environment variable is not set,
then a system dependent list of catalog entry files will
be used. A match in a catalog entry file for a PUBLIC
entry will take precedence over a match in the same file
for an ENTITY, DOCTYPE or LINKTYPE entry.
System declaration
The system declaration for nsgmls is as follows:
SYSTEM "ISO 8879:1986"
CHARSET
BASESET "ISO 646-1983//CHARSET
International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET 0 128 0
CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
FEATURES
MINIMIZE DATATAG NO OMITTAG YES RANK YES SHORTTAG YES
LINK SIMPLE YES 65536 IMPLICIT YES EXPLICIT YES 1
OTHER CONCUR NO SUBDOC YES 100 FORMAL YES
SCOPE DOCUMENT
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Core//EN"
VALIDATE
GENERAL YES MODEL YES EXCLUDE YES CAPACITY NO
NONSGML YES SGML YES FORMAL YES
SDIF
PACK NO UNPACK NO
The limit for the SUBDOC parameter is memory dependent.
Any legal concrete syntax may be used.
SGML declaration
The SGML declaration may be omitted, the following decla-
ration will be implied:
<!SGML "ISO 8879:1986"
CHARSET
BASESET "ISO 646-1983//CHARSET
International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
SCOPE DOCUMENT
SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
FEATURES
7
NSGMLS(1) NSGMLS(1)
MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
OTHER CONCUR NO SUBDOC YES 99999999 FORMAL YES
APPINFO NONE>
with the exception that characters 128 through 254 will be
assigned to DATACHAR.
Nsgmls identifies base character sets using the designat-
ing sequence in the public identifier. The following des-
ignating sequences are recognized:
Designating ISO Minimum Number
Escape Registration Character of Description
Sequence Number Number Characters
-----------------------------------------------------------------------------------
ESC 2/5 4/0 - 0 128 full set of ISO 646 IRV
ESC 2/8 4/0 2 0 128 G0 set of ISO 646 IRV
ESC 2/8 4/2 6 0 128 G0 set of ASCII
ESC 2/13 4/1 100 0 128 G1 set of ISO 8859-1
ESC 2/1 4/0 1 0 32 C0 set of ISO 646
ESC 2/2 4/3 77 0 32 C1 set of ISO 6429
ESC 2/5 2/15 4/0 162 0 65536 ISO 10646 UCS-2 level 1
ESC 2/5 2/15 4/3 174 0 65536 ISO 10646 UCS-2 level 2
ESC 2/5 2/15 4/5 176 0 65536 ISO 10646 UCS-2 level 3
ESC 2/5 2/15 4/1 163 0 2147483648 ISO 10646 UCS-4 level 1
ESC 2/5 2/15 4/4 175 0 2147483648 ISO 10646 UCS-4 level 2
ESC 2/5 2/15 4/6 177 0 2147483648 ISO 10646 UCS-4 level 3
The graphic character sets do not strictly include C0 and
C1 control character sets. For convenience, nsgmls aug-
ments the graphic character sets with the appropriate con-
trol character sets.
Output format
The output is a series of lines. Lines can be arbitrarily
long. Each line consists of an initial command character
and one or more arguments. Arguments are separated by a
single space, but when a command takes a fixed number of
arguments the last argument can contain spaces. There is
no space between the command character and the first argu-
ment. Arguments can contain the following escape
sequences.
\\ A \.
\n A record end character.
\| Internal SDATA entities are bracketed by these.
\nnn The character whose code is nnn octal.
A record start character will be represented by \012.
Most applications will need to ignore \012 and translate
\n into newline.
8
NSGMLS(1) NSGMLS(1)
The possible command characters and arguments are as fol-
lows:
(gi The start of an element whose generic identifier is
gi. Any attributes for this element will have been
specified with A commands.
)gi The end an element whose generic identifier is gi.
-data Data.
&name A reference to an external data entity name; name
will have been defined using an E command.
?pi A processing instruction with data pi.
Aname val
The next element to start has an attribute name
with value val which takes one of the following
forms:
IMPLIED
The value of the attribute is implied.
CDATA data
The attribute is character data. This is
used for attributes whose declared value is
CDATA.
NOTATION nname
The attribute is a notation name; nname will
have been defined using a N command. This
is used for attributes whose declared value
is NOTATION.
ENTITY name...
The attribute is a list of general entity
names. Each entity name will have been
defined using an I, E or S command. This is
used for attributes whose declared value is
ENTITY or ENTITIES.
TOKEN token...
The attribute is a list of tokens. This is
used for attributes whose declared value is
anything else.
Dename name val
This is the same as the A command, except that it
specifies a data attribute for an external entity
named ename. Any D commands will come after the E
command that defines the entity to which they
apply, but before any & or A commands that refer-
ence the entity.
9
NSGMLS(1) NSGMLS(1)
atype name val
The next element to start has a link attribute with
link type type, name name, and value val, which
takes the same form as with the A command.
Nnname nname. Define a notation This command will be pre-
ceded by a p command if the notation was declared
with a public identifier, and by a s command if the
notation was declared with a system identifier. A
notation will only be defined if it is to be refer-
enced in an E command or in an A command for an
attribute with a declared value of NOTATION.
Eename typ nname
Define an external data entity named ename with
type typ (CDATA, NDATA or SDATA) and notation not.
This command will be preceded by one or more f com-
mands giving the filenames generated by the entity
manager from the system and public identifiers, by
a p command if a public identifier was declared for
the entity, and by a s command if a system identi-
fier was declared for the entity. not will have
been defined using a N command. Data attributes
may be specified for the entity using D commands.
An external data entity will only be defined if it
is to be referenced in a & command or in an A com-
mand for an attribute whose declared value is
ENTITY or ENTITIES.
Iename typ text
Define an internal data entity named ename with
type typ (CDATA or SDATA) and entity text text. An
internal data entity will only be defined if it is
referenced in an A command for an attribute whose
declared value is ENTITY or ENTITIES.
Sename Define a subdocument entity named ename. This com-
mand will be preceded by one or more f commands
giving the filenames generated by the entity man-
ager from the system and public identifiers, by a p
command if a public identifier was declared for the
entity, and by a s command if a system identifier
was declared for the entity. A subdocument entity
will only be defined if it is referenced in a {
command or in an A command for an attribute whose
declared value is ENTITY or ENTITIES.
ssysid This command applies to the next E, S or N command
and specifies the associated system identifier.
ppubid This command applies to the next E, S or N command
and specifies the associated public identifier.
10
NSGMLS(1) NSGMLS(1)
ffilename
This command applies to the next E or S command and
specifies an associated filename. There will be
more than one f command for a single E or S command
if the system identifier used a colon.
{ename The start of the SGML subdocument entity ename;
ename will have been defined using a S command.
}ename The end of the SGML subdocument entity ename.
Llineno file
Llineno
Set the current line number and filename. The
filename argument will be omitted if only the line
number has changed. This will be output only if
the -l option has been given.
#text An APPINFO parameter of text was specified in the
SGML declaration. This is not strictly part of the
ESIS, but a structure-controlled application is
permitted to act on it. No # command will be out-
put if APPINFO NONE was specified. A # command
will occur at most once, and may be preceded only
by a single L command.
C This command indicates that the document was a con-
forming SGML document. If this command is output,
it will be the last command. An SGML document is
not conforming if it references a subdocument
entity that is not conforming.
ENVIRONMENT
NSGMLS_CODE
If this is set to the name of a encoding transla-
tion, then that encoding translation will be used
as the default encoding translation for everything
(including file input, file output, message output,
filenames and command line arguments). Otherwise
the identity encoding translation will be used.
Setting this to ucs2 or unicode is unlikely to give
reasonable results.
SEE ALSO
The SGML Handbook, Charles F. Goldfarb
ISO 8879 (Standard Generalized Markup Language), Interna-
tional Organization for Standardization
BUGS
Not all ESIS information for LINK is reported.
AUTHOR
James Clark (jjc@jclark.com).
11