![]() |
MtSgmlQL: I/O formats |
The nSGML format is a normalized SGML format which was defined in the MULTEXT project in collaboration with the Language Technology Group (LTG) of University of Edinburgh. The variant that we use is slightly different from LTG's.
nSGML is defined as follows:
nSGML format1.Document is valid SGML. 2.Reference concrete syntax is used. 3.No capacity/length restrictions. 4.No short refs or tag minimisation. 5.All end-tags present except for EMPTY elements. 6.No entity references except character references. 7.All character references terminated with ";". 8.No CDATA or RCDATA element content. 9.No attribute value minimisation; literal delimiters (quotes) may not be omitted. 10.No SUBDOCs. 11.No marked sections.
<!DOCTYPE MEMO SYSTEM "memo.dtd"> <MEMO TYPE="CONFIDEN"> <TO> Dr. Watson </TO> <FROM> Sherlock Holmes </FROM> <BODY> <P> Please install PGP on your computer. </P> <P> You'll see my public key below. </P> </BODY> <SIGN TYPE="PGP"> </MEMO>
This format is a subset of the output format of the well-known sgmls and nsgmls parsers developed by Jim Clark.
The subset is defined as follows (the following is extracted from the nsgmls manual).
sgmls formatThe output is a series of lines. Lines can be arbitrarily long. Each line consists of an initial command character and one or more arguments. Arguments are separated by a single space, but when a command takes a fixed number of arguments the last argument can contain spaces. There is no space between the command character and the first argu- ment. Arguments can contain the following escape sequences. \\ A \. \n A record end character. \| Internal SDATA entities are bracketed by these. \nnn The character whose code is nnn octal. A record start character will be represented by \012. Most applications will need to ignore \012 and translate \n into newline. \#n; The character whose number is n in decimal. n can have any number of digits. This is used for char- acters that are not representable by the encoding translation used for output (as specified by the NSGML_CODE environment variable). This will only occur with the multibyte version of nsgmls. The possible command characters and arguments are as fol- lows: (gi The start of an element whose generic identifier is gi. Any attributes for this element will have been specified with A commands. )gi The end of an element whose generic identifier is gi. -data Data. Aname val The next element to start has an attribute name with value val which takes one of the following forms: IMPLIED The value of the attribute is implied. CDATA data The attribute is character data. This is used for attributes whose declared value is CDATA. NOTATION nname The attribute is a notation name; nname will have been defined using a N command. This is used for attributes whose declared value is NOTATION. ENTITY name... The attribute is a list of general entity names. Each entity name will have been defined using an I, E or S command. This is used for attributes whose declared value is ENTITY or ENTITIES. TOKEN token... The attribute is a list of tokens. This is used for attributes whose declared value is anything else. ID token The attribute is an ID value. This will be output only if the -oid option is specified. Otherwise TOKEN will be used for ID values.
ATYPE TOKEN CONFIDEN (MEMO (TO -Dr. Watson )TO (FROM -Sherlock Holmes )FROM (BODY (P -Please install PGP on your computer. )P (P -You'll see my public key below. )P )BODY ATYPE TOKEN PGP (SIGN )SIGN )MEMO