[Archive copy mirrored from: http://www.w3.org/pub/WWW/Journal/3/s2.connolly.html]

WWW Journal, Issue 3

A Lexical Analyzer for HTML and Basic SGML

Dan Connolly

Abstract

[W3C Wroking Draft; WD-sgml-lex; June 15, 1996]

The Standard Generalized Markup Language (SGML) is a complex system for developing markup languages. It is used to define the Hypertext Markup Language (HTML) used in the World Wide Web, as well as several other hypermedia document representations.

Systems with interactive performance constraints use only the simplest features of SGML. Unfortunately, the specification of those features is subtly mixed into the specification of SGML in all its generality. As a result, a number of ad-hoc SGML lexical analyzers have been developed and deployed on the Internet, and reliability has suffered.

We present a self-contained specification of a lexical analyzer that uses automated parsing techniques to handle SGML document types limited to a tractable set of SGML features. An implementation is available as well.

Status of This Document

This is a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress." A list of current W3C tech reports can be found at http://www.w3.org/pub/WWW/TR/.

Please direct comments and questions to www-html@w3.org, an open discussion forum. Include the keyword "sgml-lex" in the subject.

Introduction

"The hypertext markup language is an SGML format." --Tim Berners-Lee, in "About HTML"

The result of that design decision is something of a collision between the World Wide Web development community and the SGML community-- between the quick-and-dirty software community and the formal ISO standards community. It also creates a collision between the interactive, online hypermedia technology and the bulk, batch print publications technology.

SGML, Standard General Markup Language, is a complex, mature, stable technology. The international standard, ISO 8879:1986 [1], is nearly ten years old, and GML-based systems predate the standard by years. On the other hand, HTML, Hypertext Markup Lanugage is a relatively simple, new, and rapidly evolving technology.

SGML has a number of degrees of freedom which are bound in HTML. SGML is a system for defining markup languages, and HTML is one such language; in standard terminology, HTML is an SGML application.

Lexical Analysis of Basic SGML Documents

The degree of freedom in SGML which the HTML 2.0 specification [4] binds can be separated into high-level, document structure considerations on the one hand, and low-level, lexical details on the other. The document structure issues are specific to the domain of application of HTML, and they are evolving rapidly to reflect new features in the Web.

The lexical properties of HTML 2.0 are very stable by comparison. With a few exceptions, HTML documents fit into a category termed basic SGML documents in the SGML standard. These properties are independent of the domain of HTML application. They are shared by a number of contemporary SGML applications, such as TEI [6], DocBook [9], HTF [10], and IBM-IDDOC [11].

The specification of this straightforward category of SGML documents is, unfortunately, subtly mixed into the specification of SGML in all its generality. The result is that a number of lexically incompatible HTML parser implementations have been developed and deployed [20].

The objectives of the document are to:

  1. Refine the notion of "basic SGML document" to the precise set of features used in HTML 2.0.

  2. Present a more traditional automated model of lexical analysis and parsing for these SGML documents [12].

  3. Make a rigorous specification of this lexical analyzer, that can be understood without prior knowledge of SGML, freely available to the Web development community.

While this report focuses on the SGML features necessary for HTML 2.0 user agents, it should be applicable to future HTML versions and to extensions of the HTML standard [18], as well as other SGML applications used on the Internet [5]. See the "Future Work" section for discussion.

SGML and Document Types

SGML Documents

An SGML document is a sequence of characters organized as one or more entities for storage and transmission, with a logical hierarchy of elements imposed.

The organization of an SGML document into entities is analogous to the organization of a C program into source files [13]. This report does not formally address entity structure. We restrict our discussion to documents consisting of a single entity.

The element hierarchy of an SGML document is actually the last of three parts. The first two are the SGML declaration and the prologue.

The SGML declaration binds certain variables such as the character strings that serve delimiter roles, and the optional features used. The SGML declaration also specifies the document character set--the set of characters allowed in the document and their corresponding character numbers. For a discussion of the SGML declaration, see [8].

The prologue, or DTD, declares the element types allowed in the document, along with their attributes and content models. The content models express the order and occurrence of elements in the hierarchy.

Document Types and Element Structure

SGML facilitates the development of document types, or specialized markup languages. An SGML application is a set of rules for using one or more document types. Typically, a community such as an industry segment, after identifying a need to interchange data in a rigorous method, develops an SGML application suited to their practices.

The document type definition includes two parts: a formal part, expressed in SGML, called a document type declaration or DTD, and a set of application conventions. An overview of the syntax of a DTD follows. For a more complete discussion, see [7].

The DTD essentially gives a grammar for the element structure of the specialized markup language: the start symbol is the document element name; the productions are specified in element declarations, and the terminal symbols are start-tags, end-tags, and data characters. For example:

<!doctype Memo [ 
<!element Memo         - - (Salutation, P*, Closing?)> 
<!element Salutation   O O (Date & To & Address?)> 
<!element (P|Closing|To|Address) - O (#PCDATA)> 
<!element Date - O EMPTY> 
<!attlist Date 
numeric CDATA #REQUIRED 
]> 

These four element declarations specify that a Memo consists of a Salutation, zero or more P elements, and an optional Closing. The Salutation is a Date, To, and optionally, an Address.

The notation - - specifies that both start and end tags are required; O O specifies both are optional, and - O specifies that the start tag is required, but the end tag is optional. The notation #PCDATA refers to parsed character data-- data characters with auxiliary markup such as comments mixed in. An element declared EMPTY has no content and no end-tag.

The ATTLIST declaration specifies that the Date element has an attribute called numeric. The #REQUIRED notation says that each Date start-tag must specify a value for the Date attribute.

The following is a sample instance of the memo document type:

<!doctype memo system> 
<Memo> 
<Date numeric="1994-06-12">  
<To>Third Floor 
<p>Please limit coffee breaks to 10 minutes. 
<Closing>The Management 
</Memo> 

The following left-derivation shows the nearly self-evident structure of SGML documents when viewed at this level:

Memo -> <Memo>, Salutation, P, Closing, </Memo> 
Salutation -> Date, To 
Date -> <Date numeric="1994-06-12"> 
To -> <To>, "Third Floor" 
P -> <P>, "Please limit coffee breaks to 10 minutes." 
Closing -> <Closing>, "The Management" 

The lexical analyzer in this report shows events at the level of start-tags, end-tags, and data.

Basic SGML Language Constructs

Basic SGML documents are like ordinary text files, but the text is enhanced with certain constructs called markup. The markup constructs add structure to documents.

The lexical analyzer separates the characters of a document into markup and data characters. Markup is separated from data charcters by delimiters. The SGML delimiter recognition rules include a certain amount of context information. For example, the delimiter string </ is only recognized as markup when it is followed by a letter.

For a formal specification of the language constructs, see the lex specification in the Appendix (which is part of the implementation source distribution [19]). The following sections contain an informal overview.

Markup Declarations

Each SGML document begins with a document type declaration. Comment declarations and marked section delcarations are other types of markup declarations.

The string <! followed by a name begins a markup declaration. The name is followed by parameters and a >. A [ in the parameters opens a declaration subset, which is a construct prohibited by this report.

The string <!-- begins a comment declaration. The -- begins a comment, which continues until the next occurrence of --. A comment declaration can contain zero or more comments. The string <!> is an empty comment declaration.

The string <![ begins a marked section declaration, which is prohibited by this report.

For example:

<!doctype foo> 
<!DOCTYPE foo SYSTEM> 
<!doctype bar system "abcdef"> 
<!doctype BaZ public "-//owner//DTD description//EN"> 
<!doctype BAZ Public "-//owner//DTD desc//EN" "sysid"> 
<!> 
another way to escape < and &: <<!>xxx &<!>abc; 
<!-- xyz --> 
<!-- xyz -- --def--> 
<!---- ---- ----> 
<!------------> 
<!doctype foo --my document type-- system "abc"> 

The following examples contain no markup. They illustrate that <! does not always signal markup.

<! doctype> <!,doctype> <!23> 
<!- xxx -> <!-> <!-!> 

The following are errors:

<!doctype xxx,yyy> 
<!usemap map1> 
<!-- comment-- xxx> 
<!-- comment -- -> 
<!-----> 

The following are errors, but they are not reported by this lexical analyzer.

<!doctype foo foo foo> 
<!doctype foo 23 17> 
<!junk decl> 

The following are valid SGML constructs that are prohibited by this report:

<!doctype doc [ <!element doc - - ANY> ]> 
<![ IGNORE [ lkjsdflkj sdflkj sdflkj  ]]> 
<![ CDATA [ lskdjf lskdjf lksjdf ]]> 

Tags

Tags are used to delimit elements. Most elements have a start-tag, some content, and end-tag. Empty elements have only a start-tag. For some elements, the start-tag and/or end-tag are optional. Empty elements and optional tags are structural constructs specified in the DTD, not lexical issues.

A start-tag begins with < followed by a name, and ends with >. The name refers to an element declaration in the DTD. An end-tag is similar, but begins with </.

For example:

<x> yyy </X> 
<abc.DEF   > ggg </abc.def > 
<abc123.-23> 
<A>abc def <b>xxx</b>def</a> 
<A>abc def <b>xxxdef</a> 

The following examples contain no markup. They illustrate that the < and </ strings do not always signal markup.

< x > <324 </234> 
<==> < b> 
<%%%> <---> <...> <---> 

The following examples are errors:

<xyz!> <abc/> 
</xxx/> <xyz&def> <abc_def> 

These last few examples illustrate valid SGML constructs that are prohibited in the languages described by this report:

<> xyz </> 
<xxx<yyy> </yyy</xxx> 
<xxx/content/ 

Names

A name is a name-start characer--a letter followed by any number of name characters--letters, digits, periods, or hyphens. Entity names are case sensitive, but all other names are not.

Attributes

Start tags may contain attribute specifications. An attribute specification consists of a name, an equals sign (=), and a value specification. The name refers to an item in an ATTLIST declaration.

The value can be a name token or an attribute value literal. A name token is one or more name characters. An attribute value literal is a string delimited by double-quotes (") or a string delimited by single-quotes ('). Interpretation of attribute value literals is covered in the discussion of the lexical analyzer API.

If the ATTLIST declaration specifies an enumerated list of names, and the value specification is one of those names, the attribute name and "=" may be omitted.

For example:

<x attr="val"> 
<x ATTR ="val" val> 
<y aTTr1= "val1"> 
<yy attr1='xyz' attr2="def" attr3='xy"z' attr4="abc'def"> 
<xx abc='abc&#34;def'> 
<xx aBC="fred &amp; barney"> 
<z attr1 = val1 attr2 = 23 attr3 = 'abc'> 
<xx val1 val2 attr3=.76meters> 
<a href=foo.html> ..</a> <a href=foo-bar.html>..</a> 

The following examples illustrate errors:

<x attr = abc$#@> 
<y attr1,attr2> 
<tt =xyz> 
<z attr += 2> 
<xx attr=50%> 
<a href=http://foo/bar/> 
<a href="http://foo/bar/> ... </a> ... <a href="xyz">...</a> 
<xx "abc"> 
<xxx abc=> 

Character References and Entity References

Characters in the document character set can be referred to by numeric character references. Entities declared in the DTD can be referred to by entity references.

An entity reference begins with & followed by a name, followed by an optional semicolon (;).

A numeric character reference begins with &# followed by a number followed by an optional semicolon. (The string &# followed by a name is a construct prohibited by this report.) A number is a sequence of digits.

The following examples illustrate character references and entity references:

&#38; &#200; 
&amp; &ouml; 
&#38 &#200,xxx 
&amp &abc() &xy12/.. 
To illustrate the X tag, write &lt;X&gt; 

These examples contain no markup. They illustrate that & does not always signal markup.

a & b, a &# b 
a &, b &. c 
a &#-xx &100 

These examples are errors:

&#2000000; &#20.7 &#20-35 
&#23x; 

The following are valid SGML, but prohibited by this report:

&#SPACE; 
&#RE; 

Processing Instructions

Processing instructions are a mechanism to capture platform-specific idioms. A processing instruction begins with <? and ends with >.

For example:

<?>
<?style tt = font courier>
<?page break>
<?experiment> ... <?/experiment>  

The Application Programmer Interface (API) to the Lexical Analyzer

An implementation of this specification is available [19], in the form of an ANSI C library. This section documents the API to the library. Note that the library is undergoing testing and revision. The API is expected to change.

The client of the lexical analyzer creates a data structure to hold the state of the lexical analyzer with a call to SGML_newLexer, and uses calls to SGML_lex to scan the data. Constructs are reported to the caller via three callback functions. SGML_lexNormis used to set case folding of names and whitespace normalization, and SGML_lexLine can be used to get the number of lines the lexer has encountered.

The output of the lexical analyzer, for each construct, is an an array of strings, and an array of enumerated types in one-to-one correspondence with the strings.

Data Characters

Data characters are passed to the primary callback function as an array of one single string containing the data characters and SGML_DATA as the type.

Note that the output contains all newlines (record end characters) from the input verbatim. Implementing the rules for ignoring record end characters (as per section 7.6.1 of SGML) is left to the client.

Tags and Attributes

Start-tags and end-tags are also passed to the primary callback function.

For a start-tag, the first element of the output array is a string of the form <name with SGML_START as the corresponding type. If requested (via SGML_lexNorm), the name is folded to lowercase. As shown below, the remaining elements of the array give the attributes. For an end-tag, the first element of the array is a case-folded string of the form </name with SGML_END as the type.

The output for attributes is included with the tag in which they appear. Attributes are reported as name/value pairs. The attribute name is output as a string of the form name and SGML_ATTRNAME as the type. An omitted name is reported as NULL.

An attribute value literal is output as a string of the form "xxx" or 'xxx' including the quotes, with SGML_LITERAL as the type . Other attribute values are returned as a string with SGML_NMTOKEN as the type. For example:

<xX val1 val2 aTTr3=".76meters"> 

is passed as an array of six strings:

[Tag/Data] 
Start Tag: '<xx' 
  Attr Name: '' 
  Name: 'val1' 
  Attr Name: '' 
  Name: 'val2' 
  Attr Name: 'attr3' 
  Name Token: '.76meters' 
  Tag Close: '>' 

Note that attribute value literals are output verbatim. Interpretation is left to the client. Section 7.9.3 of SGML says that an attribute value literal is interpreted as an attribute value by:

Character and Entity References

A character reference refers to the character in the document character set whose number it specifies. For example, if the document character set is ISO 646 IRV (aka ASCII), then &#65; is another way to write A.

A numeric character reference is passed to the primary callback as an event whose first token type is SGML_NUMCHARREF and whose string takes the form &#999. The second token, if present, has type SGML_REFC, and consists of a semi-colon (;) or a newline.

A general entity reference is passed as an event whose first token is of the form &name with SGML_GEREF as its type. The second token, if present, has type SGML_REFC, and consists of a semi-colon (;) or a newline.

The reference should be checked against those declared in the DTD by the client.

Other Markup

Other markup is passed to the second callback function.

A comment declaration is reported in the string <! with type SGML_MARKUP_DECL, followed by zero or more strings of the form

-- comment -- 

with SGML_COMMENT as the type, followed by > with type MDC.

Other markup declarations are output as a string of the form <!doctype followed by strings of type SGML_NAME, SGML_NUMBER, SGML_LITERAL, and/or SGML_COMMENT, followed by TAGC.

For example:

<!Doctype Foo --my document type-- System "abc"> 

is reported as:

[Aux Markup] 
Markup Decl: '<!doctype' 
  Name: 'foo' 
  Comment: '--my document type--' 
  Name: 'system' 
  Literal: '"abc"' 
  Tag Close: '>' 

Processing instructions are passed as a string of the form <?pi stuff> with type SGML_PI.

Errors and Limitations

Errors are passed to the third callback function. Two strings and two types are passed. For errors, the first string is a descriptive message, and the type is SGML_ERROR. The second string is the offending data, and the type is SGML_DATA.

Limitations imposed in this report are output similarly, but with type SGML_LIMITATION instead of SGML_ERROR. The lexical analyzer skips to a likely end of the error construct before continuing.

For example:

<tag xxx=yyy ?>xxx <![IGNORE[ a<b>c]]> zzz 

causes six callbacks:

[Err/Lim] 
!!Error!!: 'bad character in tag' 
  Data: '?' 
[Tag/Data] 
Start Tag: '<tag' 
  Attr Name: 'xxx' 
  Name Token: 'yyy' 
  Tag Close: '>' 
[Tag/Data] 
Data: 'xxx ' 
[Err/Lim] 
!!Limitation!!: 'marked sections not supported' 
  Data: '<![' 
[Err/Lim] 
!!Limitation!!: 'declaration subset: skipping' 
  Data: 'IGNORE[ a<b>c' 
[Tag/Data] 
Data: ' zzz' 

Differences from Basic SGML

In section 15.1.1 of the SGML standard, a Basic SGML document is defined as an SGML document that uses the reference concrete syntax and the SHORTTAG and OMITTAG features. A concrete syntax is a binding of the SGML abstract syntax to concrete values. The reference concrete syntax binds the delimiter role UNKNOWN TAG: <VAR.

Some of these exceptions are likely to be reflected in the ongoing revision of SGML [3].

Arbitrary Limitations Removed

The reference concrete syntax includes certain limitations (capacities and quantities, in the language of the standard). For most purposes, these limitations are unnecessary. We remove them:

Long Names

The reference concrete syntax binds the parameter NAMELEN to 8. This means that names are limited to 8 characters. We remove this limitation. Arbitrarily long names are allowed.

Long Attribute Value Literals

We similarly remove the limitation of setting LITLEN to 960 and ATTSPLEN to 240.

Simplifications

We require the SGML declaration to be implicit and the DTD to be included by reference only:

SGML declaration

The SGML declaration is generally transmitted out of band, and is assumed by the sender and the receiver. The lexical analyzer will accept an in-line SGML declaration, but it will not adhere to the declarations therein. The lexical analyzer client should signal an error.

Internal declaration subset

The DTD is often included by reference, but some documents contain additional entity, element and attribute declarations in the <!DOCTYPE declaration. We prohibit additional declarations in the <!DOCTYPE declaration (see "Internal Declaration Subsets" in the "Future Work" section).

Parameter entity reference

The %name; construct is a parameter entity reference--similar to a reference to a C macro. There is little use for these entity references given the above limitations. An occurrence of a parameter entity in a markup declaration is prohibited.

Named character references

The construct &#SPACE; refers to a space character. This construct is not widely supported, and is reported as a limitation.

Marked sections

The construct <![ IGNORE [ ... ]]> is similar to the #ifdef construct in the C preprocessor. It is a novel construct that can be used to represent effectivity (applicability of parts of a document to various environments, depending on locale, client capabilities, user preferences, etc.). We expect that it will be deployed eventually (see "Marked Sections"), but to avoid interoperability issues, we prohibit its use.

Shorthand Markup Prohibited

Some constructs save typing, but add no expressive capability to the languages. And while they technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. The SHORTTAG constructs related to attributes are widely used and implemented, but those related to tags are not.

These are relatively straightforward to support, but they are not widely deployed. While documents that use them are conforming SGML documents, they will not work with the deployed HTML tools. This lexical analyzer signals a limitation when encountering these constructs.

NET tags 
     <name/.../ 
 
Unclosed Start Tag 
     <name1<name2> 
 
Empty Start Tag 
     <> 
 
Empty End Tag 
     </> 

In addition, the lexical analyzer assumes no short references are used.

Future Work

This report presents technology that is usable, but not complete. Work is ongoing in the following areas. Contributions are welcome. Send a note to www-html@w3.org with "sgml-lex" in the subject.

Marked Sections

Support for marked sections is an integral part of a strategy for interoperability among HTML user agents supporting different HTML dialects [18]. It has other valueable applicatoins, and it is a straightforward addition to the lexical analyzer in this report.

Internationalization

Support for character encodings and coded character sets other than ASCII is a requirement for production use. Support for the X Windows compound text encoding (related to ISO-2022) and the UTF-8 or perhaps UCS-2 encoding of Unicode (ISO-10646), with extensibility for other character encodings seems most desirable.

Internal Declaration Subset Support

Internal declaration subsets are not expected to become a part of HTML. But the technology in this report is applicable to other SGML applications, and internal declaration subsets are a straightfoward addition to this lexical analyzer. Relavent mechanisms include:

Short References and Empty End-tags

While they may increase the complexity of the lexical analyzer, short references may be necessary to support math markup in HTML. Empty end-tags are not likely to be used in HTML, as they interact badly with conventions for handling undeclared element tags. But in other SGML applications, they are a useful feature.

Appendix: Flex Specification and Source Distribution

A formal specification of the lexical analyzer discussed in this report is given in the form of a flex input file in Example 1.

The flex input file is part of the sgml-lex source distribution, which contains an implementation of the API discussed above, and some test materials.

The source distribution is provided under the W3C copyright, which allows unlimited redistribution for any purpose.<|Zapf>n
/* $Id: s2.connolly.html,v 1.2 1996/12/11 16:57:41 jigsaw Exp $ */
/* sgml.l -- a lexical analyzer for Basic+/- SGML Documents
* See: "A Lexical Analyzer for HTML and Basic SGML"
*/
 
/*
* NOTE: We assume the locale used by lex and the C compiler
* agrees with ISO-646-IRV; for example: '1' == 0x31.
*/
 
 
/* Example 1 -- Character Classes: Abstract Syntax */
 
Digit [0-9]
LCLetter [a-z]
Special ['()_,\-\./:=?]
UCLetter [A-Z]
 
/* Example 2 -- Character Classes: Concrete Syntax */
 
LCNMCHAR [\.-]
/* LCNMSTRT [] */
UCNMCHAR [\.-]
/* UCNMSTRT [] */
/* @# hmmm. sgml spec says \015 */
RE \n
/* @# hmmm. sgml spec says \012 */
RS \r
SEPCHAR \011
SPACE \040
 
/* Example 3 -- Reference Delimiter Set: General */
 
COM "--"
CRO "&#"
DSC "]"
DSO "["
ERO "&"
ETAGO "</"
LIT \"
LITA "'"
MDC ">"
MDO "<!"
MSC "]]"
NET "/"
PERO "%"
PIC ">"
PIO "<?"
REFC ";"
STAGO "<"
TAGC ">"
 
/* 9.2.1 SGML Character */
 
/*name_start_character {LCLetter}|{UCLetter}|{LCNMSTRT}|{UCNMSTRT}*/
name_start_character {LCLetter}|{UCLetter}
name_character {name_start_character}|{Digit}|{LCNMCHAR}|{UCNMCHAR}
 
/* 9.3 Name */
 
name {name_start_character}{name_character}*
number {Digit}+
number_token {Digit}{name_character}*
name_token {name_character}+
 
/* 6.2.1 Space */
s {SPACE}|{RE}|{RS}|{SEPCHAR}
ps ({SPACE}|{RE}|{RS}|{SEPCHAR})+
 
/* trailing white space */
ws ({SPACE}|{RE}|{RS}|{SEPCHAR})*
 
/* 9.4.5 Reference End */
reference_end ({REFC}|{RE})
 
/*
* 10.1.2 Parameter Literal
* 7.9.3 Attribute Value Literal
* (we leave recognition of character references and entity references,
* and whitespace compression to further processing)
*
* @# should split this into minimum literal, parameter literal,
* @# and attribute value literal.
*/
literal ({LIT}[^\"]*{LIT})|({LITA}[^\']*{LITA})
 
 
 
/* 9.6.1 Recognition modes */
 
/*
* Recognition modes are represented here by start conditions.
* The default start condition, INITIAL, represents the
* CON recognition mode. This condition is used to detect markup
* while parsing normal data charcters (mixed content).
*
* The CDATA start condition represents the CON recognition
* mode with the restriction that only end-tags are recognized,
* as in elements with CDATA declared content.
* (@# no way to activate it yet: need hook to parser.)
*
* The TAG recognition mode is split into two start conditions:
* ATTR, for recognizing attribute value list sub-tokens in
* start-tags, and TAG for recognizing the TAGC (">") delimiter
* in end-tags.
*
* The MD start condition is used in markup declarations. The COM
* start condition is used for comment declarations.
*
* The DS condition is an approximation of the declaration subset
* recognition mode in SGML. As we only use this condition after signalling
* an error, it is merely a recovery device.
*
* The CXT, LIT, PI, and REF recognition modes are not separated out
* as start conditions, but handled within the rules of other start
* conditions. The GRP mode is not represented here.
*/
 
/* EXCERPT ACTIONS: START */
 
/* %x CON == INITIAL */
%x CDATA
 
%x TAG
%x ATTR
%x ATTRVAL
%x NETDATA
%x ENDTAG
/* this is only to be permissive with bad end-tags: */
%x JUNKTAG
 
%x MD
%x COM
%x DS
 
/* EXCERPT ACTIONS: STOP */
 
%%
 
int *types = NULL;
char **strings = NULL;
size_t *lengths = NULL;
int qty = 0;
 
/*
* See sgml_lex.c for description of
* ADD, CALLBACK, ERROR, TOK macros.
*/
 
 
/*
* 9.6 Delimiter Recognition and
* Figure 3 -- Reference Delimiter Set: General
*
* This is organized by recognition mode: first CON, then TAG,
* MD, and DS. Within a mode, the rules are ordered alphabetically
* by delimiter name.
*/
 
 
/* &#60; -- numeric character reference */
<INITIAL,NETDATA>{CRO}{number}{reference_end}? {
reference(yytext, yyleng, SGML_NUMCHARREF, 0,
l, tokF, tokObj);
}
 
/* &#60xyz. -- syntax error */
<INITIAL,NETDATA>{CRO}{number_token}{reference_end}? {
ERROR(SGML_ERROR,
"bad character in character reference",
yytext, yyleng);
}
 
 
/* &#SPACE; -- named character reference. */
<INITIAL,NETDATA>{CRO}{name}{reference_end}? {
if (l->restrict) {
if (l->compat)
/* old-style user agents use it as data. */
TOK(tokF, tokObj, SGML_DATA, yytext, yyleng);
else{
ERROR(SGML_LIMITATION,
"named character references are not supported",
yytext, yyleng);
}
}else{
reference(yytext, yyleng, SGML_NAMECHARREF, l->normalize,
l, tokF, tokObj);
}
}
 
/* &amp; -- general entity reference */
<INITIAL,NETDATA>{ERO}{name}{reference_end}? {
reference(yytext, yyleng, SGML_GEREF, 0,
l, tokF, tokObj);
}
 
/* </name < -- unclosed end tag */
<INITIAL,CDATA>{ETAGO}{name}?{ws}/{STAGO} {
if (l->restrict){
ERROR(SGML_LIMITATION,
"unclosed end tag not supported",
yytext, yyleng);
}else{
ADDCASE(SGML_END, yytext, yyleng);
CALLBACK(tokF,tokObj);
}
}
 
/* </title> -- end tag */
<INITIAL,CDATA>{ETAGO}{name}{ws} {
ADDCASE(SGML_END, yytext, yyleng);
if (l->restrict && l->compat) {
BEGIN(JUNKTAG);
}else {
BEGIN(ENDTAG);
}
}
 
/* @# HACK for XMP, LISTING?
Date: Fri, 19 Jan 1996 23:13:43 -0800
Message-Id: <v01530502ad25cc1a251b@[206.86.76.80]>
To: www-html@w3.org
Subject: Re: Daniel Connolly's SGML Lex Specification
*/
 
/* @@ all these are recognized in NETDATA too. Need a stack? */
 
/* </> -- empty end tag */
{ETAGO}{TAGC} {
if (l->restrict) {
if (l->compat)
TOK(tokF, tokObj, SGML_DATA, yytext, yyleng);
else
ERROR(SGML_LIMITATION,
"empty end tag not supported",
yytext, yyleng);
}else{
TOK2(tokF, tokObj,
SGML_START, yytext, yyleng-1,
SGML_TAGC, yytext + yyleng - 1, 1);
}
}
 
/* <!DOCTYPE -- markup declaration */
{MDO}{name}{ws} {
ADDCASE(SGML_MARKUP_DECL, yytext, yyleng);
BEGIN(MD);
}
 
/* <!> -- empty comment */
{MDO}{MDC} {
TOK(auxF, auxObj, SGML_MARKUP_DECL,
yytext, yyleng);
}
 
/* <!-- -- comment declaration */
{MDO}/{COM} {
ADD(SGML_MARKUP_DECL, yytext, yyleng);
BEGIN(COM);
}
 
/* <![ -- marked section */
{MDO}{DSO}{ws} {
ERROR(SGML_LIMITATION,
"marked sections not supported",
yytext, yyleng);
BEGIN(DS); /* @# skip past some stuff */
}
 
/* ]]> -- marked section end */
{MSC}{MDC} {
ERROR(SGML_ERROR,
"unmatched marked sections end",
yytext, yyleng);
}
 
/* <? ...> -- processing instruction */
{PIO}[^>]*{PIC} {
if (l->restrict && l->compat){
/*@# issue warning? */
TOK(tokF, tokObj, SGML_DATA,
yytext, yyleng);
}else{
TOK(auxF, auxObj, SGML_PI,
yytext, yyleng);
}
}
/* <name -- start tag */
{STAGO}{name}{ws} {
ADDCASE(SGML_START, yytext, yyleng);
BEGIN(ATTR);
}
 
 
/* <> -- empty start tag */
{STAGO}{TAGC} {
if (l->restrict) {
if (l->compat)
TOK(tokF, tokObj, SGML_DATA, yytext, yyleng);
else
ERROR(SGML_LIMITATION,
"empty tag not supported",
yytext, yyleng);
}else {
ADDCASE(SGML_START, yytext, yyleng - 1);
CALLBACK(tokF, tokObj);
}
}
 
/* abcd -- data characters */
([^<&]|(<[^<&a-zA-Z!->?])|(&[^<&#a-zA-Z]))+|. {
TOK(tokF, tokObj, SGML_DATA, yytext, yyleng);
}
 
/* abcd -- data characters */
<CDATA>[^<]+|. {
TOK(tokF, tokObj, SGML_DATA, yytext, yyleng);
}
 
<NETDATA>{NET} {
TOK(tokF, tokObj, SGML_NET, yytext, yyleng);
BEGIN(INITIAL);
}
 
/* <em/ ^abcd / -- data characters within null end tag */
<NETDATA>([^/&<])+|. {
TOK(tokF, tokObj, SGML_DATA, yytext, yyleng);
}
 
/* 7.4 Start Tag */
/* Actually, the generic identifier specification is consumed
* along with the STAGO delimiter ("<"). So we're only looking
* for tokens that appear in an attribute specification list,
* plus TAGC (">"). NET ("/") and STAGO ("<") signal limitations.
*/
 
/* 7.5 End Tag */
/* Just looking for TAGC. NET, STAGO as above */
 
/* <a ^href = "xxx"> -- attribute name */
<ATTR>{name}{s}*={ws} {
 
if(l->normalize){
 
/* strip trailing space and = */
while(yytext[yyleng-1] == '='
|| isspace(yytext[yyleng-1])){
--yyleng;
}
}
 
ADDCASE(SGML_ATTRNAME, yytext, yyleng);
BEGIN(ATTRVAL);
}
 
/* <img src="xxx" ^ismap> -- name */
<ATTR>{name}{ws} {
ADD(SGML_ATTRNAME, NULL, 0);
ADDCASE(SGML_NAME, yytext, yyleng);
}
 
/* <a name = ^xyz> -- name token */
<ATTRVAL>{name_token}{ws} {
ADD(SGML_NMTOKEN, yytext, yyleng);
BEGIN(ATTR);
}
 
/* <a href = ^"a b c"> -- literal */
<ATTRVAL>{literal}{ws} {
if(yyleng > 2 && yytext[yyleng-2] == '='
&& memchr(yytext, '>', yyleng)){
ERROR(SGML_WARNING,
"missing attribute end-quote?",
yytext, yyleng);
}
ADD(SGML_LITERAL, yytext, yyleng);
BEGIN(ATTR);
}
 
/* <a name= ^> -- illegal tag close */
<ATTRVAL>{TAGC} {
ERROR(SGML_ERROR,
"Tag close found where attribute value expected",
yytext, yyleng);
/* @@ need test for this */
ADD(SGML_TAGC, yytext, yyleng);
 
CALLBACK(tokF,tokObj);
BEGIN(INITIAL);
}
 
/* <a name=foo ^>,</foo^> -- tag close */
<ATTR,TAG>{TAGC} {
ADD(SGML_TAGC, yytext, yyleng);
CALLBACK(tokF,tokObj);
BEGIN(INITIAL);
}
 
/* <em^/ -- NET tag */
<ATTRVAL>{NET} {
ERROR(SGML_ERROR,
"attribute value missing",
yytext, yyleng);
 
ADD(SGML_NET, yytext, yyleng);
CALLBACK(tokF, tokObj);
BEGIN(INITIAL);
}
 
/* <em^/ -- NET tag */
<ATTR>{NET} {
if (l->restrict) {
CALLBACK(tokF, tokObj);
ERROR(SGML_LIMITATION, "NET tags not supported",
yytext, yyleng);
BEGIN(INITIAL);
}else{
ADD(SGML_NET, yytext, yyleng);
CALLBACK(tokF, tokObj);
BEGIN(NETDATA);
}
}
 
/* <foo^<bar> -- unclosed start tag */
<ATTR,ATTRVAL,TAG>{STAGO} {
/* report pending tag */
CALLBACK(tokF, tokObj);
BEGIN(INITIAL);
 
if(l->restrict){
ERROR(SGML_LIMITATION,
"Unclosed tags not supported",
yytext, yyleng);
}else{
/* save STAGO for next time */
#ifdef FIX_YYLESS /*@@*/
yyless(leng-1); /*@# length of STAGO assumed 1 */
#endif
BEGIN(INITIAL);
}
}
 
/* <a href = ^http://foo/> -- unquoted literal HACK */
<ATTRVAL>[^ "\t\n\r>]+{ws} {
ERROR(SGML_ERROR,
"attribute value needs quotes",
yytext, yyleng);
ADD(SGML_LITERAL, yytext, yyleng);
BEGIN(ATTR);
}
 
<ATTR,ATTRVAL,TAG>. {
ERROR(SGML_ERROR,
"bad character in tag",
yytext, yyleng);
}
 
/* end tag -- non-permissive */
<ENDTAG>{TAGC} {
ADD(SGML_TAGC, yytext, yyleng);
CALLBACK(tokF,tokObj);
BEGIN(INITIAL);
}
 
<ENDTAG>. {
ERROR(SGML_ERROR, "extraneous character in end tag",
yytext, yyleng);
}
 
/* permissive search for tag close */
<JUNKTAG>[^>]+ {
/* skip */
}
 
<JUNKTAG>{TAGC} {
BEGIN(INITIAL);
}
 
 
/* 10 Markup Declarations: General */
 
/* <!^--...--> -- comment */
<MD,COM>{COM}([^-]|-[^-])*{COM}{ws} {
ADD(SGML_COMMENT, yytext, yyleng);
}
 
/* <!doctype ^%foo;> -- parameter entity reference */
<MD>{PERO}{name}{reference_end}?{ws} {
if (l->restrict) {
ERROR(SGML_LIMITATION,
"parameter entity reference not supported",
yytext, yyleng);
}
ADD(SGML_PERO, yytext, yyleng);
}
 
/* <!entity ^% foo system "..." ...> -- parameter entity definition */
<MD>{PERO}{ps} {
if (l->restrict) {
ERROR(SGML_LIMITATION,
"parameter entity definition not supported",
yytext, yyleng);
}
ADD(SGML_PERO, yytext, yyleng);
}
/* The limited set of markup delcarations we're interested in
* use only numbers, names, and literals.
*/
<MD>{number}{ws} {
ADD(SGML_NUMBER, yytext, yyleng);
}
 
<MD>{name}{ws} {
ADDCASE(SGML_NAME, yytext, yyleng);
}
 
<MD>{number_token}{ws} {
ADD(SGML_NUMTOKEN, yytext, yyleng);
}
 
<MD>{name_token}{ws} {
ADDCASE(SGML_NMTOKEN, yytext, yyleng);
}
 
<MD>{literal}{ws} {
ADD(SGML_LITERAL, yytext, yyleng);
}
 
<MD,COM>{MDC} {
ADD(SGML_TAGC, yytext, yyleng);
CALLBACK(auxF, auxObj);
BEGIN(INITIAL);
}
 
/* other constructs are errors. */
/* <!doctype foo ^[ -- declaration subset */
<MD>{DSO} {
if(l->restrict){
ERROR(SGML_LIMITATION,
"declaration subset not supported",
yytext, yyleng);
}
ADD(SGML_DSO, yytext, yyleng);
CALLBACK(auxF, auxObj);
BEGIN(DS);
}
 
<MD,COM>. {
ERROR(SGML_ERROR,
"illegal character in markup declaration",
yytext, yyleng);
}
 
 
/* 10.4 Marked Section Declaration */
/* 11.1 Document Type Declaration Subset */
 
/* Our parsing of declaration subsets is just an error recovery technique:
* we attempt to skip them, but we may be fooled by "]"s
* inside comments, etc.
*/
 
/* ]]> -- marked section end */
<DS>{MSC}{MDC} {
BEGIN(INITIAL);
}
/* ] -- declaration subset close */
<DS>{DSC} { BEGIN(COM); }
 
<DS>[^\]]+ {
ERROR(SGML_LIMITATION,
"declaration subset: skipping",
yytext, yyleng);
}
 
/* EXCERPT ACTIONS: STOP */
 
%%
 

Goldfarb, C.F., The SGML Handbook, Y. Rubinsky, ed., Oxford University Press, 1990.

ISO 8879, Information Processing--Text and Office Systems--Standard Generalized Markup Language (SGML), 1986.

Goldfarb, C.F., ed., ISO/IEC JTC1/SC18/WG8 N1351, Request for contributions for review of ISO 8879, , 11, October 1991.

Berners-Lee, T., and D. Connolly, "Hypertext Markup Language--2.0" RFC 1866, MIT/W3C, November 1995.

rfc1874.txt--SGML Media Types, E. Levinson, December 1995.

Sperberg-McQueen, C. M., and Lou Burnard, eds., "Guidelines for Electronic Text Encoding and Interchange," 16, May 1994.

The SGML PRIMER SoftQuad's Quick Reference Guide to the Essentials of the Standard: The SGML Needed for Reading a DTD and Marked-Up Documents and Discussing Them Reasonably.

Wohler, Wayne L. "The DTD May Not Be Enough: SGML Declarations," 5/10 (October 1992) 6-9, 6/1 (January 1993) 1-7; 6/2 (February 1993) 1-6.

Eve Maler, "DocBook V2.3 Maintainer's Guide," ArborText, Inc., Revision 1.1, 25 September 1995.

Online documentation of HTF (Hyper-G Text Format), 94/09/21.

Wholer, Wayne, Don R. Day, W. Eliot Kimber, Simch Gralla, and Mike Temple "IBM ID Doc Language Reference, Draft 1.4," February 13, 1994.

Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman, Compilers, Principles, Techniques, and Tools, Addison-Wesley, 1988.

Kernighan, Brian W. and Dennis M. Ritchie, The C Programming Language, Prentice Hall, 1988.

Kaelbling, Mike, "On Improving SGML, Electronic Publishing--Origination, Dissemination and Design," 3(2)93--98, May, 1990; also available as Ohio State Tech Report 88-22.

Vern Paxson Systems Engineering Bldg. 46A, Room 1123 Lawrence Berkeley Laboratory University of California Berkeley, CA 94720, vern@ee.lbl.gov

Connolly, Dan, "SGML and the Web"--Work in Progress. W3C, January 1996.

Connolly, Dan, "HTML Dialects: Internet Media and SGML Document Types," W3C, Work in progress, January 1996.

About the Author

Dan Connolly

World Wide Web Consortium

MIT Laboratory for Computer Science

545 Technology Square

Cambridge, MA 02139

connolly@w3.org

Dan discovered the Web project in 1991 soon after graduating from U.T. Austin while he was at Convex. His industry experience in online documentation tools, distributed computing, and information delivery kept him in touch with the project while he was at Dazel and HaLSoft.

In March 1995, Dan joined the W3C, utilizing his background in formal systems to work on the specification of HTML and other parts of the Web. He was the editor of Issue 2 of the World Wide Web Journal (W3J), and is currently editor of Web Programmer magazine.

WWW Journal, Issue 3