HTML

A Lexical Analyzer for HTML and Basic SGML

Dan Connolly
$Id: sgml-lex.html,v 1.11 1995/10/18 09:12:04 connolly Exp $

Abstract

The Standard Generalized Markup Language (SGML) is a complex system for developing markup languages. It is used to define the Hypertext Markup Language (HTML) used in the World Wide Web, as well as several other hypermedia document representations.

Systems with interactive performance constraints use only the simplest features of SGML. Unfortunately, the specification of those features is subtly mixed into the specification of SGML in all its generality. As a result, a number of ad-hoc SGML lexical analyzers have been developed and deployed on the Internet, and reliability has suffered.

We present a self-contained specification of a lexical analyzer that uses automated parsing techniques to handle SGML document types limited to a tractable set of SGML features.

Introduction

The hypertext markup language is an SGML format.

--Tim Berners-Lee, in "About HTML"

The result of that design decision is something of a collision between the World Wide Web development community and the SGML community -- between the quick-and-dirty software community and the formal ISO standards community. It also creates a collision between the interactive, online hypermedia technology and the bulk, batch print publications technology.

SGML, Standard General Markup Language, is a complex, mature, stable technology. The international standard, ISO 8879:1986[SGML], is nearly ten years old, and GML-based systems pre-date the standard by years. On the other hand, HTML, Hypertext Markup Lanugage, is a relatively simple, new and rapidly evolving technology.

SGML has a number of degrees of freedom which are bound in HTML. SGML is a system for defining markup languages, and HTML is one such language; in standard terminology, HTML is an SGML application.

Lexical Analysis of Basic SGML Documents

The degrees of freedom in SGML which the HTML 2.0 specification[HTML2.0] binds can be separated into high-level, document structure considerations on the one hand, and low-level, lexical details on the other. The document structure issues are specific to the domain of application of HTML, and they are evolving rapidly to reflect new features in the web.

The lexical properties of HTML 2.0 are very stable by comparison. HTML documents fit into a category termed basic SGML documents in the SGML standard, with a few exceptions (these exceptions are likely to be reflected in the ongoing revision of SGML). These properties are independent of the domain of application of HTML. They are shared by a number of contemporary SGML applications, such as TEI[TEI], DocBook[DocBook], HTF[HTF], and IBM-IDDOC[IBM-IDDOC].

The specification of this straightforward category of SGML documents is, unfortunately, subtly mixed into the specification of SGML in all its generality.

An unfortunate result is that a number of lexically incompatible HTML parser implementations have been developed and deployed.[REF! Mosaic 2.4, Cern libwww parser]

The objectives of the document are to:

refine the notion of "basic SGML document" to the precise set of features used in HTML 2.0.
present a more traditional automated model of lexical analysis and parsing for these SGML documents[Dragon].
make a rigorous specification of this lexical analyzer that can be understood without prior knowledge of SGML freely available to the web development community.

SGML and Document Types

SGML Documents

An SGML document is a sequence of characters organized physically as one or more entities, and logically as a hierarchy of elements.

The organization of an SGML document into entities is analagous to the organization of a C program into source files[KnR2]. This report does not formally address entity structure. We restrict our discussion to documents consisting of a single entity.

The element hierarchy of an SGML document is actually the last of three parts. The first two are the SGML declaration and the prologue.

The SGML declaration binds certain variables such as the character strings that serve delimiter roles, and the optional features used. The SGML declaration also specifies the document character set -- the set of characters allowed in the document and their corresponding character numbers.

The prologue, or DTD, declares the element types allowed in the document, along with their attributes and content models. The content models express the order and occurence of elements in the hierarchy.

Document Types and Element Structure

SGML facilitates the development of document types, or specialized markup languages. An SGML application is a set of rules for using one or more document types. Typically, a community such as an industry segment, after identifying a need to interchange data in a rigorous method, develops an SGML application suited to their practices.

The document type definition includes two parts: a formal part, expressed in SGML, called a document type declaration or DTD, and a set of application conventions.

The DTD essentially gives a grammar for the element structure of the specialized markup language: the start symbol is the document element name; the productions are specified in element declarations, and the terminal symbols are start-tags, end-tags, and data characters. For example:

<!doctype Memo [
<!element Memo         - - (Salutation, P*, Closing?)>
<!element Salutation   O O (Date & To & Address?)>
<!element (P|Closing|To|Address) - O (#PCDATA)>
<!element Date - O EMPTY>
<!attlist Date
	numeric CDATA #REQUIRED
]>

These four element declarations specify that a Memo consists of a Salutation, zero or more P elements, and an optional Closing. The Salutation is a Date, To, and optionally, an Address.

The notation "- -" specifies that both start and end tags are required; "O O" specifies both are optional, and "- O" specifies that the start tag is required, but the end tag is optional. The notation #PCDATA refers to parsed character data -- data characters with auxiliary markup such as comments mixed in. An element declared EMPTY has no content and no end-tag.

The ATTLIST declaration specifies that the Date element has an attribute called numeric. The #REQUIRED notation says that each Date start-tag must specify a value for the Date attribute.

The following is a sample instance of the memo document type:

<!doctype memo system>
<Memo>
<Date numeric="1994-06-12"> 
<To>Third Floor
<p>Please limit coffee breaks to 10 minutes.
<Closing>The Management
</Memo>

The following left-derivation shows the nearly self-evident structure of SGML documents when viewed at this level:

Memo -> <Memo>, Salutation, P, Closing, </Memo>	
Salutation -> Date, To	
Date -> <Date numeric="1994-06-12">	
To -> <To>, "Third Floor"	
P -> <P>, "Please limit coffee breaks to 10 minutes."	
Closing -> <Closing>, "The Management"

This lexical analyzer in this report reports events at this level: start-tags, end-tags, and data.

Basic SGML Language Constructs

Basic SGML documents are like ordinary text files, but the text is enhanced with certain constructs called markup. The markup constructs add structure to documents.

The lexical analyzer separates a the characters of a document into markup and data characters. Markup is separated from data charcters by delimiters. The SGML delimiter recognition rules include a certain amount of context information. For example, the delimiter string "</" is only recognized as markup when it is followed by a letter.

For a formal specification of the language constructs, see the lex specification. The following is an informal overview.

Markup Declarations

Each SGML document begins with a document type declaration. Comment declarations and marked section delcarations are other types of markup declaration.

The string <! followed by a name begins a markup declaration. The name is followed by parameters and a >. A [ in the parameters opens a declaration subset, which is a construct prohibited in this report.

The string <!-- begins a comment declaration. The -- begins a comment, which continues until the next occurrence of --. A comment declaration can contain zero or more comments. The string <!> is an empty comment declaration.

The string <![ begins a marked section declaration, which is prohibited in this report.

For example:

<!doctype foo>
<!DOCTYPE foo SYSTEM>
<!doctype bar system "abcdef">
<!doctype BaZ public "-//owner//DTD description//EN">
<!doctype BAZ Public "-//owner//DTD desc//EN" "sysid">
<!>
another way to escape < and &: <<!>xxx &<!>abc;
<!-- xyz -->
<!-- xyz -- --def-->
<!---- ---- ---->
<!------------>
<!doctype foo --my document type-- system "abc">

The following examples contain no markup. They illustrate that "<!" does not always signal markup.

<! doctype> <!,doctype> <!23>
<!- xxx -> <!-> <!-!>
The following are errors:
<!doctype xxx,yyy>
<!usemap map1>
<!-- comment-- xxx>
<!-- comment -- ->
<!----->

The following are errors, but they are not reported by this lexical analyzer.

<!doctype foo foo foo>
<!doctype foo 23 17>
<!junk decl>

The following are valid SGML constructs that are prohibited in this report:

<!doctype doc [ <!element doc - - ANY> ]>
<![ IGNORE [ lkjsdflkj sdflkj sdflkj  ]]>
<![ CDATA [ lskdjf lskdjf lksjdf ]]>

Names

A name is a name-start characer -- a letter -- followed by any number of name characters -- letters, digits, periods, or hyphens. Entity names are case sensitive, but all other names are not.

Attributes

Start tags may contain attribute specifications. An attribute specification consists of a name, an "=" and a value specification. The name refers to an item in an ATTLIST declaration.

The value can be a name token or an attribute value literal. A name token is one or more name characters. An attribute value literal is a string delimited by double-quotes (") or a string delimited by single-quotes ('). Interpretation of attribute value literals is covered in the discussion of the lexical analyzer API.

If the ATTLIST declaration specifies an enumerated list of names, and the value specification is one of those names, the attribute name and "=" may be omitted.

For example:

<x attr="val">
<x ATTR ="val" val>
<y aTTr1= "val1">
<yy attr1='xyz' attr2="def" attr3='xy"z' attr4="abc'def">
<xx abc='abc&#34;def'>
<xx aBC="fred &amp; barney">
<z attr1 = val1 attr2 = 23 attr3 = 'abc'>
<xx val1 val2 attr3=.76meters>
<a href=foo.html> ..</a> <a href=foo-bar.html>..</a>

The following examples illustrate errors:

<x attr = abc$#@>
<y attr1,attr2>
<tt =xyz>
<z attr += 2>
<xx attr=50%>
<a href=http://foo/bar/>
<xx "abc">
<xxx abc=>

Character References and Entity References

Characters in the document character set can be referred to by numeric character references. Entities declared in the DTD can be referred to by entity references.

An entity reference begins with "&" followed by a name, followed by an optional semicolon.

A numeric character reference begins with "&#" followed by a number followed by an optional semicolon. (The string "&#" followed by a name is a construct prohibited by this report.) A number is a sequence of digits.

The following examples illustrate character references and entity references:

&#38; &#200;
&amp; &ouml;
&#38 &#200,xxx
&amp &abc() &xy12/..\
To illustrate the X tag, write &lt;X&gt;

These examples contain no markup. They illustrate that & does not always signal markup.

a & b, a &# b
a &, b &. c
a &#-xx &100

These examples are errors:

&#2000000; &#20.7 &#20-35
&#23x;

The following are valid SGML, but prohibited by this report:

&#SPACE;
&#RE;

Processing Instructions

Processing instructions are a sort of escape to platform-specific idioms. A processing instruction begins with <? and ends with >.

For example:

<?>
<?style tt = font courier>
<?page break>
<?experiment> ... <?/experiment>

The Application Programmer Interface (API) to the Lexical Analyzer

The client of the lexical analyzer creates a data structure to hold the state of the lexical analyzer with a call to SGML_newLexer, and uses calls to SGML_lex to scan the data. Constructs are reported to the caller via three callback functions.

The output of the lexical analyzer, for each construct, is an an array of strings, and an array of enumerated types in one-to-one correspondence with the strings.

Data Characters and Character References

Data characters are passed to the primary callback function as an array of one single string containing the data characters and SGML_DATA as the type.

Note that the output contains all newlines (record end characters) from the input verbatim. Implementing the rules for ignoring record end characters as per section 7.6.1 of SGML is left to the client.

A character reference refers to the character in the document character set whose number it specifies. For example, if the document character set is ISO 646 IRV (aka ASCII), then A is another way to write A.

A character reference is reported just like data characters: the first string in the output array is the one-character string that the character number refers to with SGML_DATA as is type.

Tags and Attributes

Start-tags and end-tags are also passed to the primary callback function.

For a start-tag, the first element of the output array is a string of the form <name with SGML_START as the corresponding type. The name is folded to lower-case. The remaining elements of the array give the attributes; see below. For an end tag, the first element of the array is a case-folded string of the form </name with SGML_END as the type.

The output for attributes is included with the tag in which they appear. Attributes are reported as name/value pairs. The attribute name is output as a string of the form name and SGML_ATTRNAME as the type. An ommitted name is reported as NULL.

An attribute value literal is output as a string of the form "xxx" or 'xxx' including the quotes, with SGML_LITERAL as the type . Other attribute values are returned as a string with SGML_NMTOKEN as the type. For example:

<xX val1 val2 aTTr3=".76meters">

is passed as an array of five strings:

[Tag/Data]
Start Tag: `<xx'
  Attr Name: `'
  Name: `val1'
  Attr Name: `'
  Name: `val2'
  Attr Name: `attr3'
  Name Token: `.76meters'

Note that attribute value literals are output verbatim. Interpretation is left to the client. Section 7.9.3 of SGML says that an attribute value literal is interpreted as an attribute value by:

Removing the quotes
Replacing character and entity references
Deleting character 10 (ASCII LF)
Replacing character 9 and 13 (ASCII HT and LF) with character 32 (SPACE)

Other Markup

Other markup is passed to the second callback function.

Each comment is output as a string of the form -- comment -- with SGML_COMMENT as the type. The null comment is not reported.

Other markup declarations are output as a string of the form <!doctype followed by strings of type SGML_NAME, SGML_NUMBER, and/or SGML_LITERAL.

A markup declaration is reported when the end of it is reached. Each comment is reported as it is encountered. Hence a comment may be reported before the markup declaration in which it occurs.

For example:

<!Doctype Foo --my document type-- System "abc">

involves two callbacks:

[Aux Markup]
Comment: `--my document type--'

[Aux Markup]
Markup Decl: `<!doctype'
  Name: `foo'
  Name: `system'
  Literal: `"abc"'

A general entity reference is passed as a string of the form &name with SGML_GEREF as its type. The reference should be checked against those declared in the DTD by the client.

A processing instructions is passed as a string of the form <?pi stuff> with type SGML_PI.

Errors and Limitations

Errors are passed to the third callback function. Two strings and two types are passed. For errors, the first string is a descriptive message, and the type is SGML_ERROR. The second string is the offending data, the the type is SGML_DATA.

Limitations imposed in this report are output similarly, but with type SGML_LIMITATION in stead of SGML_ERROR. The lexical analyzer skips to a likely end of the error construct before continuing.

For example:

<tag xxx=yyy ?>xxx <![IGNORE[ a<b>c]]> zzz

causes six callbacks:

[Err/Lim]
!!Error!!: `bad character in tag'
  Data: `?'

[Tag/Data]
Start Tag: `<tag'
  Attr Name: `xxx'
  Name Token: `yyy'

[Tag/Data]
Data: `xxx '

[Err/Lim]
!!Limitation!!: `marked sections not supported'
  Data: `<!['

[Err/Lim]
!!Limitation!!: `declaration subset: skipping'
  Data: `IGNORE[ a<b>c'

[Tag/Data]
Data: ` zzz'

Differences from Basic SGML

In section 15.1.1 of the SGML standard, a Basic SGML document is defined as an SGML document that uses the reference concrete syntax and the SHORTTAG and OMITTAG features. A concrete syntax is a binding of the SGML abstract syntax to concrete values. The reference concrete syntax binds the delimiter role stago to the string <, the role of etago to </, and so on. The OMITTAG feature allows documents to omit tags in certain cases that do not introduce ambiguity -- without OMITTAG, every element's start and end tags must occur in the document. The SHORTTAG feature allows for some short-hand syntax in attributes and tags.

Arbitrary Limitations Removed

The reference concrete syntax includes certain limitations (capacities and quantities, in the language of the standard). For most purposes, these limitations are unnecessary. We remove them:

Long Names: The reference concrete syntax binds the parameter NAMELEN to 8. This means that names are limited to 8 characters. We remove this limitation. Arbitrarily long names are allowed.
Long Attribute Value Literals: We similarly remove the limitation of setting LITLEN to 960 and ATTSPLEN to 240.

Simplifications

We require the SGML declaration to be implicit and the DTD to be included by reference only:

SGML Declaration: The SGML declaration is generally transmitted out of band -- assumed by the sender and the receiver. The lexical analyzer will accept an in-line SGML declaration, but it will not adhere to the declarations therein. The lexical analyzer client should signal an error.
Internal Declaration Subset: The DTD is often included by reference, but some documents contain additional entity, element and attribute declarations in the <!DOCTYPE declaration. We prohibit additional declarations in the <!DOCTYPE declaration.
Parameter Entity Reference: The %name; construct is a parameter entity reference -- similar to a reference to a C macro. There is little use for these given the above limitations. An occurrence of a parameter entity in a markup declaration is prohibited.
Named Character References: The construct &#SPACE; refers to a space character. With changes to the SGML declaration prohibited, this adds no expressive capability to the language. We prohibit it to avoid interoperability problems.
Marked Sections: The construct <![ IGNORE [ ... ]]> is similar to the #ifdef construct in the C preprocessor. It is a novel construct that can be used to represent effectivity (applicability of parts of a document to various environments, depending on locale, client capabilities, user preferences, etc.). We expect that it will be deployed eventually, but to avoid interoperability issues, we prohibit its use.

Shorthand Markup Prohibited

Some constructs save typing, but add no expressive capability to the languages. And while they technically introduce no ambiguity, they reduce the robustness of documents, especially when the language is enhanced to include new elements. The SHORTTAG constructs related to attributes are widely used and implemented, but those related to tags are not.

These are relatively straightforward to support, but they are not widely deployed. While documents that use them are conforming SGML documents, they will not work with the deployed tools. This lexical analyzer signals a limitation when encountering these constructs.

NET tags: <name/.../
Unclosed Start Tag: <name1<name2>
Empty Start Tag: <>
Empty End Tag: </>

In addition, the lexical analyzer assumes no short references are used.

Future Work

Internationalization -- support for character encodings other than ASCII.
Internal declaration subset support
- General entity declarations with URIs as system identifiers
- General entity declarations as macros
- Parameter entity declarations for "switches" and "hooks"
Empty end tags
Marked Sections

References

[GOLD90]

C. F. Goldfarb. "The SGML Handbook." Y. Rubinsky, Ed., Oxford University Press, 1990.

[SGML]

ISO 8879. Information Processing -- Text and Office Systems - Standard Generalized Markup Language (SGML), 1986.

HTML 2.0

T. Berners-Lee, D. Connolly. "Hypertext Markup Language - 2.0" RFC XXX@@, MIT/@W3C, September 1995.

[TEI]

C. M. Sperberg-McQueen and Lou Burnard, Eds "Guidelines for Electronic Text Encoding and Interchange", 16 May 1994

[DocBook]

Eve Maler, "DocBook V2.3 Maintainer's Guide" ArborText, Inc. Revision 1.1, 25 September 1995

[HTF]

[IBM-IDDOC]

Wayne Wholer, Don R. Day, W. Eliot Kimber, Simch Gralla Mike Temple "IBM ID Doc Language Reference, Draft 1.4" February 13, 1994

[Dragon]

Aho, Alfred V., Sethi, Ravi, Ullman, Jeffrey D. Compilers, principles, techniques, and tools, 1988, Addison-Wesley. ISBN 0-201-10088-6

[KnR2]

Brian W. Kernighan, Dennis M. Ritchie, The C Programming Language, 2nd Edition. Prentice Hall, NJ 1988. ISBN 0-13-110370-9

SGML Open recommendations on HTML 3

Message-Id: <199503202311.SAA23789@EBT-INC.EBT.COM>
Date: Mon, 20 Mar 1995 18:21:23 -0500
To: html-wg@oclc.org, sgml-internet@ebt.com
From: sjd@ebt.com (Steven J. DeRose)
Subject: SGML Open recommendations on HTML 3

On Improving SGML

Sandy directly (mamrak@cis.ohio-state.edu), or the author, Mike Kaelbling (mjk@ztivax.zfe.siemens.de). It is Tech Report 88-22

flex

Vern Paxson Systems Engineering Bldg. 46A, Room 1123 Lawrence Berkeley Laboratory University of California Berkeley, CA 94720 vern@ee.lbl.gov

Grosso on capacities

from HTML-WG archive