HEX - The HTML Enabled XML Parser





HEX - The HTML Enabled XML Parser
http://www-uk.hpl.hp.com/people/ak/java/hex.html

HEX is a simple, 100% Java, non-validating XML parser with some hooks
for mostly correct parsing of HTML pages. It doesn't understand either
SGML or XML DTD's but the parser API allows the application to control
its operation in ways that facilitate HTML parsing. It implements the
DOM core level one API and the SAX event-driven API. Comes with a
couple of sample applications.

[Note: "The parser is extremely lax with errors. It will happily
produce a parse tree without reporting errors for some non-well
formed documents.  The HTML4.0 spec is assumed but that doesn't
mean that the parser will reject all documents which are
malformed according to HTML4.0 - it does mean that most correct
HTML4.0 doc's will be parsed correctly.

"Basically the parser is very simple. It has some mechanisms for dealing
with certain basic differences between HTML and XML. For example, when
put into HTML-parsing mode it will know that certain elements, eg LI,
are terminated by an LI end tag, another LI start tag, or a UL end tag.
However it won't try to verify that the HTML content model is adhered
to. It is a very pragmatic piece of software.

"My reason for writing this thing was that I needed to handle HTML pages
with some extra (XML) markup. What this means is of course undefined (at
the moment) and also since HTML pages are of course notoriusly
non-conformant it made sense to make the parser very pragmatic."]

Anders

-- 
Anders Kristensen <ak@hplb.hpl.hp.com>,
http://www-uk.hpl.hp.com/people/ak/
Hewlett-Packard Labs, Bristol, UK