Tim Bray's Lark 1.0 final beta and Larval 0.8

Date: Mon, 05 Jan 1998 11:17:03 -0800
To: xml-dev@ic.ac.uk
From: Tim Bray <tbray@textuality.com>
Subject: Lark 1.0 final beta and Larval 0.8

     -------------------------------------------------------------

This isn't finished yet, but I am uncomfortable about the fact that 
for the last couple of months, there has not been a java-language XML 
syntax-checker that is really very close to the spec.  So, at 

 http://www.textuality.com/Lark/

I have placed the Lark 1.0 final beta, and release 0.8 of Larval, 
a validating XML processor based on Lark.

---------------
Status Snapshot
---------------

In my tests, Lark does all the things it used to do, and also rejects
163 of 164 of James' non-well-formed documents; the odd-doc-out is the
notorious 088.xml, which I consider to be well-formed and represents 
a policy issue that the WG is going to have to make a call on.  The
only hole I know about in Lark at the moment is that it doesn't do
text declarations in external parsed entities; but I won't have
time to work on it until next week, so decided to ship anyhow.

James' test-suite represents a tremendous resource: a de-facto 
reproducible test of conformance that will greatly increase the 
interoperability of XML docs.  We are all considerably in his debt, 
not for the first time; thank you once again, James.

Larval validates quite a few things, and boots out quite a few other
things, but has not been tested to anywhere near the same level that
Lark has.

These class files have been compiled with Microsoft VJ++1.1 and
tested with Microsoft JView and with Sun's Java from JDK 1.1.3.  
At the moment, if I compile with the Sun fastjavac, then neither
the Sun nor Microsoft java interpreters can use the resulting
class files.  Admittedly, Lark.java and Larval.java are a pretty
severe strain on a compiler; on the other I know about some pretty
egregious violations of the Java language spec that will get by
both of those compilers.  I suspect that my current problem with
fastjavac is as likely to be me breaking some rule about what can
be in a static string (J++ is forgiving) as it is a compiler bug.

----------------
Source Available
----------------

There's a policy change in that the Java source code for every Lark
class is now included in the distribution.  If you actually look at
Lark.java and Larval.java, you'll see that this is not quite as
generous as it sounds.

------------
Still Undone
------------

Lark 1.0 has also not received a walk-through looking for dead code,
software rot, and unconcealed evidence of stupidity, and has not been 
profiled.  It is noticeably but not unbearably slower than 0.97, but it'll 
be faster before I'm done.  I have established with previous releases
that with a little work any given release of Lark can be made faster
and smaller.  This release has grown in size by 10K.

Lark's UTF8 processing is still pretty shaky - I think that the
Java libraries are moving in the right direction fast enough to
make it not cost-effective for me to wrangle with this much more
at the moment.  Since XmlInputStream is now available at source level,
if someone were to want to plug in some robust UTF8 code that'd be
lovely.

Everything else is conformant I think without exception.

----------
Validation
----------

Larval is just another version of Lark; but it has some more methods,
most noticeably 
 public void validate(boolean)
which as a side-effect turns on processExternalEntities; there
is a new validityError() callback in the Handler.  Of course there
are a bunch of new classes with names like DTD and Validator and
Attlist and so on.

Larval is done this way because if you just use Lark, you'll never have 
to include any validation class files.  I can get away with this because 
even though Java doesn't have a preprocessor, Lark does.  Presumably I
will use the same trick to do SAX.

The validation implementation is pretty naive.  Rather than compiling
tables, Larval builds a data structure more or less isomorphic to
the declaration in the DTD, and then laboriously pokes around in it
every time it sees a start/end tag.  I think it proves that (a) a
naive implementation of validation can be done, and (b) this isn't
the right way to do it in the long-term.  However, it's nowhere
near as slow as I expected, and is good enough to be useful already
in debugging XML documents.

-------------
Other Changes
-------------

The doPI method now has separate args for target and remainder.
There is a doXmlDeclaration method.
There is a new method to tell Lark what name it should use for
  the document Entity, e.g. in error reporting.
There is an ESIS class that extends Handler; I don't claim this to
  be anything like a real SGML ESIS, but it's sure useful in 
  automated testing.

------------
Future Plans 
------------

Lark's version will remain 1.0 as long as XML does (a long time, I
hope).  Once it's no longer 'final beta' Lark.toString() will add a
build date-stamp to the "1.0" version string.

Larval will progress toward 1.0 as I get around to doing some really
serious testing on it.

 -Tim

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@ic.ac.uk)