Lark 0.92 now available

Date: Mon, 08 Sep 1997 23:25:19 -0700
From: Tim Bray <>
Subject: Lark 0.92 available


Hi - Lark 0.92 is now available at

Pardon the quick releases, but thanks to Sun's JWS profiler, Lark 0.92
is now 11.9 times faster than 0.91.  Secondly, the accompanying "xh"
application, which formats the XML spec and related documents (including
what you get at the URL above) has been upgraded so that it now can
process the Japanese version of the XML specification and produce beautiful 
UCS-2 Japanese HTML output.  (Go to and download their
Cyberbit font if you want to see some damn nice-looking stuff on your 
screen - Netscape can do it, but be warned that Communicator 4 + Cyberbit 
between them will use all your memory, no matter how much you have).

When you can have a few tens of K of code do this kind of transformation 
on two violently different character sets, it bespeaks, I think, a couple 
of standards (Java and XML) in pleasing harmony.

The process of getting the Japanese formatting working would have been
completely impossible without all sorts of support and question-answering
and double-checking and pointing-to-useful-resources from Murata Makoto
of FXIS; many thanks to him.

[From the 0.91 version announcement, September 04, 1997]:

Hi - Lark 0.91 is now available at

Only one real difference - it now does Unicode.  It reads the BOM and thus
UCS-2/UTF-16 (even byte-swaps); if there's no BOM, reads and tries to
use the encoding declaration, boots it if it says anything but "UTF-8" or
"UTF8".  Successfully parses Murata-san's translation of the XML
spec, would love to get my hands on some more internationalized
XML; in particular with non-ASCII markup.

Another 6K of .class files for I18n, sigh.

Lots of bug-fixes in the event-stream module.  I had to write a
significant event-stream Lark application to pull the character classes
out of the XML spec in order to build the file, and
ran across a few bodacious bugs in end-tag handling.

It's a bit bogus because it really doesn't do UTF-8 yet, just ASCII
masquerading as such.  UTF-8 Real Soon Now.

Cheers, Tim Bray +1-604-708-9592