docproc

[This local archive copy mirrored from the canonical site: http://javalab.uoregon.edu/ser/software/docproc_2/ snapshot 980318; links may not have complete integrity, so use the canonical document at this URL if possible.]

docproc

Version 19980207

Abstract	docproc is a software package that provides processing and layout of XML documents based on XSL scripts. docproc is written in pure java, and can be used as a server-side preparser for serving XML documents on the web.
Table of Contents	Abstract Table of Contents New Why use docproc? Requirements Usage Using docproc in an HTTP server Using docproc from a CLI Mailing list Completed To Do Known bugs Getting docproc License Regarding scripting Extending docproc Notes Examples Characteristics Value Types Credits
New	This file was last modified on 07-Feb-98. 08 Feb 98 An email from Chris Lilley has prompted me to clean up the HTML docproc generates, and I'm trying to get the output to at least pass through a validator unmolested. One of the things I've had to do is hack an addition to the SCROLL layout object. If you create a scroll object, you can give it the following tags, which will be inserted into the header of the HTML document: title author date This is not a good solution! I'm taking recommendations on a better way to solve this. XSL doesn't provide semantics for this sort of thing. 07 Feb 98 I feel it is necessary to comment at this point on ECMAScript support in docproc. docproc uses Pnuts for scripting, and I've given my reasons below. I should mention, however, that as soon as I find a (relatively) freely available ECMAScript interpreter in Java, I will be binding support for that language into docproc so that we can comply with the XSL specification. However, I know of no Java ECMAScript interpreter that I could bundle with docproc, and I'm not about to go author my own. For now, Pnuts will have to do. Pnuts is very similar to ECMAScript, and scripts written with Pnuts should convert with little effort to ECMAScript as the need warrents. Thank you for your patience. 07 Feb 98 Fixed another bug in the style sheet fetching procedure. 04 Feb 98 I introduced a lot of bugs in that last release. This release fixes them 28 Jan 98 Har! I found out from Tim that the new version of Lark is case sensitive. This distribution is therefore only available as a full distribution, including the new Lark. Case insensitivity has been removed throughout docproc, resulting in improved performance. Old News
Why use docproc?	docproc is one way of solving the problem commonly known as static HTML pages. When a web site is designed around HTML pages, the person writing the content is also usually the person laying out the web pages, designing not only what the page says, but how it looks. Often, it is preferable to separate these two jobs, allowing the content expert to author the content, and a layout professional to design the layout. Separating these jobs is difficult with static HTML. Another problem appears when the web site is upgraded, or a change in the layout is desired. At some point, someone must be assigned the tedious task of going through every web page at the site and changing the HTML so that all pages look the same, or a site must suffer from a lack of constant look-and-feel. Inevitably, there is a period when the site is in a state of flux, and often some pages are missed. There are two general solutions to static HTML pages. The first is for a site to purchase a site management package. Such a package normally contains a database connection, a template format, and a web server package that combines the two. When the web server receives a request for a document, the server software finds the data in the database and inserts it into the template, serving a dynamic HTML page to the client. The content experts can then add content into the database without worrying about layout, and the layout experts can design the web site layout while ignoring the content. The second solution involves a translation medium. With this solution, the content authors use a markup language to specify the type of information without specifying the layout. This is, in fact, a form of database entry, except that the document itself serves as the database. docproc follows this model, using XML as the markup language. Both of these solutions are very similar. Both are dynamic and allow special processing before serving the HTML page, such as inserting dynamic data (current stock quotes, time of day, etc) and client-specific processing. Both allow the layout to be changed without altering the content source. The database approach has the advantage that the content can normally be authored with any software (WordPerfect, etc), while a special converter adds the data into the database. This makes it easier for data entry. The database approach also costs money and requires significant infrastructure (if you know of a software solution that is freely distributable, please let me know). XML has two advantages over the database: the data stands on its own, and there is infrastructure for converting XML into formats other than HTML. The two solutions may be combined. docproc will attempt to do this in a later version, leveraging jDB, a database authored in pure Java by yours truely.
Requirements	docproc is written as a servlet, and will consequently run with any HTTP server software that supports servlets. The most obvious server, therefore, is JavaSoft's Java Web Server; free for noncommercial use. JavaSoft has provided a package which may be used to embed servlet support in other web servers, including Apache (and derived servers such as Stronghold), Netscape servers, and Microsoft's IIS. --- Java Web Server documentation on servlets If you don't plan to use docproc as a server-side preparser, you can still use it to convert your XML documents to HTML by hand using the docproc application. For this, you need a java virtual machine.
Usage	docproc can be used in two different ways. The first, and ideal, method is to use docproc as a servlet; the other way to use docproc is to call it by hand on documents that you want to reformat. Using docproc in an HTTP server How you install docproc as a servlet really depends on your HTTP server installation. For a more general overview on your servlet options, see my Hunt for Web October (I know, stupid name) document. As of 9.1.98, it is still being modified. The Java Web Server, by default, has all of its servlets in a file <server home>/servlets/. The easiest thing to do is copy XMLServlet.class into this directory. Place the lark.jar, ser.jar, and pnuts.jar archives somewhere in the classpath of the web server - this may require modifying the httpd startup scripts, if you use them. Also make sure that the directory containing the HTML.class, ASCII.class, and other backends are also in the classpath of the web server. Once this is done, go into the administration tool, manage the Web Service, go to Servlet Aliases and add a new entry with "Alias" = .xml and "Servlet Invoked" = XMLServlet. You may also want to add an entry in "MIME Types" for XML ("Extension" = xml, "Type" = text/xml), but this isn't strictly necessary. Then go into the Servlets section and add a new Servlet with whatever you want for the description, and the class name "XMLServlet". You'll probably want it loaded at startup, so check that option. Close the window and restart the server. If this document (not the HTML conversion, but the original XML document) is accessible through your HTTP server, you should now be able to reference this document and get a formatted HTML file served to your client! Nexus* is a little web server authored by Anders Kristensen of HP. The original Nexus web server is fairly usable, although it lacks some of the bells and whistles. It is very cleanly coded, very small, and very fast. You may find easy extension worth the effort. Nexus was the only web server that was close enough to support URI filtering that I found outside of the JWS, and I've installed most of the non-commercial ones. The original Nexus didn't support filtering, but I've added that support, and with Anders' permission, will make the distribution available for download. Email me if you are interested. Installation of docproc on Nexus is even easier than on JWS. Simply make sure that the docproc jar files and the XMLServlet class are in the class path when you run Nexus, and add an entry to the servlets.conf file: XMLServlet { code "XMLServlet", path ".xml" }, and you're off! Other web servers* are more difficult or impossible to configure. The problem is that the servlet API is not consistantly implemented across all servers, and also that most servers don't support document filtering in this manner. Your web server must not only support servlets, but must allow you to filter a document request through a servlet, not just allow you to invoke a servlet by name. If you find out how to make Apache do this, please let me know. Using docproc from a CLI Here are the instructions for using docproc by hand. A UNIX shell script is provided for ease-of-use; call it with the -h option for a list of options. These instructions describe the details of invoking docproc. Make sure that your Java VM can find the textuality, Pnuts, and ser packages. The easiest way to do this is to include the three jar files in your classpath. Since the backends are not included in any package, also make certain that the VM can also find the backends. The two backends supplied with docproc at this point are a nearly fully-funtional HTML.class, and a partially functional ASCII.class. A LOUT.class is in the works. An example CLI usage is: java ser.nexus.docproc -t HTML index.xml > index.html To get a help list describing docproc usage, use the -h option: java ser.nexus.docproc -h
Mailing list	There is now a mailing list for docproc. At this point, this mailing list covers both docproc and docproc_2. To subscribe to the docproc mailing list, send an empty message to ser-docproc-subscribe@javalab.uoregon.edu. This is a qmail mailing list, with limited frills. I've been mucking about with various extensions to the QMail package, and the list server is a little more robust than it was before. You may send an email to the help server for docproc for a list of listserver capabilities. If you subscribed to the mailing list before, please subscribe again. I didn't have the listserver configured correctly before, and none of you are on the list at this point. Sorry for the inconvenience.
Completed	This section itemizes ways in which docproc deviates from the XSL specification. If you find anything that is on the following list that doesn't work correctly, please send me a bug report. The following list is based on the 27 August 1997 XSL proposal. Each item lists a section in the XSL proposal which has been implemented, excepting gross descriptions of how XSL works. Specifics start at secion 2.2. Core query language items supported through scripting. For all of the methods which take an Element argument, sibling methods exist where the element argument defaults to the current element. element defaults to the current element String formatNumber( int, char ) int[] hierarchicalNumber( String[], Element ) int[] hierarchicalNumberRecursive( String, Element ) Element parent( Element ) Element ancestor(String, Element) String gi( Element ) String firstChildGi( Element ) String id( Element ) int childNumber( Element ) int ancestorChildNumber( String, Element ) String attributeString( String, Element ) String inheritedAttributeString( String, Element ) String inheritedElementAttributeString( String, Element ) boolean firstSibling( Element ) boolean absoluteFirstSibling( Element ) boolean lastSibling( Element ) boolean absoluteLastSibling( Element ) boolean haveAncestor( Element ) Section 2.2, as suggested in the proposal. XML documents may define which style sheet they use by using the <?xml-stylesheet?> tag. docproc does not process URLs yet! The href attribute should contain, at most, a file path. Style sheets are looked for first within the same directory as the XML source, then in a global directory defined by the application environment. 4: style-rules. 4.1: docproc does not use named styles and I have no intention of supporting them. While it does allow you to define styles, to access them you must use the apply directive or the use attribute. Scripting: docproc uses Pnuts as a default scripting language. See the section on scripting for more details.
To Do	This is an incomplete list of things left to do and of items I'd like comments from the users on. If you find anything which is incomplete and is not on this list, please let me know. Improve HTML output. Child pattern matching support needs position and only support Add support in docproc for using Lark's validation methods Add support for multiple output streams Add support in the servlet for caching of pages. What I want to do is allow "virtual" pages, so that a style sheet can be defined calling multiple scroll objects, with each one generating a page. These pages would be stored in a cache by the servlet for service upon request. I'm not sure yet whether this is a problem with the HTML backend, or if it indicative of a deeper problem in docproc, but display-groups are not handled completely properly. For example: <display-group> <select> <target-element type="blah"/> </select> </display-group> does not display each child selected, but rather displays the group of selected items as if they were one item Document the sourcecode. I've been really bad about that. Section 3.2.4: ID and CLASS attributes Section 3.3.4: Flow object macros Section 7.5: define-script Other: inline styles, better CSS support CSS support is rather weak. In fact, any non-DSSSL flow object that docproc encounters is passed through unchanged (except for processing of scripting in attributes) to the final output. Since the unofficial stand of the XSL working group is that CSS flow objects are differentiated from DSSSL flow objects by virtue only of their case, I'll need to modify docproc to handle this correctly.
Known bugs	Documents must have a root node or else the first node will be processed using the default rule. The processing of children pattern matching, which I've mentioned is rather weak, is now known to contain bugs. The choosing of the style sheet that is used is poorly ordered
Getting docproc	Since the Lark and Pnuts archives are so large, I'm making two distributions: a full distribution with everything docproc needs to function, and a minimal distribution intended for upgrading, which does not contain the Lark or Pnuts archives. Also, I've discontinued the tar/gzip distributions, since anyone running docproc has access to the jar tool. Minimal distribution (no Lark or Pnuts). (262Kb, 07-Feb-98). Full distribution (includes Lark and Pnuts). (466Kb, 07-Feb-98). Both of these archives are signed by jarsigner (javakey, for 1.1 users). You can fetch my key from my web page. If you don't yet have JDK1.1, you can use unzip to extract the archive.
License	This package is freely distributable, providing that the package is redistributed in original form. Value-added resellers and people intending to make money leveraging off this software should contact me first. I don't have anything against making money. Educational institutions are exempt from this as is private (personal) usage. Contact me if you feel you fall into a grey area. All sourcecodes are supplied. Feel free to spread the word and point people in my direction.
Regarding scripting	Scripting has turned out to be enough of a problem as to require a section of its own. It was also a very large section, so I've moved it here.
Extending docproc	docproc can be extended by adding backends, or by improving the backends I've provided. Look at the sourcecode for the HTML and ASCII classes for an idea of what needs to be done.
Notes	If you use docproc as a servlet, please note that XMLServlet takes advantage of caching, and caching only notices changes to the XML source, not the XSL source. Therefore, if you edit your XSL style-sheet but don't change your XML document, you probably won't notice any change when you reload the document in your browser. The solution is to "touch" the XML source (update the timestamp on the XML source file) before reloading. If a System property named xsl-path is set, this path will be used to search for XSL scripts in. This should be a comma separated series of paths. xsl style sheets are searched for in the following order: In the directory containing the source document In the directory(s) supplied via the ser.nexus.Nexus.addGlobalPath() method. This is normally only set via the "-g" command line options. In the directory(s) found in the xsl.path system property When using Pnuts, two variables are set for use in you scripts: element and source_path. element is a reference to the XML element that triggered the current script. source_path is usually a fully qualified path name pointing to the directory in which the XML source file is contained. source_path will always be set by the XMLServlet. See docproc v1 for an earlier, and somewhat simpler, version of this package. The earlier version is distributed under the GNU Public License.
Examples	This document was generated from this XML source, using this XSL style sheet. This style sheet is a good example of several docproc abilities, including non-trivial pattern matching, extended scripting, and complex layout. This style sheet does not take full advantage of XSL supported by docproc.
Characteristics Value Types	The following table describes the mapping of DSSSL types to Java types that are stored in the `Characteristics` class. Types
Credits	This package would not function were it not for two resources, and derives its most useful ability from a third: Tim Bray, of Textuality, provided the XML parser, Lark. I use Lark in most of my software. Toyokazu Tomatsu, of Sun, provided the default scripting language, Pnuts. Pnuts is proving to be very useful as a rapid prototyping aid, even though it is in early development. I'd also like to acknowledge: My fiance, for letting me work on this when I should have been spending time with her. JavaSoft, yet again, for providing us with Java in the first place. The working group for the XSL proposal. If not for XSL, docproc would use XS (DSSSL-o) stylesheets, which are Scheme-based. Oh, the horror. The working group for the XML proposal, for obvious reasons. Robin Cover maintains a great SGML/XML resource list.