[Mirror copy of document from ftp://naggum.no/pub/SGML94/; poster delivered by Erik Naggum at SGML '94 at Tyson's Corner, VA] Some Problems in Parsing SGML with Standard Software SGML is hard to parse correctly. SGML parsers are hard to interface to, or come with their own languages that doesn't talk to any of your other software. Various approaches to turning parsing into an "event stream" have been proposed, but there is not yet any consensus or significant experience in what these events should be, and how to work with them. There isn't even any consensus of what an SGML documents contains or means. This leads me to the conclusion that programmers of various utilities will want to write their own parser, and that a definition of SGML that has a "unified" syntax (that is, one that does not change as features change), or a context-free recognition of markup constructs, to put it another way, must be a major win. The standard is very clear that some things cannot be (valid) markup unless the prerequisite features are enabled. It is less clear that these constructs will be treated as data if they are not allowed. I propose that they be flagged as errors, either by a parser, or by a special "long-term investment protection" application. Why more validation than SGML can do for you? All applications that are worth using will have validation requirements above and beyond what SGML can offer. HyTime codifies a number of such requirements in application-level SGML constructs, and a HyTime engine will be needed to do the work. Just as HyTime restricts what you can do to what makes sense for HyTime applications, I propose a set of conventions that will focus on longevity, with an eye for the fact that SGML must and will evolve, and to make certain that we don't unwittingly block the path some number of years down the road. Take a look at Clause 9.6 Delimiter Recognition in the standard. It lists the conditions for interpreting markup according as various are enabled, and as various characters follow the markup. Turn this on its head, and never use those things that can be interpreted as markup: * don't use &# (cro) unless you mean a character entity reference. * don't use (mse mdc) in your data (already forbidden!). * don't use --> (com mdc) in your data, so you can't comment out. * be careful with ; (refc) and > (tagc, mdc, pic). You may want to be warned of any character that is part of a delimiter sequence that could occur in data. (This excludes those that can only occur in the SGML decl and DTD, as they are mostly harmless.)