SGML: SGML and Context Free Grammars

6 Feb 1997
">23 May 1997
Subject: Re: Various questions

Date: 6 Feb 1997 15:49:29 -0800

From: jenglish@crl.com (Joe English)

Newsgroup: comp.text.sgml


              ------------------------------------------------

Lars Marius Garshol <larsga@ifi.uio.no> wrote:
>eric.kidd@pobox.com (Eric M. Kidd) writes:
>> An SGML document cannot be described with a context-free grammar.
>> [...]
>> The element structure permitted by a given DTD could be recognized by a
>> sufficiently complex deterministic pushdown automaton. I don't know if it's
>> possible to be more specific than that.
>
>The set of languages that can be produced by context-free grammars is
>identical to the set of languages that can be recognized with pushdown
>automata. (As far as I can remember that includeds non-deterministic ones as
>well.) So which one is it? :-)

The basic problem here is that an SGML document begins with its
own "grammar" (the DTD).  For any particular DTD, the set of
_instances_ can be described with a context-free grammar [*],
but the language of complete SGML _documents_ is the set

    { (SGML-declaration + _prolog_ + _instance_) : where _instance_
      conforms to the grammar defined by _prolog_ }

and that takes a Turing machine.


[*] Converting a DTD to a CFG is _possible_, but not in most
    cases _practical_ when you take into account things like
    start-tag omission and CDATA/RCDATA declared content.


>And as for SGML (and C++) not being either LL(k) or LALR I'm a bit surprised,
>as that must make parsing very difficult.

That only means it's difficult to parse SGML with an LL- or LALR-
parser...  There are other parsing techniques you know :-)

From an algorithmic standpoint, parsing SGML is in some
ways easier than the traditional lex/yacc approach.
(By this I mean that, although writing an SGML parser is
considerably more difficult than *using* lex and yacc,
it's easier than *writing* lex and yacc.)

The first tricky part is in entity management.  The lex/yacc
approach assumes that the input is a single stream of characters
(perhaps, as in the case of C/C++, preprocessed to handle macro
expansion and file inclusion).  An SGML document is composed of a
hierarchical tree of entities, so the parser has to handle a stack
of input sources.

Delimiter recognition corresponds roughly to the "lex"
phase of a lex/yacc parse.  This is in some ways more complicated
than lex-based scanners due to the number of delimiter recognition
modes and their interaction with the other parsing phases,
but in other ways it's simpler.  Since all the delimiters are
fixed-length strings [*] you can use a trie instead of a
full-blown DFA like lex and flex do.

[*] With the exception of "B" sequences in SHORTREF maps;
    these are a pain.

Markup parsing -- interpreting markup declarations, start- and
end-tags, etc., -- can be done with a simple recursive-descent
parser, since this aspect of the ISO 8879 grammar is all LL(1).
The main difficulty here is the interaction with the entity manager
and delimiter recognizer.

The next phase is building the element hierarchy.  (I suspect that
this is where most people who try to use yacc to parse SGML
get into trouble.  This is really best treated as a separate parsing phase,
rather than trying to account for the markup and element hierarchy
with a single grammar).  This is *considerably* easier than
LL or LALR parsing: you can match element tokens against
individual content models with regular expression techniques,
while LL/LALR parsing requires that the entire grammar be
processed to build the parse tables.

The main difficulties in content-model matching are dealing with
OMITTAG (mainly start-tag omission; omitted end-tags are easier),
and handling AND groups.  The latter is made somewhat easier
by the rule prohibiting ambiguous content models (if you build
an NFA from a valid content model, it's guaranteed to be deterministic,
which allows some important optimizations), but AND groups are
still a pain.

XML forbids AND groups, OMITTAG, SHORTREF, and R/CDATA declared 
content; these seem to be the hardest aspects of SGML parsing.

(Actually, the hardest aspect is the astonishing number of
minor points and small details in ISO 8879, and since the
Standard is written in "legalese" rather than "computerese"
it takes a great deal of effort to account for all of
them.)


--Joe English

  jenglish@crl.com


 -----------------------------------------------------------------------

Subject: Re: DTD vs BNF Questions

Date: 23 May 1997 11:54:47 -0700

From: jenglish@crl2.crl.com (Joe English)

Newsgroups: comp.text.sgml

References: <feustelEAJKpu.Ksu@netcom.com>



Dave Feustel <feustel@netcom.com> wrote:
>
>Are DTD and BNF (Backus-Naur Form) equivalent?

No.


>If not, is one a subset of the other?
>
>If neither specification language is a proper subset of the other,
>what parts of the DTD specification are not expressible in BNF?
>
>Is it correct that XML *is* equivalent to a BNF?

No.

It is however possible to create an equivalent context-free (BNF)
grammar from any XML DTD, where the terminals are start-tags, end-tags,
and #PCDATA and the productions correspond to element types and
content model positions.  The reverse is not possible in general,
so XML DTDs are (in a sense) a subset of BNF.

The same *might* be true of SGML, but when you consider things
like declared content, exceptions, AND groups, and OMITTAG,
converting a general SGML DTD to a CFG is a much more difficult
problem.  "Do SGML DTDs define context-free languages" is still
an open question AFAIK.  (I suspect the answer is "yes", but
even if so it would not be a very useful result; the derived CFG
would be intractably large in many cases.)

>Are there any articles on-line which address these questions?

There's a section in the "Special Topics" page of Robin Cover's
SGML Web site; <URL: http://www.sil.org/sgml/topics.html>.
Also look for papers by Derek Wood and Anne Brueggeman-Klein,
they've done a lot of research in this area.


--Joe English

  jenglish@crl.com