Regularity and Locality of String Languages and Tree Languages
From forest-automaton-request@fxis.fujixerox.co.jp Thu Feb 4 20:19:13 1999 Date: Fri, 05 Feb 1999 11:07:53 +0900 From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp> Subject: Fwd: Regularity and locality of string languages and tree languages To: forest-automaton@fxis.fujixerox.co.jp
MURATA Makoto wrote:
Regularity and locality of string languages and tree languages
Murata Makoto
Fuji Xerox Information Systems
0. Abstract
Regularity and non-locality of string languages and tree languages are considered. First, we study string-regular languages and string-local languages. String-local languages can be accepted without using non-local information (i.e., what has occurred before the current symbol), while string-regular languages require non-local information via non-terminals. Then, we study tree-local languages, which does not use non-local information (i.e., what has occurred above the current symbol). DTDs correspond to such tree-local languages. Finally, we consider tree-regular languages, which require non-local information via non-terminals. Tree-regular languages naturally capture ancestor-sensitivity.
1. String-regular languages
When we talk about regularity, we usually consider sets of strings. Since we later consider sets of trees, we say "string-regular" rather than "regular". Likewise, we say "string-regular expressions" , "string (finite) automata", "string-reguar grammars" rather than "regular expresions", "(finite) automata", and "regular grammars".
A set of strings is called string-regular when:
- it is generated from a string-regular expression,
- it is accepted by a string automaton, or/li>
- it is generated from a string-regular grammar./li>
The above three conditions are equivalent. In this message, I explain string-regular grammars only and skip details of tring-regular expressions and string automata.
The set of strings that consists of elements of a finite set A is denoted by A*. In particular, A* contains the null string, which is denoted by epsilon.
A string-regular grammar G is a quadruple <N, T, S, P>, where
N is a finite set of non-terminal symbols,
T is a finite set of terminal symbols,
S is a start non-terminal symbol (S is an element of N),
P is a finite set of productions ruls of the form
n1 ::= t (n1 is a non-terminal symbol and t is a terminal symbol),
n1 ::= t n2 (n1 and n2 are non-terminal symbols and t is a
terminal symbol),
n1 ::= n2 (n1 and n2 are non-terminal symbols), or
S ::= epsilon.
Note: the third form (n1 ::= n2) are only for readability and does
not contribute to the expressive power of string-regular
grammars.
Derivation is defined in an usual manner. A string s is generated from the string-regular grammar G if s is derived from the start symbol S.
2. String-local languages
We now introduce a new concept called locality. A string-local language is string-regular, but a string-regular language is not always string-local.
As a motivating example, we consider a set {ab, abb, abbb, ....}.
To tell if a string is contained by this set, we only have to
examine conditions as below:
- Is 'a' the first symbol in the string?
- If the current symbol is 'a', is it followed by 'b'?
- If the current symbol is 'b', is it the last symbol in the string or
is it followed by 'b'?
Observe that these conditions do *not* depend on what has occurred
before the current symbol. That is, these conditions are *local*.
Next, consider a set {abac}. The first occurrence of 'a' has to be
followed by 'b', while the second occurrence has to be followed by
'c'. That is, to tell if a string is contained by this set, we have
to know what has happended BEFORE the current occurence of 'a'. Such
a set is *non-local*, since the local info (i.e., the current symbol
is 'a') is not enough for determining permissible successors.
Formally, a set of strings is string-local if it is generated by a
string-local grammar. A string-local grammar G is a quadruple
<N, T, S, P>, where
T is a finite set of terminal symbols,
N is the set of [t], where t is a terminal symbol,
S is [t], where t is some terminal symbol, and
P is a set of productions rules of the form
[t1] ::= t1 (t1 is a terminal symbol), or
[t1] ::= t1 [t2] (t1 and t2 are terminal symbols).
Obviously, a string-local grammar is a string-regular grammar.
We now show that the set {abac} cannot be generated by a string-local
grammar. We suppose a string-local grammar that accepts the string
"abac", and show that this grammar accepts more than this string
Since the first terminal symbol is 'a', the start symbol has to be
[a]. Since the second terminal symbol is 'b', we need a production
rule [a] ::= a [b]. Likewise, since the third terminal symbol is 'a',
we need a production rule [b] ::= b [a]; since the forth and last
terminal symbol is 'c', we need production rules [a] ::= a [c] and
[c] ::= c.
Therefore, we have a local-grammar <{[a], [b], [c]}, {a, b, c}, [a], P>,
where P contains productions rules as below:
[a] ::= a [b],
[b] ::= b [a],
[a] ::= a [c], and
[c] ::= c.
This local grammar generates not only "abac" but also strings
"ac", "abac", "ababac", "abababac", "ababababac", etc.
The set {"ac", "abac", "ababac", "abababac", "ababababac", ...} is
local. To tell if a string is contained by this set, we only have to
examine conditions as below:
- Is 'a' the first symbol in the string?
- If the current symbol is 'a', is it followed by 'b' or 'c'?
- If the current symbol is 'b', is it followed by 'a'?
- If the current symbol is 'c', is it the last symbol in the string?
None of these conditions require non-local information (i.e.,
what happened before the current symbol).
Now, let us simplify the definition of string-local grammars. Since N
can be computed from T, we do not need N. We can confuse [t] and t
without loss of information. Since every production whose head is [t]
has t as the first symbol in the body, we can get rid of this first
symbol and use the rest of the body only.
Here is a revised definition of string-local grammars. A string-local
grammar is a triplet <T, S, P>, where
T is a finite set of symbols,
S is t, where t is some symbol, and
P is a set of productions rules of the form
t1 ::= (t1 is a symbol), or
t1 ::= t2 (t1 and t2 are symbols).
3. Tree-local language
We have considered DTDs as context-free grammars and used them to describe permissible element trees. Observe that this approach is already backhanded since context-gree grammars are designed to describe permissible *strings*.
In DTDs, we can examine if the content of an element is valid WIHOUT KNOWING WHAT HAS OCCURRED ABOVE THIS ELEMENT. This is actually tree-locality.
Formally, a set of trees is tree-local if it is generated by a
tree-local grammar. A tree-local grammar G is a triplet <T, S, P>,
where
T is a finite set of symbols,
S is t, where t is some symbol, and
P is a set of productions rules of the form
t1 ::= u, where u is a string-regular expression over T.
For example, consider a DTD as below:
<!ELEMENT doc (sec*, app*)>
<!ELEMENT sec (para*)>
<!ELEMENT app (para*)>
<!ELEMENT para (#PCDATA)>
This is captured by a tree-local grammar <T, S, P>, where
T = {doc, sec, app, para, #PCDATA},
S = doc ,
P = {
doc ::= sec* app*,
sec ::= para*,
app ::= para*,
para ::= #PCDATA}
This tree-local grammar looks very similar to a context-free grammar. However, tree-local grammars directly capture sets of trees, while context-free grammars capture sets of strings. It is true that a context-free grammar also describes a set of parse trees, but this is merely a side-effect.
4. Tree-regular languages
Tree-regular languages are similar to string-regular languages in that we can use non-local information. That is, we are allowed to know what has occurred ABOVE the current symbol (element). This non-locality has been (inappropriately) called context-sensitivity.
A tree-regular grammar G is a quadruple <N, T, S, P>, where
N is a finite set of non-terminal symbols,
T is a finite set of terminal symbols,
S is a start non-terminal symbol (S is an element of N),
P is a finite set of productions ruls of the form
n ::= t <u> (n is a non-terminal symbol, t is a
terminal symbol, and u is a string-regular expression
over N), or
n ::= <u> (n is a non-terminal symbol and u is a string-regular
expression over N).
Note that productions of the form n ::= t are covered by the first case
and productions of the form S ::= epsilon are covered by the second
case , since u may be the null string.
Now, let us construct a tree-regular grammar that takes advantage
of non-local information. That is, permissble subordinates of the
element type para depends on superior elements.
G = <T, N, S, P>, where
T = {doc, sec, app, para, fnote, #PCDATA},
N = {DOC, SEC, APP, PARA1, PARA2, FNOTE, PCDATA}
S = DOC ,
P = {
DOC ::= doc <SEC* APP*>,
SEC::= sec<PARA1*>,
APP ::= app<PARA2*>,
PARA1 ::= para<(PCDATA|fnote)*>,
PARA2 ::= para<PCDATA>,
FNOTE ::= fnote<PCDATA>,
PCDATA ::= #PCDATA<>}
This tree-regular grammar is equivalent to a DTD with exclusion.
<!ELEMENT doc (sec*, app*)>
<!ELEMENT sec (para*)>
<!ELEMENT app (para*) - (fnote) >
<!ELEMENT para (#PCDATA|fnote)*>
<!ELEMENT fnote (#PCDATA)>
Finally, productions of the form n ::= <u> contribute to readability
For example,
<!ENTITY % phrase "(fnote|em)">
<!ELEMENT p (#PCDATA|%phrase;)*>
<!ELEMENT title (#PCDATA|%phrase;)*>
can be captured by
Phrase ::= FNOTE|EM
P ::= p <(PCDATA|Phrase)*>,
TITLE ::= title <(PCDATA|Phrase)*>,
Thus, we no longer require parameter entities for describing content models.
Cheers,
Makoto
Fuji Xerox Information Systems
Tel: +81-44-812-7230
Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp
Prepared by Robin Cover for the The SGML/XML Web Page archive. For additional references, see "SGML/XML and Forest Automata Theory." See also in this connection "Automatically Constructing the Intersection/Union/Difference of Two Schemas." Presented by Makoto Murata (Fuji Xerox) at XTech '99, Tuesday March 9, 1999.

