Cover Pages Logo SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic

Regularity and Locality of String Languages and Tree Languages


From     forest-automaton-request@fxis.fujixerox.co.jp Thu Feb  4 20:19:13 1999
Date:    Fri, 05 Feb 1999 11:07:53 +0900
From:    MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp>
Subject: Fwd: Regularity and locality of string languages and tree languages
To:      forest-automaton@fxis.fujixerox.co.jp

MURATA Makoto wrote:


Regularity and locality of string languages and tree languages

Murata Makoto
Fuji Xerox Information Systems

0. Abstract

Regularity and non-locality of string languages and tree languages are considered. First, we study string-regular languages and string-local languages. String-local languages can be accepted without using non-local information (i.e., what has occurred before the current symbol), while string-regular languages require non-local information via non-terminals. Then, we study tree-local languages, which does not use non-local information (i.e., what has occurred above the current symbol). DTDs correspond to such tree-local languages. Finally, we consider tree-regular languages, which require non-local information via non-terminals. Tree-regular languages naturally capture ancestor-sensitivity.

1. String-regular languages

When we talk about regularity, we usually consider sets of strings. Since we later consider sets of trees, we say "string-regular" rather than "regular". Likewise, we say "string-regular expressions" , "string (finite) automata", "string-reguar grammars" rather than "regular expresions", "(finite) automata", and "regular grammars".

A set of strings is called string-regular when:

  1. it is generated from a string-regular expression,
  2. it is accepted by a string automaton, or/li>
  3. it is generated from a string-regular grammar./li>

The above three conditions are equivalent. In this message, I explain string-regular grammars only and skip details of tring-regular expressions and string automata.

The set of strings that consists of elements of a finite set A is denoted by A*. In particular, A* contains the null string, which is denoted by epsilon.

A string-regular grammar G is a quadruple <N, T, S, P>, where 
  N is a finite set of non-terminal symbols,
  T is a finite set of terminal symbols,
  S is a start non-terminal symbol (S is an element of N),
  P is a finite set of productions ruls of the form 

    n1 ::= t	(n1 is a non-terminal symbol and t is a terminal symbol),  
    n1 ::= t n2 (n1 and n2 are non-terminal symbols and t is a 
		 terminal symbol), 
    n1 ::= n2   (n1 and n2 are non-terminal symbols), or 
    S  ::= epsilon.

Note: the third form (n1 ::= n2) are only for readability and does 
      not contribute to the expressive power of string-regular 
      grammars.

Derivation is defined in an usual manner. A string s is generated from the string-regular grammar G if s is derived from the start symbol S.

2. String-local languages

We now introduce a new concept called locality. A string-local language is string-regular, but a string-regular language is not always string-local.


As a motivating example, we consider a set {ab, abb, abbb, ....}.  
To tell if a string is contained by this set, we only have to 
examine conditions as below:
   - Is 'a' the first symbol in the string?
   - If the current symbol is 'a', is it followed by 'b'?
   - If the current symbol is 'b', is it the last symbol in the string or 
     is it followed by 'b'?

Observe that these conditions do *not* depend on what has occurred
before the current symbol.  That is, these conditions are *local*.

Next, consider a set {abac}.  The first occurrence of 'a' has to be
followed by 'b', while the second occurrence has to be followed by
'c'.  That is, to tell if a string is contained by this set, we have
to know what has happended BEFORE the current occurence of 'a'.  Such
a set is *non-local*, since the local info (i.e., the current symbol
is 'a') is not enough for determining permissible successors.

Formally, a set of strings is string-local if it is generated by a
string-local grammar.  A string-local grammar G is a quadruple 
<N, T, S, P>, where
  T is a finite set of terminal symbols,
  N is the set of [t], where t is a terminal symbol, 
  S is [t], where t is some terminal symbol, and
  P is a set of productions rules of the form 
    [t1] ::= t1	(t1 is a terminal symbol), or
    [t1] ::= t1 [t2] (t1 and t2 are terminal symbols).

Obviously, a string-local grammar is a string-regular grammar.

We now show that the set {abac} cannot be generated by a string-local
grammar.  We suppose a string-local grammar that accepts the string
"abac", and show that this grammar accepts more than this string

Since the first terminal symbol is 'a', the start symbol has to be
[a].  Since the second terminal symbol is 'b', we need a production
rule [a] ::= a [b].  Likewise, since the third terminal symbol is 'a',
we need a production rule [b] ::= b [a]; since the forth and last
terminal symbol is 'c', we need production rules [a] ::= a [c] and 
[c] ::= c.

Therefore, we have a local-grammar <{[a], [b], [c]}, {a, b, c}, [a], P>, 
where P contains productions rules as below:
      [a] ::= a [b],
      [b] ::= b [a],
      [a] ::= a [c], and
      [c] ::= c.

This local grammar generates not only "abac" but also strings
"ac", "abac", "ababac", "abababac", "ababababac", etc.

The set {"ac", "abac", "ababac", "abababac", "ababababac", ...}  is
local.  To tell if a string is contained by this set, we only have to
examine conditions as below:
   - Is 'a' the first symbol in the string?
   - If the current symbol is 'a', is it followed by 'b' or 'c'?
   - If the current symbol is 'b', is it followed by 'a'?
   - If the current symbol is 'c', is it the last symbol in the string? 

None of these conditions require non-local information (i.e., 
what happened before the current symbol).

Now, let us simplify the definition of string-local grammars.  Since N
can be computed from T, we do not need N.  We can confuse [t] and t
without loss of information.  Since every production whose head is [t]
has t as the first symbol in the body, we can get rid of this first  
symbol and use the rest of the body only.

Here is a revised definition of string-local grammars.  A string-local
grammar is a triplet <T, S, P>, where
  T is a finite set of symbols,
  S is t, where t is some symbol, and
  P is a set of productions rules of the form 
    t1 ::=    (t1 is a symbol), or
    t1 ::= t2 (t1 and t2 are symbols).

3. Tree-local language

We have considered DTDs as context-free grammars and used them to describe permissible element trees. Observe that this approach is already backhanded since context-gree grammars are designed to describe permissible *strings*.

In DTDs, we can examine if the content of an element is valid WIHOUT KNOWING WHAT HAS OCCURRED ABOVE THIS ELEMENT. This is actually tree-locality.


Formally, a set of trees is tree-local if it is generated by a
tree-local grammar.  A tree-local grammar G is a triplet <T, S, P>,
where
  T is a finite set of symbols,
  S is t, where t is some symbol, and
  P is a set of productions rules of the form 
    t1 ::= u, where u is a string-regular expression over T.

For example, consider a DTD as below:

<!ELEMENT doc (sec*, app*)>
<!ELEMENT sec (para*)>
<!ELEMENT app (para*)>
<!ELEMENT para (#PCDATA)>

This is captured by a tree-local grammar <T, S, P>, where
     T = {doc, sec, app, para, #PCDATA},
     S = doc ,
     P = {
       doc ::= sec* app*,
       sec ::= para*,
       app ::= para*,
       para ::= #PCDATA}

This tree-local grammar looks very similar to a context-free grammar. However, tree-local grammars directly capture sets of trees, while context-free grammars capture sets of strings. It is true that a context-free grammar also describes a set of parse trees, but this is merely a side-effect.

4. Tree-regular languages

Tree-regular languages are similar to string-regular languages in that we can use non-local information. That is, we are allowed to know what has occurred ABOVE the current symbol (element). This non-locality has been (inappropriately) called context-sensitivity.

A tree-regular grammar G is a quadruple <N, T, S, P>, where 
  N is a finite set of non-terminal symbols,
  T is a finite set of terminal symbols,
  S is a start non-terminal symbol (S is an element of N),
  P is a finite set of productions ruls of the form 
    n ::= t <u> (n is a non-terminal symbol, t is a 
		 terminal symbol, and u is a string-regular expression 
		 over N),  or
    n ::= <u>   (n is a non-terminal symbol and u is a string-regular 
	      expression over N).
   
Note that productions of the form n ::= t are covered by the first case
and productions of the form S ::= epsilon are covered by the second
case , since u may be the null string.

Now, let us construct a tree-regular grammar that takes advantage 
of non-local information.  That is, permissble subordinates of the 
element type para depends on superior elements.

G = <T, N, S, P>, where
     T = {doc, sec, app, para, fnote, #PCDATA},
     N = {DOC, SEC, APP, PARA1, PARA2, FNOTE, PCDATA}
     S = DOC ,
     P = {
       DOC ::= doc <SEC* APP*>,
       SEC::= sec<PARA1*>,
       APP ::= app<PARA2*>,
       PARA1 ::= para<(PCDATA|fnote)*>,
       PARA2 ::= para<PCDATA>,
       FNOTE ::= fnote<PCDATA>,
       PCDATA ::= #PCDATA<>}

This tree-regular grammar is equivalent to a DTD with exclusion.

<!ELEMENT doc (sec*, app*)>
<!ELEMENT sec (para*)> 
<!ELEMENT app (para*) - (fnote) >
<!ELEMENT para (#PCDATA|fnote)*>
<!ELEMENT fnote (#PCDATA)>

Finally, productions of the form n ::= <u> contribute to readability
For example,

<!ENTITY % phrase "(fnote|em)">
<!ELEMENT p     (#PCDATA|%phrase;)*>
<!ELEMENT title (#PCDATA|%phrase;)*>

can be captured by 

    Phrase ::= FNOTE|EM
    P      ::= p <(PCDATA|Phrase)*>,
    TITLE  ::= title <(PCDATA|Phrase)*>,

Thus, we no longer require parameter entities for describing content models.

Cheers,
Makoto

Fuji Xerox Information Systems
Tel: +81-44-812-7230
Fax: +81-44-812-7231
E-mail: murata@apsdc.ksp.fujixerox.co.jp


Prepared by Robin Cover for the The SGML/XML Web Page archive. For additional references, see "SGML/XML and Forest Automata Theory." See also in this connection "Automatically Constructing the Intersection/Union/Difference of Two Schemas." Presented by Makoto Murata (Fuji Xerox) at XTech '99, Tuesday March 9, 1999.


Globe Image

Document URL: http://xml.coverpages.org/murataRegularity.html