# Regularity and Locality of String Languages and Tree Languages

From forest-automaton-request@fxis.fujixerox.co.jp Thu Feb 4 20:19:13 1999 Date: Fri, 05 Feb 1999 11:07:53 +0900 From: MURATA Makoto <murata@apsdc.ksp.fujixerox.co.jp> Subject: Fwd: Regularity and locality of string languages and tree languages To: forest-automaton@fxis.fujixerox.co.jp

MURATA Makoto wrote:

## Regularity and locality of string languages and tree languages

Murata Makoto

Fuji Xerox Information Systems

## 0. Abstract

Regularity and non-locality of string languages and tree languages are
considered. First, we study string-regular languages and string-local
languages. String-local languages can be accepted without using
non-local information (*i.e.*, what has occurred before the current
symbol), while string-regular languages require non-local information
via non-terminals. Then, we study tree-local languages, which does
not use non-local information (*i.e.*, what has occurred above the
current symbol). DTDs correspond to such tree-local languages.
Finally, we consider tree-regular languages, which require non-local
information via non-terminals. Tree-regular languages naturally
capture ancestor-sensitivity.

## 1. String-regular languages

When we talk about regularity, we usually consider sets of strings. Since we later consider sets of trees, we say "string-regular" rather than "regular". Likewise, we say "string-regular expressions" , "string (finite) automata", "string-reguar grammars" rather than "regular expresions", "(finite) automata", and "regular grammars".

A set of strings is called string-regular when:

- it is generated from a string-regular expression,
- it is accepted by a string automaton, or/li>
- it is generated from a string-regular grammar./li>

The above three conditions are equivalent. In this message, I explain string-regular grammars only and skip details of tring-regular expressions and string automata.

The set of strings that consists of elements of a finite set A is denoted by A*. In particular, A* contains the null string, which is denoted by epsilon.

A string-regular grammar G is a quadruple <N, T, S, P>, where N is a finite set of non-terminal symbols, T is a finite set of terminal symbols, S is a start non-terminal symbol (S is an element of N), P is a finite set of productions ruls of the form n1 ::= t (n1 is a non-terminal symbol and t is a terminal symbol), n1 ::= t n2 (n1 and n2 are non-terminal symbols and t is a terminal symbol), n1 ::= n2 (n1 and n2 are non-terminal symbols), or S ::= epsilon. Note: the third form (n1 ::= n2) are only for readability and does not contribute to the expressive power of string-regular grammars.

Derivation is defined in an usual manner. A string s is generated from the string-regular grammar G if s is derived from the start symbol S.

## 2. String-local languages

We now introduce a new concept called locality. A string-local language is string-regular, but a string-regular language is not always string-local.

As a motivating example, we consider a set {ab, abb, abbb, ....}. To tell if a string is contained by this set, we only have to examine conditions as below: - Is 'a' the first symbol in the string? - If the current symbol is 'a', is it followed by 'b'? - If the current symbol is 'b', is it the last symbol in the string or is it followed by 'b'? Observe that these conditions do *not* depend on what has occurred before the current symbol. That is, these conditions are *local*. Next, consider a set {abac}. The first occurrence of 'a' has to be followed by 'b', while the second occurrence has to be followed by 'c'. That is, to tell if a string is contained by this set, we have to know what has happended BEFORE the current occurence of 'a'. Such a set is *non-local*, since the local info (i.e., the current symbol is 'a') is not enough for determining permissible successors. Formally, a set of strings is string-local if it is generated by a string-local grammar. A string-local grammar G is a quadruple <N, T, S, P>, where T is a finite set of terminal symbols, N is the set of [t], where t is a terminal symbol, S is [t], where t is some terminal symbol, and P is a set of productions rules of the form [t1] ::= t1 (t1 is a terminal symbol), or [t1] ::= t1 [t2] (t1 and t2 are terminal symbols). Obviously, a string-local grammar is a string-regular grammar. We now show that the set {abac} cannot be generated by a string-local grammar. We suppose a string-local grammar that accepts the string "abac", and show that this grammar accepts more than this string Since the first terminal symbol is 'a', the start symbol has to be [a]. Since the second terminal symbol is 'b', we need a production rule [a] ::= a [b]. Likewise, since the third terminal symbol is 'a', we need a production rule [b] ::= b [a]; since the forth and last terminal symbol is 'c', we need production rules [a] ::= a [c] and [c] ::= c. Therefore, we have a local-grammar <{[a], [b], [c]}, {a, b, c}, [a], P>, where P contains productions rules as below: [a] ::= a [b], [b] ::= b [a], [a] ::= a [c], and [c] ::= c. This local grammar generates not only "abac" but also strings "ac", "abac", "ababac", "abababac", "ababababac", etc. The set {"ac", "abac", "ababac", "abababac", "ababababac", ...} is local. To tell if a string is contained by this set, we only have to examine conditions as below: - Is 'a' the first symbol in the string? - If the current symbol is 'a', is it followed by 'b' or 'c'? - If the current symbol is 'b', is it followed by 'a'? - If the current symbol is 'c', is it the last symbol in the string? None of these conditions require non-local information (i.e., what happened before the current symbol). Now, let us simplify the definition of string-local grammars. Since N can be computed from T, we do not need N. We can confuse [t] and t without loss of information. Since every production whose head is [t] has t as the first symbol in the body, we can get rid of this first symbol and use the rest of the body only. Here is a revised definition of string-local grammars. A string-local grammar is a triplet <T, S, P>, where T is a finite set of symbols, S is t, where t is some symbol, and P is a set of productions rules of the form t1 ::= (t1 is a symbol), or t1 ::= t2 (t1 and t2 are symbols).

## 3. Tree-local language

We have considered DTDs as context-free grammars and used them to describe permissible element trees. Observe that this approach is already backhanded since context-gree grammars are designed to describe permissible *strings*.

In DTDs, we can examine if the content of an element is valid WIHOUT KNOWING WHAT HAS OCCURRED ABOVE THIS ELEMENT. This is actually tree-locality.

Formally, a set of trees is tree-local if it is generated by a tree-local grammar. A tree-local grammar G is a triplet <T, S, P>, where T is a finite set of symbols, S is t, where t is some symbol, and P is a set of productions rules of the form t1 ::= u, where u is a string-regular expression over T. For example, consider a DTD as below: <!ELEMENT doc (sec*, app*)> <!ELEMENT sec (para*)> <!ELEMENT app (para*)> <!ELEMENT para (#PCDATA)> This is captured by a tree-local grammar <T, S, P>, where T = {doc, sec, app, para, #PCDATA}, S = doc , P = { doc ::= sec* app*, sec ::= para*, app ::= para*, para ::= #PCDATA}

This tree-local grammar looks very similar to a context-free grammar. However, tree-local grammars directly capture sets of trees, while context-free grammars capture sets of strings. It is true that a context-free grammar also describes a set of parse trees, but this is merely a side-effect.

## 4. Tree-regular languages

Tree-regular languages are similar to string-regular languages in that we can use non-local information. That is, we are allowed to know what has occurred ABOVE the current symbol (element). This non-locality has been (inappropriately) called context-sensitivity.

A tree-regular grammar G is a quadruple <N, T, S, P>, where N is a finite set of non-terminal symbols, T is a finite set of terminal symbols, S is a start non-terminal symbol (S is an element of N), P is a finite set of productions ruls of the form n ::= t <u> (n is a non-terminal symbol, t is a terminal symbol, and u is a string-regular expression over N), or n ::= <u> (n is a non-terminal symbol and u is a string-regular expression over N). Note that productions of the form n ::= t are covered by the first case and productions of the form S ::= epsilon are covered by the second case , since u may be the null string. Now, let us construct a tree-regular grammar that takes advantage of non-local information. That is, permissble subordinates of the element type para depends on superior elements. G = <T, N, S, P>, where T = {doc, sec, app, para, fnote, #PCDATA}, N = {DOC, SEC, APP, PARA1, PARA2, FNOTE, PCDATA} S = DOC , P = { DOC ::= doc <SEC* APP*>, SEC::= sec<PARA1*>, APP ::= app<PARA2*>, PARA1 ::= para<(PCDATA|fnote)*>, PARA2 ::= para<PCDATA>, FNOTE ::= fnote<PCDATA>, PCDATA ::= #PCDATA<>} This tree-regular grammar is equivalent to a DTD with exclusion. <!ELEMENT doc (sec*, app*)> <!ELEMENT sec (para*)> <!ELEMENT app (para*) - (fnote) > <!ELEMENT para (#PCDATA|fnote)*> <!ELEMENT fnote (#PCDATA)> Finally, productions of the form n ::= <u> contribute to readability For example, <!ENTITY % phrase "(fnote|em)"> <!ELEMENT p (#PCDATA|%phrase;)*> <!ELEMENT title (#PCDATA|%phrase;)*> can be captured by Phrase ::= FNOTE|EM P ::= p <(PCDATA|Phrase)*>, TITLE ::= title <(PCDATA|Phrase)*>,

Thus, we no longer require parameter entities for describing content models.

Cheers,

Makoto

Fuji Xerox Information Systems

Tel: +81-44-812-7230

Fax: +81-44-812-7231

E-mail: murata@apsdc.ksp.fujixerox.co.jp

Prepared by Robin Cover for the The SGML/XML Web Page archive. For additional references, see "SGML/XML and Forest Automata Theory." See also in this connection "Automatically Constructing the Intersection/Union/Difference of Two Schemas." Presented by Makoto Murata (Fuji Xerox) at XTech '99, Tuesday March 9, 1999.