Cover Pages: SGML/XML and Forest/Hedge Automata Theory

[February 24, 2000] This document contains a collection of references on Tree-Regular Languages, Forest-Regular Languages, DTD/Document Transformation, Forest/Tree Automata Theory, Schema Languages, and related matters. Most of the publications below are written by Murata Makoto and Paul Prescod; the corresponding bibliographic entries contain document abstracts. As of early 1999, 'forest automata' are being referred to as 'hedge automata' in the context of SGML/XML schemas.

References:

[April 19, 2000] K. Kawaguchi recently posted an announcement inviting participation in the new 'RELDEVE' mailing list. RELDEVE [reldeve@egroups.com] is an unmoderated mailing list for those who are intereseted in RELAX schema language. The list 'reldevj@xml.gr.jp' is also available for Japanese. RELAX is a language for describing tag sets of XML; it is based on hedge automata. The list announcement from MURATA Makoto (Chair of the INSTAC XML SWG) supplies additional information. See: (1) the RELAX Web site and (2) "REgular LAnguage description for XML (RELAX)."
"Hedge automata: a formal model for XML schemata." By MURATA Makoto (Fuji Xerox Information Systems). "This note shows preliminaries of the hedge automaton theory. In the XML community, this theory has been recently recognized as a simple but powerful model for XML schemata. In particular, the design of RELAX (REgular LAnguage for XML) is directly based on this theory." Alt URL: www.xml.gr.jp/relax/hedge_nice.html. [local archive copy]
"DTD Transformation by Patterns and Contextual Conditions." By Murata Makoto (Fuji Xerox Information Systems). In SGML/XML '97 Conference Proceedings, pages 153-169 (with 13 references). Available online.
Twenty-five slides from Makoto's SGML/XML '97 presentation have been made available by the author.
"Transformation of Documents and Schemas by Patterns and Contextual Conditions" = the "mathematical" version" of the paper "DTD Transformation. . ." (above), restricted to binary trees; pages 153-169 (with 13 references). By Murata Makoto. In Principles of Document Processing. Proceedings of the Third International Workshop, 1996. Available online in PDF format.
"Forest-Regular Languages and Tree-Regular Languages." By Murata Makoto, 1995. This paper provides the basis for Makoto's other work on forest regular language theory. Available online.
"Data Model for Document Transformation and Assembly." By Murata Makoto. Extended abstract for a PODDP '98 presentation. The paper shows a data model for transforming and assembling document information such as SGML and XML documents; uses a declarative query language based upon principles articulated in "DTD Transformation by Patterns. . ." (above). Available online in PDF format.
"Formalizing XML and SGML Instances with Forest Automata Theory." Draft technical document. By Paul Prescod, University of Waterloo Computer Science. Revision 0.95, May 5, 1998. Abstract: "The notions of schemata and validation are widely deemed to be crucial to the success of XML and SGML. Yet there are very few people who consider the formal definition of these terms nor formal definitions of how individual schema languages work. This paper surveys a particular formalization for schemata that is sufficiently powerful to implement recursive SGML content model validation and also many features that SGML content models do not support." Online preliminary version of the paper; [local archive copy]
[March 19, 2001] "Extended Path Expressions for XML." By Murata Makoto (IBM Tokyo Research Lab/IUJ Research Institute, 1623-14, Shimotsuruma, Yamato-shi, Kanagawa-ken 242-8502, Japan; Email: mmurata@trl.ibm.co.jp). [Extended abstract] prepared a presentation at PODS (Principles of Database Systems) 2001. With 35 references. ZIP format. Abstract: "Query languages for XML often use path expressions to locate elements in XML documents. Path expressions are regular expressions such that underlying alphabets represent conditions on nodes. Path expressions represent conditions on paths from the root, but do not represent conditions on siblings, siblings of ancestors, and descendants of such siblings. In order to capture such conditions, we propose to extend underlying alphabets. Each symbol in an extended alphabet is a triplet (e1; a; e2), where 'a' is a condition on nodes, and 'e1 (e2)' is a condition on elder (resp. younger) siblings and their descendants; 'e1' and 'e2' are represented by hedge regular expressions, which are as expressive as hedge automata (hedges are ordered sequences of trees). Nodes matching such an extended path expression can be located by traversing the XML document twice. Furthermore, given an input schema and a query operation controlled by an extended path expression, it is possible to construct an output schema. This is done by identifying where in the input schema the given extended path expression is satisfied." Details: "XML has been widely recognized as one of the most important formats on the WWW. XML documents are ordered trees containing text, and thus have structures more exible than relations of relational databases. Query languages for XML have been actively studied. Typically, operations of such query languages can be controlled by path expressions. A path expression is a regular expression such that underlying alphabets represent conditions on nodes. For example, by specifying a path expression, we can extract figures in sections, figures in sections in sections, figures in sections in sections in sections, and so forth, where section and figure are conditions on nodes. Based on well-established theories of regular languages, a number of useful techniques (e.g., optimization) for path expressions have been developed. However, when applied to XML, path expressions do not take advantage of orderedness of XML documents. For example, path expressions cannot locate all <figure> elements whose immediately following siblings are <table> elements. On the other hand, industrial specifications such as XPath have been developed. Such specifications address orderedness of XML documents. In fact, XPath can capture the above example. However, these specifications are not driven by any formal models, but rather designed in an ad hoc manner. Lack of formal models prevents generalization of useful techniques originally developed for path expressions. As a formal framework for addressing orderedness, this paper shows a natural extension of path expressions. First, we introduce hedge regular expressions, which generate hedges (ordered sequences of ordered trees). Hedge regular expressions can be converted to hedge automata (variations of tree automata for hedges) and vice versa. Given a hedge and a hedge regular expression, we can determine which node in the hedge matches the given hedge regular expression by executing the hedge automaton. The computation time is linear to the number of nodes in hedges. Second, we introduce pointed hedge representations. They are regular expressions such that each 'symbol' is a triplet (e1, a1, e2), where e1 e2 are hedge regular expressions and a is a condition on nodes. Intuitively, e1 represent conditions on elder siblings and their descendants, while e2 represent conditions on younger siblings and their descendants. As a special case, if every hedge regular expression in a pointed hedge representation generates all hedges, this pointed hedge representation is a path expression. Given a hedge and a pointed hedge representation, we can determine which node in the hedge matches the given pointed hedge representation. For each node, (1) we determine which of the hedge regular expressions matches the elder siblings and younger siblings, respectively, (2) we then determine which of the triplets the node matches, and (3) we finally evaluate the pointed hedge representation. Again, the computation time is linear to the number of nodes in hedges. Another goal of this work is schema transformation. Recall that query operations of relational databases construct not only relations but also schemas. For example, given input schemas (A; B) and(B;C), the join operation creates an output schema (A; B; C). Such output schemas allow further processing of output relations. It would be desirable for query languages for XML to provide such schema transformations. That is, we would like to construct output schemas from input schemas and query operations (e.g., select, delete), which utilize hedge regular expressions and pointed hedge representations. To facilitate such schema transformation, we construct match-identifying hedge automata from hedge regular expressions and pointed hedge representations. The computation of such automata assigns marked states to those nodes which match the hedge regular expressions and pointed hedge representations. Schema transformation is effected by first creating intersection hedge automata which simulate the match-identifying hedge automata and the input schemata, and then transforming the intersection hedge automata as appropriate to the query operation... In Section 2, we consider related works. We introduce hedges and hedge automata in Section 3, and then introduce hedge regular expressions in Section 4. In Section 5, we introduce pointed hedges and pointed hedge representations. In Section 6, we define selection queries as pairs of hedge regular expressions and pointed hedge representations. In Section 7, we study how to locate nodes in hedges by evaluating pointed hedge representations. In Section 8, we construct match-identifying hedge automata, and then construct output schemas. In Section 9, we conclude and consider future works... We have assumed XML documents as hedges and have presented a formal framework for XML queries. Our selection queries are combinations of hedge regular expressions and pointed hedge representations. A hedge regular expression captures conditions on descendant nodes. To locate nodes, a hedge regular expression is first converted to a deterministic hedge automaton and then it is executed by a single depth-first traversal. Meanwhile, a pointed hedge representation captures conditions on non-descendant nodes (e.g., ancestors, siblings, siblings of ancestors, and descendants of such siblings). To locate nodes, a pointed hedge representation is first converted to triplets: (1) a deterministic hedge automaton, (2) a finite-index right-invariant equivalence of states, and (3) a string automaton over the equivalence classes. Then, this triplet is executed by two depth-first traversals. Schema transformation is effected by identifying where in an input schema the given hedge regular expression and pointed hedge representation is satisfied. Interestingly enough, as it turns out our framework exactly captures the selection queries definable by MSO, as do boolean attribute grammars and query automata. On the other hand, our framework has two advantages over MSO-driven approaches. First, conversion of MSO formulas to query automata or boolean attribute grammars requires non-elementary space, thus discouraging implementations. On the other hand, our framework employs determinization of hedge automaton, which requires exponential time. However, we conjecture that such determinization usually works, as does determinization of string automata. Second,(string) regular expressions have been so widely and successfully used by many users because they are very easy to understand. We hope that hedge regular expressions and pointed hedge representations will become commodities for XML in the near future. There are some interesting open issues. First, is it possible to generalize useful techniques (e.g., optimization) developed for path expressions to hedge regular expressions and pointed hedge representations? Second, we would like to introduce variables to hedge regular expressions so that query operations can use the values assigned to such variables. For this purpose, we have to study unambiguity of hedge regular expressions. An ambiguous expression may have morethan one way to match a given hedge, while an unambiguous expression has at most only one such way. Variables can be safely introduced to unambiguous expressions." [cache]
[October 05, 1999] Murata Makoto (Fuji Xerox Information Systems) has published a preliminary version of a document on 'hedge regular languages' (aka 'forest-regular languages and tree-regular languages'): "Hedge Automata: A Formal Model for XML Schemata." The research note describes "preliminaries of the hedge automaton theory. In the XML community, this theory has been recently recognized as a simple but powerful model for XML schemata. In particular, the design of two schema languages for XML, namely RSL (Regular Schema Language) and DSD (Document Structure Description), is directly derived from this theory. First, we introduce hedges. Informally, a hedge is a sequence of trees. In the XML terminology, a hedge is a sequence of elements possibly intervened by character data (or types of character data). In particular, an XML document can be considered as a hedge. [Then] we introduce hedge regular grammars (RHGs). An RHG is a mechanism for generating hedges. In other words, an RHG describes a set of hedges. Since the primary role of an XML schema is to describe a set of valid documents, an RHG can be considered as a formal representation of a XML schema. [Then we introduce] deterministic hedge automata and non-deterministic hedge automata. . . The set of parse trees of an extended context-free grammar is said to be a local tree language. A lot is known about the relationships between local tree languages and regular hedge languagess. We mention two observations which are directly relevant to XML. (1) A local tree language is a regular hedge language (in other words, for any extenced context-free grammar, we can construct a DHA.), and (2) For any regular hedge language that contains trees only, there exists a unique minimal local set that includes the language. Observation '(1)' implies that HRGs are more powerful than DTDs, while '(2)' ensures that given any HRG, we can construct a reasonable DTD." [local archive copy, PDF] Available also in Postscript format, [local archive copy, PS].
Automatically Constructing the Intersection/Union/Difference of Two Schemas. Presented by Makoto Murata (Fuji Xerox) at XTech '99, Tuesday March 9, 1999. Abstract: "This talk demonstrates a new technology for automatically constructing the intersection/union/difference of two schemata. The hedge automaton theory provides the foundation of this technology. First, input schemata (written in XSchema) are converted to hedge automata (formerly called forest automata). Second, by applying boolean operations to these hedge automata, an output hedge automaton is constructed. Last, the constructed automaton is converted to a schema (again, written in XSchema). The internal representation and boolean operations of hedge automata are built on top of the 'Grail' automaton construction toolkit." See, in this connection, "Regularity and Locality of String Languages and Tree Languages." See also the 20 presentation slides, the annotated log file from the demonstration, and complete .ZIP package format [local archive copy]
[February 10, 1999] "Regularity and Locality of String Languages and Tree Languages." By MURATA Makoto. Technical Report. February, 1999. Abstract: "Regularity and non-locality of string languages and tree languages are considered. First, we study string-regular languages and string-local languages. String-local languages can be accepted without using non-local information (i.e., what has occurred before the current symbol), while string-regular languages require non-local information via non-terminals. Then, we study tree-local languages, which does not use non-local information (i.e., what has occurred above the current symbol). DTDs correspond to such tree-local languages. Finally, we consider tree-regular languages, which require non-local information via non-terminals. Tree-regular languages naturally capture ancestor-sensitivity."
[March 11, 1999] "Syntax for Regular-but-non-local Schemata for Structured Documents." By MURATA Makoto. Technical Report. February/March, 1999. "In this note, I show practical requirements for regular-but-non-local schemata for structured documents, and then suggest a syntax for writing regular-but-non-local schemata. Thoughout this memo, a schema is said to be local if it describes a local tree set. A local schema language is a schema langauge that can express local schemata only. A schema is said to be regular if it describes a regular tree set. A regular schema language is a schema language that can express regular schemata only."
[February 24, 1999] Applications of Tree Automata in Rewriting, Logic and Programming. Dagstuhl Seminar 9743. October 20-24, 1997. Seminar organized by Hubert Comon, Dexter Kozen, Helmut Seidl, and Mosche Y. Vardi. "The idea of organizing this seminar came during the Federated Logic Conference, New Brunswick, 1996, which brought together the Symposium on Logic in Computer Science, the Conference on Rewriting Techniques and Applications, the Conference on Automated Deduction and the Conference on Computer Aided Verification. . . By a group of French researchers, a first draft of a basic textbook on tree automata [Tree Automata Techniques and Applications] was presented." [local archive copy]
[August 11, 2000] "Using Regular Tree Automata as XML Schemas." By Boris Chidlovskii (Xerox Research Centre Europe, Grenoble Laboratory, 6 Chemin de Maupertuis, F-38240 Meylan, France). Pages 89-98 (with 28 references) in Proceedings of IEEE Advances in Digital Libraries 2000 [Washington, DC, USA, May 22-24, 2000; Los Alamitos, CA, USA: IEEE Computer Society, 2000, edited by J. Hoppenbrouwers, T. de Souza Lima, M. Papazoglou, and A. Sheth]. PostScript, 147K. Document: P91004. "We address the problem of tight XML schemas and propose regular tree automata to model XML data. We show that the tree automata model is more powerful than the XML DTDs and is closed under main algebraic operations. We introduce the XML query algebra based on the tree automata model, and discuss the query optimization and query pruning techniques. Finally we show the conversion of tree automata schema into XML DTDs. . . Based on SGML (ISO 8879), XML is designed to meet the challenges of large-scale electronic publishing and data exchange on the Web. XML documents are composed of nested elements and the logical structure of elements is defined by Document Type Definitions (DTDs). Defined as grammars, DTDs are highly flexible; the structure they impose in documents is often less restrictive than the rigid structure of relational schemas but more restrictive than allowing any-to- any relationships between object types. The knowledge of DTDs is highly useful for the query formulation through the user graphic interface and for the XML query optimization. However, despite all these features, DTDs appear to be somewhat unsatisfactory as a schema formalism. Some obvious missing things in DTDs, such as elementary data types, structural and validation constraints, have been recently addressed in the W3C XML Schema proposal. There remains however another conceptual difficulty with DTDs, not addressed by the W3C proposal. The DTDs appear to be surprisingly inflexible when one attempts to capture modifications imposed by various XML retrieval and manipulation mechanisms. Even for simple XML selection queries, DTDs face the tightness problem, when no DTD can precisely represent the structure of query results. To understand the cause of the tightness problem, we have to look at the way SGML/XML define a document structure. The content models of individual elements in XML documents are regular expressions. However, the language defined by an entire DTD is not regular; it is formalized as the extended context-free language (ECFL). The class of context-free grammars is not closed under certain elementary operations, such as intersection and negation and extending context-free grammars with regular expressions brings in no additional power. Consequently, if an XML query is more complex than a regular expression, we cannot necessarily make a context-free grammar (the result DTD) for the set of documents fitting both the original DTD and the query formulation. . . In this paper, we address the problem of tight XML schemas and introduce a novel mechanism for modeling XML documents, based on tree automata. This model owns a number of beneficial properties similar to the string regular expressions; in particular, tree automata form the boolean algebra. With the tree automata model, we consider a powerful set of algebraic operations, defined in the way similar to the relational algebra. For any query expressed by these operations, a precise schema represented by a tree automaton can be obtained in polynomial time. The algebraic operations allow using regular path expressions in query formulation and we show how XML schema given by tree automata can be used for the pruning of path expressions. The tree automata mechanism is more powerful than DTDs. Thus, any DTD can be correctly captured by a tree automaton. On the other hand, we show that translation of tree automata into DTD may require some generalization. We describe the translation procedure and prove that the generalization performed in this way is minimal. Research on XML data has been preceded by intensive research on semi-structured data. . . Our algorithm 1 generates the minimal element contents for XML tags, however it does not cope with the DTD unambiguity. Some ambiguous DTDs can be converted into equivalent unambiguous ones. Unfortunately, in most cases, the conversion requires generalization of grammars. For DTD unambiguity, we rely on work done by Wood and Brueuggeman-Klein, who studied the conditions when regular expressions used for definition of XML element contents are unambiguous. They introduced the notion of 1-determinism and proved that only those regular expressions satisfying the 1-deterministic condition can be used in DTD. Our method also exploits the 1-deterministic property established for element contents of an unambiguous DTD. Orbit properties of 1-deterministic grammars give the sufficient condition that a given finite-state automaton recognizes words that form 1-deterministic regular language. If the automaton has no orbit property, it should be generalized by merging nodes and adding transitions. For a detailed description of the method transforming ambiguous regular expression into a unambiguous, we refer to [H. Ahonen, 'Disambiguation of SGML Content Models.']. [Conclusion:] We study the use of tree automata as schemas for XML files. Tree automata are proven to be closed under boolean operators; this allows us to design a XML query language in the way similar to the relational algebra, and induce precise schema for any XML query formulated in this language. We show how to translate tree automata into DTDs and discussed the DTD unambiguity."
[September 15, 2000] "Tree Automata and Pattern Matching." By Haruo Hosoya and Benjamin Pierce (Department of Computer and Information Science, University of Pennsylvania). Email: {fhahosoya,bcpierceg}@cis.upenn.edu. July 18, 2000. 15 pages, 28 references. "Tree automata have been used both in program analyses for functional languages and in type systems for tree-like data such as SGML and XML. They also form a natural basis for rich forms of pattern matching on tree structures, including conventional ML-style patterns as well as extensions such as 'OR-patterns' and 'recursive patterns.' Incorporating tree patterns and tree types into full-blown programming languages involves substantial generalizations to a number of standard analyses, including (1) exhaustiveness and redundancy checks, (2) propagation of type constraints to pattern variables from the surrounding context, to avoid the need for type annotations on pattern variables, and (3) optimization of pattern matching. We present algorithms for all three. The main technical challenges arise from the fact that recursively defined patterns make termination of the algorithms somewhat delicate, and that complex control arising from alternation patterns and the first-match policy make it non-trivial to infer the best types for pattern variables. To address the first difficulty, we rely on the finiteness of state spaces generated by regular tree automata. For the second, we exploit closure properties of tree automata to maintain the precision of type information as it is propagated through complex control." See also the full paper, "Tree Automata and Pattern Matching." [cache]
[June 30, 1998] Tree Automata Techniques and Applications (TATA) - "a textbook which presents the basics of tree automata and several variants of tree automata. . . submissions of contributions to new chapters and improvements of existing ones are welcome." Authors (as of 1998-06-30): Hubert Comon, Max Dauchet, Remi Gilleron, Denis Lugiez, Sophie Tison, Marc Tommasi. See the TATA main page, the Volume Introduction (HTML), the Volume Table of Contents (HTML), a Postscript version of draft 980630 [local archive copy]. Email contact: tata@lifl.fr.
[June 30, 1998] "Locating Matches of Tree Patterns in Forests." By Andreas Neumann and Helmut Seidl (Universität Trier - Informatik) [local archive copy]. Submitted to FST&TCS '98. See also A. Neumann's publications page. "Abstract: We deal with matching and locating of patterns in forests of arbitrary arity. A pattern consists of a structural and a contextual condition for subtrees of a forest, both of which are given as tree or forest regular languages. We adopt the notation of mu-formulae for uniformly specifying both kinds of conditions. In order to implement pattern matching we introduce the class of pushdown forest automata. We identify a special class of contexts such that not only pattern matching but also locating all of a forest's subtrees matching in context can be performed in a single traversal. We also give a method for computing the reachable states of an automaton in order to minimize the size of transition tables." Bibliography entry.
"Regular tree languages over non-ranked alphabets." By Anne Brüggemann-Klein, Derick Wood, and Murata Makoto. Version 0.3, April 19, 1998. In Postscript format; archive copy, 980529. A TeX version and substantive bibliography are also available in the FTP directory. The final section on "Previous work" surveys literature used by the authors: "Work on tree automata and tree-regular languages can be divided into two categories, one dealing with ranked and the other with non-ranked alphabets. . ." See the main bibliography entry for other details.
Metastructures 1998 Presentation. Paul Prescod (ISOGEN International) - "Forest Automata: A Formalism for Document Schemas and Validation." Abstract: "The characteristics of the tree-like element containment structure of SGML and XML documents can be constrained by schemas called Document Type Definitions (DTDs); each schema describes a class of documents. The declarations found in DTDs can be viewed as an extended context free grammar (CFG) for a simple class of language. Unlike the syntax of full-blown SGML, the syntax of XML is simple enough to allow XML documents to be parsed without knowing the grammar. Instead, we can look at the grammar as a form of constraint not on the document string, but on the tree that results from that string. We can describe the tree in terms of the forest automata theory, which describes classes of forests (forest languages) instead of strings. The new formalism is a generalization of string automata theory, and it supports concepts such as forest regular languages, forest regular expressions and forest regular grammars. Given union, concatenation, linear tree homomorphisms and other, similar operations, we can build a full query and transformation language."
Bibliographic note (in addition to references in the above papers). Gregg Reynolds wrote: 'There is a bibliography of 144 items in Handbook of Theoretical Computer Science Part B, Elsevier and MIT Press 1996, Chapter 4 "Automata on Infinite Objects" by Wolfgang Thomas. Part Two of the chapter is entitled "Automata on Infinite Trees." Caveat: very heavy sledding. Most of it sails way over my head, but on pages 165-166 there is a very concise account of tree concatenation etc., which I found understandable and useful. You can probably find this in a bookstore if you're in a major city. Two other good references: 1) Text Algorithms by Maxime Crochemore and Wojciech Rytter, Oxford 1994; 2) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield, Cambridge Univ. Press 1997. Very cool book. Excellent and thorough treatment of suffix trees.'


SEARCH \| ABOUT \| INDEX \| NEWS \| CORE STANDARDS \| TECHNOLOGY REPORTS \| EVENTS \| LIBRARY