The Cover PagesThe OASIS Cover Pages: The Online Resource for Markup Language Technologies
SEARCH | ABOUT | INDEX | NEWS | CORE STANDARDS | TECHNOLOGY REPORTS | EVENTS | LIBRARY
SEARCH
Advanced Search
ABOUT
Site Map
CP RSS Channel
Contact Us
Sponsoring CP
About Our Sponsors

NEWS
Cover Stories
Articles & Papers
Press Releases

CORE STANDARDS
XML
SGML
Schemas
XSL/XSLT/XPath
XLink
XML Query
CSS
SVG

TECHNOLOGY REPORTS
XML Applications
General Apps
Government Apps
Academic Apps

EVENTS
LIBRARY
Introductions
FAQs
Bibliography
Technology and Society
Semantics
Tech Topics
Software
Related Standards
Historic
Last modified: August 29, 2000
SGML/XML DTD Transduction and Generation

[August 29, 2000] A provisional reference list. See the bibliographies in the individual articles.


References:


[CR: 19970523]

Ahonen, Helena. "Automatic Generation of SGML Content Models." Pages 195-206 (with 12 references) in EP '96. Proceedings of the Sixth International Conference on Electronic Publishing, Document Manipulation and Typography. [ = Journal Special Issue: Electronic Publishing - Origination, Dissemination and Design (EPODD), June & September 1995, Volume 8, Issues 2-3. Sixth International Conference on Electronic Publishing, Document Manipulation and Typography, Palo Alto, California. September 24-26, 1996. Sponsored by Adobe Systems Incorporated; School of Information Management and Systems, University of California at Berkeley; Xerox Corporation. [Proceedings Volume] Edited by Allen Brown, Anne Brüggemann-Klein, and An Feng; [Journal] Editors David F. Brailsford and Richard K. Furuta. Chichester/ New York: John Wiley & Sons, 1996. ISSN: 0894-3982. Author's affiliation: Department of Computer Science, P. O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland. Phone: +358 0 708 44218; Fax: +358 0 708 44441; Email: helena.ahonen@helsinki.fi. WWW: Helena Ahonen Home Page.

Abstract: "We study the problem of automatic generation of a document type definition (DTD) for a set of Standard Generalized Markup Language (SGML) documents. We present various situations where we have tagged documents but no DTD, and discuss the requirements various applications may have with respect to the generation process. We also present an automatic DTD generation tool that can be adjusted for several tasks necessary in the applications. The method is also demonstrated with some experimental cases."

Keywords: SGML, document type definition, generation, TEKES.

For other conference information, see the main conference entry for EP '96, or the brief history of the conference as sixth in a series since 1986. See the volume main bibliographic entry for a linked list of other EP '96 titles relevant to SGML and structured documents.

The document is available in Postscript format: http://www.cs.helsinki.fi/~hahonen/helena_ep96.ps [mirror copy].



[CR: 19960728]

Ahonen, Helena. Automatic Generation of SGML Content Models. Paper Submitted and accepted for presentation at Electronic Publishing '96. Helsinki, Finland: Department of Computer Science, University of Helsinki, Finland, 1996. Extent: 10 pages. Author's affiliation: Department of Computer Science, P. O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland. Phone: +358 0 708 44218; Fax: +358 0 708 44441; Email: helena.ahonen@helsinki.fi. WWW: Helena Ahonen Home Page.

Abstract: "We study the problem of automatic generation of a document type definition (DTD) for a set of Standard Generalized Markup Language (SGML) documents. We present various situations where we have tagged documents but no DTD, and discuss the requirements various applications may have with respect to the generation process. We also present an automatic DTD generation tool that can be adjusted for several tasks necessary in the applications. The method is also demonstrated with some experimental cases."

The document is available on the Internet: http://www.cs.helsinki.fi/~hahonen/helena_ep96.ps; [mirror copy]



[CR: 19951220]

Ahonen, Helena; Nikunen, Erja. "Forming Grammars for Structured Documents: An Application of Grammatical Inference." Pages 153-167 in Grammatical Inference and Applications. Papers Presented During the Second International Colloquium. Second International Colloquium on Grammatical Inference - ICGI-94. Alicante, Spain, September 21-23, 1994. Edited by Rafael C. Carrasco and Jose Oncina. Lecture notes in computer science, number 862. Berlin/New York: Springer-Verlag, 1994. ISBN: 3540584730 (Berlin); 0387584730 (New York). ISSN: 0302-9743. Authors' affiliation: Department of Computer Science, P. O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland. Phone: +358 0 708 44218; Fax: +358 0 708 44441; Email: helena.ahonen@helsinki.fi. WWW: Helena Ahonen Home Page.

"Abstract: We consider the problem of generating grammars for classes of structured documents -- dictionaries, encyclopedias, user manuals, and so on -- from examples. The examples consist of structures of individual documents, and they can be collected either by converting typographical tagging of documents prepared for printing into structural tags, or by using document recognition techniques. Our method forms first finite-state automata describing the examples completely . These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying language. Finally, the automata are converted into regular expressions, and they are used to construct the grammar. In addition to automata, an alternative representation, characteristic k-grams, is in-troduced. Some interactive operations are also described that are necessary for generating a grammar for a large and complicated document."

Available on the Internet: http://www.cs.helsinki.fi/~hahonen/ahonen_icgi94.ps [mirror copy, December 1995].



[CR: 19951220]

Ahonen, Helena; Mannila, H. ; Nikunen, Erja. "Generating Grammars for SGML Tagged Texts Lacking DTD." Pages [???-???] in Principles of Documents Processing, PODP '94. Principles of Documents Processing. Darmstadt. April 11-12, 1994. Sponsored by: Fuji Xerox Systems and Commnunications Lab, GMD-IPSI, Rank Xerox Research Centre, and Xerox Webster Research Center. Edited by Makoto Murata and Herve Gallaire. [pub-location: Darmstadt?]: [publisher: GMD-IPSI?], 1994. Authors' affiliation: [Ahonen, Mannila] Department of Computer Science, P. O. Box 26 (Teollisuuskatu 23), FIN-00014 University of Helsinki, Finland. Phone: +358 0 708 44218; Fax: +358 0 708 44441; Email: helena.ahonen@helsinki.fi. WWW: Helena Ahonen Home Page; [Nikunen] Research Centre for Domestic Languages.

"Abstract: We describe a technique for forming a context free grammar for a document that has some kind of tagging -- structural or typographical -- but no concise description of the structure is available. The technique is based on ideas from machine learning. It forms first a set of finite-state automata describing the document completely. These automata are modified by considering certain context conditions; the modifications correspond to generalizing the underlying languages. Finally, the automata are converted into regular expressions, which are then used to construct the grammar. An alternative representation, characteristic k-grams, is also introduced. Additionally, the paper describes some interactive operations necessary for generating a grammar for a large and complicated document."

Available online: http://www.cs.helsinki.fi/~hahonen/ahonen_podp94.ps [mirror copy, December 1995]. The paper is also to appear in Mathematical and Computer Modelling. See the first author's home page for more up-to-date bibliographic details and other SGML-related research.


  • OCLC Fred: Automatic DTD Creation from a URL or Sample Text

  • The OCLC Fred Home Page

  • Keith Shafer: "Creating DTDs via the GB-Engine [Grammar Builder] and Fred", with bibliography ( mirror copy or bibliographic entry).

  • Free Fred, for non-commercial use

  • DTD Generation tools - in "Free XML tools" list.

  • [August 29, 2000] "DTD-Miner: A Tool for Mining DTD from XML Documents." By Chuang-Hue Moh, Ee-Peng Lim, and Ng Wee Keong [Email: awkng@ntu.edu.sg]. Pages 144-151 in Proceedings of the Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems. Second International Workshop on Advanced Issues of E-Commerce and Web-Based Information Systems (WECWIS 2000), Milpitas, CA, June 8-9, 2000. "XML documents are semi-structured and the structure of the documents is embedded in the tags. Although XML documents can be accompanied by a document type definition (DTD) that defines the structure of the documents, the presence of a DTD is not mandatory. The difficulty in deriving the DTD for XML documents lies in the fact that DTDs are of a different syntax from XML and that prior knowledge of the structure of the documents is required. In this paper, we introduce DTD-Miner, an automatic structure mining tool for XML documents. Using a Web-based interface, the user is able to submit a set of similarly structured XML documents and the system automatically suggests a DTD. The user is also able to further refine the DTD generated to reduce the complexity by relaxing some the rules used in the system." Note: The authors have provided an online demo for DTD-Miner. From the Web site: 'Automatic Derivation of DTDs for XML Documents': "The DTD-Miner [Version 1.5] is a prototype system for mining DTDs from XML documents. This system was built at the Centre for Advanced Information Systems, School of Applied Science of the Nanyang Technological University, under the supervision of Asst. Prof. (Dr) Lim Ee Peng. For further details pertaining to this project, please refer to the Project Objective and Project Description pages. Also, do not forget the people that made this project possible."] Objective: "Web documents are semistructured and this encumbers the automatic post-processing of the information that they contain. Semistructured data however, do contain some form of non-rigid structures, which is often encapsulated in the documents. XML documents, in particular, are semistructured and the structure of the documents is embedded in the tags. Although XML documents can be accompanied by a DTD that defines the structure of the documents, the presence of a DTD is not mandatory. The difficulty in deriving the DTD for XML documents lies in the fact that DTDs are of different syntax as XML and that prior knowledge of the structure of the documents is required. The DTD-Miner is an automatic structure mining tool for XML documents. Using a Web-based interface, the user will be able to submit a set of similarly structured XML documents and the system will automatically suggest a structure for the set of documents in the form of a DTD. The system further ensures that the set of documents will be in conformance to the DTD generated. The user is also able to further refine the DTD generated to reduce the complexity by relaxing some the rules used in the system.

  • [August 29, 2000] "Re-engineering Structures from Web Documents." By Chuang-Hue Moh, Ee-Peng Lim, and Wee-Keong Ng (Center for Advanced Information Systems, School of Applied Science, Nanyang Technological University, Nanyang Avenue, Singapore 639798, SINGAPORE). Pages 67-76 in Proceedings of the Fifth ACM Conference on Digital Libraries. ACM Digital Libraries 2000, June 2-7, 2000, San Antonio, Texas. "To realise a wide range of applications (including digital libraries) on the Web, a more structured way of accessing the Web is required and such requirement can be facilitated by the use of XML standard. In this paper, we propose a general framework for reverse engineering (or re-engineering) the underlying structures i.e., the DTD from a collection of similarly structured XML documents when they share some common but unknown DTDs. The essential data structures and algorithms for the DTD generation have been developed and experiments on real Web collections have been conducted to demonstrate their feasibility. In addition, we also proposed a method of imposing a constraint on the repetitiveness on the elements in a DTD rule to further simplify the generated DTD without compromising their correctness. . . The key objective of this project is to re-engineer the underlying structures of a given set of Web documents. We propose a general framework for Structure Re-engineering from Web documents and produce a DTD for each subset of similarly structured Web documents as a final result. In the project, we do not attempt to solve all the problems pertaining to structure re-engineering. Instead, we propose a general framework for structure re-engineering of Web documents. We introduce a structural representation for a set of Web documents, in particular XML documents or semantically tagged HTML documents 1 that share a common structure but do not come with a DTD. We then develop the algorithms for discovering the DTD from the structural representation. We have also conducted experiments on real-life examples of Web documents to demonstrate the discovery algorithms. In this research, we focus on the textual and tag information within the Web documents. Other objects embedded in the documents such as multimedia data, hyperlinks, entity references and element attributes have not been considered so far but extensions of our algorithms to cater for such objects can be made in the future research. . . The automatic creation of DTD in the OCLC's GB-Engine uses an approach that is fairly similar to ours. In the GB-Engine, an internal tree representation is built and converted into a grammar. The grammatical rules are then combined, generalized and reduced to produce a corresponding DTD. We see that the generation of an internal tree representation is similar to the Document Tree data structure that we propose. In their work, reduction rules like 'identical bases', 'off by one' and 'redundant' were used to reduce the complexity of the DTDs generated. Nevertheless, the complexity of generated DTDs cannot be easily controlled by the users. In our proposal, we employ the Longest Common Subsequence (LCS) concept and also a user defined parameter maximum repetition factor to provide a more general and flexible method to reduce the complexity of the DTD generated. In the Lore project, the OEM was proposed to model the structures of semistructured data. The OEM model addresses the need of a more flexible data model for semistructured data like Web documents, as compared to conventional data models like object-oriented models. The main 'drawback' of the OEM model is the missing ordering information about the elements in the schematic description of OEM model, also known as the DataGuides. XML, on the other hand, does require the elements to conform to the ordering defined in the DTD. [Conclusions:] In this paper, the concept of re-engineering structures from Web documents has been introduced. Based on a structure re-engineering framework, we have developed some algorithm to construct a Spanning Graph that describes the structures of a set of similarly structured XML documents. We further proposed to generate the DTD for these XML document using a set of heuristic rules. For demonstration purposes, we have implemented our proposed technique into a prototype system known as DTDMiner. The Web interface for the system can be found at http://www.cais.ntu.edu.sg:8000/chmoh/dtd-miner/. The system allows the user to supply some XML files and generates a DTD for them. It also supports relaxation of the generated DTDs. As part of our future research, we plan to extend the reengineering techniques in the following directions: (1) Discovering of attributes and attribute types: The way that we have handled attributes so far is to simply assume that all the attributes are mandatory and of type CDATA. Attributes however, can be of various data types and may not always be required in the XML standard. As a result, we need to explore into more sophisticated ways of handling attributes to produce more accurate DTDs. Note that attributes can prove to be important to the structures of XML documents e.g., the XLink standard utilizes attributes to define the hyperlinks between XML documents. (2) Discovering inter-document structures: The framework we have proposed is primarily used to discover the structures within Web documents i.e., intra-document structures. We see that such structures are not the only category of structures that can exist in Web documents. The hyperlinks that exist in almost all Web documents present an inter-document structure (e.g., Web-site structure). Used in conjunction with the DTD discovered, the inter-document structures can provide a useful road-map to user query formulation."

  • Research on grammar transduction at UWaterloo (Rick Kazman, Gaston Gonnet, and others)

  • DTDGenerator - XML DTD Generator. From Michael Kay (ICL). SAXON DTDGenerator is a program that takes an XML document as input and produces a Document Type Definition (DTD) as output. The aim of the program is to give you a quick start in writing a DTD." [19980505.] Note 2000-01-05: DTDGen is now part of SAXON.

  • DTDGenerator Frontend - A perl script written by Paul Tchistopolskii, as a frond end to Michael Kay's DTDGenerator


Hosted By
OASIS - Organization for the Advancement of Structured Information Standards

Sponsored By

IBM Corporation
ISIS Papyrus
Microsoft Corporation
Oracle Corporation

Primeton

XML Daily Newslink
Receive daily news updates from Managing Editor, Robin Cover.

 Newsletter Subscription
 Newsletter Archives
Globe Image

Document URI: http://xml.coverpages.org/grammarTransduction.html  —  Legal stuff
Robin Cover, Editor: robin@oasis-open.org