Abstract: In this paper, we motivate and describe tools we have built to automatically create reduced structural representations of tagged text. These tools are novel in that they let one use the basic tenants of SGML without creating DTDs by hand.
The Standard Generalized Markup Language (SGML) is a meta-language for writing Document Type Definitions (DTD) [ISO8879]. A DTD describes how a document conforming to it should be marked up: the structural tags that may occur in the document, the ordering of the tags, and a host of other features.
As an electronic publisher, OCLC has several tagged data sources for its reference databases. While this tagged text appears to be SGML, it does not always have or adhere to a DTD. Despite this, OCLC must build data transformations, databases, and interfaces for this tagged text. These problems became particularly apparent during the CORE project.
The CORE project was a joint research effort between Cornell University's Mann Library, Bell Communications Research, the American Chemical Society, Chemical Abstracts Service and OCLC [Entlich95, Garson90, Garson91, Garson92, Weibel92, Weibel94]. The goal of the CORE project was to put 250 journal-years of chemistry journals online, accessible via graphical interfaces. To accomplish this, archival typography files were translated to ASCII tagged text. This tagged text was essentially SGML, except that it had no pre-existing DTD.
Over the life of the CORE project, several DTDs were written. Each of these DTDs represented the understanding of the structure of the tagged text at the time of its writing. However, since these DTDs often did not reflect the actual markup in the tagged text, documents were rejected or mishandled by the CORE applications. While it is not uncommon to use DTDs to validate input documents, one could argue that the DTDs were incorrect in the case of CORE, not the tagged text. That is, the DTDs were expected to reflect the structure of the tagged CORE text, but they did not.
The writing of a correct CORE DTD could have been greatly aided by a tool that could take tagged text and create a DTD for it automatically. A DTD created in this manner need not be as aesthetically pleasing as one created by hand as long as it was readable and correctly reflected the structure of the tagged text. Such a DTD could then be edited by hand to insert additional SGML features and make it more useful.
While end users can easily tag their documents to make information structure apparent, many do not have access to DTDs that exactly match their data collection needs. Traditionally, a DTD is created for a document collection before the collection is built. This assumes that some sort of paper analysis has been performed to first identify important structure. However, sometimes electronic, tagged data exists that has no corresponding DTD. In the remainder of this paper, we describe the system we built to automatically create DTDs from sample tagged text.
The SGML Document Grammar Builder project is an ongoing research effort at OCLC Online Computer Library Center, Inc. studying the manipulation of tagged text [Shafer94a]. This project has resulted in the construction of a C++ engine library, the Grammar Builder Engine (GB-Engine), that can also be used to automatically create reduced structural representations of tagged text (DTDs) [Watson95a], translate tagged text [Shafer95a, Shafer95b], automate database creation, and automate interface design -- all from sample tagged text.
In many ways, the GB-Engine addresses a classic chicken and egg problem in the SGML world: "Which comes first, the tagged document or the DTD?" If you believe the DTD comes first, you have a lot of planning/analysis to do before you begin to create electronic documents. You may even have to hire an SGML consultant to help you analyze your document needs and help you write the appropriate DTDs.
If, on the other hand, you happen to have or create sample tagged documents, you can run these sample tagged documents through the GB-Engine and get a DTD created for you automatically. The idea is not that the automatically created DTD will be perfect, but that you may be able to improve the document analysis and conversion. OCLC receives so much tagged text that it would be impossible to analyze all of it by hand. If you find yourself in a similar situation, you may find the GB-Engine useful.
While the GB-Engine is embedded in a number of systems, Fred is currently the most popular. Fred is an extended Tcl/Tk interpreter. Tcl is a complete string command language with variables, strings, and lists, while Tk is an X-based toolkit [Ouster]. As a result, Fred is a complete interpreter/shell that has access to the GB-Engine objects that can be used to easily build X interfaces. It is important to note that Fred's basic functionality is actually provided by the GB-Engine and is not restricted to Fred. For instance, we have ported the GB-Engine to the PC and dropped the GB-Engine into Perl [Wall] and Scheme [Abelson] as an alternative to the Tcl portion of Fred. Several Fred services, including automatic DTD creation, are freely available via the WWW [Shafer94b].
To understand automatic GB-Engine creation of DTDs, one must at least have a superficial knowledge of SGML and grammar notation. Specifically,
Those familiar with these concepts may want to skip to the next section.
SGML is used to guide the markup of document structure with tags that are clearly distinguishable from document content. Generally speaking, the beginning of a structure in an SGML document is marked by a start tag and the end of the structure is marked by an end tag. A start tag is the character `<', followed by the tag name, followed by a `>' (e.g., <author>). An end tag is a `<', followed by a `/', followed by the tag name, followed by a `>' (e.g., </author>). For example, one might see this text and markup in a document which means that the structure author has the content John Smith:
GB-Engine objects recognize traditional SGML tags in text to build a structural representation of the document. This is done by matching start and end tags. During this process, the GB-Engine is able to determine when a start/end tag match resides inside of another start/end tag match. In this way, the GB-Engine can build a tree-structured directory of the document. For instance, author and title reside inside of article in the following:
<article> <title>A Short Document</title> <author>John Smith</author> ... </article>
The GB-Engine builds a corresponding internal tree representation of the article that looks something like this:
article / | \ / | \ title author ...
We would like to note that during the course of discovering document structure, the GB-Engine can also handle certain missing tags. Without a DTD, however, it is impossible to know how to handle all missing tags. (Recall that we often see tagged text without a corresponding DTD.) Consider the following markup:
<foo>... <name>John Smith <extension>5555</extension>
Should </name> be inserted before <extension>? We would say yes, but this is probably because we see the tag name, think of the semantic idea of a name, read John Smith, and think, "Hey, John Smith is a name". But, where should </foo> go? Before <name>, before <extension>, or after </extension>? Since we have no semantic understanding of foo, it is impossible for us as readers to determine where </foo> should go. Likewise, it is impossible for the GB-Engine objects to determine where all missing tags should go without outside intervention.
Structural components like author and title in the above are commonly referred to as document elements in SGML and the remainder of this paper.
Essentially, SGML tag attributes can be thought of as values that are associated with a given structural start tag. For example, the tag author in the following has the attribute primary with the value of yes.
<author primary=yes>John Smith</author>
This markup shows that John Smith is the primary author. It does not indicate what an application should do with this additional information. For instance, an application might make this name appear first, or extract it for some abstracting service. GB-Engine objects accumulate and use the SGML attribute information that they see in the sample documents.
SGML entities are like textual variables and pointers. Basically, an SGML entity begins with an ampersand `&' and ends with a semi-colon `;'. For instance, a document could contain the entity reference `&food;' which might be replaced by the word `pizza' in the translated text. While the GB-Engine accumulates SGML entity references, it cannot specify what the intended entity values are supposed to be without outside information such as that found in a DTD or translation script.
The structure of a document can be written as a grammar where each element definition is a grammar rule. In this paper, we will write element definitions (grammar rules) as follows:
ELEMENT --> DEFINITION
An element definition may include ANDs, ORs, and parentheses where AND is implied by juxtaposition and OR is signified by a vertical bar `|'. We assume that AND has a higher precedence than OR and thus will often not place parentheses around ANDs. For example, if A's definition is
( B AND C ) OR ( D AND E )
we will write:
A --> B C | D E
An element definition may use three different repetition symbols to signify that the structure in question has zero or one, one or more, or zero or more repetitions: ?, +, and *, respectively. For instance, assume that an article has a title, one or more authors, and a body. The structural definition of article can be written as:
article --> title author+ body
Finally, we will refer to a complex element definition (rule) as being made up of subrules, structural components between conjunctions. For example, `B C' and `D E' are subrules in A's definition above. Furthermore, `B' and `C' are subrules of `B C' and `D' and `E' are subrules in `D E'. The reductions that we present in the Reduction Section below simplify grammars by reducing the number and complexity of the subrules in element definitions.
The automatic DTD creation process is a means whereby a generalized grammar is built from one or more document instances by first creating structural rules as described in the previous section and then combining, generalizing, reducing, and these rules. The resulting grammar is then output using an appropriate DTD syntax. We will explain all of this in more detail in the following.
To create a DTD for a document collection, we first glean the element definitions from the collection by extracting the specific document structures as described in the section on tagged document structure above. Next, we collect all of the rules found for a given element and combine them. That is, we take all of the rules with the same element on the left hand side and combine the corresponding right hand sides as subrules of one rule joined by ORs. For instance, if we find these definitions for A in a collection:
A --> B A --> C A --> D E
we will combine these to produce the single rule (definition):
A --> B | C | D E
Once all of the element definitions have been collected, they need to be generalized and reduced to produce an accurate representation of the collection. That is, while the grammar created by collecting the specific element definitions is correct for the current state of the collection, it may not reflect the desired collection structure. This is especially true when the specific document structures have been extracted from a sample set of the collection instead of the complete collection.
For example, imagine that the body of an article is composed of any number of paragraphs (p's) and that all of the articles in a specific collection have one, two, or four paragraphs:
body --> p body --> p p body --> p p p p
body --> p p p
That is, "Is a body with three paragraphs also valid?" Probably, but since we have not seen an body with three paragraphs, the specific structures seen must be generalized to correctly reflect the underlying (or desired) collection structure. In this case, the generalized rule might be:
body --> p+
Note that in a finite collection, it is impossible to prove that the generalized rule is correct since `+' allows infinitely many p's.
Generalizing a grammar has the side benefit of making the grammar easier to read and understand. For example, imagine that a collection contains hundreds of articles, each with a different number of paragraphs in their body. Without generalization, we would have to list all of these rules. Clearly, this is not desirable.
Similarly, we have found that automatically generated grammars are most useful if they are compact and easy to read. The reductions we present in the Reduction Section below simplify grammars by reducing the complexity of subrules in element definitions using several different approaches.
Many reductions are based on the idea that one subrule subsumes another. For instance, `B*' clearly subsumes (or makes redundant) `B B' in the following:
A --> B B | B*
Accordingly, A's definition should be written as:
A --> B*
In a perfect world, we would be able to determine when a subrule subsumes another and always remove the redundant subrule. However, most are not as easy as the previous example to determine, so we approximate this by the reductions we have implemented.
Another important idea is that an automatically created subrule can subsume multiple actual subrules. For instance, imagine that A's definition is:
A --> B E | B (C D)+ E
In this case, the subrule before the `|' is missing the subrule `(C D)+' contained in the subrule after the `|'. By creating a new subrule with the subrule `(C D)*', we can remove the original subrules because they are subsumed by the new subrule:
A --> B (C D)* E
We would like to note that many SGML validators cannot handle ambiguity. Ambiguity exists in a grammar rule when two subrules have indistinguishable prefixes. While we have seen that the reductions implemented to date remove many cases of ambiguity, they do not try to remove all ambiguity. Thus, depending on the SGML validator that you are using, some automatically generated DTDs may have to be refined by hand. Reference [Exoterica] contains a excellent theoretical description of the ambiguity problem and an algorithmic proposal for solving it.
The grammar format used in this paper is not that used in DTDs. The GB-Engine maintains grammars in an internal format conducive to its many uses, especially reductions, and can write the grammar out using several different formalism. So, when writing the grammar out as a DTD, the appropriate DTD syntax is used. Furthermore, the concrete syntax of SGML will be changed if one of the element names was longer than eight characters. If you do not know what the concrete syntax of SGML is, good! It is our opinion that users should concentrate on the abstract structure of their documents, not the syntax details of writing DTDs.
In this section, we give a brief overview of the reductions available in the GB-Engine. Most interfaces to the GB-Engine provide users with mechanisms to selectively choose which reductions are applied. Currently, the reductions are applied in the order listed here. In the following, we will present reduction examples as:
ELEMENT --> DEF_1 --> DEF_2
signifying that the original definition DEF_1 will be reduced by the reduction in question to DEF_2. Furthermore, letters in the right hand side subrules refer to a single element or a another subrule. That is, the reductions work recursively on subrule trees, not just on single elements.
Z --> A () (B) --> A B Z --> ((A?)+) () ((B|(C))) (B)? --> A* (B|C) B?
Z --> (A B) C --> A B C Z --> A B (C D) F (G H)* --> A B C D F (G H)*
NOTE: This reduction can remove groupings that could be used to recognize patterns:
Z --> (A B) (A B)+ --> A B (A B)+
Whereas if left alone, the Repeating Atoms reduction below could reduce this to (A B)+.
Z --> (A | B) | C --> A | B | C Z --> (A | (B | C) | (D | E)+)* --> (A | B | C | D | E)*
Z --> A A A --> A+ Z --> A A? (B|C|B+)+ (B|C|B+) D --> A+ (B+|C)+ D
Z --> A #PCDATA --> (A | #PCDATA)+ Z --> A B | #PCDATA B --> (A | B | #PCDATA)+
Z --> (A B)* | A? B? --> (A? B?)* Z --> (A B)* | (A? B?) | A B --> (A? B?)*
NOTE: Identical Bases may not be exactly what you want since it can create a MUCH MORE generous language than was intended. For instance, in the following, the original language did not include A A, but the combined rule does:
Z --> (A B? | A* B) --> A* B?
Or how about the following where the resulting rule is any string of A's and B's whereas the original language was "A, B, and any number of repeating ABs":
Z --> (A B)* | (A? B?) --> (A? B?)*
Z --> A+ B | B --> A* B Z --> A B C | A C | A B --> A B? C | A B C?
NOTE: In the second example above, the reduced rule is not A B? C? as this would then allow A to exist by itself. This would not be correct since in the original, A never appeared alone.
Z --> A B | A B* --> A B* Z --> A B | A B* | A B | C | C --> A B* | C
While a detailed analysis of the execution times of these reductions and their theoretical background is beyond the scope of this paper, we would like to note that these reductions execute extremely fast by approximating several states and avoiding unnecessary, multiple subrule traversals.
In this paper, we have motivated the need for the automatic creation of DTDs from sample tagged text and introduced the GB-Engine and Fred. We have used Fred to analyze several tagged sources, fix markup errors, and translate tagged documents. We expect that others will find similar uses for Fred or comparable tools as SGML proliferates.
The best way to get a feel for automatic DTD creation is to experiment with examples of your own text. Accordingly, OCLC makes several Fred-based services freely available via the WWW [Shafer94b]. Currently, these services include free automatic DTD creation, direct grammar reduction, and arbitrary text translation. After only ten months of availability, the WWW-based DTD creation service has been used to generate approximately 3,300 DTDs.
Harold Abelson, Gerald Jay Sussman, and Julie Abelson. Structure and Interpretation of Computer Programs. The MIT Press, 1985.
Richard Entlich, Lorrin Garson, Michael Lesk, Lorraine Normore, Jan Olsen, and Stuart L. Weibel. Making a Digital Library: The Chemistry Online Retrieval Experiment. Communications of the ACM, 38(4):54, 1995.
Content Model Algebra. Report ECM11-1091 from Exoterica Corporation, Ottawa, Ontario, Canada. Copies may be requested from email@example.com. Accessible at <URL:http://www.exoterica.com/products/resources/whitepapers/cma/>, October 1991.
Lorrin Garson, Michael Lesk, Martha J. Lindeman, Jim Lundeen, and Jan Olsen. CORE The Chemical Online Retrieval Experiment. In Annual Review of OCLC Research July 1989-June 1990, pages 39-40, 1990.
Lorrin Garson, Michael Lesk, Jim Lundeen, Jan Olsen, and Stuart L. Weibel. CORE The Chemical Online Retrieval Experiment. In Annual Review of OCLC Research July 1990-June 1991, pages 32-33, 1991.
Lorrin Garson, Michael Lesk, Jim Lundeen, Jan Olsen, and Stuart L. Weibel. CORE The Chemical Online Retrieval Experiment. In Annual Review of OCLC Research July 1991-June 1992, pages 39-43, 1992.
Information Processing -- Text and Office Systems -- Standard Generalized Markup Language (SGML). International Organization for Standardization. Ref. No. ISO 8879:1986, 1986.
John K. Ousterhout. Tcl and the Tk Toolkit. Addison-Wesley Publishing Company, 1994.
Keith Shafer. SGML Grammar Structure. In Annual Review of OCLC Research July 1992-June 1993, pages 39-40, 1994.
Keith Shafer. Fred: The SGML Grammar Builder. Fred's WWW home page. Accessible at <URL:http://www.oclc.org/fred/>, 1994.
Keith Shafer and Roger Thompson. Introduction to Translating Tagged Text via the SGML Document Grammar Builder Engine. Accessible at <URL:http://www.oclc.org/fred/docs/translations/intro.html>, 1995.
Keith Shafer and Roger Thompson. Translating Mathematical Markup for Electronic Documents. To appear in Proceedings of the Fourth International World Wide Web Conference, 1995.
Larry Wall and Randal L. Schwartz. Programming Perl. O'Reilly & Associates, Inc., 1992.
Bradley C. Watson and Keith Shafer. Creating Custom SGML DTDs for Documentation Products. In SIGDOC '95 The 13th Annual International Conference on Systems Documentation, pages 189-196. ACM, 1995.
Stuart L. Weibel. CORE Project Installs Database at Cornell. OCLC Newsletter, pages 12-13, May/June 1992.
Stuart L. Weibel. The CORE Project Technical Shakedown and Preliminary User Studies. In Annual Review of OCLC Research July 1992-June 1993, pages 46-49, 1994.