This page: http://www.pms.informatik.uni-muenchen.de/forschung/datamodeling-markup.html (validation)
Francois.Bry@informatik.uni-muenchen.de, Norbert.Eisinger@informatik.uni-muenchen.de, 2000-07-03

Lehr- und Forschungseinheit für Programmier- und Modellierungssprachen,
Institut für Informatik der Ludwig-Maximilians-Universität München

Data Modeling with Markup Languages (DM²L)

by François Bry and Norbert Eisinger

23 March 2000, revised 3 July 2000

Chapters.: 1. Research Context and General Goals; 2. Methodology: Research Issues and Applications; 3. Research Issues; 4. Applications; 5. Related Application Areas

1. Research Context and General Goals

Contents of Chapter 1.: 1.1 Modern Markup Languages are Generic; 1.2 Structuring; 1.3 Enriching Data with Text -- Ontologies; 1.4 Data Modeling with Generic Markup Languages as an Interdisciplinary Area of Research

Modern markup languages, such as SGML (Standard Generalized Markup Language) and XML (eXtensible Markup Language), which were initially conceived for modeling texts, are now receiving an increasing attention as formalisms for data and knowledge modeling. XML is currently establishing itself as a successor of HTML (HyperText Markup Language) for a better modeling of texts as well as of other kinds of data. There are several reasons for this evolution.

1.1 Modern Markup Languages are Generic

Modern markup languages such as SGML and XML are generic, i.e.,

they serve to specify the semantic structure, not the layout of documents or data items, and
they make it possible to freely specify application dependent document or data structures.

In the following, the term data refers also, but not exclusively, to text data. Thus, a data item may consist: of text only (such data items are also known as human-readable documents), of non-text only (such data items are also known as data-oriented documents), or of both (such data items are also known as mixed-model documents). In the terminology of generic markup languages, data items are called documents. In the following, the term data item is used in lieu of document for stressing that not only (structured) texts are meant, but more generally (structured) data of any kind.

Layout vs. Structure

Widespread specific markup languages such as PostScript or RTF (Rich Text Format), whose conceptual roots go back to the seventies, serve to specify the layout of data items. Here layout is not exclusively meant as the appearance of a data item when printed on paper, but more generally as any kind of presentation of a data item to human perception. Examples of such an extended notion of layout include the formats of data items as they are displayed on a terminal screen, vocalized by a speech synthesizer, rendered in the Braille script on a tactile output device, or presented by any other means on any device.

The family of generic markup languages started in the late eighties with the conception of its first specimen, SGML. The purpose of a generic markup language is to specify the semantic -- or logical -- structure of data items, not their layout. Consider, for example, a data item representing a letter. A generic markup language can be used to mark some part of this data item as the salutation, another part as the complimentary close, and so on. In contrast to that, a specific markup language would be used to assign certain indentations, spacings, font weights, font sizes etc. to these parts of the data item, without making explicit which relationships the parts have to the whole.

In the following, the term presentation is reserved to refer to the layout of a data item in the extended sense above, while the term representation refers to how semantics is conveyed through structural elements of the underlying data modeling formalism.

The distinction between layout and structure is important, for a layout format is device or system dependent, whereas a semantic structure should not be. It is desirable that the semantic structure of data items be specified independently of any layout. This ensures both:

independence of data modeling from data usage,
independence of data modeling from presentation devices.

The first property, data independence from usage, is important because data is rarely used in a single manner only. For example, a list intended to provide tourist information about cities might be "misused" by someone looking up postal codes and telephone area codes in order to fill gaps in an address book, and by someone else extracting the numbers of inhabitants as raw data for some survey on population distribution. Neither of these "misusers" would be interested in what the author considered the primary information: sightseeing, transportation, night life, and so on. In cases where such "misuses" of an information server or database can be foreseen by the designers, they are often modeled relying on so-called database views, short views.

The second property, data independence from presentation devices, is important for several reasons. To begin with, different kinds of presentation devices require different layouts. For example, a structurally complex data item is likely not to be displayed using identical layouts on standard size screens and on small screens like those of cellular phones. Also, such devices are likely to become technically obsolete sooner than data. Moreover, a presentation format does not necessarily fully convey data semantics. For instance, it is common practice to rely on printed text layout for conveying semantic structure when using text processing systems or the markup language HTML. This practice often leads to semantic losses, especially when files are transferred from one text processing system to another, because the layout of the one system cannot always be faithfully mapped into that of the other system.

In order to specify layouts for classes of documents specified in a generic markup language, so-called style-sheet languages are used in addition. These languages basically allow the definition of layouts for those structural elements specified with the markup language. Such definitions do not have to be unique, thus ensuring the desired independence of the data from their presentations in various contexts.

Free Structures with Free Naming -- Structure-Conveying Data

Generic markup languages neither impose any predefined structure, nor predefined names for the structural elements occurring in data items. Structure and names can be freely chosen, hence the denomination of generic markup language.

Thus, using generic markup languages it is possible to faithfully model the structure of data items needed in applications and to name the structural elements of a chosen structure in a way that is natural in the application context. The "naming freedom" provided by generic markup languages reminds of modern high-level programming languages that allow the programmer to rely on intuitive names for variables, routines, and other constructs of the programming language.

Using a generic markup language, an address book might thus consist of person elements, which in turn consist of first-name, last-name, geographical-address, email-address, and phone-number elements. A geographical-address element might in turn consist of subelements street, street-number, city, and state. An example of an entry in such an address book is as follows:

      <person>
         <first-name>François</first-name>
         <last-name>Bry</last-name>
         <geographical-address>
            <street>rue de Vincennes</street>
            <street-number>70</street-number>
            <city>Bordeaux</city>
            <state>France</state>
         </geographical-address>
         <email-address>francois.bry@hotmail.com</email-address>
         <phone-number>0033556988223</phone-number>
      </person>

The use of application relevant names for structural elements is at the origin of the expression of structure-conveying data. Note that data items structured in such a manner are not necessarily accompanied by a schema (here and in the following, the term schema refers to a DTD (Document Type Definition), an XML schema, or some other specification of the structure underlying the data items considered).

It is worth recalling that the widespread markup language HTML is not a generic markup language in spite of its syntax similarities with SGML and XML. Indeed, although HTML provides for structuring constructs, it is not free from layout constructs. Furthermore, names for structuring constructs cannot be chosen freely in HTML. The fact that interfaces to a number of style-sheet languages have been defined for HTML has no influence on the non-generic character of that markup language.

References

[Bos97]: Jon Bosak: XML, Java, and the future of the Web. Sun Microsystems, 1997. http://metalab.unc.edu/pub/sun-info/standards/xml/why/xmlapps.htm. One of the early articles drawing attention to XML.
[BB99]: Jon Bosak and Tim Bray: XML and the Second-Generation Web. Scientific American, 1999. http://www.sciam.com/1999/0599issue/0599bosak.html.
[Bra98]: Neil Bradley: The XML Companion. Addison Wesley, 1998. ISBN 0-201-342855.
[GP00]: Charles F. Goldfarb and Paul Prescod: The XML Handbook. Prentice Hall PTR, 2000. ISBN 0-13-014714-1.
[vHe94]: Eric van Herwijnen: Practical SGML. Kluwer Academic Publishers, 1994. ISBN 0-7923-9434-8.
[Mac97]: Ingo Macher: XML: Professionelle Alternative zu HTML -- Revolution der Experten. 1997. http://www.heise.de/ix/artikel/1997/06/106/.
[Mic99]: Alain Michard: XML -- Langage et applications. Eyrolles, 1999 (in French). ISBN 2-212-09052-8.
[Har99]: Elliotte Rusty Harold: XML Bible. IDG Books, 1999.
[W3C99a]: W3C: XML in 10 points. 1999. http://www.w3.org/XML/1999/XML-in-10-points.
[W3C99b]: W3C: Extensible Markup Language (XML) Activity. Activity Statement, 1999. http://www.w3.org/XML/Activity.html.
[BE00]: François Bry and Norbert Eisinger: Vorlesung "Markup-Sprachen und semi-strukturierte Daten" (SS 2000). lecture notes, 2000 (in German). http://www.pms.informatik.uni-muenchen.de/lehre/markupsemistrukt/00ss/.

1.2 Structuring

The formalisms used by generic markup languages for specifying the structure of classes of documents or more generally of data items allow for

complex structures as they can be specified with object oriented knowledge representation formalisms,
data sharing like in object database models,
optional elements,
different kinds of pointers.

Technically, the formalisms used by generic markup languages for specifying the structure of classes of data items basically -- i.e., with minor restrictions -- amount to context free grammars. Context free grammars are a well-established formalism used especially for specifying and compiling programming languages.

Complex structures are essential for they are ubiquitous in data modeling -- from records in programming languages to objects in programming languages, artificial intelligence, software engineering, databases, and logics for knowledge representation.

Data sharing is a key feature in data modeling, for without it error prone redundancies would be unavoidable.

Modeling Exceptions -- Semistructured Data

The formalisms provided by generic markup languages make it possible to specify complex structures. Therefore they are much richer than the data model of relational databases. They are also richer than the data models of current object database systems, because they allow optional elements.

Optional elements are very appealing in databases, for they make it possible to express exceptions, which often occur in practical database applications. Considering again the above-mentioned address book database, it might for example happen that not all entries have a first-name element, even though most do have one. The name of semistructured data has been coined for emphasizing the possibility of such exceptions in the framework of structure-conveying data. It is under the denomination of semistructured data that most database research on using markup languages for data modeling is currently pursued.

Pointers and Hypertext

The predominant reason for the success of modern markup languages -- both, specific and generic ones -- is the ability to define pointers between data items and to follow them at the click of a button. Thus, markup languages make it very easy to construct so-called hypertext systems, i.e., networks of information-containing nodes interconnected by hyperlinks (a more common term is hypertext links, which might wrongly suggest that nodes must consist of text rather than arbitrary data items; a more sloppy term is links, which could be confused with other kinds of pointers). There may be different types of hyperlinks. In the following, the available hyperlink types and their meanings are referred to as the hyperlink model. This term has been consciously chosen in allusion to the well-established and more general notion of hypertext model ([BHK92], [Hy97], [HS94]).

The generic markup language SGML is usually considered to have no hyperlink model, although its attribute types ID and IDREF enable a simple scheme of binary, directed hyperlinks whose source and target must be part of a single data item. The generic markup language XML inherits this mechanism from SGML, but is also accompanied by the XML Linking Language XLink. XML+XLink provides constructs for a wide range of hyperlink types, which largely correspond to the quite sophisticated HyTime model [Hy97]. The constructs of XML+XLink cover, among others,

binary, directed hyperlinks between a source data item and a target data item (if in addition source and target belong to a single data item, such hyperlinks can actually be represented in different ways: either using the ID/IDREF attribute types inherited from SGML, or using the more powerful constructs provided by XLink),
undirected hyperlinks,
hyperlinks that have more than two participating data items,
out-of-line hyperlinks that are not part of any of their participating data items and not even stored together with them,
application-specific metadata that can be associated with a hyperlink.

This variety of hyperlinks is but one aspect of the richness of XML as a data modeling formalism. Some of the hyperlink types are very simple and convenient for expressing data sharing in complex structures, making possible structures corresponding to graphs that are no trees. More complex hyperlink types are primarily intended for browsing facilities, but seem to be usable also for expressing semantic relationships.

References

[BHK92]: P. De Bra, G.J. Houben, Y. Kornatzky: An Extensible Data Model for Hyperdocuments. 4th ACM Conference on Hypertext, Milano, pp 222-231, 1992.
[Hy97]: International Standard ISO/IEC 10744: Information technology -- Hypermedia/Time-based Structuring Language (HyTime). Second edition 1997-08-01.
[HS94]: F. Halasz, M. Schwartz: The Dexter Hypertext Reference Model. Communications of the ACM, 37(2):30-39, 1994.

1.3 Enriching Data with Text -- Ontologies

Enriching standard data -- like the numerical and string data of classical managerial databases -- with more informative texts is an old issue in database research and in applications known as data dictionaries. Data dictionaries are basically agreed vocabularies for an application or a class of applications and often taxonomies, i.e., classifications of terms.

Recently, the issue has gained new attention and has been enhanced with artificial intelligence knowledge modeling techniques leading to so-called ontologies.

An ontology basically provides a vocabulary whose terms are precisely defined by texts such as dictionary entries or encyclopaedia entries. Further, an ontology also defines semantic relationships between terms using formal modeling techniques, in general taken from logic-based specification formalisms such as description logics. Thus, an ontology starts from precisely defined basic concepts, building up complex concepts by relying on relationships that are precisely defined as well. These relationships permit the construction of taxonomies, but also of richer structures.

With the advent of the Web and data interchange intensive applications such as electronic commerce, ontologies are becoming a key issue. They are used for ensuring software interoperability and data sharing [Ont]. In spite of their practical relevance, data dictionaries and ontologies have up till recently not received as much attention from the database research community as database practitioners might have wished.

This discrepancy clearly reflects the fact that, up till now, textual data was not a central interest in database research and database system development, nor was the modeling of complex semantic relationships as in artificial intelligence knowledge modeling approaches such as description logics. This seems to be changing now. Arguably, textual data and the modeling of complex semantic relationships are gaining in importance within database research.

References

[Ont]: http://www.ontology.org/
[Gua98]: Nicola Guarino, ed.: Proceedings of the First International Conference on Formal Ontology in Information Systems (FOIS'98). IOS Press and Ohmsha, 1998. ISBN 90 5199 399 4 (IOS Press) ISBN 4 274 90223 4 C3000 (Ohmsha).
[CJK98]: Pierre-Jean Charrel, Hannu Jaakkola, Hannu Kangassalo, and Eiji Kawaguchi, eds.: Information Modelling and Knowledge Bases IX, IOS Press and Ohmsha, 1998. ISBN 90 5199 396 X (IOS Press) ISBN 4 274 90221 8 C3000 (Ohmsha).
[VHG98]: Virtual Hyperglossary. http://www.vhg.org.uk/index.html. 1998.
[XOL99]: XOL Ontology Exchange Language. http://www.ai.sri.com/~pkarp/xol/. 1999.

1.4 Data Modeling with Generic Markup Languages as an Interdisciplinary Area of Research

Data modeling with generic markup languages is an interdisciplinary area of research at the crossing of four traditionally distinct fields of research:

databases
artificial intelligence
information retrieval
document processing

This convergence is interesting, because each field brings its own focus, methods, and "philosophy".

From databases, the research area of data modeling with generic markup languages gains an interest for declarative query languages, of which SQL is the most known example. Declarativeness is a loosely defined notion to be understood here as meaning that the users of such query languages neither have to be aware of the computation strategy, nor of the internal organization of the data in memory, nor -- or as little as possible -- of termination issues and of efficiency. Indeed, queries expressed in query languages are automatically optimized. This makes query optyimization possible. Query optimization guarantees a predictable "average efficiency", which is one of the appreciated features of database systems. Also from databases, the area of data modeling with generic markup languages inherits its interest for data structures ensuring an efficient storage, retrieval, and updating of very large data sets. Conversely, database research itself is enriched by the attention to text data, to data accessible from the Web, and to richer data models allowing for exceptions.

From artificial intelligence, the research area gains data and knowledge modeling methods that go far beyond the relational or object database models. Artificial intelligence approaches to knowledge representation have always been driven by natural language applications, where extremly rich and complex semantics are encountered. Natural language applications have inspired most knowledge representation research and are at the origin of knowledge representation approaches like semantic networks and the nowadays ubiquitous object orientation. The need for software interoperability and data interchange in Web-based applications such as electronic commerce, health care management, and computational biology (also called bioinformatics) nowadays has led researchers as well as practitioners to express advanced artificial intelligence knowledge representation formalisms such as description logics by relying upon the generic markup language XML.

From information retrieval, the research area can learn how automatically to "grasp" knowledge from the content of -- in general large -- texts. The field of information retrieval itself gains from the consideration of structured texts, which up till recently have almost never been considered by the information retrieval community.

The contribution of the document processing field to the research area of data modeling with generic markup languages are indeed the generic markup languages and hyperlink models. Document processing itself might benefit from the interdisciplinary approach to data modeling with methods for declarative query answering and for an efficient storage.

The focus of the research activities described here lies on data modeling as it has emerged from combining techniques and ideas from both, databases and artificial intelligence. As a consequence, issues more specific to one of the fields of databases, information retrieval, and document processing are likely to receive less attention within the research activities described below.

References

[ABS00]: Serge Abiteboul, Peter Buneman, and Dan Suciu: Data on the Web -- From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000. ISBN 1-55860-622-X.
[Bou00]: Ronald Bourret: XML and Databases. http://www.informatik.tu-darmstadt.de/DVS1/staff/bourret/xml/XMLAndDatabases.htm. 1999, 2000.
[Bun97]: Peter Buneman: Semistructured Data. Proceedings of the Symposium on Principles of Database Systems (PODS), 117-121, 1997.
[CCD99]: Stefano Ceri, Sara Comai, Ernesto Damiani, Piero Fraternali, Stefano Paraboschi, and Letizia Tanca: XML-GL: a Graphical Language for Querying and Restructuring XML Documents, Proceedings of the Eighth International World Wide Web Conference, 1999. http://www8.org/w8-papers/1c-xml/xml-gl/xml-gl.html.
[FLM98]: Daniela Florescu, Alon Levy, and Alberto Mendelzon: Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59-74, 1998.
[Suc99]: Dan Suciu: From Semistructured Data to HTML. Tutorial at the 25th International Conference on Very Large Data Bases (VLDB), 1999. http://www.research.att.com/~suciu/vldb99-tutorial.pdf.

2. Methodology: Research Issues and Applications

A research issue is a problem of interest investigated independently of its actual or potential applications.

An application aims at the development of a prototype software system fulfilling a practical need. This need might be an actual need or a perceived potential need.

Investigations into research issues are usually method driven in contrast to application driven. Although these two kinds of approaches are different, they complement each other nicely. The "glue" binding both, research issues and applications, consists in their relationships and interrelevancies. On the one hand, the considered research issues are relevant to application projects, because they address problems encountered within these projects. On the other hand, application projects supply concrete examples for the research issues. Thus, research issues and application projects are united through problems, methods, and hopefully through the solutions looked for, too.

The remaining chapters of this text outline both research issues and applications in the research area introduced by the previous chapter: data modeling with markup languages. This outline is not intended to define a working plan for a short-term project, but to provide a common framework guiding a number of research activities, including students' projects, diploma and PhD theses. As time goes by, the focus is likely to move to new research issues and new applications, causing the framework to evolve. The following chapters describe in detail the research issues and applications forming the current framework.

3. Research Issues

Contents of Chapter 3.: 3.1 Hyperlink Model vs. Browser Model -- Hyperlinks for Semantic Modeling; 3.2 Evolving Subject Indexes; 3.3 Time-Dependent Data: Modeling and Presentation; 3.4 Query Languages; 3.5 Views; 3.6 Modeling of Biological Data

The table of contents above lists research issues to be investigated under the current framework. As explained in Chapter 2. Methodology: Research Issues and Applications, this framework is flexible, and the list does not preclude further issues that might emerge in the course of the research.

3.1 Hyperlink Model vs. Browser Model -- Hyperlinks for Semantic Modeling

Hyperlink Model vs. Browser Model: Semantics vs. Navigation

It is proposed here to distinguish between hyperlink model and browser model. These notions are to be understood as follows:

A hyperlink model defines hyperlink types based on semantic relationships between data items.
A browser model defines the behavior of such hyperlinks.

The proposed distinction is analogous to the distinction between a generic markup language, whose purpose is to model the logical structure of data items, and a style-sheet language, whose purpose is to specify the layout for a given logical structure.

It is true that currently, hyperlink models and browser models are usually not distinguished. For example, the HyTime model [Hy97], which inspired XML+XLink, defines hyperlinks primarily in terms of the behavior of standard browsers, i.e., browsers similar to the Netscape Navigator or the Internet Explorer, to be used with standard size screens. The distinction is, however, discussed in the literature ([BHK92], [BHK94], [HS94]). The research issue described here is based upon the hypothesis that this distinction is likely to become widely accepted.

As an illustration of the distinction consider the "behavior attribute" called show defined by XML+XLink. For simplicity assume a binary, directed hyperlink between a source data item and a target data item. The attribute may have the value show="undefined" or one of the following three values: show="replace" means that the presentation of the source data item disappears from the window from which the hyperlink was actuated and instead a presentation of the target data item appears in this window; show="new" means that the presentation of the source data item remains unchanged and a presentation of the target data item appears in a new window; show="embed" means that a presentation of the target data item is integrated into the presentation of the source data item at the position where the hyperlink was actuated (in the case of data items consisting of text, this behavior is also known as stretch text).

A hyperlink model might define the following, semantically characterized, hyperlink types: "hyperlink to new information", "hyperlink to alternative descriptions of the current information", "hyperlink to supplementary information". A browser model could then specify that these three types of hyperlinks behave in the three ways described above. Different browser models may specify the behavior differently.

It is even interesting to investigate different browser models for the same hyperlink model. There are two reasons. First, in the future many data items are likely to be browsed using not only standard size screens, but also mini-screens like those of cellular phones or new "paper-like" electronic output devices like "electronic paper (e-paper)", "E Ink" or "Organic LED (OLED)" (cf. 5.4 Electronic Books and Newspapers), with which browsing will most likely take new forms. Second, if hyperlinks are to be used for expressing semantic dependencies, which is taken here as a working assumption, then necessarily browsing along such hyperlinks will not be uniquely definable. Thus, the distinction between hyperlink model and browser model proposed here contributes to both, independence of data modeling from data usage and independence of data modeling from presentation devices, which, as pointed out in Section 1.1 Modern Markup Languages are Generic, are key issues in data modeling.

As discussed below, this issue is relevant to several application areas, e.g., e-commerce, electronic news magazines, intelligent tutoring systems, or mobile computing.

Semantic Modeling supported by Hyperlinks

A wide class of modeling problems investigated in the areas of artificial intelligence and knowledge representation boils down to defining so-called ontologies. An ontology is a set of concepts with a set of relationships between concepts.

The concepts represent the basic terms of an application domain, and the meaning of these basic terms must be precisely specified.

The relationships between concepts are domain-independent. A very common example is the generalization relationship: it relates a more specific concept with a more general concept, such as rabbit with rodent, rodent with mammal, mammal with animal, car with vehicle, house with building. Generalization hierarchies, so-called taxonomies, can be found in virtually every application domain. Somewhat less obvious is the fact that there are many other domain-independent relationships, some of them hierarchy-forming. Examples include part-whole relationships such as component-object (finger/hand, hand/arm), member-collection (person/family, family/clan) subarea-superarea (suburb/city, city/region), and also agent-result relationships (programmer/program, program/output) and order relationships and similarity relationships. A set of such relationships can represent much of the semantics of a specific domain in a completely domain-independent way.

It is suggestive to map ontologies to hypertext systems, data items representing concepts and hyperlinks representing relationships between concepts. In this way the hyperlinks are no longer considered as mere contributors to the navigation infrastructure for a browser, but as constructs of a formalism for modeling the semantics of data.

The hyperlinks supported by XML+XLink are quite promising for that purpose, provided that hyperlinks are viewed as semantics conveying rather than navigation supporting constructs. Different relationships may be distinguished by metadata that can be associated with a hyperlink or by exploiting that hyperlinks can be implemented as just another kind of data item. This also allows the attachment of additional constraints such as number restrictions and value restrictions of so-called description logics. Hyperlinks that have more than two participating data items support non-binary semantic relationships such as agent-tool-result. Out-of-line hyperlinks can model newly defined relationships between data items without changing these data items.

Thus, the research issue is to investigate how the different hyperlink types can be exploited to represent sophisticated semantic relationships. A good example for this use of hyperlinks is described in [MP98].

References

[BHK92]: P. De Bra, G.J. Houben, Y. Kornatzky: An Extensible Data Model for Hyperdocuments. 4th ACM Conference on Hypertext, Milano, pp 222-231, 1992.
[BHK94]: P. De Bra, G.J. Houben, Y. Kornatzky: A Formal Approach to Analyzing the Browsing Semantics of Hypertext. Proc. CSN-94, Utrecht, 1994.
[Hy97]: International Standard ISO/IEC 10744: Information technology -- Hypermedia/Time-based Structuring Language (HyTime). Second edition 1997-08-01.
[HS94]: F. Halasz, M. Schwartz: The Dexter Hypertext Reference Model. Communications of the ACM, 37(2):30-39, 1994.
[MP98]: A. Michard, G. Pham-Dac: Description of Collections and Encyclopaedias on the Web using XML. Archives and Museum Informatics, 12, 39-79, Kluwer Academic Publishers, 1998.
[XML-Data]: XML-Data, W3C Note 05 Jan 1998.
[XML-Schema-0]: XML Schema Part 0: Primer.
[XML-Schema-1]: XML Schema Part 1: Structures.
[XML-Schema-2]: XML Schema Part 2: Datatypes.
[DDML/XSchema]: DDML/XSchema: historical background.

3.2 Evolving Subject Indexes

Evolution was first identified in biology, but it is one of the fundamental principles pervading reality.

Biology studies a continuously replenished pool of organisms, which is classified into species according to certain characteristics some organisms have in common. These characteristics may change over time, requiring the definition of what constitutes a given species to be time-dependent and making possible the emergence of subgroups of organisms of the same species, which may start out as races or subspecies and eventually split off to become species of their own.

Art history studies a continuously replenished pool of artefacts, which is classified into styles according to certain characteristics some artefacts have in common. These characteristics may change over time, requiring the definition of what constitutes a given style to be time-dependent and making possible the emergence of subgroups of artefacts of the same style, which may start out as stylistic variants and eventually split off to become styles of their own.

The same kind of description applies throughout: in history, a pool of cultural groups is classified into evolving civilizations; in cosmology, a pool of material particles is classified into evolving dust clouds, galaxies, or stars; in a department store, a pool of commodities is classified into evolving departments; in society, a pool of lifestyle patterns is classified into evolving fashions; in science, a pool of ideas is classified into evolving scientific disciplines; the list could go on indefinitely.

At this level of generality the issue is hard to grasp from the point of view of knowledge modeling. In most cases it is not even clear which features adequately describe the members of a "pool" and how such features contribute to the classification inside the pool. So let us specialise the issue to cases where there is a pool of semistructured data items that model members of some "pool" in reality, and some parts of these data items correspond to classification features. For example, such a data item might be a research article with classification features including an author and a date and a list of keywords and a list of cited articles. It does not matter whether the keywords are provided by the author or automatically extracted from the article using information retrieval methods.

Given the classification features, it is possible to establish relationships between data items. In the example domain of research articles, straightforward relationships would be "written-by-the-same-author", "written-before", "cited-by", somewhat less straightforward relationships "similar-subject", "taken-up-and-advanced-by".

Appropriate numeric measures of the density of such relationships then allow the identification of data items that are "condensation kernels" for classes, and measures of the distance from those condensation kernels might define the boundaries of the classes. Taking into account the relationships along the temporal dimension, one can distinguish condensation kernels at different times and identify metamorphoses of classes.

This sketched approach differs from ontologies in several important ways: the condensations kernels and classes cannot be predefined, and membership of data-items in classes is fuzzy and time-dependent. The research issue would be to integrate numerical and fuzzy notions into the framework of semistructured data. Forseeable application areas of "evolving subject indexes" reflecting the changes of a domain are, among others, electronic news magazines and electronic commerce.

References

[CRSS98]: S. Conrad, J. Ramos, G. Saake, C. Sernadas: Evolving Logical Specification in Information Systems. Chapter 7 in Logics for Databases and Information Systems, edited by J. Chomicki and G. Saake, Kluwer Academic Publishers, 1998.

3.3 Time-Dependent Data: Modeling and Presentation

For many applications, time is a crucial aspect of data. This is especially so in the context of the ever-changing Web.

For instance, electronic commerce applications rely on price lists and/or resource lists. Often, not only the last versions of such lists have to be accessed, but also former ones. In such cases, it would be interesting to provide "views" of prices or resources at given -- past or even future -- times. Another example is document based workflow. Managing the queue of patients waiting for an X-ray examination requires the processing of time constraints imposed by, say, the surgery ward and the emergency ward. Planning a round trip involves the coordination of personal temporal requirements with flight schedules, future availability of seats on flights, future availability of convenient accommodation, and so on.

The semantics to be modeled can be quite complex and conceptually rich. For example, a view of the future is not time invariant, but depends on the point in time at which it is computed. The question whether a seat is available on some flight may have different answers at different times before the flight and would be meaningless after the flight.

To date, much of the time-dependent semantics of such applications is only partially modeled in information servers. The modeling of time-dependencies in an XML setting would be especially interesting, because XML is expected to become the standard language for the modeling of Web-based information servers, many of which manage short-lived data and/or data with very different life spans. Various approaches to dealing with time have been studied in the contexts of artificial intelligence and database systems. It appears promising to investigate how such approaches can be adapted to the context of a generic markup language like XML.

Time affects not only data semantics, but also data presentation. For example, it might be desirable to enable a browser to present several projections referring to different future times in a manner appropriate for comparisons. Current browsers do not support this, and time-dependent applications find almost no dedicated support for the presentation of their data.

In order to identify necessary and useful functionalities for the presentation of time-dependent data, adaptive browsers developed for Intelligent Tutoring Systems provide an interesting starting point. Such browsers adapt the presentation of learning material according to what the learner has or has not learned at the time of reading a unit of the learning material. This notion of time might not fully coincide with that relevant to applications of the kind described above, but it seems to be similar enough to merit a closer study.

References

[CT98]: J. Chomicki, D. Toman: Temporal Logic in Information Systems. Chapter 3 in Logics for Databases and Information Systems, edited by J. Chomicki and G. Saake, Kluwer Academic Publishers, 1998.

3.4 Query Languages

*** Section in preparation ***

References

[ABS00]: Serge Abiteboul, Peter Buneman, and Dan Suciu: Data on the Web -- From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000. ISBN 1-55860-622-X.
[FFL97]: M. Fernandez, D. Florescu, A. Levy, and D. Suciu: A Query Language for a Web-Site Management System. SIGMOD Record, 26(3),:4-11, 1997.
[LSS96]: L. Lakshmanan, F. Sadri and I. Subramanian: A Declarative Language for Querying and Restructuring the Web, Proceedings 6th International Workshop on Research Issues in Data Engineering (RIDE'96), 1996.
[Str99]: The Strudel System. http://www.w3.org/Style/. 1999.
[WebDB]: Proceedings of the International Workshops on the Web and Databases (WebDB). 2000: http://www.research.att.com/conf/webdb2000/, 1999: http://www-rocq.inria.fr/~cluet/webdb99.html, 1998: http://www.dia.uniroma3.it/webdb98/, cf. also http://www.informatik.uni-trier.de/~ley/db/conf/webdb/index.html.
[XWG]: XML Query Working Group, http://www.w3.org/XML/Activity.html#query-wg.
[XQR]: XML Query Requirements. W3C Working Draft. http://www.w3.org/TR/xmlquery-req. 2000.
[QX98]: Queries for XML: Positions Papers. http://www.w3.org/Tands/QL/QL98/pp.html. 1998.

3.5 Views

Views are an essential feature of databases, for they ensure the conceptual independence needed in most applications between the primary data stored in the database and various interpretations, such as partial look-ups, of the stored data. Most applications based on databases rely on views.

Views also make sense in texts, especially in texts with complex structures and/or contents. Technical manuals or textbooks are often read at different "semantical levels". This is for example the case with a maintenance manual, which can be read both, for estimating the time needed for performing a certain operation, or for learning how to perform the operation. It is also the case with a textbook, which can be read by a teacher preparing a lecture, or by a student studying for an examination. The two readers will probably skip different parts. The teacher might, for example, focus on the arguments of the introductory sections and skip proofs or explanations he is already familiar with, or restrict his attention to overviews of these proofs and explanations, if they are available. The student, in contrast, is likely to spend less attention on the introductory parts, but to need very detailed expositions of proofs and explanations.

Having different texts for different purposes would present several drawbacks. First, this would induce redundancies. Second, because of these redundancies, the approach would be error prone. Third, the consistency between the different texts giving complementary "views" of the same content would be difficult to maintain, which is yet another possible source of errors. For these reasons, it appears desirable to model a notion of view while specifying the semantics of the considered documents.

If data items combine both, data like in traditional databases and texts, e.g., in biological databases (cf. 3.6 Modeling of Biological Data), a notion of view accommodating data of both types is desirable.

As far as data modeling with markup languages is concerned, the investigation of views leads to the following interesting research questions:

How can a notion of view be appropriately specified using markup languages?
How can consistency properties between views be expressed using markup languages?
Which querying facilities are needed for handling views conveniently and efficiently?

3.6 Modeling of Biological Data

Modern biology, in particular genomic research, is data and computation intensive. In biology in general and in genomic research in particular, it is nowadays common practice to build databases of biological data. Most biological databases are -- freely or not -- accessible through the Web.

From the viewpoint of data modeling, especially of data modeling with markup languages, biological data and biological databases are interesting for several reasons:

Biological data is subject to both, general building laws -- the discovery of which is a primary objective of biology -- and exceptions. (The admittance of exceptions distinguishes modern markup languages from traditional data modeling formalisms, cf. 1.2 Structuring, Modeling Exceptions -- Semistructured Data).
Biological databases are based upon a multitude of data schemes. For most types of biological data there are no generally accepted data models or ontologies. (The resulting irregularities in structure are another form of exceptions and thus a case for modern markup languages, cf. 1.2 Structuring, Modeling Exceptions -- Semistructured Data).
Most biological databases contain data items that are enriched with texts. Typically such texts explain assumptions made in building up a data item. (Modern markup languages were designed for text in the first place, cf. 1.3 Enriching Data with Text -- Ontologies).
Sophisticated querying of biological databases is an essential task in searching for laws governing biological data and processes. The querying has to take into account the irregularities because of exceptions, different data models, and enrichments with text.
Generic markup languages, especially XML, are more and more being used for modeling biological data.

Note also that most biological databases are -- at least potentially -- very large. For this reason, biological databases are also an interesting application from the viewpoint of (conventional) database system research and development.

Unfortunately, biological data modeling is rather difficult to understand for most computer scientists. These databases and the research issues they raise are not widely known outside computational biology. This is unfortunate because it prevents a fruitful cross fertilization between application driven computer science research, as mostly practiced by biologists and computational biologists, and method driven computer science research, as practiced by computer science generalists.

An interesting research issue is to investigate, from the "General Computer Science" viewpoint, which are the essential aspects of biological data modeling. In this respect, approaches based on description logics seem especially promising. A further interesting research issue is to investigate whether specific querying methods are needed for biological databases.

References

[HLS99]: Ralf Hofestädt, Matthias Lange, and Uwe Scholz: Molecular Information Fusion for Metabolic Networks 1. Lecture Notes, Bioinformatics / Medical Informatics, Institute of Technical and Business Information Systems Otto-von-Guericke-University Magdeburg, 1999. http://wwwiti.cs.uni-magdeburg.de/iti_bm/ibss/courses/HofLanSch/.
[SPA97]: Fernando L. Silva, José, C. Príncipe, and Lurís B. Almeida, eds.: Spatiotemporal Models in Biological and Artificial Systems. IOS Press and Ohmsha, 1997. ISBN 90 5199 304 8 (IOS Press), ISBN 4 274 90133 5 C3000 (Ohmsha).
[Mic99]: Gerhard Michal, ed.: Biochemical Pathways -- Biochemie-Atlas, Spektrum Akademischer Verlag, 1999 (in German). ISBN 3-86025-239-9.
[NARJ]: Nucleic Acids Research Journal, Issue on biology databases/databanks. http://www3.oup.co.uk/nar/Volume_28/Issue_01/.
[CML]: XML/CML (CML = Chemical Markup Language). http://www.nottingham.ac.uk/~pazpmr/README.
[BSML]: BSML (Bioinformatic Sequence Markup Language). Soon at: http://visualgenomics.com/.
[Prot97]: Principles of Protein Structure Using the Internet. http://www.cryst.bbk.ac.uk/PPS2/top.html. 1997.
[K2]: K2 and Information Integration. http://www.cbil.upenn.edu/K2*/k2web.

4. Applications

Contents of Chapter 4.: 4.1 Modeling of Complex Technical Texts; 4.2 Modeling of Electronic News Magazines

The table of contents above lists applications to be investigated under the current framework. As explained in Chapter 2. Methodology: Research Issues and Applications, application projects supply concrete examples for research issues as described in the previous chapter, and the list of application projects is likely to grow.

4.1 Modeling of Complex Technical Texts

In this project, texts with a certain complexity are considered. The complexity meant here is not that due to the sheer size of the texts, but that of their semantical content and/or structure. Typical examples of such complex texts are software installation handbooks, lecture notes (e.g., for a university course at the graduate level), maintenance manuals, and legal or administrative regulations.

Such texts are sufficiently complex to suggest or even to require the support of:

different textual views to be presented to different readers, analogous to database views conceived for the presentation of part of the same database to different users,
alternative presentation modes for browsers,
an aid to the reader for taking notes and building up an individual reader's view of the document while reading it.

Textual Views and Document Model

Textual views may depend on many factors, two of which are:

The reader's familiarity with the domain of the text: The density of cross references and the granularity of details would be different in a view for novice readers and a view for experts.
The purpose of reading: Readers interested in a survey, in a systematic treatment of technical details, or in the relationships to other domains would require views placing different emphases on the presentation of central and fringe topics.

There are many more examples, including views at the meta level, which provide the reader or the author with views on available views.

Textual views are special cases of what is described in 3.5 Views. Common approaches to support textual views, especially in the field of Intelligent Tutoring Systems, are based on hypertext models, raising the research issues described in 3.1 Hyperlink Model vs. Browser Model -- Hyperlinks for Semantic Modeling.

In order to develop a document model for complex technical texts, such texts will be analysed with the aim to identify typical classes of text fragments, typical rôles they play in the whole text, and typical relationships between text fragments. A further issue is how such classes, rôles, and relationships can be modeled with modern markup languages such as XML.

The support of textual views suggests other aspects, such as: domain models, reader models, didactic models, to name a few. These are not currently the focus of the project described here.

Browser Model with Presentation Modes

Presentation modes aim at influencing the treatment of hyperlinks. Consider, for example, a "hyperlink to alternative descriptions of the current information" (cf. 3.1 Hyperlink Model vs. Browser Model -- Hyperlinks for Semantic Modeling). In a "standard mode" the browser treats a hyperlink of this kind such that the source text remains unchanged and the target text appears in a new window. In a "paper mode", where the window concept does not make much sense, the hyperlink behaves like stretch text with appropriate delimiters for the beginning and end of the insertion. In a "mini-screen mode" the number of windows might be restricted to two, say, a main window and an auxiliary window, and the treatment of the hyperlink depends on whether it is actuated from the main window or from the auxiliary window and on whether the target text is already displayed in one of the windows.

Presentation modes generalise a functionality available for instance with the Netscape Navigator by clicking on a hyperlink with the left mouse button or the middle mouse button. It is not unreasonable to assume that, in the long run, browsers will not have built-in presentation modes, but will interpret specifications of presentation modes formulated in some declarative language on a par with style-sheets formulated in some style-sheet language. The kind of independency assumed here is similar to that ensured by generic markup languages by distinguishing between structure and layout (cf. 3.1 Hyperlink Model vs. Browser Model -- Hyperlinks for Semantic Modeling).

Individual Reader's View

While reading a text, a reader may wish to dynamically construct an individual view of this text. Such a view is inherently different from views that are provided by the author at the time of writing the text. First, it has to be dynamically built up. Second, structural consistency aspects (cf. 3.5 Views) cannot be resolved before a reader's view has been built up.

The reader may compose an individual view by selecting from the provided document's elements and presentation parameters. Thus, this view presents textual elements selected by the reader in a form determined by the reader.

A reader's view can be used like a slightly generalized back-button, to reproduce what the reader read in the past, but taking into account intermediate changes of presentation parameters. Unlike a back-button, however, the new functionality allows the reader to explore a text tentatively and to decide whether and when text fragments are incorporated into the reader's view.

Moreover, the reader may incorporate own pieces of text into the reader's view, for example personal notes on the given text. These pieces of text may contain hyperlinks, either to fragments of the given text or to text added by the reader or to other targets. (This becomes possible by so-called external hyperlinks that need not be part of either source texts or target texts.)

This functionality allows the reader to augment the given structure by personal extensions and even to deviate completely from the given structure. As an extremely special case, the individual reader's view covers the bookmark functionality provided by current browsers.

The functionality described here is clearly application driven, but it will probably raise research questions beyond those described in 3.5 Views.

4.2 Modeling of Electronic News Magazines

Most newspapers and news magazines maintain pages in the World Wide Web. These pages contain current news articles, which are usually supplemented with a list of hyperlinks to earlier articles on related subjects. Recent developments suggest that the provision of such hyperlinks will be further enhanced into services offering assorted subject-centered background information to subscribers.

Such services require support for various temporal aspects of their information (cf. 3.3 Time-Dependent Data: Modeling and Presentation). An article may refer to an older article quoting a statement by a politician about an event that appeared insignificant at the time of the statement, but crucial at the time when the later article was published. Thus, there seem to be (at least) two different notions of time inherent in this kind of information: the time when an event happens and the time when a report about an event is published (which may be an event in its own right, as in the example above). It might be necessary to rely upon both notions of time in modeling electronic news magazines.

Another characteristic of electronic news magazines is that the subjects that may become news and their possible logical structure cannot be anticipated. They cannot be based on predefined static subject indexes, but require evolving dynamic subject indexes (cf. 3.2 Evolving Subject Indexes).

5. Related Application Areas

Contents of Chapter 5.: 5.1 Intelligent Tutoring Systems; 5.2 E-Commerce; 5.3 Workflow Management; 5.4 Electronic Books and Newspapers

5.1 Intelligent Tutoring Systems

*** Section in preparation ***

References

[BEG99]: François Bry, Norbert Eisinger, and Tim Geisler: Hauptseminar Elektronische Medien in der Lehre: Modellierungsaspekte http://www.pms.informatik.uni-muenchen.de/lehre/seminar/emedien/99ws00/#Vortraege-Literatur. 1999.; Contains an overview (in German) and a list of references.

5.2 E-Commerce

*** Section in preparation ***

References

E-Commerce initiatives relying upon XML:: BizTalk, http://www.biztalk.org.; CommerceNet, http://www.commerce.net.; EDI, http://www.geocities.com/WallStreet/Floor/5815/.

5.3 Workflow Management

*** Section in preparation ***

References

*** References in preparation ***

5.4 Electronic Books and Newspapers

A novel kind of computer terminals is likely to be available soon, which will open new perspectives for electronic newspapers, news magazines, and books. The novel computer terminals will rely upon three similar devices called "epaper", "E Ink", and "OLED", currently developed by Xerox, IBM, and Siemens, respectively. These devices will make "computer terminals" possible that have the appearence, texture, consistence, size, and weight similar to that of paper.

"epaper" is the commercial name of a material called "Cyricon" developed during the last decade by a research team led by Mrs. Fereshte Lesani and Mr. Nick Sheridon at the Palo Alto Research Center of Xerox. Cyricon is as thin as a sheet of paper. It has on its surface a layer of tiny, electrically charged toner balls that are black on the one side, white on the other. Variations of an electric charge turn these balls so as to display characters or black-and-white photographs or pictures of any kind. The resolution achieved is comparable to that of printed newspapers or current computer terminals. Xerox announced in the mid of 1999 that it will soon sell Cyricon under the name of "epaper" in a joint venture with 3M.

At IBM a material called "E Ink" similar to Xerox's Cyricon/epaper has been developed as part of the "e-commerce" research project. IBM presented in 1999 a newspaper-like research prototype of 16 pages based upon E Ink.

The project OLED (Organische Licht-Emittierende Dioden, i.e., Organic Light Emmitting Diods) at Siemens aims at developing a color display similar to epaper and E Ink. Mini-screens based on OLED of a size comparable to that of cellular phones have already been built. The prototypes of a cellular phone with a stretchable OLED-based "screen" for online news has been developed. Also, the prototype of a money card with an OLED-based display for displaying the balance has been presented.

Electronic newspapers, news magazines, and books based on epaper, E Ink, or OLED have many advantages over paper-prints. First, the content of an epaper, E Ink, or OLED based magazine can be updated at any time, for example at the time the magazines is being read. Second, with epaper and E Ink animated features are possible. Third, the use of a display which in essence is a computer terminal makes it possible to offer to readers new, additional services like customization of contents and/or of presentation and search.

The new devices are likely to be useful not only for newspapers and news magazines, but also for many other applications depending upon short lasting or frequently updated data such as prices, product availabilities, or time tables. It is worth mentioning that daily updated prices, product options, and delivery delays are being used more and more frequently, in particular in the car selling business. Some supermarket chains already rely on computers for instant updates of price and product lists. The use of epaper, E Ink, and OLED is also considered for textbooks and more generally for all sort of teaching material.

Thus, the new materials epaper, E Ink, and OLED might boost software research and development in electronic document management and in the field of web-based data interchange, i.e,. in one of the most promising application areas of XML and of the semi-structured data model.

The Microsoft-led Open e-Book initiative is related to electronic books and news magazines. It aims at establishing an "Open e-Book standard" as industry-wide voluntary guidelines for formatting and delivering textual content to electronic reading devices. Already, an "Open e-Book 1.0 draft specification" has been published.

References

[EInk97]: J. Jacobson, B. Comiskey, C. Turner, J. Albert, and P. Tsao: The last book. IBM Systems Journal, Vol 36, No. 3, 1997. http://www.research.ibm.com/journal/sj/363/jacobson.html.
[EInk97]: B. Comiskey, J. D. Albert, B. Polito, and J. Jacobson: Electrophoretic Ink: A Printable Display Material. Proceedings of the Society for Information Display, Boston, MA, p. 75, May 1997
[Gyricon97]: N. K. Sheridon and M. A. Berkovitz: The Gyricon -- A Twisting Ball Display. Proceedings of the Society for Information Display, Boston, MA, p. 289, May 1997
[ePaper99a]: Electronic Paper. Oct. 1999. http://www.parc.xerox.com/dhl/projects/epaper/.
[ePaper99b]: Mary Lisbeth D'Amico and Marc Ferranti: Xerox, 3M collaborate on electronic paper. June 30, 1999. http://www.windowstechedge.com/wte/wte-1999-07/wte-07-paper_p.html.
[SIAM98]: The reinvention of paper. Scientific American, Sept. 1998 http://www.sciam.com/1998/0998issue/0998techbus1.html.
[Heise99]: heise online: Elektronisches Papier im Einsatz. 4 May 1999. http://www.heise.de/newsticker/data/ae-04.05.99-000/.
[OLED99a]: Birgit Zellmann: Leuchtende Kunststoffe -- Flach, flexibel, futuristisch -- die organischen Leuchtdioden. Forschung und Innovation. Nr 2, 1999 (in German). http://w1.siemens.de/FuI/de/zeitschrift/archiv/Heft2_99/artikel08/index.html.
[OLED99b]: Ulrich Eberl: Organic LEDs Hold the Key to the Paperless Newspaper. Research and Development, November 1999. http://www.usa.siemens.com/Innovation_and_Technology/Articles/RAndDArticle8.htm.
[OLED97]: Dietmar Theis: Flachbildschirme -- das Fenster zum Informationszeitalter. Forschung und innovartion, Nr 2, 1997 (in German). http://w1.siemens.de/FuI/de/zeitschrift/archiv/Heft2_97/artikel08/index.html.
[OLED99c]: NewsDesk 9940/1: Die Hintergrundinformation (2). Referat Innovation und Technologie: Biegsame Monitore und elektronische Zeitungen, 6.10. 1999 (in German). http://w1.siemens.de/de/press_service/newsdesk/nd99401b.html.
[eBooka]: Open eBook Initiative. http://www.openebook.org/ebooks.htm.
[eBookb]: Open eBook Standard: Open eBook Publication Structure 1.0. http://www.openebook.org/OEB1.html#_Toc462070407.
[eBookc]: From "Wired News" 1999: http://www.wired.com/news/news/email/explode-infobeat/technology/story/19870.html.
[eBookd]: From "Wired News" 1998: http://www.wired.com/news/culture/0,1284,15501,00.html.
[Clax98]: Bill Claxton: Publishing Trends in the Next Century Using PDF, 1998. http://www.planetpdf.com/planetpdf/pdfs/nlbtalk.pdf.

PMS Lehr- und Forschungseinheit IfI Institut LMU Universität