[This local archive copy is from the official and canonical URL, http://www.w3.org/TandS/QL/QL98/pp/linkhier.html#ID4; please refer to the canonical source document if possible.]
This document briefly sets out functional requirements for languages to characterize and retrieve structure and content in linked hierarchies (such as XML documents). This kind of data poses more complex requirements than some others, because not only content but also structure and linkage must be available to queries. Queries on structured data are important for several domains: information retrieval, database searching, hypermedia link traversal, stylesheet rule selection, and document validation using XML schemas. Though commonly considered separate, all these domains have similar requirements for query languages. They can all be described in terms of a single abstract model of query semantics, and they all require similar sets of operators: access, comparison, testing, and retrieval of parts of XML, HTML, and SGML documents -- or more generally, searching or characterizing locations in multi-level hierarchies with links. The expression and implementation of query languages may -- and probably should -- vary among these domains, but they can benefit from a well-defined common set of primitive concepts.
This document contains three sections:
Conspicuously missing is any proposal for the syntax of a query language; until the required scope of the language is clarified, syntax design would be premature. Our examples are given in a pseudo-code which is emphatically not intended as a proposal for the syntax of a real query language.
We use the term attribute here solely in the XML sense; the term has a quite different meaning in database theory. The term entity is also used differently in the two fields, but we don't use that term here.
The term structured document is used in a wide variety of different ways. Here, we use it (without loss of generality, we think) to mean XML documents, which are ordered trees of class instances called elements. This meaning is quite different from what is called `structured data' in the relational database world: it has properties of order, hierarchy, repetition, and heterogeneity of instances that RDBs do not share (more below).
What we here call general user querying is also commonly called structured information retrieval. It is intended to provide users who know something about the content and structure of a document or collection with the ability to use that knowledge to characterize and retrieve desired portions of it.
By schema validation, we mean the process of checking that a document conforms to a schema, in particular one expressed in the XML schema language now being developed by a W3C work group.
We assume some notion of locations in a document; informally, one can think of a location as a portion of a document (e.g. an XML element or a string of characters in an element's content), or a position in it. We will treat locations as an abstract data type, and not describe further precisely how they might or should be implemented. Different query languages may define the idea of locations differently, with important results for the utility of the language.
A predicate is some expression that can be evaluated to
true or false, with respect to some document location. Predicates are
the main means of expressing what portions of data a query or other
retrieval task should or should not match. For example, is an
element of type P
or occurs within a FORM
are
simple predicates that might apply to locations within a document.
For any predicate, there may be zero or more locations in a given set of documents which satisfy it. For any given location in a document, there will be (in any interesting query language) a large (perhaps infinite) set of predicates satisfied by it. There is thus a many-to-many mapping between predicates in a query language and locations in a set of documents.
The set of locations which satisfy a predicate is the extension of the predicate. The set of characteristics of a location by virtue of which it satisfies the predicate is the intension of the predicate. Query languages may vary with respect to both the extensions and the intensions of their predicates.
A query is a request for information about a predicate-location mapping; in practice, it will take very different forms depending on the domain. See below.
Documents impose requirements on querying very different from those of relational databases. The basic characteristics of tables and hierarchies are different, and those basic characteristics lead to differences in many of the design decisions for any query language.
In RDBs, data are inherently unordered: there is by definition no relative order to the fields in a record, the records in a table, or the tables in a schema. Obviously things may be stored in some order, but this ordering is not significant (not information) as far as the database is concerned. The user can sort while making reports, but these orders are not preserved. On the contrary: database management systems are free to shuffle things around as they see fit: that is a source of many powerful optimizations in RDBMS, which should not be impeded.
In documents, the order of words of a paragraph, the order of items of a list, and the order of sections of a chapter are part of the information: the order of all must be preserved, and made accessible to the user. A query language that gave no access to information about the order of elements, or which gave only inconvenient access to information about sequence, would not be adequate for XML data.
It is occasionallly necessary to preserve or impose a given order of records in RDBMS; in such cases, one must add serial numbers, sort the records in a result (possibly repeatedly), and possibly perform other expensive tasks in order to impose or preserve the ordering. Likewise, there are cases of unordered data in documents (such as the components of a bibliography entry), where order is imposed by layout and style constraints, not by the meaning of the data itself. It is worth noting that such unusual boundary cases are handled only poorly by tools on each side of the relational/documentary divide.
By definition, RDB tables do not contain other tables or fields, records do not contain tables, and each field contains only an atomic value. Fields may of course have as their value a key or other data that can be used to retrieve another record. This is one special case of linking, but it is not a case of document structuring (see DeRose 1989 for more on the distinction). Relational queries do not have operators for containment (particularly recursive containment, which is the natural form for a containment query against an XML document), they do not support text-matches that reach across fields, and so on. Simulating these very common document requirements is extraordinarily hard in relational systems.
The XML document model is crucially hierarchical: a typical XML
document type may define divisions, sections, nested lists, and hundreds
of other text objects with internal structure.[1]
Useful document queries often require knowledge of the hierarchical
structure: the query find chapters with titles containing the
word
"Syntax" in books with titles containing the phrase "programming
language"
is likely to find discussions of programming-language
syntax and formal grammars for C, Java, Perl, etc. The query find
chapters with titles containing the phrase "programming language(s)"
within books with titles containing the word "syntax"
, in
contrast, is likely to find discussions of programming languages which
are useful in the study of natural-language syntax.[2]
Probably the most common query against documents is finding instances of a word or element type `inside' another particular type of element, regardless of whether other element types intervene. For example, finding footnotes (<fn>) that are inside chapters (<chap>), even if there are <p>s, <list>s, and so on (vertically) in between. In a naive and non-redundant relational representation of a document structure,[3] this relationship cannot be expressed at all. At best, one can write a query for direct containment, or a query twice as long to handle cases with one intervening node, or a query three times as long to handle two levels, and so on.[4] For the default SGML limit on nesting levels (28), such a query takes 25 pages of SQL (and is apt to be slow with typical query optimizers). For XML there is no level limit, so an accurate SQL query against the naive representation of a tree in third normal form is not inconvenient but impossible.
In practice, one can solve this problem most conveniently by including redundant non-normalized data in the tables: the elements table might record not only the parent but also all the ancestors of each element, either as tokens in a string field, or as a host of ancestor fields, or ancestors might appear as rows in an ancillary table of descendant-ancestor-pairs. The 25-page recursive query is now dramatically simpler,[5] though the cost in space and in complexity of query evaluation may be high.
In RDBs, by definition no field can be repeated within a record. One may define multiple fields with similar names or purposes, which users are invited to think of as repetitions (address1 and address2), but the database has no knowledge that they are related. Likewise, one may pack repetitions into a single (string) field ("zip 02906 12345"), but this makes data verification, retrieval, and many other required operations difficult or impossible. Otherwise, it is necessary to create secondary tables and link them via key fields; this imposes significant costs in size and (for typical RDMS implementations) speed.
In XML documents it is not only possible for an element to have many sub-elements of the same name; it is the norm. It would be absurd to require elements to contain only one paragraph each, or to create a separate element type for each paragraph sub-element. It would be similarly absurd to prohibit re-use of the same content in another place (as is very common for boilerplate text, warnings, etc.).
Datatypes in traditional databases are generally limited to atomic
types: INTEGER
, REAL
, BOOLEAN
,
DATE
, CHARACTER
, and STRING
(often
fixed-length). Anything else is a `blob'
about which the database knows nothing (it cannot version-manage,
display, edit, search, or otherwise process it other than keeping it
around and producing it on demand). Aggregates exist only in the form
of (rows in) tables.
Datatypes in XML documents include mainly aggregates, such as elements. XML 1.0 does not define anything like a useful set of atomic datatypes; when integers, dates, etc. appear in XML documents, they are normally encoded as strings. (If not, then the document is probably not actually XML, though it may be a very fine internal format to use with XML). This affects the most natural sizes of numbers (powers of two are not significant to XML), as well as the ever-present need to derive other data types (such as dates) from strings (such as "9/23/98" or "Jueves, 24/9/1998", not to mention "probably late in the reign of Nero").
The development of a data type system for XML will bring the problems of relational querying and XML querying closer together in the future.
In XML, linking is expected to be a very important way of relating data. Links are perhaps most important for relating parts of data that are not near each other in terms of prose flow. The Web's current requirement to break documents into small parts makes this even more important: Such parts are attached by links that are generally indistinguishable from other links; distinguishing link types would be somewhat easier if the rel and rev attributes were used consistently, but they are not. Work such as Gibson et al. (1998) shows how valuable such relationships can be for Web-scale search, even when they must be inferred through sophisticated (i.e. expensive) methods.
But in addition to this widespread need, it is worth noting that any structure that can be represented by a subelement or attribute can also be represented instead by a link. For example, one can indicate that some portion of data is a poem or price-tag, or that it is to be formatted a certain way, by creating the claim elsewhere and linking it in.[6] On the Web, this will become extremely important as people begin to attach their own annotations, elements, or attributes to other documents for their own use, without modifying those documents. (XLink provides this capability.)
In such a context, the links between data objects must participate
fully in querying, and queries must be able to reference them with
full generality, test, count, and otherwise combine them into larger
queries. For example, in order to search a document for words (tagged
<w>) which a parser (human or automatic) has interpreted as
nouns (using the <interp> element with a
wordclass attribute to specify the word class and an
instances attribute to point at the nouns of the text),
one might issue a command like find all W elements whose ID
value occurs as an IDREF in the INSTANCES attribute of INTERP elements
with attribute WORDCLASS="NN"
. This query must be just as
possible, and ideally just as simple to express, as a query which
assumes that the part of speech information has been inserted into the
wordclass attribute on the <w> element itself:
find all W elements with attribute WORDCLASS="NN"
. The
query find all U elements whose WHO attribute is the same as the
ID attribute on a SPEAKER element with attribute SEX="M"
(which
finds the utterances -- <u> elements -- spoken by males) should
not be harder to express than the query find all U elements with
attribute SPEAKER-SEX="M"
.
In particular, indirect or typed linkages must be supported without
resort to explosively long or convoluted query syntax. The examples
above, for example, should not be required to specify whether the link
occurs through an ID
/IDREF
link or through an extended
link, possibly a chain of Xpointers; a preferable formalism would allow
the expression of queries like find all W elements pointed to by
INTERP elements with attribute WORDCLASS="NN"
and find all
U elements whose WHO attribute points to a SPEAKER element with
attribute SEX="M"
.
In addition to basic structural differences, typical query languages for relational data (such as SQL) devote considerable attention to data transformation: generating new tables or views, reports, and the like. This is also true of XSL (though XSL requires many quite different operations because it must produce ordered hierarchies, not tables), but it is quite different from the requirements on XPointer or generalized document retrieval queries.
None of these differences should be taken as a criticism of either RDBs or documents: Each model is highly valuable for the kind of data it was designed for. The point is simply that forcing either kind of data into the other model cannot work well. That's why XML models for tables are a traditional sore spot, just as supporting XML on top of RDBs is (other than as blobs). Of course in a Turing-complete system anything that can be done in one model can in principle be done in the other: at some level, it's all just bits. But the most natural operations and assumptions of the models are in stark contrast. Simulating either with the other is expensive to develop, hard to maintain, and slow.
Failure to address these fundamental differences has, in our view, seriously weakened many query languages proposed for documents. Some proposals are only weakly suitable for documents: they support only queries that would also make sense in tables. Using such query languages for XML documents is like editing XML with WordPad or vi: it can be done, but it is probably not the best way to do the job. Table-oriented query languages can easily be made to look good by careful selection of mostly tabular XML schemas and documents; evaluation of query languages for general use needs to be performed on a more realistic mix of document structures.
Typical approaches of the table-oriented sort add various operators
such as CONTAINS
, but have no way of managing text that is
divided among several hierarchically-organized containers. Or, they
may be able to retrieve elements that contain both an
<abstract> and a <summary>, but not test what order
those sub-elements occur in, or whether they are adjacent. No
proposal of which we are aware does much at all with hyperlinked data
(though see Abiteboul (1998) on the theory of incorporating links into
querying). This makes perfect sense because notions such as
before, after, and adjacent
simply do not exist in relational databases, and links are present only
in the form of foreign keys, without link-type or role information.
But these notions are
very commonly important in queries on document bases, and need to be
design centers, not marginal or expensive extensions on the periphery;
otherwise documents will be second-class citizens in the retrieval
language.
Finally, it is worth noting that RDBs and documents are by no means the only data models or topologies around. Hierarchical file systems, for example, differ from both: directories are deeply and importantly hierarchical, but the members (files) of a directory are unordered. Also, where XML provides IDs that are unique across an entire document and across all element types, directories provide only locally unique filenames, and RDBs provide keys that are unique per key field per table. These represent quite different scopes, and therefore require different semantics for any operation dealing with unique identifiers. This places a roadblock in the path of those who adopt the otherwise appealing approach of extending directory-access languages to handle XML querying, or vice-versa.
XSL, XPointer, and general user querying all share some characteristics. They are, after all, operating on the same ordered hierarchical document structures. This has led to repeated proposals to design a single query language that could be used for interactive general user querying, for XSL patterns governing style-rule selection, and for addressing in XPointer. Such a unified language is a worthwhile attempt, but is not necessarily the right approach just because there is some overlap. In particular note that a query takes rather different forms in different contexts:
select-region-and-create-link
) must return a predicate
uniquely satisfied by a given location; for others (e.g. traverse
link
) it is the predicate, not the location, which is given, and
the system must identify the location(s) satisfying the given predicate.
What is returned may vary with the implementation: perhaps a full
document with a pointer (or handle) identifying the
location(s) in question, or (if the document is known to be already
accessible to the user-interface modules) perhaps just a set or list of
handles for the locations.Predicate X set of Doc -> Doc Predicate X set of Doc -> set of Doc Predicate X set of Doc -> list of Doc Predicate X set of Doc -> list of Loc Predicate X set of Doc -> set of Docfragment Loc -> Predicate Predicate -> set of Loc Predicate -> < Doc, Handle > Predicate X Loc -> Boolean
This section more specifically compares the general query-like facilities needed in these different systems. Note that the requirements for XSL and XPointer are reasonably well understood and documented; requirements for general querying and schema checking have not been worked out in similar detail, and in those columns we are relying on our own judgement about plausible requirements.
The kinds of data characteristics that can be tested via a query are largely the same in all applications. This is because the XML information structure only contains certain information (the final definition is being worked out in the DOM and XML Information Set WGs), and that is just the information that must be accessible. Thus the first set of requirements is quite similar:
Feature | XSL | XPointer | Querying | Schemas |
Test element types (GIs) | Y | Y | Y | Y |
Test attribute values | Y | Y | Y | Y |
Test ancestors GIs and attributes | Y | Y | Y | proposed |
Test order relative to siblings | Y | Y | Y | Y |
Test for place in a pattern of siblings | Y | Y | Y | ? |
Test descendants GIs and attributes | ? | Y | Y | proposed |
The different applications have similar but not identical needs for the types of data portions that can be returned. For example, all applications should be able to return any node of an XML tree, but there is seldom need for general querying to find a single character or a point selection (while XPointer has this as a very important requirement). XSL needs to address non-elements for various purposes, such as adjusting a single character, formatting drop-initial capitals, and so on. If schema validators are to test the lexical structure of attribute values and the content of elements, they will also require addressing to the character level. Intermediate results may take forms which cannot be returned as the final result of the query.
Feature | XSL | XPointer | Querying | Schemata |
Addressing characters | Y | Y | N | Y |
Addressing strings not broken by markup | Y | Y | Y | ? |
Addressing elements | Y | Y | Y | Y |
Addressing attributes | N | Y | ? | Y |
Addressing processing instructions | ? | Y | N | ? |
Addressing selections (anything a user can drag) | N | Y | N | N |
Only in XPointer and some situations in schema checking is it an error to find no result. User querying often returns no hits (it is, after all, an exploratory activity), while an XSL stylesheet may well define styles for elements that do not occur in some of the documents the stylesheet is used on (such as the many optional elements in most DTDs). On the other hand, XPointer may have little need for pointing to sets of locations, while user querying surely must. Thus, in this area we get into substantive differences in requirements.
Feature | XSL | XPointer | Querying | Schemas |
Finding no match, is an error | No | Yes | No | Sometimes |
Single returned location expected | N/A | Usually | No | Sometimes |
Support for multiple returned locations | No | No(?) | Yes | Yes |
Results sorted | No | No(?) | Option | No |
These issues involve the way an entire application
`thinks' of querying. The biggest difference here is
that for XPointer and general user querying, a query expression is
given, and data is searched for the match(es). But in XSL just the
opposite happens: a piece of data is given, and the goal is to
find a query that matches it. In particular, XSL must find
the best matching query. If (as is typical) there is one
style that applies for <p> in general and another for <p>
within <footnote>, then the latter must be selected. In a schema
language with CHECK
statements analogous to SQL CHECK
constraints, the CHECK
statements might resemble pointers in
some respects, and general user queries in others.
Feature | XSL | XPointer | Querying | Schemata |
Query arbitration (find best matching query given location) | Always | Never | Never | Never |
Cross-document matching | N | N | Y | Y |
Target authentication | N | Y | Y | Y |
Transform data before return | Y | N | ? | ? |
The feature set in any XML searching application should include ordinary operators such as Boolean operators, constants, variables, parentheses, and comparisons of the supported types. While there are many such operations in common, there are clearly significant differences as well. If a single language is to be devised that can handle all of these requirements, it must be done with great care to avoid compromising the specific requirements of each application.
As an example, if a language is designed with operations that can naturally return many hits, it may be awkward or confusing to subset it for handling only single hits: in such a language, whether multiple hits are returned depends not on the operators used, but on the data. Defining full double sets of singleton and multi-return operators would solve this question, but would not have a good effect on the clarity or unity of the language.
This analysis is excerpted from a feature set the first author developed starting in April of 1998 based on a number of standard and proprietary languages for document and database querying. The feature set here tries to capture all the basic operations needed to find things in the structure, content, and linkage of hierarchical, linked document data. Many of these features can be built from others if the right core set is chosen and adequate ways to combine operations are also provided. We make no claim that this is the right way to package these capabilities -- it is not intended to be that at all. But a language that claims adequacy should be able to accomplish all these functions one way or another, and should not require ridiculously long workarounds to get them.
Data objects occur in three places: in the data itself, in the expression of a query, and in the results generated by a query (intermediate and final query results). A query language must include atomic datatypes such as numbers, dates, truth values, and strings; operations for testing and combining data; as well as document parts such as (in XML) elements, attributes, PIs, and so on. Results are ordered (or potentially unordered) sets of locations in the data: handles to the particular data portions that match the query.
For example, find all chapters with attribute
type='secret'
shows most of these types in use:
'secret'
).The types needed include at least:
It is very important to specify clearly the relationship between the
datatypes returned from queries and the datatypes in the documents being
searched. It is a priori desirable to unify these two
sets, and thus to achieve closure. However, in
practice closure has proven frustratingly difficult to achieve in XML
query languages. Some
requirements may necessitate allowing more types in results than are
strictly necessary in the data. For example, XPointer has the very
basic requirement to model any user selection. This, after all,
reflects the simplest, most basic interface for creating XPointers
(select and create link
), as well as many other user
interactions. Yet only a small percentage of possible selections
constitute single nodes in an XML document.
A language must also be very clear about what datatypes can be explicitly or implicitly converted to other types. Explicit conversions (or casts) should be available wherever they make sense for the datatypes involved. Implicit casts can, if carefully designed, make a language far easier to use, though they risk encouraging imprecision and uncertainty about the precise meaning of expressions.
For example, many queries can be vastly simplified if there is a rule that a query can be embedded in a larger query and treated implicitly as a Boolean value: if the sub-query returns anything, it is treated as true, otherwise as false. As one simple case, one might want to find all block quotes (<bq>) directly containing more than one paragraph (<p>). This can be done by looking for the second <p> child and seeing if it exists.
Find elements e where type is BQ and exists(e.child(2,P))If we implicitly treat embedded queries as Boolean tests, we can simplify the query:
Find elements e where type is BQ and e.child(2,P)
The syntax here is for illustration. The is
indicates a
reserved relationship, namely that of an element to its element type
name; while the dot indicates a relationship that must exist in the
document structure relative to e. Such relationships are
already well-handled by XPointer, and so we use XPointer syntax
for such relationship predicates in most of the examples to follow.
We have not listed type conversions in the catalog of capabilities below, but they must be provided (whether explicitly or implicitly) in any actual language.
The means by which the language determines that a given portion of XML content or attribute value amounts to a date, a number (expressed in some numeric notation such as decimal, hex, or scientific notation), etc., must be solved as well. This is a slightly different problem than in RDBs, because all XML data starts out expressed as strings in the XML source, and because desired types for processing (such as integers) may be legitimately represented in multiple ways in the XML. Type declarations in XML schemas will solve the problem for some documents, but not for all.
We have tried to make the set of capabilities below symmetrical, since lack of symmetry reduces language functionality (often in surprising ways), and increases complexity. Asymmetry can result from designing a syntax too early, on the basis of only a few simple cases: the result is that the cases foreseen by the designers are simple, but unforeseen cases become impractically convoluted. Considering in advance the full range of requirements makes it easier to design a language where simple things are simple to express, moderate things are merely moderate, and only truly hard things are hard to express.
We have not separately addressed making the query language namespace-aware. In principle, any reference to an element type should be replaceable by a reference to a namespace-qualified name. This requires a uniform way to qualify any element type referenced in a query. Operators are also needed to query aspects of namespace use itself: find all documents using namespace N, that intermix elements from namespaces N and M, and so on.
Many of the functions listed can be useful either as a request ("get nth child of X") or as a predicate ("is this thing I've got, the nth child of X?"). We have not listed both variants unless their formulation seems non-obvious. A simple rule is used in many languages to save on such definitions: any request can be treated as a predicate: if the request in fact finds anything, it comes out true; otherwise, false. For, example, a request for the third child of type <P> can be treated as a test for whether there are at least three <P> elements in a given context. Testing for the total number of nodes fitting a given condition requires an explicit operator that returns that number (commonly called the cardinality operator).
Many of the operations listed can have mirrored operations: the
operation X precedes Y
has a mirror pair in Y follows
X
. We have not listed the extra forms, though in some cases it
is convenient or necessary to provide them (such as means for getting to
preceding siblings as well as following).
The operations are organized under these categories:
(see also the notes following the table)
Equivalence predicates |
---|
Location X is the very same place as Y |
X has exactly the same content/structure as Y |
Sibling (horizontal) predicates |
X is a sibling of Y |
X is the adjacent sibling of Y |
X is preceding sibling of Y |
X is immediately preceding sibling of Y |
X fits location L in node pattern P:Find \1 in "TI, \([P Q]+ \), UL, [^UL OL]* $" (finds the sequence of <P> and <Q> elements after the <TI> element and before the <UL>) |
Precedence predicates (global) |
X precedes Y (start tag is earlier) |
X directly-precedes Y |
X directly-precedes Y up to/at n nodes |
Global patterns (compare above under sibling predicates) |
Containment (vertical) predicates |
X is a child of Y |
X is a descendant of Y |
X is a descendant of Y up to/at n levels |
Vertical (ancestry) patterns (compare above under sibling predicates) |
Link predicates |
X directly links to Y |
X indirectly links to Y |
X indirectly links to Y at/up to n levels |
X and Y are directly linked (either direction) |
X and Y are indirectly linked (either direction) |
X and Y are indirectly linked at/up to n levels |
Link overlap predicates |
Part of X directly links Y |
Part of X indirectly links Y |
Part of X in directly links Y at/up to n levels |
X directly links part of Y |
X indirectly links part of Y |
X indirectly links part of Y at/up to n levels |
X and Y are partially directly linked |
X and Y are partially indirectly linked |
X and Y are partially indirectly linked at/up to n levels |
Child operations |
---|
Get first child |
Get last child |
Get nth child (signed) |
Path expressions (node.node.node to traverse tree, with choice predicate at each step -- see OQL, XPointer, etc.) |
Descendant operations |
Get first descendant |
Get last descendant |
Get nth child (signed) |
Get number of levels of X below Y (or root) |
Ancestor operations |
Get first ancestor (=parent) |
Get last ancestor (=root) |
Get nth ancestor (signed) |
Get-ancestor-or-self such-that |
Sibling operations |
Get first-sibling |
Get last-sibling |
Get nth-sibling (signed) |
Get first-sibling-or-self |
Get last-sibling-or-self |
Get nth-sibling-or-self (signed) |
Precedence operations (global) |
Get first-node |
Get last-node |
Get nth-node (signed) |
Link operations |
Get node linked from X via link L as linkend E (L and E may be predicates on links) |
Many relationships have variant forms where the relationship is immediate, indirect of arbitrary distance, indirect but of limited distance, or indirect but of precise distance. There seems no principled reason to exclude these variants for some relations while excluding them for others; however, some are likely to be less used than others. In this list we show first, last, and n-th operations because they are so commonly needed.
Of course one can simulate first via a get place number operator and filtering out all returns for which it returns other than 1. However, doing so has problems: (a) It obscures the user's actual intent; (b) it is harder for users to figure out how to do a simple operation; and (c) it makes it harder to implement query optimizers, which have more opportunities with high-level specifications than with low-level ones.
A language must allow operations and predicates to be combined, so that users can, for example, ask for the first, last, or nth element such that it fulfills a given predicate (however simple or complex the predicate may be). In a reasonable syntax design this should not require extra operations, just the combination of more basic ones shown here.
XPointer has long included a simple way to combine these operations: An
instance-number argument that takes signed values (or an ALL
keyword):
id(foo).child(1,P).
Negative numbers count backwards from the end of the
range involved, so -1
provides last
functionality.
There have been proposals to extend this parameter to allow multiple comma-separated values, thus covering the range functionality. This is very simple to specify syntactically, and can be quite useful, especially when a range is needed. There is also a case for making the range function independently (say, as a location term or suffix instead of a parameter); this gains some additional power because it can be applied either before or after another selection operation. This is similar to the HyTime listloc approach.
Link predicates need to be able to be conditioned on link type: it
is necessary to be able to say X is linked to Y
but it is
also necessary to be able to provide more information: X is
linked to by a link of type "critical" as the linkend with role
"reason"
. The more commonly hypertext links are used to
enrich documents with more information, the greater will be the
necessity of distinguishing among different hyperlinks by type, and
among their link-ends by role.
The sibling-or-self
and ancestor-or-self
operations may seem surprising at first glance. They are included
because experience with structural retrieval and characterization
languages has shown they are highly useful. For example, a hyperlink
may want to specify a particular sub-resource target, that might be a
selection, paragraph, or other small data portion. But for display,
application or even legal conventions may require that an entire
<CHAP> be accessible as context (in concrete terms, there may be
a requirement that the entire <CHAP> be accessible by scrollbar).
The obvious way to do this is to specify the display scope as (in
XPointer terms) ancestor(1, CHAP)
. But if the link happens
to be to the entire <CHAP> to start with, that specification will
fail (or, if the <CHAP> is within another <CHAP>, it will
leap to the larger one). These operations permit the starting node to
satisfy the predicate. In mathematical terms, they support closed
intervals, not just open intervals. Like many of the other operations,
they can be simulated.
Property predicates |
---|
X has attribute named N |
X is of same namespace throughout descendants |
Property operations |
Get node class of X (element, pi, comment, etc) (could also have is-a predicate for each) |
Get originating entity of X |
Get element type of X |
Get namespace of X |
Get all namespaces used within X |
Get attribute named N of X |
Get 1st/last/nth attribute of X (??) |
Content operations |
Get text content |
Get chars n-m of text content |
Get 1st/last/nth match of regex R |
Natural-language operations |
Match regardless of case distinctions |
Match regardless of accents |
Match ignoring punctuation distinctions (", " vs " ") |
Match on word roots (stemming) |
Hyphenated words |
Find (string or regex) match with proximity in characters (all proximity operators may be ordered or unordered) |
Find match with Proximity in words |
Find match with Proximity in sentences |
Find match with Proximity in elements |
Comparison is among the most basic operations required of any query
language. Equality comparison can be applied to any datatype (including
aggregates), but natural language strings pose some complications, such as
the need for comparison while ignoring case, accent, punctuation,
transliteration, digraph, or other distinctions, and restricting a string
to being a whole token. Although the full complexity of these problems may
not be soluble in the short term, some of these capabilities are so basic
as to be requirements. A query language incapable of searching for the
and The
simultaneously would be unacceptable to much of the world, as
would a language that permitted it only by lengthy workarounds such as
[tT][hH][eE]
.
Inequality comparisons are more complex when the values are not numeric; strings in particular have sort ordering rules that are enormously complex, and vary from one language, region, and discipline to another.
= |
<> |
> |
< |
>= |
<= |
String-contains (case sensitivity?) |
Contains-at start/end (=regex ^ $) |
Contains-as-token (=regex \< \>) |
Without these operations, of course the range of conditions that could be tested would be severely limited.
And |
Or |
Not |
Xor (of marginal use) |
Arithmetic operations are useful since they enable a range of predicates that cannot otherwise be simulated. For example, they allow testing that two parts of data are consistent, such as for checking online orders:
find elements e where e is ORDERED-ITEM and e.attr(price) * e.attr(quantity) <> e.attr(total)
These operations could be limited to integer and real datatypes; in some cases it may make sense to also support date, time, string, or other types for some operations (such as subtracting dates and times to find elapsed time).
+ |
- |
* |
/ |
** |
Modulo |
Truncation |
Absolute value |
String, numeric, date, time, and Boolean constants |
Macros or "let" binding operator |
Indirection |
Aggregates comes in many types. The most important ones within the query language (not including the various containers found in the data itself) are:
A query language must support the intrinsic operations relevant to these, since any query returns such objects: sets or ordered sets of locations.
Aggregates permitting duplicates may also be needed; this seems unclear. If they are needed, then the consequent conversion operations between aggregate types are as well.
It is probably acceptable to avoid the complexity of nested aggregates by asserting that result sets (ordered or unordered) cannot contain other result sets as members; of course they can still contain members which specify locations that are container elements - quite a different thing. Thus, any operation that generates lists of locations, even if nested, finally collapses them to a single, ordered location list without duplicates.
Since nodes in a tree do not exist entirely independently, but have relationships such as containment, operations that respond to partial or complete containment are also useful. For example, if a query returns all the descendants of a given chapter, the user might want to just get the chapter instead; operators to express this are invaluable as the amount of data being queried scales up.
A key problem is that some applications wish to apply the same queries at both the intra-document level, where things are ordered and there can be duplicate names (element types); and at the inter-document level, where things are unordered and there cannot be duplicate names (filenames). Integrating these two quite different structures is not simple.
Set/Ordered set predicates |
---|
Is-in-set |
Is-subset-of |
Is-proper-subset |
Is-empty |
Sizeof/cardinality |
X is the same set as Y |
X is a set with all members structurally identical to those in Y |
Set/Ordered set operations |
Construct set from items |
Union |
Intersect |
Difference |
Symmetric-difference |
Complement |
Extract members that satisfy predicate P (selection) |
Map each member M to expr(M) (apply) |
Ordered set operations |
Get first member |
Get last member |
Get nth member (signed) |
Ordered-set-concatenation |
Ordered-set-intersection |
Ordered set of tree-nodes operations |
Delete nodes with ancestors also in list |
Delete nodes with children also in list |
Delete nodes with descendants also in list |
Delete nodes with all ancestors also in list |
Delete nodes with all children also in list |
Delete nodes with all descendants also in list |
Replace sibling sequences by parents |
Many of these operations can be built up from more primitive ones. For
example, Replace sibling sequences by parents
is extremely useful in
practice, but can be done via a more general aggregate-mapping operation,
that maps all members to their parent. Either way, would presumably
automatically eliminate duplicates parents unless aggregate types
supporting duplicates prove necessary (in which case an explicit
remove-duplicates operation would be needed).
Note: Abstract: We demonstrate the power of object identities (oid's) as a database query language primitive. We develop an object-based data model whose structural part generalizes most of the known complex-object data models: cyclicity is allowed in both its schemas and its instances. Our main contribution is the operational part of the data model, the query language IQL, which uses oid's for three critical purposes: (1) to represent data structures with sharing and cycles, (2) to manipulate sets and (3) to express any computable database query. IQL can be statically type checked, can be evaluated bottom-up and naturally generalizes most popular rule-based languages. The model can also be extended to incorporate type inheritance, without changes to IQL. Finally, we investigate an analogous value-based data model whose structural part is founded on regular infinite trees and whose operational part is IQL.
Note: An analysis of some aspects of the difficulties in combining data of varying schemas, into single searchable databases.
Note: Discusses how researches use some of the information components within articles, not just the articles and their cataloging data, in order to improve research effectiveness.
Note: Discusses the complete design and implementation of one of the earliest systems able to display and edit structure and formatted views of a document simultaneously.
Note: Discusses the syntax and semantics of TEI extended pointers, on which XPointers are based.
Note: Presents the case for descriptive, structural document representations over solely format-oriented alternatives, with specific advantages in each phase of publication.
Note: From abstract: The communities can be viewed as containing a core of central "authoritative" pages linked together by "hub pages", and the exhibit a natural type of hierarchical topic generalization that can be inferred directly from the pattern of linkage. Our investigation shows that although the process by which users of the Web create pages and links is very difficult to understand at a "local" level, it results in a much greater degree of orderly high-level structure than has typically been assumed.
Note: Abstract: This paper describes a new object-oriented model and query algebra to be used as an input language for the query optimizers that are being built as a part of the EREQ project. The model adopts a uniform view of objects and values and separates syntactic, semantic, and implementation concerns. The algebra addresses issues of type-defined equality and duplicate elimination as well as extensions to bulk types other than sets.
Note: An analysis of how document structure can be integrated into DL systems to provide enhanced display, retrieval, and other functionality.
Note: Salton's work involves the use of word-frequency statistics to quantify the relationship of documents to each other and/or to user queries.
Note: Abstract: We define an algebra that synthesizes relational query concepts with object-oriented databases. The algebra fully supports abstract data types and object identity while providing associative access to objects, including a unique join capability. The operations take an abstract view of objects and access typed collections of objects through the public interface defined for the type. The algebra supports access to relationships implied by the structure of the objects, as well as the definition and creation of new relationships between objects. The structure of the algebra and the abstract access to objects offer opportunities for query optimization.
Note: An actual case study in which the authors treated the same data in multiple types of database, and evaluated the relative strengths and weaknesses of each for different kinds of structures.
Note: The formal specification and documentation for the TEI schemas, widely used for all kinds of documents in the humanities and social sciences. Also the source for many innovative tag sets used in other domains, such as for marking up information about other markup, about errors and corrections to texts converted from other media, about multilingual and parallel editions, and so on.
Note: Defines a minimal set of operators required for meaningful querying on tree-structures, and how it differs from the requirements and operators needed for other structures.
Note: A thorough treatise on query evaluation and optimization, covering relational, object-oriented, and other approaches. Also includes summaries of the theory of each database type.
[1]
Note
that the optimization of XML for hierarchical data makes it less adept
at tabular data. That is why tables often seem awkward in markup
systems: there is no principled reason to make rows or columns more
important, but to represent a table as a tree one must do so. Most DTDs
privilege rows; for example, that is why HTML has no direct
representation of table columns, and cannot center-justify a whole
column. CALS tables can justify columns, but only at the expense of a
block of empty column-definition elements elsewhere.
[return to text]
[2]
The example is due to Makoto Nagao of Kyoto
University.
[return to text]
[3]
For example,
a representation in third normal form in which elements appear as
entries in an
elements table with columns for
elementId, gi, parent,
eldestChild, rightSibling, etc.
[return to text]
[4] The one-level query might be expressed thus:
SELECT c.elementId FROM elements c, elements p WHERE c.gi = 'FN' AND p.gi = 'CHAP' AND c.parent = p.elementIdor thus, using a nested
SELECT
:
SELECT elementId FROM elements WHERE gi='FN' AND parent IN (SELECT elementID FROM elements WHERE gi='CHAP')A query that handles one or two intervening elements is somewhat longer:
SELECT elementId FROM elements this, elements mother, elements grandma, elements ggma WHERE this.gi = 'FN' AND ((this.parent = mother.elementId AND mother.gi = 'CHAP') OR (this.parent = mother.elementId AND mother.parent = grandma.elementId AND grandma.gi = 'CHAP') OR (this.parent = mother.elementId AND mother.parent = grandma.elementId AND grandma.parent = ggma.elementId AND ggma.gi = 'CHAP'))Further discussion may be found in Steven J. DeRose and David G. Durand (1994), p. 187. The query is simpler in languages with a primitive notion of indirect containment: the HyQ query language defined by the HyTime standard can seek all sections containing a footnote thus:
<HyQ qdomain=mydoc> Select( DOMTREE And( Eq(Proploc(CAND GI) "SECTION") Select( DOMTREE Eq(Proploc(CAND GI) "FOOTNOTE") ) ) ) </HyQ>
[5] The SQL query might now be expressed thus:
SELECT d.elementId FROM elements d, elements a, ancestors da WHERE d.gi = 'FN' AND a.gi = 'CHAP' AND da.descendant = d.elementId AND da.ancestor = a.elementId
[6]
SGML veterans may be reminded, not accidentally,
of the SGML CONREF
feature, which is too weak and too
inconsistently implemented to be seriously useful in this way, but which
did provide a limited form of the functionality described
here.
[return to text]