Introduction of XPQL (XPath based Query Language)


June 30 2000 Dongwook Shin
The developer of XPERT
	dwshin@futurexpert.com

 

First you may wonder why XPERT supports another query language other than XPath, XQL, or Quilt. Why should I make another XML query language, besides current a dozen XML query languages. In fact, I don't want to make it complicated. But one of the lessons I learned from the previous experience is that "the simpler, the better!".

In my opinion, the current version of XPath is quite complicated than it should be. XQL and Quilt are also complicated and evolving. The only things I like from XPath and XQL are "simple path notation" and "predicate (or filter in XQL)". So I want to re-make the language as simple as possible with these two notion.

Then the question is why a language should be as simple as possible. The reason is twofold: The first is that it is easy to use simpler language than the other. The second is that it is more likely to get a better optimization.

For instance, XPath specification does not address which symbol is the starting one in the grammar: Expr, or PathExpr. Moreover, the grammar is defined as recursive: Expr to PathExpr and PathExpr to Expr, which makes hard to imagine the expressiveness of the language.

I think an XML query language should have three functionalities:

  1. Node selection
  2. Transformation
  3. Update

These three are quite different from one another, so should not be put together. One good example is XSLT, which provides node selection and transformation. It is a good language (I think.) in that it allows first to select nodes with selection functionality (by XPath) and transform them into another one. On the other hand, in XQL does not support the transformation.

So, I want to differentiate these three and make a language that has all these features in a distinct way. However, unfortunately, I don't know how to do that. The only way I can do is build a language in a constructive way, adding a feature one by one, without violating the basic assumption (The simpler, the better). At this moment, I want to build a node selection component in this way. One of the main reasons that I choose XPath as the starting point is that it is a W3C recommendation and has many good features. Another rule in building a language is that I don't want to provide many ways to express the same query. You may think that the more ways you can have in expressing the same query, the more flexibility you will enjoy. But the reality is that you have to pay much more than you will get. One important price you have to pay is the performance issue of the query evaluation. It is generally true that the more complex a query language is, the more difficult we will get an optimization.

 

So the question is "Do you really want to have extra ways to express a query even though you already have some, in the expense of querying performance?" I think the answer should be No. So I try to put reasonable constraints in composing a query, which keeps from allowing many ways to express the same question and contributes to achieve the better optimization.

One such constraint is :

The test node (or Node test in XPath notation) should be an ancestor or equivalent to the condition nodes in predicates.

For instance, take an XPath query: "//SECTION/TITLE[contains(..//PARA, "XPath")]", which says "retrieve TITLEs of SECTION whose descendant PARA has literal "XPath". It is a legal XPath expression. But you can also represent the query as

"//SECTION[contains(.//PARA, "XPath")]/TITLE"

It tatally depends on your implementation which is faster. Right now, I only allow the second expression, since it it easier to implement and makes the query evaluation simpler. However, later, when I develop a way to transform the first into the second or find a simpler and faster way to implement the first, I will drop the constraint.

A similar rational applies to the query evaluation of XPERT on XPQL query. XPERT only retrieves the outermost element when more than one elements nested together are found. For instance, if an element <PARA> is found, but it contains another <PARA> elements relevant to the query "//PARA[contains(., "XPERT")]" as:

<PARA>
	XPERT
	<PARA>
		XPERT
	</PARA>
</PARA>

then XPERT only returns the outer <PARA> instead of returning both. XPERT considers the efficiency more important than the completeness. If you want to find the nested element, you can search the retrieved result instead, which gives you the possibility to reach the completeness.

The current limitations of XPQL against XPath abbreviated forms are:
(1) a test node can have at most one predicate
(2) a predicate cannot appear inside another predicate
(3) a test node should be an ancestor of condition nodes appearing in predicate.
(4) at present some operators cannot appear at some position
(5) numbers appearing inside a predicate should be positive integer.
(6) one side of the equality and relation operator (=, <, >, <=, >=) should be a literal
(7) join operation is not supported yet.
(8) test node should be elements.
(9) at present, only three functions, "contains()", "in()", and last() are supported.
(10)'*' representing 'anything' is not supported
 
With (1), a query like "//researchers/person[@name = "Shin"][@loc = "Bethesda"]" is invalid, even though it is valid in XPath. You have to convert the query to "//researchers/person[@name = "Shin" and @loc = "Bethesda"]"
 
With (2), a query like "//researchers/person[@name = //salesperson/person/name[@id > "8080"]])" is invalid, since it has a nested predicate inside another predicate.
 
With (3), a query like "//SECTION/TITLE[contains(../PARA, "XML")]" is invalid since the test node "//SECTION/TITLE" is not an ancestor of the condition node "//SECTION/PARA". You have to convert the query to "//SECTION[contains(PARA, "XML")]/TITLE", where the test node "//SECTION" is an ancestor of the condition node "//SECTION/PARA".
 
With (4), an operator "|" (Union operator) is only allowed for connecting path expression outside predicates. So the queries "//a | //b | //c" is valid, where the query "//a or //b or //c" is invalid. Inside predicates, "and" and "or" operators are allowed as well as "|". So the query "//a[1 and in("infor*", b) or last( )]" is valid.
 
With (5), a query like "//a[-1]" or "//a[1.1]" is invalid. The number inside a predicate should be positive. Hence, a query like "//a[5]" is valid. Note that the number and the literal are different with each other as in XPath. A number is a number without enclosed by double quote ("). On the other hand, a literal is a string enclosed by double quote. Hence, 1 is a number, whereas "1" is a literal.
 
With (6) and (7), comparison inside predicates is limited at present. At present, one argument should be literal for the equality and comparison operator. For instance, a query like "//a[@id > "100"]" or "//a[title = "XPERT"]" is valid, whereas another like "//a[@id=//b/@id]" is invalid. At present, semi-join is not allowed. But it will be supported later.
 
At present, only element nodes can be test node. Hence a query like "/person/@first-name[@last-name = "shin"]" is invalid. You have to write a query like "/person[@last-name = "shin"]" and search the element content. It will be supported in a later version.
At present, only three functions, "contains()", "in()", and "last()" are supported. More functions are supported later.
 
 

Now, here is the BNF form of the XPQL and example legal queries. I try to use the same symbols as used in XPath specification. This is just a beginning. Don't be disappointed if XPQL omits the features you think important. As it is evolving, we will cover it unless it violates our mission in the next release. Don't hesitate to mail us at dwshin@futurexpert.com and join to make a better language.

 
BNF form of XPQL
XPQLquery		::=	Expr

Expr			::= 	PathExpr
				| PathExpr '|' Expr 
(* '|' is union operator)

PathExpr		::=	AbsoluteLocationPath

AbsoluteLocationpath 	::=	'/' RelativeLocationPath ?
				| '//' RelativeLocationPath ?

RelativeLocationpath 	::=	Step
				'|' RelativeLocationPath '/' Step
				'|' RelativeLocationPath '//' Step

Step			::=	ElementName
				'|' ElementName Predicate

Predicate		::=	'[' PredicateExpr ']'

ElementName		::= 	NodeName | 
				'.' | 
				'..'
(* NodeName means the legal names for XML nodes and thus are not defined in more detail)


PredicateExpr		::=	OrExpr
				| OrExpr or PredicateExpr

OrExpr			::=	AndExpr
				| AndExpr and OrExpr

AndExpr			::= 	UnionExpr
				| UnionExpr '|' AndExpr 

UnionExpr		::= 	Number 
				| last( )
				| contains(PathwithoutPredicate, literal )
				| in(IRExpr, pathwithoutPredicate )
				| pathwithoutPredicate operator literal 
				| literal operator pathwithoutPredicate
(* The argument positions of in() are the opposite of those in contains())  
				
PathwithoutPredicate		::= LocationPathwithoutPredicate

LocationPathwithoutPredicate	::= RelativeLocPathwithoutPredicate
				| AbsoluteLocPathwithoutPredicate

AbsoluteLocPathwithoutPredicate ::= '/' RelativeLocPathwithoutPredicate ?
				| '//' RelativeLocPathwithoutPredicate ?

RelativeLocpathwithoutPredicate ::= StepwithoutPredicate
				'|' RelativeLocPathwithoutPredicate '/' StepwithoutPrediate
				'|' RelativeLocPathwithoutPredicate '//' StepwithoutPredicate

StepwithoutPredicate		::=NodeName
				| NodeName@AttributeName
				| '.' 
				| '..'
(* AttributeName means the legal names for XML attribute and thus are not defined in more detail)


operator		::=	'=' | '<=' | '<' | '>=' | '>'

IRExpr			::= 	OrIRExpr

OrIRExpr		::=	AndIRExpr
				| AndIRExpr or OrIRExpr
	
		
AndIRExpr		::= 	literal 
				| literal and AndIRExpr

literal			::=	'"'[^"]* '"'
				| '"'[^']" '"' 
(* inside in() function, '*' is a wild character)

Number 			::= [1-9][0-9]* 
(*Number means the position of a child. So It should be positive integer)



Here are some legal XPQL queries.