Abstract
Abstract
The
XQuery 1.0 specification describes a powerful language for accessing
and manipulating XML. In its initial form XQuery lacks two interesting
features: fast token-based text search and node-level update.
Specifications for these features are under development at the W3C with
release dates after XQuery 1.0. This paper looks at the mechanisms by
which XQuery can support search and update based on our experience at
Mark Logic implementing these features.
Are Search and Update Needed?
To
begin, let us ask the question, are search and update capabilities
needed in an XQuery environment? The answer depends on the use case.
When using XQuery to federate queries against data-oriented RDBMS
stores, text-based search isn't required and updates can be handled
through direct SQL calls. The XQuery in this situation acts as a
read-only view, a convenience rather than a primary access and
manipulation language.
In other use cases the need can be
acute. Take for example O'Reilly Media, a Mark Logic customer with a
custom publishing site called SafariU that's targeted at helping
university professors create custom course readers. Professors access
book and article content (held as Docbook within Mark Logic) by
performing advanced searches to locate items of interest -- chapters,
sections, or code examples. Professors gather together these items into
a new book, reviewing HTML and later PDF copies, and eventually
receiving a stack of their custom books at the university bookstore.
SafariU
requires fast text search so professors can find content to select and
update to track the new hierarchical book structure and content usage
for royalty reports. In fact, before the first professor visits
SafariU, O'Reilly uses XQuery update calls to "decorate" the book and
article content: adding <indexterm>
elements on articles, adding estimated page counts to content pieces, etc.
Mark
Logic has opted to provide search and update using function calls --
rather than language extensions -- so as to conform to the XQuery
standard as much as possible. We have been able to make search and
update calls efficient, executing quickly even against gigabyte
documents in terabyte collections. In the rest of this paper we'll
explain the Mark Logic approaches to search and update and compare and
contrast that approach with the preliminary W3C proposals.
Section Title
If
XQuery search were to provide only what a typical search engine
provides, it would be fairly easy. You specify the search word or
phrase, possibly add a basic document selection constraint, and
retrieve a relevance ordered list of documents containing the search
term(s). Using the Mark Logic search function calls and the input()
function to return a sequence of all document nodes, the expression would look like this:
cts:search(input(), "word or phrase")
Yet XQuery enables so much
more. First, XQuery has direct access to and can utilize the XML
document structure. This lets you determine what subset of content
should be searched. As shown in Example 1, you can look for the most relevant
<sect1>
blocks discussing "cougars" where the
<sect1>
blocks must be within chapters about "veterinary medicine" in books published after 2003.
Example
1: A Structure-aware Search
cts:search(
input()/book[meta/pubdate >= 2003]
//chapter[cts:contains(title, "veterinary medicine")]//sect1,
"cougars
)
Second, XQuery can access the items returned
from each search and programmatically manipulate each item for
rendering or to feed into another subsequent search. In the cougar
example, the search expression might return a sequence of <sect1>
elements, then the query might continue
on to extract all the figure blocks within those sections for display,
along with any paragraph linking to the figure. Example 2 shows what that query would look like, and how a theoretical
print-references()
might be implemented:
Example
2:
Manipulating Search Results
for $s in
cts:search(...)[1 to 10]
for $f in $s//figure
return (
print-figure($f),
print-references($f)
)
define function print-figure($f as element(figure)) { ... }
define function print-references($f as element(figure))
{
let $id := $f/@id
for $link in $f/ancestor::book//link[@linkend=$id]
return
<p>
{ string-join($link/ancestor::para[1]//text(), " ") }
</p>
}
The power to control what's
searched and what's returned proves especially valuable when documents
have useful semantic structure. Figure
1 shows a screen shot of a query for "tennis elbow" executed against a collection of medical textbooks:
The results were rendered with custom XQuery logic. Start at the most relevant
<para>
elements, walk up the document hierarchy to the nearest containing
section, then render that block as HTML. Clickable breadcrumbs were
added based on the full ancestral history of the matching <para>
. The search returns paragraphs, but for the end result we've opted to include full sections.
By
comparison, a search engine presented with a large number of Docbook
files would return simply URL pointers to which files were most
relevant. Any post processing would be entirely manual and external to
the search engine.
Executing a Search
Mark Logic provides two functions to execute searches:
cts:contains
()
and
cts:search
()
. The "cts
" namespace stands for "core text search" and is a standard extension in Mark Logic.
The
cts:contains
()
call is the simpler of the pair. It returns a boolean
true()
if the node passed as the first argument contains the word or phrase specified by the second argument. Example
3 returns all the
programlisting
s
within any example element whose title contains "black and white"
within any book element whose title contains "java servlet". The query
executes quickly based on indexes of both content and structure.
Example
3: Searching content and structure
input()/book[cts:contains(title, "java servlet")]
//example[cts:contains(title, "black and white")]
//programlisting
Those familiar with XQuery may recall a standard
fn:contains
()
function. The standard function operates on a character level rather than token (word), so
fn:contains
(title, "foo")
somewhat unexpectedly matches the word "food". The
cts:contains
(title, "foo")
call would not, because it matches tokens.
The
cts:contains
()
function works well for black and white searches. By this I don't mean
searches for "black and white", but rather those cases where your item
is either in or it's out. There's no relevance rank ordering. For that
you use cts:search
()
.
The
cts:search
()
call returns a result sequence in decreasing relevance order. In common
usage it accepts two arguments, an expression to be searched and a
query specifying what search to perform. Below is the Example
1 search query shown earlier:
cts:search(
input()/book[meta/pubdate >= 2003]
//chapter[cts:contains(title, "veterinary medicine")]//sect1,
"cougars"
)[1 to 10]
The first argument limits the search domain to
<sect1>
elements within specific chapters within specific books. The second
argument indicates the search criteria. It's the search criteria that
influences the relevance ranking. The following two queries are very
different:
//title[cts:contains(., "XQuery search")]
cts:search(//title, "XQuery search")
The first returns titles that contain
"XQuery search", in document order. The second returns titles that
contain "XQuery search", in decreasing relevance order.
The second argument to
cts:search
()
can accept not just simple words and phrases but
cts:query
instances. Through a
cts:query
instance you can apply an arbitrarily complex and nested search
criteria. Each criteria can provide search options (case insensitivity,
stemming enablement, wildcard enablement, language) and a term weight
(an indicator of this term's relative importance in the search, between
0.0 and 1.0).
You build a
cts:query
instance through of a series of "cts
" function calls.
Example
4 shows several examples:
Example
4: Query Examples
(: Simple strings are treated as
cts:word-query() :)
cts:word-query("XQuery search")
(: How to use a
cts:word-query() :)
cts:search(input(),
cts:word-query("rhinos"))
(: Indicates to use stemming so "searching" would be a match also :)
cts:word-query("XQuery search", "stemmed")
(: Either wine color is acceptable, red 2x preferred :)
cts:or-query(
cts:word-query("white wine", "stemmed", 0.5),
cts:word-query("red wine", "stemmed", 1.0)
)
(: Both explorers should appear :)
cts:and-query(
cts:word-query("Lewis", "case-sensitive"),
cts:word-query("Clark", "case-sensitive")
)
(: Same as previous. Capitalized words trigger case sensitivity. :)
cts:and-query(
cts:word-query("Lewis"),
cts:word-query("Clark")
)
(: Look for "spring" with relevance weighting by location :)
cts:or-query(
cts:element-word-query(xs:QName("title"), "spring", (), 1.0),
cts:element-word-query(xs:QName("subtitle"), "spring", (), 0.3),
cts:element-word-query(xs:QName("para"), "spring", 0.1)
)
(: Use
cts:near-query() to specify a term distance :)
cts:near-query(
(cts:word-query("white wine", "stemmed", 0.5),
cts:word-query("red wine", "stemmed", 1.0)),
2)
Scoring a Search
The
cts:search
()
call returns results in (decreasing) relevance order. You can quantify each item's relevance using the
cts:score
()
function, as shown in
Example
5:
Example
5: Scoring Results
for $x in
cts:search(//para, "spring break")
return <p score="{cts:score($x)}"> { $x/* } </p>
A score is a non-negative integer, unbounded and increasing
logarithmetically. The result of calling
xdmp:score
()
for a given nodes depends on the structure of the most recent function
call which returned the node as a value. This implementation of score
violates XQuery side-effect free-ness. On the other hand, it appears to
be the only feasible approach to scoring which is compatible with the
current XQuery grammar. Recently the Free Text Task Force of the W3C
XQuery Working Group has published a draft recommendation for XML free
text search which includes a grammatical recommendation for managing
relevance scoring. We will describe this is the next section and
compare it to the Mark Logic infrastructure.
Relevance
scores are computed using a "tf-idf" formula (tf-idf stands for term
frequency, inverse document frequency) with byte-length normalization.
Element nodes with a higher occurrence of a given word or tag will
score higher than nodes of the same size with a lower occurrence of the
given word or tag. Between two element nodes with the same number of
occurrences of a given word or tag, the smaller element will normalize
to larger score, and therefore a higher relevance. Between two words or
tags occurring with the same frequency in a given element, the word or
tag which is less common in the collection will make a bigger
contribution to the relevance score.
Relation of Mark Logic Search to the W3C FTTF Activity
The
Free Text Task Force of the W3C XQuery Working Group has published a
draft recommendation for XQuery text search. The recommendation
describes a surface syntax for XML search which integrates with the
existing XQuery grammar. The basic component of this syntax is the "ftcontains
" predicate relation. This predicate maps very closely to the Mark Logic
cts:contains
()
predicate. For example, the FTTF proposed search expression:
/books/book[title
ftcontains ("dog" with stemming) && "cat"]/author
maps directly to the Mark Logic syntax
/books/book[cts:contains(title,
cts:and-query((cts:word-query("dog","stemmed"),"cat")))]/author
The Mark Logic search function API acts
effectively as an implementation API for the W3C draft recommendation.
It would be premature to start implementing the FTTF grammar given the
lack of maturity and large number of unresolved issues in that
recommendation.
The FTTF grammar also uses the "ftcontains
"
predicate relation as an integral part of the FLWOR expression. For
example, the set of author children of book elements, where the book
element has a title child containing both "dog" and "cat":
for $b in //book
where $b/title
ftcontains "dog" && "cat"
return $b/author
maps to the Mark Logic function API:
for $b in
cts:search(//book,
cts:element-query(QName("title"),
cts:and-query(("dog","cat"))))
return $b/author
A key difference, however, is that the Mark Logic
cts:search
()
function returns results in (decreasing) relevance order by default,
whereas the FLWOR expression returns results in document order unless
an explicit order by
clause is included in the expression. The fully equivalent FTTF expression would be:
for $b in //book
score $s as $b/title
ftcontains "dog" && "cat"
where $s > 0
order by $s descending
return $b/author
which is surprisingly more verbose than the
lower-level Mark Logic implementation. De-coupling the search and the
relevance expression is potentially very useful, but in practice the
two expressions are almost always equal. The EBNF grammar allows that
the right-hand-side of the "ftcontains
" relation may be any expression, but the specification restricts the expression to be a Boolean combination of ft expressions.
Considering FT Expressions
FT search expressions appear on the right hand side of the
ftcontains
predicate. They are composed from the following primitives, which are
almost all supported directly by the Mark Logic search function API:
-
Primitives
-
-
Boolean And, Or, Not --
supported;MildNot --
not supported
-
Proximity –
supported
-
Case and diacritic sensitivity --
supported
-
Stemming and thesaurus options --
supported
-
Stop word options --
not supported
-
Language options --
supported
-
Wildcard matching --
supported
-
Match incidence counting --
not supported
-
Ordered matching --
supported
-
"at start" and "at end" anchors --
supported with
regexp matching
-
"words", "sentences", and "paragraph" scope --
supported with appropriate markup
From the FTTF
Mildnot specification:
If I want to find articles that mention
Mexico, I might search for
'"
Mexico
" mild not "
New Mexico
"'
.
'"
Mexico
" mild not "
New Mexico
"'
matches any
Expr
that contains
Mexico on its own. An
Expr
that contains "New Mexico" is not "excluded" from the result - it may mention "Mexico" as well.
(The specification probably means to say "node
" where it says "Expr
"
as one does not match expressions, one evaluates expressions to obtain
a set of nodes which may match some search condition.)
This seems like a nice idea, and it will be considered for subsequent releases of the Mark Logic server.
Conversely,
the Mark Logic search system includes important concepts which are not
present in the FTTF draft recommendation. The Mark Logic server may be
configured to "phrase through", "phrase around", or "phrase into" a
given set of markup. Phrasing through has the effect of making markup
"transparent". For example, inline emphasis tags should not interrupt
the phrase structure of a text search.
You <b>phrase through</b> a bold tag.
Phrasing around has the effect of
"ignoring" markup which interrupt the phrase structure. For example,
inline footnotes should not contribute to the phrase structure of the
footnoted text.
A GPS <footnote>Global Positioning System</footnote> device should match a search for "GPS device".
Lastly, there are cases when the best
behavior is to allow the phrase structure to descend recursively into
the string value of the subtree below a given element node.
Mark Logic Update
Mark
Logic provides a handful of functions to enable document-level and
node-level update. A single query can execute any number of these
functions.
Three basic functions perform document-level updates. These functions all use the Mark Logic standard "xdmp
" namespace prefix, as shown in
Example
6.
Example
6: Document-level Updates
(: Loads a document from the
filesystem to the specified URI.
The URI is what would be used as a parameter to
fn:doc().
Replaces any existing document at that URI. :)
xdmp:load($filepath, $uri)
(: Loads a dynamically constructed document or element node :)
xdmp:document-insert($uri, $root as node())
(: Deletes the specified document :)
xdmp:document-delete($uri)
Three other functions provide node-level insert abilities. Through the three basic calls shown in
Example
7 a node can be added at any location:
Example
7: Node-level Inserts
(: Inserts the new node as the last child of the parent :)
xdmp:node-insert-child($parent as node(), $new as node())
(: Inserts the node as a following sibling to the reference node :)
xdmp:node-insert-after($sibling as node(), $new as node())
(: Inserts the node as a preceding sibling to the reference node :)
xdmp:node-insert-before($sibling as node(), $new as node())
A final pair of functions in
Example
8 provide the rest of the node-level updates:
Example
8: Node-level Replacements
(: Replace the old with the new :)
xdmp:node-replace($old as node(), $new as node())
(: Delete the node (and its descendents) from the document :)
xdmp:node-delete($old as node())
There are additional methods to get and set document properties (used for
WebDAV), to add and remove documents from collections, and to alter document permissions. Example
9 utilizing several of these calls appears below. It's an implementation of a named queue:
Example
9: Named Queue
define function
enqueue($name as xs:string, $i as item()) as empty() {
(: wrap $i in an element() so it can be added to a doc,
: whether it's already an element or not: works for attributes, etc.
:)
let $n := <entry>{$i}</entry>
let $queue := doc($name)/queue
return
if (exists($queue))
then
xdmp:node-insert-child($queue, $n)
else
xdmp:document-insert($name, <queue>{$n}</queue>)
}
define function
dequeue($name as xs:string) as item()? {
(: get the first child element under the queue root, and unwrap it :)
let $node := doc($name)/queue/entry[1]
return
if (exists($node))
then (xdmp:node-delete($node), $node/@*, $node/node())
(: if there's nothing on the queue, return the empty sequence :)
else ()
}
Everything is straightforward in this example except perhaps the return value from
dequeue
()
. The "then
" clause returns a sequence of three items: the result of the
xdmp:node
-delete()
call (always empty), the
$node/@*
(any possible attribute on the
<entry>
in case the item stored was an attribute), and
$node/node()
(any nodes under the
<entry>
in case the item stored was an element, PI, text node, etc).
To
conform to the spirit of the XQuery specification in disallowing side
effects, updates are not made visible to the query environment
executing the update. This requires some getting used to, but doesn't
hinder productivity.
Updates occur after the query
completes without error. Should a query produce an uncaught error, the
set of updates from that query do not occur. Mark Logic does not yet
provide explicit commit/rollback support. However, Mark Logic provides
a try/catch expression so that errors can be caught and processed while
allowing execution to continue. Its syntax:
try {
$try-expression
}
catch ($exception as element()) {
$catch-expression
}
Example
10 loads files from a list of filenames, reporting errors but not letting any errors stop the other files from being loaded:
Example
10: Handling Load Errors
for $file in $files
return
try {
<li>
{ xdmp:load($file, $file) }
{ $file } loaded
</li>
}
catch ($exception) {
<span>
Problem while loading { $file }:<br/>
<i xmlns:e="http://marklogic.com/xdmp/error">
{$exception/e:message/text()}
</i>
</span>
}
ACID Semantics
Mark Logic updates satisfy a subset of ACID semantics. Below are some definitions conveniently extracted from the
Wikipedia that define ACID semantics.
- Atomicity
refers to the ability of the DBMS to guarantee that either all of the
tasks of a transaction are performed or none of them are.
- Consistency
refers to the database being in a legal state when the transaction
begins and when it ends. This means that a transaction can't break the
rules, or integrity constraints, of the database.
- Isolation
refers to the ability of the application to make operations in a
transaction appear isolated from all other operations. This means that
no operation outside the transaction can ever see the data in an
intermediate state.
- Durability refers to the
guarantee that once the user has been notified of success, the
transaction will persist, and not be undone. This means it will survive
system failure, and that the database system has checked the integrity
constraints and won't need to abort the transaction. Typically, all
transactions are written into a log that can be played back to recreate
the system to its state right before the failure. A transaction can
only be deemed committed after it is safely in the log.
The
Mark Logic Server evaluates an XQuery module by spawning an evaluation
thread which operates on a snapshot of the database. All the updates
appearing in the module are collected and a change vector computed and
journaled. The XQuery module has an implied commit which guarantees
that either all the changes or none of the changes become persistent.
Update evaluation guarantees atomicity, isolation, and durability.
Consistency is satisfied, but only in the weak sense, since integrity
constraints are not explicitly part of the Mark Logic architecture.
Integrity is expressed as de-normalized documents: all the elements
collected in one document are maintained as a consistent collection.
Isolation is maintained in the strongest sense: each query evaluation
thread operates on a static snapshot, and all the updates refer to the
original values, even if an update sub-expression modifies one of the
values appearing elsewhere in the XQuery module.
W3C XQuery Update Activity
The
W3C XQuery Working Group has published a draft requirements document
for XML XQuery update. This draft is available at
http://www.w3.org/TR/2005/WD-xquery-update-requirements-20050211/. The
pertinent parts for this paper are the basic functional requirements
and the ACID semantics requirements.
W3C XQuery Update Functional Requirements
From the specification:
Locus of modifications --
supported
The
XQuery Update Facility MUST be able to change the properties of
existing nodes while preserving their identity. The XQuery Update
Facility MUST be able to create a new copy of a node with a specific
set of changes.
Delete --
supported
The XQuery Update Facility MUST be able to delete nodes.
Insert --
supported
The XQuery Update Facility MUST be able to insert new nodes in specified positions.
Replace --
supported
The XQuery Update Facility MUST be able to replace a node.
Changing values --
supported
The XQuery Update Facility MUST be able to change the value returned by the typed-value
accessor for a node.
Modifying properties --
not supported
The XQuery Update Facility SHOULD be able to modify some of the properties of a node such as the name, type, content,
nilability, base-URI, etc.
Moving nodes --
not supported
The XQuery Update Facility MAY be able to move a node from one location to another.
Conditional updates --
supported
The XQuery Update Facility MUST be able to do conditional updates.
Iterative updates --
supported
The XQuery Update Facility MUST be able to iterate over nodes to do updates.
Validation --
not supported
The XQuery Update Facility MAY support an explicit XML Schema validation operation that preserves node identity.
Compositionality --
supported
The
XQuery Update Facility MUST be able to compose update operators with
other update operators. The XQuery Update Facility MAY be compositional
with respect to XQuery expressions; that is, it may be possible to use
an update wherever an XQuery expression is used.
Parameterization --
supported
The XQuery Update Facility SHOULD provide a means to parameterize update operations.
W3C XQuery Update Semantics Requirements
From the specification:
Atomicity --
supported
The
XQuery Update Facility MUST provide a set of atomic operations, and
MUST define a means to group atomic operations into an atomic execution
unit.
Consistency --
supported
At
the end of an outermost update operation (that is, an update operation
invoked from the external environment), the data model MUST be
consistent with respect to the constraints specified in the Data Model.
In particular, all type annotations MUST be consistent with the content
of the items they govern. The XQuery Update Facility MAY define
additional levels of granularity at which Data Model constraints are
enforced.
Isolation --
supported
The XQuery Update Facility SHOULD define the means by which operations can be isolated from concurrent operations.
Durability --
supported
The XQuery Update Facility MUST define a means to control the durability of atomic operations and atomic execution units.
Conclusion
Interested parties can download evaluation and community licensed copies of Mark Logic Server from
http://xqzone.marklogic.com. Full function documentation for Mark Logic search and update functions can be found at
http://xqzone.marklogic.com/pubs/.
Biography
Jason
Hunter
Mark Logic
Jason Hunter works as a Lead Applications Engineer at Mark Logic. He's author of
Java Servlet Programming
(O'Reilly). He's a member of the expert groups responsible for Servlet,
JSP, JAXP, and XQJ API development, and has assisted the W3C XQuery
Working Group. He is also an Apache Member and as Apache's
representative to the Java Community Process Executive Committee he
established a landmark agreement allowing open source Java. He
co-created the open source JDOM library to enable optimized Java and
XML integration.