Managing XML data: Native XML databases
When your only tool is a hammer, everything looks like a
nail. When your only tool is a relational database, everything looks
like a table. Reality, however, is more complicated than that. Data
often isn't tabular and can benefit from a tool that more closely fits
its natural structure. When that data is XML, the appropriate tool for
managing it might well be a native XML database. For many classes of
applications with significant XML processing needs, a native XML
database is a very powerful tool. Explore the nature of native XML
databases and get some general ideas about what to expect from this new
tool in the developer's toolbox.
Relational databases in general, and SQL databases in particular, have been so incredibly successful
that they've almost completely eliminated the competition, at least in mind share if not always in actual
installations. (A lot of data is still locked up in hierarchical, big iron databases like IMS™, and quite a bit
more is stored in lower-end, non-SQL databases like FileMaker.) However, although relational databases
fit a lot of problems very well, they don't really fit XML documents, at least not in their full generality.
While you can shred an XML document enough to stuff it into a relational table or just treat it as
one big blob, neither approach really lends itself to indexing and fast queries. In practice, shredding
also tends to lead to the loss of details like element order, processing instructions, comments, white
space, and other elements that are important in many applications in which XML documents don't
look exactly like serialized tables in the first place. Field and record boundaries just don't match the
boundaries of an XML document. Applications such as publishing systems that care about these details
need to look beyond the relational database for their information storage needs.
Traditionally, information that doesn't naturally fit into tables has been stored in a file system.
However, that approach is showing its age and probably should have been abandoned years ago.
A great deal of data is now being encoded in XML, and more is being created every day. However, many
people dump these XML documents into file systems without giving much thought to managing
the superstructures formed by the document collections (as distinct from the internal structures of each
separate document). It's time for something better.
Consequently, various vendors have released native XML databases. A native XML database
is one that treats XML documents and elements as the fundamental structures rather than tables, records,
and fields. Such a database enables developers to use tools and languages that more naturally fit the
structure of the documents they're working with, thereby enhancing productivity. It is also widely believed
(if not exactly proven) that native XML databases can significantly outperform traditional relational databases
for tasks that involve heavy document processing, such as newspaper publishing, Web site management,
and Web services.
Database models
Relational
databases have a well-understood mathematical theory behind them as
laid down 30 years ago by E. F. Codd, and expanded and expounded upon
in the decades since by C. J. Date and others. Implementations don't
always (okay, never) follow the theory precisely. However, the theory
does provide the community with a reasonably shared understanding of
what the phrase "relational database" means. The understanding is even
clearer if you say "SQL database," because an ISO standard lays out
exactly what such a database must provide.
The situation in the world of native XML databases is much murkier, in part because they're still in
development. Standards are just being developed, and they cover
only part of what's needed to interface with such a database. In fact, most of what I say here about
native XML databases won't apply to all products called "native XML databases." Nonetheless, the
smoke from the initial volleys is beginning to clear, and I can begin to make some general statements
that are at least mostly true about most XML databases, even if exceptions aren't hard to find.
I'll begin with a comparison of the XML model and the relational model, as Table 1 shows. I
should say that this is a comparison of an XML model to the relational model, because
although the relational model is fairly well defined (even if not always precisely implemented), the XML
model has no such standard, de facto or de jure. Still, Table 1 is a reasonable, rough outline of
what you can expect.
Table 1. Relational databases compared to XML databases
Relational database
|
XML database
|
A relational database contains tables. |
An XML database contains collections. |
A relational table contains records with the same schema. |
A collection contains XML documents with the same schema. |
A relational record is an unordered list of named values. |
An XML document is a tree of nodes. |
A SQL query returns an unordered set of records. |
An XQuery returns an ordered sequence of nodes. |
|
In the trough of the adoption curve?
This isn't a bad time to explore the XML database space. Native XML databases seem to be
following a classic double-bump adoption curve. The initial XML hype and dot-com hysteria led to a lot of
investment in XML database technology. The resulting databases hit the market with a fairly resounding
thud; in the resulting shakeout, many vendors have abandoned the space (Zvon, Stanford), been
acquired by others (XYZFind, Coherity, Excelon, B-Bop), slowed (Apache) or ceased (MindSuite)
development, or simply gone out of business (Tendara). Nonetheless, there's a real need for this stuff.
The initial offerings were just overpromised, overpriced, and under-delivered. Going forward,
the situation looks a lot more hopeful.
|
|
Implementations differ on each of these points. Some native XML databases don't really have a notion
of collection. Some databases allow a collection to support several schemas. A few low-end databases
don't support schemas at all. (Such databases are more useful than you might expect -- after all, you tend
to care more about the instance documents than the schemas.) Other, mostly early products
only supported DTDs. Currently, W3C XML Schema is the
most commonly supported language among native XML databases. Indeed, the needs of databases -- both
traditional relational and native XML -- were major drivers in the design of the W3C XML Schema language.
However, widespread dissatisfaction with that language is causing a few vendors to start thinking about
RELAX NG, though I've yet to see it implemented in any actual products.
In most XML databases, the fundamental unit is the XML document, which roughly corresponds to a
record in a traditional database. One big advantage of a native XML database is that it can run queries that
combine (or join, in SQL parlance) information contained in multiple XML documents. The need to query
multiple documents explains the design of XQuery, the developing query language for native XML documents,
which is in turn based on XPath 2. In fact, the ability to query multiple documents is probably the single most
fundamental difference between XPath 1 and XPath 2/XQuery. What SQL is to relational databases, XQuery
is to native XML databases.
However, XQuery does not do as much as SQL does. Whereas SQL has four fundamental operations --
SELECT
, INSERT
, UPDATE
,
and DELETE
-- as well as some lesser commands for creating and dropping
tables and users, XQuery really starts and stops with SELECT
. XQuery lets you
retrieve information from an XML database, but that's it. It can't add documents to the database, delete
documents from the database, modify existing documents, or do anything else, which is a pretty gaping hole
in its capabilities.
For the moment, most native XML databases fill this hole in various, proprietary ways, often implemented
as an XQuery extension. The closest thing to a standard in this space (close only in the sense that
the moon is closer to Brooklyn than Jupiter is) is XUpdate. XUpdate is implemented by dbXML, eXist, and
X-Hive/DB, among other products. For example, here's a simple XUpdate that adds a
<MiddleName>Rusty</MiddleName>
element to every
Author
element that has a Surname
child element
with the value "Harold":
<xupdate:append select="//Author[Surname='Harold']">
<xupdate:element name="MiddleName">Rusty</xupdate:element>
</xupdate:append>
|
However, XUpdate is just one possibility, and the number of native XML databases that don't implement it
outnumber those that do.
A lot of work remains before XUpdate becomes a serious contender, and it really hasn't
advanced in the past four years. Longer term, the W3C XQuery working group is expected to add update
facilities to XQuery. However, work on this has just begun. So far, only requirements have been published.
The group hasn't even published a proposal for the actual syntax of the language. Given that just
implementing the XML equivalent of SELECT
has taken the same group five
years and counting, I'm not holding my breath.
Benefits of native XML databases
Because the current state of native XML databases is so unsettled, why might you consider using one?
Well, possibly for the same reasons you might have considered using relational databases in the early 1980s.
Twenty-five years ago, relational databases were slow, buggy, nonstandard memory hogs. Nonetheless,
they still had a lot of advantages compared to traditional systems, and they only got better over time.
Today's native XML databases are certainly nonstandard. Some of them, perhaps most, are also slow
memory hogs -- though 25 years of Moore's law have made that particular problem less noticeable.
How buggy they are varies a lot from one product to the next. Some are ready to go into production today,
and others I wouldn't trust to manage a grocery list. However, if you've got a lot of XML data to manage,
the technology has some real advantages that may make it worth your time to evaluate the current crop
of products.
Everything is in one place
The most important (and most often overlooked) advantage of a native XML database is simply that it
keeps all your content in one easily searched, easily managed place. You don't need to worry about
file-naming conventions or directory structures -- everything's in the database. All you have to do to get the
information out is make a query. File systems are adequate (barely) for single-user systems, and even for
those systems, traditional file systems are showing their age. Companies like Apple and Microsoft® are slowly
moving the foundations of their operating systems to more database-like structures. For data that is
accessed and edited by many different users with varying levels of privilege across heterogeneous
systems, a database of some kind is the only option. Today, too much critical data is stored in
Microsoft Word files and Excel spreadsheets on the Chief Executive Officer's laptop or in the
lead programmer's personal CVS repository. Some (not all) of this information
can plausibly be stored in a centralized, database-backed repository. Besides making it possible to find the
information when you need it, doing this also enables centralized, professionally managed redundant systems
and backups. By storing content in a database, you can avoid losing every draft of a seven-figure proposal
and all its supporting documents when your boss leaves his unbacked-up laptop in a taxi.
Multiple views of the same data
A related advantage of storing data in a database is that doing so enables multiple views of the same
content. For instance, the version of a proposal you show to the internal team might contain content about
anticipated cost structures and profit margins that you might not want to make available to the company
whose business you're bidding for. Of course, this advantage is hardly unique to native XML databases.
Relational databases do this very well, too. However, it's still worth mentioning. Perhaps the special advantage
of a native XML database in this case is that the final report itself becomes just another database query, rather
than something produced by a nonstandard tool such as Crystal Reports operating over the output of the query.
Beyond the advantages that are inherent in any database system -- relational, native XML, or otherwise --
using a database specifically for processing XML has several advantages.
Performance
The first advantage is performance. Queries over a well-designed, well-implemented native XML database
are simply faster than queries over documents stored in a file system, and for several reasons. First, the
database can do all sorts of indexing tricks to operate quickly. For instance, it can maintain a table of all the ID
values in a document so that it can jump right to the element with a certain ID rather than having to walk the tree
looking for it, as a non-database tool such as the Jaxen XPath engine does. The database can assign sequence
numbers to each node so that it knows the position of each node and can compare the document order of two
nodes in constant time.
The next reason is that the database has essentially pre-parsed each document when storing it. Therefore,
it doesn't need to check each document that the query has accessed for well-formedness, or build an object model
representing that document. All these details are already inside the database in a form the query engine can use.
XML databases use a lot of other tricks to optimize performance. A few of these (smart query rewriting, for
example) are available to tools that aren't backed by a database. However, the biggest performance wins come
from trading insertion and update speed for query speed. The database does more work when it adds or modifies
a document, stores the result of that work, and then uses the results to run lightning-fast queries. If queries are
significantly more frequent than insertions and updates -- as they tend to be in many applications -- then the extra
cost paid to put documents into the database is more than earned back when retrieving them.
Very large documents
The
second advantage that native XML databases have over non-database
systems is document size. Because databases can be disk backed, they
can essentially process arbitrarily large documents. Streaming tools
like SAX and System.Xml.XmlReader can do this too, but tree-based tools
like XSLT, XPath, and DOM tend to self-limit when documents hit
approximately 100 megabytes. Native XML databases allow XSLT, XQuery,
DOM, and so forth, to process arbitrarily large documents.
Not one bit is lost
A final advantage of some (though not all) native XML databases is worth mentioning. They can retrieve the original,
unparsed document, character-per-character or even byte-per-byte. This functionality is critical in certain legal
situations where you need to reproduce the exact, original document down to the last byte. This functionality
can also be important in software development, particularly in bug tracking and performance optimization. In
these cases, seemingly irrelevant details that shouldn't matter sometimes do. It's important to make sure that the
database doesn't change the two bytes in a 10-megabyte document that actually trigger the bug. Parser-based
solutions, including systems that shred XML documents before storing them in relational databases, tend to lose
some things like white space inside tags, numeric character references, and other normally
irrelevant details. Ninety-nine percent of the time, you don't care about these arcana, but if you find yourself
in the one percent of cases in which this stuff matters, it's worth looking for a database that preserves it.
Looking forward
The more data you have, the more important it becomes to use some sort of database system to manage it. If
the data is XML, a solid native XML database is an obvious choice. The question then becomes where can you
find such a system. Given that some of the foundation technologies like XQuery are at least a year away from
completion and others are barely getting started, you might question whether you really can find a stable system
at this time. Still, the benefits are enough for you to consider moving to a native XML database
anyway, as long as you adopt it with the full knowledge that you're going to pay the upgrade costs in the future,
either in time, in money, or both.
Resources
- Read Chris Date's book, An Introduction to Database Systems, the standard introduction to the relational model. Date is probably the best advocate of the relational-is-the-one-true-data-model position. The book's eighth edition now includes (grudgingly) a chapter about XML written by IBM's Nick Tindall.
- Get a solid introduction to using XML with various types of database systems at Ronald Bourret's site.
- James Gosling describes the double bump adoption curve now being experienced by native XML databases in "Phase Relationships in the Standardization Process". An alternate way of looking at this is that native XML databases are now "crossing the chasm," as described in the book of the same name by Geoffrey A. Moore.
- Read the XUpdate specification.
- Printed out, the W3C's XQuery specs run to hundreds of pages.
I recommend starting with the XML
Query Use Cases.
- Check out the open source eXist, which is probably
the most widely deployed native XML database, though it has some performance issues.
- Find out more about the Mark Logic Content Interaction Server, probably the hottest closed source XML database right now. Whether it's the most worthy remains to be seen.
- Take a look at dbXML 2.0
from the dbXML Group and Berkeley DB XML 2.1
from Sleepycat Software. Although their names are confusingly similar,
they are probably the most robust open source native XML databases
available today.
- Read the previous installments of Elliotte Rusty Harold's Managing XML data column here on developerWorks.
- Find hundreds more XML resources on the developerWorks XML zone.
- Learn how you can become an IBM Certified Developer in XML and related technologies.