Why And How To Use PDOM: A Persistent W3C DOM API

gembin

OSGi, Eclipse Equinox, ECF, Virgo, Gemini, Apache Felix, Karaf, Aires, Camel, Eclipse RCP

HBase, Hadoop, ZooKeeper, Cassandra

Flex4, AS3, Swiz framework, GraniteDS, BlazeDS etc.

There is nothing that software can't fix. Unfortunately, there is also nothing that software can't completely fuck up. That gap is called talent.

About Me

What is PDOM?

PDOM stands for Persistent Document Object Model.

PDOM implements the W3C DOM API, as MiniDOM solves SAX processing problems, so PDOM solves DOM scalability problems by providing a persistent implementation of the DOM API.

An enhanced implementation of XPATH provides excellent usability.

PDOM's implementation exploits the capabilities of GPO, the Generic Persistent Object model.

Is it straightforward to use?

It is simplest to give an example, here is some java code:

import cutthecrap.pdom.*;



client = new PDOMClient(null, "D:/testxml/pdom.rw");

PDOM pdom = client.getPDOM();



PDocument doc = pdom.createDocument("Opera", "D:/testxml/opera.xtm");

The PDocument class implements the org.w3c.dom.Document interface.

You should be able to see from the above code that a PDOM system can contain many DOM documents.

At some later stage, the persistent document could be retrieved:

import cutthecrap.pdom.*;



client = new PDOMClient(null, "D:/testxml/pdom.rw");

PDOM pdom = client.getPDOM();



PDocument doc = pdom.getDocument("Opera");

So PDOM provides both a Persistent DOM repository to manage and interact with individual huge XML documents, and also allowing the storage of perhaps millions of separate XML documents.

Once a PDocument has been returned the Document interface can be used to navigate to the contained nodes.

In addition to the ord.w3c.dom interfaces, support is also provided for XPath-based queries.

What PDOM Isn't

PDOM has not been developed to provide a rigorous implementation of the full W3C DOM model. It does not currently support DTDs and there are no immediate plans to do so.

One example of this is that PDOM automatically recognises the "id" attribute to provide the identity for an element - subsequently accessible using document.getElementById, where the standard specifies that the DTD must indicate which attribute is used to identify a specific element type.

By default also, text nodes are not added if they only include whitespace. Although this behaviour can be overridden when an XML document is imported.

XPath

Support is provided for using XPath to return nodes from the DOM.

PNode someNode = doc.getElementById("someId");

XPathQuery query = new XPath(".//baseNameString/text()");



query.setContext(someNode);



Iterator nodes = query.execute();



while (nodes.hasNext()) {

Text txt = (Text) nodes.next();



System.out.println("baseNameString : " + txt.getNodeValue());

}

A number of utility methods are provided to make this even simpler, for example:

PNode someNode = doc.getElementById("someId");

Iterator nodes = someNode.queryXPath(".//baseNameString/text()");

Will produce the same result.

Creating XPathQuery objects directly though may have some advantages, for example, they might be passed as arguments to methods to be applied to other computationally chosen nodes - simply calling setContext for each node to be queried against.

XPath [Predicates]

The XPath support now also includes predicates where before it was limited to object navigation. For example:

PElement root = (PElement) doc.getDocumentElement();



nodes = root.queryXPath(".//instanceOf/topicRef[starts-with(@xlink:href,'#wri')]");

..or

nodes = root.queryXPath(".//instanceOf/topicRef[string-length(@xlink:href)=9]");

It should be stressed tho' that XPath access should not be "abused". Many ill-considered XPath queries may involve traversal of the entire XML tree where more focussed queries could and should be used.

Performance

PDOM is built using the Generic Persistent Object Model. No special optimization has been carried out to minimise storage requirements for the PDOM data model.

When compared with the Xerces DOM, if an in-memory system is specified then PDOM will require over twice the java memory for Xerces to store the same data, for example:

Source XML    Xerces    PDOM (memory based)

----------    -------   ----

523K          4.8Mb     10.2Mb

The figures for the in-memory PDOM representation are a little disappointing, it would have been nice to show a broad equivalence with Xerces for in-memory options. Xerces also is significantly quicker than PDOM in parsing the document.

It should be stressed that these figures demonstrate what an excellent product Xerces is. PDOM uses a generic representation that requires many java objects. The "bloat" on the PDOM memory usage is mostly explained by the overhead associated with any object instance.

However, if the PDOM is stored persistently, the memory requirement drops, here are the figures for the PDOM memory requirements and the datastore disk space:

Source XML    PDOM      GPO Datastore

----------    -------   -------------

523K          1.9Mb     1.6Mb

You may find it odd that the datastore is so small. This is achieved by various optimizations that ensure the object data is packed efficiently.

Clearly, as objects are read in the PDOM java memory requirement will increase - particularly if the application retains references to many objects.

It should be emphasised that the PDOM memory increases only very slightly as the source XML becomes bigger, while the backing datastore will be approximately three times the size of the source XML.

Scalability

The main reason to use PDOM is scalability. For small DOMs Xerces is an excellent choice, it's parsing performance is particulalry impressive, but if you cannot predict what size the DOM will be, then PDOM provides a scalable solution.

If you read in a 300Mb XML file, the Xerces DOM will require a java VM of around two gigabytes, just to hold the data, while PDOM would process the file with a backing store of around 1Gb and do so quite happily, even with a java VM limited to 10Mb.

Furthermore, processing a 300Mb XML file will take Xerces a considerable time - assuming the memory is available. Processing with PDOM will also take sometime - perhaps several times longer than Xerces would (if it is able to do so) - but thereafter the DOM could be accessed directly rather than having to reprocess the file.

Not having a 300Mb XML file around, here are some figures for a 5Mb file.

Source XML    Xerces    PDOM      GPO Datastore

----------    -------   -------   -------------

5.2Mb         21Mb      1.9Mb     13Mb

When PDOM was used to produce these figures I ran with

java -mx10M

This limits the java heap to a maximum of 10Mb. The overhead will effectively remain constant no matter how big or how many DOM documents are stored in the datastore.

Summary

PDOM solves the problem of using the standard DOM API to access huge XML data files.

The persistent DOM allows for XML files to be parsed once, and thereafter retrieved by name.

The resource overhead on the java VM - and OS virtual memory - when retaining huge in-memory DOMs is removed.

How Can I Get PDOM?

PDOM is provided as part of the full Cut The Crap distribution and can be downloaded from www.cutthecrap.biz/software/downloads.html along with other Cut The Crap software.

posted on 2008-07-29 17:18 gembin 阅读(585) 评论(0) 编辑收藏所属分类: XML

新用户注册刷新评论列表


只有注册用户登录后才能发表评论。




网站导航: 博客园博客园最新博文博问管理
相关文章: XML Naming Conventions SAX vs. DOM Eclipse Plugin for eXist (COOL) Why And How To Use PDOM: A Persistent W3C DOM API XQuery Search and Update Native XML Databases SoftwareAG的Tamino Server

gembin