What is PDOM?
PDOM stands for Persistent Document Object Model.
PDOM implements the W3C DOM API
, as MiniDOM solves
SAX processing problems, so PDOM solves DOM scalability problems by providing a persistent
implementation of the DOM API
.
An enhanced implementation of XPATH provides excellent usability.
PDOM's implementation exploits the capabilities of GPO, the Generic Persistent Object model.
Is it straightforward to use?
It is simplest to give an example, here is some java code:
import cutthecrap.pdom.*;
client = new PDOMClient(null, "D:/testxml/pdom.rw");
PDOM pdom = client.getPDOM();
PDocument doc = pdom.createDocument("Opera", "D:/testxml/opera.xtm");
The PDocument
class implements the org.w3c.dom.Document
interface.
You should be able to see from the above code that a PDOM
system can contain many
DOM
documents.
At some later stage, the persistent document could be retrieved:
import cutthecrap.pdom.*;
client = new PDOMClient(null, "D:/testxml/pdom.rw");
PDOM pdom = client.getPDOM();
PDocument doc = pdom.getDocument("Opera");
So PDOM provides both a Persistent DOM repository to manage and interact with individual huge
XML documents, and also allowing the storage of perhaps millions of separate XML documents.
Once a PDocument
has been returned the Document
interface can be used
to navigate to the contained nodes.
In addition to the ord.w3c.dom
interfaces, support is also provided for
XPath
-based queries.
What PDOM Isn't
PDOM has not been developed to provide a rigorous implementation of the full W3C DOM model. It does
not currently support DTDs and there are no immediate plans to do so.
One example of this is that PDOM automatically recognises the "id" attribute to provide the identity
for an element - subsequently accessible using document.getElementById
, where the standard
specifies that the DTD
must indicate which attribute is used to identify a specific
element type.
By default also, text
nodes are not added if they only include whitespace. Although
this behaviour can be overridden when an XML
document is imported.
XPath
Support is provided for using XPath
to return nodes from the DOM.
PNode someNode = doc.getElementById("someId");
XPathQuery query = new XPath(".//baseNameString/text()");
query.setContext(someNode);
Iterator nodes = query.execute();
while (nodes.hasNext()) {
Text txt = (Text) nodes.next();
System.out.println("baseNameString : " + txt.getNodeValue());
}
A number of utility methods are provided to make this even simpler, for example:
PNode someNode = doc.getElementById("someId");
Iterator nodes = someNode.queryXPath(".//baseNameString/text()");
Will produce the same result.
Creating XPathQuery
objects directly though may have some advantages, for example,
they might be passed as arguments to methods to be applied to other computationally chosen
nodes - simply calling setContext
for each node to be queried against.
XPath [Predicates]
The XPath
support now also includes predicates
where before it was
limited to object navigation. For example:
PElement root = (PElement) doc.getDocumentElement();
nodes = root.queryXPath(".//instanceOf/topicRef[starts-with(@xlink:href,'#wri')]");
..or
nodes = root.queryXPath(".//instanceOf/topicRef[string-length(@xlink:href)=9]");
It should be stressed tho' that XPath
access should not be "abused". Many
ill-considered XPath
queries may involve traversal of the entire XML tree where
more focussed queries could and should be used.
Performance
PDOM is built using the Generic Persistent Object Model. No special optimization has been
carried out to minimise storage requirements for the PDOM data model.
When compared with the Xerces
DOM, if an in-memory system is specified then PDOM
will require over twice the java memory for Xerces
to store the same data,
for example:
Source XML Xerces PDOM (memory based)
---------- ------- ----
523K 4.8Mb 10.2Mb
The figures for the in-memory PDOM representation are a little disappointing, it would have
been nice to show a broad equivalence with Xerces for in-memory options. Xerces also is
significantly quicker than PDOM in parsing the document.
It should be stressed that these figures demonstrate what an excellent product Xerces is.
PDOM uses a generic representation that requires many java objects. The "bloat" on the
PDOM memory usage is mostly explained by the overhead associated with any object instance.
However, if the PDOM is stored persistently, the memory requirement drops, here are the figures
for the PDOM memory requirements and the datastore disk space:
Source XML PDOM GPO Datastore
---------- ------- -------------
523K 1.9Mb 1.6Mb
You may find it odd that the datastore is so small. This is achieved by various optimizations
that ensure the object data is packed efficiently.
Clearly, as objects are read in the PDOM java memory requirement will increase - particularly
if the application retains references to many objects.
It should be emphasised that the PDOM memory increases only very slightly as the source
XML becomes bigger, while the backing datastore will be approximately three times the size of
the source XML.
Scalability
The main reason to use PDOM is scalability. For small DOMs Xerces
is
an excellent choice, it's parsing performance is particulalry impressive, but if you cannot predict
what size the DOM will be, then PDOM provides a scalable solution.
If you read in a 300Mb XML file, the Xerces
DOM will require a java VM of around
two gigabytes, just to hold the data, while PDOM would process the file with a backing store of
around 1Gb and do so quite happily, even with a java VM limited to 10Mb.
Furthermore, processing a 300Mb XML file will take Xerces a considerable time - assuming the
memory is available. Processing with PDOM will also take sometime - perhaps several times longer
than Xerces would (if it is able to do so) - but thereafter the DOM could be accessed directly rather
than having to reprocess the file.
Not having a 300Mb XML file around, here are some figures for a 5Mb file.
Source XML Xerces PDOM GPO Datastore
---------- ------- ------- -------------
5.2Mb 21Mb 1.9Mb 13Mb
When PDOM was used to produce these figures I ran with
java -mx10M
This limits the java heap to a maximum of 10Mb. The overhead will effectively remain
constant no matter how big or how many DOM documents are stored in the datastore.
Summary
PDOM solves the problem of using the standard DOM API to access huge XML data files.
The persistent DOM allows for XML files to be parsed once, and thereafter
retrieved by name.
The resource overhead on the java VM - and OS virtual memory - when retaining huge
in-memory DOMs is removed.
How Can I Get PDOM?
PDOM is provided as part of the full Cut The Crap distribution and can be downloaded from
www.cutthecrap.biz/software/downloads.html
along with other Cut The Crap software.