gembin

OSGi, Eclipse Equinox, ECF, Virgo, Gemini, Apache Felix, Karaf, Aires, Camel, Eclipse RCP

HBase, Hadoop, ZooKeeper, Cassandra

Flex4, AS3, Swiz framework, GraniteDS, BlazeDS etc.

There is nothing that software can't fix. Unfortunately, there is also nothing that software can't completely fuck up. That gap is called talent.

About Me

 

Java Content Repository specification (JSR-170)

JCR: A Practitioner's Perspective
the Java Content Repository specification (JSR-170) focuses on "content services," where these not only manage data, but offer author based versioning, full-text searches, fine grained access control, content categorization and content event monitoring. Programmers can use repositories in many ways just like a JDBC connection accesses a database: programmers obtain a connection to a repository, open a session, use the session to access a set of data, and then close the session. The JCR specification has multiple levels of compliance; the most simple level offers read-only access to a repository, XPath-like queries, and some other elements, while other levels of the specification offer a SQL-like query syntax, write capabilities, and more advanced features.

Jakarta offers the reference implementation for JCR, called "JackRabbit." As the reference implementation, Jackrabbit offers all of the mandatory elements of all of the JCR levels, and while it does not have built-in fine-grained access control, adding these features is certainly possible.

In addition to the specification itself, there are three primary online articles that introduce JCR to developers (see Resources, at the end of this article, for URLs). However, each of these offer only introductory material, and those looking to use the specification in anger are left without further guidance. This article is meant to offer a similar "short introduction" to JCR, and then delves into some frequently asked questions for which online answers are not easy to find, but should be.

JCR is an abstraction of a content repository. The actual repository might be implemented on top of a file system, on a database, or on any other storage mechanism that might be appropriate. Accessing the repository can be done directly (through a local repository contained in an application), through a resource (where a J2EE container manages the repository and access is provided via JNDI), or over the network via RMI or HTTP. Each of these deployment models has strengths and weaknesses, and a specific deployment should evaluate which of the models is appropriate. Most server-side applications will use the second or third models, but your mileage may vary. In any event, the only programmatic different between the models is how the initial Repository reference is obtained.

There are four basic operations a repository provides: reads, writes, queries, and deletions. Let's walk through each of these, just to make sure you're up to speed on basic access to a repository.

For the simple example, we'll use a transient repository, which is not meant for production use but will illustrate the basic concepts. Setting up the environment is a matter of making sure you have JavaDB (i.e., Derby) and JackRabbit in your classpath. After you've set up your environment, we'll create a base class that provides common functionality:

package jcr;

import org.apache.jackrabbit.core.TransientRepository;

import javax.jcr.*;
import java.io.IOException;

public abstract class Base {
public Repository getRepository() throws IOException {
return new TransientRepository();
}

public Session getReadonlySession(Repository repository) throws RepositoryException {
return repository.login();
}

public Session getSession(Repository repository) throws RepositoryException {
return repository.login(new SimpleCredentials("username", "password".toCharArray()));
}

public void logout(Session session) {
session.logout();
}
}

Our next task is to read data from the Repository from a known position. To understand what's going on, though, a simple explanation of how a Repository stores data is necessary (and this is covered in more detail later in this document as well.)

A Repository is a hierarchical store, much like a file system or an XML document. (This does not mean it uses XML or the file system, but only that these are analogues to how a Repository is represented.) Therefore, there is a "root node" at the "top" of the Repository, and the root node has child nodes, each of which can contain other child nodes or properties, where the properties contain actual data. This is a very simple explanation, and it leaves out the concepts of access control, versioning, workspaces, and a few other ideas, but this will be enough to get started.

We're going to create a simple structure later, where the node name will be "/foo", with one property, "bar". Our first task is to show if the "/foo" node exists, and if it does, show the value of "bar." For each element we look for, in this case, we'll trap the exception and show the result. To do this, we get the repository, open a read-only session, get the root node, and then look for "foo" -- a child of the root node -- and then the property "bar", with simple exception handling showing our results. We then close the session. Note that we're not exactly trapping exceptions properly in this sample code; you'd want to make sure all sessions were properly closed even in the case of repository exceptions (much like make sure you've closed JDBC Connections) in real-world code.

package jcr;

import javax.jcr.*;
import java.io.IOException;

public class ReadData extends Base {
public ReadData() {
}

public static void main(String[] args) throws IOException, RepositoryException {
ReadData readdata = new ReadData();
readdata.run();
}

private void run() throws IOException, RepositoryException {
Repository repository = getRepository();

Session session = getSession(repository);

Node rootnode = session.getRootNode();

Node childnode = null;
try {
childnode = rootnode.getNode("foo");
try {
Property prop = childnode.getProperty("bar");
System.out.println("value of /foo@bar: " + prop.getString());
} catch (PathNotFoundException pnfe) {
System.out.println("/foo@bar not found.");
}
} catch (PathNotFoundException pnfe) {
System.out.println("/foo not found.");
}

logout(session);
}
}

This class will show "/foo not found" after initializing the repository, because we haven't stored anything in it.

Now, to store data in our Repository, we follow a pattern not too far removed from the ReadData process: we need to get the Repository, get a Session that can write to the Repository, get the root node, see if the child node exists, add a child node, add data to that child node, save the session data if it has been changed, and then log out of the Session. Again, we are ignoring most error handling for the purpose of code clarity. Note that the save() element is critical; all changes to the repository are transient until they have been explicitly saved.

package jcr;

import javax.jcr.*;
import java.io.IOException;

public class StoreData extends Base {
public StoreData() {
}

public static void main(String[] args) throws IOException, RepositoryException {
StoreData sd=new StoreData();
sd.run();
}

private void run() throws IOException, RepositoryException {
Repository repository=getRepository();

Session session=getSession(repository);

Node rootnode=session.getRootNode();

Node childnode=null;
try {
childnode=rootnode.getNode("foo");
} catch(PathNotFoundException pnfe) {
childnode=rootnode.addNode("foo");
childnode.setProperty("bar", "this is some data");
session.save();
}

logout(session);
}
}

Now, if we run this class and then the ReadData class, we'll see that the value of /foo@bar is "this is some data", just as we'd expect.

Our next task is to clear out our data. Removing a node is as simple as getting the session (as we've done in StoreData), finding the node we want to remove, telling it to remove itself, and then saving the session. Here's the code:

package jcr;

import javax.jcr.*;
import java.io.IOException;

public class RemoveData extends Base {
public RemoveData() {
}

public static void main(String[] args) throws IOException, RepositoryException {
RemoveData sd = new RemoveData();
sd.run();
}

private void run() throws IOException, RepositoryException {
Repository repository = getRepository();

Session session = getSession(repository);

Node rootnode = session.getRootNode();

Node childnode = null;
try {
childnode = rootnode.getNode("foo");
childnode.remove();
session.save();
} catch (PathNotFoundException pnfe) {
System.out.println("/foo not found; not removed.");
}

logout(session);
}
}

We can now load our data, check to see if it's there, and remove it through these three classes. Now we're off to query our data!

Querying isn't much different, in theory, from reading or writing data, except that you access queries through a QueryManager, which is obtained through a workspace. A Query returns a QueryResult, which can return an iterator for Nodes or Rows (depending on how you want to access the data, and what your implementation supports.) In addition, the ways you can query data are impressive, so the basic structure of this class will change a little.

We'll create a console-driven application with which you can run your own queries against a repository, and show you some queries that should (and should not) return data.

package jcr;

import javax.jcr.*;
import javax.jcr.nodetype.PropertyDefinition;
import javax.jcr.query.QueryManager;
import javax.jcr.query.Query;
import javax.jcr.query.QueryResult;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;

public class CommandLineQuery extends Base {
public CommandLineQuery() {
}

public static void main(String[] args) throws IOException, RepositoryException {
CommandLineQuery clq=new CommandLineQuery();
clq.run();
}

private void run() throws IOException, RepositoryException {
Repository repository=getRepository();
Session session=getReadonlySession(repository);
Workspace workspace=session.getWorkspace();
QueryManager qm=workspace.getQueryManager();
BufferedReader reader=new BufferedReader(new InputStreamReader(System.in));
for(;;) {
System.out.print("JCRQL> ");
String queryString=reader.readLine();
if(queryString.equals("quit")) {
break;
}
if(queryString.length()==0 || queryString.startsWith("#")) {
continue;
}

int resultCounter=0;
try {
Query query=qm.createQuery(queryString, Query.XPATH);
QueryResult queryResult=query.execute();
NodeIterator nodeIterator=queryResult.getNodes();
while(nodeIterator.hasNext()) {
Node node=nodeIterator.nextNode();
dump(node);
resultCounter++;
}
} catch(Exception e) {
e.printStackTrace();
}

System.out.println("result count: "+resultCounter);
}
logout(session);
}

private void dump(Node node) throws RepositoryException {
StringBuilder sb=new StringBuilder();
String sep=",";
sb.append(node.getName());
sb.append("["+node.getPath());
PropertyIterator propIterator=node.getProperties();
while(propIterator.hasNext()) {
Property prop=propIterator.nextProperty();
sb.append(sep);
sb.append("@"+prop.getName()+"=\""+prop.getString()+"\"");
}
sb.append("]");
System.out.println(sb.toString());
}
}

This class isn't perfect by any means. In particular, the dump(Node) method is incapable of treating properties correctly, so if you plan to use this class on "real data," you should expect to modify this method extensively to support properties that contain multiple values, properties that contain different value types, and other conditions.

Here is some sample input and output based on the Repository containing one node, from the StoreData class above:

JCRQL> # Dump root's child node "foo"
JCRQL> //foo
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # Dump all /foo nodes where the property "bar" has the value "this is some data"
JCRQL> //foo[@bar='this is some data']
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # dump all nodes whose 'bar' property contains 'data'
JCRQL> //*[jcr:contains(@bar, 'data')]
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # dump all nodes whose 'bar' property contains 'daata' -- note extra a
JCRQL> //*[jcr:contains(@bar, 'daata')]
JCRQL> # dump all nodes whose 'bar' property is "like" "thi%" -- much as a SQL "like" comparison
JCRQL> //*[jcr:like(@bar, 'thi%')]
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # note that jcr:like is a full wildcard, so this would only find nodes with
JCRQL> # property 'bar' that *start* *with* "dat"
JCRQL> //*[jcr:like(@bar, 'dat%')]
result count: 0
JCRQL> # meanwhile, this query will find any node whose property 'bar' contains
'dat'

JCRQL>//*[jcr:like(@bar, '%dat%')]
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # note that jcr:contains relies on stop letters, so this is not the same as the previous query
JCRQL> //*[jcr:contains(@bar, 'dat')]
result count: 0

This is obviously only a simple introduction to queries; see the JCR specification, section 6.6, for more detail on the XPath-like query syntax as well as a discussion of the SQL-like query syntax and how XPath and SQL map to each other in the QueryManager.

Now you should have a raw starting look at JCR and how you might use it. However, this introduction doesn't really explain much about JCR besides the basic "how-to" details. Most people reach this point and stop, because information on using JCR "in anger" is hard to find.

Here are some of the common questions new JCR users run into almost immediately, along with some answers from people who have used the specification in real applications.

Q: Why would I really use JCR? Is it better than a file system or a database?

A: I think this is one of the most asked questions in the Java Content Repository community. Before answering it, we should point out what JCR is, and what was the original rationale behind it and then based on these we will be able to identify a couple of possible usages.

The parallel with a file system seems appropriate, but what JCR really does is offer a stable, unified, even transactional (*) API to work with hierarchical data, with the capability to define constraints on the stored information. It promotes the hierarchy into the interface so the actual medium does not have to be a file system at all. It can be a file system, of course, but it can also be a relational database, a Berkeley database, or any storage that can store hierarchical data in some way.

When should you use hierarchical storage? A hierarchy, or tree, is used to provide classification. In a set, a tree might represent collating order, for example. For more complex data, hierarchy indicates what domain the data exists in � much as in XML, "address" might mean different things if the parent node is "person" (in which "address" means "Mr." or "Mrs.?") or if the parent node is "location" (in which the address probably has street address, city, country, and postal code data.) In JCR, content can easily reference other content, providing access and reference across the hierarchy. This sounds very much like the use cases for XML, especially when JCR's naming and query conventions are examined; it shouldn't surprise you that JCR abstracts an XML view over the backend storage.

Q: You said the persistent storage does not have to be a file system. Why does that matter? What does that get me?

A: The JSR-170 specification does not talk about the actual persistent storage, as this is a detail for the implementation. Discussing the pros and cons of different storage solutions is much beyond the scope of this article. However, I think that from an implementation perspective the storage type will have an important influence on how easily other JCR features are supported (i.e., for file system-based solutions, support for transactions is difficult, but support for versioning is easy; for relational data stores, transactions are easy, but versioning will be more complex.) What I think I should emphasize here is that using the JCR API will guarantee that your application will run on any implementation whatever backing persistence storage is used.

Q: Okay, so I have chosen to use JCR. Where should I deploy it - as part of my web application, as part of my container? If it is part of my container, how do I make sure it is shut down properly?

A: This is a very generic question, so my answer would be, as always: Use it the way it best fits your application. If you take a look at Jackrabbit, which is one of the most complete open source implementations of the JCR (JSR-170 spec), you will find out that you have various options: embedded with your application, deployed as a shared J2EE container resource or even deployed as a standalone server.

What is the way you decide which of the above approaches you should take? Since every installation and problem domain is unique, you should analyze the lifecycle of your application versus the lifecycle of the JCR, analyze the performance you need and then just pick the one that offers you the best answers.

Q: But I do not understand what should factor into my decisions. What are the advantages of each?

A: There are three models for deploying Jackrabbit - other implementations may differ. Jackrabbit's models are fairly descriptive, though.

The first model is the simplest: deploying Jackrabbit as part of your application, just as you would Lucene or commons-dbcp. This means putting all of the jars into your application classpath.

The second model installs Jackrabbit at the container level, making it available to all of the container's applications and accessed via JNDI, just like a database pool. The application only needs access to the JCR API jar with this model.

The third model runs a server instance of Jackrabbit, where each client application that needs the repository communicates to the server over the network (by way of SOAP, DAV, RMI, or any other protocol, really. The role of JCR in this instance is to abstract the details of communication away from the programmer.)

The models increase in complexity as you go down the list: deploying JCR as part of the application is the simplest, but also the least flexible: only that one application is able to connect to the repository (although if you use a multi-user storage mechanism, you could configure it for multi-user access) and the Jackrabbit APIs are replicated for every application.

The second model provides JCR to every application in the container, because Jackrabbit is set up as a container resource; this provides all of the transaction capabilities, etc. to the JCR clients. This will appeal to most programmers because it means they only have to configure JCR once.

The third module is the most "enterprise-y," in that it moves JCR into a separate repository from the application server, much like a relational database. It provides the most access from multiple containers, but also carries with it the requirement for network access (SOAP, RMI, etc.) This also means that resources for the repository are not shared with the applications, which is the biggest benefit of the model: backups and CPU time are dedicated to the repository.

Q: What is the "happy path" using JCR?

A: From an application perspective, working with a Repository is quite easy:

  1. Obtain a Repository reference. Unfortunately, this is not covered by the spec, so you will need to identify how you can do this with your implementation
  2. obtain a JCR Session by login or login to a specific workspace
    1. repository.login(); // authentication is handled by an external mechanism and you login to the default workspace
    2. or repository.login(Credentials);
    3. or repository.login(Credentials credentials, String workspaceName);
  3. c) use the JCR Session to query or update the repository
  4. d) log out from the current JCR Session

However, in order to provide also an example for the first step, when using Jackrabbit you can do take one of the following approaches depending on the type of architecture you are using:

a) if you are just investigating JCR you can always do:

    Repository repository = new TransientRepository();

b) for the more advanced deployment models we have presented, usually it is a good idea to retrieve the Repository through a JNDI lookup, something along these lines:

InitialContext context = new InitialContext();
Context environment = (Context) context.lookup("java:comp/env");
Repository repository = (Repository) environment.lookup("jcr/repository");

For more details I would recommend reading the Apache Jackrabbit documentation as it describes in more detail what are the prerequisites and how this must done [2].

Q: When I log in to a Session and it's the only Session, the repository starts and stops - this is really expensive! What do I do?

A: Well, this is a problem only if you are using the TransientRepository in your deployment, because TransientRepository was meant for minimal use. Model 1 deployments might use it, but server-side applications should consider moving to the model 2 deployment (i.e., container-managed) as soon as possible. In any event, a trick that addresses this "problem" would be to use a global, read-only Session, which is left open for the runtime life of the Repository. This can be obtained with the following code:

Repository repository = new TransientRepository(); // see answer on obtaining a repository reference
Session sessionHolder=repository.login();

This will leave the repository open, which will allow other sessions to log in very quickly. In a servlet environment (again, necessary only with a model 1 deployment), you can use an ApplicationContextListener such as the following in your web application:

package server;
import org.apache.jackrabbit.core.TransientRepository;

import javax.servlet.ServletContextListener;
import javax.servlet.ServletContextEvent;
import javax.jcr.Session;
import javax.jcr.RepositoryException;
import javax.jcr.Repository;
import java.util.logging.Logger;
import java.io.IOException;
/**
* This is a ServletContextListener that establishes a "persistent session"
* for the life of the servlet context, which will keep a repository open
* as long as the web context is active.
*
* <h1>Warning!</h1>
*
* <p>This class is meant to be overridden, if you're using any deployment
* model other than the "model 1" deployment model. This context listener
* is only appropriate <b>if the repository is not container-managed.</b>
*/
public class JCRSessionManager implements ServletContextListener {
private Session session = null;
private Logger log=Logger.getLogger(this.getClass().getName());

/**
* Simple constructor
*/
protected JCRSessionManager() {
}
/**
* This method obtains the session from the repository in the context startup.
* @param servletContextEvent ignored.
*/
public final void contextInitialized(ServletContextEvent servletContextEvent) {
try {
session=getSession();
} catch(RepositoryException e) {
log.severe("Repository Exception: "+e.getMessage());
}
}

/**
* This method performs a read-only login to a content repository.
* @return a read-only Session
* @throws RepositoryException in the case of a login failure of some kind - which
* normally indicates a repository misconfiguration.
*/
private Session getSession() throws RepositoryException {
Repository repository=getRepository();
return repository.login();
}

/**
* This method returns a repository instance. It's meant to be overridden, in the case of
* the deployment models 2 and 3 (where the container manages the repository, or the repository
* is external.)
* @return a valid Repository instance
* @throws RepositoryException in the case that the repository could not be found or opened
*/
protected Repository getRepository() throws RepositoryException {
try {
return new TransientRepository();
} catch (IOException e) {
throw new RepositoryException(e);
}
}
/**
* This method releases the read-only session
* @param servletContextEvent Provides access to the context on shutdown; ignored.
*/
public final void contextDestroyed(ServletContextEvent servletContextEvent) {
session.logout();
}
}

Note that this class relies on JackRabbit being present as shown: it uses the JackRabbit TransientRepository implementation, which may not exist in other implementations of JCR. Again, this isn't likely to be required in a production environment, where TransientRepository isn't likely to be used.

Q: What happens if a server crashes while containing an open session?

A: Let's try a parallel here: what happens when your RDBMS crashes and you are not using transactions? Your data will be more or less in an unknown state, maybe even inconsistent. What happens when you are using transactions and your RDBMS crashes? Most of the time, your DB will try to make sure that the data is in a consistent state, but as you may know this cannot be 100% guaranteed, and sometimes you may get a heuristic result (for example take a look at the JTA exceptions: HeuristicCommitException, HeuristicMixedException, HeuristicRollbackException). Now, getting back to JCR, transactions are one of the optional features that an implementation may choose to offer. In this case, I would expect pretty much the same behavior. In transactional systems, changes are queued up until the commit occurs, so as long as the transaction is not in the middle of actually being applied by the backend data stores, your hierarchy should remain consistent and all right.

Q: What JCR implementations are there? What should I consider when making a choice about a JCR implementation?

A: As far as I know, at this moment there are quite a few implementations available, covering more or less the whole JSR-170 spec. I would mention Jackrabbit, Alfresco, eXo JCR, Day's CRX, Percussion's Rhythmyx, etc. Probably more and more vendors on the content/document management market will start looking into adding support for JSR-170 in their solutions. Many of them feature JSR-170 compatibility on their feature sets already.

I would say that the process of choosing a JCR implementation can be reduced to picking up the features from the specification that you intend to use and match them against the existing implementations.

Q: Is JCR really portable across implementations?

A: JSR-170 is a specification that defines a data model abstracting the persistence storage access and the API to handle the data. Having this in mind, if your application is coded according to the spec, and doesn't use any implementation specific APIs, then yes, it will be guaranteed to run on all implementations. Another interesting aspect of portability is the fact that JSR-170 defines three compliance levels: level 1 (read-only repository), level 2 (read/write repository), optional features (transactions, versioning, locking, observation, SQL querying) [1]. Unfortunately, there are things that were not addressed or completely clarified by the spec (like accessing a Repository instance, custom node type registration, etc.), but to fully answer the question, the application portability depends on the correct API usage and on the set of features your application relies on.

Q: Looking at Jackrabbit, what differentiates the various "file system" implementations? Why would I choose one over another?

A: As we already mentioned, the JCR API abstracts away the real physical persistence storage. Jackrabbit offers different solutions ranging from flat files, to XML, relational DB, or even non-relational embeddable storage solutions as BerkeleyDB (as mentioned earlier).

The answer to why would you choose one over the others is pretty complex and the only good answer will come by analyzing your requirements and testing what you need to do.

Q: How does JCR look at the actual repository? Is it actually XML? Is that any better than a simple XML repository?

A: JCR defines a data model around the repository. This repository model defines the way your data is structured and identified. A repository consists of one or more workspaces, each of which containing a tree of items. Here is a diagram of this model:

As you can see, the workspace looks pretty much like an Unix file system structure. One thing that needs to be pointed out is that Nodes have Properties, but Nodes have no data associated with them in and of themselves.

Speaking about XML, indeed, the specification is talking about two XML-like views of the data: the system view and the document view. These mappings are quite important when thinking about querying your data or when considering import/export facilities. According to the specification:

The system view mapping provides a complete serialization of workspace content to XML without loss of information. In level 1, this allows the complete content of a workspace to be exported. In level 2, this also allows for round tripping of content to XML and back again through export and import.

The document view is a human-readable version of the system view.

In level 1 the document view is used as the format for the virtual XML stream against which an XPath query is run. As well, in level 1, export to document view format is supported. In level 2, document view also allows for the import of arbitrary XML.

Q: The JCR implementation docs suggest that a deeper tree structure is better than a wide tree. How true is that? What's the best way to guarantee that?

A: I think the theoretical answer to this question comes from the days of hierarchical databases. In addition, using the parallel with file systems, I think everybody knows that the performance of scanning deeper folder structures is much superior to scanning flat but wide folder structures. From the Jackrabbit implementation details, this is once again true: the structure of a parent node and its children is stored together: the wider it is, the slower the access will be. As far as guaranteeing that...

Q: What's the best way to organize your data? For example, if I have a node that has some data that should keep track of versions, it makes sense to do something like this:

Node myNode=parent.addNode("myNodeName");
myNode.setProperty("prop1", "value1");

Node tempNode=myNode.addNode("data1");
tempNode.addMixin("mix:versionable");
tempNode.setProperty("value", "data1value");

tempNode=myNode.addNode("data2");
tempNode.addMixin("mix:versionable");
tempNode.setProperty("value", "data2value");

Is this the "right way" to do something like this?

A: Well, it would work. However, it might be easier to create another node type, creating snapshots of the versioned data, rather than storing the versioned data in their own nodes and properties. In other words, you'd create a "data" node under "myNodeName," create properties under the data node such as "data1value" and "data2value," and version the "data" node instead of creating versions for every child node under "myNodeName."

Q: How would one search for "data2value" given the above structure?

A: We already did this parallel a couple of times in this article and it will help us once again to answer the question: the JCR model resembles a Unix file system. Therefore, accessing data is like navigating a file system based on absolute or relative paths. That's exactly what the JCR API is offering: node navigation using relative paths, direct access using absolute paths.

The direct access can be written:

Property data2Property = (Property) session.getItem("/myNodeName/data2/value");

or by navigating from the root node:

Node rootNode = session.getRootNode();
Node data2Node = rootNode.getNode("myNodeName/data2");
Property data2Property = data2Node.getProperty("value");

Q: How do I find nodes in a repository by (A) name, (B) attribute, (C) version, or (D) unique identifier?

A: You can always navigate the workspace tree and filter the traversed nodes according to different criteria. Also, according to the Level 1 compliance features, you can use XPath queries to find nodes. Even if the JCR queries support only a subset of the XPath spec, most of the time it is powerful enough to do whatever you need.

Examples:

  //searched_node_name: find all nodes having the specified name
/jcr:root/some/additional/path//searched_node_name: find all nodes under /some/additional/path having the specified name
//*[@searched_property]: find all nodes having a property with the given name
//*[@property='value']: find all nodes having a property with the specified value

Querying for versions is a bit different, as versioning is an optional feature. According to the specification, all versioning information is exposed in each workspace under a special path: /jcr:system/jcr:versionStorage. Here are a couple of examples:

//element(*, nt:frozenNode)[@my:property = 'data' and jcr:frozenUuid = '<the uuid of original_node>']

The UUID of the original node is needed because the current path of a node in a workspace does not identify the node in the version storage. A version of a node can be checked out to any place in a workspace.

Finding a labeled version of the node with a known label:

//element(*, nt:versionLabels)/jcr:deref(@labelIKnow, '*')/*[jcr:frozenUuid ='<the uuid of original_node>']

The query dereferences all nt:versionLabel nodes with a @labelIKnow reference property in the version storage. The targets of the references are nt:version nodes and their children are the nt:frozenNodes with the versioned properties.

Other searches:

//*[jcr:contains(@data, 'foo bar')]

This query retrieves nodes that have a "data" property that contains "foo" and "bar" in their text.

//jcr:root/wiki/*/*[@published='true']

This is modeled on a wiki that stores the actual entries two levels down in the tree (i.e., "foo" would map to "/wiki/f/foo", "bar" would map to "/wiki/b/bar"). This query would retrieve all of the wiki entries that have a "published" property with a value of "true."

You can also use SQL to retrieve nodes, if you have a level 2 compliant JCR repository (such as JackRabbit). Assuming the same wiki structure is used:

select * from nt:unstructured where jcr:path like '/wiki/%/%' order by createdate

This query returns the same set of data that the prior XPath query would, ordered by the "createdate" property's value.

You can also search by UUID, if your node has been marked with the "mix:referenceable" mixin:

select * from nt:unstructured where jcr:uuid='[uuid value]'

Q: If I wanted to say that "myNodeName" from the earlier question was categorized in a taxonomy, is there a way I can add a reference to it to a top-level taxonomy node?

A: Referenceability and reference integrity is another interesting feature in JSR170 (but again optional). And here is the reason: once a node is mix:referenceable, you can start linking to it from any nodes and the system will guarantee referential integrity.

Now, let's detail a bit: every node in the JCR has a primary type - a description of what you can store in that node. Additionally, you can add more details to that node type description by using a mixin: a set of definitions and constraints that augments the node definition. There are a couple of predefined mixins that are described by the optional JCR features: mix:referenceable (allow to create references to a node), mix:versionable (allow to version a node), mix:lockable (allow to lock a node). So, there are two things you must do in order to use node references: be assured that the target node is mix:referenceable and create a REFERENCE Property pointing to the target node in your actual node.

Your next question may be: what if I want to link my nodes to multiple nodes? JCR has an answer for this question too: properties of a node can be single-valued or multiple-valued. For linking your node to multiple target nodes, you will have to use a multi-values REFERENCE property and you are done.

If you compare referenceability in JCR and RDBMS you will notice how much more easily the JCR model can be used; you don't have to add additional columns to your tables or create new tables, you are not bound to use a specific column for a relationship, you are not facing constraints issues, etc.

Q: Why would I use SQL vs. XPath queries?

A: I think this is pretty much a matter of taste and a matter of the existing expertise in your IT shop. If your developers are familiar with SQL, then I guess it would make more sense to use it. But, we should emphasize the fact that SQL query support is an optional feature of JCR implementations, so when making a decision you should make sure that the JCR implementation you are going to use supports it. Also, if you will need to be able to make your application work with different JCR applications then you may have a future problem when moving to a JCR implementation that does not support SQL querying.

Q: How do I do a full-text search with JCR?

A: The JSR-170 specification requires support for XPath-like querying. This includes also support for full-text search through XPath extension functions such as jcr:like and jcr:contains. The semantics of these functions are quite complex and I would recommend everybody to read the specification chapters. But, as a quick example let's see how we can use these two functions to retrieve the nodes that have the value property (@ value) containing the string "data":

//*[jcr:like(@value, '%data%') or //*[jcr:contains(@value, 'data')]

SQL queries are mentioned also in the JSR-170 specification, under the optional compliance features. If your implementation supports SQL queries, then you will be able to use the corresponding LIKE and CONTAINS predicates.

Q: Workspaces! What is their purpose? How should I use them, and how do I access a specific workspace in a repository?

A: This question is a very legitimate one, because as far as I know, the specification does not provide a clear definition of what a workspace is. In most cases, I think a single workspace is enough, but there may be cases where a clean separation of the stored data would probably make sense. It would probably make sense to identify some of the pros for each usage scenarios.

Pros for a single workspace:

  • It is easier to use, because you will not have to manage multiple workspace logins, or multiple sessions.
  • You can use node references (this is not going to work with multiple references, as these are only available in the same workspace)
  • In case the JSR-170 implementation you are using is Jackrabbit, a single workspace proves to be less resource intensive, because it will use a single backend store (e.g. database) and it will use fewer file handles, as it requires only two indexes (one for the workspace and one for the jcr:system tree)

Pros for using multiple workspaces:

  • The first benefit I see is to have a clean separation of the stored data. You may think of this as in different DB schemas.
  • Another important aspect of multiple workspaces (at least when considering Jackrabbit) is that you can use specific backend storage (by specifying a persistence manager implementations) per workspace that are tailored to the usage of that workspace.

And I think there are a couple of more: better concurrency when you have many write operations that can be distributed to multiple workspaces, increased cache efficiency (each workspace having its own cache).

Connecting and using a specific workspace is an easy task: the Repository API offers two methods for doing this:

login(Credentials credentials, String workspaceName) and login(String workspaceName).

In Closing...

The Java Content Repository is a complex specification, but it does an excellent job of abstracting away the details of managing content. Assuming your repository complies with the entire specification, you can version your data, lock it, store any number of attributes (with a large number of data types), validate your structure (with custom node types), and query it with both XPath and SQL, depending on which is more comfortable to you. JackRabbit, as the reference implementation, is remarkably capable for simple deployments, both in capabilities and in scalability (as a JCR-backed application has successfully survived slashdotting, in one known example.)

Hopefully, this set of questions and answers addresses the most common issues people discover with JCR after first stepping into it, such as how to add data to a repository, how to speed up access to a repository, how to query it, along with some suggestions as to how to organize it. This is meant to be a "second step" in using JCR, to cover the simple things that the specifications and the other "first steps" (see "Resources") don't cover. Any further questions would be welcomed.

Resources

  1. Introducing the Java Content Repository API, http://www-128.ibm.com/developerworks/java/library/j-jcr/
  2. Catch Jackrabbit and the Java Content Repository API, http://www.artima.com/lejava/articles/contentrepository.html
  3. What is Java Content Repository, http://www.onjava.com/pub/a/onjava/2006/10/04/what-is-java-content-repository.html
  4. JSR-170: What's in it for me?, http://www.cmswatch.com/Feature/123

[1] See JSR-170: 4.2. Compliance levels
[2] http://jackrabbit.apache.org/doc/deploy.html

About the Authors

Alexandru Popescu is Chief Architect and co-founder of InfoQ.com. He is involved in many open source initiatives and bleeding-edge technologies (AOP, testing, web, etc.), being co-founder of the TestNG Framework and a committer on the WebWork and Magnolia projects. Alexandru formerly was one of three committers on the AspectWerkz project before it merged with AspectJ. He also publishes a blog on tech topics at http://themindstorms.blogspot.com/.

Joseph Ottinger is the editor-in-chief of TheServerSide.com, and has extensive experience in server-side technologies ranging from Perl to Cold Fusion to C/C++ and Java. Joe still doesn't have a MacBook Pro.


posted on 2008-07-22 11:53 gembin 阅读(947) 评论(0)  编辑  收藏 所属分类: JSR


只有注册用户登录后才能发表评论。


网站导航:
 

导航

统计

常用链接

留言簿(6)

随笔分类(440)

随笔档案(378)

文章档案(6)

新闻档案(1)

相册

收藏夹(9)

Adobe

Android

AS3

Blog-Links

Build

Design Pattern

Eclipse

Favorite Links

Flickr

Game Dev

HBase

Identity Management

IT resources

JEE

Language

OpenID

OSGi

SOA

Version Control

最新随笔

搜索

积分与排名

最新评论

阅读排行榜

评论排行榜

free counters