the Java Content Repository specification (JSR-170) focuses on
"content services," where these not only manage data, but offer author
based versioning, full-text searches, fine grained access control,
content categorization and content event monitoring. Programmers can
use repositories in many ways just like a JDBC connection accesses a
database: programmers obtain a connection to a repository, open a
session, use the session to access a set of data, and then close the
session. The JCR specification has multiple levels of compliance; the
most simple level offers read-only access to a repository, XPath-like
queries, and some other elements, while other levels of the
specification offer a SQL-like query syntax, write capabilities, and
more advanced features.
Jakarta offers the reference implementation for JCR, called
"JackRabbit." As the reference implementation, Jackrabbit offers all of
the mandatory elements of all of the JCR levels, and while it does not
have built-in fine-grained access control, adding these features is
certainly possible.
In addition to the specification itself, there are three primary
online articles that introduce JCR to developers (see Resources, at the
end of this article, for URLs). However, each of these offer only
introductory material, and those looking to use the specification in
anger are left without further guidance. This article is meant to offer
a similar "short introduction" to JCR, and then delves into some
frequently asked questions for which online answers are not easy to
find, but should be.
JCR is an abstraction of a content repository. The actual repository
might be implemented on top of a file system, on a database, or on any
other storage mechanism that might be appropriate. Accessing the
repository can be done directly (through a local repository contained
in an application), through a resource (where a J2EE container manages
the repository and access is provided via JNDI), or over the network
via RMI or HTTP. Each of these deployment models has strengths and
weaknesses, and a specific deployment should evaluate which of the
models is appropriate. Most server-side applications will use the
second or third models, but your mileage may vary. In any event, the
only programmatic different between the models is how the initial
Repository reference is obtained.
There are four basic operations a repository provides: reads,
writes, queries, and deletions. Let's walk through each of these, just
to make sure you're up to speed on basic access to a repository.
For the simple example, we'll use a transient repository, which is
not meant for production use but will illustrate the basic concepts.
Setting up the environment is a matter of making sure you have JavaDB
(i.e., Derby) and JackRabbit in your classpath. After you've set up
your environment, we'll create a base class that provides common
functionality:
package jcr;
import org.apache.jackrabbit.core.TransientRepository;
import javax.jcr.*;
import java.io.IOException;
public abstract class Base {
public Repository getRepository() throws IOException {
return new TransientRepository();
}
public Session getReadonlySession(Repository repository) throws RepositoryException {
return repository.login();
}
public Session getSession(Repository repository) throws RepositoryException {
return repository.login(new SimpleCredentials("username", "password".toCharArray()));
}
public void logout(Session session) {
session.logout();
}
}
Our next task is to read data from the Repository from a known
position. To understand what's going on, though, a simple explanation
of how a Repository stores data is necessary (and this is covered in
more detail later in this document as well.)
A Repository is a hierarchical store, much like a file system or an
XML document. (This does not mean it uses XML or the file system, but
only that these are analogues to how a Repository is represented.)
Therefore, there is a "root node" at the "top" of the Repository, and
the root node has child nodes, each of which can contain other child
nodes or properties, where the properties contain actual data. This is
a very simple explanation, and it leaves out the concepts of access
control, versioning, workspaces, and a few other ideas, but this will
be enough to get started.
We're going to create a simple structure later, where the node name
will be "/foo", with one property, "bar". Our first task is to show if
the "/foo" node exists, and if it does, show the value of "bar." For
each element we look for, in this case, we'll trap the exception and
show the result. To do this, we get the repository, open a read-only
session, get the root node, and then look for "foo" -- a child of the
root node -- and then the property "bar", with simple exception
handling showing our results. We then close the session. Note that
we're not exactly trapping exceptions properly in this sample code;
you'd want to make sure all sessions were properly closed even in the
case of repository exceptions (much like make sure you've closed JDBC
Connections) in real-world code.
package jcr;
import javax.jcr.*;
import java.io.IOException;
public class ReadData extends Base {
public ReadData() {
}
public static void main(String[] args) throws IOException, RepositoryException {
ReadData readdata = new ReadData();
readdata.run();
}
private void run() throws IOException, RepositoryException {
Repository repository = getRepository();
Session session = getSession(repository);
Node rootnode = session.getRootNode();
Node childnode = null;
try {
childnode = rootnode.getNode("foo");
try {
Property prop = childnode.getProperty("bar");
System.out.println("value of /foo@bar: " + prop.getString());
} catch (PathNotFoundException pnfe) {
System.out.println("/foo@bar not found.");
}
} catch (PathNotFoundException pnfe) {
System.out.println("/foo not found.");
}
logout(session);
}
}
This class will show "/foo not found" after initializing the repository, because we haven't stored anything in it.
Now, to store data in our Repository, we follow a pattern not too
far removed from the ReadData process: we need to get the Repository,
get a Session that can write to the Repository, get the root node, see
if the child node exists, add a child node, add data to that child
node, save the session data if it has been changed, and then log out of
the Session. Again, we are ignoring most error handling for the purpose
of code clarity. Note that the save() element is critical; all changes
to the repository are transient until they have been explicitly saved.
package jcr;
import javax.jcr.*;
import java.io.IOException;
public class StoreData extends Base {
public StoreData() {
}
public static void main(String[] args) throws IOException, RepositoryException {
StoreData sd=new StoreData();
sd.run();
}
private void run() throws IOException, RepositoryException {
Repository repository=getRepository();
Session session=getSession(repository);
Node rootnode=session.getRootNode();
Node childnode=null;
try {
childnode=rootnode.getNode("foo");
} catch(PathNotFoundException pnfe) {
childnode=rootnode.addNode("foo");
childnode.setProperty("bar", "this is some data");
session.save();
}
logout(session);
}
}
Now, if we run this class and then the ReadData class, we'll see
that the value of /foo@bar is "this is some data", just as we'd expect.
Our next task is to clear out our data. Removing a node is as simple
as getting the session (as we've done in StoreData), finding the node
we want to remove, telling it to remove itself, and then saving the
session. Here's the code:
package jcr;
import javax.jcr.*;
import java.io.IOException;
public class RemoveData extends Base {
public RemoveData() {
}
public static void main(String[] args) throws IOException, RepositoryException {
RemoveData sd = new RemoveData();
sd.run();
}
private void run() throws IOException, RepositoryException {
Repository repository = getRepository();
Session session = getSession(repository);
Node rootnode = session.getRootNode();
Node childnode = null;
try {
childnode = rootnode.getNode("foo");
childnode.remove();
session.save();
} catch (PathNotFoundException pnfe) {
System.out.println("/foo not found; not removed.");
}
logout(session);
}
}
We can now load our data, check to see if it's there, and remove it
through these three classes. Now we're off to query our data!
Querying isn't much different, in theory, from reading or writing
data, except that you access queries through a QueryManager, which is
obtained through a workspace. A Query returns a QueryResult, which can
return an iterator for Nodes or Rows (depending on how you want to
access the data, and what your implementation supports.) In addition,
the ways you can query data are impressive, so the basic structure of
this class will change a little.
We'll create a console-driven application with which you can run
your own queries against a repository, and show you some queries that
should (and should not) return data.
package jcr;
import javax.jcr.*;
import javax.jcr.nodetype.PropertyDefinition;
import javax.jcr.query.QueryManager;
import javax.jcr.query.Query;
import javax.jcr.query.QueryResult;
import java.io.IOException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class CommandLineQuery extends Base {
public CommandLineQuery() {
}
public static void main(String[] args) throws IOException, RepositoryException {
CommandLineQuery clq=new CommandLineQuery();
clq.run();
}
private void run() throws IOException, RepositoryException {
Repository repository=getRepository();
Session session=getReadonlySession(repository);
Workspace workspace=session.getWorkspace();
QueryManager qm=workspace.getQueryManager();
BufferedReader reader=new BufferedReader(new InputStreamReader(System.in));
for(;;) {
System.out.print("JCRQL> ");
String queryString=reader.readLine();
if(queryString.equals("quit")) {
break;
}
if(queryString.length()==0 || queryString.startsWith("#")) {
continue;
}
int resultCounter=0;
try {
Query query=qm.createQuery(queryString, Query.XPATH);
QueryResult queryResult=query.execute();
NodeIterator nodeIterator=queryResult.getNodes();
while(nodeIterator.hasNext()) {
Node node=nodeIterator.nextNode();
dump(node);
resultCounter++;
}
} catch(Exception e) {
e.printStackTrace();
}
System.out.println("result count: "+resultCounter);
}
logout(session);
}
private void dump(Node node) throws RepositoryException {
StringBuilder sb=new StringBuilder();
String sep=",";
sb.append(node.getName());
sb.append("["+node.getPath());
PropertyIterator propIterator=node.getProperties();
while(propIterator.hasNext()) {
Property prop=propIterator.nextProperty();
sb.append(sep);
sb.append("@"+prop.getName()+"=\""+prop.getString()+"\"");
}
sb.append("]");
System.out.println(sb.toString());
}
}
This class isn't perfect by any means. In particular, the dump(Node)
method is incapable of treating properties correctly, so if you plan to
use this class on "real data," you should expect to modify this method
extensively to support properties that contain multiple values,
properties that contain different value types, and other conditions.
Here is some sample input and output based on the Repository containing one node, from the StoreData class above:
JCRQL> # Dump root's child node "foo"
JCRQL> //foo
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # Dump all /foo nodes where the property "bar" has the value "this is some data"
JCRQL> //foo[@bar='this is some data']
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # dump all nodes whose 'bar' property contains 'data'
JCRQL> //*[jcr:contains(@bar, 'data')]
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # dump all nodes whose 'bar' property contains 'daata' -- note extra a
JCRQL> //*[jcr:contains(@bar, 'daata')]
JCRQL> # dump all nodes whose 'bar' property is "like" "thi%" -- much as a SQL "like" comparison
JCRQL> //*[jcr:like(@bar, 'thi%')]
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # note that jcr:like is a full wildcard, so this would only find nodes with
JCRQL> # property 'bar' that *start* *with* "dat"
JCRQL> //*[jcr:like(@bar, 'dat%')]
result count: 0
JCRQL> # meanwhile, this query will find any node whose property 'bar' contains
'dat'
JCRQL>//*[jcr:like(@bar, '%dat%')]
foo[/foo,@bar="this is some data",@jcr:primaryType="nt:unstructured"]
result count: 1
JCRQL> # note that jcr:contains relies on stop letters, so this is not the same as the previous query
JCRQL> //*[jcr:contains(@bar, 'dat')]
result count: 0
This is obviously only a simple introduction to queries; see the JCR
specification, section 6.6, for more detail on the XPath-like query
syntax as well as a discussion of the SQL-like query syntax and how
XPath and SQL map to each other in the QueryManager.
Now you should have a raw starting look at JCR and how you might use
it. However, this introduction doesn't really explain much about JCR
besides the basic "how-to" details. Most people reach this point and
stop, because information on using JCR "in anger" is hard to find.
Here are some of the common questions new JCR users run into almost
immediately, along with some answers from people who have used the
specification in real applications.
Q: Why would I really use JCR? Is it better than a file system or a database?
A: I think this is one of the most asked questions in the Java
Content Repository community. Before answering it, we should point out
what JCR is, and what was the original rationale behind it and then
based on these we will be able to identify a couple of possible usages.
The parallel with a file system seems appropriate, but what JCR
really does is offer a stable, unified, even transactional (*) API to
work with hierarchical data, with the capability to define constraints
on the stored information. It promotes the hierarchy into the interface
so the actual medium does not have to be a file system at all. It can
be a file system, of course, but it can also be a relational database,
a Berkeley database, or any storage that can store hierarchical data in
some way.
When should you use hierarchical storage? A hierarchy, or tree, is
used to provide classification. In a set, a tree might represent
collating order, for example. For more complex data, hierarchy
indicates what domain the data exists in � much as in XML, "address"
might mean different things if the parent node is "person" (in which
"address" means "Mr." or "Mrs.?") or if the parent node is "location"
(in which the address probably has street address, city, country, and
postal code data.) In JCR, content can easily reference other content,
providing access and reference across the hierarchy. This sounds very
much like the use cases for XML, especially when JCR's naming and query
conventions are examined; it shouldn't surprise you that JCR abstracts
an XML view over the backend storage.
Q: You said the persistent storage does not have to be a file system. Why does that matter? What does that get me?
A: The JSR-170 specification does not talk about the actual
persistent storage, as this is a detail for the implementation.
Discussing the pros and cons of different storage solutions is much
beyond the scope of this article. However, I think that from an
implementation perspective the storage type will have an important
influence on how easily other JCR features are supported (i.e., for
file system-based solutions, support for transactions is difficult, but
support for versioning is easy; for relational data stores,
transactions are easy, but versioning will be more complex.) What I
think I should emphasize here is that using the JCR API will guarantee
that your application will run on any implementation whatever backing
persistence storage is used.
Q: Okay, so I have chosen to use JCR. Where should I deploy it -
as part of my web application, as part of my container? If it is part
of my container, how do I make sure it is shut down properly?
A: This is a very generic question, so my answer would be, as
always: Use it the way it best fits your application. If you take a
look at Jackrabbit, which is one of the most complete open source
implementations of the JCR (JSR-170 spec), you will find out that you
have various options: embedded with your application, deployed as a
shared J2EE container resource or even deployed as a standalone server.
What is the way you decide which of the above approaches you should
take? Since every installation and problem domain is unique, you should
analyze the lifecycle of your application versus the lifecycle of the
JCR, analyze the performance you need and then just pick the one that
offers you the best answers.
Q: But I do not understand what should factor into my decisions. What are the advantages of each?
A: There are three models for deploying Jackrabbit - other
implementations may differ. Jackrabbit's models are fairly descriptive,
though.
The first model is the simplest: deploying Jackrabbit as part of
your application, just as you would Lucene or commons-dbcp. This means
putting all of the jars into your application classpath.
The second model installs Jackrabbit at the container level, making
it available to all of the container's applications and accessed via
JNDI, just like a database pool. The application only needs access to
the JCR API jar with this model.
The third model runs a server instance of Jackrabbit, where each
client application that needs the repository communicates to the server
over the network (by way of SOAP, DAV, RMI, or any other protocol,
really. The role of JCR in this instance is to abstract the details of
communication away from the programmer.)
The models increase in complexity as you go down the list: deploying
JCR as part of the application is the simplest, but also the least
flexible: only that one application is able to connect to the
repository (although if you use a multi-user storage mechanism, you
could configure it for multi-user access) and the Jackrabbit APIs are
replicated for every application.
The second model provides JCR to every application in the container,
because Jackrabbit is set up as a container resource; this provides all
of the transaction capabilities, etc. to the JCR clients. This will
appeal to most programmers because it means they only have to configure
JCR once.
The third module is the most "enterprise-y," in that it moves JCR
into a separate repository from the application server, much like a
relational database. It provides the most access from multiple
containers, but also carries with it the requirement for network access
(SOAP, RMI, etc.) This also means that resources for the repository are
not shared with the applications, which is the biggest benefit of the
model: backups and CPU time are dedicated to the repository.
Q: What is the "happy path" using JCR?
A: From an application perspective, working with a Repository is quite easy:
- Obtain a Repository reference. Unfortunately, this is not covered
by the spec, so you will need to identify how you can do this with your
implementation
- obtain a JCR Session by login or login to a specific workspace
- repository.login(); // authentication is handled by an external mechanism and you login to the default workspace
- or repository.login(Credentials);
- or repository.login(Credentials credentials, String workspaceName);
- c) use the JCR Session to query or update the repository
- d) log out from the current JCR Session
However, in order to provide also an example for the first step,
when using Jackrabbit you can do take one of the following approaches
depending on the type of architecture you are using:
a) if you are just investigating JCR you can always do:
Repository repository = new TransientRepository();
b) for the more advanced deployment models we have presented,
usually it is a good idea to retrieve the Repository through a JNDI
lookup, something along these lines:
InitialContext context = new InitialContext();
Context environment = (Context) context.lookup("java:comp/env");
Repository repository = (Repository) environment.lookup("jcr/repository");
For more details I would recommend reading the Apache Jackrabbit
documentation as it describes in more detail what are the prerequisites
and how this must done [2].
Q: When I log in to a Session and it's the only Session, the
repository starts and stops - this is really expensive! What do I do?
A: Well, this is a problem only if you are using the
TransientRepository in your deployment, because TransientRepository was
meant for minimal use. Model 1 deployments might use it, but
server-side applications should consider moving to the model 2
deployment (i.e., container-managed) as soon as possible. In any event,
a trick that addresses this "problem" would be to use a global,
read-only Session, which is left open for the runtime life of the
Repository. This can be obtained with the following code:
Repository repository = new TransientRepository(); // see answer on obtaining a repository reference
Session sessionHolder=repository.login();
This will leave the repository open, which will allow other sessions
to log in very quickly. In a servlet environment (again, necessary only
with a model 1 deployment), you can use an ApplicationContextListener
such as the following in your web application:
package server;
import org.apache.jackrabbit.core.TransientRepository;
import javax.servlet.ServletContextListener;
import javax.servlet.ServletContextEvent;
import javax.jcr.Session;
import javax.jcr.RepositoryException;
import javax.jcr.Repository;
import java.util.logging.Logger;
import java.io.IOException;
/**
* This is a ServletContextListener that establishes a "persistent session"
* for the life of the servlet context, which will keep a repository open
* as long as the web context is active.
*
* <h1>Warning!</h1>
*
* <p>This class is meant to be overridden, if you're using any deployment
* model other than the "model 1" deployment model. This context listener
* is only appropriate <b>if the repository is not container-managed.</b>
*/
public class JCRSessionManager implements ServletContextListener {
private Session session = null;
private Logger log=Logger.getLogger(this.getClass().getName());
/**
* Simple constructor
*/
protected JCRSessionManager() {
}
/**
* This method obtains the session from the repository in the context startup.
* @param servletContextEvent ignored.
*/
public final void contextInitialized(ServletContextEvent servletContextEvent) {
try {
session=getSession();
} catch(RepositoryException e) {
log.severe("Repository Exception: "+e.getMessage());
}
}
/**
* This method performs a read-only login to a content repository.
* @return a read-only Session
* @throws RepositoryException in the case of a login failure of some kind - which
* normally indicates a repository misconfiguration.
*/
private Session getSession() throws RepositoryException {
Repository repository=getRepository();
return repository.login();
}
/**
* This method returns a repository instance. It's meant to be overridden, in the case of
* the deployment models 2 and 3 (where the container manages the repository, or the repository
* is external.)
* @return a valid Repository instance
* @throws RepositoryException in the case that the repository could not be found or opened
*/
protected Repository getRepository() throws RepositoryException {
try {
return new TransientRepository();
} catch (IOException e) {
throw new RepositoryException(e);
}
}
/**
* This method releases the read-only session
* @param servletContextEvent Provides access to the context on shutdown; ignored.
*/
public final void contextDestroyed(ServletContextEvent servletContextEvent) {
session.logout();
}
}
Note that this class relies on JackRabbit being present as shown: it
uses the JackRabbit TransientRepository implementation, which may not
exist in other implementations of JCR. Again, this isn't likely to be
required in a production environment, where TransientRepository isn't
likely to be used.
Q: What happens if a server crashes while containing an open session?
A: Let's try a parallel here: what happens when your RDBMS crashes
and you are not using transactions? Your data will be more or less in
an unknown state, maybe even inconsistent. What happens when you are
using transactions and your RDBMS crashes? Most of the time, your DB
will try to make sure that the data is in a consistent state, but as
you may know this cannot be 100% guaranteed, and sometimes you may get
a heuristic result (for example take a look at the JTA exceptions:
HeuristicCommitException, HeuristicMixedException,
HeuristicRollbackException). Now, getting back to JCR, transactions are
one of the optional features that an implementation may choose to
offer. In this case, I would expect pretty much the same behavior. In
transactional systems, changes are queued up until the commit occurs,
so as long as the transaction is not in the middle of actually being
applied by the backend data stores, your hierarchy should remain
consistent and all right.
Q: What JCR implementations are there? What should I consider when making a choice about a JCR implementation?
A: As far as I know, at this moment there are quite a few
implementations available, covering more or less the whole JSR-170
spec. I would mention Jackrabbit, Alfresco, eXo JCR, Day's CRX,
Percussion's Rhythmyx, etc. Probably more and more vendors on the
content/document management market will start looking into adding
support for JSR-170 in their solutions. Many of them feature JSR-170
compatibility on their feature sets already.
I would say that the process of choosing a JCR implementation can be
reduced to picking up the features from the specification that you
intend to use and match them against the existing implementations.
Q: Is JCR really portable across implementations?
A: JSR-170 is a specification that defines a data model abstracting
the persistence storage access and the API to handle the data. Having
this in mind, if your application is coded according to the spec, and
doesn't use any implementation specific APIs, then yes, it will be
guaranteed to run on all implementations. Another interesting aspect of
portability is the fact that JSR-170 defines three compliance levels:
level 1 (read-only repository), level 2 (read/write repository),
optional features (transactions, versioning, locking, observation, SQL
querying) [1]. Unfortunately, there are things that were not addressed
or completely clarified by the spec (like accessing a Repository
instance, custom node type registration, etc.), but to fully answer the
question, the application portability depends on the correct API usage
and on the set of features your application relies on.
Q: Looking at Jackrabbit, what differentiates the various "file system" implementations? Why would I choose one over another?
A: As we already mentioned, the JCR API abstracts away the real
physical persistence storage. Jackrabbit offers different solutions
ranging from flat files, to XML, relational DB, or even non-relational
embeddable storage solutions as BerkeleyDB (as mentioned earlier).
The answer to why would you choose one over the others is pretty
complex and the only good answer will come by analyzing your
requirements and testing what you need to do.
Q: How does JCR look at the actual repository? Is it actually XML? Is that any better than a simple XML repository?
A: JCR defines a data model around the repository. This repository
model defines the way your data is structured and identified. A
repository consists of one or more workspaces, each of which containing
a tree of items. Here is a diagram of this model:
As you can see, the workspace looks pretty much like an Unix file
system structure. One thing that needs to be pointed out is that Nodes
have Properties, but Nodes have no data associated with them in and of
themselves.
Speaking about XML, indeed, the specification is talking about two
XML-like views of the data: the system view and the document view.
These mappings are quite important when thinking about querying your
data or when considering import/export facilities. According to the
specification:
The system view mapping provides a complete serialization
of workspace content to XML without loss of information. In level 1,
this allows the complete content of a workspace to be exported. In
level 2, this also allows for round tripping of content to XML and back
again through export and import.
The document view is a human-readable version of the system view.
In level 1 the document view is used as the format for the
virtual XML stream against which an XPath query is run. As well, in
level 1, export to document view format is supported. In level 2,
document view also allows for the import of arbitrary XML.
Q: The JCR implementation docs suggest that a deeper tree
structure is better than a wide tree. How true is that? What's the best
way to guarantee that?
A: I think the theoretical answer to this question comes from the
days of hierarchical databases. In addition, using the parallel with
file systems, I think everybody knows that the performance of scanning
deeper folder structures is much superior to scanning flat but wide
folder structures. From the Jackrabbit implementation details, this is
once again true: the structure of a parent node and its children is
stored together: the wider it is, the slower the access will be. As far
as guaranteeing that...
Q: What's the best way to organize your data? For example, if I
have a node that has some data that should keep track of versions, it
makes sense to do something like this:
Node myNode=parent.addNode("myNodeName");
myNode.setProperty("prop1", "value1");
Node tempNode=myNode.addNode("data1");
tempNode.addMixin("mix:versionable");
tempNode.setProperty("value", "data1value");
tempNode=myNode.addNode("data2");
tempNode.addMixin("mix:versionable");
tempNode.setProperty("value", "data2value");
Is this the "right way" to do something like this?
A: Well, it would work. However, it might be easier to create
another node type, creating snapshots of the versioned data, rather
than storing the versioned data in their own nodes and properties. In
other words, you'd create a "data" node under "myNodeName," create
properties under the data node such as "data1value" and "data2value,"
and version the "data" node instead of creating versions for every
child node under "myNodeName."
Q: How would one search for "data2value" given the above structure?
A: We already did this parallel a couple of times in this article
and it will help us once again to answer the question: the JCR model
resembles a Unix file system. Therefore, accessing data is like
navigating a file system based on absolute or relative paths. That's
exactly what the JCR API is offering: node navigation using relative
paths, direct access using absolute paths.
The direct access can be written:
Property data2Property = (Property) session.getItem("/myNodeName/data2/value");
or by navigating from the root node:
Node rootNode = session.getRootNode();
Node data2Node = rootNode.getNode("myNodeName/data2");
Property data2Property = data2Node.getProperty("value");
Q: How do I find nodes in a repository by (A) name, (B) attribute, (C) version, or (D) unique identifier?
A: You can always navigate the workspace tree and filter the
traversed nodes according to different criteria. Also, according to the
Level 1 compliance features, you can use XPath queries to find nodes.
Even if the JCR queries support only a subset of the XPath spec, most
of the time it is powerful enough to do whatever you need.
Examples:
//searched_node_name: find all nodes having the specified name
/jcr:root/some/additional/path//searched_node_name: find all nodes under /some/additional/path having the specified name
//*[@searched_property]: find all nodes having a property with the given name
//*[@property='value']: find all nodes having a property with the specified value
Querying for versions is a bit different, as versioning is an
optional feature. According to the specification, all versioning
information is exposed in each workspace under a special path:
/jcr:system/jcr:versionStorage. Here are a couple of examples:
//element(*, nt:frozenNode)[@my:property = 'data' and jcr:frozenUuid = '<the uuid of original_node>']
The UUID of the original node is needed because the current path of
a node in a workspace does not identify the node in the version
storage. A version of a node can be checked out to any place in a
workspace.
Finding a labeled version of the node with a known label:
//element(*, nt:versionLabels)/jcr:deref(@labelIKnow, '*')/*[jcr:frozenUuid ='<the uuid of original_node>']
The query dereferences all nt:versionLabel nodes with a @labelIKnow
reference property in the version storage. The targets of the
references are nt:version nodes and their children are the
nt:frozenNodes with the versioned properties.
Other searches:
//*[jcr:contains(@data, 'foo bar')]
This query retrieves nodes that have a "data" property that contains "foo" and "bar" in their text.
//jcr:root/wiki/*/*[@published='true']
This is modeled on a wiki that stores the actual entries two levels
down in the tree (i.e., "foo" would map to "/wiki/f/foo", "bar" would
map to "/wiki/b/bar"). This query would retrieve all of the wiki
entries that have a "published" property with a value of "true."
You can also use SQL to retrieve nodes, if you have a level 2
compliant JCR repository (such as JackRabbit). Assuming the same wiki
structure is used:
select * from nt:unstructured where jcr:path like '/wiki/%/%' order by createdate
This query returns the same set of data that the prior XPath query would, ordered by the "createdate" property's value.
You can also search by UUID, if your node has been marked with the "mix:referenceable" mixin:
select * from nt:unstructured where jcr:uuid='[uuid value]'
Q: If I wanted to say that "myNodeName" from the earlier question
was categorized in a taxonomy, is there a way I can add a reference to
it to a top-level taxonomy node?
A: Referenceability and reference integrity is another interesting
feature in JSR170 (but again optional). And here is the reason: once a
node is mix:referenceable, you can start linking to it from any nodes
and the system will guarantee referential integrity.
Now, let's detail a bit: every node in the JCR has a primary type -
a description of what you can store in that node. Additionally, you can
add more details to that node type description by using a mixin: a set
of definitions and constraints that augments the node definition. There
are a couple of predefined mixins that are described by the optional
JCR features: mix:referenceable (allow to create references to a node),
mix:versionable (allow to version a node), mix:lockable (allow to lock
a node). So, there are two things you must do in order to use node
references: be assured that the target node is mix:referenceable and
create a REFERENCE Property pointing to the target node in your actual
node.
Your next question may be: what if I want to link my nodes to
multiple nodes? JCR has an answer for this question too: properties of
a node can be single-valued or multiple-valued. For linking your node
to multiple target nodes, you will have to use a multi-values REFERENCE
property and you are done.
If you compare referenceability in JCR and RDBMS you will notice how
much more easily the JCR model can be used; you don't have to add
additional columns to your tables or create new tables, you are not
bound to use a specific column for a relationship, you are not facing
constraints issues, etc.
Q: Why would I use SQL vs. XPath queries?
A: I think this is pretty much a matter of taste and a matter of the
existing expertise in your IT shop. If your developers are familiar
with SQL, then I guess it would make more sense to use it. But, we
should emphasize the fact that SQL query support is an optional feature
of JCR implementations, so when making a decision you should make sure
that the JCR implementation you are going to use supports it. Also, if
you will need to be able to make your application work with different
JCR applications then you may have a future problem when moving to a
JCR implementation that does not support SQL querying.
Q: How do I do a full-text search with JCR?
A: The JSR-170 specification requires support for XPath-like
querying. This includes also support for full-text search through XPath
extension functions such as jcr:like and jcr:contains. The semantics of
these functions are quite complex and I would recommend everybody to
read the specification chapters. But, as a quick example let's see how
we can use these two functions to retrieve the nodes that have the
value property (@ value) containing the string "data":
//*[jcr:like(@value, '%data%') or //*[jcr:contains(@value, 'data')]
SQL queries are mentioned also in the JSR-170 specification, under
the optional compliance features. If your implementation supports SQL
queries, then you will be able to use the corresponding LIKE and
CONTAINS predicates.
Q: Workspaces! What is their purpose? How should I use them, and how do I access a specific workspace in a repository?
A: This question is a very legitimate one, because as far as I know,
the specification does not provide a clear definition of what a
workspace is. In most cases, I think a single workspace is enough, but
there may be cases where a clean separation of the stored data would
probably make sense. It would probably make sense to identify some of
the pros for each usage scenarios.
Pros for a single workspace:
- It is easier to use, because you will not have to manage multiple workspace logins, or multiple sessions.
- You can use node references (this is not going to work with
multiple references, as these are only available in the same workspace)
- In case the JSR-170 implementation you are using is Jackrabbit, a
single workspace proves to be less resource intensive, because it will
use a single backend store (e.g. database) and it will use fewer file
handles, as it requires only two indexes (one for the workspace and one
for the jcr:system tree)
Pros for using multiple workspaces:
- The first benefit I see is to have a clean separation of the stored data. You may think of this as in different DB schemas.
- Another important aspect of multiple workspaces (at least when
considering Jackrabbit) is that you can use specific backend storage
(by specifying a persistence manager implementations) per workspace
that are tailored to the usage of that workspace.
And I think there are a couple of more: better concurrency when you
have many write operations that can be distributed to multiple
workspaces, increased cache efficiency (each workspace having its own
cache).
Connecting and using a specific workspace is an easy task: the Repository API offers two methods for doing this:
login(Credentials credentials, String workspaceName) and login(String workspaceName).
In Closing...
The Java Content Repository is a complex specification, but it does
an excellent job of abstracting away the details of managing content.
Assuming your repository complies with the entire specification, you
can version your data, lock it, store any number of attributes (with a
large number of data types), validate your structure (with custom node
types), and query it with both XPath and SQL, depending on which is
more comfortable to you. JackRabbit, as the reference implementation,
is remarkably capable for simple deployments, both in capabilities and
in scalability (as a JCR-backed application has successfully survived
slashdotting, in one known example.)
Hopefully, this set of questions and answers addresses the most
common issues people discover with JCR after first stepping into it,
such as how to add data to a repository, how to speed up access to a
repository, how to query it, along with some suggestions as to how to
organize it. This is meant to be a "second step" in using JCR, to cover
the simple things that the specifications and the other "first steps"
(see "Resources") don't cover. Any further questions would be welcomed.
Resources
- Introducing the Java Content Repository API, http://www-128.ibm.com/developerworks/java/library/j-jcr/
- Catch Jackrabbit and the Java Content Repository API,
http://www.artima.com/lejava/articles/contentrepository.html
- What is Java Content Repository,
http://www.onjava.com/pub/a/onjava/2006/10/04/what-is-java-content-repository.html
- JSR-170: What's in it for me?, http://www.cmswatch.com/Feature/123
[1] See JSR-170: 4.2. Compliance levels
[2] http://jackrabbit.apache.org/doc/deploy.html
About the Authors
Alexandru Popescu is Chief Architect and co-founder of
InfoQ.com. He is involved in many open source initiatives and
bleeding-edge technologies (AOP, testing, web, etc.), being co-founder
of the TestNG Framework and a committer on the WebWork and Magnolia
projects. Alexandru formerly was one of three committers on the
AspectWerkz project before it merged with AspectJ. He also publishes a
blog on tech topics at http://themindstorms.blogspot.com/.
Joseph Ottinger is the editor-in-chief of TheServerSide.com,
and has extensive experience in server-side technologies ranging from
Perl to Cold Fusion to C/C++ and Java. Joe still doesn't have a MacBook
Pro.