In this article, based on chapter 4 of MongoDB in Action, author Kyle Banker explains how MongoDB schema differs from an equivalent RDBMS schema, and how common relationships between entities, such as one-to-many and many-to-many, are replicated in MongoDB.
Author: Kyle Banker, http://www.manning.com/banker/
This article is based on MongoDB in Action, to be published November 2011. It is being reproduced here by permission from Manning Publications. Manning early access books and ebooks are sold exclusively through Manning. Visit the book’s page for more information.
This article takes a closer look at the document-oriented data model and at how data is organized at the database, collection, and document level in MongoDB. We start with a brief, general discussion of schema design. This is helpful because a large number of MongoDB users have never designed schemas outside the realm of a traditional relational database management system (RDBMS). This exploration of principles helps set the stage for the second part of the article, where we examine the design of an e-commerce schema in MongoDB. Along the way, I’ll explain how this schema differs from an equivalent RDBMS schema, and we’ll learn how common relationships between entities, such as one-to-many and many-to-many, are replicated in MongoDB.
Principles of schema design
Database schema design is the process of choosing the best representation for a data set given the features of the database system, the nature of the data, and the requirements of the application. The principles of schema design for relational database systems are well established. With RDBMSs, we’re encouraged to shoot for a normalized data model, which helps to ensure generic queryability and avoid updates to the data that might result in inconsistencies. Developers aren’t wondering how to create one-to-many and many-to-many relationships in an RDBMS. But, having said that, schema design is not a hard science even with relational databases. Performance-intensive applications or applications that have to consume data flexibly, may require a more denormalized data model. Some applications are so demanding in their storage and scaling requirements that they’re forced to break all the old rules of schema design. FriendFeed is a great example of this, and reading about their data model is well worth your time.
If you’re moving to MongoDB from the RDBMS world, you may be troubled by the lack of hard schema design rules. Good practices are beginning to emerge but, frankly, there’s often more than one good way to model a given data set. The premise of this section is that principles can drive schema design. But those principles are, in practice, variable. Here, then, are few questions that can be applied when modeling data with any database system.
* What’s the basic unit of data?
In an RDBMS, we have tables with columns and rows. In a key-value store, we have keys and amorphous values. And, in MongoDB, the basic unit of data is the BSON document.
* How can we query and update that data?
Once we have a basic unit of data, we need to know how we can manipulate it. RDBMSs feature indexed dynamic queries and joins. MongoDB also has indexed dynamic queries, but joins aren’t supported. And a simple key-value store can only get and put values based on a single key. Databases also diverge in the kinds of updates they permit. With an RDBMS, we can update documents in sophisticated ways using SQL and we can wrap such updates in a transaction to get atomicity and rollback. MongoDB doesn’t support transactions, but it does support a variety of atomic update operations that can work on the internal structures of a complex document. With simple key-value stores, we might be able to update a value but every update involves replacing the value completely. The essential point of this second question is that building the best data model means understanding the features of your database. If you want to model data well in MongoDB, you need to know exactly what sorts of queries and updates it performs the best.
* What are our application access patterns?
In addition to understanding our basic unit of data and database features, we also need to pin down the needs of our application. If you read the FriendFeed article suggested earlier, you see that the idiosyncrasies of an application can easily demand a schema that goes against firmly-held data modeling principles. Thus, numerous questions can be asked about the application to help determine the data model. What’s the read/write ratio? What’s sorts of queries do we need? How is the data updated? What concurrency issues are involved?
The best schema designs are always the product of deep knowledge of the database you’re using, good judgment about the requirements of the application at hand, and plain old experience.
Designing an e-commerce data model
Demonstrations of next-generation data stores typically revolve around social media: Twitter-like demo apps are the norm. Unfortunately, such apps tend to have rather simple data models. E-commerce has the advantage of including a large number of familiar data modeling patterns. It’s not difficult to imagine how products, categories, product reviews, and orders would typically be modeled using an RDBMS. This should make the upcoming examples more instructive, as we’ll be able to contrast them with our preconceived notions of schema design.
E-commerce has always been a domain exclusive to RDBMSs for a couple of reasons. The first is that e-commerce sites generally require transactions, and transactions are a strong point of RDBMSs. The second is that, up until very recently, domains that require rich data models and dynamic queries have been assumed to fit best within the realm of RDBMS. The examples below call into question this second assumption.
Building an entire e-commerce back-end is beyond this article’s scope. What we’re going to do instead is pick out a handful of e-commerce entities and show how they might be modeled in MongoDB. In particular, we’re going to look at products and categories, users and orders, and product reviews. For each entity, we’ll show an example document. Then, we’ll hint at some of the database features that make that particular structure viable.
Before moving on, we must say a few words about object mappers. When developing applications on an RDBMS, you’ve probably used an object-relational mapping library, such as Java’s Spring framework or Ruby’s ActiveRecord. Such libraries are absolutely necessary for building applications efficiently on an RDBMS. But they’re less necessary with MongoDB. This is due in part to documents already being an object-like representation. It’s also partly due to the good usability of drivers, which provide a high-level interface to MongoDB.
Indeed, applications can frequently be built on drivers alone. Having said that, object mappers are nice for larger projects, since they usually provide validation, type checking, and associations for free. And there are, undoubtedly, a number of mature object mappers for MongoDB that work on top of the basic language drivers. Ultimately, regardless of the object mapper, what gets stored in MongoDB are documents. Knowing the shape of documents in a good MongoDB schema will prepare you to work with the database efficiently with or without an object mapper.
Products and categories
Products and categories are mainstays of any e-commerce site. Products in a normalized RDBMS model tend to require a large number of tables. There’s a table for basic product info, such as the name and SKU, but there will also be other tables to relate shipping information and pricing histories. If the system allows creating products with arbitrary attributes, a complicated series of tables will be added to define and store those attributes. All of this will be facilitated by the RDBMS’s ability to join tables.
Modeling a product in MongoDB should be less complicated. Because collections don’t enforce a schema, any product document will have room for whichever dynamic attributes the product dictates. And, because we can use arrays to contain inner document structures, we can typically condense a multitable RDBMS representation into a single MongoDB collection. Listing 1 shows a product from a sample gardening store.
Listing 1 A sample product document
@doc =
{ :_id => BSON::ObjectID(“4c4b1476238d3b4dd5003981”),
:slug => “wheel-barrow-9092”,
:sku => “9092”,
:name => “Extra Large Wheel Barrow”,
:description => “Heavy duty wheel barrow...”,
:details => {
:weight => 47,
:weight_units => “lbs”,
:model_num => 4039283402
},
:tags => [“tools”, “equiptment”, “soil”],
:pricing => {
:retail => 589700,
:sale => 489700,
},
:price_history => [
{:retail => 529700,
:sale => 429700,
:start => Time.utc(2010, 5, 1),
:end => Time.utc(2010, 5, 8)
},
{:retail => 529700,
:sale => 529700,
:start => Time.utc(2010, 5, 9),
:end => Time.utc(2010, 5, 16)
},
],
:category_ids => [BSON::ObjectID(“6a5b1476238d3b4dd5000048”),
BSON::ObjectID(“6a5b1476238d3b4dd5000048”)],
:average_review => 4.5
}
The document contains the basic name, sku, and description fields. There’s also the standard MongoDB object id stored in the _id field. In addition, we’ve defined a slug, “wheel-barrow-9092,” to provide a meaningful URL. MongoDB users sometimes justifiably complain about the ugliness of object ids in URLs. Naturally, we don’t want URLs looking like this: http://mygardensite.org/products/4c4b1476238d3b4dd5003981. Meaningful ids are so much better: http://mygardensite.org/products/wheel-barrow-9092
As a result, we generally recommend building out a slug field if a URL will be generated for the document. Such a field should have a unique index on it so that it can be used as a primary key. Here’s how we’d create the unique index in Ruby:
@products.create_index(‘slug’)
If we have a unique index on slug, we’ll need to insert our product document using safe mode so that we’ll know if the insert fails. That way, we can retry with a different slug if necessary. To take an example, imagine our garden store has multiple wheel barrows for sale. We decide to start selling a new wheel barrow, so our code will need to generate a unique slug for the new product. Here’s how we perform the insert:
@products.insert({:name => “Extra Large Wheel Barrow”,
:sku => “9092”,
:slug => “wheel-barrow-9092”},
:safe => true)
If the insert succeeds without raising an exception, we know we’ve chosen a unique slug. But if an exception is raised, our code will have to retry. In any case, the management of slugs is easy with unique indexes.
Continuing on, we have a key, details, that points to a subdocument containing various product details. We’ve specified the weight, weight units, and the manufacturer’s model number. But, we might store other ad-hoc attributes here as well. For instance, if we were selling seeds, we might include attributes here for the expected yield and time to harvest, and, if we were selling lawn mowers, we could include horse power, fuel type, and mulching options. The details attribute provides a nice container for these types of dynamic attributes.
Notice that we’re also able to store the product’s current and past prices in the same document: pricing simply points to an object containing retail and sale prices; price history, by contrast, references a whole array of pricing options. This is a common technique for versioning a document.
Next, we see an array of tag names for the product. Since we can index array keys, this is the simplest and best way of storing relevant tags on an item while at the same time assuring efficient queryability.
But what about relationships? We’re able to use rich document structures, like embedded documents and arrays, to store product details, prices, and tags—all in a single document. And yet, we do eventually have to relate to documents in other collections. To start, we’re going to relate products to a category structure. This relationship between products and categories is usually conceived as many to many, where each product can belong to more than one category, and each category can contain multiple products. In an RDMBS, we’d use a join table to represent a many-to-many relationship like this. Join tables are great because they store all the relationship references between two tables in a single table. Using a SQL join, it then becomes possible to issue a single query to pull back a product and all its categories, and vice versa.
Now, MongoDB doesn’t support joins, so we need a different many-to-many strategy. Looking at our wheelbarrow document, you’ll see a field called category_ids containing an array of object ids. Each of these object ids is a reference to the _id field of some category document. For reference, you can see an example category document in listing 2.
Listing 2 A category document
doc =
{
:_id => BSON::ObjectID(“6a5b1476238d3b4dd5000048”),
:slug => “gardening-tools”,
:parents =>
[{
:name => “Home”,
:_id => BSON::ObjectID(“8b87fb1476238d3b4dd5000003”),
:slug => “home”
},
{
:name => “Outdoors”,
:_id => BSON::ObjectID(“9a9fb1476238d3b4dd5000001”),
:slug => “outdoors”
}],
:name => “Gardening Tools”,
:description => “Gardening gadgets galore!”,
}
If you look carefully at the object ids in the category_ids fields, you’ll see that our product is related to the “Gardening Tools” category just shown. Having the category_ids array key in the product document enables all of the kinds of queries we’d want to do on a many-to-many relationship. For instance, to query for all products in the “Gardening Tools” category, the Ruby code is quite simple:
gardening_id = @category[‘_id’]
@products.find({:category_ids => gardening_id})
To query for all categories for our products, we pass the array of category ids using the special $in operator:
@categories.find({:_id => {“$in” => @product[‘category_ids’]}})
Queries like this, with the $in operator, are not unlike SQL’s IN directive. With the many-to-many relationship out of the way, we must say a few words about the category document itself. You’ll notice the standard _id, slug, name, and description fields. These are straightforward, but the array of parent documents probably won’t be. Why are we storing such a large percentage of each parent category document? The fact is that categories are always conceived as a hierarchy and the ways of representing such a hierarchy in a database are many.
The strategy we choose is always dependent on the needs of the application. In this case, because MongoDB doesn’t support joins, we’ve elected to cache the parent category names in each child document. This way, when querying for the “Gardening Products” category, there’s no need to perform additional queries to get the names and URLs of the parent categories, “Outdoors” and “Home.”
Some developers would, no doubt, consider this level of denormalization unacceptable. In those cases, there are other options for representing a tree. For the moment, try to be open to the possibility that what best determine the schema are the demands of the application, and not necessarily the dictates of theory.
Users and orders
Looking at how we model users and orders illustrates another common relationship: one to many. In this case, a user is said to have many orders. In an RDBMS, we’d use a foreign key in our orders table; here, the convention is similar. Examine listing 3.
Listing 3 An e-commerce order as a document with line items, pricing, and a shipping address
doc =
{
:_id => BSON::ObjectID(“6a5b1476238d3b4dd5000048”)
:user_id => BSON::ObjectID(“4c4b1476238d3b4dd5000001”)
:state => “CART”,
:line_items =>
[{
:_id => BSON::ObjectID(“4c4b1476238d3b4dd5003981”),
:sku => “9092”,
:name => “Extra Large Wheel Barrow”,
:pricing =>
{
:retail => 5897,
:sale => 4897,
}
},
{
:_id => BSON::ObjectID(“4c4b1476238d3b4dd5003981”),
:sku => “10027”,
:name => “Rubberized Work Glove, Black”,
:pricing =>
{
:retail => 1499,
:sale => 1299,
}
}],
:shipping_address =>
{
:street => “588 5th Street”,
:city => “Brooklyn”,
:state => “NY”,
:zip => 11215
},
:subtotal => 6196
}
The second attribute, user_id, stores a given user’s _id. That points to our sample user, shown in listing 4 (we discuss this listing presently).
Listing 4 A user document with addresses and payment methods
doc =
{
:_id => BSON::ObjectID(“4c4b1476238d3b4dd5000001”)
:username => “kbanker”,
:first_name => “Kyle”,
:last_name => “Banker”,
:hashed_password => “bd1cfa194c3a603e7186780824b04419”,
:addresses =>
[{
:name => “home”,
:street => “588 5th Street”,
:city => “Brooklyn”,
:state => “NY”,
:zip => 11215
},
{
:name => “work”,
:street => “1 E. 23rd Street”,
:city => “New York”,
:state => “NY”,
:zip => 10010
}],
:payment_methods =>
[{
:name => “VISA”,
:last_four => 2127,
:crypted_number => “43f6ba1dfda6b8106dc7”,
:expiration_date => Time.utc(2014, 5)
}]
}
This modeling arrangement makes it easy to query either side of the relationship. To find all orders for a given user is simple:
@orders.find({:user_id => some_user_id})
The query for getting the user for a particular order is equally simple:
user_id = @order[‘user_id’]
@users.({:_id => user_id})
Thus, a one-to-many relationship between orders and users is easily accomplished using a foreign key. Let’s look at some other salient aspects of our order document. In general, we’re using the rich representation afforded to us by the document data model to represent the object holistically. You’ll see that the document includes both the line items and the shipping address, which, in a normalized relational model, would be located in separate tables. Our line items consist of an array of subdocuments, each describing a product in our shopping cart. The shipping address attribute points to a single object containing address fields.
Let’s take just a moment to discuss the merits of this representation. First, there’s a win for the human mind. Our entire concept of an order, including line items, shipping address, and, eventually, payment information, can be encapsulated in a single entity. When querying the database, we can return the entire order object without a single join. What’s more, the products, as they appeared when purchased, are effectively captured within our order document. The order document can be easily queried and modified.
The user document presents similar patterns in storing a list of address document along with a list of payment method documents. In addition, at the top level of the document, we store the basic attributes common to any user model. And, like the slug field on our product, the username field here would have a unique index.
Reviews
We’re going to close out our sample data model with product reviews. Relationally speaking, each product has many reviews. Here, that relationship is encoded using a foreign key, review_id, which you can see in our sample review document in listing 5.
Listing 5 A document representing a product review
doc =
{
:_id => BSON::ObjectID(“4c4b1476238d3b4dd5000041”),
:product_id => BSON::ObjectID(“4c4b1476238d3b4dd5003981”),
:date => Time.at(2010, 6, 7),
:title => “Amazing”,
:text => “Has a squeaky wheel, but still the best wheel barrow I’ve ever owned.”,
:rating => 4,
:user_id => BSON::ObjectID(“4c4b1476238d3b4dd5000041”),
:username => “dgreenthumb”,
:helpful_votes => 3,
:voter_ids =>
[
BSON::ObjectID(“4c4b1476238d3b4dd5000041”),
BSON::ObjectID(“7a4f0376238d3b4dd5000003”),
BSON::ObjectID(“92c21476238d3b4dd5000032”)
]
}
Most of the remaining attributes are self-explanatory: we store the review’s date, title, and text; the rating provided by the user; and the user’s id. But it may come as a surprise that we store the username as well. After all, if this were an RDBMS, we’d be able to pull in the username with a join on the users’ table. Since we don’t have the join option with MongoDB, we can proceed in one of two ways: either query against the users collection for each review or accept some denormalization. Issuing a query for every review seems unnecessarily costly when the attribute being queried, the username, is extremely unlikely to change. So it makes sense to take the denormalized route. This does mean that a username update is more expensive, since a username will have to change in every place that it appears but, again, that happens infrequently enough to justify this design choice.
Also noteworthy is our decision to store votes in the review document itself. Of course, it’s common for users to be able to vote on reviews. Here, we store the object id of each voting user in an array of voter ids. This allows us to prevent users from voting on a review more than once, and it also gives us the ability to query for all the reviews a user has voted on. Notice that we also cache the total number of helpful votes, which allows, among other things, sorting of reviews based on helpfulness. With that, we’ve covered a basic data model for e-commerce.
Summary
We've covered a lot of ground in this article! We began with a theoretical discussion of schema design and then proceeded to outline the data model for an e-commerce application. This gave you a chance to see what documents might look like in a production system and it should have gotten you thinking in a more concrete way about the differences between schemas in RDMBSs and MongoDB. Granted, some of the rationale for the design of these documents won't become clear until you get more experience with MongoDB’s query and update powers. But the general pattern of using a rich document data structure should be coming to the fore.
-----------------------------------------------------
Silence, the way to avoid many problems;
Smile, the way to solve many problems;