The relational debate rages on. On the relational side of things, you have single sources of truth, all the joins you could possibly want, and fast queries on everything you could possibly imagine. Then you got the rebels tossing around POJOs and rapidly iterating the super loose tables. What is a dev to do?
I was working on a project where we were modelling some highly related data using MongoDB, projects as parts of an organization, and we came up against the inevitable existential question, to embed or not to embed?
Embedding documents keeps everything in one place and makes all relevant data easily accessible with a single query.
//Organizations - Embedded
{
name: 'Organization',
projects: [{name: 'project1'}, {name:'project2'}]
}
Creating separate collections means we have only one source of truth to make updates on and we can easily access any piece of the hierarchy.
//Organizations - Separate Collections
{
name: 'Organization',
projects: [project1._id, project2._id]
}
// Projects collection
{
name: 'Project1'
}
{
name: 'Project2'
}
We ended up choosing to separate out the collections. There was too much overlap between different documents; if you updated one sub document, you had to make sure it was updated everywhere. And it proved very difficult to make complex queries and operations on sub documents. That being said, how can you conveniently link related documents together in MongoDB?
The solution I ended up using was to create separate collections using Mongoose DB-refs to link related documents on different collections, and then creating a middleware function to auto-populate the documents with their db-refs on any query.
DB-refs
DB-refs essentially save a reference to an _id property for one or many other documents that are related to the current document. From the Mongoose docs:
var StorySchema = new Schema({
_creator : { type: Schema.ObjectId, ref: 'Person' }
, title : String
, fans : [{ type: Schema.ObjectId, ref: 'Person' }]
});
The story schema can be linked to a single _creator document and any number of fan documents. These references will be saved in the _creator
and fans
properties respectively. To set up these connections, you only need to save the corresponding _id value in _creator.
var aaron = new Person({ name: 'Aaron', age: 100 });
aaron.save(function (err) {
if (err) ...
var story1 = new Story({
title: "A man who cooked Nintendo"
, _creator: aaron._id
});
story1.save(function (err) {
if (err) ...
});
})
Now the story document has a reference to 'aaron', its creator. To get access to this document, you can 'populate' it to fill in all the DB-ref fields with the documents that they actually reference.
Story
.findOne({ title: /Nintendo/i })
.populate('_creator', ['name']) // <-- only return the Persons name
.exec(function (err, story) {
if (err) ..
console.log('The creator is %s', story._creator.name);
// prints "The creator is Aaron"
console.log('The creators age is %s', story._creator.age)
// prints "The creators age is null'
})
Now this is all well and good, but you have to set up the query to populate. And what if you have multiple fields you want to populate? And what if you want to do it all the time? It would be pain in the ass to have to include populate on every single query you write out, so we'll ship it off to Mongoose middleware to do it each time for us.
Shifting gears away from the stories a little, in my recent project we had a users schema that held references to organizations, projects, and tasks all of which were separate collections in their own right.
Here's what our user schema looked like:
var usersSchema = new Schema({
username: {
type: String,
required: true,
unique: true
},
password: {
type: String,
required: true
},
organization: [{
type: Schema.ObjectId,
ref: 'Org'
}],
project_list: [{
type: Schema.ObjectId,
ref: 'Project'
}],
task_list: [{
type: Schema.ObjectId,
ref: 'Task'
}]
});
It would be nice, no matter when you grabbed a user from the DB, it had all of those fields automatically populated, right?
On defining the model, you can hook into the Mongoose middleware, specifically the 'init' hook which will get fired whenever an existing model is grabbed from the DB. Inside of the users' model definition, I added in this hook. It makes use of the Mongoose Model's populate method. You can pass a space delimited string of all the fields you want to populate. Any time a user model is picked up, it will have each of those DB-ref fields filled out.
db.usersSchema.pre('init', function(next, data) {
User.populate(data, {
path: 'organization project_list task_list'
}, function(err, user) {
data = user;
next();
});
});
I found this solution to work quite nicely with what we were trying to achieve. We maintained separate, highly related collections. Each collection was easy to access and update, and there was only one source of truth for each entry. We also got access to the convenience and intuitive structure of POJOs automatically built for us with every query! A nice little solution that gave us the best of both worlds.