blog

Simulating Inner Joins on MongoDB

Probably one of the most important features in SQL, for more complicated queries, is the ability to join data from several tables and group them; for example: having a table of users and a table of messages, and then joining them both to get the users, as well as their messages. There are several types of joins in SQL:

Diagram of joins

Different types of joins

Now, for our project, we are using MongoDB, a NoSQL database - how do joins work, in this case? Say we have two collections on MongoDB that follow this schema:

Users: {
    _id: Number,
    name: String
}

Messages: {
    _id: Number,
    text: String,
    creator: Number // References _id in Users
}

And a sample of the data:

Users: [{
    _id: 100,
    name: "John",
}, {
    _id: 101,
    name: "Paul"
}]

Messages: [{
    _id: 200,
    text: "Hello, how are you?",
    creator: 101
}]

And now I want to get all the messages, as well as the creator's name. Is there an easy way to do this? There is, with Mongoose, we can build these relationships in the schema, and we can use the method populate to join the two pieces together:

Schema:

Users: {
    _id: { type: Number },
    name: { type: String }
}

Messages: {
    _id: { type: Number },
    text: { type: String },
    creator: { type: Number, ref: 'Users' } // References _id in Users
}

Joining:

db.Messages.findOne({})
           .populate('creator')
           .exec((err, docs) => {
               if (err) { throw err; }
               console.log(docs);
           });

This would give us an output similar to this:

[{
    _id: 200,
    text: "Hello, how are you?",
    creator: {
        _id: 101,
        name: "Paul"
    }
}]

Good enough, right? Ok. But the problem is that this is a full left join: if there was a message without a creator, it would still be selected. So, what if I want an inner join? Short answer: you can't. MongoDB does not support inner joins. This is fine for most scenarios: you can simply filter the data afterwards to get rid of the incomplete documents; but it starts to be a problem when you run in to memory issues, which was the problem we faced during the development of a module, and it would be a really big problem. Luckily, we have algorithms by our side!

In our case, execution time is not a big issue, we must do inner joins using many collections (often more than 5), and memory is a limiting factor, so we tried to get the best of this scenario. I designed a module that did the inner joins manually for us and saved as much memory as possible, this is how I did it:

1- The most specific queries with the most sparsely populated collections happen first: if you are looking for "all the users that use IE 6", it is a much better idea to "look for the ID of the IE6 browser in the database, and then fetch the users that have that ID in the entry" than "getting all the users, selecting all their browsers, and then getting only the ones that use IE6".

2- For every query done, we build up more and more conditions for the next query: if you want all the users that use IE6, as long as they live in Canada, you do the query to "find the ID of the IE6 browser, and then you find the addresses within canada, and then you query for the users - but only the ones that match the accepted addresses and browser", instead of simply getting all the users at the end and joining the information.

3- Leave extra information for the end: if you want all the users messages in addition to the users from the previous case, fist you should find all the users that match those conditions and then you find their messages, instead of scanning for all the messages and then joining them with the users that matched the conditions.

4- If a query returned too many results even with conditions, try again later: following the rule #2, it is likely that if you let other queries run, you will end up with more conditions to refine the search even more. For example: if your first search for browsers returned too many results, but the next search, the one for users only returned 1 result, your next query for the browsers will only need to find the browser for that particular user.

Following these 4 rules, I managed to come up with a module that make inner joins on MongoDB for our project: you pass a JSON object with the conditions you want, and it will do the queries for you and join them automatically. For example:

stores.Users.execute({
    Users: {
        name: { contains: 'John' }
    },
    Browsers: {
        name: { contains: 'IE6' }
    },
    Address: {
        country: { matches: 'CA' }
    }
});

The snippet above would select all, and only the users that have "John" in their names, live in Canada and use IE6.

I can't believe it actually works.