hcoelho.com

my blog

Title only : Full post

Fixing memory problems with Node.js and Mongo DB

:

Now that the basic functionality of Rutilus is done, I spent some time improving the memory limitations that we faced. In this post I will list the problems we faced and how I solved them.

Observation: We were using Mongoose for these queries, and not the native Node.js driver.

1- Steps in the aggregation pipeline taking too much memory

From the MongoDB manual:

"Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the MongoDB instance simplifies application code and limits resource requirements."

So, obviously, a pipeline such the one below would need to have memory available to perform all those stages:

ZipCodes
  .aggregate([
    { $group: {
      _id: { state: "$state", city: "$city" },
      pop: { $sum:  "$pop" }
    }},
    { $sort: { pop: 1 }},
    { $group: {
      _id : "$_id.state",
      biggestCity:  { $last:  "$_id.city" },
      biggestPop:   { $last:  "$pop"      },
      smallestCity: { $first: "$_id.city" },
      smallestPop:  { $first: "$pop"      }
    }},
    { $project: {
      _id: 0,
      state: "$_id",
      biggestCity:  { name: "$biggestCity",  pop: "$biggestPop"  },
      smallestCity: { name: "$smallestCity", pop: "$smallestPop" }
    }}
  ])
  .exec((err, docs) => {
    ...
  });

The problem we were having in this case was: we did not have enough memory to perform the stages, even though we did have enough memory for the output. In other words: the output was small and concise, but we needed a lot of memory to do it.

The solution for this was easy: we can simply tell Mongo to use disk space temporarily to store the data. It probably is slower, but it is better than not being able to run the query at all. To do this, we just needed to add an extra step (allowDiskUse) to that method chain:

ZipCodes
  .aggregate([
    ...
  ])
  .allowDiskUse(true) // < Allows MongoDB to use the disk temporarily
  .exec((err, docs) => {
    ...
  });

2- Result from aggregation pipeline exceeding maximum document size

For queries with a huge number of results, the aggregation pipeline would greet us with the lovely error "exceeds maximum document size problem". This is because the result of an aggregation pipeline is returned in a single BSON document, which has a size limit of 16Mb.

There are two ways to solve this problem:

1- Piping the results to another collection and querying it later

2- Getting a cursor to the first document and iterating through it

I picked the second method, and this is how I used it:

const cursor = ZipCodes
  .aggregate([
    ...
  ])
  .allowDiskUse(true)
  .cursor({ batchSize: 1000 }) // < Important
  .exec(); // < Returns a cursor

// The method .toArray of a cursor iterates through all documents
// and load them into an array in memory
cursor.toArray((err, docs) => {
  ...
});

The batchSize refers to how many documents we want returned in every batch, but according to the MongoDB documentation, this will not affect the use of the application because most results are returned in a single batch.

3- JavaScript Heap out of memory

After getting those beautiful millions of rows from the aggregation pipeline, we were greeted by another loverly error: "FATAL ERROR: CALLANDRETRY_LAST Allocation failed - JavaScript heap out of memory". This happens when the Node.js Heap runs out of memory (as you probably inferred from the description of the error).

According to some sources on the internet, the default memory limit for Node.js on 32-bit systems is 512Mb, and 1Gb for 64-bit systems. We can increase this memory limit when we are launching the node.js application with the option --max_old_space_size and specifying how much memory we want in Mb. For example:

node --max_old_space_size=8192 app.js

This will launch the app.js application with 8Gb of ram instead of 1Gb.

cdot mongo node memory limit 

Prototyping a calculated field on MongoDB for quick access

:

The next phase of our project will be a content recommendation system for the users who visit our website: we will consider their past preferences (article category, for example) in order to recommend new content. This system needs to be fast and not use the database unnecessarily, since it will be used for every visit of every user. Considering that all the data we gather from our users are spread among several collections in our database, we cannot afford to make an expensive, slow operation with joins; we need a way to make this operation fast and cheap.

Calculated values are a great way to turn expensive and slow operations into very simple queries, however, they have a drawback: how to keep them synchronized? Our solution for this problem was using a collection that contains all the hits made by a user, which we called a "session" (a session contains many hits); every time the user makes a new hit, we use the information from this hit to improve the history we have in the session - it also ensures that the calculated fields will always be up to date.

For example, assuming this is our current history for the user:

Session
{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6, 7, 8],
        articlesVisited: 5,
        videosVisited: 3,
    }
}

The history says that we visited 5 articles and 3 videos; the IDs (of the articles and videos, assuming they are stored in the same collection) visited are 1, 2, 3, 4, 5, 6, 7, and 8.

If the user makes another hit in another article (say article #9), the history in the user's session would be changed to:

Session
{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6, 7, 8, 9],
        articlesVisited: 6,
        videosVisited: 3,
    }
}

Changes like these are very easy to do with MongoDB. For pushing the new ID in the array, we can simply use the $push (not unique values) or the $addToSet (unique values) operator:

db.sessions.update({
    _id: <session id>
}, {
    $addToSet: {
        "history.visitedIds": <article id>
        // In our case, the article id would be "9"
    }
});

Likewise, it is easy to increment values, like for articles visited using the $inc operator:

db.sessions.update({
    _id: <session id>
}, {
    $inc: {
        "history.<field to increment>": 1
        // In our case, the field to increment would be "articlesVisited"
    }
});

Joining them together:

db.sessions.update({
    _id: 
}, {
    $addToSet: {
        "history.visitedIds": <article id>
    },

    $inc: {
        "history.<field to increment>": 1
    }
});

This takes care of maintaining the calculated fields up to date with a simple operation.

Now we get to another detail: the calculated field we are keeping is not in the exact format we want it to have; for example: instead of just the raw numbers of visits a user made couldn't we have it in percentage? This would help us to group them in clusters, if we so desire; for example:

Session 1
{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    }
}

Session 2
{
    history: {
        visitedIds: [1, 2, 3],
        articlesVisited: 2,
        videosVisited: 1,
    }
}

Despite the user from Session 2 having less visits than the user from Session 1, their preferences are actually similar: they visited twice more articles than videos. we could abstract these preferences like this:

Session 1
{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    },
    affinity: {
        articles: 66.666,
        videos: 33.333,
    }
}

Session 2
{
    history: {
        visitedIds: [1, 2, 3],
        articlesVisited: 2,
        videosVisited: 1,
    },
    affinity: {
        articles: 66.666,
        videos: 33.333,
    }
}

This could be done after we pull the data, on the server, or directly in the database. If we do it in the database, we can use the aggregation framework on MongoDB to make this calculation:

First, we get the total number of visits. For this, we can use the $project operator to sum the number of visits on articles and videos:

db.test.aggregate([
    { $project: {

        _id: 1, // Keep the ID

        history: 1, // Keep the history

        // Creating the "totalVisits" field by adding the visits together
        totalVisits: { $add: [
            "$history.articlesVisited",
            "$history.videosVisited"
        ]}
    }}
])

This would be the result:

Session 1
{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    },
    totalVisits: 6,
}

Now that we have the total of visits, we can do some arithmetic ($multiply and $divide for multiplication and division) to find the percentage of the categories with another $project:

db.test.aggregate([
    { $project: {
        _id: 1,
        history: 1,
        totalVisits: { $add: [
            "$history.articlesVisited",
            "$history.videosVisited"
        ]}
    }},

    { $project: {

        _id: 1,

        history: 1,

        // We don't project the totalVisits here, if we want to omit it

        affinity: {
            articles: { $multiply: [
                { $divide: [
                    "$history.articlesVisited", "$totalVisits"
                ]},
                100
            ]},

            videos: { $multiply: [
                { $divide: [
                    "$history.videosVisited", "$totalVisits"
                ]},
                100
            ]}
        }
    }}
])

And this will be the result:

{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    },
    affinity: {
        articles: 66.333,
        videos: 33.333,
    }
}

In this example, the categories were "hard coded": we will have more than "articles" and "videos", but this example was only to show that what we are envisioning can be done: we only need a more elaborated schema and a more intelligent algorithm.

cdot mongo database 

Simulating Inner Joins on MongoDB

:

Probably one of the most important features in SQL, for more complicated queries, is the ability to join data from several tables and group them; for example: having a table of users and a table of messages, and then joining them both to get the users, as well as their messages. There are several types of joins in SQL:

Diagram of joins

Different types of joins

Now, for our project, we are using MongoDB, a NoSQL database - how do joins work, in this case? Say we have two collections on MongoDB that follow this schema:

Users: {
    _id: Number,
    name: String
}

Messages: {
    _id: Number,
    text: String,
    creator: Number // References _id in Users
}


And a sample of the data:

Users: [{
    _id: 100,
    name: "John",
}, {
    _id: 101,
    name: "Paul"
}]

Messages: [{
    _id: 200,
    text: "Hello, how are you?",
    creator: 101
}]

And now I want to get all the messages, as well as the creator's name. Is there an easy way to do this? There is, with Mongoose, we can build these relationships in the schema, and we can use the method populate to join the two pieces together:

Schema:
Users: {
    _id: { type: Number },
    name: { type: String }
}

Messages: {
    _id: { type: Number },
    text: { type: String },
    creator: { type: Number, ref: 'Users' } // References _id in Users
}


Joining:
db.Messages.findOne({})
           .populate('creator')
           .exec((err, docs) => {
               if (err) { throw err; }
               console.log(docs);
           });

This would give us an output similar to this:

[{
    _id: 200,
    text: "Hello, how are you?",
    creator: {
        _id: 101,
        name: "Paul"
    }
}]

Good enough, right? Ok. But the problem is that this is a full left join: if there was a message without a creator, it would still be selected. So, what if I want an inner join? Short answer: you can't. MongoDB does not support inner joins. This is fine for most scenarios: you can simply filter the data afterwards to get rid of the incomplete documents; but it starts to be a problem when you run in to memory issues, which was the problem we faced during the development of a module, and it would be a really big problem. Luckily, we have algorithms by our side!

In our case, execution time is not a big issue, we must do inner joins using many collections (often more than 5), and memory is a limiting factor, so we tried to get the best of this scenario. I designed a module that did the inner joins manually for us and saved as much memory as possible, this is how I did it:

1- The most specific queries with the most sparsely populated collections happen first: if you are looking for "all the users that use IE 6", it is a much better idea to "look for the ID of the IE6 browser in the database, and then fetch the users that have that ID in the entry" than "getting all the users, selecting all their browsers, and then getting only the ones that use IE6".

2- For every query done, we build up more and more conditions for the next query: if you want all the users that use IE6, as long as they live in Canada, you do the query to "find the ID of the IE6 browser, and then you find the addresses within canada, and then you query for the users - but only the ones that match the accepted addresses and browser", instead of simply getting all the users at the end and joining the information.

3- Leave extra information for the end: if you want all the users messages in addition to the users from the previous case, fist you should find all the users that match those conditions and then you find their messages, instead of scanning for all the messages and then joining them with the users that matched the conditions.

4- If a query returned too many results even with conditions, try again later: following the rule #2, it is likely that if you let other queries run, you will end up with more conditions to refine the search even more. For example: if your first search for browsers returned too many results, but the next search, the one for users only returned 1 result, your next query for the browsers will only need to find the browser for that particular user.

Following these 4 rules, I managed to come up with a module that make inner joins on MongoDB for our project: you pass a JSON object with the conditions you want, and it will do the queries for you and join them automatically. For example:

stores.Users.execute({
    Users: {
        name: { contains: 'John' }
    },
    Browsers: {
        name: { contains: 'IE6' }
    },
    Address: {
        country: { matches: 'CA' }
    }
});

The snippet above would select all, and only the users that have "John" in their names, live in Canada and use IE6.

I can't believe it actually works.

cdot mongo join database