hcoelho.com

my blog

Title only : Full post

Fixing memory problems with Node.js and Mongo DB

:

Now that the basic functionality of Rutilus is done, I spent some time improving the memory limitations that we faced. In this post I will list the problems we faced and how I solved them.

Observation: We were using Mongoose for these queries, and not the native Node.js driver.

1- Steps in the aggregation pipeline taking too much memory

From the MongoDB manual:

"Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the MongoDB instance simplifies application code and limits resource requirements."

So, obviously, a pipeline such the one below would need to have memory available to perform all those stages:

ZipCodes
  .aggregate([
    { $group: {
      _id: { state: "$state", city: "$city" },
      pop: { $sum:  "$pop" }
    }},
    { $sort: { pop: 1 }},
    { $group: {
      _id : "$_id.state",
      biggestCity:  { $last:  "$_id.city" },
      biggestPop:   { $last:  "$pop"      },
      smallestCity: { $first: "$_id.city" },
      smallestPop:  { $first: "$pop"      }
    }},
    { $project: {
      _id: 0,
      state: "$_id",
      biggestCity:  { name: "$biggestCity",  pop: "$biggestPop"  },
      smallestCity: { name: "$smallestCity", pop: "$smallestPop" }
    }}
  ])
  .exec((err, docs) => {
    ...
  });

The problem we were having in this case was: we did not have enough memory to perform the stages, even though we did have enough memory for the output. In other words: the output was small and concise, but we needed a lot of memory to do it.

The solution for this was easy: we can simply tell Mongo to use disk space temporarily to store the data. It probably is slower, but it is better than not being able to run the query at all. To do this, we just needed to add an extra step (allowDiskUse) to that method chain:

ZipCodes
  .aggregate([
    ...
  ])
  .allowDiskUse(true) // < Allows MongoDB to use the disk temporarily
  .exec((err, docs) => {
    ...
  });

2- Result from aggregation pipeline exceeding maximum document size

For queries with a huge number of results, the aggregation pipeline would greet us with the lovely error "exceeds maximum document size problem". This is because the result of an aggregation pipeline is returned in a single BSON document, which has a size limit of 16Mb.

There are two ways to solve this problem:

1- Piping the results to another collection and querying it later

2- Getting a cursor to the first document and iterating through it

I picked the second method, and this is how I used it:

const cursor = ZipCodes
  .aggregate([
    ...
  ])
  .allowDiskUse(true)
  .cursor({ batchSize: 1000 }) // < Important
  .exec(); // < Returns a cursor

// The method .toArray of a cursor iterates through all documents
// and load them into an array in memory
cursor.toArray((err, docs) => {
  ...
});

The batchSize refers to how many documents we want returned in every batch, but according to the MongoDB documentation, this will not affect the use of the application because most results are returned in a single batch.

3- JavaScript Heap out of memory

After getting those beautiful millions of rows from the aggregation pipeline, we were greeted by another loverly error: "FATAL ERROR: CALLANDRETRY_LAST Allocation failed - JavaScript heap out of memory". This happens when the Node.js Heap runs out of memory (as you probably inferred from the description of the error).

According to some sources on the internet, the default memory limit for Node.js on 32-bit systems is 512Mb, and 1Gb for 64-bit systems. We can increase this memory limit when we are launching the node.js application with the option --max_old_space_size and specifying how much memory we want in Mb. For example:

node --max_old_space_size=8192 app.js

This will launch the app.js application with 8Gb of ram instead of 1Gb.

cdot mongo node memory limit