hcoelho.com

my blog

Title only : Full post

How not to flood you background jobs (even loop) with node

:

I recently encountered a problem in our program that could not be properly solved by synchronous methods, nor asynchronous method - but it worked nicely when I somewhat mixed both. I thought this was an interesting problem and I am going to describe it in this post.

Consider the following: you have a list of 100 million entries in an array, and you have to iterate through all of them in your Node.js server. How would you do it?

Let's say, in this case, that our list is just a sequence from 0 to 100,000,000 and I want to print them on the screen. Pretend for a moment that this "print" is a method that requires lots of computation.

The simplest answer would be using a forEach method or a for loop like this:

function printNumber(c) {
    console.log(c);
}

for (let i = 0; i < 100000; i++) {
    printNumber(i);
}

That's nice. It certainly works. But if you are doing this in a server, you will block any requests until this job is done. If we ran the code below, it is obvious that the "new task" would be printed only when we finish with all the "expensive" operations.

function printNumber(c) {
    console.log(c);
}

for (let i = 0; i < 100000; i++) {
    printNumber(i);
}

// Putting a new task in the even loop
console.log('=========================== NEW TASK');

Fine. So what do we do? Asynchronous, right? That's the killer feature of node! Let's try this:

function printNumber(c) {
    const rand = Math.random(); // Just giving some randomness to the output time
    setTimeout(() => {
        console.log(c);
    }, rand * 10);
}

for (let i = 0; i < 100000; i++) {
    printNumber(i);
}

// Putting a new task in the even loop
setImmediate(() => {
    console.log('=========================== NEW TASK');
});

The idea here is the same: we'll print all the numbers, but every print will be asynchronous.

Now, that asynchronous task at the end is the important part: what I want to simulate here is what happens if we get another asynchronous task (say, a request) as soon as the loop finishes?

So, what do I expect to see in the output? Probably something like this:

// Beginning of the output
0
3
2
1
4
5
== NEW TASK
7
6
9
10
8
...

But instead, this is what I got:

...
99978
99998
99903
99956
99966
99969
99977
99980
99989
99992
99995
99999
=========================== NEW TASK
// End of output

Well, that's disappointing. The numbers are all out of order, but the new task still had to wait for all of them to finish. Why is that?

To understand what happened, we'll have to understand how Node's event loop works. I will try to make this very simple: in a conventional programming language, we would have the stack - a data structure that keeps track of where the execution is. In javascript, you also have the event loop (also called callback queue): a queue with "things to do when the stack is clear". When the stack gets cleared, we pull something from the event loop and execute it, it is basically a list of things "to do when you finish what you are doing now". When we use setTimeout or setImmediate, we are pushing an instruction to this list.

Here is the problem with what happened there: we made a for loop that pushed 100,000 instructions in the queue all at once, almost immediately. At this point, the event loop looked like this:

eventLoop = [printNumber(1), printNumber(2) ... printNumber(99998), printNumber(99999)]

And then, a little later, we pushed the new task:

eventLoop = [printNumber(1), printNumber(2) ... printNumber(99998), printNumber(99999), newTask]

In order for the new task to be executed, all the prints had to be finished. So even though we made the code asynchronous, by flooding the event loop, it became synchronous again.

Solution

To fix this, we must avoid flooding the event loop with all the instructions at once. Instead, this is what we can do: we will push only a fraction of those instruction (maybe 50 or so), and then when they are all done, we will push more 50, and so on. This will ensure that, if we receive any additional task in the event loop, it will fit "in between" those prints, and not at the end. Something like this:

eventLoop = [printNumber(1) ... printNumber(49) newTask]

And after printNumber(49) is called, it will be:

eventLoop = [newTask, printNumber(50) ... printNumber(99)]

This is relatively simple to do:

function printNumber(c, cb) {
    const rand = Math.random();
    setTimeout(() => {
        console.log(c);
        cb();
    }, rand * 10);
}

const upperLimit = 100000; // How many numbers we want
const batchSize = 50; // How many numbers for each batch
let currBatch = 0; // Which batch are we in?

(function pushBatch() {

    // Where do we start in this batch
    const batchStart = currBatch * batchSize;
    if (batchStart >= upperLimit) { return; }

    // Where does this batch end?
    let batchEnd = (currBatch + 1) * batchSize;
    if (batchEnd > upperLimit) { batchEnd = upperLimit; }

    // How many tasks do we still have to do?
    let tasksToDo = batchSize;

    for (let i = batchStart; i < batchEnd; i++) {

        // Printing the number, and passing a callback to it:
        // if this was the last number to be printed, we start
        // the next batch
        printNumber(i, () => {
            tasksToDo--;
            if (tasksToDo < 1) {
                currBatch++;
                pushBatch();
            }
        });
    }
})();

// Putting a new task in the even loop
setImmediate(() => {
    console.log('=========================== NEW TASK');
});

And finally, here is the output:

...
1507
1518
1529
1536
=========================== NEW TASK
1467
1475
1497
1503
1517
1519
...

The problem with this is: it is slow. All these 100,000 functions are being called, then pushed in the event loop, then moved to the stack, and then executed. All individually. We can change it slightly to make it more performant: instead of making every single print asynchronous (they'll follow one after the other in the event loop anyway), we'll make the batch asynchronous:

function printNumber(c, cb) {
    console.log(c);
}

const upperLimit = 100000; // How many numbers we want
const batchSize = 50; // How many numbers for each batch
let currBatch = 0; // Which batch are we in?

(function pushBatch() {
    setImmediate(() => {
        // Where do we start in this batch
        const batchStart = currBatch * batchSize;
        if (batchStart >= upperLimit) { return; }

        // Where does this batch end?
        let batchEnd = (currBatch + 1) * batchSize;
        if (batchEnd > upperLimit) { batchEnd = upperLimit; }

        // How many tasks do we still have to do?
        let tasksToDo = batchSize;

        for (let i = batchStart; i < batchEnd; i++) {
            printNumber(i);
        }

        currBatch++;
        pushBatch();
    });
})();

// Putting a new task in the even loop
setImmediate(() => {
    console.log('=========================== NEW TASK');
});

This script is MUCH faster, and produces similar output.

Demonstration

Loupe is a little website that helps you visualizing how the event loop works.

Try running the scripts below to see how they behave.

Just as a reminder: the purpose of this is to demonstrate what happens when we get a new task AFTER we flooded the event loop with tasks.

Normal asynchronous code:

function printNumber(c) {
    setTimeout(printing() {
        console.log(c);
    }, 0);
}

for (var i = 0; i < 5; i++) {
    printNumber(i);
}

// Pushes the task as soon as the loop finishes
setTimeout(function newTask() {
    console.log('NEW TASK');
}, 0);

In batches:

function printNumber(c, cb) {
    console.log(c);
}

var upperLimit = 10; // How many numbers we want
var batchSize = 3; // How many numbers for each batch
var currBatch = 0; // Which batch are we in?

(function pushBatch() {
    setTimeout(function computingBatch() {
        // Where do we start in this batch
        var batchStart = currBatch * batchSize;
        if (batchStart >= upperLimit) { return; }

        // Where does this batch end?
        var batchEnd = (currBatch + 1) * batchSize;
        if (batchEnd > upperLimit) { batchEnd = upperLimit; }

        // How many tasks do we still have to do?
        var tasksToDo = batchSize;

        for (var i = batchStart; i < batchEnd; i++) {
            printNumber(i);
        }

        currBatch++;
        pushBatch();
    }, 0);
})();

// Putting a new task in the even loop
setTimeout(function newTask() {
    console.log('NEW TASK');
}, 0);

cdot javascript.asynchronous synchronous batches 

AWS ECS: Fixing agent container not running and exit code 5

:

I had a great time this week when our instance on Amazon imploded for strange reasons, sadly I don't have the logs anymore to post them here, but these were the key problems (or what I remember from the event) and their solutions:

The docker process suddenly stopped responding some commands such as docker ps

Solution: I reinstalled docker and AWS ECS Container Agent.

Docker is now responding, but the container agent will not run

By running it with the -it flags, we noticed it was exiting with code 5.

Solution: Delete the file /var/lib/ecs/data/ecs-agent-data.json and start the container agent again.

cdto amazon aws ecs agent 

Fixing memory problems with Node.js and Mongo DB

:

Now that the basic functionality of Rutilus is done, I spent some time improving the memory limitations that we faced. In this post I will list the problems we faced and how I solved them.

Observation: We were using Mongoose for these queries, and not the native Node.js driver.

1- Steps in the aggregation pipeline taking too much memory

From the MongoDB manual:

"Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the MongoDB instance simplifies application code and limits resource requirements."

So, obviously, a pipeline such the one below would need to have memory available to perform all those stages:

ZipCodes
  .aggregate([
    { $group: {
      _id: { state: "$state", city: "$city" },
      pop: { $sum:  "$pop" }
    }},
    { $sort: { pop: 1 }},
    { $group: {
      _id : "$_id.state",
      biggestCity:  { $last:  "$_id.city" },
      biggestPop:   { $last:  "$pop"      },
      smallestCity: { $first: "$_id.city" },
      smallestPop:  { $first: "$pop"      }
    }},
    { $project: {
      _id: 0,
      state: "$_id",
      biggestCity:  { name: "$biggestCity",  pop: "$biggestPop"  },
      smallestCity: { name: "$smallestCity", pop: "$smallestPop" }
    }}
  ])
  .exec((err, docs) => {
    ...
  });

The problem we were having in this case was: we did not have enough memory to perform the stages, even though we did have enough memory for the output. In other words: the output was small and concise, but we needed a lot of memory to do it.

The solution for this was easy: we can simply tell Mongo to use disk space temporarily to store the data. It probably is slower, but it is better than not being able to run the query at all. To do this, we just needed to add an extra step (allowDiskUse) to that method chain:

ZipCodes
  .aggregate([
    ...
  ])
  .allowDiskUse(true) // < Allows MongoDB to use the disk temporarily
  .exec((err, docs) => {
    ...
  });

2- Result from aggregation pipeline exceeding maximum document size

For queries with a huge number of results, the aggregation pipeline would greet us with the lovely error "exceeds maximum document size problem". This is because the result of an aggregation pipeline is returned in a single BSON document, which has a size limit of 16Mb.

There are two ways to solve this problem:

1- Piping the results to another collection and querying it later

2- Getting a cursor to the first document and iterating through it

I picked the second method, and this is how I used it:

const cursor = ZipCodes
  .aggregate([
    ...
  ])
  .allowDiskUse(true)
  .cursor({ batchSize: 1000 }) // < Important
  .exec(); // < Returns a cursor

// The method .toArray of a cursor iterates through all documents
// and load them into an array in memory
cursor.toArray((err, docs) => {
  ...
});

The batchSize refers to how many documents we want returned in every batch, but according to the MongoDB documentation, this will not affect the use of the application because most results are returned in a single batch.

3- JavaScript Heap out of memory

After getting those beautiful millions of rows from the aggregation pipeline, we were greeted by another loverly error: "FATAL ERROR: CALLANDRETRY_LAST Allocation failed - JavaScript heap out of memory". This happens when the Node.js Heap runs out of memory (as you probably inferred from the description of the error).

According to some sources on the internet, the default memory limit for Node.js on 32-bit systems is 512Mb, and 1Gb for 64-bit systems. We can increase this memory limit when we are launching the node.js application with the option --max_old_space_size and specifying how much memory we want in Mb. For example:

node --max_old_space_size=8192 app.js

This will launch the app.js application with 8Gb of ram instead of 1Gb.

cdot mongo node memory limit