blog

How not to flood you background jobs (even loop) with node

I recently encountered a problem in our program that could not be properly solved by synchronous methods, nor asynchronous method - but it worked nicely when I somewhat mixed both. I thought this was an interesting problem and I am going to describe it in this post.

Consider the following: you have a list of 100 million entries in an array, and you have to iterate through all of them in your Node.js server. How would you do it?

Let's say, in this case, that our list is just a sequence from 0 to 100,000,000 and I want to print them on the screen. Pretend for a moment that this "print" is a method that requires lots of computation.

The simplest answer would be using a forEach method or a for loop like this:

function printNumber(c) {
    console.log(c);
}

for (let i = 0; i < 100000; i++) {
    printNumber(i);
}

That's nice. It certainly works. But if you are doing this in a server, you will block any requests until this job is done. If we ran the code below, it is obvious that the "new task" would be printed only when we finish with all the "expensive" operations.

function printNumber(c) {
    console.log(c);
}

for (let i = 0; i < 100000; i++) {
    printNumber(i);
}

// Putting a new task in the even loop
console.log('=========================== NEW TASK');

Fine. So what do we do? Asynchronous, right? That's the killer feature of node! Let's try this:

function printNumber(c) {
    const rand = Math.random(); // Just giving some randomness to the output time
    setTimeout(() => {
        console.log(c);
    }, rand * 10);
}

for (let i = 0; i < 100000; i++) {
    printNumber(i);
}

// Putting a new task in the even loop
setImmediate(() => {
    console.log('=========================== NEW TASK');
});

The idea here is the same: we'll print all the numbers, but every print will be asynchronous.

Now, that asynchronous task at the end is the important part: what I want to simulate here is what happens if we get another asynchronous task (say, a request) as soon as the loop finishes?

So, what do I expect to see in the output? Probably something like this:

// Beginning of the output
0
3
2
1
4
5
== NEW TASK
7
6
9
10
8
...

But instead, this is what I got:

...
99978
99998
99903
99956
99966
99969
99977
99980
99989
99992
99995
99999
=========================== NEW TASK
// End of output

Well, that's disappointing. The numbers are all out of order, but the new task still had to wait for all of them to finish. Why is that?

To understand what happened, we'll have to understand how Node's event loop works. I will try to make this very simple: in a conventional programming language, we would have the stack - a data structure that keeps track of where the execution is. In javascript, you also have the event loop (also called callback queue): a queue with "things to do when the stack is clear". When the stack gets cleared, we pull something from the event loop and execute it, it is basically a list of things "to do when you finish what you are doing now". When we use setTimeout or setImmediate, we are pushing an instruction to this list.

Here is the problem with what happened there: we made a for loop that pushed 100,000 instructions in the queue all at once, almost immediately. At this point, the event loop looked like this:

eventLoop = [printNumber(1), printNumber(2) ... printNumber(99998), printNumber(99999)]

And then, a little later, we pushed the new task:

eventLoop = [printNumber(1), printNumber(2) ... printNumber(99998), printNumber(99999), newTask]

In order for the new task to be executed, all the prints had to be finished. So even though we made the code asynchronous, by flooding the event loop, it became synchronous again.

Solution

To fix this, we must avoid flooding the event loop with all the instructions at once. Instead, this is what we can do: we will push only a fraction of those instruction (maybe 50 or so), and then when they are all done, we will push more 50, and so on. This will ensure that, if we receive any additional task in the event loop, it will fit "in between" those prints, and not at the end. Something like this:

eventLoop = [printNumber(1) ... printNumber(49) newTask]

And after printNumber(49) is called, it will be:

eventLoop = [newTask, printNumber(50) ... printNumber(99)]

This is relatively simple to do:

function printNumber(c, cb) {
    const rand = Math.random();
    setTimeout(() => {
        console.log(c);
        cb();
    }, rand * 10);
}

const upperLimit = 100000; // How many numbers we want
const batchSize = 50; // How many numbers for each batch
let currBatch = 0; // Which batch are we in?

(function pushBatch() {

    // Where do we start in this batch
    const batchStart = currBatch * batchSize;
    if (batchStart >= upperLimit) { return; }

    // Where does this batch end?
    let batchEnd = (currBatch + 1) * batchSize;
    if (batchEnd > upperLimit) { batchEnd = upperLimit; }

    // How many tasks do we still have to do?
    let tasksToDo = batchSize;

    for (let i = batchStart; i < batchEnd; i++) {

        // Printing the number, and passing a callback to it:
        // if this was the last number to be printed, we start
        // the next batch
        printNumber(i, () => {
            tasksToDo--;
            if (tasksToDo < 1) {
                currBatch++;
                pushBatch();
            }
        });
    }
})();

// Putting a new task in the even loop
setImmediate(() => {
    console.log('=========================== NEW TASK');
});

And finally, here is the output:

...
1507
1518
1529
1536
=========================== NEW TASK
1467
1475
1497
1503
1517
1519
...

The problem with this is: it is slow. All these 100,000 functions are being called, then pushed in the event loop, then moved to the stack, and then executed. All individually. We can change it slightly to make it more performant: instead of making every single print asynchronous (they'll follow one after the other in the event loop anyway), we'll make the batch asynchronous:

function printNumber(c, cb) {
    console.log(c);
}

const upperLimit = 100000; // How many numbers we want
const batchSize = 50; // How many numbers for each batch
let currBatch = 0; // Which batch are we in?

(function pushBatch() {
    setImmediate(() => {
        // Where do we start in this batch
        const batchStart = currBatch * batchSize;
        if (batchStart >= upperLimit) { return; }

        // Where does this batch end?
        let batchEnd = (currBatch + 1) * batchSize;
        if (batchEnd > upperLimit) { batchEnd = upperLimit; }

        // How many tasks do we still have to do?
        let tasksToDo = batchSize;

        for (let i = batchStart; i < batchEnd; i++) {
            printNumber(i);
        }

        currBatch++;
        pushBatch();
    });
})();

// Putting a new task in the even loop
setImmediate(() => {
    console.log('=========================== NEW TASK');
});

This script is MUCH faster, and produces similar output.

Demonstration

Loupe is a little website that helps you visualizing how the event loop works.

Try running the scripts below to see how they behave.

Just as a reminder: the purpose of this is to demonstrate what happens when we get a new task AFTER we flooded the event loop with tasks.

Normal asynchronous code:

function printNumber(c) {
    setTimeout(function printing() {
        console.log(c);
    }, 0);
}

for (var i = 0; i < 5; i++) {
    printNumber(i);
}

// Pushes the task as soon as the loop finishes
setTimeout(function newTask() {
    console.log('NEW TASK');
}, 0);

In batches:

function printNumber(c, cb) {
    console.log(c);
}

var upperLimit = 10; // How many numbers we want
var batchSize = 3; // How many numbers for each batch
var currBatch = 0; // Which batch are we in?

(function pushBatch() {
    setTimeout(function computingBatch() {
        // Where do we start in this batch
        var batchStart = currBatch * batchSize;
        if (batchStart >= upperLimit) { return; }

        // Where does this batch end?
        var batchEnd = (currBatch + 1) * batchSize;
        if (batchEnd > upperLimit) { batchEnd = upperLimit; }

        // How many tasks do we still have to do?
        var tasksToDo = batchSize;

        for (var i = batchStart; i < batchEnd; i++) {
            printNumber(i);
        }

        currBatch++;
        pushBatch();
    }, 0);
})();

// Putting a new task in the even loop
setTimeout(function newTask() {
    console.log('NEW TASK');
}, 0);