blog

Prototyping a calculated field on MongoDB for quick access

The next phase of our project will be a content recommendation system for the users who visit our website: we will consider their past preferences (article category, for example) in order to recommend new content. This system needs to be fast and not use the database unnecessarily, since it will be used for every visit of every user. Considering that all the data we gather from our users are spread among several collections in our database, we cannot afford to make an expensive, slow operation with joins; we need a way to make this operation fast and cheap.

Calculated values are a great way to turn expensive and slow operations into very simple queries, however, they have a drawback: how to keep them synchronized? Our solution for this problem was using a collection that contains all the hits made by a user, which we called a "session" (a session contains many hits); every time the user makes a new hit, we use the information from this hit to improve the history we have in the session - it also ensures that the calculated fields will always be up to date.

For example, assuming this is our current history for the user:

Session

{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6, 7, 8],
        articlesVisited: 5,
        videosVisited: 3,
    }
}

The history says that we visited 5 articles and 3 videos; the IDs (of the articles and videos, assuming they are stored in the same collection) visited are 1, 2, 3, 4, 5, 6, 7, and 8.

If the user makes another hit in another article (say article #9), the history in the user's session would be changed to:

Session

{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6, 7, 8, 9],
        articlesVisited: 6,
        videosVisited: 3,
    }
}

Changes like these are very easy to do with MongoDB. For pushing the new ID in the array, we can simply use the $push (not unique values) or the $addToSet (unique values) operator:

db.sessions.update({
    _id: <session id>
}, {
    $addToSet: {
        "history.visitedIds": <article id>
        // In our case, the article id would be "9"
    }
});

Likewise, it is easy to increment values, like for articles visited using the $inc operator:

db.sessions.update({
    _id: <session id>
}, {
    $inc: {
        "history.<field to increment>": 1
        // In our case, the field to increment would be "articlesVisited"
    }
});

Joining them together:

db.sessions.update({
    _id: 
}, {
    $addToSet: {
        "history.visitedIds": <article id>
    },

    $inc: {
        "history.<field to increment>": 1
    }
});

This takes care of maintaining the calculated fields up to date with a simple operation.

Now we get to another detail: the calculated field we are keeping is not in the exact format we want it to have; for example: instead of just the raw numbers of visits a user made couldn't we have it in percentage? This would help us to group them in clusters, if we so desire; for example:

Session 1

{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    }
}

Session 2

{
    history: {
        visitedIds: [1, 2, 3],
        articlesVisited: 2,
        videosVisited: 1,
    }
}

Despite the user from Session 2 having less visits than the user from Session 1, their preferences are actually similar: they visited twice more articles than videos. we could abstract these preferences like this:

Session 1

{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    },
    affinity: {
        articles: 66.666,
        videos: 33.333,
    }
}

Session 2

{
    history: {
        visitedIds: [1, 2, 3],
        articlesVisited: 2,
        videosVisited: 1,
    },
    affinity: {
        articles: 66.666,
        videos: 33.333,
    }
}

This could be done after we pull the data, on the server, or directly in the database. If we do it in the database, we can use the aggregation framework on MongoDB to make this calculation:

First, we get the total number of visits. For this, we can use the $project operator to sum the number of visits on articles and videos:

db.test.aggregate([
    { $project: {

        _id: 1, // Keep the ID

        history: 1, // Keep the history

        // Creating the "totalVisits" field by adding the visits together
        totalVisits: { $add: [
            "$history.articlesVisited",
            "$history.videosVisited"
        ]}
    }}
])

This would be the result:

Session 1

{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    },
    totalVisits: 6,
}

Now that we have the total of visits, we can do some arithmetic ($multiply and $divide for multiplication and division) to find the percentage of the categories with another $project:

db.test.aggregate([
    { $project: {
        _id: 1,
        history: 1,
        totalVisits: { $add: [
            "$history.articlesVisited",
            "$history.videosVisited"
        ]}
    }},

    { $project: {

        _id: 1,

        history: 1,

        // We don't project the totalVisits here, if we want to omit it

        affinity: {
            articles: { $multiply: [
                { $divide: [
                    "$history.articlesVisited", "$totalVisits"
                ]},
                100
            ]},

            videos: { $multiply: [
                { $divide: [
                    "$history.videosVisited", "$totalVisits"
                ]},
                100
            ]}
        }
    }}
])

And this will be the result:

{
    history: {
        visitedIds: [1, 2, 3, 4, 5, 6],
        articlesVisited: 4,
        videosVisited: 2,
    },
    affinity: {
        articles: 66.333,
        videos: 33.333,
    }
}

In this example, the categories were "hard coded": we will have more than "articles" and "videos", but this example was only to show that what we are envisioning can be done: we only need a more elaborated schema and a more intelligent algorithm.