blog

User Affinity Tool: grouping and finding patterns for users

One of the last steps in our project is building the user affinity tool. In this post I will explain what it is going to do and how it works; there are some details I cannot disclose, but I will try to describe the general idea behind this tool.

Let's suppose we have some users in our database, and we are recording a history of what they read in our website:

users: [
    {
        name: "John",
        articlesVisited: [
            {
                tag: "Engineering",
                title: "Nanomaterials"
            }, {
                tag: "Engineering",
                title: "Nanoscale Sensors"
            }, {
                tag: "Engineering",
                title: "Challenges of Nanotechnology"
            }, {
                tag: "Arts",
                title: "Origins of Music"
            }
        ]
    }, {
        name: "Mark",
        job: "Artist"
        articlesVisited: [
            {
                tag: "Arts",
                title: "Syncing Music to Video"
            }, {
                tag: "Arts",
                title: "Music Lessons"
            }
        ]
    }, {
        name: "Lynda",
        job: "Engineer"
        articlesVisited: [
            {
                tag: "Engineering",
                title: "Nanomaterials"
            }, {
                tag: "Engineering",
                title: "Nanoscale Sensors"
            }, {
                tag: "Engineering",
                title: "Challenges of Nanotechnology"
            }, {
                tag: "Engineering",
                title: "Milling Processes"
            }, {
                tag: "Engineering",
                title: "3D Printing for Manufacturing"
            }, {
                tag: "Engineering",
                title: "The Automotive Industry"
            }, {
                tag: "Arts",
                title: "Music Lessons"
            }, {
                tag: "Arts",
                title: "Origins of Music"
            }
        ]
    }, {
        name: "Mary",
        job: "Artist"
        articlesVisited: [
            {
                tag: "Engineering",
                title: "The Automotive Industry"
            }, {
                tag: "Arts",
                title: "Music Lessons"
            }, {
                tag: "Arts",
                title: "Origins of Music"
            }
        ]
    }
]

Let's also assume that we have an anonymous user visiting our website - we don't have any information about him, but we are tracking his browsing history via cookies. This is how his history look like:

{
    articlesVisited: [
        {
            tag: "Arts",
            title: "Music Lessons"
        }, {
            tag: "Arts",
            title: "Origins of Music"
        }
    ]
}

Here is the challenge of working with big data: recording and keeping a bunch of information is easy, the hard part is turning it into something useful. What can we do with information like this? How can we turn this into something beneficial for the users and for the website?

There are two routes we took with our user affinity tool to work with this kind of data:

1- Generate recommendations based on the user history: if we know what the user is interested in and what the user look like, we can recommend content based on this

2- Make unknown users known: based on browsing patterns, we could assume characteristics for the user even when they haven't provided it

The question now is: how are we going to categorize the users? We have several possibilities depending on what we are recording, and we don't have to pick only one. We could categorize them based on their author preference, the tags of the articles, what sections they visit, among other characteristics. Since I am only using tags in this example, I'll use these tags to form clusters of users.

To group the users based on their tag preference, I will calculate the percentage of visits that the tags have for every user in relation to the total number of visits. For example: if the user visited 10 articles, where 8 articles were about engineering and 2 articles were about arts, the engineering tag will receive 80% and the arts tag will receive 20%.

User name Engineering Arts
John 75% 25%
Lynda 75% 25%
Mary 33.3% 66.6%
Mark 0% 100%
-anonymous user- 0% 100%

We are starting to see some patterns here, aren't we? Notice that:

1- John and Lynda have very similar browsing patterns

2- Mark and the anonymous user have very similar browsing patterns

3- Mark and the anonymous user are more similar to Mary than John and Lynda

Based on these characteristics, we could generate some scores to rank users based on their similarities (where 10 means "identical" and 0 means "completely different"). Let's supposed these are the scores the users got:

John Lynda Mary Mark -anonymous user-
John - 10 4 2 2
Lynda 10 - 4 2 2
Mary 4 4 - 6 6
Mark 2 2 6 - 10
-anonymous user- 2 2 6 10 -

Making unknown users known

With that information, maybe now we can start assuming characteristics for the users:

First, John and the anonymous user did not specify their jobs, but based on the scores we got, we can assume that:

1- John is probably an engineer (he is very similar to Lynda, who is an engineer), but there is a fairly small chance that he is actually an artist (since he is moderately similar to Mary, but he is very different than Mark, and they are both artists).

2- The anonymous user is probably an artist: his browsing history is very similar to Mark's and moderately similar to Mary's, who are artists; there is a very small chance that he is an engineer, since Lynda (who is an engineer) has a very small similarity to him.

Making recommendations

And we can also start recommending content to users based on their browsing history: by looking at what articles other people who are similar to them visited, we can take the articles that they read but our user didn't, and recommend them.

For example, with the user John: John visited the articles Nanomaterials, Nanoscale Sensors, Challenges of Nanotechnology, and Origins of Music.

1- John is very similar to Lynda, who visited the articles Nanomaterials, Nanoscale Sensors, Challenges of Nanotechnology, Milling Processes, 3D Printing for Manufacturing, The Automotive Industry, Music Lessons, and Origins of Music. Notice that Lynda visited some articles that John didn't: Milling Processes, 3D Printing for Manufacturing, The Automotive Industry, Music Lessons. Since John and Lynda are so similar, we can recommend these articles to John with a very high priority and assume he will be interested on them.

2- John and Mary are moderately similar, and Mary visited an article that John did not visit: Music Lessons. Although John is less similar to Mary when compared to Lynda, he has some interest in arts, and we can recommend this article to him with a lower priority.

3- John is very different than Mark and the anonymous user, but we can recommend some articles from them too, only with a much smaller priority.

Identity dilemma: what the users say they are, and what they look like

Suppose we have another user called Jane with this profile:

{
    name: "Jane",
    job: "Artist"
    articlesVisited: [
        {
            tag: "Engineering",
            title: "Nanomaterials"
        }, {
            tag: "Engineering",
            title: "Nanoscale Sensors"
        }, {
            tag: "Engineering",
            title: "Challenges of Nanotechnology"
        }, {
            tag: "Engineering",
            title: "Milling Processes"
        }, {
            tag: "Engineering",
            title: "3D Printing for Manufacturing"
        }
    ]
}

As you can see, we are in trouble: Jane says she is an artist, but her browsing history says she is an engineer. Do we recommend articles that artists like Mark visited, or do we recommend articles that engineers like Lynda visited?

We could separate these two identities of the user in "what the user say they are" and "what the user actually looks like" - I am going to call the first one a persona and the second one a profile.

There is no right answer to this, but there are some ways out of this problem. I think the easiest ones are:

1- What is the degree of certainty that we have when we say that Jane looks like an engineer? Just because she read 1 article about engineering, doesn't mean she actually looks like an engineer; but if she read 100 articles, all of them about engineering, then it's much safer to say that she looks like an engineer.

2- We could reserve a percentage of the articles recommended to the persona, and another to the profile.



In conclusion, grouping users into clusters in order to assume their characteristics and recommend articles is not necessarily hard, but the algorithms that make the calculations must be finely tuned. Just because one clustering method works for a website doesn't mean it will work for another one; it is important to make it easy for the clients to change their algorithms as they need, as well as providing some ways to test the performance of these methods (with some A/B testing, for instance).