Lookalikes: Finding needles in a haystack

Co-Authors: Konstantinos Bozas and Stefano Costantini

Collaborators: Dimitris PapadopoulosVladislav SoldatovMichael Davy

At Schibsted, we use data science to build models that aggregate user behaviors and preferences. Advertisers can combine these to allocate users in groups, known as targeting segments.

Depending on how they are defined, some of these segments can be quite niche, limiting their reach. However, advertisers always look for ways to further extend the potential audience for their message.

What could be better than finding users that behave like the potential customers they are already targeting? It is safe to assume that these users would be more receptive to their message than the average user, thus providing a well-targeted audience. We call these additional users the lookalike users.

This sounds great, but how can we find them?

Our starting point is the information on our users’ browsing behavior. For example, we collect data on different aspects of their visiting our sites, such as the pages they browse, the times they access them and the devices they use. This can help us identify lookalike users.

However, such a wealth of information is both a blessing and a curse. It is a blessing because it provides us with the flexibility to model the behavior of our users in a variety of ways and to choose the set of features that best characterizes their activity. But it is also a curse because, given the high dimensionality of the data, finding the right set of features is challenging. And choosing the wrong set of features would make finding the precious needles in the vast haystack of users all but impossible.

The image below, constructed from already labeled data, illustrates the problem. Each dot in the image is a 2D t-SNE projection of the original features for each user. The blue dots represent an initial set of users — in this case a set of newspaper subscribers — for which we need to find lookalikes. The red dots are the true lookalikes (i.e. other subscribers not included in the initial set) while the yellow dots are generic users that have not subscribed to the newspaper.

t-SNE project of the feature space

Our objective is to pick the true subscribers (the “red”), leaving out the generic users (the “yellow”). As we can see in the picture this appears to be an intractable problem.

In this post, we show how we solved it. It is a pragmatic solution that allows us to make the most of the multiple features available to us, leveraging the information contained in the original seed set.

The lookalikes problem

We can define our objective as follows:

    • Assume u is a user representation in the form of a feature vector in a d-dimensional space.
    • Let
     be the initial set of users, called the ‘seed set’
     a second set of users, where typically n >> m
    • Return ⊆ T that includes the users that look like those in S according to some criteria. In our case, the users’ online activity on Schibsted sites.

In our experiments user representation vectors u are binary, encoding the presence or absence of a particular characteristic of a user’s online behavior.

How to achieve this? We discuss it below, but first we begin with a dilemma: shall we use a supervised or an unsupervised learning approach? Both approaches aim to find L, however each approach does it in a fundamentally different way

In the unsupervised case, the model goes through all the users in S, one by one, and for each it retrieves user(s) from T whose activity looks most similar to that of the given users. Under this approach, the set of ‘lookalike’ users will contain users that are similar to those contained in the initial set. However, the resulting set could be as heterogeneous as the initial set.

In the supervised case, we use information of the users in sets S and T, to train a supervised machine learning algorithm that would identify users similar to those in the initial set. The model would try to learn the implicit characteristics that make the users in the initial seed set stand out in the general population and would use them to find lookalike users. The main advantage of this approach is that provided the users in the initial set share some common latent characteristics, it can be very effective at identifying similar users. However, finding lookalikes is not an ordinary supervised learning problem due to the lack of definitely negative examples. There are ways to get around those issues, such as Positive-Unknown learning, or one-class classification.

From an engineering point of view, the unsupervised approach is much more straightforward to implement and maintain, as there is no need for training, parameter tuning, storage and serving of the models.

K.I.S.S. (Keep It Stupid Simple)

We started with the simplest possible approach we could use: K-Nearest-Neighbors (KNN).

Under this approach, each user in the population T is scored against the input seed set S. We score each user in T by calculating their similarity to each user in S, and then taking the average. So, for each user, the final similarity score is given by:

After calculating the similarity score for each user in T, we can rank them in order of “closeness” to the initial seed set and then pick the top K users with the highest score as the lookalike set.

But, does this approach work? Well, this question brings us back to the issue of data dimensionality that we discussed in the introduction. Specifically, our data has a dimensionality of ~30,000–40,000 (depending on the feature set we choose). This gets us face to face with the curse of dimensionality.

When the feature space is so vast, the available data becomes sparse. As a consequence, it becomes harder to differentiate the activity of each user from that of everybody else, as everybody appears dissimilar from everybody else in some way.

Despite this challenge, we evaluated the vanilla KNN approach with our dataset to establish a baseline. The cosine distance is used for all the experiments, as we found it produces better results compared to other metrics in our data.

The evaluation of a lookalike model is not as straightforward as a typical binary classification problem, due to the absence of negative ground truth. During the early prototyping phases, we overcame this hurdle by obfuscating part of the labels in a fully-known dataset. In this case, we used the traditional classification metrics, such as precision and recall. But this approach only took us so far.

In real life scenarios, such as in the case we consider in this post, negative samples are just not available. So we need an alternative measure of performance. In this case, we chose to use the Mean Average Precision (MAP), which measures the degree to which the model ranks the positive class on top of the lookalike ranking. The image below shows how the MAP calculation works and what an improvement in MAP means. First, suppose we have a lookalike algorithm that provides the ranking shown at the top of the picture, below. We know that the positive samples should be towards the top of the ranking. In this case we find them in positions 1, 5 and 7. Accordingly, the MAP of this ranking is 0.609. Now, suppose now we identify an improvement that changes the ranking as shown at the bottom of the picture. In this case, as the positive samples are more bunched up towards the top (in positions 1, 2 and 5) the MAP goes up to 0.866.

Mean Average Precision (MAP) explained

In our case, the MAP score of the vanilla KNN approach is 0.704This can be compared with the MAP of a random ranking, which is 0.196. It’s not bad, but we can do better. It’s time to turn to ways in which we can handle the high dimensionality of the data better.

Leverage what you know

While the naive approach appears to be able to identify lookalike users to some extent, the original feature space may not be serving our purposes well. We need a way to reduce the dimensionality of the data.

The canonical technique for dimensionality reduction is PCA, however, when we applied it, we didn’t get any improvement to the MAP score. That can be explained because PCA attempts to keep only the orthogonal components with the highest variability and is agnostic to different trends occurring between the seed users activity and the big pool of users.

Instead of using PCA or other dimensionality reduction approaches, perhaps we could improve the performance further if we identified which features are the most valuable to the problem, i.e. features that help differentiate the seed set users from the rest. If we knew this, we could calculate user similarity relying more on the most relevant features.

We start from the assumption that the input seed set contains users that exhibit a specific behavior which we would like to identify in the population at large, while most of the individuals in the population exhibit a generic behavior. By comparing the two sets, we can highlight the differences between them and use this information to identify the features that should be given more importance.

More formally, to measure the importance of each feature, we can employ information theory metrics, such as mutual information. Specifically, Ma et al.found that Information Value (IV) works well: we have followed a similar approach for our problem. The information value is frequently used in variable selection and is defined as:

Information Value score

Where p_i and q_i are the proportions of users for which feature i (f_i) is activated in the seed set S and population set T respectively.

Since our features vectors are binary, we can generate new weighted feature vectors by multiplying the calculated IV weights w to the vectors u. This procedure generates a new feature space with a higher discriminative potential.

We visually demonstrate this by generating the 2D t-SNE projection of the same samples in the new feature space.

t-SNE projection of the of the original features space and the weighted one

As we can see, when we use the IV weighted features, the categories of users appear to be much more separated: the positive test set is now clearly differentiated from the unknown test set. Moreover, we can see the samples belonging to the positive set being grouped together with those in the seed set. As expected, this has a positive impact on the performance of the lookalike algorithm: the MAP on the weighted feature space is 0.791 which is over 12% better than our baseline.

Scaling it up

Around 200 millions users visit Schibsted’s media and marketplace sites daily. Scaling the weighted KNN prototype was a serious challenge and a crucial part of a successful product delivery.

A simplified complexity analysis of our algorithm is as follows:

The cosine similarity between two vectors needs approximately 3d operations (1 pass for the dot product and 1 for each of the vector norms), where is the vector dimensions.

  • for a reference set sample, we need |S| x 3d +|S| operations (one pass to calculate the distance and one to sum all the distances).
  • We already notice that if the seed set size is large then scoring a single sample can get quite expensive.
  • Finally, to rank the whole reference set |T| x (|S|x3d + |S|) operations are required.

Even if the sparse nature of our features means that each distance calculation can be solved in much less than 3d operations, the total computational effort when dealing with several million users is vast.

Luckily, this task can be parallelized.

In our production setup, we use Spark and Scala. We found that we can significantly speed up the process, taking advantage of the sparsity of our features, by implementing a sparse matrix multiplication approach in conjunction with broadcasting the seed set sparse matrix and the seed set norms in Spark. If u_T^i is the feature vector to be ranked, W the seed set sparse matrix, g is a vector holding the seed set norms and v is the norm of u_T^i then we can get the numerator of the cosine distances by multiplying the feature vector with the seed set matrix z = u_T^i^T W and then compute the final score:

Here is the implementation:

We are also aware of approximate nearest neighbour search techniques such as Local Sensitive Hashing (LSH), yet the implementation of the exact search presented above can cope with our user base and generate results in less than one hour in a moderate sized cluster.

Does it work?

A machine learning algorithm shows its real value when performs well in the real world. To check this, we set up two online campaigns: the first showing targeted ads to users belonging to a known segment and the second made up by lookalike users of the first segment. We constructed the second segment by applying the algorithm described above, using the known segment as the seed set.

The lookalike segment was created with the objective to extend the audience of the first segment. However, this is only achievable if the performance of this segment is comparable to that of the original one. The chart below compares the click-through rates (CTRs) of ads shown to the users of the original segment (left) and the lookalike segment (right). The chart shows that the performance of the campaign based on lookalikes was comparable with that of the original campaign.

The original segment performed better on the desktop, while the lookalikes segment showed a higher median CTR on mobile. In both cases though, the difference is well within the margin of error. Most importantly, the performance of the lookalike segment is not significantly worse than that of the original segment. This result indicates that the lookalike algorithm could be used for its intended audience-extension objective. Specifically, in the case above we were able to double the size of the segment while keeping up the CTR performance. All in all, a successful first online test. We are currently carrying out more tests across different campaigns and advertisers and the preliminary results are promising.

Read more from the Data science category