Comparing three solutions for estimating population sizes

In an earlier post we discussed a service to estimate the number of unique visitors to a website. In this article we explore the algorithms we used to build the system: HLL (HyperLogLog) and KMV (Kth Minimal Value) and evaluate each.

When you set up an advertising campaign, it’s crucial to know how many users the campaign will reach. We built an estimation subsystem inside Schibsted’s Audience Targeting Engine (ATE) which can answer questions like:

  • “How many males from Oslo who are interested in sports visit Schibsted websites during a week?”
  • “How many users in Helsinki age 25-35 visit Schibsted websites in a day?”

Example: males from Norway who are interested in sports

The Problem

To estimate the number of unique users for the next week, we can look at historical data for the previous weeks and assume that we will get a similar number of users.

From a technical point of view this is a count-distinct problem. Though we don’t need to have the exact cardinality, it’s essential to have some approximate value.

As we discussed in our earlier article, we have several limitations:

  • We should provide the estimate in real time for any future targeting campaigns. This allows advertising campaign managers to try multiple combinations of targeting parameters and select the best one.
  • In the ATE, we have many values for our targeting parameters. For example, we have more than 100,000 locations, more than 100,000 interests and more than a million search terms, so it’s impossible to calculate in advance all combinations of targeting parameters.

HLL approach

The most popular approach to solve the count-distinct problem is to use the HyperLogLog (HLL) algorithm, which allows us to estimate the cardinality with a single iteration over the set of users, using constant memory.

The popular databases Redis and Redshift as well as the Spark computing framework already have a built-in implementation of this algorithm.

HLL in Spark

Our original approach was to use a Spark cluster. External job server functionality allows us to execute queries on Spark using a REST API. These queries are executed on data that is already cached in memory and disk, so it doesn’t require additional time to load the data from file systems and distribute it across the nodes.

To get an estimate, Spark executes queries using the countApproxDistinct function. For example: to find all males in Oslo of age 25-35, the query is the following.

This approach has an issue: Spark has to filter the whole dataset and only then it can create the corresponding HLL. These queries are very fast on small datasets but take up to several seconds on large datasets like ours. That is usually suitable for ad-hoc analytics, but it is not acceptable in our case, because our goal is to get a response within no more than a couple of seconds.

HLL in Redis

Redis uses a slightly different approach. Rather than calculate an HLL at query time, it stores pre-calculated HLLs as a value in its key-value storage. At query time, it only calculates an estimate based on the HLL. So the query executes within milliseconds.

Also, Redis supports unions of severals HLLs. So in order to select people who have been in Oslo or Helsinki, we can build two HLLs for these cities and compute the union of them. But for intersections, for example where we need to find people who were both in Oslo and Helsinki, it performs poorly in certain cases – for example, when we need to intersect multiple sets or when we need to intersect sets with very different sizes. To understand why exactly this happens we need to learn how the HLL algorithm works.

HLL Algorithm

The algorithm consists of two parts:

  1. In the first part we create a basic HLL array which can be modified if we need to add more users. This is the most time-consuming part.
  2. The second part is the actual evaluation, which is very fast and can be performed at any time.

Calculating the base array

Internally, an HLL is represented as an array M of length 2^b. In order to add an element to the array we need to do 3 steps:

  1. Compute the hashcode of the element.
  2. Split the binary hash value in 2 parts.
    1. The first part of length b represents the position in the array M; we call it i
    2. The second part (we call it w) is used to derive the position of the leftmost digit “1” in binary representation, we call it leftmostPos(w).
  3. Insert the value into the array using the formula M[i] = max(M[i], leftmostPos(w))

In the example we took b=4, so the array M will have 16 elements. We will start with empty array M:

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  1. Let’s say we inserting element x where hash(x) = 0011000110
  2. Split the hash value into 2 parts:
  3. Insert the position of the leftmost 1 = 4 to the 3rd position in the M array (array indexes are starting from zero):
0 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0

Estimation

To get the cardinality estimation we need to calculate the harmonic mean function:

and then finally determine the count:

 

where α is a constant to correct hash collisions.

These formulas may not look very intuitive; for the mathematical proof of the algorithm you can check the original article.

HLL Unions

To calculate unions, we need two arrays M1 and M2 with calculated p(w) values. Based on these two arrays, we calculate a new array M. For each element we apply a formula similar to the one in step 3. M[i] = max(M1[i], M2[i]). This will allow us to get a new base array, so we can perform evaluations on it.

Intersections in HLL

There are two approaches to building intersections on top of HLLs:

The most commons approach is the Inclusion-exclusion principle. It allows us to get the intersection based on original estimates and their unions. The idea is based on the fact that the union of 2 sets is the sum of these set sizes, minus their intersection.

|A ∪ B| = |A| + |B| – |A ∩ B|

This idea can be expanded to any number of sets. For example for three sets the formula looks slightly more complex:

|A ∪ B ∪ C|=  |A| + |B| +  |C|  – |A ∩ B|  – |A ∩ C|  –  |B ∩ C| +  |A ∩ B ∩ C|

Example of 3 set intersection, source: Wikipedia

Another way of calculating intersections is with the Minhash approach. It suggests  building an additional minhash data structure which helps to estimate intersections.

As we will see in the evaluations, the first approach has issues when we need to intersect multiple HLLs, and the second one requires additional time to build minhash structure.

Kth Minimal Value approach

We wanted to find an algorithm that has the same speed and precision as HLL but supports intersections by design.

This type of data structure exists in the datasketches framework and is called a theta sketch. It was developed fairly recently at Yahoo and to our knowledge only the Druid database is using it at the moment. The core of a theta sketch is based on the KMV (Kth Minimal Value) algorithm.

The KMV Algorithm

This algorithm has two steps, similar to HLL.

  1. from each incoming element, calculate a hash value with a floating point number between 0 and 1, and keep only K minimal values.
  2. based on the value of the Kth minimal element, it is possible to make an estimation:

Count = number of samples – 1value of the Kth sample

For example we have only 10 elements and we selected K=3

Hash codes of our elements:

0.02 0.07 0.18 0.20 0.21 0.31 0.56 0.59 0.81 0.96

Because we selected K=3, we keep only 3 elements.

0.02 0.07 0.18

The estimate in this case is

      Count = 3 – 10.18 = 11.1

This estimate is pretty close, but in this case we are just lucky because the value of K is too small for practical purposes.

The best way to pick the value of K is to find a suitable tradeoff between accuracy and sketch size. For example, if we select K=2^12 then the error will be ~3% and the size of the sketch will be no more than 61Kb.

Unions

In order to calculate a union we need to have several sets with K (or less) minimal values in each. Then we can combine them into a single set that will contain the K minimal values of all the lists.

For example, we can take the set with 10 elements from the previous example and union it with the set which has 5 elements in common, so the expected estimate of the union is 15.

Set 1:

0.02 0.07 0.18 0.20 0.21 0.31 0.56 0.59 0.81 0.96

Set 2:

0.02 0.12 0.18 0.20 0.21 0.46 0.61 0.66 0.81 0.82

We have the same value of K=3, as we had in previous example, so we keep only 3 minimal values from each set.

Set 1:

0.02 0.07 0.18

Set 2:

0.02 0.12 0.18

Now we can select 3 minimal unique numbers out of these 2 sets.

Union set:

0.02 0.07 0.12

Count = 3 – 10.18 = 16.7, where original estimate was 15.

Intersections

Intersections can be calculated as an intersection of two arrays with K minimal values. In this case K may become smaller, but if the intersection is big enough this is not a problem.

For example, we can take the same two sets of 10 element each with 5 elements in common from the previous example. In this case, the expected estimate is 5.

We have the same value of K=3 as we had in previous example, so we keep only 3 minimal values.

Set 1:

0.02 0.07 0.18

Set 2:

0.02 0.12 0.18

After intersection, we keep only duplicated elements. In this case there are only two of them, so our K reduces to 2.

Intersected set:

0.02 0.18

Count = 2 – 10.18 = 5.6, where expected estimate was 5.

You can read more about KMV in this KMV Tutorial and original Datasketches paper.

Evaluation

Our ATE components are implemented in Scala, so we were looking for a JVM library that is compatible with our ecosystem.

In our experiments we compared different approaches to get estimates:

  • HLLs with intersections using inclusion-exclusion principle implemented in the algebird library
  • HLLs with intersection using minhash approach implemented by Cantor
  • Theta sketches implemented in the Datasketches library.

Metrics

Our goal was to build an accurate, high performance system, so the most important metrics for us were:

  • Accuracy
  • Time to compute an estimate (query time).

Accuracy in our case means relative error in percent, which can be described as

error = (measured – truth) / truth

HLLs and Theta sketches provide theoretical estimations of this metric, which depend on the size of the data structure. The more accurate the estimate, the more memory it uses. However, in some cases, especially involving intersections, there are no theoretical numbers, so it is useful to get experimental results.

The second important metric is the response time for a user, i.e. how quickly we can compute an estimate from HLLs or Theta sketches. This includes the time that is required to perform a union or an intersection of several sketches.

Memory usage is less important because all solutions are constant in memory, and accuracy is usually a tradeoff with memory consumption. Time to build the initial data structure is also not very important because we can always precalculate that data structure.

Experiments

When an advertising campaign is set up to target only males, it reaches millions of users. If it targets people with a specific interest, it is just a couple of thousand users. But it’s important in both cases to get fast and precise estimates.We conducted a large number of experiments with HLLs and datasketches and the experiments below are the most representative.

To make sure we could compare relative error and execution time of the algorithms we set up their internal parameters to have a 1% theoretical error.

To make the experiments more descriptive and flexible to set up, the algorithms were conducted on various sets of randomly generated data. The experiments were executed on AWS c4.xlarge instances.

Experiment 1: One category

Some advertising campaigns are very simple and target just one category. In this case, we build an HLL/datasketch with from 100 to several millions elements and query it. All the experimental values were below the theoretical 1% and the execution time for all algorithms was less than 0.1 ms.

Experiment 2: Unions of multiple groups of users

In the first article, we explained how we use a grid which can have up to several thousand points to target users by geographical coordinates. Each point has a relatively small number of users, but to get a final estimate we need to union all of them.

Unions of 1000 groups of 1000 unique users:

Algorithm Relative Error Execution time
HLL with inclusion/exclusion 0.25 % 60 ms
HLL with minhash 0.05 % 400 ms
Theta sketches 0.20 % 30 ms

As we can see from the table this estimate is very precise and very fast in most of the cases.

Experiment 3: Intersection of several large groups of users

Another common case is when we want to target users with several targeting parameters. For example, it can be a popular website in a country where we target males of a specific age who are interested in sports. So we get an intersection of several groups with many users.

Execution time of intersections

Intersection of 20 groups, 1 million users each:

Relative Error for 20 intersected groups Execution time for 20 intersected groups
HLL with inclusion/exclusion 0.35 % 30 seconds
HLL with minhash 0.60 % 700 ms
Theta sketches 0.20 % 50 ms

The precision of all algorithms is very good but in this case, HLL with inclusion/exclusion struggles with execution time.

Experiment 4: Intersection of a small and a large group of users

This represents the case when we want to target a relatively small group of users with several targeting parameters. For example, females in a Norwegian town. The group “females” will have millions of users, but a Norwegian town may have only a few thousand people.

Precision of intersections

Relative Error for intersection of 1000 with 1 mln users Execution time for intersection of 1000 with 1 mln users
HLL with inclusion/exclusion 34 % 4.4 ms
HLL with minhash 11 % 78 ms
Theta sketches 13 % 3.2 ms

In this experiment, opposite to Experiment 3, all implementations turn out to be very fast, but the estimates in the corner cases are not precise enough.

Conclusions

The results of our experiments show that in general, all algorithms have good results both for accuracy and execution time. In simple use-cases when we have only one category or union of several categories, there is no significant difference between algorithms. However, in certain cases especially with intersections, execution time and accuracy was not ideal for the HLL-based solutions.

For our service, we chose the theta framework from Yahoo’s Datasketches library because it shows the most stable results both for accuracy and execution in different complex cases.

Read more from the Software engineering category
SUBSCRIBE TO OUR UPDATES
Menu