crowded street

DataSketches: The secret sauce behind population estimation

We built a service to estimate the number of website visitors reached by new audience segments in real time, for queries with any combination of user attributes. Here’s how we did it.

By Manuel Weiss, with help from the ATE team

Audience targeting and segments

Schibsted’s Audience Targeting Engine (ATE) allows us to target advertising based on a user’s attributes, like their age, gender, search history, location, etc.

ATE’s Segment Manager allows the creation of audience “segments”. A segment represents a group of users and is defined by a set of attributes that an advertiser is interested in (e.g. users aged 40-45, based in London).  These segments can then be attached to an advertisement campaign.

When an ad campaign manager sets up a segment to target a certain group of users, it is important to know the approximate size of that user group so that the number of impressions can be estimated.

Example of a segment: users interested in business and technology, aged 40-45, based in London

The problem

To calculate the number of users matching a segment, we can look at historical data and compute how many unique users we have observed over a given time period (last week, for example) that would fall into this segment. (This is also known as the “count-distinct problem”.)

This is fairly easy to do for existing segments. But the estimate is most crucial when a new segment is created or an existing one changed. The problem is that it takes a while to go through the historical data and count all users matching the newly created segment. And we cannot precompute all possible segments, as they can consist of any combination of attributes. Additionally, it’s also possible to upload lists of user IDs, which can then also be combined (i.e. intersected) with other targeting attributes. Obviously, this custom list is unknown until it is uploaded.

Example of unions and intersections for a segment

The solution to that problem is to record the unique set of users separately for every criterion and then do the unions and intersections as needed when a new segment is created or changed. To give an example, let’s assume the following segment:

  • Gender: Male
  • Age groups: 36-45, 46-55
  • Location: Oslo, Bergen, Stavanger
  • Interests: Porsche, Tesla

We will have 8 sets of users: all male users, users aged 36 to 45 or 46 to 55, users in Oslo, Bergen or Stavanger and users interested in Porsche or Tesla. Now we’ll do a union of all sets within a targeting category and then an intersection between categories. This gives us the count of unique users for our segment.

In order to do this for any new segment, we need to store a set of unique user IDs for each possible value of each criterion:

  • 3 for gender: Male, Female, Unknown
  • 6 age groups
  • > 100.000 for location by district/region/country
  • Several 100.000s for interest categories (separate for each supported website)
  • Several millions for location by latitude/longitude
  • Several millions for search terms

Which creates a new problem: how do we efficiently store millions of user IDs for each of these sets, while being able to do unions and intersections?

To the rescue: sketches!

Luckily, smart people have been thinking about this for a while and have come up with a smart solution: sketches! They are probabilistic data structures, like, for example, the slightly better known Bloom filters. The basic idea is similar to lossy compression (as we know it from JPEGs and MP3s): you don’t get quite the same thing back, but it is good enough for your purposes, using only a fraction of the original size.

There are many different versions of these data sketches, as they’re also called, but they all build on the following observation: if I store some information about specific patterns in the incoming data, I can estimate how many distinct items I have observed so far. As a very simplified example, if I flip a coin many times and only store the largest number of heads in a row, I can estimate how many times the coin was flipped. For a more detailed explanation of sketches, see here.

All of these sketches rely on a uniform distribution of the incoming values, which can easily be achieved by hashing the original user IDs.

So we can construct a data structure that allows us to add a very large number of values, and then ask how many distinct values were observed. But even better – we can take two such data structures and do a union (i.e. how many distinct values show up in either of the two: “lives in Oslo or Bergen”) or an intersection (i.e. how many distinct values show up in both: “is male and 46-55 years old”).

For a more detailed introduction to these concepts, have a look at these excellent blog posts.

In an upcoming blog post, we will discuss some data sketch implementations and the benchmarks we did. Here, we want to focus on how we used data sketches to solve the problem outlined in the introduction and how we implemented this system in production.

Commercial importance of population estimates

For our campaign managers, seeing the population estimates in real time as they work on their segments is absolutely crucial. A segment’s reach needs to be greater than a certain number of unique users to be commercially viable. What this size is, depends completely on the type of segment. The segments that make up the standard offering in the product portfolio need to be very large and reach a cross section of our users. Custom segments for a specific advertiser can be very small in comparison, but still very attractive for a particular campaign. And, the ad sales team always needs to know the size of each segment to make the call: should I sell this? Will the campaign be able to deliver?

 

The segment creation user interface with the estimated weekly unique users in the top right.

Using data sketches in a production system

Our previous solution

When we first started estimating reach, we had only a couple of targeting attributes, and needed to support internal users only. For each query, we would kick off a job via Spark Jobserver that would calculate the estimate on the fly based on HyperLogLog data structures kept in memory. This approach did not scale with the number of targeting attributes, the amount of data and the number of concurrent estimation requests. When there was an issue with Spark, the system would stop responding. The new design overcomes all these issues and can respond to many simultaneous requests.

Calculating the sketches

Data flow in our population estimation system

Every night, we run a batch job on Spark which reads all the user profiles recorded on the previous day and calculates a data sketch for each attribute. The attributes in these profiles have been inferred by our User Modelling data science team based on browsing behaviour on our sites. Once all the sketches are calculated, we load them from S3 into a PostgreSQL RDS (serialised as a byte array).

This might seem like an odd choice, given there’s nothing relational about our list of sketches. But we are relying on two very convenient features:

  • Geo queries with the PostGIS extension
  • Partitioning

Geo queries or “How many users are in this area?”

Map of Oslo area with example grid: one data sketch is computed for each square in the grid (red: 10x10km, blue: 1x1km, highest resolution omitted).
© OpenStreetMap contributors

The geo location sketches come in two different flavours, as there are two ways to define locations in a segment: by postcode/district code/region code and by latitude/longitude + radius. The latter allows the precise definition of target areas; e.g. for a new shopping mall which wants to target everyone living within 20km of their location or a coffee shop targeting mobile users within 500m of any of their branches.

To allow the targeting by latitude/longitude, we divide the world into a grid of points on the surface and calculate a data sketch for every quadrant in this grid for which we have seen any users. For efficiency reasons, we calculate these sketches at three different resolutions:

  • High resolution: latitude/longitude with 3 decimals ≈ 100m x 100m quadrant
  • Medium resolution: 2 decimals ≈ 1km x 1km quadrant
  • Low resolution: 1 decimal ≈ 10km x 10km quadrant

 

Example of selecting all squares in the grid that fall within a circle: we have to compute the union of all the corresponding data sketches (one per shaded squares) to know how many visitors are from within this area
© OpenStreetMap contributors

Depending on the radius in the query, we then select the appropriate resolution.

So when we need to know how many users we have observed within 20km of 59.915/10.742, we’ll fetch all the quadrants that fall within that circle and do a union of the corresponding sketches.

Luckily, with the PostGIS extension and a geo index, this is as simple as using the query

SELECT datasketches FROM low_res_geo_sketches
WHERE ST_DWithin(geography, ‘SRID=4326;POINT(10.742 59.915)’::geography, 20000.0)

Obviously, the user defining a segment in the UI simply clicks on a point in a map.

Lowest-resolution coverage for Norway, Sweden, Finland and Belarus. Each dot represents one data sketch. Areas without dots don’t have any active users.

 

 

Partitioning in PostgreSQL

Another nice feature of PostgreSQL that we rely on is partitioning. Partitioning refers to splitting what is logically one large table into smaller physical pieces. We are loading new data sketches into the database every day and want to query the last seven days of data. At the same time, we want to remove older data when it is not in use anymore. As a precaution, we always keep 14 days of data in case there is some issue with loading new data.

To achieve this, we have a parent table, for example low_res_geo_sketches, and one partition for each day. When we add a new partition, we remove the oldest from the parent table, but keep it around for another week. For selects on the parent table, PostgreSQL automatically includes data from all partitions associated with it.

Another optimisation is to run the CLUSTER command after loading a new partition and creating the geo index. This reorganises the data physically on disk to match the index, so that retrieval is as fast as possible. It can take quite a while, depending on the amount of data, but it only needs to be once and happens in the middle of the night.

Computing population estimates on demand

Now that we have all the data sketches ready in our database, how do we actually compute the population estimates for a given segment? When someone creates or edits a segment, the browser sends a request to the population estimation API. This request contains the list of targeting values, like in our example at the beginning.

The server will fetch all 8 required sketches from the db, deserialise them, do a union of all sketches within a targeting category and then an intersection between categories. This resulting sketch is then queried for its estimated count of unique users. Depending on the number of sketches that need to be retrieved, this typically takes less than 1 second.

If there are many geo radius criteria defined in a segment, up to several thousand sketches may need to be loaded from the database. This can sometimes take up to 20 seconds, which is still acceptable as these segments are not created or changed very often. The limiting factor is not the speed of computing the unions/intersections, which is very fast, but the time it takes to load all the bytes from the database. There are many ways to optimise this further, for example by precomputing the union per attribute across the lookback period of seven days, or even by doing the unions/intersections in the database. But for our current needs, it is good enough.

Conclusion

Even estimating unique counts for complex queries is not rocket science and can be done in real time, provided enough effort has been spent to compute things upfront. There are a number of existing robust implementations of the required data structures, or “sketches”, which can just be used as a library, with results that come close to magic!

Read more from the Data science category
SUBSCRIBE TO OUR UPDATES
Menu