Using awareness to ensure good data quality.
The new Schibsted
Schibsted is changing. Heroic efforts are underway to turn us into a global technology company. One such effort is the consolidation of our user base of some 200 million unique users a month. A user base of 200 million can provide a wealth of insights into users’ behavior, activities and interests, and a wealth of opportunities – from enabling customized interactions to sharing best practice, all to ensure an excellent user experience.
Given the importance of taking advantage of such opportunities, the efficient and consistent acquisition, dissemination and analysis of data is essential. Without these analyses’, we will not be able to inform the wide range of operations and decision making processes that depend on them. Pulse, Schibsted’s internal tracking solution, is the technology that enables the acquisition and dissemination of data. While PulseMonitor ensures that we can efficiently and consistently analyse that data – by securing high quality data.
The purpose of PulseMonitor
Pulse provides us at Schibsted with the ability to record a user’s behavior as they interact with our sites through the collection of discrete events. We track our users’ behavior for a multitude of reasons: to reveal usage patterns, determine trends in activity, augment the user experience and much more. Our users trust us with their data, so it is vital that we take privacy seriously, and we do. At Schibsted we work hard to be transparent – because we recognize that our users’ trust is our most important asset.
With Pulse we aim to enable the creation of unique and valuable experiences, so each user that interacts with our sites and applications can achieve their goals; be that reading informative and reflected news articles, or buying that Swedish designer chair they have spent five years searching for. Tracking is necessary to enable a good user experience, but that doesn’t mean we don’t understand and take seriously the implications of users trusting us with their data.
Collecting data for the mere purpose of collecting data doesn’t make much sense. What is important is that we collect the right data, and that it is of sufficient quality for our data scientists or the sites themselves. Maintaining a sufficient quality of data is therefore vital, and this is the purpose of PulseMonitor.
More than just a dashboard for showing the status of a tracking integration, PulseMonitor aims to ensure that sites across Schibsted are consistently integrated. To extract value from the data sites provide we need to be able to trust the information it contains, and capture enough of the overall picture to correctly analyse it. We need to be able to trust the conclusions drawn from those analyses’.
PulseMonitor ensures that sites are aware of their tracking integration, how they can improve their integration, how they compare to other sites, and most importantly, how well the data they collect meets the expectations of the data consumers.
Schemas, we have them for a reason
Each interaction a user has with a site, or each action performed by a backend system on their behalf, can result in a collected event. Each of these events has an associated type with an associated schema that it needs to conform to (see figure above for an example). The schema represents a contract, or blueprint, between the dispatcher and receiver of the event. The schema is important because it adds predictability, allowing us to validate an incoming event. This implies that we can sort events based on their type and version, and process them accordingly.
A schema also allows us to make some assumptions about the intrinsic quality of an event, where intrinsic quality refers to the data inherent to that event. For example, if an event has a field called address of type string, and the value contained within that field is indeed a string, the field at least has a correct type (even though it tells us nothing about the semantic correctness of the content it is a hint).
PulseMonitor can then cover the introspection of schemas by providing a simple user interface to easily discover event types, connections between event types, required fields, etc.
Schibsted uses Amazon Web Services (AWS) for running a number of services and this is also true for the data collection pipeline, depicted in the figure above. Events being dispatched by the SDKs are received by an endpoint called the Collector, which is a very simple service that quickly pushes the received events into a persistent queue. We use Kinesis as our persistent queue, it’s a sliding window queue that provides certain guarantees about reliability and retention. It’s called a sliding window queue because all events stored in Kinesis can be read as many times as required during the retention period – seven days in our case. It is up to the readers to keep track of processed events. Once the seven-day retention period has passed, events are removed automatically by Kinesis.
There are two paths in the collection pipe line: one for fast processing or streaming, and one for slow or bulk processing, i.e., volume of data is more important than low latency data. For PulseMonitor we rely on the fast processing path. In the collection pipeline, a component called Piper can forward a subset or forward all events, depending on filters, etc., to any dependent client.
PulseMonitor receives all events, since we are interested in providing a global view of Pulse’s status. Furthermore, we hook onto the fast path, because we want the lowest possible feedback-loop latency for our users.
Immediate feedback on modifications made by a site to their tracking integration is related to timeliness and is very important.
Piper then forwards all the events to our own Kinesis stream, from where we read the events and push them into ElasticSearch (the component called Slurp in the above figure). Between Kinesis and ElasticSearch, Slurp allows us to modify events if required, with some basic hacking, slashing, validations or aggregations. It also enables us to push events into one or more ElasticSearch indices, depending on requirements.
PulseMonitor is a web service running queries towards ElasticSearch through data providers written using GraphQL, a powerful client-side friendly query language. With GraphQL, our webview is only one potential client; anyone else who wants to extract a subset of the metrics can do so through the GraphQL API and publish them to their own services, such as DataDog, or another ElasticSearch cluster.
So why not publish these metrics directly to DataDog? One thing we want to achieve with PulseMonitor is to provide suggestions or actions for improving an integration, which is not easily done with DataDog.
Also, we only want to provide insights into recent metric data with PulseMonitor. With approximately 700 million events ingested per day we don’t want to keep more than a small window of events available – perhaps a day or two. For long term insights, and trends over weeks, months or even years, we have separate teams.
PulseMonitor displays metrics and suggestions on simple data sets; we are not in the business of creating complex analyses’, for retention, this is the job of our very competent data scientists.
What is data quality?
Quality cannot be assessed independently of the consumers who are the users of a product. Similarly, data quality cannot be assessed independent of those who use the data, i.e., the data consumers. At Schibsted, those consumers are primarily our data scientists. With improved data quality, Schibsted’s data scientists can increase the accuracy of their models and extract more insights. With more accurate models, capturing a broader and more accurate picture of the state of affairs, enables the sites to more accurately improve their products. This is a mutually beneficial relationship.
Quality categories and dimensions
To understand how data quality can be measured there are a number of categories and dimensions that are commonly used, as shown in table 1. Given these definitions, we can invert the problem and say that data quality problems can be defined as those caused by collected data being unfit for use in one or more of these quality dimensions.
|Point of view||Category||Dimensions|
|Inherent||Intrinsic||Accuracy, objectivity, believability and reputation|
|System dependent||Accessibility||Accessibility, access security|
|Contextual||Relevancy, value-added, timeliness, completeness, amount|
|Representational||Interpretability, ease of understanding, concise representation, consistent representation|
Table 1:Data quality categories and dimensions
The intrinsic category refers to the quality the data has in its own right, i.e., whether it is accurate or inaccurate. These are dimensions related to the quality of the data itself, not how it is accessed or used. For example, average user reading times for an article might be accurate, but of no use for determining a user’s gender.
Accessibility relates to the ease of access and understanding of data. Are there processes in place to access certain sets of data? Is a particular set of credentials required? Are we able to easily find datasets that are useful for an analysis? Are we able to interpret the data without expert assistance? With accessibility we want to remove any problems that conceal or make data inaccessible.
Contextual refers to the quality of data, in the context of the task being performed. If our analysis relies on real-time data, data transmitted with significant delay will provide little value for the context it is processed in. Completeness is another example, we might be sending a number of events, all of which are correct, but only half of the type of events we should be sending, but the lack of such events might not be known or communicated. Being able to flag this as an issue, and have it easily communicated to all data producers will form a vital piece of functionality in PulseMonitor.
Representational data quality means data is easily interpreted, concise and consistent. Accessing large amounts of data is time-consuming, so introducing columnar data storage can reduce computational time for some sets of analysis. Are we able to access the stored data in a consistent format, e.g., metadata extracted from video, audio or images should be available in the same format as events collected from the frontend or backend. Having meaningful, concise and consistent representations of data is vital.
What can we do to ensure quality
For batch jobs, as mentioned with columnar storage, this may simply involve access to a data format that doesn’t make it cumbersome to process large amounts of data. An overview of storage formats available for various datasets, and advice on the situations different formats are useful in can be very useful to data consumers. Similarly, a notification framework for the availability of bulk data sets for processing can be of great value, and provide statistics about the general dataset availability.
Since processing resources are closely linked to data accessibility, we could also display the number of resources dedicated to shared cluster resources and average times for running jobs, etc., or tips and tricks on how to optimize jobs. Such simple insights could then make it possible for data consumers to alert resource managers if jobs are piling up or request assistance from experts for optimizing their work.
With regard to proper interpretation and understanding, we can ensure the data is accompanied by descriptions and examples of use, to highlight how it can be interpreted, without specialist aid. If specialists are required to interpret the data, contacts or channels of communication can be provided to address relevant questions
Concise and consistent representation, e.g., extracting metadata from images and performing data queries in combination with text or other objects; this could also be made available via PulseMonitor as part of a documentation set, perhaps just providing a link to an API that can be used.
Ultimately, we want to increase data utilization with PulseMonitor, and the best way to achieve this is by removing common sources of faults.
Where are we now and where do we want to be?
Within the described quality categories and dimensions there are a number of high reward features we can integrate into PulseMonitor. The goal is to provide a real-time live-aggregation dashboard where a visualisation of metrics from the events currently flowing through the system can be seen. We already have a prototype for this (see figure above). Key metrics are event type distribution, number of events, distribution across different SDKs, etc. More importantly, this will provide valuable input for sites to increase the quality of their integration to extract further value from tracking.
There are different types of quality signals. For instance, a lack of event diversity across comparable sites, e.g., marketplaces, can be an indication that one marketplace is not collecting a sufficiently diverse set of events. Adding tracking for an additional set of event types, might enable them to collect more complete profiles of user behaviors, increasing the value of these user interactions with the site. In the long term, the ability to compare your integration with others in comparable markets is a powerful one. It enables sites to adapt to positive changes on other sites through awareness that one site can collect location data for more events. At the very least it creates an incentive to reach out and see if there are lessons to be learnt.
Other quality signals could be simple statistics on what SDK versions sites are shipping events from. If no events are being sent from iOS, perhaps a potential market is not being exploited; the numbers for other sites on that platform might indicate a potential benefit. We can also provide simple insights, such as percentages of users who opt out, percentage of events containing certain fields, e.g., location.
Such a dashboard can provide insights, such as the first time an event type was seen, or the percentage of those events received within a time period. It could also be a one-stop-shop for integrating tracking into a site.
We can then provide snippets of code for integrating the missing tracking SDKs, or documentation on how to track more types of events. It can also be used to display the schema, and show a distribution of events that are valid or conform to the given schema. Events that don’t conform can then be listed, with hints about what is wrong with the event (and potential fixes) provided to the site.
Pulse is Schibsted’s tracking solution, providing a unified view of Schibsted’s user base, with the aim of extracting value from the behavioral data created by our users.
The efficient collection, dissemination and analysis of data is becoming increasingly popular. To support our data consumers’, who are primarily Schibsted data scientists, we have to ensure that data utilization is high and there are a minimum of problems with data quality. This is important because the mutually beneficial relationship that occurs as a result of more accurate models can lead to more users or better retention, which means more data, etc.
PulseMonitor is a tool for sites to efficiently examine the status of their tracking integration, and for data scientists to highlight issues related to data quality problems, and have them removed.