Getting into business with Prometheus

FINN has moved towards an architecture of microservices and uses a number of technologies – Prometheus included – to identify and fix service outages.

With each piece of functionality being comprised of a growing number of individual services, specialized tools are required for detection, analysis and mitigation of errors, and we are already using our fair share, including Zipkin, Hystrix, Kibana, Grafana and Sensu.

When it comes to metrics and monitoring of time series data, FINN has traditionally employed an infrastructure based on StatsD and Graphite. Recently, however, we opted to switch to Prometheus.

Prometheus logo

Prometheus is heavily inspired by Google’s Borgmon monitoring system, and originally developed by SoundCloud as a reaction to scaling issues experienced with just StatsD and Graphite. Bearing in mind Adrian Cockcroft’s rule #4 of monitoring: “Monitoring systems need to be more available and scalable than the systems being monitored” Prometheus stands out as the best choice for us. It’s directly supported by Kubernetes, the future container management platform of choice here at FINN and been engineered from the ground up to deal with issues of scale and stability. In this new area of monitoring microservices, Graphite seems to be losing ground.

All our autonomous service teams rapidly switched to Prometheus and has implemented monitoring of the “traditional” application metrics like latency and memory usage. Special care has also been taken to ensure adequate monitoring of business metrics.

Business metrics monitoring

When you are dealing with a large number of services the implications of one failing or partly failed service can be hard to evaluate. Since a complex problem now can be broken up into units that are truly independent, all individual parts can continue to work separately, while the end result does not. Monitoring business metrics is key to successfully discovering and mitigating problems.

Your business metrics are the core performance indicators associated with the services you provide. These are primary features of the business which can suffer if any operational or functional part of the system is not performing. Turnbull describes business metrics as

“..the next layer up from application metrics […] Business metrics might include the number of new users/customers, number of sales, sales by value or location, or anything else that helps measure the state of a business.”(from Turnbull, James: The Art of Monitoring).

Other examples are orders per second at Amazon, or stream starts per second at Netflix.

My team at FINN, FINN småjobber (Norway’s leading marketplace for matching labor and demand for help with everyday tasks such as cleaning and moving) has recently finished implementing metrics monitoring with Prometheus. In weeks to come this will require some tuning to adjust thresholds for automatic alerts and so forth, but is already looking good.

I have collected five tips related to metrics monitoring with Prometheus based on our recent experiences.

Counters for the win

Counters are great. Unlike gauges, which can spike when you are not looking, counters have no loss of information between samples. But there is one core rule of thumb when working with counters:

The only mathematical operations you can safely apply directly to a counter are rate, irate, increase, and resets. Anything else will cause problems.

Tip courtesy of Brian Brazil.

Remember recording rules

Once your data are turned into an instant vector (like when aggregating with the sum function), further application of functions requiring range vectors is naturally not possible. However, sometimes you want to continue transforming your queries, and business metrics monitoring is one area where this fast becomes a reality. Say, if the business metric is number_of_purchases , you want to monitor whether the total number of purchases is at a healthy level across all your servers. Assuming counters of type number_of_purchases{server="X"} , a Prometheus query for this is

 

This yields a meaningful metric to reason about – the total number of new purchases in the previous ten minutes. To get an alert about deviances in this metric functions like deriv, delta and holt_winters may be useful, but these all require range vectors. One solution is to record the data using a recording rule.

Recording rules allow new time series to be created from user-configured expressions, and mean you can query these as if they were any other metric.

Recording rules are otherwise used to precompute computationally expensive expressions and save them as new time series. They’re particularly useful when used in real-time dashboards (which are often refreshed at frequent intervals).

Use time() to deal with seasonality (business metric monitoring tip #1)

Business metrics, like number_of_purchases , can vary a lot throughout the day, typically correlating with the traffic to your site, and often with the  day of the week or other seasonal signs. Functions like holt_winters can help to alert about such metrics, but are not always adequate, for example when general volumes are just too low.

At such times, the time function may come in handy. time() returns the number of seconds since January 1, 1970, and can be used to determine the current hour of the day:

 

Expressions can now be built to incorporate special handling for chosen periods, for example, to compensate for night hours where traffic is low and when alerts on low volumes would typically be triggered.

Adding the expression + (((time() / (60 * 60)) % 24) < bool 8) * 1000  to a query adds 1000 to a metric for hours from 00:00 to 08:00, exempting these periods from any alerting thresholds set.

Time function used to represent hour of daytime() function for the current hour of day

 

(Depending on your alerting tool, options to exclude time periods from your alerts may be configurable in that tool. However, this will not usually remedy the status in any dashboards for the metric, something which using a time()-clause in the query accomplishes).

Use micro-metrics (business metric monitoring tip #2)

Although not strictly related to Prometheus as a technology, you may find the volume of an interesting metric is so low that efficient monitoring and alerting is hard. But low volumes do not mean that monitoring is useless, only that it will take longer to find deviances in the metric. It’s a problem analogous to doing A/B testing on low-traffic websites (as explained here).

Here’s a tip: instead of (or as well as) your key metrics, monitor correlated micro-metrics (or micro-conversions to use A/B Testing terminology). Here, instead of monitoring the key metric number_of_purchases , monitor related metrics like clicks on the “add to cart”-button or the number of product page views. The main point is that the higher the volume, the faster one can detect deviances in metrics. This is fundamentally due to the fact that statistical significance increases with a higher sample population, and a higher sample population can, of course, be achieved more quickly with higher traffic.

Use better graphing tools to effectively tune your queries

OK, the Prometheus GUI comes with some core graphing capabilities, but it can’t compare to those provided by tools like Grafana. With support for Prometheus out-of-the-box, Grafana can stack metrics, override time intervals, template variables and much, much more, all of which enables more efficient testing and analysis of queries.


Wrapping up, I would like to recommend the previously referenced blog by Brian Brazil at Robust Perception for more interesting Prometheus stuff: http://www.robustperception.io/blog/, and of course the official docs, which also link to the community.

Read more from the Software engineering category
SUBSCRIBE TO OUR UPDATES
Menu