Data numbers

How we use Librato to monitor data quality

 August 9, 2016 

What’s the problem?

WattTime analyzes power grid data in real time from dozens of open data sources. Because the data is used to optimize the behavior of smart devices in real time, it is very important that we always have the most accurate up-to-date data. This means that we have had to tackle a classic engineering problem: building a highly reliable system out of less reliable components.

There are two main sources of unreliability in our data ingestion system. First, any of our incoming data sources can go down for a period of time without warning, creating a gap in our data record. Second, our cluster of worker servers may not run the data scraping jobs for any number of reasons, deepening the potential gaps. You can see how we’re in a tricky predicament of always needing the most accurate data and yet having a number of reasons for something to go wrong.

As the saying goes, if you can’t measure it, you can’t manage it. Today’s post explains how we built a monitoring and alerting system to detect gaps in our data ingestion pipeline using Librato.

Designing the Solution

Here are the primary characteristics we wanted in a monitoring and alerting system:

  1. The solution needs to run 24/7. We want to provide our end-users with cleanest available energy, and whether the current energy supply is clean can change every five minutes. If our data is not up-to-date, then the energy supply can change without our knowledge, and we might miss an opportunity to give our users a choice to save carbon.
  1. The solution needs to run in a way that's isolated/decoupled from how the data scraping tasks normally get run. We don’t want our monitoring system to be dependent on the system it’s supposed to be monitoring! This means that we’d either have to spin up our own separate monitoring service (and maybe a monitor for the monitor…) or use a reliable third-party SaaS tool.
  1. The solution needs to have a way to identify any new gaps in our data as soon as possible, so we can triage and fix the problem before it gets worse. If we’re using a third-party tool, that means we need to give it a way to hook into our data pipeline.
  1. The solution needs to have a way to send us alert messages when a problem occurs. We like the workflow of Slack, but email would be ok as a fallback.
  1. The solution should make it easy to configure the frequency and thresholds for triggering alerts. Overly noisy alerting systems get ignored, so we thought it was a good idea that any system we implemented didn’t bombarded us with notifications, just sending the important actionable ones.

The WattTime API is currently running on Heroku, a popular cloud platform-as-a-service provider. Heroku has a great ecosystem of high-quality third-party “add-ons” that we can trust to have good uptime (satisfying criterion #1), even if something bad happens on our end (satisfying criterion #2). We decided to start by surveying Heroku’s add-ons to see which ones would help us satisfy our other design criteria: data pipeline integration (#3), Slack integration (#4), and configurability (#5).

After some doc hunting, we established that many add-ons had Slack integration, so that didn’t narrow our solution space very far. Instead, we decided to make our choice primarily based on the mechanism of integrating with our data pipeline. There was a wide range of options here: some add-ons would collect data from us if we printed it to our logs, others would collect data if we raised an exception, etc. Exception-based add-ons would be a great fit for detecting failed requests during data ingestion, but they wouldn’t help us monitor failures in our worker cluster overall. A sufficiently configurable log-based add-on, on the other hand, would be able to send alerts either if a problematic value appeared in the logs, or if the log stream stopped getting updated altogether. If such an add-on existed, it would allow us to meet all five design criteria.

And it does: Librato! Librato is a service for visualizing and creating alerts based on metrics. The Heroku add-on comes preconfigured to read metrics from Heroku log streams. In fact, we were already using Librato to visualize dyno performance metrics that Heroku printed to our log stream—but we hadn’t tried configuring any alerts. Reading Librato’s docs on alerts, we discovered that it supports both kinds of alert triggers we wanted. This made it a great choice for a monitoring service that we can configure to watch our data and alert us if something unexpected happens.

Step 1: Logging data quality

Librato is organized around the concept of a “metric.” A Librato metric can be any kind of time series data: something that can be graphed with a time stamp on the x axis and a number on the y axis. When the metric hasn’t been reporting or hasn’t logged a new datapoint within a certain period of time, visualizations like this can help us clearly see what is going on.

We chose “lag time” as the metric to track. We define lag time as the age of the most recently ingested data point of a particular type. Lag time serves as a good metric because it helps us get to the heart of the problem. If our goal is to have the most accurate up-to-date data, we’d like to know when our gaps start and for how long they occurred for.

To implement data quality logging, every time we collect data, we figure out how old the newest data point is, and print that number of minutes to the logs in Librato’s specific format. Here’s a screenshot of what the output looks like in our Papertrail logs:

Log messages formatted as Librato metrics, in Papertrail
Log messages formatted as Librato metrics, in Papertrail

Step 2: Setting up alerts

We have several ways to detect if something has gone wrong with our data collecting. If a new datapoint has not been added within an acceptable amount of time, then we may want to get alerted and see if there is anything we can do on our end. To do this, we set “condition type” to “goes above” when creating an alert:

Example Condition

We can also check if our metric stops reporting entirely. In that case, we set “condition type” to “stops reporting”. Both alerts serve to inform us as quickly as possible if something were to go wrong.

Screen Shot 2016-08-09 at 11.09.04 AM

Step 3: Receiving alerts

To set up Slack integration, Librato requires you retrieving the Webhook URL for your Slack channel. In the Slack desktop app, go to the top left and click on Apps & Integration. This will take you to a directory of different apps that can integrate with Slack. Search for "Incoming WebHooks."

Screen Shot 2016-08-08 at 3.46.32 PM

From there, go to "add configurations" and you should be able to find the webhook URL.

Once you have the URL, you can go to Librato’s Integrations tab and click “add configuration” on the Slack tab. Paste in the URL and give it a title you’ll be familiar with.

Screen Shot 2016-08-08 at 3.49.05 PM

Then whenever an alert is triggered, you’ll receive a Slack message showing you why the alert was triggered and other relevant information. 

Screen Shot 2016-08-08 at 3.55.39 PM

What we're thinking about next

While it’s really nice for our system to alert us whenever anything goes wrong, we thought that maybe it would be even better practice if our system was self-healing. As it stands, our system sends us an alert, and from there a living breathing human being has to take time out of what they’re doing and investigate. So for our next step, we will create a system to patch up data holes as they happen. In this way we can be more sure that we have the most accurate up-to-date data at any time.