WattTime analyzes power grid data in real time from dozens of open data sources. Because the data is used to optimize the behavior of smart devices in real time, it is very important that we always have the most accurate up-to-date data. This means that we have had to tackle a classic engineering problem: building a highly reliable system out of less reliable components.
There are two main sources of unreliability in our data ingestion system. First, any of our incoming data sources can go down for a period of time without warning, creating a gap in our data record. Second, our cluster of worker servers may not run the data scraping jobs for any number of reasons, deepening the potential gaps. You can see how we’re in a tricky predicament of always needing the most accurate data and yet having a number of reasons for something to go wrong.
As the saying goes, if you can’t measure it, you can’t manage it. Today’s post explains how we built a monitoring and alerting system to detect gaps in our data ingestion pipeline using Librato.
Here are the primary characteristics we wanted in a monitoring and alerting system:
The WattTime API is currently running on Heroku, a popular cloud platform-as-a-service provider. Heroku has a great ecosystem of high-quality third-party “add-ons” that we can trust to have good uptime (satisfying criterion #1), even if something bad happens on our end (satisfying criterion #2). We decided to start by surveying Heroku’s add-ons to see which ones would help us satisfy our other design criteria: data pipeline integration (#3), Slack integration (#4), and configurability (#5).
After some doc hunting, we established that many add-ons had Slack integration, so that didn’t narrow our solution space very far. Instead, we decided to make our choice primarily based on the mechanism of integrating with our data pipeline. There was a wide range of options here: some add-ons would collect data from us if we printed it to our logs, others would collect data if we raised an exception, etc. Exception-based add-ons would be a great fit for detecting failed requests during data ingestion, but they wouldn’t help us monitor failures in our worker cluster overall. A sufficiently configurable log-based add-on, on the other hand, would be able to send alerts either if a problematic value appeared in the logs, or if the log stream stopped getting updated altogether. If such an add-on existed, it would allow us to meet all five design criteria.
And it does: Librato! Librato is a service for visualizing and creating alerts based on metrics. The Heroku add-on comes preconfigured to read metrics from Heroku log streams. In fact, we were already using Librato to visualize dyno performance metrics that Heroku printed to our log stream—but we hadn’t tried configuring any alerts. Reading Librato’s docs on alerts, we discovered that it supports both kinds of alert triggers we wanted. This made it a great choice for a monitoring service that we can configure to watch our data and alert us if something unexpected happens.
Librato is organized around the concept of a “metric.” A Librato metric can be any kind of time series data: something that can be graphed with a time stamp on the x axis and a number on the y axis. When the metric hasn’t been reporting or hasn’t logged a new datapoint within a certain period of time, visualizations like this can help us clearly see what is going on.
We chose “lag time” as the metric to track. We define lag time as the age of the most recently ingested data point of a particular type. Lag time serves as a good metric because it helps us get to the heart of the problem. If our goal is to have the most accurate up-to-date data, we’d like to know when our gaps start and for how long they occurred for.
To implement data quality logging, every time we collect data, we figure out how old the newest data point is, and print that number of minutes to the logs in Librato’s specific format. Here’s a screenshot of what the output looks like in our Papertrail logs:
We have several ways to detect if something has gone wrong with our data collecting. If a new datapoint has not been added within an acceptable amount of time, then we may want to get alerted and see if there is anything we can do on our end. To do this, we set “condition type” to “goes above” when creating an alert:
We can also check if our metric stops reporting entirely. In that case, we set “condition type” to “stops reporting”. Both alerts serve to inform us as quickly as possible if something were to go wrong.
To set up Slack integration, Librato requires you retrieving the Webhook URL for your Slack channel. In the Slack desktop app, go to the top left and click on Apps & Integration. This will take you to a directory of different apps that can integrate with Slack. Search for "Incoming WebHooks."
From there, go to "add configurations" and you should be able to find the webhook URL.
Once you have the URL, you can go to Librato’s Integrations tab and click “add configuration” on the Slack tab. Paste in the URL and give it a title you’ll be familiar with.
Then whenever an alert is triggered, you’ll receive a Slack message showing you why the alert was triggered and other relevant information.
While it’s really nice for our system to alert us whenever anything goes wrong, we thought that maybe it would be even better practice if our system was self-healing. As it stands, our system sends us an alert, and from there a living breathing human being has to take time out of what they’re doing and investigate. So for our next step, we will create a system to patch up data holes as they happen. In this way we can be more sure that we have the most accurate up-to-date data at any time.