Monitoring Sidekiq Using AWS Lambda and Slack
It’s no mystery I’m a Sidekiq fan – my background job processing library of choice for any non-trivial applications. My favorite feature of Sidekiq has to be retries. By default, failed jobs will retry 25 times over the course of 21 days.
As a remote company, we use Slack to stay in touch with everyone AND to manage/monitor our infrastructure (hello #chatops). We can deploy from Slack (we don’t generally, we have full CI) and be notified of infrastructure and application errors.
When Sidekiq retries accumulate, it’s a good indication that something more severe might be wrong. Rather than get an email we won’t see for 30 minutes, we decided to integrate these notifications in to Slack. In doing so, we found AWS Lambda to be a lightweight solution to tie the monitoring of Sidekiq and notifications in Slack together.
The Problem
Bark is background job-heavy. The web application is a glorified CRUD app that sets up the data needed to poll a child’s social media feed and monitor for potential issues. The best-case scenario for a parent is that they will never hear from us.
Because Bark’s background jobs commonly interact with 3rd-party APIs, failures aren’t a big surprise. APIs can be down, network connections can fail – Sidekiq’s retry logic protects us from transient network errors. Under normal circumstances, jobs retry and ultimately run successfully after subsequent attempts. These are non-issues and something we don’t need an engineer to investigate.
There are times when retries accumulate, giving us a strong indication that something more severe may be wrong. Initially, we setup New Relic to notify us of an increased error rate. This worked for simple cases, but was sometimes a false positive. As a result, we started to ignore them, which potentially masked more important issues.
We soon realized one of the gauges of application health was the number of retries in the Sidekiq queue. We have the Sidekiq Web UI mounted within our admin application, so we’d browse there a few times a day to make sure the number of retries weren’t outside our expectations (in this case < 50 were acceptable).
This wasn’t a great use of our time. Ideally, we wanted a Slack notification when the number of Sidekiq retries was > 50.
The Solution
Because Bark is on AWS, we naturally looked to their tools for assistance. In this case, we needed something that would poll Sidekiq, check the number of retries, and POST
to Slack if the number of retries was > 50.
There were a few options:
- Add the Sidekiq polling and Slack notification logic to our main application and setup a Cron job
- Create a new satellite application that ONLY does the above (microservices???)
- Setup an AWS Lambda function to handle the above logic
The first two options would’ve worked, but I was hesistant to add complexity to our main application. I was also hesitant to have to manage another application (ie. updates, etc.) for something that seemed simple.
Option “AWS Lambda” won! Let’s take a look at the implementation.
Sidekiq Queue Data Endpoint
First, we need to expose the number of Sideki retries somehow. As I mentioned above, the Sidekiq web UI is mounted in our admin application, but behind an authentication layer that would’ve been non-trivial to publicly expose.
Instead, we created a new Rails route to respond with some basic details about the Sidekiq system.
along with the route:
As you can see, the endpoint is public (there’s no job args or names exposed – just counts). The code digs in to the Sidekiq API to interrogate the size of queues.
Slack Incoming WebHook
We want to be able to POST to Slack when the number of Sidekiq retries are > 50. To do this, we’ll setup a custom incoming webhook integration in Slack.
We’ll start by choose Apps & integrations
from within the main Slack options. From here, choose Manage
in the top right, and then Custom Integrations
on the left. You’ll have 2 options:
- Incoming WebHooks
- Slash Commands
We’ll choose Incoming Webhooks
and choose Add Configuration
to add a new one. From here, we’ll supply the information needed to specify the channel where the notifications will appear and how they look.
The most important of this step is to get the Webhook URL
. This will be the URL we POST
to from within our Lambda function when retries are above our acceptable threshold.
AWS Lambda Function
Now that we have our endpoint to expose the number of retries (among other things) and the Slack webhook URL to POST
to, we need to setup the AWS Lambda function to tie to the two together. We’ll start by creating a new Lambda function with the defaults – using the latest Node.
For the trigger, we’ll use “CloudWatch Events - Schedule”:
From here, we’ll enter a name and description for our rule and define its rate (I chose every 5 minutes). Enable the trigger and we’ll move to defining our code. Next, we’ll give the function a name and choose the latest NodeJS as the runtime. Within the inline editor, we’ll use the following code:
Note: sidekiqURL
and hookURL
need to be defined with appropriate values for this to work.
Review and save the Lambda function and we’re all set!
Review
We can review the Lambda function logs on CloudWatch. Go to CloudWatch and choose “Logs” from the left menu. From here, we’ll click the link to the name of our Lambda function:
From here, logs for each invocation of the Lambda function will be grouped in to a log stream:
Grouped by time, each link will contain multiple invocations. A single execution is wrapped with a START
and END
, as shown in the logs. Messages in between will be calls to console.log
from within our function. We logged the results of the Sidekiq queue poll for debugging purposes, so you can see that below:
This was invocation where the number of retries were < 50, and a result, didn’t need to POST
to Slack. Let’s take a look at the opposite:
We can see the Message posted successfully
log indicating our message was successfully sent to Slack’s incoming webhook.
Finally, here’s what the resulting message looks like in Slack when the number of Sidekiq retries are above our threshold:
Conclusion
Using new tools is fun, but not when it brings operational complexity. I’ve personally found AWS lamba to be a great place for endpoints/functionality that feels cumbersome to include in my applications. Bringing these notifications in to Slack has been a big win for our team. We took a previously untrustworthy notification (NewRelic error rate) and brought some clarity to the state and health of our applications.