Faster CloudWatch Alarm Investigations

Name: Blocks
Author: Blocks

When a CloudWatch alarm fires at 2am, the investigation ritual starts. Open the console. Filter thousands of log streams. Search for stack traces. Figure out which repo owns the failing service. Open your IDE. Start debugging from memory. By the time you understand what happened, you have already spent 30 to 45 minutes and your on-call rotation has grown to dread their shifts.

We built a different workflow. An agent does the investigation. You make the call.

December 13, 2025 · Blocks Team

The manual process has real costs

Most teams treat CloudWatch alarm investigation as unavoidable overhead. The steps are always the same:

Navigate to the CloudWatch console and locate the relevant alarm
Filter through log streams to find entries from the affected time window
Search for error patterns, exceptions, and stack traces
Identify which repository and service owns the failing code
Open the code and reconstruct what was running at the time of the failure

The downstream effect is predictable. Non-critical alarms get muted because the investigation cost is too high to justify. Muted alarms become invisible production issues. Eventually something breaks badly enough that it cannot be ignored.

Triggering an investigation with a Skill

With Blocks, investigation starts from the alarm thread in Slack. Mention the bot and invoke the alarm Skill: @blocks /alarm. That is the entire trigger. The agent picks up the alarm metadata from context and starts working.

Skills are reusable, versioned instructions that shape what the agent does. The alarm Skill tells the agent how to read CloudWatch data, what to look for in logs, and how to structure its findings. You can customize it for your team: different log patterns, different output formats, different services. The Skill is the configuration; the agent does the work.

What the agent actually does

Once triggered, the agent works across several data sources automatically. It reads the alarm metadata to understand the trigger condition and time window. It pulls the relevant CloudWatch logs using scoped, read-only IAM credentials limited to specific log groups. It searches those logs for error patterns, exceptions, and stack traces. Then it cross-references what it found against the relevant GitHub repositories to locate the specific code involved.

The IAM scoping is deliberate. The agent gets exactly the access it needs for the specific log groups in scope. Nothing broader. The security boundary is explicit, not implicit.

The output is a complete investigation report

The agent posts its findings directly in the Slack thread where the alarm surfaced. The report includes the root cause with specific file paths and line numbers, an impact assessment across affected services, and a summary of the error pattern found in the logs. When a clear fix exists, it can open a pull request directly.

The whole investigation lands in the same thread where the on-call engineer is already looking. No context switching to the console. No manual log filtering. The engineer reads the report and decides what to do next.

Faster understanding, not just faster action

The goal here is not to automate the remediation. Automatically applying fixes to production systems during an active incident is a different and riskier problem. The goal is to compress the time between alarm and understanding.

An engineer who understands what is failing and why can make a good decision in minutes. An engineer who just got paged and is still navigating the console cannot. The agent handles the mechanical part of the investigation so the human can focus on the judgment call: is this a rollback, a hotfix, a mitigation, or a watch-and-wait?

That shift also makes alarms worth responding to again. When investigation costs 5 minutes instead of 45, unmuting non-critical alarms becomes reasonable. Your signal-to-noise ratio improves, and the incidents you were quietly missing start showing up in your queue.

All posts Try Blocks free