Incident Intelligence case study

Framing the problemIn order to improve MTTR, admins require granular control over their alerts

SRE managers rely on alerts to detect performance issues across their tech stack, but investigating and resolving these can be complex, noisy, and require coordination. To aid this, admins must ensure alerts are clear, consistent, and contextually relevant, often aggregating noisy alerts into incidents or muting those from expected outages like maintenance.

Alert fatigue 
‍Alert storms appear during a critical outage

Burnout 
‍SRE sleep deprivation and burnout are prevalent

$300k/hr 
‍Cost of enterprise outages

The hardest part of being on-call is not the work itself, but the constant anticipation — knowing that at any moment, your peace can be shattered by someone else's emergency. - Anonymous SRE

Timeline & process

June‘21

May‘23

01.
Project kickoff

Interpret research
Interview SREs

02.
Discovery

Define challenges
Define requirements

03.
Explore

IA / Wireframes
Design concepts

04.
Design

Final designs
Evaluation

05.
Deliver

‍Launch MVP
Eng review

I remained embedded on this team from the beginning of design execution, through to the launch of our MVP solution.

Results‍

94% 
‍Improvement to mean time to resolution (MTTR)

6 
‍Fortune 500 companies signed up for pilot

92% 
‍Reduction in alert noise 

DiscoveryDefine the incident responder journey

To help the product team prioritize features, we defined the incident responder journey within the product and identified user challenges at each stage, ensuring our work addressed key problems within this scope.

phase 1.
Acknowledge alert

‍An incident responder is notified that a critical issue has degraded their application’s performance and now requires an investigation to determine its root cause.

phase 2.
Investigate issue

‍Incident responders analyze alert metadata to identify the problem's origin, checking APM for impacted services, IMM for degraded infrastructure, and logs for key data to determine the cause.

phase 3.
Resolve & report

‍Responders resolve the incident (e.g., scaling resources, rolling back code) and document actions taken. Afterward, investigators review and document lessons, while admins adjust settings to improve future responses.

admin user research insights

Understanding the SRE journey allows us to equip admins with tools to streamline alerts, provide context, and enhance collaboration for faster resolution

I interviewed SRE admins to understand their needs, receiving numerous feature requests. A key area they identified—and we prioritized—was reducing alert noise through aggregation features.

How do I see through alert noise?
Alerts are configured to fire when a performance degradation is observed. These failures may happen at any time, even in the middle of the night. During a critical outage, this often results in 10s or hundreds of alerts firing, even though they may all describing the same root cause issue.

But the challenge isn't just managing the alerts—it's realizing that many of them are symptoms of the same underlying issue. It's like trying to put out a fire by focusing on every individual spark, instead of stopping the source. - Anonymous SRE

How do I consolidate different data schema?Alerts come from multiple different sources and they all use different naming conventions. A field may be labeled "service" from one source, or "service_name" from another but they're both describing the same thing.

Consistency in naming isn’t just a nice-to-have; it’s a lifeline when you need to make sense of the chaos quickly. - Anonymous SRE

Where do I start?When a user is notified of an incident and they don’t know the root cause, they go across all monitoring tools and spend precious time to find out which one holds the info needed to uncover that error.

An SRE's job is already tough enough—now add the frustration of chasing down the root cause by hopping between every monitoring tool like a scavenger hunt. - Anonymous SRE

How do I ignore known non-issues?Sometimes an app needs to run maintenance or deploy a critical update. During this time, the application will briefly go down. As a result, all the alert detectors the team set up will scream at them and wake the whole team up.

Running maintenance or deploying an update shouldn't feel like an emergency, but when every alert goes off like the system's on fire, it’s hard not to wake up the whole team in a panic—even though it's all part of the plan. - Anonymous SRE

Each failure point became an end-user or admin experience for me to design

These 4 failure points were critical to solve for our users. We were also under significant pressure to deliver in order to announce at our annual conference, and I didn't want design to bottleneck the rest of the team. It was time to start designing. The focus of this case study is failure point 1: alert noise consolidation.

team alignment

Workshop features for our MVP

The team was eager for me to begin design, but I knew it was important that we align on critical use cases and features. I facilitated a workshop with engineering and PM to set priorities for the admin experience, synthesizing research and creating HMW statements to guide cross-functional ideation.

Due in part to the convincing insights I gathered in research, the team agreed to start with alert aggregation as the key feature for a timely pilot launch.

ExploreEnabling alert aggregation

In order to facilitate incident management through alert noise reduction (alert aggregation), we needed to provide admins with powerful controls for targeting alerts with specific metadata signals, grouping them together into a singular incident, and then paging the on-call SRE to investigate and resolve that issue.

Step 1.
Alert routing

‍An admin sets filters targeting specific metadata fields or values in alerts. Any metadata indexed by the system is available here to filter. Alerts with matching fields are grouped.

Step 2.
Alert grouping

‍The system sorts through all the alerts routed into each policy, and then looks for matching values for admin-specific fields. For example, group all routed alerts into an incident that have the same "service" field value.

Step 3.
Incident automation

‍Once an incident is created, it is triggered immediately and the on-call team is notified. In addition to notifications, other automated actions such as opening a 3rd party ticket may be useful.

initial exploration

An overly complex and infeasible start

Before I joined, designers had explored a complex node-based Northstar design for processing alerts and incidents. While exciting for the design team, it lacked stakeholder collaboration and was deemed infeasible due to its complexity and resource constraints. I helped guide the team toward a more practical solution that could meet our delivery timelines.

Low fidelity exploration

Translating the admin journey

I designed wireframes for a multi-step process to set up alert aggregation and automation, adhering to established patterns and components. Collaborating closely with the Design System team, I successfully advocated for new patterns where needed, which were later integrated into the core system.

testing insights

What users told me:

What happens if two policies have similar conditions?

What happens if an alert matches the conditions in multiple policies?

I don't understand how grouping works. I select alerts to be monitored in routing and then I select... less of them to be grouped?

What do we even call this thing?

Our product often struggled with confusing taxonomy. Initially, this admin area was to be called a "service," a term already heavily used in Observability. To address this, I created a clickable prototype using my wireframes and conducted unmoderated user testing asking users to complete tasks, and then prompting them at the end with a list of pre-defined options of what this area should be called. This process led to the adoption of "Incident Policy" as the final name.

Design iteration (not quite there yet)

Might this design pattern actually make alerts more noisy?

While designing the lister page for admins' Incident Policies, we realized allowing unlimited filtering controls could result in alerts matching multiple policies, worsening alert storms. To address this, I designed a priority-based pattern where each policy has an inherent order. Admins can easily set which policy takes precedence by reordering the list via drag-and-drop.

testing insights

What users told me:

It’s great to have control over which policy takes precedence for noisy alerts.

It’s hard to drag items when there are too many policies on the screen.

Design iteration (not quite there yet)

How might we simplify alert grouping?

Initially, my alert grouping design mirrored alert routing, requiring admins to set separate, narrower filter conditions. To simplify, I redesigned it so admins could select fields from routed alerts, grouping alerts with matching values into incidents automatically. This allows each Incident Policy to trigger multiple incidents, with similar signals prioritized by the admin.

I added a visualization to show admins the expected impact of their policy by simulating how alert aggregation would have reduced noise during a defined look-back period, highlighting the number of incidents that would be triggered.

testing insights

What users told me:

I like that I can reuse existing fields from alert routing instead of setting new filters again. It’s simpler and faster.

The impact chart showing the reduction from alerts to incidents really helps me understand the benefit of grouping.

Sometimes an alert will resolve itself and send out an info "all clear!" alert. Is there any way to just automatically resolve an incident if the alerts indicate it has resolved itself?

DesignEnd to end incident response mangement

I iterated on this design for months, meeting daily with stakeholders to gather feedback and refine the core concepts of Incident Response management: selecting alerts to group, grouping them into incidents, and triggering automated actions.

high fidelity Design

Improved stack ranking

My initial design for reordering Incident Policies offered only one mechanism, limiting interaction and discoverability, especially for users with long policy lists. I redesigned the pattern to improve usability, adding drag-and-drop reordering, a stepper for single increments, and a dropdown to jump policies to specific ranks.

high fidelity Design

Robust filtering rules

The routing section remained largely consistent but highlights a successful collaboration with the design system team. Lacking a pattern for complex filtering logic, we co-designed one that debuted on the routing rules page. This page dynamically responds to user input, showing a preview of matching alerts within the defined look-back period as admins update their filters.

high fidelity Design

More control over the incident lifecycle

The grouping section underwent the most change, as the earlier version lacked granular control over the incident lifecycle, risking incidents persisting indefinitely. I introduced controls to auto-resolve incidents based on alert severity changes, define when new incidents should be created, and adjust an incident's severity or status. I made adjustments to the helpful data vis to help the highlights stand out more clearly.

high fidelity Design

Automation rules

Our final design for Incident workflows (automatic actions taken on a triggered incident) better conforms to the established patterns of our product, while exposing effective controls for paging teammates automatically when an incident is triggered.

high fidelity Design

Incident responder experience

This admin experience powers the end-user experience for incident responders. The incident portal consolidates everything defined by the admin configuration, allowing SRE teams to manage investigators, associated alerts, a historical timeline of events, and more.

deliver
Launching Incident Intelligence

Incident Intelligence debuted with a public launch at .Conf, our annual customer conference, generating significant excitement and starting a closed pilot with six large-scale customers. By embedding incident management into our Observability product, we strengthened our case for tool consolidation and expanded sales opportunities.

However, due to shifting priorities, Incident Intelligence features were later spun off into a standalone on-call product, which is still available to customers today.

Post-launch AI enhancements

Smart alert grouping

When Incident Intelligence was spun off, the team was disappointed, but many features I designed proved valuable elsewhere—particularly the end-to-end administration of alert aggregation.

This design Northstar demonstrates how AI can simplify alert grouping. Rather than manually selecting correlating values, admins can now choose alerts to monitor, and AI provides grouping recommendations.

Why not let AI handle the entire incident response process?

Not everyone wants AI to handle all their critical tasks—surprising, I know. During initial user surveys, I found only about 3% of users wanted an AI to tell them what steps to take in an incident investigation. While AI can be valuable in investigations, I believe in a "trust but verify" approach. If your job is ensuring critical applications run smoothly, you'd want to double-check too.

That’s why I design AI/ML-driven features that provide actionable insights that human users can easily validate.