Darren Lasso
case study #1How might we empower SREs to manage incident response and make on-call less awful?
Design execution
Design system
UX research
AI/ML

From 2021-2023, I led design for a new Observability feature streamlining incident response for on-call SREs. The solution aggregated related alerts into single incidents, improving notification, investigation, and reducing MTTX while creating a cohesive user experience across the suite.

Read about the launch
Framing the problemIn order to improve MTTR, admins require granular control over their alerts

SRE managers rely on alerts to detect performance issues across their tech stack, but investigating and resolving these can be complex, noisy, and require coordination. To aid this, admins must ensure alerts are clear, consistent, and contextually relevant, often aggregating noisy alerts into incidents or muting those from expected outages like maintenance.

Alert fatigue
Alert storms appear during a critical outage
Burnout
SRE sleep deprivation and burnout are prevalent
$300k/hr
Cost of enterprise outages

The hardest part of being on-call is not the work itself, but the constant anticipation — knowing that at any moment, your peace can be shattered by someone else's emergency. - Anonymous SRE

Timeline & process
June‘21
May‘23
01.
Project kickoff
Interpret research
Interview SREs
02.
Discovery
Define challenges
Define requirements
03.
Explore
IA / Wireframes
Design concepts
04.
Design
Final designs
Evaluation
05.
Deliver
‍Launch MVP
Eng review
I remained embedded on this team from the beginning of design execution, through to the launch of our MVP solution.
Results
94%
Improvement to mean time to resolution (MTTR)
6
Fortune 500 companies signed up for pilot
92%
Reduction in alert noise
project kickoffLead the incident management admin experience

I joined the team after initial research and prioritization were established, focusing on quickly understanding users and use cases to deliver designs for critical incident response features. My work centered on the admin experience, defining key features like alert aggregation, enrichment, and maintenance policies.

  • Define the end to end incident responder journey for admins and SREs.
  • Partner with PM and engineering to rapidly iterate on design concepts.
  • Lead design execution and evaluation across the alert management admin experience.
  • Provide feedback on engineering implementation of designs.
DiscoveryDefine the incident responder journey
To help the product team prioritize features, we defined the incident responder journey within the product and identified user challenges at each stage, ensuring our work addressed key problems within this scope.
phase 1.
Acknowledge alert

An incident responder is notified that a critical issue has degraded their application’s performance and now requires an investigation to determine its root cause.
phase 2.
Investigate issue

Incident responders analyze alert metadata to identify the problem's origin, checking APM for impacted services, IMM for degraded infrastructure, and logs for key data to determine the cause.
phase 3.
Resolve & report

Responders resolve the incident (e.g., scaling resources, rolling back code) and document actions taken. Afterward, investigators review and document lessons, while admins adjust settings to improve future responses.
admin user research insights
Understanding the SRE journey allows us to equip admins with tools to streamline alerts, provide context, and enhance collaboration for faster resolution
I interviewed SRE admins to understand their needs, receiving numerous feature requests. A key area they identified—and we prioritized—was reducing alert noise through aggregation features.
How do I see through alert noise?
Alerts are configured to fire when a performance degradation is observed. These failures may happen at any time, even in the middle of the night. During a critical outage, this often results in 10s or hundreds of alerts firing, even though they may all describing the same root cause issue.
But the challenge isn't just managing the alerts—it's realizing that many of them are symptoms of the same underlying issue. It's like trying to put out a fire by focusing on every individual spark, instead of stopping the source. - Anonymous SRE
How do I consolidate different data schema?Alerts come from multiple different sources and they all use different naming conventions. A field may be labeled "service" from one source, or "service_name" from another but they're both describing the same thing.
Consistency in naming isn’t just a nice-to-have; it’s a lifeline when you need to make sense of the chaos quickly. - Anonymous SRE
Where do I start?When a user is notified of an incident and they don’t know the root cause, they go across all monitoring tools and spend precious time to find out which one holds the info needed to uncover that error.
An SRE's job is already tough enough—now add the frustration of chasing down the root cause by hopping between every monitoring tool like a scavenger hunt. - Anonymous SRE
How do I ignore known non-issues?Sometimes an app needs to run maintenance or deploy a critical update. During this time, the application will briefly go down. As a result, all the alert detectors the team set up will scream at them and wake the whole team up.
Running maintenance or deploying an update shouldn't feel like an emergency, but when every alert goes off like the system's on fire, it’s hard not to wake up the whole team in a panic—even though it's all part of the plan. - Anonymous SRE
Each failure point became an end-user or admin experience for me to design
These 4 failure points were critical to solve for our users. We were also under significant pressure to deliver in order to announce at our annual conference, and I didn't want design to bottleneck the rest of the team. It was time to start designing. The focus of this case study is failure point 1: alert noise consolidation.
ExploreEnabling alert aggregation
In order to facilitate incident management through alert noise reduction (alert aggregation), we needed to provide admins with powerful controls for targeting alerts with specific metadata signals, grouping them together into a singular incident, and then paging the on-call SRE to investigate and resolve that issue.
Step 1.
Alert routing

An admin sets filters targeting specific metadata fields or values in alerts. Any metadata indexed by the system is available here to filter. Alerts with matching fields are grouped.
Step 2.
Alert grouping

The system sorts through all the alerts routed into each policy, and then looks for matching values for admin-specific fields. For example, group all routed alerts into an incident that have the same "service" field value.
Step 3.
Incident automation

Once an incident is created, it is triggered immediately and the on-call team is notified. In addition to notifications, other automated actions such as opening a 3rd party ticket may be useful.
initial exploration
An overly complex and infeasible start
Before I joined, designers had explored a complex node-based Northstar design for processing alerts and incidents. While exciting for the design team, it lacked stakeholder collaboration and was deemed infeasible due to its complexity and resource constraints. I helped guide the team toward a more practical solution that could meet our delivery timelines.
Low fidelity exploration
Translating the admin journey
I designed wireframes for a multi-step process to set up alert aggregation and automation, adhering to established patterns and components. Collaborating closely with the Design System team, I successfully advocated for new patterns where needed, which were later integrated into the core system.
testing insights
What users told me:
What happens if two policies have similar conditions?
What happens if an alert matches the conditions in multiple policies?
I don't understand how grouping works. I select alerts to be monitored in routing and then I select... less of them to be grouped?
What do we even call this thing?
Our product often struggled with confusing taxonomy. Initially, this admin area was to be called a "service," a term already heavily used in Observability. To address this, I created a clickable prototype using my wireframes and conducted unmoderated user testing asking users to complete tasks, and then prompting them at the end with a list of pre-defined options of what this area should be called. This process led to the adoption of "Incident Policy" as the final name.
Design iteration (not quite there yet)
Might this design pattern actually make alerts more noisy?
While designing the lister page for admins' Incident Policies, we realized allowing unlimited filtering controls could result in alerts matching multiple policies, worsening alert storms. To address this, I designed a priority-based pattern where each policy has an inherent order. Admins can easily set which policy takes precedence by reordering the list via drag-and-drop.
testing insights
What users told me:
It’s great to have control over which policy takes precedence for noisy alerts.
It’s hard to drag items when there are too many policies on the screen.
Design iteration (not quite there yet)
How might we simplify alert grouping?
Initially, my alert grouping design mirrored alert routing, requiring admins to set separate, narrower filter conditions. To simplify, I redesigned it so admins could select fields from routed alerts, grouping alerts with matching values into incidents automatically. This allows each Incident Policy to trigger multiple incidents, with similar signals prioritized by the admin.
I added a visualization to show admins the expected impact of their policy by simulating how alert aggregation would have reduced noise during a defined look-back period, highlighting the number of incidents that would be triggered.
testing insights
What users told me:
I like that I can reuse existing fields from alert routing instead of setting new filters again. It’s simpler and faster.
The impact chart showing the reduction from alerts to incidents really helps me understand the benefit of grouping.
Sometimes an alert will resolve itself and send out an info "all clear!" alert. Is there any way to just automatically resolve an incident if the alerts indicate it has resolved itself?
DesignEnd to end incident response mangement
I iterated on this design for months, meeting daily with stakeholders to gather feedback and refine the core concepts of Incident Response management: selecting alerts to group, grouping them into incidents, and triggering automated actions.
high fidelity Design
Improved stack ranking
My initial design for reordering Incident Policies offered only one mechanism, limiting interaction and discoverability, especially for users with long policy lists. I redesigned the pattern to improve usability, adding drag-and-drop reordering, a stepper for single increments, and a dropdown to jump policies to specific ranks.
high fidelity Design
Robust filtering rules
The routing section remained largely consistent but highlights a successful collaboration with the design system team. Lacking a pattern for complex filtering logic, we co-designed one that debuted on the routing rules page. This page dynamically responds to user input, showing a preview of matching alerts within the defined look-back period as admins update their filters.
high fidelity Design
More control over the incident lifecycle
The grouping section underwent the most change, as the earlier version lacked granular control over the incident lifecycle, risking incidents persisting indefinitely. I introduced controls to auto-resolve incidents based on alert severity changes, define when new incidents should be created, and adjust an incident's severity or status. I made adjustments to the helpful data vis to help the highlights stand out more clearly.
high fidelity Design
Automation rules
Our final design for Incident workflows (automatic actions taken on a triggered incident) better conforms to the established patterns of our product, while exposing effective controls for paging teammates automatically when an incident is triggered.
high fidelity Design
Incident responder experience
This admin experience powers the end-user experience for incident responders. The incident portal consolidates everything defined by the admin configuration, allowing SRE teams to manage investigators, associated alerts, a historical timeline of events, and more.
deliver
Launching Incident Intelligence
Incident Intelligence debuted with a public launch at .Conf, our annual customer conference, generating significant excitement and starting a closed pilot with six large-scale customers. By embedding incident management into our Observability product, we strengthened our case for tool consolidation and expanded sales opportunities.

However, due to shifting priorities, Incident Intelligence features were later spun off into a standalone on-call product, which is still available to customers today.
Post-launch AI enhancements
Smart alert grouping
When Incident Intelligence was spun off, the team was disappointed, but many features I designed proved valuable elsewhere—particularly the end-to-end administration of alert aggregation.

This design Northstar demonstrates how AI can simplify alert grouping. Rather than manually selecting correlating values, admins can now choose alerts to monitor, and AI provides grouping recommendations.
Why not let AI handle the entire incident response process?
Not everyone wants AI to handle all their critical tasks—surprising, I know. During initial user surveys, I found only about 3% of users wanted an AI to tell them what steps to take in an incident investigation. While AI can be valuable in investigations, I believe in a "trust but verify" approach. If your job is ensuring critical applications run smoothly, you'd want to double-check too.

That’s why I design AI/ML-driven features that provide actionable insights that human users can easily validate.
more case studiesDesigning for impact
previous case study
User adoption of K8s
View case study
next case study
Public access to Treasury data
View case study