From 2021-2023, I led design for a new Observability feature streamlining incident response for on-call SREs. The solution aggregated related alerts into single incidents, improving notification, investigation, and reducing MTTX while creating a cohesive user experience across the suite.
SRE managers rely on alerts to detect performance issues across their tech stack, but investigating and resolving these can be complex, noisy, and require coordination. To aid this, admins must ensure alerts are clear, consistent, and contextually relevant, often aggregating noisy alerts into incidents or muting those from expected outages like maintenance.
The hardest part of being on-call is not the work itself, but the constant anticipation — knowing that at any moment, your peace can be shattered by someone else's emergency. - Anonymous SRE
I joined the team after initial research and prioritization were established, focusing on quickly understanding users and use cases to deliver designs for critical incident response features. My work centered on the admin experience, defining key features like alert aggregation, enrichment, and maintenance policies.