Ethan Walker

Ethan WalkerinMonitoring Strategy & DevOps

Preventing Alert Storms with Grouping and Deduplication

A technical walkthrough on configuring alert managers and routing rules to condense hundreds of failing underlying service checks into a single incident context.

May 4

•

7 min read

Preventing Alert Storms with Grouping and Deduplication

Ethan WalkerinMonitoring Strategy & DevOps

Defining SLIs and SLOs That Trigger Meaningful Pages

Transition from theoretical SLOs to practical burn-rate alerts that only wake up the on-call engineer when user experience is actively deteriorating.

May 4

•

7 min read

Defining SLIs and SLOs That Trigger Meaningful Pages

Ethan WalkerinMonitoring Strategy & DevOps

Why CPU and Memory Spikes Make Terrible Alerts

Explains the engineering pitfalls of alerting on resource utilization metrics instead of user-facing latency and error rates.

May 1

•

7 min read

Why CPU and Memory Spikes Make Terrible Alerts

Ethan WalkerinMonitoring Strategy & DevOps

The Complete Guide to Beating Alert Fatigue and Fixing On-Call

A comprehensive deep dive into auditing noisy alerts, defining actionable service level objectives (SLOs), and transitioning to symptom-based alerting.

May 1

•

9 min read

The Complete Guide to Beating Alert Fatigue and Fixing On-Call

Ethan WalkerinReliability & Uptime

What Actually Causes Downtime in Modern Web Applications

Downtime in modern web applications is rarely caused by a single failure. In practice, outages usually happen because multiple small issues align across multiple layers.

Feb 28

•

3 min read