Explains the engineering pitfalls of alerting on resource utilization metrics instead of user-facing latency and error rates.

It is 3:15 AM. Your phone buzzes with a high-priority PagerDuty incident: CRITICAL: API Server CPU utilization > 95%. You open your laptop, pull up the Grafana dashboard, and notice that while the CPU spiked heavily for 4 minutes, the HTTP 500 error rate remained at 0% and API latency stayed completely flat.
You acknowledge the alert, sigh heavily, and go back to sleep. You just experienced the textbook definition of a terrible alert.
For decades, infrastructure monitoring has heavily relied on monitoring utilization metrics like CPU and Memory. But in the era of containerized microservices and autoscaling cloud infrastructure, alerting on resource consumption is an anti-pattern. This post explores why.
To understand why we still set CPU threshold alerts, we have to look back at the era of bare-metal servers. When you had a single database server sitting in a rack, a CPU hitting 99% meant that the machine had zero compute headroom left. If traffic increased even slightly, the box would inevitably crash.
Modern cloud architecture operates differently. We explicitly deploy autoscaling groups to maximize resource utilization and reduce cloud computing costs. If a node consistently runs at 40% CPU, you are over-provisioned. You actually want nodes operating efficiently near their limits, allowing horizontal scaling to handle overflow.
When your orchestration layer handles the influx, high CPU is a sign of a healthy, cost-effective system—not an emergency.
When designing an alert, you must differentiate between the cause of an issue and the symptom that users experience.
A high memory spike is a cause (or an underlying state). A user getting a 502 Bad Gateway is a symptom (the actual pain).

If a background cron job runs and consumes 100% of a node's CPU for two minutes while parsing a large file, but it runs on a dedicated background worker queue, the user experiences exactly zero performance degradation. Paging an engineer for this creates pure noise.
Conversely, a deadlock in your database might only consume 5% of the database's CPU, but it halts all user transactions. If you only alert on CPU, you'll completely miss the critical outage.
Let's examine three common scenarios where resource utilization spikes trigger false positives:
In languages like Java or Go, intermittent memory spikes are expected as objects are allocated before a GC pause cleans them up. Triggering memory alerts based on these sawtooth waveforms is notoriously flaky.
A nightly database backup or log rotation naturally requires intense disk I/O and CPU. Unless it prevents primary application functions, it does not warrant an alert.
A sudden influx of connections will immediately tax the CPU as TLS handshakes negotiate and connection pools warm up. As long as the application autoscales effectively within a few minutes, the brief saturation is standard operating procedure.
Paging alerts should be exclusively reserved for the "Golden Signals": Latency, Traffic, Errors, and Saturation.
Instead of: CPU > 90%
Alert on: P99 Latency > 1500ms for 5m
If CPU hits 99% but latency safely stays under your 1500ms threshold, let the team sleep.
Instead of: Memory > 85%
Alert on: HTTP 5xx Error Rate > 2%
If a memory leak eventually causes Pod reboots resulting in dropped requests, the 5xx error rate alert will catch the symptom and accurately page the team.
CPU and memory metrics are not useless—they are simply not pager-worthy.
These metrics belong in two places:
Stop creating on-call schedules built around infrastructure health. Build them around user health. By eliminating raw hardware thresholds and committing to symptom-based latency and error alerting, engineers suffer less fatigue and trust the alerts that actually fire.
Transitioning requires reliable external telemetry. Platforms like Heimdall monitor exactly what the user experiences—enforcing alerts based on real HTTP latencies and actual DNS resolution capabilities—allowing teams to safely turn off the noisy CPU threshold rules inside their clusters.
Senior Systems Reliability Engineer focused on uptime, incident response, and building monitoring systems that surface problems before users notice.
Schließen Sie sich Tausenden von Teams an, die sich darauf verlassen, dass Heimdall ihre Websites und APIs rund um die Uhr online hält. Starten Sie noch heute mit unserem kostenlosen Plan.
Kostenlos mit der Überwachung beginnen