How to Audit Your Alert History to Find the Worst Offenders

Introduction

You cannot fix what you cannot measure. If your engineering team is suffering from alert fatigue, simply "guessing" which monitors to delete is a recipe for an eventual blind-spot outage. To systematically reclaim your sleep, you must treat your incident routing system as a data source.

Every alert sent to PagerDuty, Opsgenie, or VictorOps leaves a trail: when it fired, who acknowledged it, how quickly it was resolved, and whether it was escalated. By applying a data-driven audit process to this history, you can pinpoint the few bad actors generating the vast majority of the noise. This guide provides a step-by-step framework to identify and silence them.

Problem Overview: The Pareto Principle in Alerting

In almost every infrastructure footprint, the 80/20 rule (the Pareto principle) governs observability: roughly 80% of your meaningless alerts are generated by just 20% of your monitors.

These offenders often hide in plain sight. They are the flaky database backup job that triggers a warning every night. They are the aggressively-tuned HTTP check that fails during micro-deployments. Because they are individually acknowledged and dismissed quickly, they feel like minor annoyances. Only in aggregate does their true cost—engineering toil and normalized deviance—become apparent.

Technical Deep Dive: The Data-Driven Audit Process

Step 1: Exporting Your Data

Start by exporting the last 60 to 90 days of incident data from your incident management platform. Look for CSV/JSON exports that include:

Incident ID, Title, and Routing Key (Service)
Creation, Acknowledgment, and Resolution timestamps
Resolution Reason (if categorized)

Step 2: Identifying the Worst Offenders

Load the export into a spreadsheet or Jupyter notebook. Group identical alerts together (using regex to strip out dynamic IDs like pod names). Count the total occurrences.

Look at the top five highest volume alerts. If an alert constitutes more than 5% of your total weekly volume and mostly resolves without code deploys or rollbacks, disable it. It is too noisy to be actionable.

Failure Modes: Flapping vs. Ghost Alerts

During your audit, you will likely encounter these specific profiles of bad monitoring:

The Flapping Alert

Detection: Subtract the Resolution timestamp from the Creation timestamp. If the duration frequently averages under 3 minutes (with no human intervention), the alert is "flapping."

Solution: Add an evaluation delay. In Prometheus, adjust the for: 1m parameter to for: 5m to absorb transient blips.

The Ghost Alert (High MTTA)

Detection: Look at the Mean Time To Acknowledge (MTTA). If a specific warning frequently sits unacknowledged for over 45 minutes, the team subconsciously knows it isn't crucial.

Solution: Downgrade its severity. Route it to a daily Slack digest channel instead of an SMS paging workflow.

Debugging Workflow: The "Delete and Wait" Protocol

Engineers often fear deleting noisy legacy monitors because they lack context ("What if Bob set this up for a reason?"). Implement the safe "Delete and Wait" protocol for these edge cases:

Identify a consistently un-actioned noisy monitor.
Do NOT delete it immediately. Instead, forcefully suppress or silence it for a specific window (e.g., 2 weeks).
Wait. If a user raises a support ticket about degraded performance and your silenced monitor is the only thing that caught it, you must keep it (but adjust thresholds). If zero complaints arrive, permanently delete the monitor.

Monitoring Strategy: Building a Monthly Review Cadence

An audit is not a one-time operation. Entropy ensures that new alerts will slowly begin generating noise as infrastructure grows.

Establish a Monthly Alert Review cadence:

Dedicate 30 minutes in a standard SRE operational meeting.
Review the top 3 noisiest incident titles from the past month.
Create immediate Jira tickets to either fix the underlying root cause, or delete the alert itself.

Best Practices

Tag everything for data grouping. Ensure your payloads explicitly label environments (env: production) and services (service: payments). This allows you to pivot your audit data effectively to see if a particular microservice is disproportionately burning out the team.

Conclusion

Cleaning up alert history is one of the highest-leverage toil reduction tasks an engineering team can perform. By systematically silencing flapping alerts, downgrading non-critical warnings, and deleting the top offenders, you can dramatically improve the mental health of your on-call responders.

External observability tools can enhance this visibility. Heimdall, for example, natively tracks historical performance and uptime metrics across external endpoints—allowing teams to historically query and analyze genuine downtime patterns separate from their noisy internal cluster telemetry.