A practical step-by-step guide to exporting and analyzing past alert data to identify top noisy systems, flapping monitors, and frequently silenced alerts.

You cannot fix what you cannot measure. If your engineering team is suffering from alert fatigue, simply "guessing" which monitors to delete is a recipe for an eventual blind-spot outage. To systematically reclaim your sleep, you must treat your incident routing system as a data source.
Every alert sent to PagerDuty, Opsgenie, or VictorOps leaves a trail: when it fired, who acknowledged it, how quickly it was resolved, and whether it was escalated. By applying a data-driven audit process to this history, you can pinpoint the few bad actors generating the vast majority of the noise. This guide provides a step-by-step framework to identify and silence them.
In almost every infrastructure footprint, the 80/20 rule (the Pareto principle) governs observability: roughly 80% of your meaningless alerts are generated by just 20% of your monitors.
These offenders often hide in plain sight. They are the flaky database backup job that triggers a warning every night. They are the aggressively-tuned HTTP check that fails during micro-deployments. Because they are individually acknowledged and dismissed quickly, they feel like minor annoyances. Only in aggregate does their true cost—engineering toil and normalized deviance—become apparent.
Start by exporting the last 60 to 90 days of incident data from your incident management platform. Look for CSV/JSON exports that include:
Load the export into a spreadsheet or Jupyter notebook. Group identical alerts together (using regex to strip out dynamic IDs like pod names). Count the total occurrences.
Look at the top five highest volume alerts. If an alert constitutes more than 5% of your total weekly volume and mostly resolves without code deploys or rollbacks, disable it. It is too noisy to be actionable.

During your audit, you will likely encounter these specific profiles of bad monitoring:
Detection: Subtract the Resolution timestamp from the Creation timestamp. If the duration frequently averages under 3 minutes (with no human intervention), the alert is "flapping."
Solution: Add an evaluation delay. In Prometheus, adjust the for: 1m parameter to for: 5m to absorb transient blips.
Detection: Look at the Mean Time To Acknowledge (MTTA). If a specific warning frequently sits unacknowledged for over 45 minutes, the team subconsciously knows it isn't crucial.
Solution: Downgrade its severity. Route it to a daily Slack digest channel instead of an SMS paging workflow.
Engineers often fear deleting noisy legacy monitors because they lack context ("What if Bob set this up for a reason?"). Implement the safe "Delete and Wait" protocol for these edge cases:
An audit is not a one-time operation. Entropy ensures that new alerts will slowly begin generating noise as infrastructure grows.
Establish a Monthly Alert Review cadence:
Tag everything for data grouping. Ensure your payloads explicitly label environments (env: production) and services (service: payments). This allows you to pivot your audit data effectively to see if a particular microservice is disproportionately burning out the team.
Cleaning up alert history is one of the highest-leverage toil reduction tasks an engineering team can perform. By systematically silencing flapping alerts, downgrading non-critical warnings, and deleting the top offenders, you can dramatically improve the mental health of your on-call responders.
External observability tools can enhance this visibility. Heimdall, for example, natively tracks historical performance and uptime metrics across external endpoints—allowing teams to historically query and analyze genuine downtime patterns separate from their noisy internal cluster telemetry.
Join thousands of teams who rely on Heimdall to keep their websites and APIs online 24/7. Get started with our free plan today.
Start monitoring for freeInfrastructure engineer focused on DNS, networking, and the invisible layers that determine whether applications are reachable.