Large-scale automation of Amazon CloudWatch Alarm Cleanup

October 26, 2022

Large-scale automation of Amazon CloudWatch Alarm Cleanup

Do you have thousands of Amazon CloudWatch alarms in AWS regions and want to identify low-value or misconfigured alarms in different areas quickly? Are you looking for ways to identify alarms in an 'ALARM' or 'IN_SUFFICIENT' state for days that need to be re-checked? Do you need a cleaning mechanism to review low-value alarms in different regions and periodically remove them to optimize alarm costs?


With this article, the authors will explore how you can implement a mechanism to clean up CloudWatch low-value alerts at scale across different regions in your AWS account. They will also discuss a mechanism that helps customers optimize CloudWatch alarm costs by identifying various types of misconfigured or low-value alarms. Alarms that have remained in ALARM or INSAFFICIENT_DATA state continuously for several days are classified as stale alarms. Alarms that take no action and are not triggered by a compound alarm may have a low value. The authors encourage you to review them to ensure they are helpful or remove them if you realize you no longer need them.

Implementing the solution

This solution and associated resources are available for deployment in your AWS account as an AWS CloudFormation template.

Prerequisites

The following prerequisites must be met for this guide:


What will the CloudFormation template implement?

The CloudFormation template will deploy the following resources to your AWS account:

  • AWS Identity and Access Management (IAM) role for AWS Lambda
    - CloudWatchAlarmHealthCheckerRole
    - Allows logging to CloudWatch logs and S3 tray and access to CloudWatch APIs
  • AWS Lambda function
    - CloudWatchAlarmHealthCheckerpy-< stackname >

How to implement the CloudFormation template

  1. Download the yaml file.
  2. Go to the CloudFormation console in your AWS account.
  3. Select Create Stack.
  4. Select Template is ready, upload the template file and navigate to the yaml file you just downloaded.
  5. Select Next.
  6. Give the stack a name (max length 30 characters) and select Next.
  7. Add tags if required and select Next.
  8. Scroll to Capabilities at the bottom of the screen and tick the box "I acknowledge that AWS CloudFormation may create IAM resources with non-standard names" and Create Stack.
  9. Wait for the stack creation to complete.
  10. Go to the Lambda > Functions console.
  11. Select the Lambda function named CloudWatchAlarmHealthCheckerpy-<stackname >.
  12. Scroll down to the Lambda function code section and select Test.
  13. Configure the test event and enter the data below.Automating Amazon CloudWatch Alarm Cleanup at scale

Use the sample JSON file below and enter your S3 tray.

You can change "nodata_days", "stale_days", "disabled_actions_days" to list suspicious alarms according to your use case or requirements. "outputPath" must be updated to the specific Amazon S3 segment to which suspicious alerts and alerts to delete spreadsheets will be sent. If this value is not set, details will be available in the CloudWatch log group of the lambda function. "max_iterations" can be configured based on the number of alarms and metrics in the account. "regions" can be configured based on the regions of presence in the account.

When you run this solution on your account, it will display spreadsheets uploaded to the selected S3 tray for you to review. The spreadsheets below list alarms that are ready to be cleared and alarms that are suspicious and require your review.

  • Spreadsheet listing alarms ready for deletion by region. The spreadsheet created contains a list of alarms likely to be prepared for deletion because they refer to a metric that does not exist, which may mean that the metric is no longer being issued or was entered incorrectly when the alarm was created.
  • Another spreadsheet listing all suspect alarms by region. This created file lists outdated alarms or may have a low value.Automating Amazon CloudWatch Alarm Cleanup on a large scale


The Suspicious Alarms spreadsheet lists the alarms in an account in different regions:

  • in the state "ALARM" for more than "stale_days" without data
  • in the state "IN_SUFFICIENT" for more than "nodata_days"
  • alarms that are not linked to any action and have no parent element.
  • alarms that have been permanently disabled for more than "disabled_actions_days"


"stale_days", "nodata_days", "disabled_actions_days" can be configured within the CloudFormation template provided in this AWS blog article. A list of suspicious alerts is available for review and removal. Generally, alarms that are not associated with any action and do not have an overriding alarm are on the suspicious list. As EventBridge can monitor the change in alarm status, they are on the suspicious alarm list for review.

Alarms are to be removed if the spreadsheet contains the following:

  • an alarm relating to an invalid namespace or a namespace that does not exist
  • an alarm relating to an unknown metric
  • an alarm relating to a dimension that does not exist for a metric.


Costs

There is a cost to using this solution, as it stores the data in an S3 tray. The solution runs Lambda code; in this case, Lambda functions make API calls. However, the cost should be minimal. For example, 100,000 alerts out of 300,000 metrics in your account cost less than a few cents.

All pricing details are available on the Amazon S3 and AWS Lambda pages.

Ordering

If you decide that you no longer want to store the dashboard and associated resources, you can go to CloudFormation in the AWS console, select the stack (which you will call it during deployment), and select Delete. All resources will be deleted.

If you want to re-add the cleanup and organizing mechanism at any time, you can re-create the stack from the CloudFormation yaml file.

Applications

This solution can help you better understand CloudWatch alerts that are low value or obsolete and take action to remove them. You can run this once and review the alerts for deletion or run it periodically using Amazon EventBridge Events. Customers can quickly identify and remove obsolete or low-value alarms in different regions and thus save costs.

Case Studies
Testimonials

Hostersi provides administrative support for the cloud infrastructure of Danone GmbH in Amazon Web Services. As part of this support, Hostersi's specialists take care of a many web projects located in dozens of instances. We are very impressed with the professionalism, quality of service and competence of Hostersi.

Marek Nadra
Business Solution Manager Supporting the Enterprise
Briefly about us
We specialize in IT services such as server solutions architecting, cloud computing implementation and servers management.
We help to increase the data security and operational capacities of our customers.