Managing ML-based Operational Insights from Amazon DevOps Guru to Datadog event stream
Amazon DevOps Guru is a fully managed AIOps service that uses machine learning (ML) to quickly identify when applications are behaving outside of their normal operating patterns while generating insights into their findings. These insights generated by DevOps Guru can be used to alert on-call teams to respond to anomalies in critical business workloads. If you are already using Datadog to automate infrastructure monitoring, application performance monitoring and log management to observe your entire technology stack in real-time, this article is for you.
You may already be using Datadog for a consolidated view of the Datadog Events interface to search, analyse and filter events from many different sources in one place. Datadog events are records of significant changes relevant to IT operations management and troubleshooting, such as code, deployments, service status, configuration changes and monitoring alerts.
Wherever DevOps Guru detects operational events in your AWS environment that could lead to downtime, it generates insights and recommendations. These insights/recommendations are then passed to a user-specific Datadog endpoint using the Datadog events API. You can then create dashboards, events, alerts or take automated corrective action based on these insights and recommendations in Datadog.
Datadog collects and unifies all streaming data from these complex environments, integrating it with a single click to pull metrics and tags from over 90 AWS services. Businesses can deploy the Datadog agent directly on their hosts and compute instances to collect metrics with greater detail - down to the nearest second. And with Datadog's off-the-shelf integration dashboards, companies gain not only an overall view of the health of their infrastructure and applications, but also deeper insights into individual services such as AWS Lambda and Amazon EKS.
The following article will show you in an accessible way how to use Amazon DevOps Guru with Datadog to gain real-time insights and recommendations for your AWS infrastructure. The authors will demonstrate how the insights generated by Amazon DevOps Guru for anomalies can be automatically uploaded to Datadog event streams, which can then be used to create dashboards, create alerts and create alerts to take corrective action.
Overview of the solution
When an Amazon DevOps Guru insight is created, an Amazon EventBridge rule is used to capture the insight as an event and route it to an AWS Lambda Function target. The lambda function interacts with Datadog via the REST API to push out relevant DevOps Guru events captured by Amazon EventBridge.
The EventBridge rule can be customised to capture all DevOps Guru details or narrowed down to specific details. In this article, the authors have collected all DevOps Guru insights and will perform actions in Datadog for the following DevOps Guru events:
- DevOps Guru New Insight Open
- DevOps Guru New Anomaly Association
- DevOps Guru Insight Severity Upgraded
- DevOps Guru New Recommendation Created
- DevOps Guru Insight Closed
Solution implementation steps
Prerequisites
Before deploying the solution, complete the following steps.
Set up your Datadog account: Link your AWS account to Datadog. If you do not have a Datadog account, you can request a free trial developer instance via Datadog.
Datadog credentials: Gather the Datadog key credentials that will be used to connect to AWS. Follow the steps below to create an API key and an application key.
Add an API key or customer token:
- To add a Datadog API key or customer token:
- Go to your organization settings, then click API keys or customer tokens
- Click New key or New customer token, depending on which one you are creating.
- Enter the name of your key or token.
- Click Create API key or Create customer token.
- Make a note of the newly generated API key value. You will need this in later steps.
Add application keys.
To add a Datadog application key, go to Organisation Settings > Application Keys. If you have permission to create application keys, select New Key. Make a note of the newly generated application key. You will need this in later steps.
Add the application key and API key to AWS Secrets Manager: Secrets Manager allows you to replace hard-coded credentials in your code, including passwords, with an API call to Secrets Manager to retrieve the secret key programmatically. This helps ensure that the key cannot be tampered with by someone examining your code, as the key no longer exists in the code. Follow the steps below to create a new secret in AWS Secrets Manager.
- Open the Secrets Manager console at https://console.aws.amazon.com/secretsmanager/.
- Select Store a new secret.
- On the Choose secret type page, do the following:
- In the Secret type field, select another type of secret.
- In the Key/value pairs, enter your secret key in the Key/value field.
Click Next and enter "DatadogSecretManager" as the secret name, then Review and Finish.
- Follow these steps to deploy a sample serverless application and enable DevOps Guru for your applications. Then, you can generate DevOps Guru insights on anomalies detected in your application.
- AWS Cloud9 is recommended for setting up the environment, as the AWS Serverless Application Model (SAM) CLI and AWS Command Line Interface (CLI) are pre-installed and can be accessed from a bash terminal.
- Install and configure the SAM CLI - Install the SAM CLI.
- Download and configure Java. The version should be compatible with the execution environment defined in the SAM template. Configure yaml Serverless features - Install the Java SE Development Kit 11.
- Maven - Install Maven
Option 1: Deploy the Datadog Connector application from the AWS Serverless repository
The DevOps Guru, Datadog Connector application, is available in the AWS Serverless Application Repository, a managed repository for serverless applications. The application is packaged with the AWS Serverless Application Model (SAM) template, a definition of the AWS resources used, and a link to the source code. Follow the steps below to quickly deploy this serverless application to your AWS account.
- Log into the account's AWS management console where you plan to deploy this solution.
- Navigate to the DevOps Guru Datadog Connector application in the AWS Serverless Repository and click "Deploy."
- The Lambda deployment screen will appear, where you can enter the name of the Datadog application to deploy ML-based
Upon successful deployment, the AWS Lambda application page will display a 'Create complete' status for the serverlessrepo-DevOps-Guru-Datadog-Connector application. The CloudFormation template creates four resources,
- A Lambda function that has the logic to integrate with Datadog
- Event Bridge rule for DevOps Guru Insights
- Lambda permission
- IAM role
Now skip option 2 and follow the steps described in the 'Testing the solution' section to trigger some DevOps Guru insights/recommendations and check that events are created and updated in Datadog.
Option 2: Build and deploy a sample Datadog Connector application using the AWS SAM command-line interface.
As seen above, you can directly deploy the sample serverless application from the serverless repository with a single click. Alternatively, you can clone the source GitHub repository and deploy it using the SAM CLI from your terminal.
The Serverless Application Model Command Line Interface (SAM CLI) is an extension to the AWS CLI that adds functionality for developing and testing serverless applications. The CLI provides commands that allow you to check that AWS SAM template files are written to specification, call Lambda functions locally, debug Lambda functions step-by-step, package and deploy serverless applications to the AWS cloud, etc. For detailed information on using the AWS SAM CLI, including a complete description of the AWS SAM CLI commands, see AWS SAM reference - AWS Serverless Application Model.
Before you go any further, complete the prerequisites section at the start, which should configure the AWS SAM CLI, Maven and Java on your local terminal. You must also install and configure Docker to run your functions in an Amazon Linux environment that matches Lambda.
Clone the source code from the GitHub repository.
git clone https://github.com/aws-samples/amazon-devops-guru-connector-datadog.git
Build a sample application using the SAM CLI.
$cd DatadogFunctions
$sam build
Building codeuri: $\amazon-devops-guru-connector-datadog\DatadogFunctions\Functions runtime: java11 metadata: {} architecture: x86_64 functions: Functions
Running JavaMavenWorkflow:CopySource
Running JavaMavenWorkflow:MavenBuild
Running JavaMavenWorkflow:MavenCopyDependency
Running JavaMavenWorkflow:MavenCopyArtifacts
Build Succeeded
Built Artifacts : .aws-sam\build
Built Template : .aws-sam\build\template.yaml
Commands you can use next
=========================
[*] Validate SAM template: sam validate
[*] Invoke Function: sam local invoke
[*] Test Function in the Cloud: sam sync --stack-name {{stack-name}} --watch
[*] Deploy: sam deploy --guided
This command will build your application's source by installing the dependencies defined in Functions/pom.xml, creating a deployment package, and saving it in the aws-sam/build folder.
Deploy the sample application using the SAM CLI.
$sam deploy --guided
This command will package and deploy your application to AWS with a series of prompts that you should respond to, as shown below:
- Stack name: The name of the stack to deploy to CloudFormation. This should be unique to your account and region, and a good starting point would be something matching your project name.
- AWS Region: The AWS region where you want to deploy your application.
- Confirm changes before deployment: If set to Y, all change sets will be displayed before execution for manual review. If set to N, AWS SAM CLI will automatically deploy changes to your application.
- Allow the creation of SAM CLI IAM roles: Many AWS SAM templates, including this example, create AWS IAM roles required by AWS Lambda functions to access AWS services. By default, these are limited to the minimum required permissions. To implement an AWS CloudFormation stack that creates or modifies IAM roles, you must specify the CAPABILITY_IAM value for the capability. If no permissions are specified in this monition to deploy this instance, you must explicitly pass the --capabilities CAPABILITY_IAM option to the same deployment command.
- Disable change rollback [Y/N]: If set to Y, preserve the state of previously shared resources when an operation fails.
- Save arguments in the configuration file (samconfig.toml): If set to Y, your selections will be saved in the configuration file inside the project, so that in the future you can simply re-run the deployment itself without parameters to implement the changes in your application.
After entering the parameters, you should see something like this if you have specified Y to display and confirm the change sets. Go here, selecting 'Y' to deploy resources.
Initiating deployment
=====================
Uploading to sam-app-datadog/0c2b93e71210af97a8c57710d0463c8b.template 1797 / 1797 (100.00%)
Waiting for changeset to be created..
CloudFormation stack changeset
---------------------------------------------------------------------------------------------------------------------
Operation LogicalResourceId ResourceType Replacement
---------------------------------------------------------------------------------------------------------------------
+ Add FunctionsDevOpsGuruPermissi AWS::Lambda::Permission N/A
on
+ Add FunctionsDevOpsGuru AWS::Events::Rule N/A
+ Add FunctionsRole AWS::IAM::Role N/A
+ Add Functions AWS::Lambda::Function N/A
---------------------------------------------------------------------------------------------------------------------
Changeset created successfully. arn:aws:cloudformation:us-east-1:867001007349:changeSet/samcli-deploy1680640852/bdc3039b-cdb7-4d7a-a3a0-ed9372f3cf9a
Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]: y
2023-04-04 15:41:06 - Waiting for stack create/update to complete
CloudFormation events from stack operations (refresh every 5.0 seconds)
---------------------------------------------------------------------------------------------------------------------
ResourceStatus ResourceType LogicalResourceId ResourceStatusReason
---------------------------------------------------------------------------------------------------------------------
CREATE_IN_PROGRESS AWS::IAM::Role FunctionsRole -
CREATE_IN_PROGRESS AWS::IAM::Role FunctionsRole Resource creation Initiated
CREATE_COMPLETE AWS::IAM::Role FunctionsRole -
CREATE_IN_PROGRESS AWS::Lambda::Function Functions -
CREATE_IN_PROGRESS AWS::Lambda::Function Functions Resource creation Initiated
CREATE_COMPLETE AWS::Lambda::Function Functions -
CREATE_IN_PROGRESS AWS::Events::Rule FunctionsDevOpsGuru -
CREATE_IN_PROGRESS AWS::Events::Rule FunctionsDevOpsGuru Resource creation Initiated
CREATE_COMPLETE AWS::Events::Rule FunctionsDevOpsGuru -
CREATE_IN_PROGRESS AWS::Lambda::Permission FunctionsDevOpsGuruPermissi -
on
CREATE_IN_PROGRESS AWS::Lambda::Permission FunctionsDevOpsGuruPermissi Resource creation Initiated
on
CREATE_COMPLETE AWS::Lambda::Permission FunctionsDevOpsGuruPermissi -
on
CREATE_COMPLETE AWS::CloudFormation::Stack sam-app-datadog -
---------------------------------------------------------------------------------------------------------------------
Successfully created/updated stack - sam-app-datadog in us-east-1
Once successfully deployed, you should be able to see the resource's successful creation. Your Lambda, IAM role, and EventBridge rule can also be found in the CloudFormation stack output values.
Using the local SAM CLI functionality, you can also test and debug your function locally with sample events. Test a single function by calling it directly with a test event. An event is a JSON document representing the input that the function receives from the event source. For more information, see the link Calling Lambda functions locally - AWS Serverless Application Model here.
$ sam local invoke Functions -e ‘event/event.json’
Once you've followed the steps above, go to the 'Testing the solution' section below to trigger some DevOps Guru insights and verify that events are being created and sent to Datadog.
Test the solution
To test the solution, run a DevOps Guru Insight simulation. DevOps Guru creates an insight when an anomaly is detected in the application, as shown below.
In the case of the DevOps Guru insight shown above, the corresponding event is created automatically and passed to Datadog, as shown below. In addition to event creation, any new anomalies and recommendations from DevOps Guru are also linked to events.
Ordering
To delete the sample application you created, open a new terminal in your Cloud 9 environment. Now, type the following AWS CLI command and pass the stack name specified in the deployment step.
aws cloudformation delete-stack --stack-name
Alternatively, you can also use the AWS CloudFormation console to delete the stack.
Applications
This article describes how Amazon DevOps Guru monitors resources in a specific region of your AWS account, automatically detecting operational issues, predicting potential resource exhaustion, identifying likely causes, and recommending corrective actions. It describes a bespoke solution to integrate DevOps Guru insights with Datadog, streamlining the management and oversight of AWS services. This solution helps customers using Datadog improve operational efficiency by delivering personalized insights, real-time alerts, and management capabilities directly from DevOps Guru, offering a unified interface to restore services and systems quickly.
Go to the Amazon DevOps Guru documentation page to start gaining operational insights into your AWS infrastructure using Datadog.