Calling asynchronous external APIs using AWS Step Functions
Third-party vendor APIs can help organizations streamline operations, reduce costs, and provide better services to customers. However, integrating with third-party services can present many challenges, such as security, reliability, and cost.
Organizations need to ensure that their systems can handle performance issues or downtime. In some cases, calling a third-party API may incur costs, such as license fees. If the external API provider agrees to adhere to the maximum number of requests per second (RPS), the system must adapt accordingly.
This blog article will use the authors as an example of how to build an architecture for calling an external provider's API using AWS Step Functions, with detailed guidance on reliability.
This arrangement applies to any industry that relies on technology and data using third-party API integration. Examples include e-commerce applications for online merchants integrating with third-party payment gateways, carriers, or applications in the healthcare and banking sectors.
Discussion of calling asynchronous external APIs
This solution uses AWS services to build an orchestrator that controls the frequency of third-party service calls and implements the service callback pattern to process long-running tasks. This architecture is also available in the AWS Reference Architecture Diagrams section of the AWS Architecture Center.
As in Figure 1, the architecture controls call forwarding to an external service according to its maximum RPS contract using Step Functions. Step Functions pauses the main request workflows until you receive a callback from the external system indicating that the task has been completed. Calling asynchronous external APIs using AWS Step Functions
It's time to walk through each step.
- Configure Step Functions to handle the lifecycle of long-running requests to a third party. Add a request step inside the workflow that pauses it, using the waitForTaskToken element as a callback to continue. Set a timeout to report a timeout error if the callback is not received.
- Send the task token and request load to the Amazon Simple Queue Service (Amazon SQS) queue. Use Amazon CloudWatch to monitor its length. If the queue length exceeds the limit set on the maximum RPS with the third party, consider adjusting your contract with that service.
- Use AWS Lambda to probe Amazon SQS and run the Step Functions express workflow. Later in this article, We will discuss the probing batch size, reserved concurrency, and maximum concurrency, which can be used to control the speed of Lambda calls.
- Optionally, dynamic latency inside Lambda, controlled by AWS AppConfig, can be added if the system still needs a lower call rate to comply with the contracted RPS.
- Step Functions is called the Amazon API Gateway HTTP proxy API and is configured with a rate limit to maintain compliance with the contracted RPS. The API Gateway is a security proxy that ensures your system does not break the RPS contract when dynamically adjusting call rate parameters.
- Call an external third-party asynchronous service API, send the payload from the request queue, and receive the task ID from the service. Use Amazon SQS to send failed requests to the undelivered message queue (DLQ).
- Store the primary workflow token and job ID from a third-party service in an Amazon DynamoDB table. This is used as a mapping to correlate the task ID with the task token.
- When the external service is complete, the completed task identifier will be received at the endpoint of the webhook element of the callback implemented using the Gateway API.
- Convert the external callbacks using the Gateway API mapping templates, add the payload and task ID to the Amazon SQS queue, and respond to the caller immediately.
- Use Lambda to check the Amazon SQS callback queue and then challenge the stored token. Use the token to unblock the pending workflow by calling SendTaskSuccess and the DLQ callback to store failed messages.
- In the main workflow, pass the task ID to the next step and call the Step Functions processor to retrieve third-party results. Finally, process the third-party service results.
Control the frequency of external API calls.
To comply with third-party RPS contracts, apply a mechanism to control the frequency of your system's calls. The frequency of questioning messages from an Amazon SQS request (or step 3 in the architecture) directly impacts the frequency of calls.
Various parameters can be used to control the call frequency for Lambda with Amazon SQS as the 'event source' trigger, for example:
- Batch size: The number of records to send to the function in each batch. For a standard queue, this can be up to 10,000 records. For a FIFO queue, the maximum number is 10. Batch size alone will not limit the speed of calls. It should be used with other parameters, such as reserved or maximum concurrency.
- Batch window: maximum time to accumulate records before a function call in seconds. This applies only to standard queues.
- Maximum concurrency is a setting at the event source level that limits the number of simultaneous occurrences of a function that an Amazon SQS event source can call.
The trigger configuration is shown in Figure 2.
Other Lambda configuration parameters can also be used to control the speed of the calls. These will include:
- Reserved concurrency: guarantees the maximum number of simultaneous occurrences of a function. When a function has reserved concurrency, no other function can use that concurrency. This can be used to limit and reduce the frequency of calls.
- Provided concurrency: Initialises the desired number of execution environments so that they are prepared to respond immediately to your function calls. Note that configuring provided concurrency charges your AWS account.
These additional Lambda configuration parameters are shown in Figure 3.
Maximizing the external API architecture
Several use cases must be considered during this architecture implementation to ensure you are creating the right Orchestrator.
Follow some examples:
- If the external system does not respond to the API request in step 8, a timeout exception will occur in step 1. In step 1, a sensible timeout should be configured in the central state machine. The timeout value should consider the external system's maximum response time.
The error handling capabilities in the Step Functions section of the AWS Step Functions Developer guide allow you to implement custom logic for different types of errors. You can configure timeout errors using the error state States. Timeout.
- As mentioned in step 4, dynamic delay inside the Lambda function should only be used temporarily for burst traffic. If the external party has a contract with a very low RPS, consider other alternatives to introduce a delay.
For example, the Amazon EventBridge Scheduler can trigger the Lambda function at regular intervals to use messages from Amazon SQS. This avoids the cost of the Lambda function's idle/waiting state.
Applications
In this paper, they discuss how to build a comprehensive orchestration to manage the request lifecycle, five different parameters to control the frequency of third-party service calls, and limiting calls to the third-party service API for a maximum RPS contract.
The authors also consider use cases for error handling in Step Functions and monitor systems using CloudWatch. In addition, this architecture adopts fully managed AWS serverless services, eliminating undifferentiated 'heavy lifting' in building highly available, reliable, secure, and cost-effective AWS systems.