Skip to content

New serverless pattern - lambda-SQS-best-Practices-CDK #2733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

shubsmor
Copy link
Contributor

Creating a new serverless pattern which demonstrate prod ready SQS -> Lambda integration with all best practices implemented and more granular observability implemented via Lambda power tools

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@@ -0,0 +1,207 @@
# Lambda SQS Best Practices with AWS CDK

This pattern demonstrates a production-ready implementation of AWS Lambda processing messages from Amazon SQS using AWS CDK. It serves as a reference architecture for building robust, observable, and maintainable serverless applications, featuring AWS Lambda Powertools integration for enhanced observability through structured logging, custom metrics, and distributed tracing with X-Ray. The pattern implements comprehensive error handling with automatic retries and Dead Letter Queue (DLQ) configuration, along with a detailed CloudWatch Dashboard for operational monitoring. Security is enforced through least privilege IAM roles, while operational excellence is maintained through proper resource configurations and cost optimizations. This enterprise-grade solution includes batch message processing, configurable timeouts, message validation, and a complete monitoring strategy, making it ideal for teams building production serverless applications that require high reliability, observability, and maintainability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using the production-ready, enterprise-ready, enterprise-grade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, will work on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to more generic wordings for pattern description

- X-Ray tracing
3. A CloudWatch Dashboard for operational monitoring
4. Least priviledge permissions implemented on roles and policies
<img src="./resources/Least-priviledge.png" alt="Architecture" width="100%"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the purpose is to showcase the least privilege, then please ask users to navigate to
IAM --> Roles --> LambdaSqsBestPracticesCdk** and then review the Managed and customer inline policies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


The pattern includes a load testing script to verify functionality:

1. Set the Queue URL environment variable:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding print at the end

echo $QUEUE_URL
echo $DLQ_URL
echo $AWS_REGION

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this script requires us to export environment variables before running the script else prompt us to set valid value. However, sure I can add console.log() for each value inside the script as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added console.log which prints the Source Queue URL, DLQ URL and Region from the script

Sample result
<img src="./resources/Success-script-sample.png" alt="Architecture" width="100%"/>

Refer Dashboard to verify all the Messages are processed successfully and no messages in DLQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why there is failure in this case?

Screenshot 2025-07-20 at 10 45 14 AM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we are also simulating external API failure from Inside the code.
this could be related to same. we should have a success scenario on next retry as well


<img src="./resources/dashboard-mesage-processing-2.png" alt="Architecture" width="100%"/>

Additionally, confirm the messages in DLQ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Messages won't show on the screen, until you poll the messages. I see it is highlighted in the snapshot, worth to mention.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Verify the same using dashboard
<img src="./resources/dashboard-mesage-processing.png" alt="Architecture" width="100%"/>

<img src="./resources/dashboard-mesage-processing-2.png" alt="Architecture" width="100%"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between these 2 snapshots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to showcase what happened to complete batch and retries. The i have mentioned the same using receive count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer dashboard 2, which shows the external API failure message was processed successfully and remaining 10 were retried 2 more times and post hitting receive count to 3 the 10 messages moved to DLQ

* Error details
```

Example DeepDive walkthrough on structured logging for a batch :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ia there a way you can show the life cycle of the message by taking a messageID? and then take it all the way to trace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have highlighted the Message ID and type of retry. The external API failure was successfully processed on retry. where the invalid message ( poison pill ) will be sent to DLQ. I will try to include complete retry cycle ( all 3 invokes - original and 2 retries ) and message in DLQ for poison pill message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have showed this for a poison pill record from initial invoke, retry 1, retry 2 and message sent to DLQ

@shubsmor
Copy link
Contributor Author

I have updated the project with suggested changes. Please review again and let me know if any concern

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants