-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Distributed Lock Client #3444
Comments
Hey @codyfrisch
How is this different from idempotency? This is exactly what we solve right now with idempotency by having a specific record for a payload execution with expiration and return payload. Having a lock for rate-limit is something I agree with. I have had customers in the past asking for the same feature, where they have a configuration for each 3rd party API which need its on rate limit, but from the a consumer perspective. I don't think it's an optimal solution to let the client rate limit itself, but the reality is that it is needed in some cases. However, EventBridge offers solution for exactly this situation, calling API with a rate limiter.
You can also do the reverse, let EventBridge invoke your Lambda function with a rate limit, wich of course adds latency and infrastructure configuration, but removes the necessity for a dedicated custom rate-limit solution. I'd love to add rate-limiter to our set of tools, but we really need understand the use cases and constrains that would make it easier, than current solutions. |
Hi @codyfrisch - apologies for the late reply. The issue fell through the cracks of re:Invent and end of year leaves. I'd like to understand a bit more about the request, primarily since as you also call out there's quite a lot of similarity with the already existing Idempotency utility. Could you please share an example of where the Idempotency utility falls short for this type of use cases? I think that would help better understand what this distributed lock would add. At least based on my current understanding, despite the name and framing on the documentation, the Idempotency utility we have today already acts as a lock mechanism in the sense that whenever a request is first seen, an idempotency record for that transaction is created (aka the lock) and any other request attempting to perform the operation on the same payload would get rejected. Possibly the only missing piece is a lock release mechanism, but using a short idempotency ttl would equate to an immediate release. Regarding the distributed rate-limiter, I am not 100% sure this is a concern that belongs within a Lambda function and that perhaps would be better handled at the WAF or API Gateway level. But maybe it's something to explore. |
@dreamorosi primary difference from the Idempotency tool is that these are not per-request based but per resource based. It could be something that is happening across many separate requests - including entirely different endpoints. Additionally the use is different, separate executions would wait for release without returning and then either take a lock themselves or act on the results of the operations performed by the execution that had the lock. Idempotency is close, beyond a direct release mechanism, there also needs to be a polling mechanism (for trying to acquire a released lock) and renewal mechanism (to extend the lock) These are short duration (500ms or so) and if there is a failure where it is not released you want an instance polling the lock to be able to acquire it based simply on the conditional expression and the lack of a timely renewal. The rate limiter was an example of a use case for this, not its own request. The lock is on the operation of getting rate limit data from the third-party API (how many requests have been made, consumed query/mutation complexity (this is graphql)) which in itself consumes API requests - if all request to my service were to make the same API call the API would reach its limits much sooner and reduce throughput (or prevent any real executions all together!). By creating a lock on the API for this call, in a thundering herd situation (very common), one execution makes the call and gets the results while the others wait (~300ms), and once the lock is released they are now able to use the rate limit data the locking process cached in dynamodb (or whatever persistence layer the developer chooses). (I won't get into the details of the actual rate limting, thats irrelevant). Basically, between lambda executions which themselves are not idempotent - there are routines that are, despite the overall requests not being. |
Thanks for the detailed explanation @codyfrisch! We are a team maintaining Powertools - that's why sometimes some of us may comment on the same issue - and we are very interested in hearing from our customers about their needs. To avoid wrong assumptions and misunderstandings on my part, I will add my thoughts to make sure I understood the point, ask questions or try to get more information. Persistent storageIf we agree to implement this, we will need to decide on a backend to store the lock mechanism. I see an opportunity here to use DynamoDB, Redis and probably some SQL database like Aurora DSQL. Do you have any other storage layer in mind? Lambda is stateless by design and we can't store things in memory and assume this will be available in all the requests, so, we need a persistent storage. Per-request based but per resource basedThis is clear to me, you want a mechanism that will be divided between Lambda functions, endpoints and maybe different operations/workloads. The main point of this mechanism is to avoid concurrent calls to an external service from happening, here I have my first questions: 1 - What are these external services? APIs? Database queries? Other? Any kind of block of code? There also needs to be a polling mechanism for acquire and renew:To acquire the lock, it is clear to me that Powertools should do it, but the renew part I am not sure if it is the responsibility of Powertools or the customer who does the renew polling? Let me try to explain this with code. I'm not the best at TypeScript, so I'll do it using python pseudocode: Powertoolslock = DynamoDBLock('LockTable', 'rate_limit_data')
lock.renew()
Class DynamoDBLock:
def renew(self,...):
While True:
... Customerlock = DynamoDBLock('LockTable', 'rate_limit_data')
while True:
time.sleep(0.1) # Renew every 100ms
lock.renew() The main difference here is that whoever has control over polling to acquire renew can have better control over Lambda timeout. Dealing with timeoutsUnlike other execution platforms, Lambda is an environment that has a fixed execution timeout. Acquiring a lock can be a process that takes seconds or minutes, depending on what the customer configures. In Lambda, we also have no way to control the execution CostsPolling to try to acquire a lock in DynamoDB can be a expensive operation due to DynamoDB's billing model. In my mind we would have to do something like the code below, and even if the insert is not successful because def acquire(self):
while True:
try:
response = self.table.put_item(
Item={...},
ConditionExpression='attribute_not_exists(lock_name) OR expiry < :current_time',
ExpressionAttributeValues={':current_time': int(time.time() * 1000)}
)
return True
except ClientError as e:
if e.response['Error']['Code'] != 'ConditionalCheckFailedException':
raise
time.sleep(0.05) # Poll every 50ms or configured time These are my points so far and we need to iterate more times to have more clarity on the solution, but I see excellent potential in this feature. Please let me know what you think and I can clarify any points that are unclear. |
Storagepersistent storage could by dynamodb or redis/valkey (elasticache), or SQL. I used dynamodb in my implementation (I should make that clear, I wrote this since submitting the suggestion and its currently live and working). Redis/Elasticache would be ideal but also adds some complexity since now VPCs are required and other considerations. Costs & Polling/RenewalsAs far as the costs of the polling mechanism and the timeouts - yes, the expectation is these are for short lived situations. 1-5s tops. My implementation is 500ms polls and 333ms renews. The lock duration is 1500ms, and each renewal advances it by 333ms. There is also a hard max duration of 3000ms in my implementation. When the lock is released the dynamodb item is changed to released (a boolean value). My experience is its an average of 1 renewal per request, rarely 2 (less than 15% of the time). Polling tends to match of course. Something else to remember is in this use case, once the lambda instance that gets the initial lock completes - the rest that were polling use its result rather than acquiring the lock - all stop waiting on next poll interval. Further requests within the 60s window of the rate limit will then use that result and never attempt to lock in the first place. My experience is at most 10 instances will be outstanding waiting for a lock to release (per customer account), before the data exists and they can proceed. Additional requests find the data for the rate limit and proceed without ever acquiring a lock - they just increment values until they are over a limit then return 429. (I know how much capacity each incoming request requires in advance.) TimeoutsAs far as the timeouts, that is the reason for the renewal - the lock is set with a short duration and the timestamp, the renewals advance the timestamp by short intervals. If there is a failure, the timestamp of the last renewal plus the lock duration will soon be in the past, and the lock will be considered stale. I also pass the getRemainingTimeMs() function from the lambda context in, that is check at various point in process and controls some limits (its also checked during renewal and polling). If it reaches a threshold, a release and cleanup happens. The lock also provides an AbortController() which can signal if renewal goes past limits, or the lock gets snatched in some way, or gets aborted by remaining time. Anything depending on the lock can then abort accordingly. Theory of OperationI use a version ID which is set when the lock is first acquired. When trying to acquire the lock, it is first retrieved, checked if its released (or non-existent, or expired/stale). That existing version identifier (UUID) is used when the acquire lock call is made - part of the conditional expression. When it acquires the lock, it sets its own version at the same time blocking any others which had gotten the same older version UUID but not yet acquired it, and additional requests will get the new lock which is unreleased and not stale and go into waiting as well. When released, new lock attempts get the UUID set last time the lock was acquired and release, and the process repeats itself. As far as the identifiers, in my use case, the partition key is The use cases I've come up with have all be dealing with thundering herds, to prevent me from propagating that effect. But that is an inherent issue caused by the platform I am providing a plugin "App" to - which I guess I will go ahead and reveal is monday.com. I'm sure there are other use cases out there though - part of why I wanted others to chime in, so we can learn what they might be. |
Thanks for explaining these points with too much detail @codyfrisch! Let's give some time for other maintainers to read this and give their opinions. In the meantime, do you have any public branches or repositories where we can see your implementation? I'd be very interested in reading the code to understand how you're addressing some challenges when implementing this. |
I airlifted the code out of the repo, but here it is: https://github.com/codyfrisch/distributed-lock. Its ugly. Don't laugh. Its what you get when you're in "just make it work" mode. But if anyone notices anything "wrong" please tell me.
|
It's awesome you can share this code with us! I'm 1000% sure that this will help us a lot to read the code, understand the implementation and see if it makes sense to incorporate it as a utility. I have one last question: are you using this utility only in Lambda or in different types of platforms? I didn't see anything in the code that made this super specific to Lambda. |
I am only using it in lambda. To be honest, it could probably be used anywhere. The main reason its of value for a lambda is they are by nature distributed, and cannot communicate with each other. If it was a monolith, this could just be done in memory. A small cluster could just elect a leader for this and the others communicate with that one. Lambda - none of that is an option. Redis/valkey/elasticache ("serverless") would be my preferred storage layer if I could. But thats a future project since I'd need to add a VPC - I just didn't have time for it. I think supporting DynamoDB and some/one of the above would be ideal - a memory database would be far faster than DynamoDB and less expensive depending on volume (its not for me though). Redis also has some things built in to facilitate this already. If the comments don't match the code - assume the code is right and the comments just need to be updated/removed. |
Hi @codyfrisch! Sorry for the long time without any update here. I've been thinking about this utility since we had this discussion here and I would really like to invest more time in drafting an RFC to cover important aspects and probably implement it. If you think this makes sense, I would like to schedule a meeting with you and try to discuss a bit more some use cases I have in mind and how we can draft this RFC. |
Hi, We also have a use case for this and would love to see and support an implementation. We would like to rate limit our customers use of some features on a customer-by-customer basis. The rate limits provided by similar services like API Gateway or WAF do not allow us to do this at the code level which would mean manually maintaining complex firewall rules and API Gateway rate limits for different paths and payloads. We would commonly do this in Redis / Valkey which fits the feature set perfectly. However, with a migration to Lambda, we don't want to have to launch functions inside of VPCs (network-interface allocation time). Having the Redis cluster exposed externally is not an option for security reasons. We are aiming for an implementation in DynamoDB. If you choose not to pursue a DynamoDB adapater, I'd be happy to look into implementation of one. For the RFC, I think it's important to consider different rate-limiting alogirthms (sliding window etc). Choosing one for initial implementation would be a good start but the code should probably be open to others being added. |
Hi @alexbaileyuk! Thanks for sharing your use case here. I see a growing demand from customers to transition from resource-based rate limiting to code-based rate limiting. Code-based rate limiting offers more granular control, allowing limits to be applied directly within the business logic of the application rather than at the resource level. This is particularly effective in scenarios where APIs handle multiple variables or high-value operations, as it enables precise restrictions tailored to specific workflows or user tiers. In the context of Generative AI, where models are increasingly being called with limited token capacities, code-rate limiting can help manage resource pressure without outright rejecting requests. For example, it can dynamically adjust limits based on token usage or prioritize certain operations over others. A important thing that I want to highlight is that code-rate limiting isn’t meant to replace resource limits entirely. In my opinion it complements them by adding feature-specific restrictions. I'm more than happy to connect and talk about your ideas on this utility. If make sense for you, please let me know and we schedule a meeting. |
@leandrodamascena I completely agree that these would not be a replacement for other types of rate limits which normally would sit higher up the stack in the infrastructure layer. When looking for inspiration for utilities such as this, I often find it useful to draw inspiration from similar utilities which have been around for a long time and have been thoroughly put through their pace. In this case, I would probably draw inspiration from https://symfony.com/doc/current/components/lock.html & https://symfony.com/doc/current/rate_limiter.html. Whilst Symfony is written in PHP rather than Typescript, it is a very well tested and utilised framework and is the backbone that other frameworks such as Laravel build on. Based on other similar implementations, I've got some suggestions and ideas on how this could look. What's the best way to connect? |
I think something to keep in mind is the original feature suggestion was for a distributed lock, for example to lock a certain external API call used to get rate limits (to then manage rate limiting itself). This arose because the platform I work with has some complex multi-faceted rate limits. I have to track locally because with Lambda functions I can't wait until a rate limit resets on the external API. (esp with API Gateway involved when the rate limits reset at 1 minute intervals - there will be a timeout first.). The rate limit tool by itself does also seem like it could be useful - but ideally both of these tools would be developed across the various languages to be compatible with each other so that anyone working with multiple languages in their environment would be able to use them in harmony. |
@codyfrisch spot on. There are two tools to develop here. One is the distributed lock and the other is an application-level rate limiter. |
I am actually starting to think there may be rather than the distributed lock something here that's more like a coordination system, with locks being one use case, rate limits, idempotency, etc. build on top of that. |
Use case
I believe incorporating a distributed lock client into the powertools would be of value. There are times multiple events being processed need to update the same resource (requires making calls to some API to retrieve updated data, and updating a record). The triggering events may not even be related to each other, other than all triggering an update.
A distributed lock would permit one event to perform the needed work and update a record, while others wait for it to finish (or skip entirely). This saves on calls to APIs (rate limits exist), and possibly save on execution time by skipping the update being handled by another instance.
Example is an application that from time to time needs to get updated data for a customer account from a third party service (the application is a plugin for the service, the data needed is the subscription and account details for the plugin from the third-party marketplace). These updates are needed for execution. One incoming event can gain a lock while updating data for the account, other executions will wait while the update occurs and check for the lock to be released (time required for the update is well understood and <1s).
Another is setting up per-minute distributed rate limit tracking - when an event is received a dynamodb record is created to track the number of requests received in a minute. Different account tiers have different limits. When a tracking record is needed, an API call must be made to get the account tier, current minute API usage (if any) as well as start the API minute timer (so the windows are deterministic). A distributed lock would prevent many incoming events from duplicating this work (and contributing to the API rate limit being tracked!). The rate tracking record allows additional requests to receive a 429 response, causing the external platform to continue to display "in progress" rather than a false "success" if the event were queued. This also minimizes needless executions to find out that the rate limits have been reached.
Solution/User Experience
I would suggest a DynamoDB based lock, built on conditional expressions much like the idempotency tool. There is an existing Java based client in awslabs, https://github.com/awslabs/amazon-dynamodb-lock-client, and at least one defunct typescript based project, https://www.npmjs.com/package/@deliveryhero/dynamodb-lock.
(I am internally developing a DynamoDB based distributed lock, as I need it, and am open to contributing if I can get authorization, and up to speed on this project. Though an ElastiCache based solution would probably be a more efficient solution, it has higher costs at low volume.)
Alternative solutions
Alternative solutions could use ElastiCache, Valkey, Redis, etc. or there could be multiple provider choices within the module.
Acknowledgment
Future readers
Please react with 👍 and your use case to help us understand customer demand.
The text was updated successfully, but these errors were encountered: