Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Why no explanation of remote failures in DistributedLock.java #238

Open
simbo1905 opened this issue Dec 7, 2024 · 0 comments
Open
Labels
bug Something isn't working

Comments

@simbo1905
Copy link

simbo1905 commented Dec 7, 2024

Hello 👋🏻

Reading the code of com.coditory.sherlock. DistributedLock it is very careful to validate the application input but says nothing about what might happen if the remote calls to the external store (sql or mongo) fails.

We might expect that under adverse conditions, the developers using the library will see IOExceptions due to transient network errors, timeouts, and the database running into issues like running out of space.

I recognise that using checked exceptions is less than ideal to document problems yet it does force the user of your API to understand that a remote method invocation is inherently perilous.

At the very least, a Java.io.RuntimeIOException being documented as being thrown on the methods that make remote calls would be helpful.

Beyond that, the API documenting error types are being thrown for unrecoverable database errors would be useful.

Take for example this example code on the readme:

// this may or may not make a call to the database so it may or may not throw...?
Sherlock sherlock = MongoSherlock.create(collection);

// this may or may not make a call to the database so it may or may not throw...?
DistributedLock lock = sherlock.createLock("sample-lock");

// obviously this might throw due to not being able to connect to the database. But what does it throw? And can it leave the lock as completely broken if the database runs out of space? Does it return immediately (I think so the documentation doesn't say)? 
lock.runLocked(() -> System.out.println("Lock granted!"));

As per the comments it's not really apparently what may happen.

Historically making a remote procedure call look like a local call is a really bad idea. They are orders of magnitude slower and orders of magnitude more likely to fail.

As the expected behaviour of failures isn't documented it isn't clear how to write safe code that uses this library. More troubling is that there are multiple implementations: SQL and Mongo. As it isn't documented what the failure modes are to clients there are no constraints on the failure modes of any implementations. They are free to fail in any manner they wish without documenting it.

This basically implies that the API provided by this interface might be awesome today and fail one way, but any implementation completely free to be optimised tomorrow in a way that fails in different manner.

I would expect clear documentation that which methods do make remote calls (maybe all of them?) and clear documentation as to what to do when they throw (eg simply retry) and some explanation in the docs that "all valid implementations use transactions else atomic operations that are native to the data store such that network error can never leave the lock state in the database as corrupted".

I think that might be the case, so hopefully, this is just a document matter.

Finally, the IO back to the JVM might fail when we have actually acquired the lock. Consider:

  • We make a call to obtain a lock.
  • The database makes the lock durable.
  • The network stutters, and we get no response.
  • Subsequent connections to the database work as we are using a database connection pool, and the next call on a different connection works.

We now have a situation where we have acquired the lock, but the documentation has not told me if when I am using a non-reentrant lock that has thrown an IO type error that it should be safe to reattempt. The implementation is then free to have a bug to disallow me to attempt to retry to acquire a lock I may have, but I dont know.

Finally, I would like to congratulate you all on having a successful recent 1.0 release.

May I gently suggest a 2.0 that is the same server logic, but with a client facing API that is very rigorous about explaining how to handle the not happy path scenarios, that is a contract that all implementations of the API must comply with.

@simbo1905 simbo1905 added the bug Something isn't working label Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant