improvements on concurrent scenarios #137

amaximciuc · 2025-04-03T18:22:00Z

(part2/2 of #12 ; to be merged after #136)

FSBucket write to tempdir first and atomic rename -- fixed the get_size issue in #136

fixes issue with main content being fetched more than once when cache in used in concurrent gets

refactored the locking mechanism

added documentations

#10

- FSBucket now creates a temp dir in the root dir, and uses it for temp files and locks, and ensures it's not listed in the object listings - atomic FSBucket.put_object() by storing initially in a temp file (in the same root dir), and renaming it atomically - prefix validations - added documentations

ensure the content from base bucket is fetched only once. #10

asuiu · 2025-04-04T11:33:55Z

[esamTrade] Tests that have issues

asuiu · 2025-04-06T17:12:38Z

python/tests/test_namedlock.py

+        lock2 = self.lock_manager.get_lock("test_lock2")
+        self.assertIsNot(lock1, lock2)
+
+    def test_lock_actually_locks(self):


This one seems redundant, since it tests if the existing threading.Lock actually works.
It wouldn't hurt, if it wouldn't take so much time to run.
Be aware - the unit-tests should be lightning fast! Tests running time is also very very important on big projects (for some repositories, inefficient tests costs tens of millions of $)

asuiu

I have fixed all the noted issues in a separat PR: #139

so we can close this one in favor of #139, but I've added the comments here just for tracking and learning purposes.

asuiu · 2025-04-06T17:13:38Z

python/tests/test_namedlock.py

+        name2 = "path\\with\\backslashes"
+
+        lock1 = self.lock_manager.get_lock(name1)
+        lock2 = self.lock_manager.get_lock(name2)


Here's a BUG. The get_lock() should raise an exception since the \\ path is invalid!

asuiu · 2025-04-06T17:17:38Z

python/bucketbase/cached_immutable_bucket.py

        self._cache = cache
        self._main = main
+        self._lock_manager = lock_manager or namedlock.ThreadLockManager()


I'd say that here's a bug, as using the threading synchronization mecanism would make this class fail when used with a cache used by multiple processes, like our main scenario - using the FileSystem as cache. The bucketbase.cached_immutable_bucket.CachedImmutableBucket.build_from_fs() let it have default value, so this implementation definitely will fail working in real-world scenario

asuiu · 2025-04-06T17:22:19Z

python/bucketbase/ibucket.py

@@ -311,6 +327,8 @@ def put_object(self, name: PurePosixPath | str, content: Union[str, bytes, bytea
        :raises io.UnsupportedOperation: If the object already exists
        """
        self._lock_object(name)
+        if self._base_bucket.exists(name):
+            raise io.UnsupportedOperation(f"Object {name} already exists in AppendOnlySynchronizedBucket")


Here's a BUG, since the Lock will remain opened, as on the previous line the lock is opened, and not closed here.

asuiu · 2025-04-06T17:22:35Z

python/bucketbase/ibucket.py

@@ -328,6 +346,8 @@ def put_object_stream(self, name: PurePosixPath | str, stream: BinaryIO) -> None
        :raises IOError: If stream operations fail
        """
        self._lock_object(name)
+        if self._base_bucket.exists(name):
+            raise io.UnsupportedOperation(f"Object {name} already exists in AppendOnlySynchronizedBucket")


Here's a BUG, since the Lock will remain opened, as on the previous line the lock is opened, and not closed here.

asuiu · 2025-04-06T17:23:29Z