-
Notifications
You must be signed in to change notification settings - Fork 106
fix: add file-based locking to EventLog.append() #1614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…itions Fixes OpenHands#1545 When multiple EventLog instances write to the same directory concurrently, they could produce duplicate file indices causing data corruption. Changes: - Add filelock dependency for cross-process file locking - Wrap EventLog.append() in FileLock to serialize concurrent writes - Re-sync _length from disk after acquiring lock to handle concurrent writes - Add exists() and get_absolute_path() to FileStore interface - Add thread safety tests for concurrent append verification
|
[Automatic Post]: I have assigned @jpshackelford as a reviewer based on git blame information. Thanks in advance for the help! |
openhands-sdk/pyproject.toml
Outdated
| dependencies = [ | ||
| "deprecation>=2.1.0", | ||
| "fastmcp>=2.11.3", | ||
| "filelock>=3.15.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vulnerable to CVE-2025-68146: TOCTOU symlink attack vulnerability. Version 3.20.1 is the first patched version. Anyone running on 3.15.x-3.20.0 can have arbitrary files truncated by a local attacker. Pin to >=3.20.1
|
Currently, there's no inter-process locking in the production SDK code. This PR introduces the first such mechanism. The codebase has a custom FIFOLock implementation that provides:
This is used in ConversationState for thread coordination but it does not provide support for inter-process coordination which is required here. In addition to not working reliably on NFS, flock() has some limitations we need to think about:
The Key Question: Is
|
|
If we go with this locking library / mechanism a few things we'd probably want to add:
|
- Add lock timeout (30s) with proper Timeout exception handling - Log exceptions in _count_events_on_disk() and _sync_from_disk() - Separate FileNotFoundError (expected) from unexpected errors - Document NFS limitation in class docstring - Pin filelock>=3.20.1 to fix CVE-2025-68146
|
@jpshackelford Hey! Just pushed updates addressing all your feedback. Added lock timeout w/ proper error handling, split out FileNotFoundError from the catch-all, added logging, and pinned filelock to 3.20.1 for the CVE. |
|
Done 👍 |
|
Refactored to reuse _scan_and_build_index - see latest commit 👍 |
tofarr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work and a fantastic and needed change!
2 nits which are not blockers - the only thing I would like to see is moving the file lock inside the file store interface so that we don't require it for other implementations (Maybe expose the lock on the file store using __enter__ and __exit__).
A big thank you for taking this on!
|
Refactored locking into FileStore interface with |
tofarr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Work! 🍰
|
Glad we got this one across the finish line! 🍰 I think I’ve officially caught the OpenHands bug—that’s 5 PRs down and I’m just getting warmed up. Great working with you all on the technical deep-dives; let’s keep this momentum going! 💪 |
Fixes #1545
Hey folks! 👋
So I was digging into some weird duplicate index issues and found the culprit - classic race condition in
EventLog.append(). When multiple instances hit it at the same time, they both grab the same_length, write to the same file path, and things go sideways fast.What's changed:
filelockfor proper cross-process lockingEventLog.append()with aFileLockexists()andget_absolute_path()toFileStoreinterface (needed for lock file handling)Tests:
All existing tests still pass, plus added:
test_event_log_concurrent_append_thread_safety- hammers it with 5 threads doing 10 writes eachtest_event_log_concurrent_writes_serialized- two EventLog instances sharing the same dirLmk if anything looks off or if you'd prefer a different approach!