fix: add file-based locking to EventLog.append() #1614

ixchio · 2026-01-06T15:57:40Z

Hey folks! 👋

So I was digging into some weird duplicate index issues and found the culprit - classic race condition in EventLog.append(). When multiple instances hit it at the same time, they both grab the same _length, write to the same file path, and things go sideways fast.

What's changed:

Pulled in filelock for proper cross-process locking
Wrapped the critical section in EventLog.append() with a FileLock
After grabbing the lock, we re-check disk state in case someone else wrote while we waited
Added exists() and get_absolute_path() to FileStore interface (needed for lock file handling)

Tests:

All existing tests still pass, plus added:

test_event_log_concurrent_append_thread_safety - hammers it with 5 threads doing 10 writes each
test_event_log_concurrent_writes_serialized - two EventLog instances sharing the same dir

Lmk if anything looks off or if you'd prefer a different approach!

…itions Fixes OpenHands#1545 When multiple EventLog instances write to the same directory concurrently, they could produce duplicate file indices causing data corruption. Changes: - Add filelock dependency for cross-process file locking - Wrap EventLog.append() in FileLock to serialize concurrent writes - Re-sync _length from disk after acquiring lock to handle concurrent writes - Add exists() and get_absolute_path() to FileStore interface - Add thread safety tests for concurrent append verification

all-hands-bot · 2026-01-08T12:21:10Z

[Automatic Post]: I have assigned @jpshackelford as a reviewer based on git blame information. Thanks in advance for the help!

jpshackelford · 2026-01-08T13:04:03Z

openhands-sdk/pyproject.toml

 dependencies = [
    "deprecation>=2.1.0",
    "fastmcp>=2.11.3",
+    "filelock>=3.15.0",


Vulnerable to CVE-2025-68146: TOCTOU symlink attack vulnerability. Version 3.20.1 is the first patched version. Anyone running on 3.15.x-3.20.0 can have arbitrary files truncated by a local attacker. Pin to >=3.20.1

jpshackelford · 2026-01-08T13:42:58Z

Currently, there's no inter-process locking in the production SDK code. This PR introduces the first such mechanism.

The codebase has a custom FIFOLock implementation that provides:

FIFO ordering (fairness)
Reentrancy
Timeout support

This is used in ConversationState for thread coordination but it does not provide support for inter-process coordination which is required here.

In addition to not working reliably on NFS, flock() has some limitations we need to think about:

Limitation	Relevant?	Analysis
No Asyncio Support	⚠️ Potentially	`LocalConversation` calls `self._state.events.append(e)` synchronously from callbacks. However, `RemoteConversation` uses asyncio. If the codebase ever needs async event persistence, this becomes a blocker. Currently OK since `EventLog` is only used in `LocalConversation`.
Thread-Local Behavior	✅ Yes, but OK	Each `EventLog` instance creates its own `FileLock`. The PR's design means the lock is per-EventLog instance, not shared across threads. This is actually the correct behavior - multiple threads sharing an `EventLog` will correctly serialize through the single lock instance.
Lock Files Not Auto-Deleted	⚠️ Minor concern	`.eventlog.lock` files will persist in every conversation's `events/` directory. For long-running systems with many conversations, these accumulate. Not a functional problem, but adds clutter. Could add cleanup in a future PR.
Android/Termux Unsupported	✅ Not relevant	Probably not an issue though I know of one community member doing something with OpenHands on Android...
No Deadlock Detection	⚠️ Potentially	The code only uses ONE lock (`_lock` in EventLog), so circular deadlocks are impossible within EventLog itself. However, if `EventLog.append()` is called while holding `ConversationState._lock` (the `FIFOLock`), and something inside `append()` tried to acquire `ConversationState._lock`, you'd deadlock. Looking at the code, this doesn't happen currently - `append()` only touches the file store. Low risk but worth noting.

The Key Question: Is `EventLog.append()` Called from Async Code?

Looking at the call chain:

LocalConversation._default_callback(e)
  → self._state.events.append(e)    # This is EventLog.append()

This callback is invoked synchronously during agent execution. The RemoteConversation has a different path that doesn't use local EventLog at all (it streams to a server).

Verdict: Asyncio limitation is NOT currently blocking, but the architecture should be aware that EventLog is sync-only.

jpshackelford · 2026-01-08T14:00:02Z

If we go with this locking library / mechanism a few things we'd probably want to add:

Consider a timeout on lock acquisition and raising an error rather than silent hang for cases where lock holder itself is stuck.
Fix InMemoryFileStore.get_absolute_path() to include a unique instance identifier. Currently all InMemoryFileStore instances return the same lock file path (/tmp/openhands_inmemory/...), meaning separate in-memory stores across different tests or processes will contend on the same lock. Adding a UUID or similar per-instance makes each store's lock independent.
Log exceptions in _count_events_on_disk() and_sync_from_disk() instead of silently swallowing them. Currently these methods catch all exceptions and return quietly, making debugging difficult. At minimum, FileNotFoundError (expected when directory doesn't exist yet) should be handled separately from unexpected errors, which should be logged with context.
Document the NFS/network filesystem limitation in the class docstring. File locking via flock() does not work reliably on NFS mounts or network filesystems. Users deploying with shared storage should be aware

- Add lock timeout (30s) with proper Timeout exception handling - Log exceptions in _count_events_on_disk() and _sync_from_disk() - Separate FileNotFoundError (expected) from unexpected errors - Document NFS limitation in class docstring - Pin filelock>=3.20.1 to fix CVE-2025-68146

ixchio · 2026-01-08T18:40:59Z

@jpshackelford Hey! Just pushed updates addressing all your feedback. Added lock timeout w/ proper error handling, split out FileNotFoundError from the catch-all, added logging, and pinned filelock to 3.20.1 for the CVE.
tests still passing 👍

openhands-sdk/openhands/sdk/io/memory.py

openhands-sdk/openhands/sdk/conversation/event_store.py

ixchio · 2026-01-11T15:54:24Z

Done 👍

ixchio · 2026-01-11T15:58:09Z

Refactored to reuse _scan_and_build_index - see latest commit 👍

tofarr

Great work and a fantastic and needed change!

2 nits which are not blockers - the only thing I would like to see is moving the file lock inside the file store interface so that we don't require it for other implementations (Maybe expose the lock on the file store using __enter__ and __exit__).

A big thank you for taking this on!

openhands-sdk/openhands/sdk/conversation/event_store.py

ixchio · 2026-01-11T16:15:56Z

Refactored locking into FileStore interface with __enter__/__exit__. EventLog now uses with self._fs.lock(...). Added threading lock for InMemoryFileStore to keep tests happy. Ready for another look! 🚀

tofarr

Great Work! 🍰

ixchio · 2026-01-12T16:15:10Z

Glad we got this one across the finish line! 🍰 I think I’ve officially caught the OpenHands bug—that’s 5 PRs down and I’m just getting warmed up. Great working with you all on the technical deep-dives; let’s keep this momentum going! 💪

ixchio added 2 commits January 6, 2026 21:23

Merge branch 'main' into fix-eventlog-race-condition

3b2702b

ixchio closed this Jan 6, 2026

ixchio deleted the fix-eventlog-race-condition branch January 6, 2026 15:58

ixchio restored the fix-eventlog-race-condition branch January 6, 2026 15:59

ixchio reopened this Jan 6, 2026

ixchio added 3 commits January 6, 2026 21:29

Merge branch 'main' into fix-eventlog-race-condition

1278c4f

fix: use BaseFileLock type annotation for pyright

971c4b9

Merge branch 'main' into fix-eventlog-race-condition

8be5566

all-hands-bot requested a review from jpshackelford January 8, 2026 12:21

Merge branch 'main' into fix-eventlog-race-condition

509a39c

jpshackelford removed their request for review January 8, 2026 12:46

jpshackelford requested changes Jan 8, 2026

View reviewed changes

jpshackelford requested a review from tofarr January 8, 2026 14:00

ixchio added 2 commits January 9, 2026 00:09

Merge branch 'main' into fix-eventlog-race-condition

4bd0b41

ixchio requested a review from jpshackelford January 8, 2026 18:41

tofarr reviewed Jan 11, 2026

View reviewed changes

openhands-sdk/openhands/sdk/io/memory.py Outdated Show resolved Hide resolved

tofarr reviewed Jan 11, 2026

View reviewed changes

openhands-sdk/openhands/sdk/io/memory.py Outdated Show resolved Hide resolved

use uuid hex + fstrings per review

e04e2f6

tofarr reviewed Jan 11, 2026

View reviewed changes

openhands-sdk/openhands/sdk/conversation/event_store.py Show resolved Hide resolved

ixchio added 2 commits January 11, 2026 21:26

Merge branch 'main' into fix-eventlog-race-condition

011b8f7

refactor: reuse _scan_and_build_index in sync

780281d

ixchio requested a review from tofarr January 11, 2026 16:00

fix: remove unused variable

7783bfa

tofarr requested changes Jan 11, 2026

View reviewed changes

openhands-sdk/openhands/sdk/conversation/event_store.py Outdated Show resolved Hide resolved

refactor: move locking into FileStore interface

4855b3f

ixchio requested a review from tofarr January 11, 2026 16:17

tofarr approved these changes Jan 11, 2026

View reviewed changes

Merge branch 'main' into fix-eventlog-race-condition

6d0eb21

jpshackelford approved these changes Jan 12, 2026

View reviewed changes

Merge branch 'main' into fix-eventlog-race-condition

2f1d438

tofarr enabled auto-merge (squash) January 12, 2026 16:14

tofarr merged commit 397ab08 into OpenHands:main Jan 12, 2026
14 checks passed

fix: add file-based locking to EventLog.append() #1614

fix: add file-based locking to EventLog.append() #1614

Uh oh!

Conversation

ixchio commented Jan 6, 2026

What's changed:

Tests:

Uh oh!

all-hands-bot commented Jan 8, 2026

Uh oh!

jpshackelford Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

jpshackelford commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Key Question: Is EventLog.append() Called from Async Code?

Uh oh!

jpshackelford commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ixchio commented Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ixchio commented Jan 11, 2026

Uh oh!

ixchio commented Jan 11, 2026

Uh oh!

tofarr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ixchio commented Jan 11, 2026

Uh oh!

tofarr left a comment

Choose a reason for hiding this comment

Uh oh!

ixchio commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jpshackelford commented Jan 8, 2026 •

edited

Loading

The Key Question: Is `EventLog.append()` Called from Async Code?

jpshackelford commented Jan 8, 2026 •

edited

Loading