Skip to content

Conversation

@ixchio
Copy link
Contributor

@ixchio ixchio commented Jan 6, 2026

Fixes #1545

Hey folks! 👋

So I was digging into some weird duplicate index issues and found the culprit - classic race condition in EventLog.append(). When multiple instances hit it at the same time, they both grab the same _length, write to the same file path, and things go sideways fast.

What's changed:

  • Pulled in filelock for proper cross-process locking
  • Wrapped the critical section in EventLog.append() with a FileLock
  • After grabbing the lock, we re-check disk state in case someone else wrote while we waited
  • Added exists() and get_absolute_path() to FileStore interface (needed for lock file handling)

Tests:

All existing tests still pass, plus added:

  • test_event_log_concurrent_append_thread_safety - hammers it with 5 threads doing 10 writes each
  • test_event_log_concurrent_writes_serialized - two EventLog instances sharing the same dir

Lmk if anything looks off or if you'd prefer a different approach!

ixchio added 2 commits January 6, 2026 21:23
…itions

Fixes OpenHands#1545

When multiple EventLog instances write to the same directory concurrently,
they could produce duplicate file indices causing data corruption.

Changes:
- Add filelock dependency for cross-process file locking
- Wrap EventLog.append() in FileLock to serialize concurrent writes
- Re-sync _length from disk after acquiring lock to handle concurrent writes
- Add exists() and get_absolute_path() to FileStore interface
- Add thread safety tests for concurrent append verification
@ixchio ixchio closed this Jan 6, 2026
@ixchio ixchio deleted the fix-eventlog-race-condition branch January 6, 2026 15:58
@ixchio ixchio restored the fix-eventlog-race-condition branch January 6, 2026 15:59
@ixchio ixchio reopened this Jan 6, 2026
@all-hands-bot
Copy link
Collaborator

[Automatic Post]: I have assigned @jpshackelford as a reviewer based on git blame information. Thanks in advance for the help!

@jpshackelford jpshackelford removed their request for review January 8, 2026 12:46
dependencies = [
"deprecation>=2.1.0",
"fastmcp>=2.11.3",
"filelock>=3.15.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vulnerable to CVE-2025-68146: TOCTOU symlink attack vulnerability. Version 3.20.1 is the first patched version. Anyone running on 3.15.x-3.20.0 can have arbitrary files truncated by a local attacker. Pin to >=3.20.1

@jpshackelford
Copy link
Contributor

jpshackelford commented Jan 8, 2026

Currently, there's no inter-process locking in the production SDK code. This PR introduces the first such mechanism.

The codebase has a custom FIFOLock implementation that provides:

  • FIFO ordering (fairness)
  • Reentrancy
  • Timeout support

This is used in ConversationState for thread coordination but it does not provide support for inter-process coordination which is required here.

In addition to not working reliably on NFS, flock() has some limitations we need to think about:

Limitation Relevant? Analysis
No Asyncio Support ⚠️ Potentially LocalConversation calls self._state.events.append(e) synchronously from callbacks. However, RemoteConversation uses asyncio. If the codebase ever needs async event persistence, this becomes a blocker. Currently OK since EventLog is only used in LocalConversation.
Thread-Local Behavior Yes, but OK Each EventLog instance creates its own FileLock. The PR's design means the lock is per-EventLog instance, not shared across threads. This is actually the correct behavior - multiple threads sharing an EventLog will correctly serialize through the single lock instance.
Lock Files Not Auto-Deleted ⚠️ Minor concern .eventlog.lock files will persist in every conversation's events/ directory. For long-running systems with many conversations, these accumulate. Not a functional problem, but adds clutter. Could add cleanup in a future PR.
Android/Termux Unsupported Not relevant Probably not an issue though I know of one community member doing something with OpenHands on Android...
No Deadlock Detection ⚠️ Potentially The code only uses ONE lock (_lock in EventLog), so circular deadlocks are impossible within EventLog itself. However, if EventLog.append() is called while holding ConversationState._lock (the FIFOLock), and something inside append() tried to acquire ConversationState._lock, you'd deadlock. Looking at the code, this doesn't happen currently - append() only touches the file store. Low risk but worth noting.

The Key Question: Is EventLog.append() Called from Async Code?

Looking at the call chain:

LocalConversation._default_callback(e)
  → self._state.events.append(e)    # This is EventLog.append()

This callback is invoked synchronously during agent execution. The RemoteConversation has a different path that doesn't use local EventLog at all (it streams to a server).

Verdict: Asyncio limitation is NOT currently blocking, but the architecture should be aware that EventLog is sync-only.

@jpshackelford
Copy link
Contributor

jpshackelford commented Jan 8, 2026

If we go with this locking library / mechanism a few things we'd probably want to add:

  • Consider a timeout on lock acquisition and raising an error rather than silent hang for cases where lock holder itself is stuck.

  • Fix InMemoryFileStore.get_absolute_path() to include a unique instance identifier. Currently all InMemoryFileStore instances return the same lock file path (/tmp/openhands_inmemory/...), meaning separate in-memory stores across different tests or processes will contend on the same lock. Adding a UUID or similar per-instance makes each store's lock independent.

  • Log exceptions in _count_events_on_disk() and_sync_from_disk() instead of silently swallowing them. Currently these methods catch all exceptions and return quietly, making debugging difficult. At minimum, FileNotFoundError (expected when directory doesn't exist yet) should be handled separately from unexpected errors, which should be logged with context.

  • Document the NFS/network filesystem limitation in the class docstring. File locking via flock() does not work reliably on NFS mounts or network filesystems. Users deploying with shared storage should be aware

@jpshackelford jpshackelford requested a review from tofarr January 8, 2026 14:00
ixchio added 2 commits January 9, 2026 00:09
- Add lock timeout (30s) with proper Timeout exception handling
- Log exceptions in _count_events_on_disk() and _sync_from_disk()
- Separate FileNotFoundError (expected) from unexpected errors
- Document NFS limitation in class docstring
- Pin filelock>=3.20.1 to fix CVE-2025-68146
@ixchio
Copy link
Contributor Author

ixchio commented Jan 8, 2026

@jpshackelford Hey! Just pushed updates addressing all your feedback. Added lock timeout w/ proper error handling, split out FileNotFoundError from the catch-all, added logging, and pinned filelock to 3.20.1 for the CVE.
tests still passing 👍

@ixchio ixchio requested a review from jpshackelford January 8, 2026 18:41
@ixchio
Copy link
Contributor Author

ixchio commented Jan 11, 2026

Done 👍

@ixchio
Copy link
Contributor Author

ixchio commented Jan 11, 2026

Refactored to reuse _scan_and_build_index - see latest commit 👍

@ixchio ixchio requested a review from tofarr January 11, 2026 16:00
Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work and a fantastic and needed change!

2 nits which are not blockers - the only thing I would like to see is moving the file lock inside the file store interface so that we don't require it for other implementations (Maybe expose the lock on the file store using __enter__ and __exit__).

A big thank you for taking this on!

@ixchio
Copy link
Contributor Author

ixchio commented Jan 11, 2026

Refactored locking into FileStore interface with __enter__/__exit__. EventLog now uses with self._fs.lock(...). Added threading lock for InMemoryFileStore to keep tests happy. Ready for another look! 🚀

@ixchio ixchio requested a review from tofarr January 11, 2026 16:17
Copy link
Collaborator

@tofarr tofarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Work! 🍰

@tofarr tofarr enabled auto-merge (squash) January 12, 2026 16:14
@ixchio
Copy link
Contributor Author

ixchio commented Jan 12, 2026

Glad we got this one across the finish line! 🍰 I think I’ve officially caught the OpenHands bug—that’s 5 PRs down and I’m just getting warmed up. Great working with you all on the technical deep-dives; let’s keep this momentum going! 💪

@tofarr tofarr merged commit 397ab08 into OpenHands:main Jan 12, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition in EventLog.append() causes duplicate indices when multiple instances write concurrently

4 participants