Skip to content

Agent-assisted search for thread races #1321

@jfindlay

Description

@jfindlay

As mentioned in this comment, here is the full thread race audit report. I don't think regression tests are going to be nonflaky as they are sensitive to runtime ((non-)GIL CPython, pypy, etc.), platform (virtualization stack, CPU characteristics, etc.), and python version. I do think I can provide at least a minimum reproducible test case for each class of thread race (see below).

If you want I can submit a pull request addressing all issues with the indicated strategies, but let me know if you have any feedback and I will incorporate it into the fixes. Read over the Fix strategies section. No strategy is sufficient alone. My preference would be for fix 2, 3, and 5.

[Agent report below]

Background

#1317 (use_original class-attribute race, fixed in PR #1318) and #1319 (_directory_content dict-mutate-during-iterate, open) are the first two confirmed crashing sites in a broader pattern. A systematic static audit of all pyfakefs/*.py source files (Pass A) followed by runtime confirmation under PyPy 7.3.19 and free-threaded CPython 3.13.3 (--disable-gil) found 8 distinct hazard shapes across the library's internal data structures and module-level state.

Hazard inventory

Produced by static-pattern grep over pyfakefs/*.py. Runtime confirmation performed on:

  • PyPy 7.3.19 / Python 3.11, x86_64 Linux
  • CPython 3.13.3 free-threaded (GIL disabled, --disable-gil), x86_64 Linux
# Shape Confirmed sites Symptom Runtime confirmation
1 Integer RMW counters 7 Silent: duplicate inodes, wrong link counts, wrong disk usage Duplicate inodes 10/10 repeats (3.13 FT); used_size mismatch 10/10 (3.13 FT), 1/10 (PyPy)
2 Heap TOCTOU 2 Crash: IndexError / interpreter segfault IndexError 10/20 repeats (PyPy 32 threads); segfault on 3.13 FT at 32 threads
3 Dict mutate-during-iterate 7 Crash: RuntimeError: dictionary changed size during iteration 940/50 repeats (PyPy); 620/640 workers (3.13 FT); mount_points crash 20/20 (3.13 FT)
4 List append + index allocator 1 Silent: duplicate fd numbers Confirmed by Shape 2 crash path
5 TOCTOU existence check 5 Silent: lost writes, duplicate insertions Not observed (silent under GIL; dict insert atomic at bytecode level)
6 Class-attribute toggle 5 Silent / crash: cross-thread state confusion, premature teardown #1317 / PR #1318 (one site fixed); tearDown KeyError/NameError on 3.13 FT
7 Process-global module variables 2 Silent: permission checks see another thread's uid 591,563 cross-thread UID reads in 5/5 repeats at 50k ops/thread (3.13 FT)
8 Shared-list nullification (free-threaded CPython only) 2 Crash: RuntimeError or memory corruption Confirmed by Shape 3 workload; safe under GIL

Crashing surface (observed)

These are the crashes that fire without exotic configuration. All confirmed on the platforms listed above.

Shape 2 — _free_fd_heap heap TOCTOU (time-of-check to time-of-use)

Two independent crash paths from add_open_file / close_open_file:

Path A — concurrent open: check-then-pop race

File "pyfakefs/fake_filesystem.py", line 946, in add_open_file
    if self._free_fd_heap:                     # check
File "pyfakefs/fake_filesystem.py", line 947, in add_open_file
    open_fd = heapq.heappop(self._free_fd_heap)  # another thread drained it
File "pypy3.11/heapq.py", line 143, in heappop
    _siftup(heap, 0)
IndexError: list index out of range

Path B — concurrent close: push corrupts heap mid-siftdown

File "pyfakefs/fake_filesystem.py", line 965, in close_open_file
    heapq.heappush(self._free_fd_heap, file_des)
File "pypy3.11/heapq.py", line 135, in heappush
    _siftdown(heap, 0, len(heap)-1)
IndexError: list index out of range

Under free-threaded CPython 3.13 at 32 threads the same workload crashes the interpreter (segmentation fault) rather than raising IndexError.

Shape 3 — _directory_content dict iterate-during-mutate (see also #1319)

File "pyfakefs/fake_filesystem.py", line 1523, in _directory_content
    matching_content = [
File "pyfakefs/fake_filesystem.py", line 1525, in <listcomp>
    for subdir in directory.entries
RuntimeError: dictionary changed size during iteration

Fires when any concurrent add_entry (line 563 of fake_file.py) inserts into the same FakeDirectory._entries dict while another thread's _directory_content is iterating it. Six other Shape 3 sites (mount_points iteration in add_mount_point, _mount_point_for_path, is_mount_point, etc.) fire on any filesystem configuration.

See also the mount_points sibling (Shape 3, add_mount_point / _mount_point_for_path).

Shape 7 — USER_ID / GROUP_ID module-global contamination

Not a crash, but the most reliably observable hazard on free-threaded CPython:

# helpers.py:83, 99
def set_uid(uid):
    global USER_ID
    USER_ID = uid       # plain module-global write, no lock

def get_uid():
    return USER_ID      # plain read, no lock

On CPython with GIL: 2 contaminated reads in 1/5 repeats (ordering race, not torn-read). The practical consequence is that a thread calling set_uid(0) to bypass permission checks can contaminate permission decisions in sibling threads.

How to reproduce the crashing hazards

Shape 2 (_free_fd_heap) — reliable on PyPy

import threading
import os
from concurrent.futures import ThreadPoolExecutor
from pyfakefs.fake_filesystem_unittest import Patcher

THREADS, N = 32, 64

def open_close(n):
    tid = threading.get_ident()
    for i in range(n):
        path = f"/shared/f_{tid}_{i}"
        fd = os.open(path, os.O_WRONLY | os.O_CREAT, 0o600)
        os.close(fd)

with Patcher() as p:
    assert p.fs is not None
    p.fs.create_dir("/shared")
    with ThreadPoolExecutor(max_workers=THREADS) as pool:
        for f in [pool.submit(open_close, N) for _ in range(THREADS)]:
            f.result()

Run under pypy3. Crashes with IndexError within seconds. Under free-threaded CPython 3.13 (python --disable-gil) the process segfaults.

Shape 3 (_directory_content) — reliable on PyPy (see also #1319)

import threading
from concurrent.futures import ThreadPoolExecutor
from pyfakefs.fake_filesystem import FakeFilesystem

THREADS, N = 32, 64

def mkdir_worker(fs, n, prefix):
    tid = threading.get_ident()
    for i in range(n):
        try:
            fs.create_dir(f"/shared/d_{prefix}_{tid}_{i}")
        except OSError:
            pass   # EEXIST from Shape 5 TOCTOU — benign here

fs = FakeFilesystem("/")
fs.is_case_sensitive = False   # required: listcomp only reached on this path
fs.create_dir("/shared")
with ThreadPoolExecutor(max_workers=THREADS) as pool:
    for f in [pool.submit(mkdir_worker, fs, N, i) for i in range(THREADS)]:
        f.result()

Run under pypy3. 940 RuntimeError crashes in 50 repeats. Under 3.13 FT: 620/640 workers crash.

Shape 7 (USER_ID) — reliable on free-threaded CPython

import threading
from concurrent.futures import ThreadPoolExecutor
from pyfakefs import helpers

THREADS, N = 16, 10_000

def worker(uid, n):
    contaminated = 0
    for _ in range(n):
        helpers.set_uid(uid)
        if helpers.get_uid() != uid:
            contaminated += 1
    return contaminated

with ThreadPoolExecutor(max_workers=THREADS) as pool:
    results = [pool.submit(worker, uid, N) for uid in range(THREADS)]
    total = sum(f.result() for f in results)
print(f"Cross-thread UID reads: {total}")

Run under free-threaded CPython 3.13. Output: ~590,000 cross-thread reads. On CPython-with-GIL: 0–2 (ordering race only, not torn-read).

Fix strategies

In rough increasing order of cost. Non-exclusive — any combination is valid.

2. Static-pattern test (low-cost, doesn't decay).
A ~80-line pytest test (test_thread_safety_static.py) that greps the source for unsynchronized augmented-assignment (+=, -=) on named hazardous attributes (last_ino, last_dev, st_nlink, used_size) and any write to Shape 7 module globals (USER_ID, GROUP_ID). Currently flags 9 genuine hazardous sites with no false positives after annotating 5 init-time safe sites with # thread_safe_ok.

This is the only test form that does not decay silently across revisions: it tests a syntactic property of the source, not a runtime observation.

3. Surgical fixes for the crashing sites (Shapes 2 and 3).
A per-FakeFilesystem threading.Lock guarding add_open_file, close_open_file, and _entries mutation. Solves the two shapes that produce crashes; leaves the remaining silent hazards as documented limitations. Shape 7 (USER_ID/GROUP_ID) can be fixed independently and cheaply by converting those module globals to threading.local() — same pattern as PR #1318's fix for use_original.

4. Coarse-grained lock.
A single threading.Lock on FakeFilesystem guarding the entire public API surface. Solves all 8 shapes; adds single-threaded overhead (one lock acquire/release per filesystem call) but is probably tolerable for a test helper.

5. Per-object locks.
Finer-grained (FakeDirectory._entries lock, FakeFilesystem._fd_lock, etc.). Lower contention than strategy 4; significantly more complex.

Relationship to existing issues and PRs

Environment

Crashing hazards confirmed on:

x86_64 Ubuntu 24.04 LTS
PyPy 7.3.19 / Python 3.11 (GIL enabled)
CPython 3.13.3 (--disable-gil, free-threaded build from source)
pyfakefs 6.3.dev0 (main + PR #1318, commit 9697a15)

Silent hazards (Shapes 1, 4, 5, 6) are latent on all runtimes. Shape 7 (USER_ID) fires observably on free-threaded CPython and occasionally on PyPy. All shapes will become observable on free-threaded CPython 3.13+/3.14+ as that runtime matures.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions