There is a time of check to time of use race that prob doesnt happen often but could happen at scale. I will paste my chat with gemini because I don't have time at the moment to write it up myself.
Gemini chat
Me: There seem to be a potential race condition in the versioning here, can you double check this? (paste `adapter.py`)
Chat: You have a great eye. There is indeed a classic Time-Of-Check to Time-Of-Use (TOCTOU) race condition in the replace_metadata method.
The Race Condition Location
The vulnerability is right here in the replace_metadata method when calculating the next_revision_number:
async def replace_metadata(
self, metadata=None, specs=None, access_blob=None, *, drop_revision=False
):
# ...
async with self.context.session() as db:
if not drop_revision:
current = (
await db.execute(
select(orm.Node).where(orm.Node.id == self.node.id)
)
).scalar_one()
# --- RACE CONDITION HERE ---
next_revision_number = 1 + (
(
await db.execute(
select(func.max(orm.Revision.revision_number)).where(
orm.Revision.node_id == self.node.id
)
)
).scalar_one()
or 0
)
# ---------------------------
revision = orm.Revision(
metadata_=current.metadata_,
specs=current.specs,
access_blob=current.access_blob,
node_id=current.id,
revision_number=next_revision_number,
)
db.add(revision)
Why it happens
If two requests concurrently attempt to update the metadata for the exact same node:
- Thread A executes the
select(func.max(...)) and finds the current max is 5. It calculates next_revision_number = 6.
- Thread B executes the
select(func.max(...)) at the exact same time before Thread A commits. It also sees the current max as 5 and calculates next_revision_number = 6.
- Both threads create an
orm.Revision object with revision_number=6 and attempt to commit.
Depending on the database schema setup:
- If there is a unique constraint on
(node_id, revision_number), one of the transactions will fail with an IntegrityError and the user will get a 500 Server Error.
- If there is no unique constraint, you will end up with two distinct revisions sharing the same
revision_number, corrupting your version history.
Standard database transaction isolation levels (like Read Committed) will not prevent this because they do not block concurrent reads.
The Fix: Row Locking
The cleanest way to solve this in SQLAlchemy is using pessimistic concurrency control via with_for_update(). By locking the target orm.Node row at the very beginning of the transaction, any concurrent requests will be forced to wait in line until the first transaction finishes and releases the lock.
Here is how you can patch the replace_metadata method:
async def replace_metadata(
self, metadata=None, specs=None, access_blob=None, *, drop_revision=False
):
values = {}
if metadata is not None:
values["metadata_"] = metadata
if specs is not None:
values["specs"] = specs
if access_blob is not None:
values["access_blob"] = access_blob
async with self.context.session() as db:
if not drop_revision:
# Lock the Node row. Concurrent requests trying to update this
# specific node will block here until the current transaction commits.
current = (
await db.execute(
select(orm.Node)
.where(orm.Node.id == self.node.id)
.with_for_update() # <-- ADD THIS
)
).scalar_one()
next_revision_number = 1 + (
(
await db.execute(
select(func.max(orm.Revision.revision_number)).where(
orm.Revision.node_id == self.node.id
)
)
).scalar_one()
or 0
)
revision = orm.Revision(
metadata_=current.metadata_,
specs=current.specs,
access_blob=current.access_blob,
node_id=current.id,
revision_number=next_revision_number,
)
db.add(revision)
else:
# Even if dropping the revision, if you want to be completely safe
# against metadata overwrites, you might still want to lock the node here.
pass
await db.execute(
update(orm.Node).where(orm.Node.id == self.node.id).values(**values)
)
await db.commit()
# ... (rest of the streaming cache logic)
By adding .with_for_update(), you enforce serialization on this specific node's metadata updates, guaranteeing the max() check is always accurate.
Me: dont we have to do something at the api layer as well? (paste api layer)
Chat: You are absolutely right. Adding a row lock at the database layer prevents the revision_number collision, but the API layer as it is currently written introduces a classic Lost Update anomaly.
Your intuition is spot on: the read-modify-write cycle is split between the API layer and the database layer, leaving a wide open window for a race condition.
The Race Condition in api.py
Take a look at what happens in the patch_metadata route:
# 1. READ: Fetch current metadata into memory (NO LOCK YET)
if body.content_type == patch_mimetypes.JSON_PATCH:
metadata = apply_json_patch(entry.metadata(), (body.metadata or []))
# ...
# 2. MODIFY: In-memory patching happens here
# 3. WRITE: The DB lock is finally acquired inside this method call
await entry.replace_metadata(...)
Because entry.metadata() is read before the database transaction and lock are initiated, concurrent requests will read the exact same base metadata.
The Scenario:
- Request A and Request B arrive at the same time.
- Both hit
patch_metadata and read entry.metadata() (let's say it's currently {"color": "blue"}).
- Request A applies a patch changing the color to
"red". Its in-memory state is now {"color": "red"}.
- Request B applies a patch adding a new key
"size": "large". Its in-memory state is {"color": "blue", "size": "large"}.
- Request A calls
replace_metadata({"color": "red"}). It acquires the row lock, saves the data, increments the revision, and commits.
- Request B calls
replace_metadata({"color": "blue", "size": "large"}). It acquires the lock, saves the data, increments the revision, and commits.
- The Bug: Request A's change to
"red" is completely overwritten and lost, even though both requests generated valid sequential revisions.
How to Fix It
You have two main architectural choices to solve this: Pessimistic or Optimistic concurrency control.
Option 1: Pessimistic Concurrency (Push Patching Down)
To make the database row lock actually protect the data, the read step must happen inside the locked transaction. You would need to refactor api.py so that it passes the patch instructions down to replace_metadata, rather than doing the patching in the router.
- API Layer (
api.py): Just pass the payload down.
await entry.patch_and_replace_metadata(
patch_payload=body.metadata,
patch_type=body.content_type,
# ...
)
- Database Layer: Lock, Read, Patch, Write.
async with self.context.session() as db:
# 1. Lock the row
current = (await db.execute(select(orm.Node).where(...).with_for_update())).scalar_one()
# 2. Read and Patch inside the safety of the lock
if patch_type == patch_mimetypes.JSON_PATCH:
new_metadata = apply_json_patch(current.metadata_, patch_payload)
# 3. Save new revision and update Node...
Option 2: Optimistic Concurrency (HTTP ETags / If-Match)
Instead of relying strictly on database locks, you force the client to prove they are modifying the most up-to-date version of the node.
- When a client GETs the metadata, the server returns an
ETag header (usually a hash of the current metadata or the current revision_number).
- When the client sends a
PATCH or PUT, they must include an If-Match: <etag> header.
- In
api.py (or the database layer), you check if the If-Match header matches the current revision_number of the node.
- If it doesn't match (meaning someone else modified the node between the client's GET and PATCH), you abort and return an HTTP 412 Precondition Failed or HTTP 409 Conflict. The client must re-fetch the new state and try their patch again.
Which is better?
- Option 1 is better if you expect a high volume of automated, concurrent patching and don't want clients to write retry logic.
- Option 2 is the RESTful standard and is much safer for user-facing applications (e.g., a user opens a web form, walks away for 10 minutes, and hits "Save"—you want it to fail if someone else edited it during that 10 minutes).
There is a time of check to time of use race that prob doesnt happen often but could happen at scale. I will paste my chat with gemini because I don't have time at the moment to write it up myself.
i noticed it when looking at #1353 and wanted to mark it down
Gemini chat
Me: There seem to be a potential race condition in the versioning here, can you double check this? (paste `adapter.py`)
Chat: You have a great eye. There is indeed a classic Time-Of-Check to Time-Of-Use (TOCTOU) race condition in the
replace_metadatamethod.The Race Condition Location
The vulnerability is right here in the
replace_metadatamethod when calculating thenext_revision_number:Why it happens
If two requests concurrently attempt to update the metadata for the exact same node:
select(func.max(...))and finds the current max is5. It calculatesnext_revision_number = 6.select(func.max(...))at the exact same time before Thread A commits. It also sees the current max as5and calculatesnext_revision_number = 6.orm.Revisionobject withrevision_number=6and attempt to commit.Depending on the database schema setup:
(node_id, revision_number), one of the transactions will fail with anIntegrityErrorand the user will get a 500 Server Error.revision_number, corrupting your version history.Standard database transaction isolation levels (like Read Committed) will not prevent this because they do not block concurrent reads.
The Fix: Row Locking
The cleanest way to solve this in SQLAlchemy is using pessimistic concurrency control via
with_for_update(). By locking the targetorm.Noderow at the very beginning of the transaction, any concurrent requests will be forced to wait in line until the first transaction finishes and releases the lock.Here is how you can patch the
replace_metadatamethod:By adding
.with_for_update(), you enforce serialization on this specific node's metadata updates, guaranteeing themax()check is always accurate.Me: dont we have to do something at the api layer as well? (paste api layer)
Chat: You are absolutely right. Adding a row lock at the database layer prevents the
revision_numbercollision, but the API layer as it is currently written introduces a classic Lost Update anomaly.Your intuition is spot on: the read-modify-write cycle is split between the API layer and the database layer, leaving a wide open window for a race condition.
The Race Condition in
api.pyTake a look at what happens in the
patch_metadataroute:Because
entry.metadata()is read before the database transaction and lock are initiated, concurrent requests will read the exact same base metadata.The Scenario:
patch_metadataand readentry.metadata()(let's say it's currently{"color": "blue"})."red". Its in-memory state is now{"color": "red"}."size": "large". Its in-memory state is{"color": "blue", "size": "large"}.replace_metadata({"color": "red"}). It acquires the row lock, saves the data, increments the revision, and commits.replace_metadata({"color": "blue", "size": "large"}). It acquires the lock, saves the data, increments the revision, and commits."red"is completely overwritten and lost, even though both requests generated valid sequential revisions.How to Fix It
You have two main architectural choices to solve this: Pessimistic or Optimistic concurrency control.
Option 1: Pessimistic Concurrency (Push Patching Down)
To make the database row lock actually protect the data, the read step must happen inside the locked transaction. You would need to refactor
api.pyso that it passes the patch instructions down toreplace_metadata, rather than doing the patching in the router.api.py): Just pass the payload down.Option 2: Optimistic Concurrency (HTTP ETags / If-Match)
Instead of relying strictly on database locks, you force the client to prove they are modifying the most up-to-date version of the node.
ETagheader (usually a hash of the current metadata or the currentrevision_number).PATCHorPUT, they must include anIf-Match: <etag>header.api.py(or the database layer), you check if theIf-Matchheader matches the currentrevision_numberof the node.Which is better?