race in versioning

There is a time of check to time of use race that prob doesnt happen often but could happen at scale. I will paste my chat with gemini because I don't have time at the moment to write it up myself.

i noticed it when looking at #1353 and wanted to mark it down

<details><summary>Gemini chat</summary>
<p>
Me:  There seem to be a potential race condition in the versioning here, can you double check this? (paste `adapter.py`)

Chat: You have a great eye. There is indeed a classic Time-Of-Check to Time-Of-Use (TOCTOU) race condition in the `replace_metadata` method. 

### The Race Condition Location
The vulnerability is right here in the `replace_metadata` method when calculating the `next_revision_number`:

```python
async def replace_metadata(
    self, metadata=None, specs=None, access_blob=None, *, drop_revision=False
):
    # ...
    async with self.context.session() as db:
        if not drop_revision:
            current = (
                await db.execute(
                    select(orm.Node).where(orm.Node.id == self.node.id)
                )
            ).scalar_one()
            
            # --- RACE CONDITION HERE ---
            next_revision_number = 1 + (
                (
                    await db.execute(
                        select(func.max(orm.Revision.revision_number)).where(
                            orm.Revision.node_id == self.node.id
                        )
                    )
                ).scalar_one()
                or 0
            )
            # ---------------------------
            
            revision = orm.Revision(
                metadata_=current.metadata_,
                specs=current.specs,
                access_blob=current.access_blob,
                node_id=current.id,
                revision_number=next_revision_number,
            )
            db.add(revision)
```

### Why it happens
If two requests concurrently attempt to update the metadata for the exact same node:
1. **Thread A** executes the `select(func.max(...))` and finds the current max is `5`. It calculates `next_revision_number = 6`.
2. **Thread B** executes the `select(func.max(...))` at the exact same time before Thread A commits. It also sees the current max as `5` and calculates `next_revision_number = 6`.
3. Both threads create an `orm.Revision` object with `revision_number=6` and attempt to commit.

Depending on the database schema setup:
* **If there is a unique constraint** on `(node_id, revision_number)`, one of the transactions will fail with an `IntegrityError` and the user will get a 500 Server Error.
* **If there is no unique constraint**, you will end up with two distinct revisions sharing the same `revision_number`, corrupting your version history. 

Standard database transaction isolation levels (like Read Committed) will not prevent this because they do not block concurrent reads.

### The Fix: Row Locking
The cleanest way to solve this in SQLAlchemy is using pessimistic concurrency control via **`with_for_update()`**. By locking the target `orm.Node` row at the very beginning of the transaction, any concurrent requests will be forced to wait in line until the first transaction finishes and releases the lock. 

Here is how you can patch the `replace_metadata` method:

```python
async def replace_metadata(
    self, metadata=None, specs=None, access_blob=None, *, drop_revision=False
):
    values = {}
    if metadata is not None:
        values["metadata_"] = metadata
    if specs is not None:
        values["specs"] = specs
    if access_blob is not None:
        values["access_blob"] = access_blob
        
    async with self.context.session() as db:
        if not drop_revision:
            # Lock the Node row. Concurrent requests trying to update this 
            # specific node will block here until the current transaction commits.
            current = (
                await db.execute(
                    select(orm.Node)
                    .where(orm.Node.id == self.node.id)
                    .with_for_update() # <-- ADD THIS
                )
            ).scalar_one()
            
            next_revision_number = 1 + (
                (
                    await db.execute(
                        select(func.max(orm.Revision.revision_number)).where(
                            orm.Revision.node_id == self.node.id
                        )
                    )
                ).scalar_one()
                or 0
            )
            revision = orm.Revision(
                metadata_=current.metadata_,
                specs=current.specs,
                access_blob=current.access_blob,
                node_id=current.id,
                revision_number=next_revision_number,
            )
            db.add(revision)
        else:
            # Even if dropping the revision, if you want to be completely safe 
            # against metadata overwrites, you might still want to lock the node here.
            pass 

        await db.execute(
            update(orm.Node).where(orm.Node.id == self.node.id).values(**values)
        )
        await db.commit()
        
        # ... (rest of the streaming cache logic)
```

By adding `.with_for_update()`, you enforce serialization on this specific node's metadata updates, guaranteeing the `max()` check is always accurate.

Me: dont we have to do something at the api layer as well? (paste api layer)

Chat: You are absolutely right. Adding a row lock at the database layer prevents the `revision_number` collision, but the API layer as it is currently written introduces a classic **Lost Update** anomaly. 

Your intuition is spot on: the read-modify-write cycle is split between the API layer and the database layer, leaving a wide open window for a race condition.

### The Race Condition in `api.py`

Take a look at what happens in the `patch_metadata` route:

```python
        # 1. READ: Fetch current metadata into memory (NO LOCK YET)
        if body.content_type == patch_mimetypes.JSON_PATCH:
            metadata = apply_json_patch(entry.metadata(), (body.metadata or []))
            # ...
        
        # 2. MODIFY: In-memory patching happens here

        # 3. WRITE: The DB lock is finally acquired inside this method call
        await entry.replace_metadata(...)
```

Because `entry.metadata()` is read *before* the database transaction and lock are initiated, concurrent requests will read the exact same base metadata.

**The Scenario:**
1. **Request A** and **Request B** arrive at the same time.
2. Both hit `patch_metadata` and read `entry.metadata()` (let's say it's currently `{"color": "blue"}`).
3. **Request A** applies a patch changing the color to `"red"`. Its in-memory state is now `{"color": "red"}`.
4. **Request B** applies a patch adding a new key `"size": "large"`. Its in-memory state is `{"color": "blue", "size": "large"}`.
5. **Request A** calls `replace_metadata({"color": "red"})`. It acquires the row lock, saves the data, increments the revision, and commits.
6. **Request B** calls `replace_metadata({"color": "blue", "size": "large"})`. It acquires the lock, saves the data, increments the revision, and commits.
7. **The Bug:** Request A's change to `"red"` is completely overwritten and lost, even though both requests generated valid sequential revisions.

---

### How to Fix It

You have two main architectural choices to solve this: **Pessimistic** or **Optimistic** concurrency control.

#### Option 1: Pessimistic Concurrency (Push Patching Down)
To make the database row lock actually protect the data, the *read* step must happen inside the locked transaction. You would need to refactor `api.py` so that it passes the patch instructions down to `replace_metadata`, rather than doing the patching in the router.

* **API Layer (`api.py`)**: Just pass the payload down.
    ```python
    await entry.patch_and_replace_metadata(
        patch_payload=body.metadata,
        patch_type=body.content_type,
        # ...
    )
    ```
* **Database Layer**: Lock, Read, Patch, Write.
    ```python
    async with self.context.session() as db:
        # 1. Lock the row
        current = (await db.execute(select(orm.Node).where(...).with_for_update())).scalar_one()
        
        # 2. Read and Patch inside the safety of the lock
        if patch_type == patch_mimetypes.JSON_PATCH:
            new_metadata = apply_json_patch(current.metadata_, patch_payload)
            
        # 3. Save new revision and update Node...
    ```

#### Option 2: Optimistic Concurrency (HTTP ETags / If-Match)
Instead of relying strictly on database locks, you force the client to prove they are modifying the most up-to-date version of the node. 

1. When a client GETs the metadata, the server returns an `ETag` header (usually a hash of the current metadata or the current `revision_number`).
2. When the client sends a `PATCH` or `PUT`, they must include an `If-Match: <etag>` header.
3. In `api.py` (or the database layer), you check if the `If-Match` header matches the *current* `revision_number` of the node.
4. If it doesn't match (meaning someone else modified the node between the client's GET and PATCH), you abort and return an **HTTP 412 Precondition Failed** or **HTTP 409 Conflict**. The client must re-fetch the new state and try their patch again.

**Which is better?**
* **Option 1** is better if you expect a high volume of automated, concurrent patching and don't want clients to write retry logic.
* **Option 2** is the RESTful standard and is much safer for user-facing applications (e.g., a user opens a web form, walks away for 10 minutes, and hits "Save"—you want it to fail if someone else edited it during that 10 minutes).


</p>
</details> 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

race in versioning #1355

The Race Condition Location

Why it happens

The Fix: Row Locking

The Race Condition in `api.py`

How to Fix It

Option 1: Pessimistic Concurrency (Push Patching Down)

Option 2: Optimistic Concurrency (HTTP ETags / If-Match)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

race in versioning #1355

Description

The Race Condition Location

Why it happens

The Fix: Row Locking

The Race Condition in api.py

How to Fix It

Option 1: Pessimistic Concurrency (Push Patching Down)

Option 2: Optimistic Concurrency (HTTP ETags / If-Match)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The Race Condition in `api.py`