Skip to content

Concept: Rust database access via Python database connection pool (via runInteraction) (LLM splat v1)#19846

Draft
MadLittleMods wants to merge 58 commits into
developfrom
madlittlemods/rust-db-access-using-python-db-pool-run-interaction-llm1
Draft

Concept: Rust database access via Python database connection pool (via runInteraction) (LLM splat v1)#19846
MadLittleMods wants to merge 58 commits into
developfrom
madlittlemods/rust-db-access-using-python-db-pool-run-interaction-llm1

Conversation

@MadLittleMods

@MadLittleMods MadLittleMods commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Note: This iteration isn't meant to be merged at all and is just meant to see if it's possible. Just splatting this here for the reference. It does seem to work (see testing strategy) although I haven't inspected the code. I just asked an LLM to take my previous non-working structure and make it work.


Rust database access via Python database connection pool.

This is a stepping stone before we can go full Rust everywhere. We're providing a generic interface as we want database access to work in Synapse and synapse-rust-apps. In synapse-rust-apps, we will use a tokio-postgres based database connection pool so it's full Rust.

We want to avoid the situation where we have two database connection pools (one for Python, one for Rust) as we've run into connection exhaustion problems on Matrix.org before.

Why runInteraction(...)?

This is a variation of #19824 instead using the runInteraction(...) pattern we already have in Synapse. This means everything that uses the transaction happens in the function callback which enables retries for serialization/deadlock errors when using repeatable-read isolation.

Using the same runInteraction pattern also means we can port over existing Synapse code/endpoints without much thought.

This strategy is less of an impedance mismatch (aligns more closely) with Synapse so the glue code should also be simpler.

Testing strategy

Added some tests that exercise some async Rust handlers for the /versions endpoint:

SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.rest.client.test_versions.VersionsTestCase

Real-world:

  1. poetry run synapse_homeserver --config-path homeserver.yaml
  2. GET http://localhost:8008/_matrix/client/versions

Logging

We already support normal logging from Rust -> Python with pyo3-log and log:

For example, if you add this inside async fn build_versions_response(...) { ... }

log::info!(
    "asdf msc3881_enabled={:?} msc3575_enabled={:?}",
    msc3881_enabled,
    msc3575_enabled
);

It's logged out on the Python side although it's logged against the sentinel LoggingContext because we're crossing a thread boundary and losing the association to the LoggingContext on the main thread:

2026-06-22 18:00:57-0500 [-] synapse.handlers.versions - 124 - INFO - sentinel - asdf msc3881_enabled=false msc3575_enabled=true

But if you use create_deferred(...) to start running some async Rust function, we will capture the calling logcontext as a Tokio task local which can be referenced by any code running on that other thread.

For example, when using run_python_awaitable(...) we activate the logcontext stored in the Tokio task local. This is why we get database metrics tracked when using run_interaction(...).

Similarly, we can add some sort of helper to wrap any place we want to log that will activate the correct logcontext for that block in Rust. I think this is best left as a follow-up where we can specifically focus on the approach.

For the Processed request log lines:

  1. Does it count CPU time from things run in Rust?
    • No
  2. Does it count database time from things run in Rust?
    • Yes! Because run_interaction(...) activates the correct logcontext and calls runInteraction(...) on the Python side, everything will be counted like normal
2026-06-23 21:20:31-0500 [-] synapse.access.http.fake - 643 - INFO - GET-6 - 127.0.0.1 - test - {@user1:test} Processed request: 0.619sec/0.000sec ru=(0.000sec, 0.000sec) db=(0.000sec/0.601sec/3) 383B 200 "GET /_matrix/client/versions 1.1" "-" [0 dbevts]

Dev notes

LLM Prompt

The goal is to achieve Rust database access via the Python database connection pool.

We're providing a generic interface as we want database access to work in Synapse and a separate project synapse-rust-apps. In synapse-rust-apps, we will use a tokio-postgres based database connection pool so it's full Rust. This is what's being prototyped in rust/src/storage/db/rust_db_pool.rs vs the actual implementation we'll keep here in rust/src/storage/db/python_db_pool.rs.

We want to stick with the runInteraction(...) pattern from Synapse as it allows us to retry transactions in a world where we're using repeatable-read isolation level (serialization and deadlock errors).

I'm in the middle of getting it working and most of it is just non-working structure. Please finish the rest.

As an example, we're doing some Rust database access in rust/src/handlers/versions.rs -> rust/src/storage/store.rs


Transaction retries were originally introduced into Synapse for deadlocks which makes me question why they're even happening in the first place. How are other people dealing with this kind of problem? (poorly, not a problem, retries like us, etc)

@reivilibre points out that we're more likely to hit retries because of serialization errors which aligns with the Postgres docs (we use the repeatable read isolation level in Synapse):

When an application receives this error message, it should abort the current transaction and retry the whole transaction from the beginning. The second time through, the transaction will see the previously-committed change as part of its initial view of the database, so there is no logical conflict in using the new version of the row as the starting point for the new transaction's update.

Note that only updating transactions might need to be retried; read-only transactions will never have serialization conflicts.

-- https://www.postgresql.org/docs/current/transaction-iso.html#XACT-REPEATABLE-READ


Usage:

db_pool
    .run_interaction(|txn| {
        async move {
            /* do stuff with txn */
        }
        .boxed()
    })

Or equivalent (if preferred):

db_pool
    .run_interaction(|txn| {
        Box::pin(async move {
            /* do stuff with txn */
        })
    })

AsyncFn (Async closure) with Send trait doesn't seem very viable:


Other PR where I dealt with deferred (make_deferred_yieldable) and async in Rust: #18903

Fix MemoryReactor.callWhenRunning not invoking callbacks if already started, twisted/twisted#12514 - this is relevant because we need to start the tokio thread pool for these tests.


PyO3 Rust -> Python logging: https://pyo3.rs/main/ecosystem/logging

Todo

  • See how LoggingContext (log context) interact
    • Logging
    • Tracing
  • See how metrics work

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

name: &'static str,
func: InteractionFn,
) -> AnyResult;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my high-level reading of things, the type erasure stuff here is because we want DatabasePool to by dyn-compatible because we currently store it like this:

pub struct Store {
    pub config: SynapseConfig,
    pub db_pool: Box<dyn DatabasePool>,
}

If we drop that constraint and instead store the DatabasePool like the following, all of erasure complexity can fall away.

pub struct Store<P: DatabasePool> {
    pub config: SynapseConfig,
    pub db_pool: P,
}

I think the answer depends on whether we care about the Store being Synapse specific or if that also should be shared with synapse-rust-apps.

I think the synapse-rust-apps might have its own Store implementation given we already store things that are Synapse specific in it like SynapseConfig. Although that could be refactored to be different 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added d9a111b to see what this looks like

Comment thread rust/src/storage/db/python_db_pool.rs Outdated
Comment thread rust/src/http_client.rs Outdated
Comment thread rust/src/storage/db/python_db_pool.rs Outdated
Comment thread rust/src/storage/db/mod.rs Outdated
) -> impl Future<Output = anyhow::Result<R>> + Send
where
R: Send + 'static,
F: for<'txn> Fn(&'txn mut dyn Transaction) -> BoxFuture<'txn, anyhow::Result<R>>

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably relax this to a FnMut?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's better if we don't allow FnMut as the func can be retried multiple times or even run and then fail to commit leaving the outside mutated state.

For example with the following example where we do this kind of FnMut thing in Python, this seems flawed when the transaction is retried. When retried, it will use min_stream_id mutated from the previous attempt.

# The minimum stream ID to delete in the next batch, c.f. comment above.
# We default to 0 here as that is less than all possible stream IDs.
min_stream_id = 0
def prune_device_lists_changes_in_room_txn(txn: LoggingTransaction) -> int:
nonlocal min_stream_id
delete_sql = """
DELETE FROM device_lists_changes_in_room
WHERE stream_id IN (
SELECT stream_id FROM device_lists_changes_in_room
WHERE ? < stream_id AND stream_id <= ?
ORDER BY stream_id ASC
LIMIT ?
)
RETURNING stream_id
"""
txn.execute(
delete_sql,
(min_stream_id, prune_before_stream_id, PRUNE_DEVICE_LISTS_BATCH_SIZE),
)
# We can't use rowcount as that is incorrect on SQLite when using
# RETURNING.
num_deleted = 0
for row in txn:
num_deleted += 1
min_stream_id = max(min_stream_id, row[0])
if num_deleted:
# Update the max pruned stream ID tracking table so that the
# safety check knows data up to this point has been deleted.
self.db_pool.simple_update_one_txn(
txn,
table="device_lists_changes_in_room_max_pruned_stream_id",
keyvalues={},
updatevalues={"stream_id": min_stream_id},
)
return num_deleted

And as a further example that doesn't exist here but if we tried to use min_stream_id outside of that function and ignored the runInteraction(...) error, we could continue with some flawed data.

`_trial_temp/test.log`
```
2026-06-17 19:17:57-0500 [-] synapse.http.server - 147 - ERROR - GET-3 - Failed handle request via 'VersionsRestServlet': <SynapseRequest at 0x7ff02b1cf9d0 method='GET' uri='/_matrix/client/versions' clientproto='1.1' site='test'>
	Traceback (most recent call last):
	  File "/home/eric/Documents/github/element/synapse/synapse/http/server.py", line 335, in _async_render_wrapper
	    callback_return = await self._async_render(request)
	                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/home/eric/Documents/github/element/synapse/synapse/http/server.py", line 576, in _async_render
	    callback_return = await raw_callback_return
	                      ^^^^^^^^^^^^^^^^^^^^^^^^^
	  File "/home/eric/Documents/github/element/synapse/synapse/rest/client/versions.py", line 81, in on_GET
	    versions_response_body = await self.rust_handlers.versions.get_versions(user_id)
	                                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
	RuntimeError: Tokio runtime is not running
```
…work on our async Rust function

Previously, we were running into this problem:
```
SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.rest.client.test_versions.VersionsTestCase.test_authenticated
tests.rest.client.test_versions
  VersionsTestCase
    test_authenticated ...                                              [ERROR]

===============================================================================
[ERROR]
Traceback (most recent call last):
  File "/home/eric/Documents/github/element/synapse/tests/rest/client/test_versions.py", line 85, in test_authenticated
    channel = self.make_request(
  File "/home/eric/Documents/github/element/synapse/tests/unittest.py", line 619, in make_request
    return make_request(
  File "/home/eric/Documents/github/element/synapse/tests/server.py", line 486, in make_request
    channel.await_result()
  File "/home/eric/Documents/github/element/synapse/tests/server.py", line 310, in await_result
    raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.

tests.rest.client.test_versions.VersionsTestCase.test_authenticated
-------------------------------------------------------------------------------
Ran 1 tests in 0.182s

FAILED (errors=1)
```
…pool to work on our async Rust function"

This reverts commit 0b6a973.
Fix Tokio thread waiting a different way
Comment on lines +43 to +51
# XXX: We must create the Rust HTTP client before we call `reactor.run()` below.
# Twisted's `MemoryReactor` doesn't invoke `callWhenRunning` callbacks if it's
# already running and we rely on that to start the Tokio thread pool in Rust. In
# the future, this may not matter, see https://github.com/twisted/twisted/pull/12514
self._http_client = hs.get_proxied_http_client()
_ = HttpClient(
reactor=hs.get_reactor(),
user_agent=self._http_client.user_agent.decode("utf8"),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an update, twisted/twisted#12514 is finally part of a Twisted release 26.4.0 (2026-05-11). If we updated our Twisted version, we could probably get rid of all of this ugliness.

But that may be a few years away given our deprecation policy considers a "no-brainer" upgrade once it's available in both the latest Debian Stable (currently Twisted 24.11.0) and Ubuntu LTS repositories (currently Twisted 25.5.0)

```
[ERROR]
Traceback (most recent call last):
  File "/home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.14/lib/python3.14/site-packages/twisted/trial/runner.py", line 711, in loadByName
    return self.suiteFactory([self.findByName(name, recurse=recurse)])
  File "/home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.14/lib/python3.14/site-packages/twisted/trial/runner.py", line 474, in findByName
    obj = reflect.namedModule(searchName)
  File "/home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.14/lib/python3.14/site-packages/twisted/python/reflect.py", line 156, in namedModule
    topLevel = __import__(name)
  File "/home/eric/Documents/github/element/synapse/tests/__init__.py", line 24, in <module>
    from synapse.util.patch_inline_callbacks import do_patch
  File "/home/eric/Documents/github/element/synapse/synapse/__init__.py", line 31, in <module>
    from synapse.util.rust import check_rust_lib_up_to_date
  File "/home/eric/Documents/github/element/synapse/synapse/util/__init__.py", line 41, in <module>
    from synapse.types import JsonDict
  File "/home/eric/Documents/github/element/synapse/synapse/types/__init__.py", line 63, in <module>
    from synapse.api.errors import Codes, SynapseError
  File "/home/eric/Documents/github/element/synapse/synapse/api/errors.py", line 33, in <module>
    from synapse.util.json import json_decoder
  File "/home/eric/Documents/github/element/synapse/synapse/util/json.py", line 23, in <module>
    from synapse.synapse_rust.events import JsonObject
  File "/home/eric/Documents/github/element/synapse/synapse/logging/context.py", line 55, in <module>
    from synapse.util.stringutils import random_string_insecure_fast
  File "/home/eric/Documents/github/element/synapse/synapse/util/stringutils.py", line 31, in <module>
    from synapse.api.errors import Codes, SynapseError
builtins.ImportError: cannot import name 'Codes' from partially initialized module 'synapse.api.errors' (most likely due to a circular import) (/home/eric/Documents/github/element/synapse/synapse/api/errors.py)
```
Prompt:
```
Compare this with a previous approach that was swapped out in `d9a111bdfaf51cd4cadfdc071f9956eec47aea43` (look at the removals). Anything we should change/simplify or something you like better/worse about that one? Give the new approach a review and simplify and make things more clear as necessary
```

Changes made (LLM summary):

1. #[async_trait] on the object-safe method. The core trait method is now async fn run_interaction_erased(...) -> ErasedResult instead of a hand-rolled fn ... -> BoxFuture<'a, _>. This is the convention already used by Transaction in the same file, and it removed the async move { ... }.boxed() wrapper (plus the BoxFuture/FutureExt imports) from both pool impls — their bodies are now plain async fns again.
2. Dropped the constrain / erase_interaction helper. I'd over-engineered the closure erasure out of a worry that higher-ranked-lifetime inference would fail. The previous code proved a direct let erased: ErasedInteraction = Box::new(move |txn| {...}) compiles — the type annotation on the binding is enough. ~30 lines gone.
3. Dropped the unneeded + Sync on the erased callback type. ErasedInteraction is + Send only now; the boxed callback is moved to a single DB thread and called there, never shared by reference.
LLM summary:

func only ever needs to be moved to a thread and called there sequentially, never shared by & across threads:

- In the erasure, func is captured by-move into the Box<dyn Fn ... + Send> callback (only Send is required to box it that way).
- The Python pool moves that callback into the runInteraction DB-thread closure and calls it there; the Rust pool calls it within a single async task. Neither sends &func to multiple threads concurrently.
Previous logic (LLM summary):

> The current logic always prefers captured over outcome, even when outcome is an Err. Consider: callback runs successfully → slot = Some(Ok(value)) → returns None → but then runInteraction's commit fails and the deferred errbacks. Now captured = Some(Ok(value)) and outcome = Err(...), and we'd return Ok(value) — silently swallowing the commit failure.
>
> Whether that's reachable depends on Synapse's retry behavior (it may retry the whole interaction on commit failure, re-running the callback). It's likely a narrow edge case, but the comment ("we only trust this slot once the deferred has fired") implies the slot is authoritative once fired, which isn't quite true if the deferred fires with an error after a successful callback run.
Comment on lines +16 to +17
// TODO: remove. This is just here to make sure our `DatabasePool`/`Transaction`
// interfaces are compatible with `tokio-postgres`.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just here to prove it's possible (no scrutiny given to this)

We can care about the implementation in synapse-rust-apps when we actually care about using it

Comment thread rust/src/deferred.rs
Comment on lines +336 to +337
// We can't check this here because of circular import issues
// logging_context_module(py)?;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this will get fixed by #19876 (comment)

Comment thread synapse/config/room.py
RoomCreationPreset.TRUSTED_PRIVATE_CHAT,
RoomCreationPreset.PUBLIC_CHAT,
]
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made these a set to better represent what it is and more efficient lookups. Touched this because I had to model it in SynapseHomeServerConfig on the Rust side

@MadLittleMods

Copy link
Copy Markdown
Contributor Author

Extracted things into a reviewable PR in #19878

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants