Concept: Rust database access via Python database connection pool (via runInteraction) (LLM splat v1)#19846
Conversation
| name: &'static str, | ||
| func: InteractionFn, | ||
| ) -> AnyResult; | ||
| } |
There was a problem hiding this comment.
From my high-level reading of things, the type erasure stuff here is because we want DatabasePool to by dyn-compatible because we currently store it like this:
pub struct Store {
pub config: SynapseConfig,
pub db_pool: Box<dyn DatabasePool>,
}If we drop that constraint and instead store the DatabasePool like the following, all of erasure complexity can fall away.
pub struct Store<P: DatabasePool> {
pub config: SynapseConfig,
pub db_pool: P,
}I think the answer depends on whether we care about the Store being Synapse specific or if that also should be shared with synapse-rust-apps.
I think the synapse-rust-apps might have its own Store implementation given we already store things that are Synapse specific in it like SynapseConfig. Although that could be refactored to be different 🤔
There was a problem hiding this comment.
Added d9a111b to see what this looks like
| ) -> impl Future<Output = anyhow::Result<R>> + Send | ||
| where | ||
| R: Send + 'static, | ||
| F: for<'txn> Fn(&'txn mut dyn Transaction) -> BoxFuture<'txn, anyhow::Result<R>> |
There was a problem hiding this comment.
You could probably relax this to a FnMut?
There was a problem hiding this comment.
Perhaps it's better if we don't allow FnMut as the func can be retried multiple times or even run and then fail to commit leaving the outside mutated state.
For example with the following example where we do this kind of FnMut thing in Python, this seems flawed when the transaction is retried. When retried, it will use min_stream_id mutated from the previous attempt.
synapse/synapse/storage/databases/main/devices.py
Lines 2557 to 2596 in 4e9f775
And as a further example that doesn't exist here but if we tried to use min_stream_id outside of that function and ignored the runInteraction(...) error, we could continue with some flawed data.
`_trial_temp/test.log` ``` 2026-06-17 19:17:57-0500 [-] synapse.http.server - 147 - ERROR - GET-3 - Failed handle request via 'VersionsRestServlet': <SynapseRequest at 0x7ff02b1cf9d0 method='GET' uri='/_matrix/client/versions' clientproto='1.1' site='test'> Traceback (most recent call last): File "/home/eric/Documents/github/element/synapse/synapse/http/server.py", line 335, in _async_render_wrapper callback_return = await self._async_render(request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/eric/Documents/github/element/synapse/synapse/http/server.py", line 576, in _async_render callback_return = await raw_callback_return ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/eric/Documents/github/element/synapse/synapse/rest/client/versions.py", line 81, in on_GET versions_response_body = await self.rust_handlers.versions.get_versions(user_id) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^ RuntimeError: Tokio runtime is not running ```
…work on our async Rust function
Previously, we were running into this problem:
```
SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.rest.client.test_versions.VersionsTestCase.test_authenticated
tests.rest.client.test_versions
VersionsTestCase
test_authenticated ... [ERROR]
===============================================================================
[ERROR]
Traceback (most recent call last):
File "/home/eric/Documents/github/element/synapse/tests/rest/client/test_versions.py", line 85, in test_authenticated
channel = self.make_request(
File "/home/eric/Documents/github/element/synapse/tests/unittest.py", line 619, in make_request
return make_request(
File "/home/eric/Documents/github/element/synapse/tests/server.py", line 486, in make_request
channel.await_result()
File "/home/eric/Documents/github/element/synapse/tests/server.py", line 310, in await_result
raise TimedOutException("Timed out waiting for request to finish.")
tests.server.TimedOutException: Timed out waiting for request to finish.
tests.rest.client.test_versions.VersionsTestCase.test_authenticated
-------------------------------------------------------------------------------
Ran 1 tests in 0.182s
FAILED (errors=1)
```
…pool to work on our async Rust function" This reverts commit 0b6a973.
Fix Tokio thread waiting a different way
| # XXX: We must create the Rust HTTP client before we call `reactor.run()` below. | ||
| # Twisted's `MemoryReactor` doesn't invoke `callWhenRunning` callbacks if it's | ||
| # already running and we rely on that to start the Tokio thread pool in Rust. In | ||
| # the future, this may not matter, see https://github.com/twisted/twisted/pull/12514 | ||
| self._http_client = hs.get_proxied_http_client() | ||
| _ = HttpClient( | ||
| reactor=hs.get_reactor(), | ||
| user_agent=self._http_client.user_agent.decode("utf8"), | ||
| ) |
There was a problem hiding this comment.
As an update, twisted/twisted#12514 is finally part of a Twisted release 26.4.0 (2026-05-11). If we updated our Twisted version, we could probably get rid of all of this ugliness.
But that may be a few years away given our deprecation policy considers a "no-brainer" upgrade once it's available in both the latest Debian Stable (currently Twisted 24.11.0) and Ubuntu LTS repositories (currently Twisted 25.5.0)
```
[ERROR]
Traceback (most recent call last):
File "/home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.14/lib/python3.14/site-packages/twisted/trial/runner.py", line 711, in loadByName
return self.suiteFactory([self.findByName(name, recurse=recurse)])
File "/home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.14/lib/python3.14/site-packages/twisted/trial/runner.py", line 474, in findByName
obj = reflect.namedModule(searchName)
File "/home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.14/lib/python3.14/site-packages/twisted/python/reflect.py", line 156, in namedModule
topLevel = __import__(name)
File "/home/eric/Documents/github/element/synapse/tests/__init__.py", line 24, in <module>
from synapse.util.patch_inline_callbacks import do_patch
File "/home/eric/Documents/github/element/synapse/synapse/__init__.py", line 31, in <module>
from synapse.util.rust import check_rust_lib_up_to_date
File "/home/eric/Documents/github/element/synapse/synapse/util/__init__.py", line 41, in <module>
from synapse.types import JsonDict
File "/home/eric/Documents/github/element/synapse/synapse/types/__init__.py", line 63, in <module>
from synapse.api.errors import Codes, SynapseError
File "/home/eric/Documents/github/element/synapse/synapse/api/errors.py", line 33, in <module>
from synapse.util.json import json_decoder
File "/home/eric/Documents/github/element/synapse/synapse/util/json.py", line 23, in <module>
from synapse.synapse_rust.events import JsonObject
File "/home/eric/Documents/github/element/synapse/synapse/logging/context.py", line 55, in <module>
from synapse.util.stringutils import random_string_insecure_fast
File "/home/eric/Documents/github/element/synapse/synapse/util/stringutils.py", line 31, in <module>
from synapse.api.errors import Codes, SynapseError
builtins.ImportError: cannot import name 'Codes' from partially initialized module 'synapse.api.errors' (most likely due to a circular import) (/home/eric/Documents/github/element/synapse/synapse/api/errors.py)
```
…-db-pool-run-interaction-llm1
Prompt:
```
Compare this with a previous approach that was swapped out in `d9a111bdfaf51cd4cadfdc071f9956eec47aea43` (look at the removals). Anything we should change/simplify or something you like better/worse about that one? Give the new approach a review and simplify and make things more clear as necessary
```
Changes made (LLM summary):
1. #[async_trait] on the object-safe method. The core trait method is now async fn run_interaction_erased(...) -> ErasedResult instead of a hand-rolled fn ... -> BoxFuture<'a, _>. This is the convention already used by Transaction in the same file, and it removed the async move { ... }.boxed() wrapper (plus the BoxFuture/FutureExt imports) from both pool impls — their bodies are now plain async fns again.
2. Dropped the constrain / erase_interaction helper. I'd over-engineered the closure erasure out of a worry that higher-ranked-lifetime inference would fail. The previous code proved a direct let erased: ErasedInteraction = Box::new(move |txn| {...}) compiles — the type annotation on the binding is enough. ~30 lines gone.
3. Dropped the unneeded + Sync on the erased callback type. ErasedInteraction is + Send only now; the boxed callback is moved to a single DB thread and called there, never shared by reference.
LLM summary: func only ever needs to be moved to a thread and called there sequentially, never shared by & across threads: - In the erasure, func is captured by-move into the Box<dyn Fn ... + Send> callback (only Send is required to box it that way). - The Python pool moves that callback into the runInteraction DB-thread closure and calls it there; the Rust pool calls it within a single async task. Neither sends &func to multiple threads concurrently.
> This concept was formerly known as object safety. > > *-- https://doc.rust-lang.org/reference/items/traits.html#dyn-compatibility*
Previous logic (LLM summary):
> The current logic always prefers captured over outcome, even when outcome is an Err. Consider: callback runs successfully → slot = Some(Ok(value)) → returns None → but then runInteraction's commit fails and the deferred errbacks. Now captured = Some(Ok(value)) and outcome = Err(...), and we'd return Ok(value) — silently swallowing the commit failure.
>
> Whether that's reachable depends on Synapse's retry behavior (it may retry the whole interaction on commit failure, re-running the callback). It's likely a narrow edge case, but the comment ("we only trust this slot once the deferred has fired") implies the slot is authoritative once fired, which isn't quite true if the deferred fires with an error after a successful callback run.
| // TODO: remove. This is just here to make sure our `DatabasePool`/`Transaction` | ||
| // interfaces are compatible with `tokio-postgres`. |
There was a problem hiding this comment.
This is just here to prove it's possible (no scrutiny given to this)
We can care about the implementation in synapse-rust-apps when we actually care about using it
| // We can't check this here because of circular import issues | ||
| // logging_context_module(py)?; |
There was a problem hiding this comment.
Perhaps this will get fixed by #19876 (comment)
| RoomCreationPreset.TRUSTED_PRIVATE_CHAT, | ||
| RoomCreationPreset.PUBLIC_CHAT, | ||
| ] | ||
| } |
There was a problem hiding this comment.
Made these a set to better represent what it is and more efficient lookups. Touched this because I had to model it in SynapseHomeServerConfig on the Rust side
…-db-pool-run-interaction-llm1
|
Extracted things into a reviewable PR in #19878 |
Note: This iteration isn't meant to be merged at all and is just meant to see if it's possible. Just splatting this here for the reference. It does seem to work (see testing strategy) although I haven't inspected the code. I just asked an LLM to take my previous non-working structure and make it work.
Rust database access via Python database connection pool.
This is a stepping stone before we can go full Rust everywhere. We're providing a generic interface as we want database access to work in Synapse and
synapse-rust-apps. Insynapse-rust-apps, we will use atokio-postgresbased database connection pool so it's full Rust.We want to avoid the situation where we have two database connection pools (one for Python, one for Rust) as we've run into connection exhaustion problems on Matrix.org before.
Why
runInteraction(...)?This is a variation of #19824 instead using the
runInteraction(...)pattern we already have in Synapse. This means everything that uses the transaction happens in the function callback which enables retries for serialization/deadlock errors when using repeatable-read isolation.Using the same
runInteractionpattern also means we can port over existing Synapse code/endpoints without much thought.This strategy is less of an impedance mismatch (aligns more closely) with Synapse so the glue code should also be simpler.
Testing strategy
Added some tests that exercise some
asyncRust handlers for the/versionsendpoint:Real-world:
poetry run synapse_homeserver --config-path homeserver.yamlGET http://localhost:8008/_matrix/client/versionsLogging
We already support normal logging from Rust -> Python with
pyo3-logandlog:For example, if you add this inside
async fn build_versions_response(...) { ... }It's logged out on the Python side although it's logged against the
sentinelLoggingContextbecause we're crossing a thread boundary and losing the association to theLoggingContexton the main thread:But if you use
create_deferred(...)to start running some async Rust function, we will capture the calling logcontext as a Tokio task local which can be referenced by any code running on that other thread.For example, when using
run_python_awaitable(...)we activate the logcontext stored in the Tokio task local. This is why we get database metrics tracked when usingrun_interaction(...).Similarly, we can add some sort of helper to wrap any place we want to log that will activate the correct logcontext for that block in Rust. I think this is best left as a follow-up where we can specifically focus on the approach.
For the
Processed requestlog lines:run_interaction(...)activates the correct logcontext and callsrunInteraction(...)on the Python side, everything will be counted like normalDev notes
LLM Prompt
The goal is to achieve Rust database access via the Python database connection pool.
We're providing a generic interface as we want database access to work in Synapse and a separate project
synapse-rust-apps. Insynapse-rust-apps, we will use atokio-postgresbased database connection pool so it's full Rust. This is what's being prototyped inrust/src/storage/db/rust_db_pool.rsvs the actual implementation we'll keep here inrust/src/storage/db/python_db_pool.rs.We want to stick with the
runInteraction(...)pattern from Synapse as it allows us to retry transactions in a world where we're using repeatable-read isolation level (serialization and deadlock errors).I'm in the middle of getting it working and most of it is just non-working structure. Please finish the rest.
As an example, we're doing some Rust database access in
rust/src/handlers/versions.rs->rust/src/storage/store.rsTransaction retries were originally introduced into Synapse for deadlocks which makes me question why they're even happening in the first place. How are other people dealing with this kind of problem? (poorly, not a problem, retries like us, etc)
@reivilibre points out that we're more likely to hit retries because of serialization errors which aligns with the Postgres docs (we use the repeatable read isolation level in Synapse):
Usage:
db_pool .run_interaction(|txn| { async move { /* do stuff with txn */ } .boxed() })Or equivalent (if preferred):
db_pool .run_interaction(|txn| { Box::pin(async move { /* do stuff with txn */ }) })AsyncFn(Async closure) withSendtrait doesn't seem very viable:Other PR where I dealt with deferred (
make_deferred_yieldable) and async in Rust: #18903Fix
MemoryReactor.callWhenRunningnot invoking callbacks if already started, twisted/twisted#12514 - this is relevant because we need to start the tokio thread pool for these tests.PyO3 Rust -> Python logging: https://pyo3.rs/main/ecosystem/logging
Todo
LoggingContext(log context) interactPull Request Checklist
EventStoretoEventWorkerStore.".code blocks.