Skip to content

Commit 098b998

Browse files
committed
feat: Sprint 3 — observability, concurrency tests, and production docs
- Add Prometheus Summary metrics to PrometheusMetricsRecorder with configurable quantiles (p50/p95/p99) via enable_summaries flag - Export DEFAULT_QUANTILES constant from pharox public API - Add threading stress tests for concurrent acquire/release scenarios - Add geo utility tests for haversine_distance_km - Add three ADRs: callbacks over events, sync IStorage, SQLAlchemy Core - Add OpenTelemetry integration guide and production deployment checklist - Set pytest coverage threshold to 85% (postgres adapter excluded)
1 parent b0ef5f2 commit 098b998

12 files changed

Lines changed: 574 additions & 6 deletions
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# ADR 001 — Callbacks over an event system
2+
3+
**Status:** Accepted
4+
**Date:** 2025-11-02
5+
6+
## Context
7+
8+
`ProxyManager` needs to let callers plug in observability (logging, metrics) and
9+
custom business logic (notifications, auditing) without hard-coding those concerns
10+
into the core. Two common approaches were considered:
11+
12+
1. **Event bus / pub-sub** — a central dispatcher; handlers subscribe by event type.
13+
2. **Callback lists** — callers register plain callables; the manager invokes them
14+
at defined lifecycle points (after acquire, after release).
15+
16+
## Decision
17+
18+
Use typed callback lists (`List[Callable[[Payload], None]]`) registered via
19+
`register_acquire_callback` and `register_release_callback`.
20+
21+
## Rationale
22+
23+
- **Zero extra dependencies.** A callback list needs no third-party event library.
24+
- **Explicit typing.** The payload types (`AcquireEventPayload`,
25+
`ReleaseEventPayload`) are Pydantic models — mypy can verify handler signatures.
26+
- **Simple mental model.** Callbacks fire in registration order; there is no
27+
routing, priority, or async dispatch complexity.
28+
- **Sufficient for the use cases on record** — metrics recording and structured
29+
logging only need the two lifecycle hooks.
30+
31+
An event bus would add value if many unrelated subsystems need to observe the
32+
same events. For the current scope (one manager, a handful of optional callbacks),
33+
the overhead is not justified.
34+
35+
## Consequences
36+
37+
- Adding a new lifecycle hook requires a new callback list and a new `register_*`
38+
method on `ProxyManager` — a small but explicit change.
39+
- Callers that need async callbacks must wrap the synchronous call in
40+
`asyncio.to_thread` themselves; the manager remains sync-only.
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# ADR 002 — Synchronous `IStorage` interface
2+
3+
**Status:** Accepted
4+
**Date:** 2025-11-02
5+
6+
## Context
7+
8+
`IStorage` defines the persistence contract used by `ProxyManager` and
9+
`HealthCheckOrchestrator`. Python has two concurrency models relevant here:
10+
11+
- **Sync / threading** — blocking I/O, standard with WSGI servers and
12+
`concurrent.futures`.
13+
- **Async / asyncio** — non-blocking I/O, standard with ASGI servers and
14+
`asyncio`-based workers.
15+
16+
The choice of sync vs. async propagates through the entire call chain.
17+
18+
## Decision
19+
20+
`IStorage` is a synchronous (blocking) interface. All abstract methods return
21+
values directly; none are coroutines.
22+
23+
## Rationale
24+
25+
- **Broader compatibility.** Sync adapters work inside both threaded and async
26+
runtimes (via `asyncio.to_thread`). An async-only interface would exclude
27+
threaded hosts without a wrapper.
28+
- **Simpler adapter authoring.** SQLAlchemy Core with a synchronous engine is the
29+
most common production setup. Sync adapters avoid the complexity of
30+
`async_sessionmaker` and connection pool lifecycle management.
31+
- **ProxyManager usage pattern.** Lease acquisition is typically short (sub-ms for
32+
in-memory, low-ms for Postgres). Blocking a thread for that duration is
33+
acceptable in most real-world scenarios.
34+
- **Async helpers already exist.** `async_helpers.py` wraps `ProxyManager`
35+
operations in `asyncio.to_thread`, giving async callers a clean interface
36+
without requiring an async `IStorage`.
37+
38+
## Consequences
39+
40+
- High-throughput async services that want non-blocking storage access should
41+
use the async wrappers or implement `IAsyncStorage` separately (tracked in
42+
the backlog as a Sprint 4 item).
43+
- SQLAlchemy async engine users will need a custom adapter; the provided
44+
`PostgresStorage` reference implementation uses a synchronous engine.
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# ADR 003 — SQLAlchemy Core over ORM for the Postgres adapter
2+
3+
**Status:** Accepted
4+
**Date:** 2025-11-08
5+
6+
## Context
7+
8+
The reference `PostgresStorage` adapter needs to issue SQL queries for proxy
9+
lookup, lease creation, health-check updates, and pool statistics. SQLAlchemy
10+
offers two layers:
11+
12+
- **ORM** — declares Python classes mapped to tables; rich relationship loading;
13+
session-based identity map.
14+
- **Core** — table/column objects and composable SQL expression language; thin
15+
wrapper over the DBAPI.
16+
17+
## Decision
18+
19+
Use **SQLAlchemy Core** with explicit `Table` definitions (`tables.py`) and
20+
parameterised `select`/`insert`/`update` statements.
21+
22+
## Rationale
23+
24+
- **No identity map overhead.** Proxy and lease objects are value objects — they
25+
are re-created from database rows on every read. An ORM session tracking
26+
identity is unnecessary and would add memory and lock-management overhead.
27+
- **Explicit SQL.** Core statements are close to the generated SQL, making
28+
performance characteristics easy to reason about. `SKIP LOCKED`, window
29+
functions, and advisory locks are straightforward to express.
30+
- **Pydantic owns validation.** Domain models (`Proxy`, `Lease`, etc.) are
31+
Pydantic models. Having a parallel set of ORM-mapped Python classes would
32+
create two representations of the same domain with synchronisation risk.
33+
- **Alembic compatibility.** `MetaData` + `Table` definitions integrate cleanly
34+
with Alembic migrations regardless of whether ORM is used.
35+
36+
## Consequences
37+
38+
- Developers writing custom adapters for other databases follow the same
39+
Core pattern; there is no ORM base class to inherit from.
40+
- Relationship traversal (e.g., joining pool → proxies → leases) must be
41+
written as explicit joins rather than relying on ORM `relationship()`.

docs/how-to/opentelemetry.md

Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: OpenTelemetry Integration
3+
description: Emit traces and spans from Pharox via the OpenTelemetry SDK.
4+
---
5+
# OpenTelemetry Integration
6+
7+
Pharox ships with Prometheus metrics out of the box, but you can layer
8+
OpenTelemetry traces on top by hooking into the acquire/release callbacks.
9+
10+
## Prerequisites
11+
12+
```bash
13+
pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
14+
```
15+
16+
## Basic Setup
17+
18+
```python
19+
from opentelemetry import trace
20+
from opentelemetry.sdk.trace import TracerProvider
21+
from opentelemetry.sdk.trace.export import BatchSpanProcessor
22+
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
23+
24+
provider = TracerProvider()
25+
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
26+
trace.set_tracer_provider(provider)
27+
28+
tracer = trace.get_tracer("pharox")
29+
```
30+
31+
## Acquire Callback
32+
33+
Register a callback that opens a span for every proxy acquisition:
34+
35+
```python
36+
from pharox import ProxyManager
37+
from pharox.models import AcquireEventPayload
38+
39+
def otel_acquire_callback(payload: AcquireEventPayload) -> None:
40+
status = "hit" if payload.lease else "miss"
41+
with tracer.start_as_current_span("pharox.acquire") as span:
42+
span.set_attribute("pharox.pool", payload.pool_name)
43+
span.set_attribute("pharox.consumer", payload.consumer_name or "")
44+
span.set_attribute("pharox.status", status)
45+
span.set_attribute("pharox.duration_ms", payload.duration_ms)
46+
47+
manager = ProxyManager(storage=...)
48+
manager.register_acquire_callback(otel_acquire_callback)
49+
```
50+
51+
## Release Callback
52+
53+
```python
54+
from pharox.models import ReleaseEventPayload
55+
56+
def otel_release_callback(payload: ReleaseEventPayload) -> None:
57+
with tracer.start_as_current_span("pharox.release") as span:
58+
span.set_attribute("pharox.pool", payload.pool_name)
59+
span.set_attribute("pharox.lease_duration_ms", payload.lease_duration_ms)
60+
61+
manager.register_release_callback(otel_release_callback)
62+
```
63+
64+
## Propagating an Existing Span Context
65+
66+
If the calling code already holds an active span, the callbacks run inside that
67+
context automatically — the `with tracer.start_as_current_span(...)` call will
68+
create a child span of whatever is current in the calling thread.
69+
70+
For explicit propagation across thread boundaries (e.g. when using
71+
`acquire_proxy_with_retry_async` via `asyncio.to_thread`):
72+
73+
```python
74+
from opentelemetry.context import attach, detach
75+
from opentelemetry.propagate import extract
76+
77+
def otel_acquire_callback(payload: AcquireEventPayload) -> None:
78+
# carrier is set by your async caller before handing off to the thread
79+
token = attach(extract(carrier={}))
80+
try:
81+
with tracer.start_as_current_span("pharox.acquire"):
82+
...
83+
finally:
84+
detach(token)
85+
```
86+
87+
## Notes
88+
89+
- Pharox callbacks are **synchronous**. If your OTLP exporter is async-only,
90+
wrap the export in `asyncio.run_coroutine_threadsafe` or use a synchronous
91+
exporter.
92+
- For high-throughput workers, prefer the `BatchSpanProcessor` over
93+
`SimpleSpanProcessor` to avoid blocking the acquire path.
94+
- Combine with `PrometheusMetricsRecorder` — both can be registered as callbacks
95+
on the same manager instance.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
---
2+
title: Production Deployment Checklist
3+
description: Steps to harden a Pharox-based service before going live.
4+
---
5+
# Production Deployment Checklist
6+
7+
Work through this list before exposing a Pharox-backed service to production
8+
traffic.
9+
10+
## Storage
11+
12+
- [ ] **Use `PostgresStorage`**`InMemoryStorage` does not survive process
13+
restarts and is not safe across multiple workers.
14+
- [ ] **Run Alembic migrations** before deploying a new version:
15+
```bash
16+
alembic upgrade head
17+
```
18+
- [ ] **Connection pool sizing** — set `pool_size` and `max_overflow` on the
19+
SQLAlchemy engine to match your expected concurrency. A safe starting point:
20+
`pool_size = <worker_threads>`, `max_overflow = 2`.
21+
- [ ] **SSL/TLS** — pass `connect_args={"sslmode": "require"}` (or `verify-full`)
22+
to the engine when the database is on a separate host.
23+
24+
## Proxy Data
25+
26+
- [ ] Seed proxies via `add_proxies_bulk` for large initial loads — it uses a
27+
single transaction and avoids per-row round-trips.
28+
- [ ] Set realistic `max_concurrency` on each pool; leaving it at `None`
29+
(unlimited) is only appropriate for dev/testing.
30+
- [ ] Configure `expires_at` on leases so stale holds are cleaned up
31+
automatically by `cleanup_expired_leases`.
32+
33+
## Health Checks
34+
35+
- [ ] Run `HealthCheckOrchestrator` in a dedicated thread or process — it is
36+
blocking and CPU-light but I/O-heavy.
37+
- [ ] Tune `HealthCheckOptions.interval_seconds` and `timeout_seconds` to your
38+
proxy SLA. A common production setting: 60 s interval, 10 s timeout.
39+
- [ ] Archive or truncate old health records periodically — see
40+
`docs/how-to/health-archival.md`.
41+
42+
## Observability
43+
44+
- [ ] Register `PrometheusMetricsRecorder` as acquire/release callbacks and
45+
expose the `/metrics` endpoint to Prometheus.
46+
- [ ] Set alert rules on `pharox_acquire_total{status="miss"}` to detect pool
47+
exhaustion.
48+
- [ ] Enable summaries (`enable_summaries=True`) and configure quantiles if
49+
p95/p99 latency SLOs are required.
50+
51+
## Retry / Resilience
52+
53+
- [ ] Pass an explicit `RetryConfig` to `acquire_proxy_with_retry` rather than
54+
relying on defaults — defaults are conservative and may not match your SLA.
55+
- [ ] Cap `RetryConfig.max_backoff_seconds` to avoid unbounded blocking in
56+
synchronous workers.
57+
58+
## Security
59+
60+
- [ ] Store proxy credentials in a secrets manager (e.g. HashiCorp Vault, AWS
61+
Secrets Manager) — do not hard-code them or commit them to source control.
62+
- [ ] Restrict database user to `SELECT`, `INSERT`, `UPDATE`, `DELETE` on
63+
Pharox tables only — no DDL grants in production.
64+
- [ ] Rotate credentials via `update_proxy` rather than deleting and re-inserting
65+
to preserve lease history.
66+
67+
## Testing Before Go-Live
68+
69+
- [ ] Run `poetry run pytest` locally with the Postgres extra to confirm the
70+
adapter contract tests pass against a staging database.
71+
- [ ] Run `poetry run mypy src/pharox` — zero errors expected.
72+
- [ ] Run the production-template stack (`docs/how-to/production-template.md`)
73+
and verify metrics appear in Grafana before switching production traffic.
74+
75+
## Dependency Versions
76+
77+
- [ ] Pin `pharox` to a specific minor version in your service's lockfile —
78+
minor versions may add new abstract methods to `IStorage`.
79+
- [ ] Review `CHANGELOG.md` on every upgrade for breaking changes.

pyproject.toml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,8 @@ known-first-party = ["pharox"]
6262

6363
[tool.pytest.ini_options]
6464
minversion = "7.0"
65-
addopts = "-ra -q --cov=src/pharox --cov-report=term-missing"
65+
addopts = "-ra -q --cov=src/pharox --cov-report=term-missing --cov-fail-under=85"
66+
filterwarnings = ["ignore::DeprecationWarning"]
6667
testpaths = ["tests"]
6768
markers = ["contract: marks adapter contract tests that may require external services"]
6869

@@ -74,6 +75,9 @@ disallow_untyped_defs = true
7475
requires = ["poetry-core>=2.0.0,<3.0.0"]
7576
build-backend = "poetry.core.masonry.api"
7677

78+
[tool.coverage.run]
79+
omit = ["src/pharox/storage/postgres/*", "src/pharox/benchmarks.py"]
80+
7781
[tool.commitizen]
7882
name = "cz_conventional_commits"
7983
version_provider = "poetry"

src/pharox/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@
4242
)
4343
from .observability import (
4444
DEFAULT_LATENCY_BUCKETS,
45+
DEFAULT_QUANTILES,
4546
PrometheusMetricsRecorder,
4647
StructuredLogger,
4748
register_prometheus_metrics,
@@ -97,4 +98,5 @@
9798
"StructuredLogger",
9899
"register_prometheus_metrics",
99100
"DEFAULT_LATENCY_BUCKETS",
101+
"DEFAULT_QUANTILES",
100102
]

src/pharox/observability/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,12 +3,14 @@
33
from .logging import StructuredLogger
44
from .metrics import (
55
DEFAULT_LATENCY_BUCKETS,
6+
DEFAULT_QUANTILES,
67
PrometheusMetricsRecorder,
78
register_prometheus_metrics,
89
)
910

1011
__all__ = [
1112
"DEFAULT_LATENCY_BUCKETS",
13+
"DEFAULT_QUANTILES",
1214
"PrometheusMetricsRecorder",
1315
"StructuredLogger",
1416
"register_prometheus_metrics",

0 commit comments

Comments
 (0)