Skip to content

perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion#722

Merged
nicoloboschi merged 21 commits intomainfrom
fix/retain-deadlock-prevention
Apr 1, 2026
Merged

perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion#722
nicoloboschi merged 21 commits intomainfrom
fix/retain-deadlock-prevention

Conversation

@nicoloboschi
Copy link
Copy Markdown
Collaborator

@nicoloboschi nicoloboschi commented Mar 27, 2026

Summary

Major retain pipeline overhaul: fix deadlocks, eliminate write amplification, add producer-consumer pipeline, and defer semantic ANN to a single post-commit pass. Tested on BEAM 10m benchmark (1M+ facts, zero errors).

Architecture: 3-Phase Retain Pipeline

Phase 1 — Entity Resolution (separate connection, read-heavy)

  • Trigram GIN scan + co-occurrence fetch on a dedicated connection outside the write transaction
  • Eliminates TimeoutErrors from slow reads holding the write transaction open

Phase 2 — Core Write Transaction (atomic)

  • Sorted bulk INSERT ... SELECT FROM unnest() for all link types — prevents deadlocks
  • Temporal links: bidirectional B-tree index scan (LIMIT 20 each direction) — capped at 20 per unit
  • Semantic links: skipped in streaming mode (deferred to final ANN pass)
  • skip_exists_check=True on temporal/causal INSERT (saves ~0.5s/batch)
  • Configurable backpressure via HINDSIGHT_API_RETAIN_MAX_CONCURRENT (default: 4)

Phase 3 — Best-Effort Display Data (post-transaction, error-isolated)

  • Entity links for UI graph visualization (retrieval uses unit_entities self-join instead)
  • try/except wrapped — failures logged as warnings, never fail the retain

Producer-Consumer Streaming Pipeline

Replaces sequential extract-write-extract-write with concurrent processing:

  • LLM Producer: fires all chunk extractions concurrently (LLM semaphore provides backpressure), pushes enriched results into asyncio.Queue
  • DB Consumer: drains queue in batches of RETAIN_CHUNK_BATCH_SIZE (default: 100), runs Phase 1+2+3 per batch
  • LLM and DB work overlap instead of running sequentially

Deferred Semantic ANN Pass

Instead of creating within-batch semantic links in Phase 2 + fire-and-forget ANN per batch:

  • Skip all semantic links during batch processing (saves 2.6s/batch)
  • Single final ANN pass after all facts committed: parallel (4 connections), chunked (1000 seeds/query), top_k=20
  • Recovery checkpoint: result_metadata.facts_committed flag — on crash/retry, skip extraction and jump straight to ANN pass

Dead Code Removed

  • process_entities_batch (legacy single-connection entity processing)
  • extract_entities_batch_optimized (legacy monolithic entity+link function)
  • Fallback entity processing inside Phase 2 transaction
  • Legacy ANN inline fallback in create_semantic_links_batch
  • Fallback entity_links direct-insert path in Phase 3

Benchmarks

50MB document (mock LLM, clean DB):

Metric Before After Improvement
Batch phase (177 batches) 23 min (6.4s avg) 7.5 min (2.5s avg) 3x
Semantic ANN fire-and-forget (caused spikes) 1.7 min (parallel, clean) predictable
Total 23 min 9.2 min 2.5x

BEAM 10m benchmark (real LLM, production workload):

  • 1M+ facts ingested across multiple docs
  • Zero deadlocks, zero DB errors, zero data loss

Test Coverage

  • 83 retain/link tests pass (test_retain.py + test_link_utils.py)
  • Streaming: test_streaming_chunk_batching_produces_same_facts, test_streaming_chunk_batching_recovery, test_streaming_disabled_for_small_docs
  • Unit tests for _cap_links_per_unit, compute_semantic_links_within_batch
  • Delta retain + document upsert tests verify FK safety

Configuration

Variable Default Description
HINDSIGHT_API_RETAIN_MAX_CONCURRENT 4 Max concurrent Phase 1+2 batches
HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE 100 Chunks per streaming batch

@nicoloboschi nicoloboschi changed the title fix: prevent deadlocks on concurrent retain by sorting link inserts perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion Mar 27, 2026
@nicoloboschi nicoloboschi marked this pull request as draft March 27, 2026 15:10
@nicoloboschi nicoloboschi force-pushed the fix/retain-deadlock-prevention branch 17 times, most recently from d83b37d to 0f1250c Compare March 30, 2026 09:04
…ery-time entity expansion

Major retain pipeline overhaul addressing deadlocks, write amplification,
and TimeoutErrors. Restructures retain into three phases:

Phase 1: Entity resolution on separate connection (read-heavy)
Phase 2: Core write transaction (atomic) — facts, unit_entities, links
Phase 3: Best-effort display data (error-isolated) — entity viz links, stats

Key changes:
- Sorted bulk INSERT FROM unnest() prevents deadlocks
- Temporal links capped to top-20 per unit (95% reduction)
- Batched semantic ANN via temp table + LATERAL
- Query-time entity expansion via unit_entities self-join
- Entity viz links moved to Phase 3 (post-transaction)
- HINDSIGHT_API_RETAIN_MAX_CONCURRENT config (default: 32)
The hardcoded top_k=5 was artificially limiting semantic link creation.
Link expansion retrieval can consume up to budget (50-200) semantic
neighbors per seed set, but each fact only had 5 outgoing edges — making
the bidirectional graph very sparse.

Increasing to 20 gives retrieval 4x more edges to work with. The ANN
probe cost is unchanged (same HNSW traversal per fact, just returning
more rows). INSERT cost is negligible (~14k rows via bulk INSERT).

Also: all 18 TimeoutErrors in the latest benchmark (beam-1m-u20) were
from Gemini LLM calls, zero from the database — confirming the entity
resolution split eliminated DB timeouts entirely.
The batched LATERAL ANN query (700 HNSW probes) was the last remaining
source of DB TimeoutErrors — all 29 in the latest benchmark were from
create_semantic_links_batch inside the Phase 2 write transaction.

Split semantic link creation into three phases:
- Phase 1 (separate conn, autocommit): ANN search via temp table + LATERAL.
  No transaction locks, no contention with concurrent writers.
- Phase 2 (write transaction): within-batch numpy similarities (instant) +
  INSERT of both within-batch and Phase 1 ANN results. No DB reads.
- Phase 3 (flush_pending_stats): future hook point for re-checking ANN
  results after commit to catch links missed by concurrent batches.

Also adds 7 unit tests for compute_semantic_links_within_batch covering
empty input, identical/orthogonal embeddings, threshold filtering, top_k
cap, and tuple structure validation.
- New test_semantic_links_phase1_ann_cross_batch verifies that the Phase 1
  ANN search with placeholder unit IDs correctly creates cross-batch
  semantic links after remapping to real IDs.
- Test PG port now configurable via HINDSIGHT_TEST_PG_PORT env var
  (default: 5556) to avoid conflicts with running benchmark daemons.
Remove retry_with_backoff from _run_db_work and _run_delta_db_work:
- Deadlocks are prevented by sorted bulk INSERT (no need for retry)
- Transient timeouts are handled by the worker poller's task-level retry
  (3 attempts, 60s spacing) which is better than rapid internal retries
  that amplify I/O pressure during contention storms

Set HINDSIGHT_API_RETAIN_MAX_CONCURRENT default from 32 to 4:
- The semaphore gates Phase 1 (ANN + entity resolution) + Phase 2 (writes)
- At 4 concurrent, HNSW index I/O is manageable; at 10+ concurrent the
  probes saturate disk and cause cascading timeouts
- LLM extraction still runs at full parallelism (semaphore acquired after)
…ndexes

The LATERAL ANN query was falling back to sequential scan + sort (90ms/probe)
because the per-bank HNSW indexes are partial indexes filtered on fact_type.
Without fact_type in the WHERE clause, PostgreSQL couldn't use them.

Fix: iterate over ('world', 'experience') and run one HNSW-indexed ANN per
type. EXPLAIN shows 8ms/probe (was 90ms) — 11x faster.

700 probes × 8ms × 2 types = ~11s total (was ~63s via seq scan).
Temporal links now filter by fact_type in the LATERAL query — world facts
only link to world facts, experience to experience. This matches how
retrieval filters results and avoids wasted cross-type link rows.

New integration tests:
- test_semantic_ann_uses_hnsw_index: verifies Phase 1 ANN creates
  cross-batch semantic links (tests fact_type filter + placeholder remap)
- test_temporal_links_scoped_by_fact_type: verifies world facts get
  temporal links to other world facts but NOT to experience facts
… batch

Changed asyncio.gather(*tasks) to asyncio.gather(*tasks, return_exceptions=True)
in both chunk-level and content-level fact extraction. A single chunk timeout
(e.g., Gemini >90s) no longer discards all other successfully extracted facts.

For a 50MB document with 17k chunks, even a 2% chunk failure rate previously
caused 0 completions (entire batch discarded). Now 16,700 facts are extracted
and only the 300 failed chunks are skipped with a warning log.
The LATERAL query for temporal links passed all unit_ids at once into
unnest(), causing PostgreSQL timeouts on documents with 16k+ chunks.
Split into batches of 500 units per query to keep each under the
command_timeout.

Also identified: HNSW index creation on shared pg0 instances with
50k+ existing units exceeds the 60s command_timeout. This is a
test infrastructure issue (shared pg0 accumulates data) but also
affects production when creating new banks on large instances.
…H_SIZE)

Process chunks in mini-batches of N (default 500), committing each batch
to the DB before starting the next. This prevents OOM kills on large
documents (50MB / 17k+ chunks) by keeping only ~500 facts + embeddings
in memory at a time instead of 50k+.

Each mini-batch goes through the full Phase 1 → 2 → 3 pipeline
independently, sharing the same document_id. On recovery (process dies
mid-way), delta retain detects already-committed chunks via content_hash
and skips them — only remaining chunks get re-extracted.

Config: HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE (default: 500, 0 to disable)
Per-bank configurable via the hierarchical config system.

Tests:
- test_streaming_chunk_batching_produces_same_facts
- test_streaming_chunk_batching_recovery (delta retain skips committed chunks)
- test_streaming_disabled_for_small_docs
Replace the sequential streaming loop with a producer-consumer pipeline:
- LLM producer fires concurrent chunk extractions (semaphore-bounded)
- DB consumer drains queue in batches, runs Phase 1+2+3 per batch
- LLM and DB work overlap instead of running sequentially

Defer semantic links to a single final ANN pass after all batches commit:
- Remove within-batch semantic links from Phase 2 (was 2.6s/batch)
- Run parallel ANN (4 connections) after all facts committed
- top_k reduced from 50 to 20 (recall uses at most 20 neighbors)
- Recovery via operation result_metadata checkpoint

Additional optimizations:
- skip_exists_check on temporal/causal link INSERT (saves ~0.5s/batch)
- WHERE EXISTS guard on semantic link INSERT (handles document upsert)
- timeout=300s on ANN queries and bulk INSERT for large banks
- Demote [ANN] debug logs to logger.debug()
- Fix docstring typos (agent_id → bank_id)
- Fix content_index remapping in producer-consumer batches
- Fix delta retain passing contents vs delta_contents

50MB benchmark (mock LLM): 9.2 min (was 23 min) — 2.5x faster.
BEAM 10m benchmark: zero deadlocks, zero DB errors.
- Remove process_entities_batch (legacy single-connection entity processing)
- Remove extract_entities_batch_optimized (only caller was the above)
- Remove fallback entity processing inside Phase 2 transaction
- Remove legacy ANN inline fallback in create_semantic_links_batch
- Remove fallback entity_links direct-insert path in Phase 3
- Make resolved_entity_ids/entity_to_unit/unit_to_entity_ids required params
@nicoloboschi nicoloboschi force-pushed the fix/retain-deadlock-prevention branch from 4232c9e to 1b3f0a1 Compare March 31, 2026 08:22
… code

- Add EntityResolutionResult and Phase1Result dataclasses in types.py
- Replace 4-tuple return from _pre_resolve_phase1 with Phase1Result
- Remove dead `entity_links = []` variables in retain_batch and _try_delta_retain
- Remove unused `confidence_score` parameter from orchestrator.retain_batch
  and _retain_batch_async_internal (was accepted but never used)
… trigram matching

The entity resolution query had LIKE '%...' substring conditions that bypassed
the GIN trigram index, causing full sequential scans of the entities table.
On banks with 10k+ entities, this caused TimeoutErrors (observed in BEAM 10m).

Changes:
- Remove LIKE fallbacks, use trigram % operator only (GIN index-based)
- Lower similarity threshold from 0.3 to 0.15 to catch substring relationships
- Use LOWER() on both sides for case-insensitive matching
- Migration: recreate GIN trigram index on LOWER(canonical_name)
…000)

_chunk_contents_for_delta defaulted to chunk_size=120000 while the streaming
path used 3000. On retry, delta re-chunked the document with different
boundaries, found 0 matching chunks, and fell through to full re-extraction.
This wasted all LLM calls on already-committed chunks.

Fix: use the same default (3000) so chunk hashes match on recovery.
…retry recovery

When no document_id is provided, retain generates a UUID. On retry, a new UUID
was generated, making delta retain and streaming chunk-hash recovery unable to
find previously committed chunks. All LLM extraction was wasted on retry.

Fix: resolve document_id early in retain_batch (before delta), persist it to
operation result_metadata, and recover it on retry. Both delta and streaming
paths now see the same document_id across attempts.
…reaming path

All retains now go through the producer-consumer streaming pipeline,
regardless of document size. Small documents are processed as a single batch.
This eliminates the maintenance burden of two separate code paths.

Also fix document upsert: compare content hash to distinguish recovery
(same content, partially committed) from update (different content, needs
cascade-delete). Previously, existing chunks always triggered recovery mode.
…ext dataclass

- Remove dead _handle_zero_facts_documents (no callers after path unification)
- Remove unused imports: defaultdict, EntityLink
- Replace raw dict phase3_context with typed Phase3Context dataclass
- Update _build_and_insert_entity_links_phase3 to use typed parameter
@nicoloboschi nicoloboschi marked this pull request as ready for review April 1, 2026 10:50
@nicoloboschi nicoloboschi merged commit 914ba79 into main Apr 1, 2026
28 of 47 checks passed
nicoloboschi added a commit that referenced this pull request Apr 1, 2026
The 3-phase retain pipeline (914ba79) introduced several regressions:

1. **Per-content tags lost** — streaming pipeline used `contents[0].tags`
   for ALL chunks, breaking tag-based visibility. Fixed by tracking
   chunk-to-content mapping so each chunk uses its source content's tags.

2. **Multi-document batches broken** — batches with per-content
   `document_id` values were merged into a single document. Fixed by
   grouping by document_id and processing each group independently.

3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both
   `drop_documents_metadata` and `case_insensitive_entities_trgm_index`.
   Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing
   schema prefix on DROP INDEX.

4. **Graph entity inheritance** — `get_graph_data` queried entities for
   observation IDs only, but observations inherit entities from source
   memories. Fixed by querying `all_relevant_ids`.

5. **Docstring false positives** — link_utils.py docstrings triggered
   the SQL schema safety test's unqualified table reference check.

6. **Config test count** — `retain_chunk_batch_size` added to
   `_CONFIGURABLE_FIELDS` without updating the test assertion.
nicoloboschi added a commit that referenced this pull request Apr 1, 2026
The 3-phase retain pipeline (914ba79) introduced several regressions:

1. **Per-content tags lost** — streaming pipeline used `contents[0].tags`
   for ALL chunks, breaking tag-based visibility. Fixed by tracking
   chunk-to-content mapping so each chunk uses its source content's tags.

2. **Multi-document batches broken** — batches with per-content
   `document_id` values were merged into a single document. Fixed by
   grouping by document_id and processing each group independently.

3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both
   `drop_documents_metadata` and `case_insensitive_entities_trgm_index`.
   Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing
   schema prefix on DROP INDEX.

4. **Graph entity inheritance** — `get_graph_data` queried entities for
   observation IDs only, but observations inherit entities from source
   memories. Fixed by querying `all_relevant_ids`.

5. **Docstring false positives** — link_utils.py docstrings triggered
   the SQL schema safety test's unqualified table reference check.

6. **Config test count** — `retain_chunk_batch_size` added to
   `_CONFIGURABLE_FIELDS` without updating the test assertion.
nicoloboschi added a commit that referenced this pull request Apr 1, 2026
#836)

The 3-phase retain pipeline (914ba79) introduced several regressions:

1. **Per-content tags lost** — streaming pipeline used `contents[0].tags`
   for ALL chunks, breaking tag-based visibility. Fixed by tracking
   chunk-to-content mapping so each chunk uses its source content's tags.

2. **Multi-document batches broken** — batches with per-content
   `document_id` values were merged into a single document. Fixed by
   grouping by document_id and processing each group independently.

3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both
   `drop_documents_metadata` and `case_insensitive_entities_trgm_index`.
   Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing
   schema prefix on DROP INDEX.

4. **Graph entity inheritance** — `get_graph_data` queried entities for
   observation IDs only, but observations inherit entities from source
   memories. Fixed by querying `all_relevant_ids`.

5. **Docstring false positives** — link_utils.py docstrings triggered
   the SQL schema safety test's unqualified table reference check.

6. **Config test count** — `retain_chunk_batch_size` added to
   `_CONFIGURABLE_FIELDS` without updating the test assertion.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant