perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion by nicoloboschi · Pull Request #722 · vectorize-io/hindsight

nicoloboschi · 2026-03-27T09:02:25Z

Summary

Major retain pipeline overhaul: fix deadlocks, eliminate write amplification, add producer-consumer pipeline, and defer semantic ANN to a single post-commit pass. Tested on BEAM 10m benchmark (1M+ facts, zero errors).

Architecture: 3-Phase Retain Pipeline

Phase 1 — Entity Resolution (separate connection, read-heavy)

Trigram GIN scan + co-occurrence fetch on a dedicated connection outside the write transaction
Eliminates TimeoutErrors from slow reads holding the write transaction open

Phase 2 — Core Write Transaction (atomic)

Sorted bulk INSERT ... SELECT FROM unnest() for all link types — prevents deadlocks
Temporal links: bidirectional B-tree index scan (LIMIT 20 each direction) — capped at 20 per unit
Semantic links: skipped in streaming mode (deferred to final ANN pass)
skip_exists_check=True on temporal/causal INSERT (saves ~0.5s/batch)
Configurable backpressure via HINDSIGHT_API_RETAIN_MAX_CONCURRENT (default: 4)

Phase 3 — Best-Effort Display Data (post-transaction, error-isolated)

Entity links for UI graph visualization (retrieval uses unit_entities self-join instead)
try/except wrapped — failures logged as warnings, never fail the retain

Producer-Consumer Streaming Pipeline

Replaces sequential extract-write-extract-write with concurrent processing:

LLM Producer: fires all chunk extractions concurrently (LLM semaphore provides backpressure), pushes enriched results into asyncio.Queue
DB Consumer: drains queue in batches of RETAIN_CHUNK_BATCH_SIZE (default: 100), runs Phase 1+2+3 per batch
LLM and DB work overlap instead of running sequentially

Deferred Semantic ANN Pass

Instead of creating within-batch semantic links in Phase 2 + fire-and-forget ANN per batch:

Skip all semantic links during batch processing (saves 2.6s/batch)
Single final ANN pass after all facts committed: parallel (4 connections), chunked (1000 seeds/query), top_k=20
Recovery checkpoint: result_metadata.facts_committed flag — on crash/retry, skip extraction and jump straight to ANN pass

Dead Code Removed

process_entities_batch (legacy single-connection entity processing)
extract_entities_batch_optimized (legacy monolithic entity+link function)
Fallback entity processing inside Phase 2 transaction
Legacy ANN inline fallback in create_semantic_links_batch
Fallback entity_links direct-insert path in Phase 3

Benchmarks

50MB document (mock LLM, clean DB):

Metric	Before	After	Improvement
Batch phase (177 batches)	23 min (6.4s avg)	7.5 min (2.5s avg)	3x
Semantic ANN	fire-and-forget (caused spikes)	1.7 min (parallel, clean)	predictable
Total	23 min	9.2 min	2.5x

BEAM 10m benchmark (real LLM, production workload):

1M+ facts ingested across multiple docs
Zero deadlocks, zero DB errors, zero data loss

Test Coverage

83 retain/link tests pass (test_retain.py + test_link_utils.py)
Streaming: test_streaming_chunk_batching_produces_same_facts, test_streaming_chunk_batching_recovery, test_streaming_disabled_for_small_docs
Unit tests for _cap_links_per_unit, compute_semantic_links_within_batch
Delta retain + document upsert tests verify FK safety

Configuration

Variable	Default	Description
`HINDSIGHT_API_RETAIN_MAX_CONCURRENT`	4	Max concurrent Phase 1+2 batches
`HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE`	100	Chunks per streaming batch

…ery-time entity expansion Major retain pipeline overhaul addressing deadlocks, write amplification, and TimeoutErrors. Restructures retain into three phases: Phase 1: Entity resolution on separate connection (read-heavy) Phase 2: Core write transaction (atomic) — facts, unit_entities, links Phase 3: Best-effort display data (error-isolated) — entity viz links, stats Key changes: - Sorted bulk INSERT FROM unnest() prevents deadlocks - Temporal links capped to top-20 per unit (95% reduction) - Batched semantic ANN via temp table + LATERAL - Query-time entity expansion via unit_entities self-join - Entity viz links moved to Phase 3 (post-transaction) - HINDSIGHT_API_RETAIN_MAX_CONCURRENT config (default: 32)

The hardcoded top_k=5 was artificially limiting semantic link creation. Link expansion retrieval can consume up to budget (50-200) semantic neighbors per seed set, but each fact only had 5 outgoing edges — making the bidirectional graph very sparse. Increasing to 20 gives retrieval 4x more edges to work with. The ANN probe cost is unchanged (same HNSW traversal per fact, just returning more rows). INSERT cost is negligible (~14k rows via bulk INSERT). Also: all 18 TimeoutErrors in the latest benchmark (beam-1m-u20) were from Gemini LLM calls, zero from the database — confirming the entity resolution split eliminated DB timeouts entirely.

The batched LATERAL ANN query (700 HNSW probes) was the last remaining source of DB TimeoutErrors — all 29 in the latest benchmark were from create_semantic_links_batch inside the Phase 2 write transaction. Split semantic link creation into three phases: - Phase 1 (separate conn, autocommit): ANN search via temp table + LATERAL. No transaction locks, no contention with concurrent writers. - Phase 2 (write transaction): within-batch numpy similarities (instant) + INSERT of both within-batch and Phase 1 ANN results. No DB reads. - Phase 3 (flush_pending_stats): future hook point for re-checking ANN results after commit to catch links missed by concurrent batches. Also adds 7 unit tests for compute_semantic_links_within_batch covering empty input, identical/orthogonal embeddings, threshold filtering, top_k cap, and tuple structure validation.

- New test_semantic_links_phase1_ann_cross_batch verifies that the Phase 1 ANN search with placeholder unit IDs correctly creates cross-batch semantic links after remapping to real IDs. - Test PG port now configurable via HINDSIGHT_TEST_PG_PORT env var (default: 5556) to avoid conflicts with running benchmark daemons.

Remove retry_with_backoff from _run_db_work and _run_delta_db_work: - Deadlocks are prevented by sorted bulk INSERT (no need for retry) - Transient timeouts are handled by the worker poller's task-level retry (3 attempts, 60s spacing) which is better than rapid internal retries that amplify I/O pressure during contention storms Set HINDSIGHT_API_RETAIN_MAX_CONCURRENT default from 32 to 4: - The semaphore gates Phase 1 (ANN + entity resolution) + Phase 2 (writes) - At 4 concurrent, HNSW index I/O is manageable; at 10+ concurrent the probes saturate disk and cause cascading timeouts - LLM extraction still runs at full parallelism (semaphore acquired after)

…ndexes The LATERAL ANN query was falling back to sequential scan + sort (90ms/probe) because the per-bank HNSW indexes are partial indexes filtered on fact_type. Without fact_type in the WHERE clause, PostgreSQL couldn't use them. Fix: iterate over ('world', 'experience') and run one HNSW-indexed ANN per type. EXPLAIN shows 8ms/probe (was 90ms) — 11x faster. 700 probes × 8ms × 2 types = ~11s total (was ~63s via seq scan).

Temporal links now filter by fact_type in the LATERAL query — world facts only link to world facts, experience to experience. This matches how retrieval filters results and avoids wasted cross-type link rows. New integration tests: - test_semantic_ann_uses_hnsw_index: verifies Phase 1 ANN creates cross-batch semantic links (tests fact_type filter + placeholder remap) - test_temporal_links_scoped_by_fact_type: verifies world facts get temporal links to other world facts but NOT to experience facts

… batch Changed asyncio.gather(*tasks) to asyncio.gather(*tasks, return_exceptions=True) in both chunk-level and content-level fact extraction. A single chunk timeout (e.g., Gemini >90s) no longer discards all other successfully extracted facts. For a 50MB document with 17k chunks, even a 2% chunk failure rate previously caused 0 completions (entire batch discarded). Now 16,700 facts are extracted and only the 300 failed chunks are skipped with a warning log.

The LATERAL query for temporal links passed all unit_ids at once into unnest(), causing PostgreSQL timeouts on documents with 16k+ chunks. Split into batches of 500 units per query to keep each under the command_timeout. Also identified: HNSW index creation on shared pg0 instances with 50k+ existing units exceeds the 60s command_timeout. This is a test infrastructure issue (shared pg0 accumulates data) but also affects production when creating new banks on large instances.

…H_SIZE) Process chunks in mini-batches of N (default 500), committing each batch to the DB before starting the next. This prevents OOM kills on large documents (50MB / 17k+ chunks) by keeping only ~500 facts + embeddings in memory at a time instead of 50k+. Each mini-batch goes through the full Phase 1 → 2 → 3 pipeline independently, sharing the same document_id. On recovery (process dies mid-way), delta retain detects already-committed chunks via content_hash and skips them — only remaining chunks get re-extracted. Config: HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE (default: 500, 0 to disable) Per-bank configurable via the hierarchical config system. Tests: - test_streaming_chunk_batching_produces_same_facts - test_streaming_chunk_batching_recovery (delta retain skips committed chunks) - test_streaming_disabled_for_small_docs

Replace the sequential streaming loop with a producer-consumer pipeline: - LLM producer fires concurrent chunk extractions (semaphore-bounded) - DB consumer drains queue in batches, runs Phase 1+2+3 per batch - LLM and DB work overlap instead of running sequentially Defer semantic links to a single final ANN pass after all batches commit: - Remove within-batch semantic links from Phase 2 (was 2.6s/batch) - Run parallel ANN (4 connections) after all facts committed - top_k reduced from 50 to 20 (recall uses at most 20 neighbors) - Recovery via operation result_metadata checkpoint Additional optimizations: - skip_exists_check on temporal/causal link INSERT (saves ~0.5s/batch) - WHERE EXISTS guard on semantic link INSERT (handles document upsert) - timeout=300s on ANN queries and bulk INSERT for large banks - Demote [ANN] debug logs to logger.debug() - Fix docstring typos (agent_id → bank_id) - Fix content_index remapping in producer-consumer batches - Fix delta retain passing contents vs delta_contents 50MB benchmark (mock LLM): 9.2 min (was 23 min) — 2.5x faster. BEAM 10m benchmark: zero deadlocks, zero DB errors.

- Remove process_entities_batch (legacy single-connection entity processing) - Remove extract_entities_batch_optimized (only caller was the above) - Remove fallback entity processing inside Phase 2 transaction - Remove legacy ANN inline fallback in create_semantic_links_batch - Remove fallback entity_links direct-insert path in Phase 3 - Make resolved_entity_ids/entity_to_unit/unit_to_entity_ids required params

… code - Add EntityResolutionResult and Phase1Result dataclasses in types.py - Replace 4-tuple return from _pre_resolve_phase1 with Phase1Result - Remove dead `entity_links = []` variables in retain_batch and _try_delta_retain - Remove unused `confidence_score` parameter from orchestrator.retain_batch and _retain_batch_async_internal (was accepted but never used)

… trigram matching The entity resolution query had LIKE '%...' substring conditions that bypassed the GIN trigram index, causing full sequential scans of the entities table. On banks with 10k+ entities, this caused TimeoutErrors (observed in BEAM 10m). Changes: - Remove LIKE fallbacks, use trigram % operator only (GIN index-based) - Lower similarity threshold from 0.3 to 0.15 to catch substring relationships - Use LOWER() on both sides for case-insensitive matching - Migration: recreate GIN trigram index on LOWER(canonical_name)

…000) _chunk_contents_for_delta defaulted to chunk_size=120000 while the streaming path used 3000. On retry, delta re-chunked the document with different boundaries, found 0 matching chunks, and fell through to full re-extraction. This wasted all LLM calls on already-committed chunks. Fix: use the same default (3000) so chunk hashes match on recovery.

…retry recovery When no document_id is provided, retain generates a UUID. On retry, a new UUID was generated, making delta retain and streaming chunk-hash recovery unable to find previously committed chunks. All LLM extraction was wasted on retry. Fix: resolve document_id early in retain_batch (before delta), persist it to operation result_metadata, and recover it on retry. Both delta and streaming paths now see the same document_id across attempts.

…reaming path All retains now go through the producer-consumer streaming pipeline, regardless of document size. Small documents are processed as a single batch. This eliminates the maintenance burden of two separate code paths. Also fix document upsert: compare content hash to distinguish recovery (same content, partially committed) from update (different content, needs cascade-delete). Previously, existing chunks always triggered recovery mode.

…ext dataclass - Remove dead _handle_zero_facts_documents (no callers after path unification) - Remove unused imports: defaultdict, EntityLink - Replace raw dict phase3_context with typed Phase3Context dataclass - Update _build_and_insert_entity_links_phase3 to use typed parameter

The 3-phase retain pipeline (914ba79) introduced several regressions: 1. **Per-content tags lost** — streaming pipeline used `contents[0].tags` for ALL chunks, breaking tag-based visibility. Fixed by tracking chunk-to-content mapping so each chunk uses its source content's tags. 2. **Multi-document batches broken** — batches with per-content `document_id` values were merged into a single document. Fixed by grouping by document_id and processing each group independently. 3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both `drop_documents_metadata` and `case_insensitive_entities_trgm_index`. Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing schema prefix on DROP INDEX. 4. **Graph entity inheritance** — `get_graph_data` queried entities for observation IDs only, but observations inherit entities from source memories. Fixed by querying `all_relevant_ids`. 5. **Docstring false positives** — link_utils.py docstrings triggered the SQL schema safety test's unqualified table reference check. 6. **Config test count** — `retain_chunk_batch_size` added to `_CONFIGURABLE_FIELDS` without updating the test assertion.

#836) The 3-phase retain pipeline (914ba79) introduced several regressions: 1. **Per-content tags lost** — streaming pipeline used `contents[0].tags` for ALL chunks, breaking tag-based visibility. Fixed by tracking chunk-to-content mapping so each chunk uses its source content's tags. 2. **Multi-document batches broken** — batches with per-content `document_id` values were merged into a single document. Fixed by grouping by document_id and processing each group independently. 3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both `drop_documents_metadata` and `case_insensitive_entities_trgm_index`. Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing schema prefix on DROP INDEX. 4. **Graph entity inheritance** — `get_graph_data` queried entities for observation IDs only, but observations inherit entities from source memories. Fixed by querying `all_relevant_ids`. 5. **Docstring false positives** — link_utils.py docstrings triggered the SQL schema safety test's unqualified table reference check. 6. **Config test count** — `retain_chunk_batch_size` added to `_CONFIGURABLE_FIELDS` without updating the test assertion.

nicoloboschi changed the title ~~fix: prevent deadlocks on concurrent retain by sorting link inserts~~ perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion Mar 27, 2026

nicoloboschi marked this pull request as draft March 27, 2026 15:10

nicoloboschi force-pushed the fix/retain-deadlock-prevention branch 17 times, most recently from d83b37d to 0f1250c Compare March 30, 2026 09:04

nicoloboschi added 11 commits March 31, 2026 10:21

fix: handle placeholder unit_ids in Phase 1 ANN search (not valid UUIDs)

fbfb66c

nicoloboschi added 2 commits March 31, 2026 10:22

nicoloboschi force-pushed the fix/retain-deadlock-prevention branch from 4232c9e to 1b3f0a1 Compare March 31, 2026 08:22

nicoloboschi added 3 commits March 31, 2026 10:47

fix: remove schema prefix from index names in trigram migration

8bfbc59

nicoloboschi added the v0.5 label Mar 31, 2026

nicoloboschi added 5 commits April 1, 2026 11:16

Merge branch 'main' into fix/retain-deadlock-prevention

f443412

nicoloboschi marked this pull request as ready for review April 1, 2026 10:50

nicoloboschi merged commit 914ba79 into main Apr 1, 2026
28 of 47 checks passed

nicoloboschi mentioned this pull request Apr 1, 2026

fix: resolve 25 test regressions from streaming retain pipeline #836

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion#722

perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion#722
nicoloboschi merged 21 commits intomainfrom
fix/retain-deadlock-prevention

nicoloboschi commented Mar 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicoloboschi commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture: 3-Phase Retain Pipeline

Producer-Consumer Streaming Pipeline

Deferred Semantic ANN Pass

Dead Code Removed

Benchmarks

Test Coverage

Configuration

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nicoloboschi commented Mar 27, 2026 •

edited

Loading