perf: 3-phase retain pipeline — fix deadlocks, cap temporal links, query-time entity expansion#722
Merged
nicoloboschi merged 21 commits intomainfrom Apr 1, 2026
Merged
Conversation
d83b37d to
0f1250c
Compare
…ery-time entity expansion Major retain pipeline overhaul addressing deadlocks, write amplification, and TimeoutErrors. Restructures retain into three phases: Phase 1: Entity resolution on separate connection (read-heavy) Phase 2: Core write transaction (atomic) — facts, unit_entities, links Phase 3: Best-effort display data (error-isolated) — entity viz links, stats Key changes: - Sorted bulk INSERT FROM unnest() prevents deadlocks - Temporal links capped to top-20 per unit (95% reduction) - Batched semantic ANN via temp table + LATERAL - Query-time entity expansion via unit_entities self-join - Entity viz links moved to Phase 3 (post-transaction) - HINDSIGHT_API_RETAIN_MAX_CONCURRENT config (default: 32)
The hardcoded top_k=5 was artificially limiting semantic link creation. Link expansion retrieval can consume up to budget (50-200) semantic neighbors per seed set, but each fact only had 5 outgoing edges — making the bidirectional graph very sparse. Increasing to 20 gives retrieval 4x more edges to work with. The ANN probe cost is unchanged (same HNSW traversal per fact, just returning more rows). INSERT cost is negligible (~14k rows via bulk INSERT). Also: all 18 TimeoutErrors in the latest benchmark (beam-1m-u20) were from Gemini LLM calls, zero from the database — confirming the entity resolution split eliminated DB timeouts entirely.
The batched LATERAL ANN query (700 HNSW probes) was the last remaining source of DB TimeoutErrors — all 29 in the latest benchmark were from create_semantic_links_batch inside the Phase 2 write transaction. Split semantic link creation into three phases: - Phase 1 (separate conn, autocommit): ANN search via temp table + LATERAL. No transaction locks, no contention with concurrent writers. - Phase 2 (write transaction): within-batch numpy similarities (instant) + INSERT of both within-batch and Phase 1 ANN results. No DB reads. - Phase 3 (flush_pending_stats): future hook point for re-checking ANN results after commit to catch links missed by concurrent batches. Also adds 7 unit tests for compute_semantic_links_within_batch covering empty input, identical/orthogonal embeddings, threshold filtering, top_k cap, and tuple structure validation.
- New test_semantic_links_phase1_ann_cross_batch verifies that the Phase 1 ANN search with placeholder unit IDs correctly creates cross-batch semantic links after remapping to real IDs. - Test PG port now configurable via HINDSIGHT_TEST_PG_PORT env var (default: 5556) to avoid conflicts with running benchmark daemons.
Remove retry_with_backoff from _run_db_work and _run_delta_db_work: - Deadlocks are prevented by sorted bulk INSERT (no need for retry) - Transient timeouts are handled by the worker poller's task-level retry (3 attempts, 60s spacing) which is better than rapid internal retries that amplify I/O pressure during contention storms Set HINDSIGHT_API_RETAIN_MAX_CONCURRENT default from 32 to 4: - The semaphore gates Phase 1 (ANN + entity resolution) + Phase 2 (writes) - At 4 concurrent, HNSW index I/O is manageable; at 10+ concurrent the probes saturate disk and cause cascading timeouts - LLM extraction still runs at full parallelism (semaphore acquired after)
…ndexes
The LATERAL ANN query was falling back to sequential scan + sort (90ms/probe)
because the per-bank HNSW indexes are partial indexes filtered on fact_type.
Without fact_type in the WHERE clause, PostgreSQL couldn't use them.
Fix: iterate over ('world', 'experience') and run one HNSW-indexed ANN per
type. EXPLAIN shows 8ms/probe (was 90ms) — 11x faster.
700 probes × 8ms × 2 types = ~11s total (was ~63s via seq scan).
Temporal links now filter by fact_type in the LATERAL query — world facts only link to world facts, experience to experience. This matches how retrieval filters results and avoids wasted cross-type link rows. New integration tests: - test_semantic_ann_uses_hnsw_index: verifies Phase 1 ANN creates cross-batch semantic links (tests fact_type filter + placeholder remap) - test_temporal_links_scoped_by_fact_type: verifies world facts get temporal links to other world facts but NOT to experience facts
… batch Changed asyncio.gather(*tasks) to asyncio.gather(*tasks, return_exceptions=True) in both chunk-level and content-level fact extraction. A single chunk timeout (e.g., Gemini >90s) no longer discards all other successfully extracted facts. For a 50MB document with 17k chunks, even a 2% chunk failure rate previously caused 0 completions (entire batch discarded). Now 16,700 facts are extracted and only the 300 failed chunks are skipped with a warning log.
The LATERAL query for temporal links passed all unit_ids at once into unnest(), causing PostgreSQL timeouts on documents with 16k+ chunks. Split into batches of 500 units per query to keep each under the command_timeout. Also identified: HNSW index creation on shared pg0 instances with 50k+ existing units exceeds the 60s command_timeout. This is a test infrastructure issue (shared pg0 accumulates data) but also affects production when creating new banks on large instances.
…H_SIZE) Process chunks in mini-batches of N (default 500), committing each batch to the DB before starting the next. This prevents OOM kills on large documents (50MB / 17k+ chunks) by keeping only ~500 facts + embeddings in memory at a time instead of 50k+. Each mini-batch goes through the full Phase 1 → 2 → 3 pipeline independently, sharing the same document_id. On recovery (process dies mid-way), delta retain detects already-committed chunks via content_hash and skips them — only remaining chunks get re-extracted. Config: HINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE (default: 500, 0 to disable) Per-bank configurable via the hierarchical config system. Tests: - test_streaming_chunk_batching_produces_same_facts - test_streaming_chunk_batching_recovery (delta retain skips committed chunks) - test_streaming_disabled_for_small_docs
Replace the sequential streaming loop with a producer-consumer pipeline: - LLM producer fires concurrent chunk extractions (semaphore-bounded) - DB consumer drains queue in batches, runs Phase 1+2+3 per batch - LLM and DB work overlap instead of running sequentially Defer semantic links to a single final ANN pass after all batches commit: - Remove within-batch semantic links from Phase 2 (was 2.6s/batch) - Run parallel ANN (4 connections) after all facts committed - top_k reduced from 50 to 20 (recall uses at most 20 neighbors) - Recovery via operation result_metadata checkpoint Additional optimizations: - skip_exists_check on temporal/causal link INSERT (saves ~0.5s/batch) - WHERE EXISTS guard on semantic link INSERT (handles document upsert) - timeout=300s on ANN queries and bulk INSERT for large banks - Demote [ANN] debug logs to logger.debug() - Fix docstring typos (agent_id → bank_id) - Fix content_index remapping in producer-consumer batches - Fix delta retain passing contents vs delta_contents 50MB benchmark (mock LLM): 9.2 min (was 23 min) — 2.5x faster. BEAM 10m benchmark: zero deadlocks, zero DB errors.
- Remove process_entities_batch (legacy single-connection entity processing) - Remove extract_entities_batch_optimized (only caller was the above) - Remove fallback entity processing inside Phase 2 transaction - Remove legacy ANN inline fallback in create_semantic_links_batch - Remove fallback entity_links direct-insert path in Phase 3 - Make resolved_entity_ids/entity_to_unit/unit_to_entity_ids required params
4232c9e to
1b3f0a1
Compare
… code - Add EntityResolutionResult and Phase1Result dataclasses in types.py - Replace 4-tuple return from _pre_resolve_phase1 with Phase1Result - Remove dead `entity_links = []` variables in retain_batch and _try_delta_retain - Remove unused `confidence_score` parameter from orchestrator.retain_batch and _retain_batch_async_internal (was accepted but never used)
… trigram matching The entity resolution query had LIKE '%...' substring conditions that bypassed the GIN trigram index, causing full sequential scans of the entities table. On banks with 10k+ entities, this caused TimeoutErrors (observed in BEAM 10m). Changes: - Remove LIKE fallbacks, use trigram % operator only (GIN index-based) - Lower similarity threshold from 0.3 to 0.15 to catch substring relationships - Use LOWER() on both sides for case-insensitive matching - Migration: recreate GIN trigram index on LOWER(canonical_name)
…000) _chunk_contents_for_delta defaulted to chunk_size=120000 while the streaming path used 3000. On retry, delta re-chunked the document with different boundaries, found 0 matching chunks, and fell through to full re-extraction. This wasted all LLM calls on already-committed chunks. Fix: use the same default (3000) so chunk hashes match on recovery.
…retry recovery When no document_id is provided, retain generates a UUID. On retry, a new UUID was generated, making delta retain and streaming chunk-hash recovery unable to find previously committed chunks. All LLM extraction was wasted on retry. Fix: resolve document_id early in retain_batch (before delta), persist it to operation result_metadata, and recover it on retry. Both delta and streaming paths now see the same document_id across attempts.
…reaming path All retains now go through the producer-consumer streaming pipeline, regardless of document size. Small documents are processed as a single batch. This eliminates the maintenance burden of two separate code paths. Also fix document upsert: compare content hash to distinguish recovery (same content, partially committed) from update (different content, needs cascade-delete). Previously, existing chunks always triggered recovery mode.
…ext dataclass - Remove dead _handle_zero_facts_documents (no callers after path unification) - Remove unused imports: defaultdict, EntityLink - Replace raw dict phase3_context with typed Phase3Context dataclass - Update _build_and_insert_entity_links_phase3 to use typed parameter
nicoloboschi
added a commit
that referenced
this pull request
Apr 1, 2026
The 3-phase retain pipeline (914ba79) introduced several regressions: 1. **Per-content tags lost** — streaming pipeline used `contents[0].tags` for ALL chunks, breaking tag-based visibility. Fixed by tracking chunk-to-content mapping so each chunk uses its source content's tags. 2. **Multi-document batches broken** — batches with per-content `document_id` values were merged into a single document. Fixed by grouping by document_id and processing each group independently. 3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both `drop_documents_metadata` and `case_insensitive_entities_trgm_index`. Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing schema prefix on DROP INDEX. 4. **Graph entity inheritance** — `get_graph_data` queried entities for observation IDs only, but observations inherit entities from source memories. Fixed by querying `all_relevant_ids`. 5. **Docstring false positives** — link_utils.py docstrings triggered the SQL schema safety test's unqualified table reference check. 6. **Config test count** — `retain_chunk_batch_size` added to `_CONFIGURABLE_FIELDS` without updating the test assertion.
3 tasks
nicoloboschi
added a commit
that referenced
this pull request
Apr 1, 2026
The 3-phase retain pipeline (914ba79) introduced several regressions: 1. **Per-content tags lost** — streaming pipeline used `contents[0].tags` for ALL chunks, breaking tag-based visibility. Fixed by tracking chunk-to-content mapping so each chunk uses its source content's tags. 2. **Multi-document batches broken** — batches with per-content `document_id` values were merged into a single document. Fixed by grouping by document_id and processing each group independently. 3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both `drop_documents_metadata` and `case_insensitive_entities_trgm_index`. Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing schema prefix on DROP INDEX. 4. **Graph entity inheritance** — `get_graph_data` queried entities for observation IDs only, but observations inherit entities from source memories. Fixed by querying `all_relevant_ids`. 5. **Docstring false positives** — link_utils.py docstrings triggered the SQL schema safety test's unqualified table reference check. 6. **Config test count** — `retain_chunk_batch_size` added to `_CONFIGURABLE_FIELDS` without updating the test assertion.
nicoloboschi
added a commit
that referenced
this pull request
Apr 1, 2026
#836) The 3-phase retain pipeline (914ba79) introduced several regressions: 1. **Per-content tags lost** — streaming pipeline used `contents[0].tags` for ALL chunks, breaking tag-based visibility. Fixed by tracking chunk-to-content mapping so each chunk uses its source content's tags. 2. **Multi-document batches broken** — batches with per-content `document_id` values were merged into a single document. Fixed by grouping by document_id and processing each group independently. 3. **Migration ID collision** — `d6e7f8a9b0c1` was used by both `drop_documents_metadata` and `case_insensitive_entities_trgm_index`. Renamed trgm migration to `e8f9a0b1c2d3`, fixed chain, added missing schema prefix on DROP INDEX. 4. **Graph entity inheritance** — `get_graph_data` queried entities for observation IDs only, but observations inherit entities from source memories. Fixed by querying `all_relevant_ids`. 5. **Docstring false positives** — link_utils.py docstrings triggered the SQL schema safety test's unqualified table reference check. 6. **Config test count** — `retain_chunk_batch_size` added to `_CONFIGURABLE_FIELDS` without updating the test assertion.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Major retain pipeline overhaul: fix deadlocks, eliminate write amplification, add producer-consumer pipeline, and defer semantic ANN to a single post-commit pass. Tested on BEAM 10m benchmark (1M+ facts, zero errors).
Architecture: 3-Phase Retain Pipeline
Phase 1 — Entity Resolution (separate connection, read-heavy)
Phase 2 — Core Write Transaction (atomic)
INSERT ... SELECT FROM unnest()for all link types — prevents deadlocksskip_exists_check=Trueon temporal/causal INSERT (saves ~0.5s/batch)HINDSIGHT_API_RETAIN_MAX_CONCURRENT(default: 4)Phase 3 — Best-Effort Display Data (post-transaction, error-isolated)
unit_entitiesself-join instead)Producer-Consumer Streaming Pipeline
Replaces sequential extract-write-extract-write with concurrent processing:
RETAIN_CHUNK_BATCH_SIZE(default: 100), runs Phase 1+2+3 per batchDeferred Semantic ANN Pass
Instead of creating within-batch semantic links in Phase 2 + fire-and-forget ANN per batch:
result_metadata.facts_committedflag — on crash/retry, skip extraction and jump straight to ANN passDead Code Removed
process_entities_batch(legacy single-connection entity processing)extract_entities_batch_optimized(legacy monolithic entity+link function)create_semantic_links_batchBenchmarks
50MB document (mock LLM, clean DB):
BEAM 10m benchmark (real LLM, production workload):
Test Coverage
test_streaming_chunk_batching_produces_same_facts,test_streaming_chunk_batching_recovery,test_streaming_disabled_for_small_docs_cap_links_per_unit,compute_semantic_links_within_batchConfiguration
HINDSIGHT_API_RETAIN_MAX_CONCURRENTHINDSIGHT_API_RETAIN_CHUNK_BATCH_SIZE