Skip to content

perf: parallelize sync content processing (4m25s → 2m00s)#1974

Draft
Asi0Flammeus wants to merge 3 commits intodevfrom
feat/sync-parallelization
Draft

perf: parallelize sync content processing (4m25s → 2m00s)#1974
Asi0Flammeus wants to merge 3 commits intodevfrom
feat/sync-parallelization

Conversation

@Asi0Flammeus
Copy link
Copy Markdown
Collaborator

@Asi0Flammeus Asi0Flammeus commented Mar 29, 2026

Context

The content sync takes ~4m25s, which is an irritant during local review/test workflows where sync is triggered frequently. The bottleneck is sequential DB upserts (for...of await), not network or I/O.

Root cause analysis

The sync architecture uses a mark-and-sweep garbage collection pattern via last_sync:

  1. Capture sync_date = NOW() at start
  2. Process ALL entities → each gets last_sync = NOW()
  3. DELETE WHERE last_sync < sync_date → removes anything not touched

This design is deliberately robust (impossible to leave orphaned data) but forces a full sync every time. Changing the GC mechanism was evaluated and rejected due to high risk/effort ratio.

What this PR does

Replaces sequential processing with concurrent execution using a local pMap helper (13 lines, no external dependency). The GC mechanism, all update functions, and all business logic remain untouched.

Three-phase execution respecting foreign key dependencies:

  • Phase 1 (parallel): professors, labs, resources, events, blogs, legals, bcerts
  • Phase 2 (parallel, after phase 1): courses, tutorials — depend on professors
  • Phase 3 (parallel, after phase 2): quiz questions, assignments — depend on courses

Deadlock prevention: Tags are sorted before insertion (.sort() on lowercaseTags) to ensure deterministic lock ordering on the shared content.tags table.

Concurrency is set to 10, matching the default postgres.js connection pool size.

Benchmark results

Phase Before After
Phase 1 (independent types) ~50s ~15s
Phase 2 (courses + tutorials) ~40s ~20s
Phase 3 (quiz questions + assignments) ~120s ~25s
Content processing total ~210s ~60s (3.5x faster)
Typesense indexing ~50s ~50s (no change)
Total sync ~260s ~110s

What does NOT change

  • GC mechanism (mark-and-sweep via last_sync)
  • All update* / delete* / groupBy* functions
  • CDN sync, Typesense indexing, location sync
  • Database schema
  • No new external dependencies

Files changed

  • packages/service-content/src/lib/utils/concurrency.ts — NEW: local pMap helper (13 lines)
  • packages/service-content/src/lib/index.ts — sequential loops → 3 parallel phases
  • 7 import files — .sort() added on tag insertion

Local tests

  • Full sync completes without new errors
  • TypeScript compiles cleanly
  • Biome lint passes
  • Entity counts in DB coherent with known quantities
  • Second sync completes all phases normally
  • Typesense search returns correct results (3831 results for "bitcoin")
  • GC deletion test (delete file, entity removed) — not testable locally, see note below

Note on GC deletion test

The GC deletion phase (processDeleteOldEntities) is skipped when syncErrors.length > 0. In local dev, 54 assignment PDF uploads fail with ECONNREFUSED because there is no S3 service configured locally. This means the GC never runs in local, regardless of this PR.

This is a pre-existing gap in the local dev setup. Adding a MinIO container (S3-compatible) to local-dev.sh would fix this and enable full GC testing. Tracked separately.

The GC code itself (DELETE WHERE last_sync < sync_date) was not modified by this PR.

Asi0Flammeus and others added 3 commits March 29, 2026 16:18
Replace sequential for...of await loops with concurrent processing
using a local pMap helper (no external dependency).

Three-phase execution respecting FK dependencies:
- Phase 1: professors, labs, resources, events, blogs, legals, bcerts
- Phase 2: courses, tutorials (depends on professors)
- Phase 3: quiz questions, assignments (depends on courses)

Also sort tags before insertion to prevent potential deadlocks
on the shared content.tags table during concurrent processing.

Benchmarked locally: content processing 210s → 61s (3.4x faster).
Total sync 4m25s → 2m00s (remaining time is Typesense indexing).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The 17 DB queries that fetch content for Typesense search indexing
were executed sequentially due to the spread-await pattern used
since the initial implementation. Each query is an independent
SELECT on a different table — no ordering dependency.

Replace [...(await A), ...(await B)] with Promise.all([A, B]).flat()
to run all 17 queries concurrently through the connection pool.

Expected: ~35s of sequential DB queries → ~3-5s in parallel.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Asi0Flammeus Asi0Flammeus marked this pull request as draft March 29, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant