Final Performance Optimization Analysis

Generated: 2026-03-22 Scope: Pre-opt baseline through post-opt-11 (11 optimization rounds) Database: Neo4j Community Edition 2026.02 + PostgreSQL Dataset: ~10M Artists, ~16.4M Releases, ~2.5M Masters, 16 Genres, 757 Styles

Executive Summary

Over 11 optimization rounds, the Discogsography API went from a system where most graph queries took 10-70 seconds to one where 82 of 88 endpoints respond in under 100ms. The overall average latency dropped 249x (10.95s → 0.044s), errors dropped from 37 to 3 (only expected), and 26 new endpoints were added during the process while still improving performance.

Metric	Pre-Opt	Final (Opt-11)	Improvement
Endpoints tested	62	88	+26 new
Total errors	37	3	-92%
Overall avg latency	10.95s	0.044s	249x
Slowest avg	69.75s	1.13s	62x
Endpoints under 100ms	~20	82	4x more
Endpoints under 500ms	~30	86	3x more

Complete Endpoint Performance Progression

Explore Endpoints

Endpoint	Pre-Opt	Opt-1	Opt-4	Opt-7	Opt-8	Opt-9	Opt-11	Speedup
explore/year-range	2.13s	32.21s*	0.043s	0.013s	0.044s	0.045s	0.045s	47x
explore/genre-emergence	64.31s	69.44s	44.28s	0.040s	44.85s	0.069s	0.102s	630x
explore/artist/Indecent Noise	0.098s	0.058s	0.079s	0.036s	0.068s	0.112s	0.002s	49x
explore/artist/Solarstone	0.090s	0.041s	0.094s	0.033s	0.075s	0.180s	0.001s	90x
explore/artist/Green Day	0.056s	0.015s	0.154s	0.014s	0.094s	0.426s	0.001s	56x
explore/artist/Johnny Cash	0.080s	0.029s	0.458s	0.026s	0.267s	1.305s	0.001s	80x
explore/genre/Electronic	22.02s	19.79s	26.96s	14.54s	0.018s	0.016s	0.019s	1,159x
explore/genre/Rock	26.15s	23.78s	28.46s	17.52s	0.007s	0.007s	0.008s	3,269x
explore/style/Trance	1.62s	1.48s	1.36s	1.01s	0.015s	0.015s	0.018s	90x
explore/style/Hard Trance	0.50s	0.46s	0.43s	0.26s	0.014s	0.015s	0.016s	31x
explore/style/Progressive Trance	0.62s	0.57s	0.48s	0.32s	0.007s	0.007s	0.007s	89x
explore/label/Hooj Choons	0.069s	0.035s	0.034s	0.034s	0.064s	0.161s	0.001s	69x
explore/label/Reprise Records	0.263s	0.229s	0.196s	0.183s	1.650s	4.881s	0.001s	263x
explore/label/Tracid Traxx	0.009s	0.004s	0.008s	0.004s	0.010s	0.008s	0.017s	— (404)

*Opt-1 year-range had a 92.5s cold-cache outlier skewing the average.

Trends Endpoints

Endpoint	Pre-Opt	Opt-1	Opt-4	Opt-7	Opt-8	Opt-9	Opt-11	Speedup
trends/artist/Indecent Noise	0.058s	0.042s	0.024s	0.021s	0.024s	0.054s	0.035s	1.7x
trends/artist/Solarstone	0.142s	0.081s	0.023s	0.020s	0.024s	0.095s	0.031s	4.6x
trends/artist/Green Day	0.330s	0.222s	0.012s	0.009s	0.012s	0.240s	0.020s	17x
trends/artist/Johnny Cash	0.934s	0.735s	0.021s	0.017s	0.020s	0.746s	0.039s	24x
trends/genre/Electronic	28.32s	26.36s	33.44s	3.30s	1.02s	0.001s	0.001s	28,320x
trends/genre/Rock	28.87s	26.70s	26.85s	4.03s	1.23s	0.001s	0.001s	28,870x
trends/style/Trance	27.60s	21.42s	0.31s	0.26s	0.090s	0.001s	0.001s	27,600x
trends/style/Hard Trance	7.43s	2.61s	0.10s	0.08s	0.036s	0.002s	0.001s	7,430x
trends/style/Progressive Trance	4.53s	3.06s	0.11s	0.09s	0.032s	0.002s	0.001s	4,530x
trends/label/Hooj Choons	0.037s	0.046s	0.019s	0.019s	0.052s	0.099s	0.001s	37x
trends/label/Reprise Records	2.06s	3.54s	0.064s	0.063s	0.067s	4.218s	0.001s	2,060x
trends/label/Tracid Traxx	0.006s	0.004s	0.005s	0.005s	0.011s	0.006s	0.001s	6x

Path Finder Endpoints

Endpoint	Pre-Opt	Opt-1	Opt-2	Opt-11	Speedup
path/Indecent Noise → Solarstone	53.38s	51.56s	0.06s	0.108s	494x
path/Indecent Noise → Green Day	56.01s	52.47s	0.13s	0.145s	386x
path/Indecent Noise → Johnny Cash	47.25s	44.94s	0.31s	0.342s	138x
path/Solarstone → Green Day	68.22s	60.24s	0.12s	0.111s	615x
path/Solarstone → Johnny Cash	66.57s	63.90s	0.35s	0.308s	216x
path/Green Day → Johnny Cash	69.75s	66.28s	0.32s	0.317s	220x

Search Endpoints (added in Opt-1)

Endpoint	First Seen	Opt-4	Opt-8	Opt-9	Opt-11
search/Indecent Noise	0.058s	0.029s	0.049s	0.042s	0.033s
search/Solarstone	0.025s	0.030s	0.047s	0.026s	0.025s
search/Green Day	0.037s	0.023s	0.053s	0.028s	0.025s
search/Johnny Cash	0.381s	0.141s	0.367s	0.158s	0.095s
search/Electronic	0.517s	0.482s	0.536s	0.301s	0.001s
search/Rock	3.691s	6.184s	8.526s	5.218s	9.091s
search/Trance	0.583s	1.977s	0.819s	0.813s	0.001s
search/Hard Trance	0.033s	0.067s	0.043s	0.047s	0.043s
search/Progressive Trance	0.030s	0.059s	0.051s	0.070s	0.051s
search/Hooj Choons	0.008s	0.032s	0.020s	0.019s	0.019s
search/Reprise Records	0.038s	0.030s	0.030s	0.029s	0.029s
search/Tracid Traxx	0.012s	0.015s	0.023s	0.019s	0.018s

Similarity, Label-DNA & Other Endpoints (added in Opt-2/Opt-4)

Endpoint	First Seen	Opt-4	Opt-8	Opt-9	Opt-11
artist-similar/Indecent Noise	54.08s	19.07s	0.002s	0.003s	0.001s
artist-similar/Solarstone	56.03s	27.00s	0.002s	0.003s	0.001s
artist-similar/Green Day	32.35s	0.004s	30.40s	0.003s	0.002s
artist-similar/Johnny Cash	111.81s	40.27s	0.002s	0.003s	0.002s
label-similar/Hooj Choons	94.60s	34.36s	6.83s	0.005s	0.002s
label-similar/Reprise Records	timeout	51.24s	5.50s	0.002s	0.001s
label-similar/Tracid Traxx	66.65s	21.93s	1.97s	0.002s	0.001s
label-dna/Hooj Choons	0.17s	0.23s	0.16s	0.004s	0.001s
label-dna/Reprise Records	21.06s	4.38s	8.13s	0.003s	0.002s
label-dna/Tracid Traxx	0.05s	0.04s	0.04s	0.002s	0.001s
label-dna-compare	0.36s	0.35s	0.36s	0.076s	0.003s

Insights Endpoints

Endpoint	Pre-Opt	Opt-4	Opt-8	Opt-11	Speedup
insights/top-artists	0.004s	0.026s	0.026s	0.036s	—
insights/genre-trends/Electronic	—	—	0.004s	0.021s	—
insights/genre-trends/Rock	—	—	0.003s	0.004s	—
insights/label-longevity	0.003s	0.004s	0.003s	0.004s	—
insights/this-month	0.780s	0.965s	0.802s	1.132s	—
insights/data-completeness	0.004s	0.004s	0.003s	0.021s	—
insights/status	0.005s	0.005s	0.005s	0.005s	—

Autocomplete, Node-Details & Expand Endpoints

Endpoint	Opt-11 Avg	Notes
autocomplete/artist/*	0.004-0.007s	Fulltext index, no optimization needed
autocomplete/genre/*	0.017-0.020s	Fulltext index
autocomplete/style/*	0.004-0.007s	Fulltext index
autocomplete/label/*	0.004s	Fulltext index
node-details/artist/*	0.015-0.046s	Property reads, proportional to release count
node-details/label/*	0.006-0.026s	Property reads
expand/artist/*/releases	0.008-0.040s	SKIP/LIMIT pagination
expand/label/*/releases	0.007-0.176s	SKIP/LIMIT pagination

Error Progression

Phase	Errors	Sources
Pre-Opt	37	search/* (27 — search was broken), genre-emergence (1 — timeout), Tracid Traxx (3), other timeouts (6)
Opt-1	3	Tracid Traxx (3) — search fixed, timeouts eliminated
Opt-2	15	Tracid Traxx (3), artist-similar timeouts (9), label-similar/Reprise timeout (3)
Opt-4	13	Tracid Traxx (3), artist-similar (9), node-details (1)
Opt-8	3	Tracid Traxx (3) — all similarity timeouts eliminated
Opt-9	7	Tracid Traxx (3), label-dna-compare 500 (3), connection reset (1)
Opt-10	3	Tracid Traxx (3) — label-dna-compare fixed
Opt-11	3	Tracid Traxx (3) — only expected errors

Areas Addressed to Improve Performance

Area 1: Path Finder — shortestPath Relationship Type Filtering

Problem: The original path finder used shortestPath((a)-[*..6]-(b)) which explored ALL relationship types in a bidirectional BFS. On a graph with ~50M relationships across 7+ types, this meant expanding millions of edges per hop.

Solution: Added explicit relationship type filter to limit BFS to meaningful edge types:

MATCH p = shortestPath((a)-[:BY|ON|IS|ALIAS_OF|MEMBER_OF|MASTER_OF|DERIVED_FROM*..6]-(b))

Also ensured both endpoints are resolved via unique index seeks (Artist(id)) before BFS begins, eliminating any AllNodesScan.

Impact: 47-70s → 0.1-0.35s (138-615x improvement)

Neo4j Principle: Always specify relationship types and direction in variable-length patterns. Each excluded relationship type exponentially reduces the BFS search space.

Area 2: Explore Genre/Style — Pre-Computed Aggregate Properties

Problem: The explore/genre endpoint executed queries like:

MATCH (g:Genre {name: $name})
WITH g
MATCH (r:Release)-[:IS]->(g)
WITH g, collect(DISTINCT r) AS releases
UNWIND releases AS r
OPTIONAL MATCH (r)-[:BY]->(a:Artist)
OPTIONAL MATCH (r)-[:ON]->(l:Label)
OPTIONAL MATCH (r)-[:IS]->(s:Style)

For "Electronic" (5.6M releases), this produced 180M DB accesses and allocated 201MB of memory to collect all releases before traversing them 3 more times.

Solution (multi-phase):

Phase 1 (Opt-2): Replaced collect+UNWIND with streaming aggregation via CALL {} subqueries. Reduced to ~45M DB hits.
Phase 2 (Opt-6): Split into 4 concurrent independent count queries via asyncio.gather(). Reduced to ~28M DB hits.

Phase 3 (Opt-8): Pre-computed aggregate properties on Genre/Style/Label nodes during the graphinator post-import step:

CALL { } IN TRANSACTIONS OF 1 ROWS
SET g.release_count = ..., g.artist_count = ..., g.label_count = ..., g.style_count = ...

The explore query became a simple property read:

MATCH (g:Genre {name: $name})
RETURN g.name AS id, g.name AS name,
       g.release_count AS release_count, g.artist_count AS artist_count,
       g.label_count AS label_count, g.style_count AS style_count

Impact: 22-26s → 0.008-0.019s (1,159-3,269x improvement). From 180M DB accesses to 6 DB accesses.

Neo4j Principle: When aggregate counts are expensive to compute at runtime but change only on data import, denormalize them as node properties with indexes. Trade storage for query speed.

Area 3: Genre Emergence — Pre-Computed first_year Properties

Problem: The genre-emergence query scanned all ~16M releases to find the earliest year per genre:

MATCH (g:Genre)
CALL {
    WITH g
    MATCH (g)<-[:IS]-(r:Release)
    WHERE r.year > 0 AND r.year <= $before_year
    RETURN min(r.year) AS first_year
}

Despite the CALL {} subquery, the planner chose a release-first scan: 203M DB accesses for genres + 207M for styles.

Solution (multi-phase):

Phase 1 (Opt-7): Pattern comprehension to force per-genre expansion.

Phase 2 (Opt-9): Pre-computed first_year property on Genre/Style nodes during import, with RANGE indexes:

CREATE INDEX genre_first_year_index FOR (g:Genre) ON (g.first_year)
CREATE INDEX style_first_year_index FOR (s:Style) ON (s.first_year)

The query became an index-backed range seek:

MATCH (g:Genre)
WHERE g.first_year IS NOT NULL AND g.first_year <= $before_year
RETURN g.name AS name, g.first_year AS first_year
ORDER BY first_year

Impact: 64.3s → 0.10s (630x improvement). From 410M DB accesses to 33 (genres) + 1,515 (styles).

Neo4j Principle: Index-backed ORDER BY eliminates Sort operators. When the index property matches the ORDER BY clause, Neo4j reads entries in order without sorting.

Area 4: Trends Genre/Style — CALL {} Barriers to Prevent CartesianProduct

Problem: The trends/genre query:

MATCH (g:Genre {name: $name})
WITH g
MATCH (r:Release)-[:IS]->(g)
WHERE r.year > 0
WITH r.year AS year, count(DISTINCT r) AS count
RETURN year, count ORDER BY year

Despite the WITH g barrier, the planner chose:

NodeUniqueIndexSeek on genre → 1 row
NodeIndexSeekByRange on r.year > 0 → 16,427,716 rows
CartesianProduct → 16M rows
Expand(Into) → 134M DB hits

The planner saw through the WITH barrier and incorrectly judged the year index range seek as more selective.

Solution (multi-phase):

Phase 1 (Opt-2): CALL {} subquery to create a stronger planner barrier.
Phase 2 (Opt-7): Pattern comprehension [(g)<-[:IS]-(r:Release) WHERE r.year > 0 | r.year] to force genre-first traversal.
Phase 3 (Opt-9): Redis caching with 24h TTL — genre/style trends data is static between imports.

Impact: 28.3-28.9s → 0.001s (28,000x+ improvement). First call computes and caches; subsequent calls return from Redis in <2ms.

Neo4j Principle: WITH barriers are advisory — the planner can see through them. CALL {} subqueries provide stronger barriers but even they aren't guaranteed. Pattern comprehension [... | ...] is the strongest way to force a specific traversal order. When query-level optimization has limits, application-level caching is the final layer.

Area 5: Artist/Label Similarity — Batch Queries and Cardinality Control

Problem: The original artist-similar endpoint:

Found target artist's genres
Expanded ALL releases in those genres (for "Electronic": 5.6M releases) to find candidate artists
For each of 200 candidates, ran 4 separate profile queries = 800 sequential Neo4j queries
Computed cosine similarity

For Johnny Cash (genres: Rock, Country, Blues, Pop, Folk), the genre expansion produced millions of candidates before filtering. Total: 32-112s with frequent timeouts.

Label-similar had the same pattern with triple re-traversal per candidate: 67-95s.

Solution (multi-phase):

Phase 1 (Opt-2): Batch 4 profile queries per dimension into single queries. 800 queries → 4 queries.
Phase 2 (Opt-4): Two-phase candidate discovery — lightweight ID finding, then batch profile fetch.
Phase 3 (Opt-5): Top-5-genres LIMIT + per-genre LIMIT 500 to cap mega-genre explosion.
Phase 4 (Opt-6): CALL {} subqueries per-genre to prevent cross-genre row multiplication.
Phase 5 (Opt-8): Style-based similarity (757 styles vs 16 genres = finer granularity, fewer releases per traversal). Split into 25-label batches with concurrent queries.
Phase 6 (Opt-9): Redis caching with 24h TTL.

Impact:

artist-similar: 32-112s → 0.001-0.002s (16,000-56,000x improvement)
label-similar: 67-95s → 0.001s (67,000-95,000x improvement)

Neo4j Principle: N+1 query patterns are the single biggest performance killer. Batch queries into UNWIND + MATCH patterns. Use per-dimension LIMIT to cap high-cardinality expansions. When candidate discovery is expensive, split into a lightweight phase (find IDs) and a batch profile phase (fetch details for top N).

Area 6: Label DNA — Redis Cache Reuse in Comparison

Problem: The /api/label/dna/compare endpoint called _build_dna() for each label, which ran ~10 Neo4j queries per label. But _build_dna() didn't check or populate the Redis cache that /api/label/{id}/dna uses. The compare endpoint always hit the database cold.

Solution: Added Redis cache read/write to _build_dna() so it reuses cached label DNA profiles:

async def _build_dna(label_id: str) -> tuple[LabelDNA | None, str]:
    # Check cache first (same key as the /dna endpoint)
    cache_key = f"label-dna:{label_id}"
    if _redis:
        cached = await _redis.get(cache_key)
        if cached:
            return LabelDNA(**json.loads(cached)), "ok"
    # ... compute and cache ...

Impact: 5.31s → 0.003s (1,770x improvement)

Principle: When multiple endpoints compute the same data, share the cache. A cache miss in one endpoint should populate the cache for all endpoints.

Area 7: Explore Artist/Label — Redis Caching for Computed Counts

Problem: The explore/artist endpoint used COUNT {} subqueries that traversed all of an artist's releases and their relationships. For Johnny Cash (8,070 releases), this was 102K DB hits. As optimizations in other areas increased test concurrency, explore/artist latency actually regressed from 0.08s (pre-opt) to 1.31s (opt-9) due to cache pressure.

Solution: Redis caching with 24h TTL for explore/artist and explore/label results. Pre-computed aggregate properties on Label nodes (similar to Genre/Style).

Impact: explore/artist/Johnny Cash: 1.31s (opt-9) → 0.001s (opt-11). explore/label/Reprise: 4.88s (opt-9) → 0.001s (opt-11).

Area 8: Search Full-Text — Per-Table LIMIT and Concurrent Queries

Problem: PostgreSQL full-text search for high-cardinality terms like "Rock" required ts_rank() computation across all matching rows before sorting. The releases table alone has 18.9M rows, many containing "Rock" in their title.

Solution:

Per-table LIMIT in each UNION ALL arm prevents materializing 100K+ rows per table
5 concurrent queries via asyncio.gather(): paginated results, total count, type counts, genre facets, decade facets
Redis caching with 300s TTL
_TOTAL_COUNT_CAP = 10000 limits auxiliary count queries

Impact: search/* went from 27 errors (broken) to 0 errors. Most search queries respond in <50ms. search/Rock remains slow on cold cache (~9s) but subsequent calls return from Redis in <5ms.

Remaining opportunity: Pre-warm Redis cache for common genre/style terms on startup; increase search TTL to 3600s.

Area 9: Year Range — Index-Backed Min/Max

Problem: Original year-range query scanned all 16M releases to find min and max year.

Solution: CALL {} subqueries with ORDER BY + LIMIT 1, leveraging the release_year_index:

CALL {
    MATCH (r:Release) WHERE r.year > 0
    RETURN r.year AS year ORDER BY r.year ASC LIMIT 1
}

Neo4j uses the index to return the first/last entry without scanning.

Impact: 2.13s → 0.045s (47x improvement). From 32M DB accesses to 6.

Area 10: Master Year Index for Insights

Problem: The insights/this-month query scanned all 2.5M Master nodes to filter by year.

Solution: Created master_year_index:

CREATE INDEX master_year_index FOR (m:Master) ON (m.year)

NodeByLabelScan + Filter → NodeIndexSeek.

Impact: 7M → 2.1M DB accesses. Latency improved from ~0.8s to ~0.8s (modest, because the MASTER_OF expansion dominates).

Neo4j Community Edition Performance Guide

Based on the optimization work in this project, here are the specific techniques that work with Neo4j Community Edition.

Query Optimization Techniques

1. Always Specify Relationship Types and Direction

-- BAD: explores ALL relationship types
shortestPath((a)-[*..6]-(b))

-- GOOD: restricts BFS to relevant types
shortestPath((a)-[:BY|ON|IS|ALIAS_OF|MEMBER_OF*..6]-(b))

Each excluded relationship type exponentially reduces traversal. In this project: 70s → 0.3s.

2. Use CALL {} Subqueries to Control the Planner

The Neo4j planner can see through WITH barriers and choose unexpected plans (e.g., CartesianProduct). CALL {} subqueries create stronger barriers:

-- BAD: planner may choose CartesianProduct (16M rows)
MATCH (g:Genre {name: $name})
WITH g
MATCH (r:Release)-[:IS]->(g)
WHERE r.year > 0

-- GOOD: forces genre-first expansion
MATCH (g:Genre {name: $name})
CALL {
    WITH g
    MATCH (g)<-[:IS]-(r:Release)
    WHERE r.year > 0
    RETURN r.year AS year, count(DISTINCT r) AS count
}

3. Use Pattern Comprehension for Strongest Planner Control

When even CALL {} doesn't prevent the planner from choosing a bad plan:

-- Forces per-genre expansion (planner cannot choose release-first)
MATCH (g:Genre)
WITH g, [(g)<-[:IS]-(r:Release) WHERE r.year > 0 | r.year] AS years

4. Pre-Compute Expensive Aggregates as Node Properties

For queries that aggregate across millions of relationships but whose results change only on data import:

-- At import time (in graphinator post-import step):
CALL { } IN TRANSACTIONS OF 1 ROWS
MATCH (g:Genre)
SET g.release_count = COUNT { (g)<-[:IS]-(:Release) }
SET g.artist_count = COUNT { MATCH (g)<-[:IS]-(:Release)-[:BY]->(a:Artist) RETURN DISTINCT a }

-- At query time (instead of 200M DB hits):
MATCH (g:Genre {name: $name})
RETURN g.release_count, g.artist_count  -- 6 DB hits

5. Create Indexes for All Filtered and Sorted Properties

CREATE INDEX master_year_index FOR (m:Master) ON (m.year)
CREATE INDEX genre_first_year_index FOR (g:Genre) ON (g.first_year)
CREATE INDEX release_year_index FOR (r:Release) ON (r.year)

Index-backed ORDER BY eliminates Sort operators. min()/max() can read first/last index entry.

6. Batch Queries Instead of N+1 Patterns

-- BAD: 200 separate queries
FOR candidate IN candidates:
    MATCH (r:Release)-[:BY]->(a:Artist {id: candidate.id})
    MATCH (r)-[:IS]->(g:Genre)
    RETURN g.name, count(r)

-- GOOD: 1 query with UNWIND
UNWIND $candidate_ids AS cid
MATCH (r:Release)-[:BY]->(a:Artist {id: cid})
MATCH (r)-[:IS]->(g:Genre)
WITH a.id AS artist_id, g.name AS genre, count(DISTINCT r) AS count
RETURN artist_id, collect({name: genre, count: count}) AS genres

7. Use Per-Dimension LIMIT to Cap High-Cardinality Expansions

-- BAD: "Rock" genre expands to 6M+ releases
MATCH (g:Genre {name: genre_name})<-[:IS]-(r:Release)-[:BY]->(a:Artist)

-- GOOD: cap per-genre expansion
CALL {
    WITH genre_name
    MATCH (g:Genre {name: genre_name})<-[:IS]-(r:Release)-[:BY]->(a:Artist)
    WITH a, count(DISTINCT r) AS count
    ORDER BY count DESC
    LIMIT 500
    RETURN a
}

Configuration Tuning (Community Edition)

Page Cache

# Size to fit entire store + 20% headroom
# Check store size: du -sh data/databases/neo4j/
server.memory.pagecache.size=12g

The page cache keeps store files in memory. When it's too small, the database reads from disk on every query. Size it to fit the entire store for optimal performance.

Heap

# Set initial = max to avoid GC pauses during resizing
server.memory.heap.initial_size=4g
server.memory.heap.max_size=4g

The this-month query uses 99MB for aggregation. Ensure heap can handle concurrent queries.

Transaction Memory

# Cap individual transactions to prevent runaway queries
db.memory.transaction.max=256m
dbms.memory.transaction.total.max=1g

Query Planner

# Query cache (default 1000 is usually sufficient)
server.db.query_cache_size=1000

Run CALL db.prepareForReplanning() after bulk imports to update cardinality statistics.

Per-Query Planner Hints

-- Force exhaustive planner for complex queries (better plans, slower planning)
CYPHER planner=dp
MATCH ...

-- JIT-compile expressions for faster execution
CYPHER expressionEngine=compiled
MATCH ...

What's NOT Available in Community Edition

These Enterprise-only features would help but cannot be used:

Parallel runtime (multi-threaded query execution)
Composite indexes (multi-property)
Node/relationship property existence constraints
Advanced memory management (per-query memory limits with soft/hard thresholds)
Causal clustering (read replicas for read scaling)

Application-Level Optimization Patterns

These techniques complement Neo4j query optimization:

Pattern	Description	TTL Used
Redis cache-aside	Check cache → query DB → store → return	24h for trends, similarity, DNA; 5m for search
asyncio.gather()	Execute independent queries concurrently	N/A
Two-phase candidate discovery	Phase 1: lightweight ID finding; Phase 2: batch profile fetch	N/A
Pre-computed properties at import	Denormalize aggregates during graph building	N/A (permanent)
Per-table LIMIT in SQL UNION ALL	Cap FTS result sets per table before ranking	N/A

Database Statistics

Neo4j Graph

Entity	Count
Artist nodes	9,969,328
Release nodes (year > 0)	16,427,716
Master nodes	2,530,776
Genre nodes	16
Style nodes	757
IS relationships (Release→Genre/Style)	~47M
BY relationships (Release→Artist)	~5.6M per popular genre
ON relationships (Release→Label)	~5.2M per popular genre

Neo4j Indexes (24 total)

Name	Type	Entity	Property
artist_id	RANGE	Artist	id
artist_name	RANGE	Artist	name
artist_name_fulltext	FULLTEXT	Artist	name
artist_sha256	RANGE	Artist	sha256
genre_first_year_index	RANGE	Genre	first_year
genre_name	RANGE	Genre	name
genre_name_fulltext	FULLTEXT	Genre	name
label_id	RANGE	Label	id
label_name	RANGE	Label	name
label_name_fulltext	FULLTEXT	Label	name
label_sha256	RANGE	Label	sha256
master_id	RANGE	Master	id
master_sha256	RANGE	Master	sha256
master_year_index	RANGE	Master	year
release_id	RANGE	Release	id
release_sha256	RANGE	Release	sha256
release_title_fulltext	FULLTEXT	Release	title
release_year_index	RANGE	Release	year
style_first_year_index	RANGE	Style	first_year
style_name	RANGE	Style	name
style_name_fulltext	FULLTEXT	Style	name
user_id	RANGE	User	id
index_460996c0	LOOKUP	NODE	—
index_1b9dcc97	LOOKUP	RELATIONSHIP	—

PostgreSQL Indexes (77 total across 12 tables)

Key tables and their indexes:

Table	Rows (approx)	Indexes
artists	~10M	PK, FTS (GIN), hash, name, updated_at
releases	~16.4M	PK, FTS (GIN), hash, title, year, country, genres (GIN), labels (GIN), updated_at
masters	~2.5M	PK, FTS (GIN), hash, title, year, updated_at
labels	~1.8M	PK, FTS (GIN), hash, name, updated_at
user_collections	variable	PK, user_id, release_id, unique(user_id, release_id, instance_id)
user_wantlists	variable	PK, user_id, release_id, unique(user_id, release_id)

Summary by Endpoint Category

Category	Pre-Opt → Opt-11	Technique
Path finder (6)	58.5s → 0.21s (279x)	Relationship type filtering on shortestPath
Explore genre (2)	24.1s → 0.014s (1,721x)	Pre-computed aggregate properties on Genre nodes
Explore style (3)	0.91s → 0.014s (65x)	Pre-computed aggregate properties on Style nodes
Explore label (3)	0.11s → 0.001s (110x)	Pre-computed properties + Redis caching
Explore artist (4)	0.08s → 0.001s (80x)	Redis caching (24h TTL)
Genre-emergence (1)	64.3s → 0.10s (630x)	Pre-computed first_year + index-backed ORDER BY
Trends genre (2)	28.6s → 0.001s (28,600x)	CALL {} barriers + Redis caching
Trends style (3)	13.2s → 0.001s (13,200x)	Pattern comprehension + Redis caching
Trends label (3)	0.70s → 0.001s (700x)	Redis caching
Trends artist (4)	0.37s → 0.031s (12x)	Direct query (small artists are fast)
Artist-similar (4)	64s → 0.002s (32,000x)	Batch queries + cardinality LIMIT + Redis
Label-similar (3)	86s → 0.001s (86,000x)	Style-based similarity + batch + Redis
Label-DNA (3)	7.1s → 0.001s (7,100x)	Redis caching
Label-DNA-compare (1)	0.36s → 0.003s (120x)	Cache reuse in _build_dna
Search (12)	broken → 0.97s avg	Per-table LIMIT + concurrent queries + Redis
Insights (7)	0.13s → 0.17s	Already fast (batch computation)
Node-details (7)	— → 0.030s	Property reads
Expand (7)	— → 0.014s	SKIP/LIMIT pagination
Autocomplete (12)	0.004s → 0.007s	Fulltext index (unchanged)
Overall (88)	10.95s → 0.044s (249x)	—

Uh oh!

FilesExpand file tree

query-performance-optimizations.md

Latest commit

History

query-performance-optimizations.md

File metadata and controls

Final Performance Optimization Analysis

Executive Summary

Complete Endpoint Performance Progression

Explore Endpoints

Trends Endpoints

Path Finder Endpoints

Search Endpoints (added in Opt-1)

Similarity, Label-DNA & Other Endpoints (added in Opt-2/Opt-4)

Insights Endpoints

Autocomplete, Node-Details & Expand Endpoints

Error Progression

Areas Addressed to Improve Performance

Area 1: Path Finder — shortestPath Relationship Type Filtering

Area 2: Explore Genre/Style — Pre-Computed Aggregate Properties

Area 3: Genre Emergence — Pre-Computed first_year Properties

Area 4: Trends Genre/Style — CALL {} Barriers to Prevent CartesianProduct

Area 5: Artist/Label Similarity — Batch Queries and Cardinality Control

Area 6: Label DNA — Redis Cache Reuse in Comparison

Area 7: Explore Artist/Label — Redis Caching for Computed Counts

Area 8: Search Full-Text — Per-Table LIMIT and Concurrent Queries

Area 9: Year Range — Index-Backed Min/Max

Area 10: Master Year Index for Insights

Neo4j Community Edition Performance Guide

Query Optimization Techniques

1. Always Specify Relationship Types and Direction

2. Use CALL {} Subqueries to Control the Planner

3. Use Pattern Comprehension for Strongest Planner Control

4. Pre-Compute Expensive Aggregates as Node Properties

5. Create Indexes for All Filtered and Sorted Properties

6. Batch Queries Instead of N+1 Patterns

7. Use Per-Dimension LIMIT to Cap High-Cardinality Expansions

Configuration Tuning (Community Edition)

Page Cache

Heap

Transaction Memory

Query Planner

Per-Query Planner Hints

What's NOT Available in Community Edition

Application-Level Optimization Patterns

Database Statistics

Neo4j Graph

Neo4j Indexes (24 total)

PostgreSQL Indexes (77 total across 12 tables)

Summary by Endpoint Category